Object detection, sitting at the intersection between Deep Learning and Computer Vision, has recently gained a lot of attention due to its extensive use in multiple real-world applications. International Development is not alien to that trend. In this blog entry, we introduce some of the basic concepts behind the workflow that powers such Artificial Intelligence solutions. For that, we use a real-world application based on satellite imagery from one of our projects at AKTEK.
Contextual awareness is key for the successful implementation of operations, projects and investments in many industries. While field information is still fundamental, collecting that data can be very expensive and may only be possible assuming high risks, especially in fragile and conflict-affected areas. Furthermore, contextual information in such areas becomes outdated very fast, which aggravates those limitations. The rise of object detection techniques, through machine learning and computer vision solutions, is facilitating new applications where images are automatically ingested to inform decision-making processes. This rise has been possible thanks to a combination of factors, the two main ones being the better quality and cost of image recording systems; coupled with the more powerful, but also more affordable, computational resources that power advanced solutions such as deep learning algorithms.
This conjunction of factors enables the implementation of object detection in multiple industries. While some of the techniques are common to all types of images (even videos), here we focus in Satellite imagery. Satellite imagery is a source of remote sensing data that can help increase the level of contextual awareness for international development and for other industry actors, potentially everywhere, and at reduced cost.
Some inspirational references
Satellite imagery has already been used in applications within international development and beyond. A non-exhaustive list of examples includes the study of Land Use and Land Cover (LULC) change detection to identify illegal deforestation, or to inform agricultural projects; the use of satellite imagery to study refugee settlements; or its analysis to help with rescue operations after natural catastrophes such as floods or fires. Another example of those impressive applications was published in a paper two years ago by authors from the Sustainability and Artificial Intelligence Lab at Stanford. The study combined deep learning, day satellite imagery, night satellite imagery, and socio-economic traditional surveys, all together to develop a solution for the predictiction of poverty estimates in several places in Africa. Even with its limitations, new ways to improve on the estimation of poverty are still required in the fight for the first UN sustainable development goal, the end of poverty in all its forms, everywhere. That study is a very inspirational example of how satellite imagery and artificial intelligence can merge together to be add value international development.
Invisible ship detection
At AKTEK, we have been working on including satellite imagery as a new source of remote sensing data, enhancing that way the contextual awareness in our projects. In this blog post we will use one of those projects to illustrate the workflow that powers deep learning applications for object detection. In this project we aim to complement the traditional ship tracking systems adding an invisible ship detection algorithm, that works autonomously on satellite imagery.
The Automatic Identification System (AIS) is extensively used as an automatic tracking system for ships, and it serves to monitor sea traffic. It is based on transponders and navigation systems on ships, land stations and satellites. However, even with all the law regulations that enforce the use of the system, ships can turn off their AIS transponders and effectively go dark. That has massive consequences in very sensitive international issues. The list is long, but invisible ships are related to very urgent development and security problems: dangerous sea migration, trafficking, of people or goods, illegal fishing, and piracy. Even sea accident rescue operations are affected by this. All these problems have terrible consequences on human lives, not mentioning the economic impact they can have. In response to this, advanced solutions are being developed combining different types of data and analysis techniques. AIS data and anomaly detection, radar signal processing, including Synthetic Aperture Radar (SAR), air photography and object detection, in addition to modern naval patrols. Satellite imagery is yet another source of data that will complement such data sources for a better detection solution. While the commercial success of such a solution relies on the combination of data sources and the use of several analysis techniques for each, something we will not discuss further here; in this post we will focus on visual ship detection in satellite imagery. As such, we present the basics behind object detection powered by deep learning and other computer vision techniques.
Technological advances and cost reductions have led to an acute rise in commercial aerospace products, including the availability of satellite imagery. The list of satellite imagery providers is extensive, and includes not only public providers whose images are open to all citizens, such as Landsat or Sentinel missions, but also privately own constellations of satellites that provide and license their images, for instance DigitalGlobe, Airbus or Planet.
Satellite images have several singularities with respect to other types of images. Modern satellite images are multi-million pixel arrays with a variety of ground resolutions depending on the specific provider and spectral band. Ground resolutions cover a wide range of values per pixel, from 60cm to more than 120m of ground sample distances per pixel for different cases. The ground resolution of the images determines the size of the objects that can be detected on them. Those images are multi-spectral, we have one array for each of the spectral bands recorded. This includes the usual Red, Green and Blue (RGB) bands, but also the near infrared, the shortwave infrared, the thermal or the panchromatic bands to mention a few others. The combined analysis of the bands enable the construction and study of indices, such as vegetation, water, soil, or thermal indices, that then power several solutions, for instance around land use and land cover detection. While satellite imagery can arguably cover continuously the entire Earth, there are limitations associated with taking images from the sky, revisit rates are key when developing solutions. Those are dependent on the provider, and for a given place, they range from a few images per day, to one image every few days. Different spectral bands show different visibility behaviour in function of weather events (such as clouds or storms), or simply the time of the day. Several infrared wavelengths are still visible at night, while the usual RGB images are not. The list of singularities attached to satellite imagery still goes on, nadir points, ortho-rectifications, etc. It is true that some of these aspects are usually part of the processing the imagery provider does before releasing the scenes itself to the user, but many solutions still rely on analysing and properly combining all these aspects, so we encourage you to look for more information on this if you are interested. Satellite imagery is becoming an amazing source of contextual information for many industries, it is worth the time investment.
In this blog post we start with a more curated scenario, the purpose is to take a RGB satellite scene such as the one below (ignoring weather effects here), and design and train a machine learning algorithm that automatically detects and locates the out-of-port ships in a given scene.
This type of scene covers a few tens of squared kilometers of the Earth surface (~45km2). A realistic monitoring solution based on satellite images will have to browse around tens of thousands or more of those scenes on a timely manner. Therefore, it is unfeasible to rely on a continuous manual supervision for such a solution. This is a suitable scenario for machine learning. Relying only initially on a minimal human supervision, a system can be designed and trained to perform ship detection without further human intervention. This will enable the automated monitoring of vast surfaces within feasible computing times. Enabling organizations to monitor much faster and cover larger surfaces.
Notice that the satellite scenes are multi-million pixel images (~5MP), with a ground resolution of 3m per pixel in this case, building a big numerical array per color band. Therefore, on top of aiming to reach a high out-of-port ship detection accuracy, the other key metric for the algorithm will be the speed of detection. Otherwise, the size of the scenes to monitor would present a prohibitive challenge for a real world solution looking at many square miles.
In order to build the ship detection solution we need to follow a simple two-stage workflow, allowing us to illustrate two areas at the intersection of deep learning and computer vision: image segmentation and image classification. The workflow proceeds as follows.
First, a selective segmentation algorithm is used to identify the areas within a scene that must be further analyzed in the second stage. Big pieces of land and/or water are already discarded in this stage. This enables us to reach a fast solution when dealing with such heavy satellite scenes, while improving the accuracy metrics of the overall solution.
In the second stage, we implement a moving window browsing mechanism only around the areas selected in the previous stage. Each of those windows is fed into a binary classifier that outputs then the probability there is a ship in that specific window. Drawing a box if it estimates there is one. As we will describe, such a machine learning algorithm is trained on a collection of previously labeled or annotated images the machine can learn from.
This way, given a satellite scene, the solution outputs a new image of the scene with a box drawn around each detected out-of-port ship in only a few seconds.
For the solution used to illustrate this post, we use a simple segmentation algorithm. However, we note that more advanced segmentation algorithms, selective or not, can be used, and new ones are being researched as we speak. If you are interested, have a look at the state of the art region proposal algorithms, where regions are learnt, like Faster R-CNN, or at YOLOimplementations as well.
Here it suffices however to use a well known unsupervised algorithm, based on k-means clustering. In a scene, each pixel can be understood as a three-dimensional numerical vector. This is, each pixel is defined by its values for the red, green and blue spectral bands. The pixels then form a collection of vectors. The k-means algorithm proceeds partitioning and distributing all those pixels in k different clusters, in our case based only on color proximity. This is a difficult computational problem, but usually, and specifically here, the numerical algorithms converge rapidly into an optimal solution. Such optimization algorithms work iteratively by assigning each pixel to the cluster whose mean vector is the nearest to the pixel (in the RGB space; the first assignment may be arbitrary), to then re-compute the new mean of each resulting cluster after the new pixel assignments, and iteratively repeat these two same actions again and again until the distribution of pixels in clusters is stable. In our case at hands, the optimal number of clusters is usually low (2), this suffices to cluster the pixels into water-like and into non-water-like pixels. For this, the clustering can completely ignore pixel proximity in a geo-spatial sense, as the clustering can be performed solely on the three-dimensional color vectors.
Once the clustering has been performed, we proceed with a few simple actions that improve the algorithm performance. The water-like pixels can be established by analysing the mean color vectors of each of the resulting clusters from the k-means algorithm, to then be safely ignored in the next steps. Provided the scene is not just water, we can then focus on non-water-likepixels. We proceed by excluding the areas formed by adjacent pixels that are either too large, or too small, or have very non-linear shapes, for them to be ships. The final detection of the ships will still happen in the next stage of the solution, but this is a very fast an efficient way to remove pieces of water and land that need not be further processed by the deep learning algorithm coming next, enhancing efficiency and speed. For this, we need to take into account the ground resolution of the scene at hands, the color normalization of the scene, and indeed this step would be more convoluted if we were interested in detecting ships at port.
For the scene we are using to illustrate this blog post, the areas selected by the selective segmentation (that still need to be further analyzed) are illustrated by the yellow-boxed-areas in the image above. There, only small non-water-like pixels have been depicted, everything else is black. If we had proceeded without this segmentation stage, and instead gone for a brute force approach, fixed boxes would need to parse across the whole scene in the next stage. Using 80 x 80 pixels boxes, as we describe below, that would have meant around 50 million windows to be analyzed by the coming algorithm, which would have led to a prohibitive solution in terms of computing speed. Conversely, with the simple selective segmentation we have performed and described here, only a few hundred windows need to be further analyzed. This reduces dramatically the time spent on the full analysis of a scene to a few seconds, (if using only a single cheap CPU without any parallelization on the predictions), all while helping improve the overall solution accuracy as well.
Supervised Learning and Image Classification
Once the segmentation has automatically selected the regions that need further analysis, the solution proceeds to the second stage. In there, a Convolutional Neural Network (CNN) is designed and trained to decide if each of those regions contain ships or not. For this, we browse those regions moving a fixed sized window of 80 x 80 pixels across. Such windows are suited for detecting ships of up to ~250–300 meters (like the largest cargo ships on the scene).
Thus, this means that the CNN we design always ingest 80 x 80 pixel RGB box images, not directly the larger scenes, enhancing the efficiency. As we described in the previous blog entry on International Development and Data Science projects based on text, here we also work with Supervised Learning techniques. In order to train the CNN we need to first manually label a smart set of images, and assign each of those to a binary category: ship or no-ship. We consider this assignment to be the ground truth label for each training set image, from which the algorithm has to learn to extrapolate the decision-making process.
Those training set labels, together with the RGB array of values of the corresponding training set images are then fed into the algorithm. For the real world solution, we are combining labeled images from several providers. This is possible after some simple interpolation or pooling processing on the labeled images, and only because the resolutions of those providers are compatible. This provides us with a larger dataset of a few hundred thousand 80 x 80 images, relatively balanced between ship images and no-ship ones. However, for the simpler solution presented in this blog post, we only use a few thousand of those images, balanced between the two categories, to train a first working version of the solution. Note that we are referring here about 80 x 80 boxes, not full scenes.
This is possible because we perform data augmentation on those images. Data augmentation refers to the process of artificially increasing the number of labeled images by taking the originally manually labeled ones and extending the labeled set in a controlled manner. This is done by performing a few basic actions on them: rotations, image flips, translations, etc, while assigning to those newly derived images the same label than the original one had. This is another rich set of techniques inside computer vision, so we encourage you to research further into this if you are interested. Data augmentation can significantly help reduce the amount of images to manually label and feed to the algorithms during the training phase without reducing accuracy.
Convolutional Neural Networks
As we described in our first blog post, at a very high level, algorithms are mathematical recipes that operate on the image numerical arrays and output a probability that the image belongs to each specific class. To be able to classify new images as accurately as possible, algorithms work to minimize the error between their predictions on the training set and the true classifications of those same training set images, in what is called the training phase. Provided this training is performed within a proper validation framework, that same algorithm can then predict on new images (and thus new scenes), without further human supervision needed. This can then be done almost near real-time, in a process which is now scalable, and whose accuracy can be rigorously established through a test set.
Convolutional Neural Networks are a set of algorithms being extensively used in state-of-the-art image solutions in artificial intelligence. The inspiration, history, and details behind CNN implementations can be found elsewhere. As the list of books, papers, blog entries, and tutorials describing their technical details is already so long, it is really hard to pick just one reference. For this blog entry, we oversimplify CNNs by saying they are Neural Network algorithms with tens-of-thousands to multiple millions parameters to be trained, all operating in a very structured way. These Neural Networks follow a systematic pattern of mathematical operations on the numerical arrays: convolutions, poolings and regularization layers. This construction makes them suitable to automatically learn how to extract relevant features on images, such as shapes, edges and more complex visual features. Combining this with the powerful minimization algorithms that train them have made them a very powerful solution for object detection in images and beyond.
Currently, many CNN-based solutions leverage what are called pre-trained models. Those are huge CNN constructions (such as ResNet, VGG, Inception and others), that have been previously trained in generic labeled image datasets publicly available, such as the image sets in ImageNet. Through transfer learning techniques those pre-trained constructions are extended and several of the layers are then trained further with the specific new labeled data set for the new solution at hands. That way the new model is tuned for the desired new object detection, here the detection of ships in images. While this may lead to highly accurate solutions, those are also usually heavier and thus slower. Instead, for this blog entry, it suffices to build and train a new CNN from scratch, what can be done using the existing libraries in your prefered programming language. With only a few blocks of convolutional and pooling layers (each with only tens of different small-sized convolutional filters), some dense layers, and using proper regularization techniques (batch normalization for instance), we can already reach high validation and consistent binary test accuracies (above 95%) on the selected subsample of the simplified training set used for this blog entry, further enhancing efficiency.
The training process is implemented feeding all the images and the pre-processing steps in batches, so that parallelization is possible and speeds the training of the algorithm. The training times can be really short if we keep the model light. All the workflow described here can be nowadays implemented through the extensive set of available libraries on image processing and machine learning in your prefered programming language. In python for instance, the usual numpy, scipy, scikit-learn (and image), openCV, keras and tensorflow libraries suffice to build the solution. For all those, there are excellent blogs and tutorials online.
To sum up, our two stage solution combines a relatively simple k-means clustering algorithm for the initial selective segmentation, followed by a relatively light CNN construction trained for this blog post on only a few thousand images that are further augmented. The workflow of the solution can then ingest a satellite scene such as the one above, and in only a few seconds output the location in the scene of the detected ships (if any). This is illustrated for the scene in the image below.
As it can be observed, this lighter solution is already pretty accurate when working on a scene. It only misses a very small ship (barely visible for the human eye), and it only fails mistaking a ship-shaped deck as if it was a ship indeed. Of course, the real world solution is more complete, as it uses the larger dataset, and is further validated in a proper collection of scenes. However, this lighter algorithm is already good enough to illustrate the overall workflow process while reaching quite a high performance already. It also serves to highlight some of the main challenges faced on ship detection. The segmentation has helped dealt with most of the coast based objects, but in there, decks and other ship-like land surfaces may sometimes be confused with ships. Beyond improving on the selective segmentation, a sample of those objects could be further manually labeled and fed again to a new training phase, improving that way the algorithm performs. In addition, the smallest ships remain the hardest to detect. This is related to the resolution of the image, and could be improved relying on higher resolution scenes.
You can see the performance on another scene, this time for Long Beach bay (with the same resolution than the one from San Francisco bay).
On top of the improvements we have already suggested above in order to reach a more robust object detector, the real solution at this point would proceed integrating the algorithm with AIS data. Once the ships are located on the satellite scenes, we have their geolocation at a moment in time, contrasting that with AIS data, we could go on an identify which of those are invisible ships, and that may thus be engaged in illegal activities.
Satellite imagery, in combination with machine learning and computer vision, can be used to design and train new solutions and increase contextual awareness. The process has been illustrated here for a project in international development, but the potential applications go beyond ships and this industry, to cover almost anything you can think of. Actually, the same workflow is not constrained to satellite images, as it is also applied in other types of images or even videos. We’re just beginning to scratch the surface.