Places and Spaces: a Giant Image Dataset

An explanation of the paper "Learning Deep Features for Scene Recognition using Places Database" originally by Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva

Visual classifiers and object detection is is something that is becoming increasingly important. We've all seen the examples of images with bounding boxes detecting cars, people, or other objects. But the creation of these programs requires enormous datasets composed of hundreds of thousands of images. Where do these databases come from? How are they put together, and how are they determined to be useful? The Places database is one such collection, and this paper explains the process of how Zhou et al put together the Places database, and determined its efficacy.

Before we get to Places, however, let's take a look at some existing databases. The first benchmark for scene classification was a database called Scene15. Scene15 has only 15 categories and several hundred per class. The MIT Indoor67 database has 67 categories, but only on indoor places. SUN database has 397 categories, with over 100 images per category. All of these are relatively small comparison to ImageNet.
ImageNet is a huge database, but is object-focused, with only a proportionally small number of scene categories.

This is where the idea for Places arose. Places is a scene-centric database with over 7 million images for 476 place categories. It is the largest image database of scenes and places so far and the first scene database large enough to be useful for training programs that require vast amounts of data.

Creating the Database

Categories are taken from the SUN database to start with, giving a set of basic location types. These categories are then combined with adjectives and run through three different search engines: Flickr, Google Images, and Bing Images. The pictures have to be large enough to be useful, so any images smaller than 200x200 pixels are discarded.

At this point we need to designate names for samples taken from several databases, which will be used during testing.
Places 205 has 205 categories, with 5000+ images in total.
Places 88, has the 88 categories in common with ImageNet, and at least 1000 images.

Similar sets taken from SUN and ImageNet are referred to as SUN 205 and SUN 88, and ImageNet 205 and ImageNet 88.

Measuring the Database

The quality of a database can depend on the task it is being used to perform. In general, a good database should be dense and diverse. But what do these terms actually mean? Before we use them as measures, it is important to define them.

For an image set, high density means that any given image most likely has similar neighbors. But you cannot measure the quality of a dataset on density alone. So what is diversity? Diversity is determined by the probability that any two images randomly selected from the database will be from the same category.

Density and Diversity

To measure density and diversity workers were presented with pairs of images, and asked to select the pairs that were the most similar. The only difference between experiments was how the pairs were generated:
for diversity pairs were selected randomly randomly sampled. For Density pairs were selected that were more likely to be visually similar.

All three datasets were proven to have similar density, but there was a larger variation in diversity. Places is the most diverse, with relative diversity of 0.83, while ImageNet has diversity of 0.67, and SUN has diversity of 0.5. The categories with largest variation in diversity across all three are playground, veranda, and waiting room.

Cross Dataset Generalization

Separate models were trained on all three databases, and then all three models were tested on each database. In each case training and testing from the same database provided the best results when restricted to the same number of training samples. However, the Places database is large, so it achieves the best score on two out of the three when all of its training data is used.

Source: Learning Deep Features for Scene
Recognition using Places Database

Training Neural Network for Scene Recognition and Deep Features

The next step is to show that a Convolutional Neural Network trained using the Places database can achieve a significant improvement over previous scene-centered benchmarks, as compared with a network using the ImageNet database, referred to as the ImageNet-CNN.

The Places-CNN is trained using 2,448,873 images from 205 categories of Places as the train set. This train set has between 5,000 and 15,000 images per category. The test set has 200 images per category and the validation set has 100 images per category.

Visualization of the Deep Features

To gain a better understanding of the differences between ImageNet-CNN and Places-CNN, we will examine the responses at various layers of the networks. We use a combination of the test sets from ImageNet LSVR2012 (contains 100,000 images) and SUN397 (contains 108,754 images) to create the input for both networks.

After Places-CNN is trained, the final layer output (Soft-max) can be used to classify images. For this we use a test set of Places 205 and SUN 205.

Places-CNN had accuracy of 50.0% on Places 205, and 66.2% on SUN 205. ImageNet-CNN had accuracy of 40.8% on Places 205 and 49.6% on SUN 205.
Places-CNN performs better across both sets.

The performance of Places-CNN is further assessed in terms of the top-5 error rate; This means a sample is counted as classified if the true label is not among the top 5 predicted labels. The top-5 error rate for Places-CNN on Places 205 is 18.9%, and 8.1% for SUN 205.

Using a linear SVM as a classifier with the same default parameters for both ImageNet-CNN and Places-CNN, accuracy is determined on the SUN397 and SUN Attribute dataset.

Places-CNN outperformed ImageNet-CNN on all scene classification benchmarks, although ImageNet-CNN still performs better on object-focused databases. This demonstrates that Places-CNN and ImageNet-CNN have complementary strengths regarding scene and object related jobs.

Hybrid-CNN

Lastly, A Hybrid-CNN was trained, using data from both images from the training set of Places-CNN and ImageNet-CNN. After removing all overlapping categories, the training set of Hybrid-CNN has 3,500,000 images from 1,138 categories. The training process uses over 700,000 iterations. By combining the datasets there is an additional improvement on several benchmarks.

Conclusion

Large amounts of data are important for the performance of Deep CNNs.
To this end, the Places database is introduced as a new benchmark, with millions of labeled images representing locations and landscapes from the real world. Introducing measures of density and diversity makes it easier to estimate biases and helps with comparisons to other datasets. Through an array of of tests against current benchmarks, Places can be seen to give equal or greater performance to image datasets of similar size, but with a different focus.

You can read the original paper here for a more technical overview.

Blog

Places and Spaces: a Giant Image Dataset

Zander Bailey

Creating the Database

Measuring the Database

Density and Diversity

Cross Dataset Generalization

Source: Learning Deep Features for Scene
Recognition using Places Database

Training Neural Network for Scene Recognition and Deep Features

Visualization of the Deep Features

Hybrid-CNN

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related

Places and Spaces: a Giant Image Dataset

Zander Bailey

Creating the Database

Measuring the Database

Density and Diversity

Cross Dataset Generalization

Source: Learning Deep Features for Scene Recognition using Places Database

Training Neural Network for Scene Recognition and Deep Features

Visualization of the Deep Features

Hybrid-CNN

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related

Source: Learning Deep Features for Scene
Recognition using Places Database