How to use COCO for Object Detection
To train a detection model, we need images, labels and bounding box annotations. The COCO (Common Objects in Context) dataset is a popular choice and benchmark since it covers a variety of different objects in different settings. Here you can learn more about how models are evaluated on COCO.
Here is my video version of this post:
Setting up
To get started, we first download images and annotations from the COCO website. We create a folder for the dataset and add two folders named images
and annotations
. Next, we add the downloaded folder train2017
(around 20GB) to images
and the file instances_train2017.json
to annotations
. Our dataset folder should then look like this:
cocoDataset/
├── annotations/
│ ├── instances_train2017.json
├── images/
│ ├── train2017/
│ ├── 000000000009.jpg
│ ├── ...
Images -
Images are in the .jpg
format, of different sizes and named with a number. All image names are 12 digits long with leading 0s.
COCO annotation file -
The file instances_train2017
contains the annotations. These include the COCO class label, bounding box coordinates, and coordinates for the segmentation mask. Next, we explore how this file is structured in more detail.
Annotation file structure
The annotation file consists of nested key-value pairs. On the top level there are five such objects:
'info'
'licenses'
'categories'
'images'
'annotations'
In Python we can access these objects by loading the file with the json module.
The first two objects contain information regarding the dataset such as date of creation etc. and the licenses under which the images are used. For example, the first license is
Before we explore the remaining objects let’s define three variables.
category_id | - Maps a label to the class name |
image_id | - Image name without file extension and leading zeros |
annotation_id | - Annotation identifier |
Each of these ids is unique.
The categories key contains a list of category objects. These map the category_id
to the classname. For example, the first two are
{'supercategory': 'person', 'id': 1, 'name': 'person'},
{'supercategory': 'vehicle', 'id': 2, 'name': 'bicycle'}
The image object contains image meta information.
{'license': 4,
'file_name': '000000522418.jpg',
'coco_url': 'http://images.cocodataset.org/train2017/000000522418.jpg',
'height': 480,
'width': 640,
'date_captured': '2013-11-14 11:38:44',
'flickr_url': 'http://farm1.staticflickr.com/1/127244861_ab0c0381e7_z.jpg',
'id': 522418}
Note the two fields file_name
and id
(which is the image_id
) are the same except the leading zeros in the file name (and the extension). That means we can use the image_id
to later retrieve image files.
For each image, there are one or multiple annotation objects. Each one of these annotations contains multiple key-value pairs
{'segmentation': [[239.97,
260.24,
...
222.04,
228.87,
271.34]],
'area': 2765.1486500000005,
'iscrowd': 0,
'image_id': 558840,
'bbox': [199.84, 200.46, 77.71, 70.88],
'category_id': 58,
'id': 156}
The image_id
maps this annotation to the image object, while the category_id
provides the class information. Each annotation is uniquely identifiable by its id
(annotation_id
).
The bounding box field provides the bounding box coordinates in the COCO format x,y,w,h
where (x,y)
are the coordinates of the top left corner of the box and (w,h)
the width and height of the box.
COCO api
If you don’t want to write your own code to access the annotations you can get the COCO api.
As a brief example let’s say we want to train a bicycle detector. To get annotated bicycle images we can subsample the COCO dataset for the bicycle
class (coco label 2).
First, we clone the repository and add the folders images
and annotations
to the root of the repository. Then we can use the COCO api to get a list of all image_ids
which contain annotated bicycles.
With this list of image_ids
we can get annotations. For example, to get all annotations containing bicycles in the image 000000196610.jpg
we use two filters, which results in five bicycle annotations.
These five annotation objects can then be loaded into a list anns
Now we can access the bounding box coordinates by iterating over the annotations.
To visualize the image with all bicycle annotations we can use matplotlib
and PIL
for example.
Figure 1 shows the image with the drawn bounding boxes.
And that is how we can access the bicycle images and their annotations.
In conclusion, we have seen how the images and annotation of the popular COCO dataset can be used for new projects, particularly in object detection.