The Mislabelled Objects in COCO
Benchmark datasets, such as COCO or ImageNet, have driven the development of newer and better neural network models. Different models may be empirically compared to each other by having an agreed upon dataset to keep the training and test data constant. In this way, different models may be trained on identical datasets and their predictions may be scored in a competition. This, however, comes with the assumption that the benchmark dataset does not have any errors which is often not true. From my experience in going through the traffic lights in the COCO dataset, labelling errors are not uncommon to find.
Mislabelled data may limit the model accuracy because the model is trying to fit some of the data that is quite simply wrong. Additionally, these errors are magnified when comparing between models or completing an error analysis. They create uncertainty when comparing two models which can lead to false conclusions regarding model performance.
Having gone through all of the traffic lights to relabel them for another task (see COCO Dataset Extension), we have found that 6.9% (927/ 13,521) of the traffic light annotations are mislabelled. There would likely be many more false negatives in addition to this. These errors may be caused by being labelled by different people, labeller fatigue, or just simply out of error. This post will detail some of the more notable ones.
Examples of Mislabelled Traffic Lights
Datasets are most often labelled using human labellers to create the ground truth values. Even though supervised training algorithms assume that humans are infallible data labellers, errors and mistakes can slip into the dataset. These errors include misclassifying one class as another, labelling inconsistency, and errors in omission or admission. These errors can affect the performance of a model trained on this dataset. As the widely used data aphorism goes: garbage in, garbage out.
Some examples of some of these errors are shown in Figure 1 which shows some of the more common errors. There are a lot of very obvious traffic lights that were missed at the top of the image. Also, a part of the building on the right is labelled as a street light. Finally, some of the street lamps are labelled as traffic lights, which is not an uncommon occurence throughout the dataset.
It should be noted that there may be dozens of bounding boxes in an image and that Figure 1 shows only the ones labelled as traffic lights and that each car requires labelling as well. Even still, some of the smaller bounding traffic lights can be difficult to identify such as the pedestrian light to the right or some of the lights near the top of the image, outlining how arduous labelling task may be for just a single image.
There is even some sort of confidence that humans have when labelling images. Sometimes the object is too small, obscured, or faraway to be confident that the object is a traffic light (Figure 1). It could be a sign, a car light, or a reflection off a window. The difficulty of labelling these images is often underappreciated.
The Traffic Light Tree (Figure 2) is an example of how tiresome labelling can be with over two dozen traffic lights in a small area. There are several images of the Traffic Light Trees; however, none of them are perfectly labelled. Many of the traffic lights are correctly labelled, however, suggesting that the labeller initially attempts to label them all before getting too tired to continue labelling the numerous traffic lights on this art installation. Additionally, it is difficult to keep track of which lights have already been labelled.
Some errors, on the other hand, can be particularly egregious (Figure 3). This may be because of the inexperience of the data labeller, not knowing how to delete a bounding box, or perhaps just simply by mistake. These types of errors could be avoided with someone else checking to ensure data quality. Otherwise, there are some automated ways being developed to detect and remove mislabelled objects.
Labelling Confusion
Occasionally, there is some ambiguity in the labelling task that can lead to labelling inconsistency. Figure 4 is an example of such error where the lit up part of the traffic light is labelled separately from the rest of the traffic light. This type of error happens occasionally throughout the COCO dataset, and could have been prevented by giving the labeller clear instructions on what is a traffic light.
Instructions can also clear up some ambiguities when considering what counts as a traffic light. Do reflections, images of traffic lights, or toy traffic lights count (Figures 5 and 6) or should they remain unlabelled? It can end up being like the painting The Treachery of Images by Rene Magritte where a picture of a pipe is labelled as “not a pipe”. Decisions have to be made when determining what should count as a traffic lights, and these decisions were likely left to the various labellers hired to label the images, some of who will make different decisions from others.
Conclusion
The COCO dataset has a moderate number of annotation errors in the traffic light object class, some of them being very obvious. These errors are not specific to COCO and other datasets and plague other datasets such as CIFAR-100 and ImageNet. Substantial improvements in data quality by fixing or removing these errors can improve trained models, such as what was found by Deepomatic.
Labelling, however, can be a difficult task that is prone to errors. Reasons are both inconsistent labelling guidelines and uncertainty on the label size. To fix some of these errors, there should be clear guidelines on how to label certain instances and these instructions could be updated as new instances are encountered. Additionally, some of these errors could have been avoided with better annotation and labelling procedures so that these labelling ambiguities may be resolved to create consistently labelled and correct datasets. The annotations can then be looked at by another person to ensure that the most egregious errors are removed. Another option is to develop and deploy an automated way to identify errors to improve object recognition models.