COCO is a visual dataset that plays an important role in computer vision. In this article, we will cover everything you need to know about the popular Microsoft COCO dataset that is widely used for machine learning Projects. Learn what you can do with MS COCO and what makes it different from COCO alternatives such as Google’s OID (Open Images Dataset).
About us: Viso.ai provides the end-to-end computer vision platform Viso Suite. Leading organizations use our technology to gather training data, train models, and develop computer vision applications. Learn more or get a demo for your organization.

The COCO Dataset
The MS COCO dataset is a large-scale object detection, image segmentation, and captioning dataset published by Microsoft. Machine Learning and Computer Vision engineers popularly use the COCO dataset for various computer vision projects.
Understanding visual scenes is a primary goal of computer vision; it involves recognizing what objects are present, localizing the objects in 2D and 3D, determining the object’s attributes, and characterizing the relationship between objects. Therefore, algorithms for object detection and object classification can be trained using the dataset.
What is COCO?
COCO stands for Common Objects in Context, as the image dataset was created with the goal of advancing image recognition. The COCO dataset contains challenging, high-quality visual datasets for computer vision, mostly state-of-the-art neural networks.
For example, COCO is often used to benchmark algorithms to compare the performance of real-time object detection. The format of the COCO dataset is automatically interpreted by advanced neural network libraries.

Features of the COCO dataset
- Object segmentation with detailed instance annotations
- Recognition in context
- Superpixel stuff segmentation
- Over 200’000 images of the total 330’000 images are labeled
- 1.5 Mio object instances
- 80 object categories, the “COCO classes”, which include “things” for which individual instances may be easily labeled (person, car, chair, etc.)
- 91 stuff categories, where “COCO stuff” includes materials and objects with no clear boundaries (sky, street, grass, etc.) that provide significant contextual information.
- 5 captions per image
- 250’000 people with 17 different keypoints, popularly used for Pose Estimation
List of the COCO Object Classes
The COCO dataset classes for object detection and tracking include the following pre-trained 80 objects:
'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis','snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'

List of the COCO Keypoints
The COCO keypoints include 17 different pre-trained keypoints (classes) that are annotated with three values (x,y,v). The x and y values mark the coordinates, and v indicates the visibility of the key point (visible, not visible).
"nose", "left_eye", "right_eye", "left_ear", "right_ear", "left_shoulder", "right_shoulder", "left_elbow", "right_elbow", "left_wrist", "right_wrist", "left_hip", "right_hip", "left_knee", "right_knee", "left_ankle", "right_ankle"

Annotated COCO images
The large dataset comprises annotated photos of everyday scenes of common objects in their natural context. Those objects are labeled using pre-defined classes such as “chair” or “banana”. The process of labeling, also named image annotation and is a very popular technique in computer vision.
While other object recognition datasets have focused on 1) image classification, 2) object bounding-box localization, or 3) semantic pixel-level segmentation – the mscoco dataset focuses on 4) segmenting individual object instances.

Why common objects in natural context?
For many categories of objects, there are iconic views available. For example, when performing a web-based image search for a specific object category (for example, “chair”), the top-ranked examples appear in profile, un-obstructed, and near the center of a very organized photo. See the example images below.
While image recognition systems usually perform well on such iconic views, they struggle to recognize objects in real-life scenes that show a complex scene or partially occlude the object. Hence, it is an essential aspect of the coco images that they contain natural images that contain multiple objects.

How to use the COCO dataset
Is the COCO dataset free to use?
Yes, the MS COCO images dataset is licensed under a Creative Commons Attribution 4.0 License. Accordingly, this license lets you distribute, remix, tweak, and build upon your work, even commercially, as long as you credit the original creator.
How to download the COCO dataset
There are different dataset splits available to download for free. Each year’s images are associated with different tasks such as Object Detection, Keypoint Tracking, Image Captioning, and more.
To download them and see the most recent Microsoft COCO 2020 challenges, visit the official MS COCO website. To efficiently download the COCO images, it is recommended to use gsutil rsync to avoid the download of large zip files. You can use the COCO API to set up the downloaded COCO data.
COCO recommends using the open-source tool FiftyOne to access the MSCOCO dataset for building computer vision models.

Comparison of COCO Dataset vs. Open Images Dataset (OID)
A popular alternative to the COCO Dataset is the Open Images Dataset (OID), created by Google. It is essential to understand and compare the visual datasets COCO and OID with their differences before using one for projects to optimize all available resources.
Open Images Dataset (OID)
What makes it unique? Google annotated all images in the OID dataset with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives. This leaves it to be used for slightly more computer vision tasks when compared to COCO because of its slightly broader annotation system. The OID home page also claims it’s the largest existing dataset with object location annotations.
Data. Open Images is a dataset of approximately 9 million pre-annotated images. Most, if not all, images of Google’s Open Images Dataset have been hand-annotated by professional image annotators. This ensures accuracy and consistency for each image and leads to higher accuracy rates for computer vision applications when in use.
Common Objects in Context (COCO)
What makes it unique? With COCO, Microsoft introduced a visual dataset that contains a massive number of photos depicting common objects in complex everyday scenes. This sets COCO apart from other object recognition datasets that may be specifically specific sectors of artificial intelligence. Such sectors include image classification, object bounding box localization, or semantic pixel-level segmentation.
Meanwhile, the annotations of COCO are mainly focused on the segmentation of multiple, individual object instances. This broader focus allows COCO to be used in more instances than other popular datasets like CIFAR-10 and CIFAR-100. However, compared to the OID dataset, COCO does not stand out too much, and in most cases, both could be used.
Data. With 2.5 million labeled instances in 328k images, COCO is a very large and expansive dataset that allows many uses. However, this amount does not compare to Google’s OID, which contains a whopping 9 million annotated images.
Google’s 9 million annotated images were manually annotated, while OID discloses that it generated object bounding boxes and segmentation masks using automated and computerized methods. Both COCO and OID have not disclosed bounding box accuracy, so it remains up to the user whether they assume automated bounding boxes would be more precise than manually made ones.

What’s Next?
The COCO dataset and benchmark are used in a wide range of AI vision tasks and disciplines. Models trained on COCO are used for object detection, people detection, face detection, pose estimation, and many more computer vision tasks.
Check out the following related articles:
- AI in Sports: How Computer Vision is Changing the Game
- Everything you need to know about Image Annotation
- What is Computer Vision? A beginner’s guide
- Data Preprocessing Techniques for Machine Learning (Tutorial)
- What you need to know about Mask R-CNN
- AI to create ultra-realistic images from text
FAQs
What is the COCO Dataset? What you need to know in 2023 - viso.ai? ›
The large dataset comprises annotated photos of everyday scenes of common objects in their natural context. Those objects are labeled using pre-defined classes such as “chair” or “banana”. The process of labeling, also named image annotation and is a very popular technique in computer vision.
What is included in COCO dataset? ›What is the COCO dataset? The COCO (Common Objects in Context) dataset is a large-scale image recognition dataset for object detection, segmentation, and captioning tasks. It contains over 330,000 images, each annotated with 80 object categories and 5 captions describing the scene.
What is Microsoft COCO dataset? ›MS COCO (Microsoft Common Objects in Context) is a large-scale image dataset containing 328,000 images of everyday objects and humans. The dataset contains annotations you can use to train machine learning models to recognize, label, and describe objects.
What is COCO dataset for image captioning? ›COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
What is Coco in AI? ›COCO is a large-scale object detection, segmentation, and captioning dataset. This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students.
How many classes are there in COCO dataset? ›* Coco defines 91 classes but the data only uses 80 classes.
What are the three main components of dataset? ›The dataset consists of three main parts: (1) Metadata; (2) UI events; (3) Network traces.
How to train a model using COCO dataset? ›- 1) COCO format. ...
- 2) Creating a Dataset class for your data. ...
- 3) Adding dataset paths. ...
- 4) Evaluation file. ...
- 5) Training script. ...
- 6) Changing the hyper-parameters. ...
- 7) Finetuning the model. ...
- Now all it is ready for trainnig!!
The Microsoft Common Objects in Context (COCO) dataset is the gold standard benchmark for evaluating the performance of state of the art computer vision models. COCO contains over 330,000 images, of which more than 200,000 are labelled, across dozens of categories of objects.
How do you visualize COCO dataset? ›- Step1: Downloading and Extracting the COCO Dataset. ...
- Step 2: Understanding the structure of the COCO Format. ...
- Step 3: Creating the COCOParser Class. ...
- Step 4: Loading and Visualizing the Dataset.
How to build COCO dataset? ›
- Step 1: Plan Dataset. ...
- Step 2: Setup Project. ...
- Step 2.5: Define Annotation Classes. ...
- Step 3: Gather Data. ...
- Step 4: Upload Files. ...
- Step 5: Organize Files into Datasets. ...
- Step 6: Distribute Annotation Tasks. ...
- Step 7: Annotate Data.
The goal of COCO-Text is to advance state-of-the-art in text detection and recognition in natural images. The dataset is based on the MS COCO dataset, which contains images of complex everyday scenes. The images were not collected with text in mind and thus contain a broad variety of text instances.
What is Coco file format? ›COCO is a format for specifying large-scale object detection, segmentation, and captioning datasets. This Python example shows you how to transform a COCO object detection format dataset into an Amazon Rekognition Custom Labels bounding box format manifest file.
What animals are in the COCO dataset? ›The COCO animals dataset has 800 training images and 200 test images of 8 classes of animals: bear, bird, cat, dog, giraffe, horse, sheep, and zebra.
How big is COCO dataset? ›The features of the COCO dataset are – object segmentation, context recognition, stuff segmentation, three hundred thirty thousand images, 1.5 million instances of the object, eighty categories of object, ninety-one categories of staff, five per image captions, 250,000 keynotes people. The size of the dataset is 25 GB.
What are the annotations in COCO dataset? ›Five COCO Annotation Types
According to cocodataset.org/#format-data: COCO has five annotation types: for object detection, keypoint detection, stuff segmentation, panoptic segmentation, and image captioning. The annotations are stored using JSON.
Generally speaking, the rule of thumb regarding machine learning is that you need at least ten times as many rows (data points) as there are features (columns) in your dataset. This means that if your dataset has 10 columns (i.e., features), you should have at least 100 rows for optimal results.
What is Coco tools? ›COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation. Recognition in context. Superpixel stuff segmentation.
What are the 3 types of data sets? ›Finally, coming on the types of Data Sets, we define them into three categories namely, Record Data, Graph-based Data, and Ordered Data.
What are all 3 data types? ›- Short-term data. This is typically transactional data. ...
- Long-term data. One of the best examples of this type of data is certification or accreditation data. ...
- Useless data. Alas, too much of our databases are filled with truly useless data.
What are the 4 areas of data analysis? ›
Modern analytics tend to fall in four distinct categories: descriptive, diagnostic, predictive, and prescriptive.
How to convert COCO dataset to yolo? ›- Step 1: Create a free Roboflow public workspace. Roboflow is the universal conversion tool for computer vision annotation formats. ...
- Step 2: Upload your data into Roboflow. ...
- Step 3: Generate Dataset Version. ...
- Step 4: Export Dataset Version.
- Look through your annotation file e.g. 'instances_val2017. ...
- Remove any extra categories.
- Give the categories new ids (counting up from 1)
- Find any annotations that reference the desired categories.
- Filter out extra annotations.
- Filter out images not referenced by any annotations.
COCO Annotator is a web-based image annotation tool designed for versatility and efficiently label images to create training data for image localization and object detection.
What is the maximum image size for COCO dataset? ›Image resolution: 640×480.
How many classes are available in Coco names? ›Objects in COCO
As written in the original research paper, there are 91 object categories in COCO.
Visualizing data in Three Dimensions (3-D)
Considering three attributes or dimensions in the data, we can visualize them by considering a pair-wise scatter plot and introducing the notion of color or hue to separate out values in a categorical dimension.
- Step 1: Create a free Roboflow public workspace. Roboflow is the universal conversion tool for computer vision annotation formats. ...
- Step 2: Upload your data into Roboflow. ...
- Step 3: Generate Dataset Version. ...
- Step 4: Export Dataset Version.
The MS COCO dataset has 81 classes, out of which only five are useful to us; these include carrot, broccoli, orange, apple and banana. We retrained the model by adding Pineapple to our database. Note: In order to implement the task, you will have to reproduce the Mask R CNN model.
How long does it take to train COCO dataset? ›Train the model
It takes ~6 minutes to train 300 iterations on Colab's K80 GPU. In case you switch to your own datasets, change the number of classes, learning rate, or max iterations accordingly.
How can I create my own dataset? ›
- Collect the raw data.
- Identify feature and label sources.
- Select a sampling strategy.
- Split the data.
The COCO dataset and benchmark are used in a wide range of AI vision tasks and disciplines. Models trained on COCO are used for object detection, people detection, face detection, pose estimation, and many more computer vision tasks.
Which algorithm is used for text detection? ›Optical Character Recognition (OCR) is used to analyze text in images.
When did COCO dataset come out? ›COCO (Microsoft Common Objects in Context)
Splits: The first version of MS COCO dataset was released in 2014.
It is widely used to benchmark the performance of computer vision methods. Due to the popularity of the dataset, the format that COCO uses to store annotations is often the go-to format when creating a new custom object detection dataset.
What does Coco look like? ›Appearance. Coco is a light brown rabbit with dark speckles and darker brown legs. She has empty black eyes and a mouth with an apparently hollow head, referencing the appearance of gyroids.
What is the homeless dog on Coco? ›One of the most charming (and funny) characters in Pixar's new movie “Coco” is Dante the dog, a stray who accompanies the main character, Miguel on his journey into the Land of the Dead. Dante is a Xolo dog—short for Xoloitzcuintli—the national dog of Mexico.
Does Coco and Eve test on animals? ›Coco & Eve is cruelty-free. Coco & Eve has confirmed that it is truly cruelty-free. They don't test finished products or ingredients on animals, and neither do their suppliers or any third-parties. They also don't sell their products where animal testing is required by law.
What is a good dataset size? ›The most common way to define whether a data set is sufficient is to apply a 10 times rule. This rule means that the amount of input data (i.e., the number of examples) should be ten times more than the number of degrees of freedom a model has. Usually, degrees of freedom mean parameters in your data set.
What is the biggest face dataset? ›Description – Digi-Face 1M is the largest scale synthetic dataset for face recognition that is free from privacy violations and lack of consent. Licensing – The Digi-Face 1M dataset is available for non-commercial research purposes only.
Does the size of dataset matter? ›
The Size of a Data Set
As a rough rule of thumb, your model should train on at least an order of magnitude more examples than trainable parameters. Simple models on large data sets generally beat fancy models on small data sets. Google has had great success training simple linear regression models on large data sets.
The data annotation process involves a series of well-defined steps to ensure high-quality and accurate data labeling for machine learning applications. These steps cover every aspect of the process, from data collection to exporting the annotated data for further use.
What are the components of a dataset? ›A data set consists of roughly two components. The two components are rows and columns. Additionally, a key feature of a data set is that it is organized so that each row contains one observation.
What should be included in a data set? ›A data set is an ordered collection of data. As we know, a collection of information obtained through observations, measurements, study, or analysis is referred to as data. It could include information such as facts, numbers, figures, names, or even basic descriptions of objects.
What is Coco format? ›The “COCO format” is a specific JSON structure dictating how labels and metadata are saved for an image dataset. Many blog posts exist that describe the basic format of COCO, but they often lack detailed examples of loading and working with your COCO formatted data.
What are the 6 components of data? ›Hence, information systems can be viewed as having six major components: hardware, software, network communications, data, people, and processes. Each has a specific role, and all roles must work together to have a working information system.
What are the five data elements? ›- Collecting data.
- Data analysis.
- Reporting results.
- Improving processes.
- Building a data-driven culture.
- Step 1: Plan Dataset. ...
- Step 2: Setup Project. ...
- Step 2.5: Define Annotation Classes. ...
- Step 3: Gather Data. ...
- Step 4: Upload Files. ...
- Step 5: Organize Files into Datasets. ...
- Step 6: Distribute Annotation Tasks. ...
- Step 7: Annotate Data.
Four Elements of Data: Volume, velocity, variety, and veracity – Effective Database Management.
What are the two types of data sets? ›There are two types of categorical data sets: dichotomous and polytomous.