AIMultiple ResearchAIMultiple Research

Image Data Collection in 2024: What it is and Best Practices

Image Data Collection in 2024: What it is and Best PracticesImage Data Collection in 2024: What it is and Best Practices

From retail to healthcare, computer vision (CV) is revolutionizing almost every industry. However, successful development and implementation of computer vision depend on high-quality image data. While some work with data collection services, others gather their own data to train their computer vision systems.

This article explored image data collection and how business leaders and developers can gather relevant image data.

If you wish to work with an image data collection service, click here.

What is image data collection for AI?

Image data collection for AI/ML training involves gathering and preparing images to be added to datasets that will train AI/ML algorithms. 

This can include images of people, animals, objects, locations, etc. For instance, a CV-based system for detecting the quality of fruits on a conveyor belt might require thousands of images to be trained. Such datasets can be large or small, depending on the scope of the project.

Here is a sample dataset for a quality control computer vision system that scans apples. 

a collection of apple images with different variations in its health, shape, and colors
Source: researchgate

An image dataset for a facial recognition system might look something like this (only larger):

Face image dataset to train a facial recognition system
Source: researchgate

Challenges in image data collection

Gathering image data can be costly

A computer vision system requires a lot of images to be trained; however, this varies with the scope of the project. 

According to a study, a facial recognition system trained with 300,000 and 400,000 images will have significantly higher accuracy levels as compared to 100,000 and 200,000 (See Figure 1). 

However, collecting datasets of this magnitude, as useful as they might be, requires expensive cameras and an additional workforce.

Figure 1. Facial recognition accuracy with different dataset sizes

A bar chart showing the level of accuracy pf the computer vision system according to the size of the image data.

For instance, a facial recognition system that is being deployed in different countries will require data from the population of those specific countries. If this is done in-house, it can raise the project’s budget to unreasonable heights.

Gathering images can sometimes have ethical and legal considerations attached to them. For instance, a facial recognition system might require face data for training., But since face images are biometric data, it can be difficult to gather and use them. 

Other biometric image data that CV systems can gather includes fingerprint images, retina scans, etc., and if companies do not follow ethical and legal considerations, they can end up with expensive lawsuits.

Watch this video to learn about one of Facebook’s lawsuits on unethically collecting the user’s biometric data:

Gathering data of any sort can be biased

Another issue when collecting data is the risk of the dataset becoming biased. This unconscious bias is transferred from the collector to the dataset to the AI/ML model that’s being trained.

Best practices to consider while collecting image data

Leverage automation

Leveraging automation to collect image data can reduce bias because the data that will be collected will be random and with minimal prejudice and discrimination. You can program a bot or leverage an automated data collection application to automatically add data to your existing dataset. 

However, this can create storage issues since the dataset will increase every time someone uses the program. In such a scenario, setting parameters to regulate the collection can be helpful.

Having said that, automation in itself can be another issue. As Jo and Gebru mention in their paper, “Taking data in masses without critiquing its origin, motivation, platform, and potential impact results in minimally supervised data collection.”

Leverage Crowdsourcing

This can be another solution to overcoming the aforementioned challenges. Since the crowdsourcing model works with data collectors from around the world through a micro-job mechanism, the collected data can be diverse.

Crowdsourcing to collect data is a way to bypass AI bias.

In our data collection ethics article, we list the following best practices:

  • Providing ethical training to all data collection staff Obtaining the consent of the person providing the image data
  • Ensuring that the consent given is explicit, clear, and understandable
  • Ensuring  that the ethical consideration is consistent with all the data providers
  • The data provider has control over its usage of the images

Ensure consistency

Consistency in image data is an important factor in determining the model’s performance level. If 500 images need to be in front of a green background, then that should be the case for all of them.

Ensure quality

The quality of the overall image dataset is evaluated through its:

  • Relevancy to the scope of the project
  • Comprehensiveness so that it can cover all requirements of the AI/ML model
  • Authenticity and validity

To learn more about data collection quality assurance, check out this quick read.

For more in-depth knowledge on data collection, check out our comprehensive whitepaper:

Get Data Collection Whitepaper

You can also check our data-driven list of data collection/harvesting services to find the option that best suits your needs.

Further reading

If you need help finding a vendor or have any questions, feel free to contact us:

Find the Right Vendors
Access Cem's 2 decades of B2B tech experience as a tech consultant, enterprise leader, startup entrepreneur & industry analyst. Leverage insights informing top Fortune 500 every month.
Cem Dilmegani
Principal Analyst
Follow on

Shehmir Javaid
Shehmir Javaid is an industry analyst in AIMultiple. He has a background in logistics and supply chain technology research. He completed his MSc in logistics and operations management and Bachelor's in international business administration From Cardiff University UK.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments