building an image dataset
ashish
Posted on November 21, 2020
It's bit of hectic process in creating image datasets. It Basically consists of below mentioned pipeline.(to my understanding)
- Model Bias.
- Whats your model goal?
- Ways to collect images.
- Cleaning the data.
- Resizing the images.
Model Bias
Can you solve this riddle??
A man and his son are in a terrible accident and are rushed to the hospital in critical care. The doctor looks at the boy and exclaims "I can't operate on this boy, he's my son!" How could this be?
Firstly most people generally think what i think😃, this is an example for human bias.
If you train your model with more cat images and expect it to perform well on detecting cats and dogs, this happens
source: Sidney Harris
For more details on data bias you can go through this excellent slides by cs224n: Bias in the Vision and Language of Artificial Intelligence
Ways to collect data
here's a just a sample list of sources to collect images data
- Search engines 🔍 (Google, Bing, Yandex, Duck Duck Go)
- Social Media > through hashtags#️⃣
- Youtube videos and flickr📹
- take a camera/mobile and go around collect data by yourself.
Cleaning the data.
- Trash the Images which can't be loaded/ corrupted.
- find out duplicate images(due to various search engines).
- Do what's necessary...
Resizing the images
- Resize maintaining its aspect ratio.
- If you have images of different sizes, and you try using resize with padding(filling the pixels with black/white).
- Smaller your images >>> faster your model training.
Codes you need(💪 open source)
Some of these requires chromedriver and selenium.
For images downloading based on Search engines:
Image Downloader by sczhengyabin [google | bing]
google images download by hardikvasa
yandex images download by bobokvsky
Bulk Bing Image downloader by ostrolucky
Flickr image-scraping software developed by Ultralytics LLC
For downloading from instagram based on hashtags:
- Instagram-scraper by arc298
For duplicate images cleaning:
- Imagededup by idealo✨ 😎
This "imagededup" package uses Convolutional Neural Network (CNN) and hashing algorithms to find duplicates in images.
Posted on November 21, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.