Image scraping with Python

petercour

petercour

Posted on July 7, 2019

Image scraping with Python

The web has many different types of content: images, video, text, audio and more. You can use Python to download data from the web.

The program below downloads image from search engines Google and Baidu.

Why these? Because they are large image archives.

It can be called like this with the keyword:

download_baidu(word)
download_google(word) 

Code to scrape images:

#!/usr/bin/python3
#-*- coding:utf-8 -*-

import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import os

def download_baidu(keyword): 
    url = 'https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word='+word+'&ct=201326592&v=flip'
    result = requests.get(url)
    html = result.text
    pic_url = re.findall('"objURL":"(.*?)",',html,re.S)
    i = 0

    for each in pic_url:
        print(pic_url)
        try:
            pic= requests.get(each, timeout=10)
        except requests.exceptions.ConnectionError:
            print ('exception')
            continue

        string = 'pictures'+keyword+'_'+str(i) + '.jpg'
        fp = open(string,'wb')
        fp.write(pic.content)
        fp.close()
        i += 1

def download_google(word):
    url = 'https://www.google.com/search?q=' + word + '&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982'
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')

       for raw_img in soup.find_all('img'):
           link = raw_img.get('src')
           os.system("wget " + link)

if __name__ == '__main__':
    word = input("Input key word: ")
    download_baidu(word)
    #download_google(word)

This downloads the images into the same directory. The downside of this implementation is that it does not use threading. Threading speeds up the download process.

If you want to download lots of images, you should use threading. Speed with threading, ok maybe not this fast.

Either-way it's interesting to compare search results for Google and Baidu.

Python resources:

💖 💪 🙅 🚩
petercour
petercour

Posted on July 7, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related