スクレイピング

Webスクレイピングの基本は，標準ライブラリ urllib.request または Requests（pip install requests または conda install requests でインストール）と，正規表現 re とである。より高レベルのライブラリとして Beautiful Soup（pip install beautifulsoup4 または conda install beautifulsoup4 でインストール）がある。詳しくはPythonでWebスクレイピングする時の知見をまとめておくが参考になる。

例えば，このサイトにどれだけリンクがあるかを調べてみよう。まず，古い urllib.request による方法：

import urllib.request
import re

with urllib.request.urlopen('https://okumuralab.org/~okumura/python/') as f:
    s = f.read().decode('utf-8')
    a = re.findall('<a href="(.*?)"', s)

requests による方法：

import requests
import re

r = requests.get('https://okumuralab.org/~okumura/python/')
r.raise_for_status()  # エラーチェック
r.encoding = 'UTF-8'  # 念のため
a = re.findall('<a href="(.*?)"', r.text)

これで a には ['../', '../stat/', 'zero.html', ...] という配列が入る。

<meta charset="UTF-8"> と書いてあっても r.encoding が ISO-8859-1 になることがある。自動で UTF-8 と判断してほしければサーバ側で AddDefaultCharset UTF-8 と設定しなければならない。
エラーチェックには r.raise_for_status() を呼び出す方法以外に r.status_code が200（正常）かどうか確かめる方法もある。

Beautiful Soup による方法：

import requests
from bs4 import BeautifulSoup

r = requests.get('https://okumuralab.org/~okumura/python/')
r.raise_for_status()  # エラーチェック
r.encoding = 'utf-8'
s = BeautifulSoup(r.text, "html.parser") # or "lxml" or "html5lib"
s                                        # HTML全体
s.select('a')                            # <a ...> タグ
[x.get('href') for x in s.select('a')]   # <a ...> タグの中のhref

一般に正規表現は想定外の HTML の書き方に対応できないので Beautiful Soup のような高レベルの方法が良いとされる。

BeautifulSoup() の第1引数はエンコード指定前の生のバイト列 r.content でもよい。エンコーディングは自動判断してくれる。第2引数はパーサで，省略可能だが，再現性のためには省略しないほうがよいとされる。"html.parser" はPython標準のもの。より速い "lxml" か，よりブラウザに近い "html5lib" が推奨。ただし pip か conda で lxml か html5lib をインストールする必要がある。

バイナリファイルのダウンロードは例えば次のようにする：

import requests
import pathlib

r = requests.get('https://okumuralab.org/~okumura/python/img/iris.png')
pathlib.Path("iris.png").write_bytes(r.content)

ほか，例えば r.headers['Last-Modified'] で更新日時がわかるので，os.utime() でファイルのタイムスタンプを変えるといったことも可能であろう。

User Agent（ブラウザの種類）は "python-requests/2.27.1" のような感じで先方のログに残る。これが嫌なら適当に変えられる：

r = requests.get('https://okumuralab.org/~okumura/python/',
                 headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Safari/605.1.15'})

[以下は 2019-09-05 に試した] 最後に，いらすとやさんから「科学」ジャンルの画像のサムネール（256x256）をスクレイプする簡単なコード例を挙げておく。カレントディレクトリ下に irasutoya というサブディレクトリがあるとする。

import requests
import re
import pathlib
import time

path = pathlib.Path("irasutoya")
names = [re.sub("^.*/", "", str(n)) for n in path.iterdir()]

url = 'https://www.irasutoya.com/search/label/科学'
while True:
    print("Requesting", url)
    r = requests.get(url)
    if r.status_code != 200:
        print("Status:", r.status_code)
        break
    r.encoding = 'utf-8'
    a = re.findall('document.write\(bp_thumbnail_resize\("(.*?)"', r.text)
    for i in a:
        name = re.sub("^.*/", "", i)
        if name in names:
            continue
        names.append(name)
        png256 = re.sub('/s72-c/', '/s256-c/', i)
        print("Downloading", png256)
        time.sleep(1)  # 迷惑をかけないように必ず数秒待つ
        r1 = requests.get(png256)
        if r1.status_code != 200:
            print("Status:", r1.status_code)
            continue
        pathlib.Path("irasutoya/" + name).write_bytes(r1.content)
    m = re.search("<a class='blog-pager-older-link' href='(.*?)'", r.text)
    if not m:
        break
    url = m.group(1)

さらについでに，いらすとやさんのいろいろな顔アイコン（動物・モンスターを除く人間だけ）以下の256×256サムネールを全部取得してカレントディレクトリの irasutoya2 サブディレクトリに入れる：

path = pathlib.Path("irasutoya2")
names = [re.sub("^.*/", "", str(n)) for n in path.iterdir()]

urls = [
    'http://www.irasutoya.com/2013/10/blog-post_5077.html',
    'http://www.irasutoya.com/2013/10/blog-post_3974.html',
    'http://www.irasutoya.com/2013/10/blog-post_9098.html',
    'http://www.irasutoya.com/2013/10/blog-post_6907.html',
    'http://www.irasutoya.com/2013/10/blog-post_872.html',
    'http://www.irasutoya.com/2013/10/blog-post_8683.html',
    'http://www.irasutoya.com/2013/10/blog-post_2022.html',
    'http://www.irasutoya.com/2013/10/blog-post_1473.html',
    'http://www.irasutoya.com/2015/10/blog-post_59.html',
    'http://www.irasutoya.com/2015/10/blog-post_29.html',
    'http://www.irasutoya.com/2015/10/blog-post_405.html',
    'http://www.irasutoya.com/2015/10/blog-post_135.html' ]

for u in urls:
    print("Requesting", u)
    r = requests.get(u)
    if r.status_code != 200:
        print("Status:", r.status_code)
        break
    r.encoding = 'utf-8'
    a = re.findall('"([^"]*/s800/[^"]*)"', r.text)
    for i in a:
        name = re.sub("^.*/", "", i)
        if name in names:
            continue
        names.append(name)
        png256 = re.sub('/s800/', '/s256-c/', i)
        m = re.search('^//', png256)
        if m:
            png256 = 'http:' + png256
        print("Downloading", png256)
        time.sleep(1)
        r1 = requests.get(png256)
        if r1.status_code != 200:
            print("Status:", r1.status_code)
            continue
        pathlib.Path("irasutoya2/" + name).write_bytes(r1.content)

奥村晴彦

Last modified: 2022-05-08 15:38:14 JST