Writing an Artwork Scrapper

Swagato Chatterjee

July 2019

Background

To prepare the dataset for the project, I needed the artwork image-iconography pair. Iconclass project aims to solve the iconography part of the problem by using a code for Iconography partly inspired by Dewy’s Decimal system. It is a tree-structured dictionary with each entry(code) describing the iconography and search keywords associated with that code; followed by codes for more detailed variants of that iconography. Eg., my work was with Christian Artworks. So I decided to focus on New Testament (code: 73),which has details like Birth and Youth of Christ (code: 73B) and which itself was a part of Bible (code: 7). Hence if I could train a model to determine these codes I solve the iconography problem. Now where can I get the images and codes from? Thanks to RKD Museum’s site, we have a wonderful gallery of images with Iconclass codes (yes an image might have multiple iconographies) attached with them.

I just have to figure out a way to mine them.

Deciding on the scrapping strategy

To scrap the requisite images out of the site, we need the following:

Well if you are planning to use the frontend for scrapping, you are out of luck. However, with a feat of luck, I discovered something amazing! RKD has a API endpoint. Hence if I can develop some hacks, I can download the images and the codes! You can exactly do that with the /api/search/{database} query. The database is obviously images, but the devil is in the details. Now the best way to extract information will be using the JSON format, well because Python handles it well, but the server doesn’t return all the image links and returns no iconclass code when I query

api.rkd.nl/api/search/images?filters[iconclass_code]=73*&filters[periode]=1400||1735&format=json

;i.e. set iconclass to everything under code 73 which implies New Testament (filters[iconclass_code]=73*) and time period to 15-18th Century(filters[periode]=1400||1735). Hence to get hold of the image links and the codes we need to add the option fieldset=detail. The whole query stands as of now to be,

api.rkd.nl/api/search/images?filters[iconclass_code]=73*&filters[periode]=1400||1735&format=json&fieldset=detail

If we study the problem now, this query returns:

After tinkering a bit, I managed to come to the conclusion, that the most important image is the last one in the array picturae_images. You solve one problem, you are hit with another; you can’t download hi-res watermark-free images from RKD! Cool! The only images you can get hold of are 650x650px thumbnails. I decided to download them. Since the query is just for the first image, I had to add the start={i} option to the query, where i will be substituted by the i − 1th image of the total N images matching the query constraints. We have to send 2N such queries (one for image and the other for the metadata). We can get the value of N from the response field num_queries.

Organising the data

Now that we have figured out how to collect the data let’s write functions to fetch and organise them:

def download_images(im, id):
    with http.ClientSession() as session:
        url = f'https://images.rkd.nl/rkd/thumb/650x650/{im}.jpg'
        with session.get(url) as resp:
            if resp.status == 200:
                f = open(f'images/{id}.jpg', mode='wb')
                f.write(await resp.read())
                f.close()

def get_info(offset):
    url = f'https://api.rkd.nl/api/search/images?filters[iconclass_code]=73*
    &format=json&language=en&filters[periode]=1400||1735
    &fieldset=detail&start={offset}'
    with http.ClientSession() as session:
        with session.get(url) as resp:
            if resp.status == 200:
                q = resp.json()
                return q['response']['docs'][0]

def extract(i):
    resp = get_info(i)
    imgl = resp['picturae_images'][-1]
    permalink = resp['permalink']
    q = [st.split('@') for st in resp['iconclass_tekst_search']]
    id = permalink.rsplit('/', 1)[-1]
    for sq in q:
        t, c = sq
        if c.startswith('73'):
            data.append((c, t, imgl, permalink, id))  # keeping it at 1NF, data is global
    return imgl, id

The next part is simple, for each of the N query extract() and download_images(). Save the data as a csv file. You will soon notice, this takes a lot of time to download and process. Can we do better? One observation might be that download_images() takes way more time than get_info() and extract() combined. If only we could convert it into a pipelined process, where the extract() and get_info() doesn’t sit idle.

Going asynchronous with async-await

The asyncio package in Python 3 allows us to write asynchronous piped code. Check the following YouTube video for a tutorial.

asyncio

Now, we apply the wisdom passed by the lecturer and add @unsync decorators. The code hence becomes:

@unsync
async def download_images(task):
    im, id = task.result()
    async with aiohttp.ClientSession() as session:
        url = f'https://images.rkd.nl/rkd/thumb/650x650/{im}.jpg'
        async with session.get(url) as resp:
            if resp.status == 200:
                f = await aiofiles.open(f'images/{id}.jpg', mode='wb')
                await f.write(await resp.read())
                await f.close()

@unsync
async def get_info(offset):
    url = f'https://api.rkd.nl/api/search/images?filters[iconclass_code]=73*
    &format=json&language=en&filters[periode]=1400||1735
    &fieldset=detail&start={offset}'
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            if resp.status == 200:
                q = await resp.json()
                return q['response']['docs'][0]

@unsync
def extract(task):
    resp = task.result()
    imgl = resp['picturae_images'][-1]
    permalink = resp['permalink']
    q = [st.split('@') for st in resp['iconclass_tekst_search']]
    id = permalink.rsplit('/', 1)[-1]
    for sq in q:
        t, c = sq
        if c.startswith('73'):
            data.append((c, t, imgl, permalink, id))  # keeping it at 1NF
    return imgl, id

This improves the performance significantly. For me, it reduced down the running time from 6 hrs to 4 hrs.

Next: Cleaning up the Data