Scraping an HTTP API with Python and distributed task queues

The company behind¬† the game World of Tanks provides a public HTTP API for third-party developers to create their own applications. API features include OpenID based login, clan and player statistics and information about the game. I’ve already used this API for my attendance tracker application but wanted to try something of bigger scale and with different technologies now. One idea was to create an application that tracks the clan membership of players over time.

The API offers methods to get a list of clans and members of a clan but only allows to fetch 100 clans at once. There are about 40000 clans on the EU server cluster which means many requests are required to get the data of all clans. Furthermore the API is rate-limited and only allows around 10 requests per second (which can be negotiated though).

Celery, an asynchronous task queue, seemed perfectly suited for this. For storing the player clan history itself I chose MongoDB because as schema-less document database it is very comfortable to use.

In Python, a Celery task is declared by annotating a function:

Now we can queue up tasks for all possible pages in Celery and have multiple workers execute them in parallel:

get() blocks until all tasks are finished and returns a list of lists (because each get_clans task returns a list) with the JSON response data from the API. We can then start a task for each clan to request the members and store the results in the database.

Workers can be started from the celery command line tool:

In this case, 10 processes run in parallel to process the tasks but Celery can also use threads or green threads.

This works quite well and retrieves the information of 40000 clans and around 550000 players in an hour. With all the information in the local MongoDB, a few interesting statistics are already possible. For example this histogram visualizes the member count of clans:

Member count of EU clans

Unsurprisingly, most clans have few members and only 1030 clans have at least 80 members.

This project is mostly for the learning experience and currently work in progress. The source code is on Github.