Skip to content

Category: Python

Python: Asyncio and Aiohttp

Introduction

Suppose your program is trying to execute a series of tasks (1 – 6). If each task takes different time to complete, then your program will need to wait for each task to be completed sequentially before it can proceed.

Asyncio will be useful in such scenarios because it enables the program to continue running other tasks while waiting for the specific task to be completed. In order to use Asyncio, you will need to use compatible libraries. For example, instead of using requests (a popular HTTP library in Python), you will use aiohttp.

Note: In order to install aiohttp library in Windows system, you will need to download and install Microsoft C++ Build Tools. https://visualstudio.microsoft.com/visual-cpp-build-tools/

When to use Asyncio?

  • You want to speed up a specific part of your program where you are running a list of tasks sequentially for large-N items.
  • Suppose you are making API calls based on a list of different values for a parameter, you can use asyncio and aiohttp to make the API requests.
  • You do not need to change your entire program to use async/await syntax. Try to observe which part of the program is a bottleneck and explore how asyncio can improve performance on this particular flow.

Example: Crawling Wikipedia for info on Football (Soccer) Clubs

In this demo, we are going to perform the list of tasks below:

  1. Read the list of football clubs from a csv file.
  2. Get the Wikipedia URL of each Football club.
  3. Get the Wikipedia HTML page of each Football club.
  4. Write the HTML page into a HTML file for each Football Club.
  5. From the HTML page, we need to parse for information (Full Name, Ground, Founded Date etc.) using BeautifulSoup library.
  6. For the information of each club, we want to append the information into a Dataframe.
  7. Finally, print out the Dataframe to see if the information is correct.

See the Synchronous example and Asynchronous example from my Github repo. If we execute both scripts, we can an estimated difference here where Asyncio complete the execution faster by about 20-30%.

Execution time for Asyncio : 17.885913610458374
Execution time for Synchronous: 23.075875997543335

Refactor Tips

  • As a practice, a co-routine main is often defined and used in an event loop (e.g. asyncio.run(main()). Then in the co-routine main function, all the other co-routines are await.
  • If the request has a consistent response time, then you should stick to the synchronous approach. For example, if you are using Pandas, then you should use apply() on a function. For parts of the program which are bottleneck, you should try with asyncio to see if the speed performance is improved.

Key Terms

Event Loop. You must use an event loop to run the co-routine.

# Running event loop for Python 3.7+
asyncio.run(main())

# Older syntax before Python 3.7+
loop = asyncio.get_event_loop()
try:
    loop.run_until_complete(main())
finally:
    loop.close()

Co-routine

async / await. This is the syntax for defining co-routines in python. You can declare a co-routine by using async def in front of a function. await is used inside a co-routine and tells the program to come back to foo() when do_something() is ready. Make sure that do_something() is also a co-routine.

async def foo():
    x = await do_something()
    return x

Recommended Resources

Black Hat Programming Series

Recently, I plan to work through two technical books (Black Hat Python and Black Hat Go).

One of the motivations of going through these books is to understand how to build tools for content discovery and brute-forcing. Also I will like to develop my Python scripting skills further.

In Black Hat Python, the sample code for the chapters are in Python 2. I decided to convert the Python 2 code to Python 3 code. I will also use libraries such as requests to replace some of the steps were performed by urllib and urllib2.

Here are some sample projects from Black Hat Python that were converted to Python 3:

Web Application Mapper
Once you identified the open source technology used by the target web app, you can download the open source code to your directory. The mapper will send request to the target and spider the target using the directories and file names used in the open source code.

The script uses the known directories of the particular to map out the attack surfaces of the web app

Content Brute Forcing
In cases where you do not know the exact technology stack, you will need to brute force using a common word list. The word list can contain the common directory and file names. In the book, the script allow extension brute forcing as well. I have added filter method that allow the script to display responses that have specific status codes (e.g. 200).

Notice only response with status 200 are displayed?

A common workflow that we can observe from these tooling scripts:

  • A word list or list of test cases are generated or taken from open source. These are added to the queue.
  • A filter or specific information list is given based on what we are interested during our recon.
  • Brute forcing can be done faster with threads.
  • The code might be simpler with the use of requests instead of urllib

All source code in this blog post can be found here