Mastering Asynchronous Web Crawling with Python’s asyncio

Published on April 13, 2025 by Andrey Gubarev

Web crawling is inherently an I/O-bound task – most of the time is spent waiting for network responses rather than CPU processing. This makes it a perfect candidate for asynchronous programming. In this comprehensive Python asyncio web crawler tutorial, we’ll explore how to build high-performance scrapers using real-world patterns and implementations.

What is Asynchronous Web Crawling?

Asynchronous web crawling is a technique that allows multiple web requests to be processed concurrently rather than sequentially. While traditional synchronous crawlers wait for each request to complete before starting the next one, async crawlers can handle hundreds of URLs simultaneously, resulting in 60-90% performance improvements.

How Do You Build an Asyncio Web Crawler?

Building an effective asyncio web crawler requires three key components: queues for work distribution, workers for processing, and semaphores for concurrency control.

How Does asyncio.Queue Distribute Work?

At the heart of our crawler is asyncio.Queue, which helps distribute work among different workers. We use two queues: one for sitemaps and another for individual pages:

class Crawler:
    def __init__(self, config):
        self.queue_sitemap = asyncio.Queue()
        self.queue_page = asyncio.Queue()

The queues act as task pools, allowing workers to pick up new URLs as they become available. This creates a producer-consumer pattern where sitemap workers produce page URLs that page workers consume.

How Do Async Workers Process Requests?

Our workers follow an async pattern that continuously processes items from the queue:

class Worker:
    async def run(self):
        while True:
            try:
                item = await asyncio.wait_for(self.queue.get(), timeout=self.timeout)
                await self.process(item)
                self.queue.task_done()
            except asyncio.TimeoutError:
                return

Key aspects of this implementation:

  • await self.queue.get() asynchronously waits for new items
  • timeout ensures workers can shut down gracefully when work is complete
  • task_done() marks tasks as complete, essential for queue management

How Do You Control Concurrency in Async Crawlers?

While asyncio enables concurrent operations, we need to control how many requests run simultaneously to avoid overwhelming servers or our own resources. Semaphores provide this control by limiting the number of concurrent operations:

class PageWorker:
    def __init__(self, ..., max_concurrent_requests):
        self.semaphore = asyncio.Semaphore(max_concurrent_requests)

    async def fetch(self, url, profile):
        async with self.semaphore:
            response = await self.client.head(url, headers=profile.headers)

The semaphore ensures that no more than max_concurrent_requests are running at any time, preventing resource exhaustion while maintaining high throughput.

How Do You Handle Graceful Shutdown in Async Crawlers?

Proper shutdown handling is crucial for any crawler to prevent data loss and ensure clean termination. Our implementation handles this at multiple levels:

  1. Signal Handling:
def handle_sigint(signum, frame):
    print("\nReceived Ctrl+C. Shutting down gracefully...")
    sys.exit(0)

signal.signal(signal.SIGINT, handle_sigint)
  1. Worker Timeout:
try:
    item = await asyncio.wait_for(self.queue.get(), timeout=self.timeout)
except asyncio.TimeoutError:
    print(f"Worker {self.__class__.__name__} finished")
    return

Conclusion

Building high-performance web crawlers with Python’s asyncio requires understanding three core concepts: asynchronous queues for work distribution, worker patterns for concurrent processing, and semaphores for concurrency control. By implementing these patterns, you can achieve significant performance improvements over traditional synchronous approaches while maintaining code clarity and robustness.

The key to successful async crawling lies in proper resource management, graceful error handling, and respecting the servers you’re crawling. Start with conservative concurrency limits and adjust based on your specific use case and target servers’ capacity.


Need help on Python or DevOps? Reach out at andrey@andreygubarev.com.