Mahi
My notes, blogs and thoughts on tech

Make log files reading faster through prallel processing

I have been doing a lot of access logs analysis lately to find out important insights from logs, like change in http response status, user access patterns, etc., for an application at work. We do have an internal custom application which I developed for live usage monitoring (utilising Elasticsearch as backend) but it was developed to target specific data patterns, thus minimising storage overhead due to unwanted entries in the logs which we were not interested in, and also to speed-up ingestion time.

If you have done analysis of large log files then you would know it can be painfully slow. A single process reads and processes every line one by one, and for a file with millions of entries, that adds up fast.

A natural first thought is to use Python's threading module — split the file into chunks and process them simultaneously. The problem is Python's GIL (Global Interpreter Lock), which only allows one thread to execute Python bytecode at a time. So threads end up taking turns rather than running in true parallel, and you get little to no speedup for CPU-bound work like parsing. However, for the tasks which are heavy on IO, multi-threading should provide sufficient speed-ups.

The right tool for CPU-bound tasks is Python's multiprocessing module. Each process gets its own Python interpreter and its own GIL, so parsing genuinely runs in parallel across CPU cores.

I am using a sample log pattern for this post, like a Tomcat access log, to demonstrate the idea.

The log format

A typical Tomcat access log line looks like this:

192.168.1.10 - - [07/Nov/2024:10:23:45 +0000] "GET /api/users HTTP/1.1" 200 1024 312

The last number is the response time in milliseconds. Suppose we want to scan the entire file and find:

  • Total number of requests
  • Count of slow requests (say, response time > 1000 ms)
  • Count of errors (HTTP 5xx status codes)

Single-process program - the slow way

First let's see how a simple data analysis program looks like, the below program runs in a single process.

import time

LOG_FILE = "access.log"


def parse_line(line):
    """Returns (status_code, response_time_ms) or None if line is unparseable."""
    try:
        parts = line.split('"')
        # parts[2] looks like: ' 200 1024 312'
        trailing = parts[2].strip().split()
        status = int(trailing[0])
        response_time = int(trailing[2])
        return status, response_time
    except (IndexError, ValueError):
        return None


def analyse_single(filepath):
    total = 0
    slow = 0
    errors = 0

    with open(filepath, "r") as f:
        for line in f:
            result = parse_line(line)
            if result is None:
                continue
            status, response_time = result
            total += 1
            if response_time > 1000:
                slow += 1
            if status >= 500:
                errors += 1

    return {"total": total, "slow": slow, "errors": errors}

start = time.time()
stats = analyse_single(LOG_FILE)
elapsed = time.time() - start

print(f"Total requests : {stats['total']}")
print(f"Slow requests  : {stats['slow']}")
print(f"Errors (5xx)   : {stats['errors']}")
print(f"Time taken     : {elapsed:.2f}s")

For a 2 GiB log file with ~20 million lines, this typically takes 18-22 seconds on a modern laptop.

Multi-process program - the faster way

The idea is the same as threading — split the file into N equal chunks and process them in parallel — but each chunk is handled by a separate OS process instead of a thread. This bypasses the GIL entirely.

The number of processes is controlled by a single constant at the top — change NUM_PROCESSES and the rest adjusts automatically. A good starting value is the number of CPU cores on your machine (os.cpu_count()).

import time
import os
from multiprocessing import Pool

LOG_FILE = "access.log"
NUM_PROCESSES = 8  # <-- change this to tune parallelism (try os.cpu_count())


def parse_line(line):
    """Returns (status_code, response_time_ms) or None if line is unparseable."""
    try:
        parts = line.split('"')
        trailing = parts[2].strip().split()
        status = int(trailing[0])
        response_time = int(trailing[2])
        return status, response_time
    except (IndexError, ValueError):
        return None


def analyse_chunk(lines):
    """Process a list of log lines and return stats dict.
    This function runs in a separate process."""
    total = 0
    slow = 0
    errors = 0

    for line in lines:
        parsed = parse_line(line)
        if parsed is None:
            continue
        status, response_time = parsed
        total += 1
        if response_time > 1000:
            slow += 1
        if status >= 500:
            errors += 1

    return {"total": total, "slow": slow, "errors": errors}


def analyse_parallel(filepath, num_processes):
    # Read all lines upfront — keep a check on available memory in your system
    with open(filepath, "r") as f:
        all_lines = f.readlines()

    # Split lines into equal chunks, one per process (using integer division '//')
    chunk_size = len(all_lines) // num_processes
    chunks = [
        all_lines[i * chunk_size : (i + 1) * chunk_size]
        for i in range(num_processes)
    ]
    # Push leftover lines into the last chunk
    chunks[-1].extend(all_lines[num_processes * chunk_size :])

    # Pool.map distributes chunks across processes and collects results
    with Pool(num_processes) as pool:
        results = pool.map(analyse_chunk, chunks)

    # Merge results from all processes
    merged = {"total": 0, "slow": 0, "errors": 0}
    for r in results:
        merged["total"] += r["total"]
        merged["slow"] += r["slow"]
        merged["errors"] += r["errors"]

    return merged


if __name__ == "__main__":
    start = time.time()
    stats = analyse_parallel(LOG_FILE, NUM_PROCESSES)
    elapsed = time.time() - start

    print(f"Total requests : {stats['total']}")
    print(f"Slow requests  : {stats['slow']}")
    print(f"Errors (5xx)   : {stats['errors']}")
    print(f"Time taken     : {elapsed:.2f}s  (using {NUM_PROCESSES} processes)")

Pool.map is the key simplification in multiprocessing, you hand it a function and a list of inputs, it distributes them across the worker pool and returns results in order. No manual thread management needed.

Results

On the same 2 GiB log file:

| Processes | Time    | Speedup |
|-----------|---------|---------|
| 1         | 20.1 s  |      |
| 2         | 10.5 s  | 1.9×    |
| 4         | 5.4 s   | 3.7×    |
| 8         | 2.9 s   | 6.9×    |

The speedup here is noticeable because the parsing work actually runs in parallel. Beyond the number of physical CPU cores, gains flatten out — you can't get more parallel CPU work than you have cores.

One thing to watch out for

Each worker process receives its chunk of lines as an argument, which means Python has to serialise (pickle) the data and send it to the child process. For very large chunks this inter-process communication overhead can eat into the gains.

If memory is tight, an alternative is to pass file byte offsets to each worker and have each process open and read its own slice of the file directly — this avoids loading everything into the main process first.

Key takeaway

When log analysis is slow, the culprit is usually the GIL. multiprocessing sidesteps it entirely - split the file into chunks, farm them out with Pool.map, merge the results. The code is only marginally more complex than the single-process version, and the speedup is real. 🙂