Deep Dive · Python Internals · Part 1

The GIL isn't so bad after all

A deep dive into what the CPython's Global Interpreter Lock solves.

It's the year 1992. Every computer runs with a single core. A new programming paradigm called threading is gaining traction. Guido Van Rossum faces a challenge: he needs to make changes in CPython so that developers can use multithreading safely. He must come up with a solution so that the interpreter can safely share memory between multiple threads. He introduces the Global Interpreter Lock, or the GIL.

Few years pass, and chip manufacturers realise they can expedite compute by adding multiple cores on a single machine. However, python developers realise that the GIL prohibits parallel execution of Python code across multiple cores by a single Python process. Unfortunately, what was once an elegant solution to enable threading becomes a ceiling for performance.

Python is infamous for the GIL. But it solves a key problem. Before multiple cores became a thing, it was probably the simplest solution to support threading. In this post, we'll explore why the GIL exists, what problem it solves, and why it may not be as bad as its reputation suggests.

Note: Python is just a language. There are multiple interpreters that can run code written in the Python language. CPython is the most commonly used interpreter, and we'll focus on that.

How CPython manages memory

To begin, we need to understand how CPython manages memory. If you've programmed in a low-level language like C, you must be familiar with the stack and the heap. They are two distinct regions in memory.

The stack holds local variables used in a function. It is a Last-In-First-Out (LIFO) structure. Whenever a function is called, its local variables and return address are pushed onto the stack. When the function returns, all this data is popped. The CPU and compiler automatically manage the stack. On many Linux systems, a thread's stack is limited to roughly 8 MB by default.

The heap, on the other hand, is a large pool of memory managed by the C programmer. The programmer requests memory from the heap to store data, and that data exists until they free it. It's ideal for data that needs to persist across multiple functions, and for data structures whose size is determined at runtime.

If you've only programmed in python, or other high level languages, you are unlikely to have managed memory. So who is managing it?

In 'C', int x = 20, stores the value as 4 bytes on the stack.

In Python, integers are wrapped in a PyLongObject struct. A variable like x is just a name bound to some object sitting on the heap. Internally, CPython maintains mappings from variable names to the object location. The struct contains:

ob_type — a pointer to a PyTypeObject describing the type
ob_refcnt — the reference count
ob_size — number of "digits" used to store the value
ob_digit — the actual integer value

python

import sys
x = 2781
print(sys.getsizeof(x))  # 28 bytes on a 64-bit system

If you check size of the same integer value in python it's 7x larger than the native int in 'C'. The 28 bytes store the actual value and all the corresponding metadata. That is the fee python charges to offer developers tremendous flexibility.

In C, if you declare int x = 2781 and then write x = "c", the compiler throws an error. In Python, you can modify x to anything at runtime. Python simply switches x to point to a different heap location:

python

x = 2781
print(id(x))
x = "how do you like them apples"
print(id(x))

4318596496 4318584688

Reference counting

Since all the data sits on the heap, CPython must track it and timely reclaim the memory to prevent heap overlflow. To do so, it takes help of the ref_count attribute of the PyObject.

python

import sys
x = 2781
print(sys.getrefcount(x))  # 2 — getrefcount itself holds one ref
y = x
print(sys.getrefcount(x))  # 3
print(id(x))
print(id(y))              # same address — same object

2 3 4318597424 4318597424

If you set x=2781, we get a named reference x in the current namespace that points to a PyLongObject on the heap. Next, if you set y=x, CPython creates another reference y which points to the same PyLongObject as x. You can confirm this using the id method. The initial count,2, might be confuse you. Why is it not 1? Well, the getrefcount method also creates a temporary reference. The crux, however, is that the ref-count of a PyObject increases whenever a new declared variable points to it. When a variable pointing to it is removed, CPython decrements the ref_count. So, if you delete y and print the refcount of x it will go back to its initial value, 2.

python

del y
print(sys.getrefcount(x))  # back to 2

You don't necessarily have to delete y. You can also set it to some other string, and the refcount still decreases, since y then points to some other object.

python

import sys
x = 2781
print(sys.getrefcount(x))
y = x
print(sys.getrefcount(y))
print(sys.getrefcount(x))
y = "how do you like them apples"
print(sys.getrefcount(y))
print(sys.getrefcount(x))

2 3 3 2 2

How CPython manages threads

Now let's address the devil in the room: threading. To illustrate what happens without the GIL, let's switch to C, where threads can truly run in parallel. You can paste any of the C examples below into programiz.com/c-programming/online-compiler to run them.

Let's create a simple struct to mimic PyLongObject and run two threads that both increment a shared reference count 100 000 times each:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#define ITERS 100000

typedef struct {
    int val;
    int ref_counts;
} int_obj;

int_obj* create_int_obj(int val) {
    int_obj* obj = (int_obj*)malloc(sizeof(int_obj));
    obj->val = val;
    obj->ref_counts = 1;
    return obj;
}

int inc_ref(int_obj* obj) {
    obj->ref_counts++;
    return 1;
}

void* thread_inc(void* arg) {
    int_obj* obj = (int_obj*)(arg);
    for (int i = 0; i < ITERS; i++) {
        inc_ref(obj);
    }
    return NULL;
}

int main() {
    pthread_t t1, t2;
    int_obj* obj = create_int_obj(42);
    pthread_create(&t1, NULL, thread_inc, obj);
    pthread_create(&t2, NULL, thread_inc, obj);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    printf("Final ref_count = %d\n", obj->ref_counts);
    return 0;
}

Any guesses what the final ref-count will be? If you said 200001, run it a few times. You'll get a different answer every time. This is a classic example of race condition. Two threads are reading and writing the same value without any coordination.

Adding dec_ref: segfaults and memory corruption

Now let's add a decrement thread that frees the object when ref_count hits zero, exactly what Python's garbage collector does. Let's also add if-checks in the inc_ref and the dec_ref functions to avoid accessing a deleted object.

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#define ITERS 100000

typedef struct {
    int val;
    int ref_counts;
} int_obj;

int_obj* create_int_obj(int val) {
    int_obj* obj = (int_obj*)malloc(sizeof(int_obj));
    obj->val = val;
    obj->ref_counts = 1;
    return obj;
}

int inc_ref(int_obj* obj) {
    if (obj->ref_counts <= 0) {
        printf("Cannot increase ref_count of a deleted object\n");
        return 0;
    }
    obj->ref_counts++;
    return 1;
}

int dec_ref(int_obj* obj) {
    if (obj->ref_counts <= 0) {
        printf("Cannot decrease ref_count of a deleted object\n");
        return 0;
    }
    obj->ref_counts--;
    if (obj->ref_counts == 0) {
        free(obj);
        printf("Deleted object\n");
        return 0;
    }
    return 1;
}

void* thread_inc(void* arg) {
    int_obj* obj = (int_obj*)(arg);
    int err;
    for (int i = 0; i < ITERS; i++) {
        err = inc_ref(obj);
        if (err == 0) break;
    }
    return NULL;
}

void* thread_dec(void* arg) {
    int_obj* obj = (int_obj*)(arg);
    int err;
    for (int i = 0; i < ITERS; i++) {
        err = dec_ref(obj);
        if (err == 0) break;
    }
    return NULL;
}

int main() {
    pthread_t t1, t2;
    int_obj* obj = create_int_obj(42);
    pthread_create(&t1, NULL, thread_inc, obj);
    pthread_create(&t2, NULL, thread_dec, obj);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    printf("Final ref_count = %d\n", obj->ref_counts);
    return 0;
}

Run it a few times. The results are completely unreliable. Since the threads are not synchronized, they can interfere with each other's operations. This can lead to the increment_thread increasing the reference count of an object that has already been freed, or the decrement_thread freeing the same object multiple times.

Deleted object Segmentation fault === Code Exited With Errors === Deleted object Cannot increase ref_count of a deleted object Segmentation fault === Code Exited With Errors === Final ref_count = 16317 === Code Execution Successful ===

The fix: a global mutex lock

The fix? We add a lock. Each thread must acquire it before operating on the shared object, and release it after. This ensures they function cooperatively:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#define ITERS 100000

pthread_mutex_t LOCK;

typedef struct {
    int val;
    int ref_counts;
} int_obj;

int_obj* create_int_obj(int val) {
    int_obj* obj = (int_obj*)malloc(sizeof(int_obj));
    obj->val = val;
    obj->ref_counts = 1;
    return obj;
}

int inc_ref(int_obj* obj) {
    if (obj->ref_counts <= 0) {
        printf("Cannot increase ref_count of a deleted object\n");
        return 0;
    }
    obj->ref_counts++;
    return 1;
}

int dec_ref(int_obj* obj) {
    if (obj->ref_counts <= 0) {
        printf("Cannot decrease ref_count of a deleted object\n");
        return 0;
    }
    obj->ref_counts--;
    if (obj->ref_counts == 0) {
        free(obj);
        return 0;
    }
    return 1;
}

void* thread_inc(void* arg) {
    int_obj* obj = (int_obj*)(arg);
    int err;
    for (int i = 0; i < ITERS; i++) {
        pthread_mutex_lock(&LOCK);
        err = inc_ref(obj);
        pthread_mutex_unlock(&LOCK);
        if (err == 0) break;
    }
    return NULL;
}

void* thread_dec(void* arg) {
    int_obj* obj = (int_obj*)(arg);
    int err;
    for (int i = 0; i < ITERS; i++) {
        pthread_mutex_lock(&LOCK);
        err = dec_ref(obj);
        pthread_mutex_unlock(&LOCK);
        if (err == 0) break;
    }
    return NULL;
}

int main() {
    pthread_t t1, t2;

    int_obj* obj = create_int_obj(42);
    pthread_create(&t1, NULL, thread_inc, obj);
    pthread_create(&t2, NULL, thread_inc, obj);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    printf("Final ref_count = %d\n", obj->ref_counts);

    pthread_t t3, t4;
    obj = create_int_obj(42);
    pthread_create(&t3, NULL, thread_inc, obj);
    pthread_create(&t4, NULL, thread_dec, obj);
    pthread_join(t3, NULL);
    pthread_join(t4, NULL);
    printf("Final ref_count = %d\n", obj->ref_counts);

    return 0;
}

With both threads incrementing, we reliably get 200001 every run. With one increment and one decrement thread, two outcomes are possible:

1. If the ref-count never reaches 0, the final count is 1 (100 000 increments minus 100 000 decrements, starting from 1):

Final ref_count = 200001 Final ref_count = 1 === Code Execution Successful ===

2. If the ref-count reaches 0 and the object is freed, the increment thread reliably receives the update and terminates safely — no segmentation fault:

Final ref_count = 200001 Cannot increase ref_count of a deleted object Final ref_count = 0

Per-object locks vs. one global lock

We can also add a lock to each object instead of using one global lock. This will allow two threads working on different objects to run truly in parallel:

typedef struct {
    int val;
    int ref_counts;
    pthread_mutex_t lock; // per-object lock
} int_obj;

However, locks are expensive. Acquiring and releasing a mutex for every reference count mutation on every object adds serious overhead. One global lock is simpler and cheaper per operation, at the cost of true CPU parallelism.

CPython doesn't let a thread hold the GIL for its entire lifetime. It uses a time-slice model: a thread acquires the GIL, runs for a fixed interval, then releases it so other threads get a turn. The interval is controlled by sys.setswitchinterval() and defaults to 5 ms on Python 3.13+:

python

import sys
print(sys.getswitchinterval())  # 0.005  (5 ms)

0.005

Threading for concurrent I/O

During I/O operations like network requests, Python releases the GIL. The below script runs in ~2 seconds with threads versus ~6 seconds sequentially:

python

import threading
import time
import urllib.request

def fetch(url):
    print(f"[{threading.current_thread().name}] starting request")
    urllib.request.urlopen(url)
    print(f"[{threading.current_thread().name}] done")

urls = [
    "https://httpbin.org/delay/2",
    "https://httpbin.org/delay/2",
    "https://httpbin.org/delay/2",
]

# Sequential — takes ~6 seconds
start = time.time()
for url in urls:
    fetch(url)
print(f"Sequential: {time.time() - start:.1f}s")

# Threaded — takes ~2 seconds despite the GIL
start = time.time()
threads = [threading.Thread(target=fetch, args=(url,)) for url in urls]
for t in threads: t.start()
for t in threads: t.join()
print(f"Threaded: {time.time() - start:.1f}s")

[Thread-1] starting request [Thread-2] starting request [Thread-3] starting request [Thread-1] done [Thread-2] done [Thread-3] done Sequential: 6.0s Threaded: 2.0s

Tip: If you want to make concurrent network requests in Python, coroutines (via asyncio) are the better option. But that's a topic for the next post.

Conclusion

While the GIL limits CPU-bound multithreading, it simplifies development by ensuring thread-safe memory management and object reference counting. It also makes life easier for developers maintaining C extensions. They can write code without worrying about thread safety at the object level.

Recent updates (Python 3.13+) have introduced an experimental no-GIL mode to remove the lock entirely, but widespread adoption is slow. Without the GIL, the entire burden of thread safety shifts onto every C extension.

In practice, the GIL isn't a bottleneck for most tasks. Python is great at orchestrating different performant libraries. It's generally not used for running heavy computational workloads that benefit from multiple cores. Instead, it relies libraries like NumPy, Pytorch, OpenCV, etc. These libraries, written mostly in C/C++, drop the GIL anyways and use multiple cores to parallelize operations like matrix multiplications.