A deep dive into what the CPython's Global Interpreter Lock solves.
It's the year 1992. Every computer runs with a single core. A new programming paradigm called threading is gaining traction. Guido Van Rossum faces a challenge: he needs to make changes in CPython so that developers can use multithreading safely. He must come up with a solution so that the interpreter can safely share memory between multiple threads. He introduces the Global Interpreter Lock, or the GIL.
Few years pass, and chip manufacturers realise they can expedite compute by adding multiple cores on a single machine. However, python developers realise that the GIL prohibits parallel execution of Python code across multiple cores by a single Python process. Unfortunately, what was once an elegant solution to enable threading becomes a ceiling for performance.
Python is infamous for the GIL. But it solves a key problem. Before multiple cores became a thing, it was probably the simplest solution to support threading. In this post, we'll explore why the GIL exists, what problem it solves, and why it may not be as bad as its reputation suggests.
To begin, we need to understand how CPython manages memory. If you've programmed in a low-level language like C, you must be familiar with the stack and the heap. They are two distinct regions in memory.
The stack holds local variables used in a function. It is a Last-In-First-Out (LIFO) structure. Whenever a function is called, its local variables and return address are pushed onto the stack. When the function returns, all this data is popped. The CPU and compiler automatically manage the stack. On many Linux systems, a thread's stack is limited to roughly 8 MB by default.
The heap, on the other hand, is a large pool of memory managed by the C programmer. The programmer requests memory from the heap to store data, and that data exists until they free it. It's ideal for data that needs to persist across multiple functions, and for data structures whose size is determined at runtime.
If you've only programmed in python, or other high level languages, you are unlikely to have managed memory. So who is managing it?
In 'C', int x = 20, stores the value as 4 bytes on the stack.
In Python, integers are wrapped in a PyLongObject struct. A variable like x is just
a name bound to some object sitting on the heap. Internally, CPython maintains mappings from variable names to the
object location.
The struct contains:
ob_type — a pointer to a PyTypeObject describing the typeob_refcnt — the reference countob_size — number of "digits" used to store the valueob_digit — the actual integer valueimport sys
x = 2781
print(sys.getsizeof(x)) # 28 bytes on a 64-bit system
If you check size of the same integer value in python it's 7x larger than the native int in 'C'. The 28 bytes store the actual value and all the corresponding metadata. That is the fee python charges to offer developers tremendous flexibility.
In C, if you declare int x = 2781 and then write x = "c", the compiler throws an error.
In Python, you can modify x to anything at runtime. Python simply switches x to point
to a different heap location:
x = 2781
print(id(x))
x = "how do you like them apples"
print(id(x))
Since all the data sits on the heap, CPython must track it and timely reclaim the memory to prevent heap
overlflow. To do so,
it takes help of the ref_count attribute of the PyObject.
import sys
x = 2781
print(sys.getrefcount(x)) # 2 — getrefcount itself holds one ref
y = x
print(sys.getrefcount(x)) # 3
print(id(x))
print(id(y)) # same address — same object
If you set x=2781, we get a named reference x in the current namespace that points to a
PyLongObject on the heap. Next, if you
set y=x, CPython creates another reference y which points to the same PyLongObject as
x. You can confirm
this using the id method. The initial count,2, might be confuse you. Why is it not 1? Well, the
getrefcount method also creates a temporary reference. The crux, however, is that the ref-count of a
PyObject
increases whenever a new declared variable points to it. When a variable pointing to it is removed, CPython
decrements the ref_count. So, if you delete
y and print the refcount of x it will go back to its initial value, 2.
del y
print(sys.getrefcount(x)) # back to 2
You don't necessarily have to delete y. You can also set it to some other string, and the refcount
still decreases, since y then
points to some other object.
import sys
x = 2781
print(sys.getrefcount(x))
y = x
print(sys.getrefcount(y))
print(sys.getrefcount(x))
y = "how do you like them apples"
print(sys.getrefcount(y))
print(sys.getrefcount(x))
Now let's address the devil in the room: threading. To illustrate what happens without the GIL, let's switch to C, where threads can truly run in parallel. You can paste any of the C examples below into programiz.com/c-programming/online-compiler to run them.
Let's create a simple struct to mimic PyLongObject and run two threads that both increment a shared
reference count 100 000 times each:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define ITERS 100000
typedef struct {
int val;
int ref_counts;
} int_obj;
int_obj* create_int_obj(int val) {
int_obj* obj = (int_obj*)malloc(sizeof(int_obj));
obj->val = val;
obj->ref_counts = 1;
return obj;
}
int inc_ref(int_obj* obj) {
obj->ref_counts++;
return 1;
}
void* thread_inc(void* arg) {
int_obj* obj = (int_obj*)(arg);
for (int i = 0; i < ITERS; i++) {
inc_ref(obj);
}
return NULL;
}
int main() {
pthread_t t1, t2;
int_obj* obj = create_int_obj(42);
pthread_create(&t1, NULL, thread_inc, obj);
pthread_create(&t2, NULL, thread_inc, obj);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
printf("Final ref_count = %d\n", obj->ref_counts);
return 0;
}
Any guesses what the final ref-count will be? If you said 200001, run it a few times. You'll get a different answer every time. This is a classic example of race condition. Two threads are reading and writing the same value without any coordination.
Now let's add a decrement thread that frees the object when ref_count hits zero, exactly what
Python's garbage collector does. Let's also add if-checks in the inc_ref and the dec_ref
functions to avoid accessing a deleted object.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define ITERS 100000
typedef struct {
int val;
int ref_counts;
} int_obj;
int_obj* create_int_obj(int val) {
int_obj* obj = (int_obj*)malloc(sizeof(int_obj));
obj->val = val;
obj->ref_counts = 1;
return obj;
}
int inc_ref(int_obj* obj) {
if (obj->ref_counts <= 0) {
printf("Cannot increase ref_count of a deleted object\n");
return 0;
}
obj->ref_counts++;
return 1;
}
int dec_ref(int_obj* obj) {
if (obj->ref_counts <= 0) {
printf("Cannot decrease ref_count of a deleted object\n");
return 0;
}
obj->ref_counts--;
if (obj->ref_counts == 0) {
free(obj);
printf("Deleted object\n");
return 0;
}
return 1;
}
void* thread_inc(void* arg) {
int_obj* obj = (int_obj*)(arg);
int err;
for (int i = 0; i < ITERS; i++) {
err = inc_ref(obj);
if (err == 0) break;
}
return NULL;
}
void* thread_dec(void* arg) {
int_obj* obj = (int_obj*)(arg);
int err;
for (int i = 0; i < ITERS; i++) {
err = dec_ref(obj);
if (err == 0) break;
}
return NULL;
}
int main() {
pthread_t t1, t2;
int_obj* obj = create_int_obj(42);
pthread_create(&t1, NULL, thread_inc, obj);
pthread_create(&t2, NULL, thread_dec, obj);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
printf("Final ref_count = %d\n", obj->ref_counts);
return 0;
}
Run it a few times. The results are completely unreliable. Since the threads are not synchronized, they can interfere with each other's operations. This can lead to the increment_thread increasing the reference count of an object that has already been freed, or the decrement_thread freeing the same object multiple times.
The fix? We add a lock. Each thread must acquire it before operating on the shared object, and release it after. This ensures they function cooperatively:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define ITERS 100000
pthread_mutex_t LOCK;
typedef struct {
int val;
int ref_counts;
} int_obj;
int_obj* create_int_obj(int val) {
int_obj* obj = (int_obj*)malloc(sizeof(int_obj));
obj->val = val;
obj->ref_counts = 1;
return obj;
}
int inc_ref(int_obj* obj) {
if (obj->ref_counts <= 0) {
printf("Cannot increase ref_count of a deleted object\n");
return 0;
}
obj->ref_counts++;
return 1;
}
int dec_ref(int_obj* obj) {
if (obj->ref_counts <= 0) {
printf("Cannot decrease ref_count of a deleted object\n");
return 0;
}
obj->ref_counts--;
if (obj->ref_counts == 0) {
free(obj);
return 0;
}
return 1;
}
void* thread_inc(void* arg) {
int_obj* obj = (int_obj*)(arg);
int err;
for (int i = 0; i < ITERS; i++) {
pthread_mutex_lock(&LOCK);
err = inc_ref(obj);
pthread_mutex_unlock(&LOCK);
if (err == 0) break;
}
return NULL;
}
void* thread_dec(void* arg) {
int_obj* obj = (int_obj*)(arg);
int err;
for (int i = 0; i < ITERS; i++) {
pthread_mutex_lock(&LOCK);
err = dec_ref(obj);
pthread_mutex_unlock(&LOCK);
if (err == 0) break;
}
return NULL;
}
int main() {
pthread_t t1, t2;
int_obj* obj = create_int_obj(42);
pthread_create(&t1, NULL, thread_inc, obj);
pthread_create(&t2, NULL, thread_inc, obj);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
printf("Final ref_count = %d\n", obj->ref_counts);
pthread_t t3, t4;
obj = create_int_obj(42);
pthread_create(&t3, NULL, thread_inc, obj);
pthread_create(&t4, NULL, thread_dec, obj);
pthread_join(t3, NULL);
pthread_join(t4, NULL);
printf("Final ref_count = %d\n", obj->ref_counts);
return 0;
}
With both threads incrementing, we reliably get 200001 every run. With one increment and one
decrement thread, two outcomes are possible:
1. If the ref-count never reaches 0, the final count is 1 (100 000 increments minus 100 000 decrements, starting from 1):
2. If the ref-count reaches 0 and the object is freed, the increment thread reliably receives the update and terminates safely — no segmentation fault:
We can also add a lock to each object instead of using one global lock. This will allow two threads working on different objects to run truly in parallel:
typedef struct {
int val;
int ref_counts;
pthread_mutex_t lock; // per-object lock
} int_obj;
However, locks are expensive. Acquiring and releasing a mutex for every reference count mutation on every object adds serious overhead. One global lock is simpler and cheaper per operation, at the cost of true CPU parallelism.
CPython doesn't let a thread hold the GIL for its entire lifetime. It uses a time-slice model: a
thread acquires the GIL, runs for a fixed interval, then releases it so other threads get a turn. The interval is
controlled by sys.setswitchinterval() and defaults to 5 ms on Python 3.13+:
import sys
print(sys.getswitchinterval()) # 0.005 (5 ms)
During I/O operations like network requests, Python releases the GIL. The below script runs in ~2 seconds with threads versus ~6 seconds sequentially:
import threading
import time
import urllib.request
def fetch(url):
print(f"[{threading.current_thread().name}] starting request")
urllib.request.urlopen(url)
print(f"[{threading.current_thread().name}] done")
urls = [
"https://httpbin.org/delay/2",
"https://httpbin.org/delay/2",
"https://httpbin.org/delay/2",
]
# Sequential — takes ~6 seconds
start = time.time()
for url in urls:
fetch(url)
print(f"Sequential: {time.time() - start:.1f}s")
# Threaded — takes ~2 seconds despite the GIL
start = time.time()
threads = [threading.Thread(target=fetch, args=(url,)) for url in urls]
for t in threads: t.start()
for t in threads: t.join()
print(f"Threaded: {time.time() - start:.1f}s")
asyncio) are the better option. But that's a topic for the next post.
While the GIL limits CPU-bound multithreading, it simplifies development by ensuring thread-safe memory management and object reference counting. It also makes life easier for developers maintaining C extensions. They can write code without worrying about thread safety at the object level.
Recent updates (Python 3.13+) have introduced an experimental no-GIL mode to remove the lock entirely, but widespread adoption is slow. Without the GIL, the entire burden of thread safety shifts onto every C extension.
In practice, the GIL isn't a bottleneck for most tasks. Python is great at orchestrating different performant libraries. It's generally not used for running heavy computational workloads that benefit from multiple cores. Instead, it relies libraries like NumPy, Pytorch, OpenCV, etc. These libraries, written mostly in C/C++, drop the GIL anyways and use multiple cores to parallelize operations like matrix multiplications.