Reverse engineering Python's memory - Integers

View this post in Jupyter notebook: https://nbviewer.jupyter.org/github/nowke/notebooks/blob/master/Reverse%20engineering%20Python%27s%20memory%20-%20Integers.ipynb

If you are using Python for more than a year, you would have heard about different variants of it.

CPython - original implementation which can be downloaded from python.org. Python code is compiled to bytecode using C.
Jython - uses Java to convert the program to bytecode. Enables the program to run on JVM.
PyPy - written in Python itself! It uses a Just-In-Time (JIT) compiler, which translates directly to machine code (instead of bytecode). Since native machine code is faster than running bytecode, PyPy provides speed improvements compared to CPython.

Here we talk only about CPython variant.

Basic data types

Determining the size of primitive data types

C, similar to other programming languages has built-in primitive data types - int, char, long, bool etc.

Let's look at how we can find the size of each data types in your machine (I use 64-bit MacBook Air). We can use ctypes module.

import ctypes

types = [('int', ctypes.c_int), ('char', ctypes.c_char), 
         ('long', ctypes.c_long), ('bool', ctypes.c_bool), 
         ('voidp', ctypes.c_voidp), ('ulong', ctypes.c_ulong)]

for ctype in types:
    size = ctypes.sizeof(ctype[1])
    print(f'c_{ctype[0]:<5}  - {size} byte(s)')

c_int    - 4 byte(s)
c_char   - 1 byte(s)
c_long   - 8 byte(s)
c_bool   - 1 byte(s)
c_voidp  - 8 byte(s)
c_ulong  - 8 byte(s)

Variable representation

In Python, when a variable is declared var1 = 5, var1 doesn't hold the value 5 (like in C). Rather, var1 points to the object containing value 5.

var1 = 5
var2 = 5

Accessing address of variables

We can access the logical address of a variable using built-in the id() function.

id(var1), id(var2), id(5)

(4418119136, 4418119136, 4418119136)

var1, var2 and 5 refer to the same object in memory.

Let us inspect the address of integers from 1 to 10 and the difference between the addresses.

def print_addr(nums):
    for num in nums:
        addr = id(num)
        print(f'Num = {num:<2}, id = {addr}, diff = {addr - id(num-1)}')

print_addr(range(1, 11))

Num = 1 , id = 4418119008, diff = 32
Num = 2 , id = 4418119040, diff = 32
Num = 3 , id = 4418119072, diff = 32
Num = 4 , id = 4418119104, diff = 32
Num = 5 , id = 4418119136, diff = 32
Num = 6 , id = 4418119168, diff = 32
Num = 7 , id = 4418119200, diff = 32
Num = 8 , id = 4418119232, diff = 32
Num = 9 , id = 4418119264, diff = 32
Num = 10, id = 4418119296, diff = 32

The numbers are placed in contiguously in memory. But is it true for large numbers? Let's try from 250 to 265

print_addr(range(250, 266))

Num = 250, id = 4418126976, diff = 32
Num = 251, id = 4418127008, diff = 32
Num = 252, id = 4418127040, diff = 32
Num = 253, id = 4418127072, diff = 32
Num = 254, id = 4418127104, diff = 32
Num = 255, id = 4418127136, diff = 32
Num = 256, id = 4418127168, diff = 32
Num = 257, id = 4463633392, diff = 45506224
Num = 258, id = 4463633328, diff = 32
Num = 259, id = 4463633296, diff = -96
Num = 260, id = 4463633392, diff = 64
Num = 261, id = 4463633328, diff = 32
Num = 262, id = 4463633296, diff = -96
Num = 263, id = 4463633392, diff = 64
Num = 264, id = 4463633328, diff = 32
Num = 265, id = 4463633296, diff = -96

Let's try negative numbers

print_addr(range(-10, 5))

Num = -10, id = 4462902928, diff = -730432
Num = -9, id = 4463633360, diff = 96
Num = -8, id = 4463633264, diff = 730336
Num = -7, id = 4462902928, diff = -730432
Num = -6, id = 4463633360, diff = 96
Num = -5, id = 4418118816, diff = -44784112
Num = -4, id = 4418118848, diff = 32
Num = -3, id = 4418118880, diff = 32
Num = -2, id = 4418118912, diff = 32
Num = -1, id = 4418118944, diff = 32
Num = 0 , id = 4418118976, diff = 32
Num = 1 , id = 4418119008, diff = 32
Num = 2 , id = 4418119040, diff = 32
Num = 3 , id = 4418119072, diff = 32
Num = 4 , id = 4418119104, diff = 32

We can observe that the numbers are contiguous from -5 to 256. This is because, Python pre-allocates memory for numbers from -5 to 256 and creates the objects before your program starts.

Integers

Accessing integers from address

Using id() function, we saw how to convert from variable -> address. How about the other way. Given an address, how to dig what's stored inside?

x = 45
addr_x = id(x)
print(addr_x)

4418120416

The function ctypes.<type>.from_address takes address and returns the <type> value. Let's give addr_x to c_int type. Remember: We need to get 45 back from addr_x

print(ctypes.c_int.from_address(addr_x))
print(ctypes.c_int.from_address(addr_x + 8))
print(ctypes.c_int.from_address(addr_x + 16))
print(ctypes.c_int.from_address(addr_x + 24))

c_int(81)
c_int(122850480)
c_int(1)
c_int(45)

x_value = ctypes.c_int.from_address(addr_x + 24)
x_value.value

Why add 24 to the original address? An integer is stored similar to the below struct in C (actual struct maybe different). The allocated 32 bytes are divided as follows.

struct {
    long   ob_refcnt;  // long   = 8 bytes (c_long)
    void*  ob_type;    // void*  = 8 bytes (c_voidp)
    long   ob_size;    // long   = 8 bytes (c_long)
    uint   ob_digit[]; // uint[] = (8 x ob_size) bytes (c_uint)
}

For the digit 45, array size of ob_digit will be 1. Now let's unpack the above fields using the same from_address function and actually make sense of the returned values.

1. `ob_refcnt` -> represents the total number of references made to the object

ob_refcnt_x = ctypes.c_long.from_address(addr_x)
print(f'Total references made to {x} - {ob_refcnt_x.value}')

Total references made to 45 - 88

2. `ob_type` -> Points to the type object, in our case `int`

Since int is an object, it has an address in memory!

id(int)

4417817776

ob_type_x = ctypes.c_void_p.from_address(addr_x + 8)
ob_type_x.value

4417817776

Observe that, id(int) is same as ob_type_x.value

3. `ob_size` -> size of `ob_digit` array.

We will see about its use further below

ob_size_x = ctypes.c_long.from_address(addr_x + 16)
ob_size_x.value

4. `ob_digit[]`

ob_digit_x = ctypes.c_uint.from_address(addr_x + 24)
ob_digit_x.value

print(f'''
Integer {x} is represented as,

struct_ {{
    long   ob_refcnt  = {ctypes.c_long.from_address(addr_x).value}
    void*  ob_type    = {ctypes.c_long.from_address(addr_x + 8).value}
    long   ob_size    = {ctypes.c_long.from_address(addr_x + 16).value}
    uint   ob_digit[] = {ctypes.c_long.from_address(addr_x + 24).value}
}}
''')

Integer 45 is represented as,

struct_ {
    long   ob_refcnt  = 91
    void*  ob_type    = 4417817776
    long   ob_size    = 1
    uint   ob_digit[] = 45
}

Big integers

In Python, we can declare and operate on integers of size as big as you want!

big_x = 2**1000
big_x

10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376

big_x + 1

10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069377

How does Python store this value? It uses the fields ob_size and ob_digit[]. The formula for calculating the integer representation stored inside ob_digit is as follows

where n = ob_size

Example 1

ob_size = 1
ob_digit = [45]

value = ob_digit[0] x (1024^3)^0
      = 45 x 1
      = 45

Example 2

ob_size  = 2
ob_digit = [2, 3]

value = (ob_digit[0] x (1024^3)^0) +
        (ob_digit[1] x (1024^3)^1)
      = (2 x 1) + (3 x 1024^3)
      = 3,221,225,474

Let's verify Example 2

y = 3221225474
addr_y = id(y)
y, addr_y

(3221225474, 4463632688)

ob_size_y = ctypes.c_long.from_address(addr_y + 16)
ob_size_y.value

ob_digit_y_0 = ctypes.c_uint.from_address(addr_y + 24)
ob_digit_y_0

c_uint(2)

ob_digit_y_1 = ctypes.c_uint.from_address(addr_y + 28)
ob_digit_y_1

c_uint(3)

Ta! We got back the values ob_size = 2 and ob_digit = [2, 3]

Size of an integer

Both x and y take 32 bytes of memory. How about even bigger integers? We can make use of sys.getsizeof function

import sys

sys.getsizeof(x), sys.getsizeof(y)

(28, 32)

for power in range(10, 101, 10):
    num = 10 ** power
    size = sys.getsizeof(num)
    print(f'Number = 10^{power:<3}, Size = {size}')

Number = 10^10 , Size = 32
Number = 10^20 , Size = 36
Number = 10^30 , Size = 40
Number = 10^40 , Size = 44
Number = 10^50 , Size = 48
Number = 10^60 , Size = 52
Number = 10^70 , Size = 56
Number = 10^80 , Size = 60
Number = 10^90 , Size = 64
Number = 10^100, Size = 72

Recreating integers

We made use of from_address to retrieve various parts of an integer. Let's actually write a custom structure that takes an integer using ctypes.Structure abstract base class.

class IntStructure(ctypes.Structure):
    _fields_ = [("ob_refcnt", ctypes.c_long),
                ("ob_type"  , ctypes.c_void_p),
                ("ob_size"  , ctypes.c_long),
                ("ob_digit" , ctypes.c_uint * 0)]

    def __repr__(self):
        digits   = self.get_digits()
        int_size = self.calc_size()
        int_num  = self.calc_integer(digits)
        return (f'Int(ob_refcnt={self.ob_refcnt},\n'
                f'    ob_size={self.ob_size},\n'
                f'    ob_digit={digits},\n'
                f'    int_size={int_size} bytes,\n'
                f'    int_num={int_num})')

    def get_digits(self):
        """Internal representation of `ob_digit` as a list"""

        digits = ctypes.cast(
            ctypes.byref(self.ob_digit),
            ctypes.POINTER(ctypes.c_uint * self.ob_size)
        ).contents

        return list(digits)

    def calc_size(self):
        """Number of bytes consumed by the integer"""

        size = (ctypes.sizeof(ctypes.c_long)   + 
                ctypes.sizeof(ctypes.c_void_p) +
                ctypes.sizeof(ctypes.c_long)   + 
                ctypes.sizeof(ctypes.c_long))
        if self.ob_size > 2:
            size += (self.ob_size - 2) * ctypes.sizeof(ctypes.c_uint)
        return size

    def calc_integer(self, digits):
        """Retrieve the actual integer"""

        return sum(
            [digit * (1024**3)**i
             for i, digit in enumerate(digits)]
        )

a = 50
addr_a = id(a)
int_a = IntStructure.from_address(addr_a)
int_a

Int(ob_refcnt=77,
    ob_size=1,
    ob_digit=[50],
    int_size=32 bytes,
    int_num=50)

b = 10**40
addr_b = id(b)
int_b = IntStructure.from_address(addr_b)
int_b

Int(ob_refcnt=1,
    ob_size=5,
    ob_digit=[0, 668304384, 172751547, 175927511, 7523],
    int_size=44 bytes,
    int_num=10000000000000000000000000000000000000000)

Changing the value of an integer

Referred from r/Python

Do not do this anywhere in your actual programs! Very bad things can happen

Let's modify the integer 120

# Define a simpler structure (similar to IntStructure)

class MyInt(ctypes.Structure):
    _fields_ = [("ob_refcnt", ctypes.c_long),
                ("ob_type"  , ctypes.c_void_p),
                ("ob_size"  , ctypes.c_ulong),
                ("ob_digit" , ctypes.c_long)]

addr_120 = id(120)
struct_120 = MyInt.from_address(addr_120)
struct_120.ob_digit = 25

120 + 120

60 + 60

What next?

Using similar techniques, we can decode how the basic data structures are implemented efficiently in Python.

Python Lists
Dictionaries

Blog

Reverse engineering Python's memory - Integers

Navaneesh Kumar

Basic data types

Determining the size of primitive data types

Variable representation

Accessing address of variables

Integers

Accessing integers from address

1. `ob_refcnt` -> represents the total number of references made to the object

2. `ob_type` -> Points to the type object, in our case `int`

3. `ob_size` -> size of `ob_digit` array.

4. `ob_digit[]`

Big integers

Size of an integer

Recreating integers

Changing the value of an integer

What next?

Join Our Newsletter. No Spam, Only the good stuff.

Related

Reverse engineering Python's memory - Integers

Navaneesh Kumar

Basic data types

Determining the size of primitive data types

Variable representation

Accessing address of variables

Integers

Accessing integers from address

1. ob_refcnt -> represents the total number of references made to the object

2. ob_type -> Points to the type object, in our case int

3. ob_size -> size of ob_digit array.

4. ob_digit[]

Big integers

Size of an integer

Recreating integers

Changing the value of an integer

What next?

Join Our Newsletter. No Spam, Only the good stuff.

Related

1. `ob_refcnt` -> represents the total number of references made to the object

2. `ob_type` -> Points to the type object, in our case `int`

3. `ob_size` -> size of `ob_digit` array.

4. `ob_digit[]`