Reverse engineering Python's memory - Integers

nowke

Navaneesh Kumar

Posted on March 31, 2019

Reverse engineering Python's memory - Integers

Python integers

View this post in Jupyter notebook: https://nbviewer.jupyter.org/github/nowke/notebooks/blob/master/Reverse%20engineering%20Python%27s%20memory%20-%20Integers.ipynb

If you are using Python for more than a year, you would have heard about different variants of it.

  • CPython - original implementation which can be downloaded from python.org. Python code is compiled to bytecode using C.
  • Jython - uses Java to convert the program to bytecode. Enables the program to run on JVM.
  • PyPy - written in Python itself! It uses a Just-In-Time (JIT) compiler, which translates directly to machine code (instead of bytecode). Since native machine code is faster than running bytecode, PyPy provides speed improvements compared to CPython.

Here we talk only about CPython variant.

Basic data types

Determining the size of primitive data types

C, similar to other programming languages has built-in primitive data types - int, char, long, bool etc.

Let's look at how we can find the size of each data types in your machine (I use 64-bit MacBook Air). We can use ctypes module.

import ctypes

types = [('int', ctypes.c_int), ('char', ctypes.c_char), 
         ('long', ctypes.c_long), ('bool', ctypes.c_bool), 
         ('voidp', ctypes.c_voidp), ('ulong', ctypes.c_ulong)]

for ctype in types:
    size = ctypes.sizeof(ctype[1])
    print(f'c_{ctype[0]:<5}  - {size} byte(s)')
Enter fullscreen mode Exit fullscreen mode
c_int    - 4 byte(s)
c_char   - 1 byte(s)
c_long   - 8 byte(s)
c_bool   - 1 byte(s)
c_voidp  - 8 byte(s)
c_ulong  - 8 byte(s)
Enter fullscreen mode Exit fullscreen mode

Variable representation

In Python, when a variable is declared var1 = 5, var1 doesn't hold the value 5 (like in C). Rather, var1 points to the object containing value 5.

var1 = 5
var2 = 5
Enter fullscreen mode Exit fullscreen mode

Variable representation

Accessing address of variables

We can access the logical address of a variable using built-in the id() function.

id(var1), id(var2), id(5)
Enter fullscreen mode Exit fullscreen mode
(4418119136, 4418119136, 4418119136)
Enter fullscreen mode Exit fullscreen mode

var1, var2 and 5 refer to the same object in memory.

Let us inspect the address of integers from 1 to 10 and the difference between the addresses.

def print_addr(nums):
    for num in nums:
        addr = id(num)
        print(f'Num = {num:<2}, id = {addr}, diff = {addr - id(num-1)}')

print_addr(range(1, 11))
Enter fullscreen mode Exit fullscreen mode
Num = 1 , id = 4418119008, diff = 32
Num = 2 , id = 4418119040, diff = 32
Num = 3 , id = 4418119072, diff = 32
Num = 4 , id = 4418119104, diff = 32
Num = 5 , id = 4418119136, diff = 32
Num = 6 , id = 4418119168, diff = 32
Num = 7 , id = 4418119200, diff = 32
Num = 8 , id = 4418119232, diff = 32
Num = 9 , id = 4418119264, diff = 32
Num = 10, id = 4418119296, diff = 32
Enter fullscreen mode Exit fullscreen mode

The numbers are placed in contiguously in memory. But is it true for large numbers? Let's try from 250 to 265

print_addr(range(250, 266))
Enter fullscreen mode Exit fullscreen mode
Num = 250, id = 4418126976, diff = 32
Num = 251, id = 4418127008, diff = 32
Num = 252, id = 4418127040, diff = 32
Num = 253, id = 4418127072, diff = 32
Num = 254, id = 4418127104, diff = 32
Num = 255, id = 4418127136, diff = 32
Num = 256, id = 4418127168, diff = 32
Num = 257, id = 4463633392, diff = 45506224
Num = 258, id = 4463633328, diff = 32
Num = 259, id = 4463633296, diff = -96
Num = 260, id = 4463633392, diff = 64
Num = 261, id = 4463633328, diff = 32
Num = 262, id = 4463633296, diff = -96
Num = 263, id = 4463633392, diff = 64
Num = 264, id = 4463633328, diff = 32
Num = 265, id = 4463633296, diff = -96
Enter fullscreen mode Exit fullscreen mode

Let's try negative numbers

print_addr(range(-10, 5))
Enter fullscreen mode Exit fullscreen mode
Num = -10, id = 4462902928, diff = -730432
Num = -9, id = 4463633360, diff = 96
Num = -8, id = 4463633264, diff = 730336
Num = -7, id = 4462902928, diff = -730432
Num = -6, id = 4463633360, diff = 96
Num = -5, id = 4418118816, diff = -44784112
Num = -4, id = 4418118848, diff = 32
Num = -3, id = 4418118880, diff = 32
Num = -2, id = 4418118912, diff = 32
Num = -1, id = 4418118944, diff = 32
Num = 0 , id = 4418118976, diff = 32
Num = 1 , id = 4418119008, diff = 32
Num = 2 , id = 4418119040, diff = 32
Num = 3 , id = 4418119072, diff = 32
Num = 4 , id = 4418119104, diff = 32
Enter fullscreen mode Exit fullscreen mode

We can observe that the numbers are contiguous from -5 to 256. This is because, Python pre-allocates memory for numbers from -5 to 256 and creates the objects before your program starts.

Integers

Accessing integers from address

Using id() function, we saw how to convert from variable -> address. How about the other way. Given an address, how to dig what's stored inside?

x = 45
addr_x = id(x)
print(addr_x)
Enter fullscreen mode Exit fullscreen mode
4418120416
Enter fullscreen mode Exit fullscreen mode

The function ctypes.<type>.from_address takes address and returns the <type> value. Let's give addr_x to c_int type. Remember: We need to get 45 back from addr_x

print(ctypes.c_int.from_address(addr_x))
print(ctypes.c_int.from_address(addr_x + 8))
print(ctypes.c_int.from_address(addr_x + 16))
print(ctypes.c_int.from_address(addr_x + 24))
Enter fullscreen mode Exit fullscreen mode
c_int(81)
c_int(122850480)
c_int(1)
c_int(45)
Enter fullscreen mode Exit fullscreen mode
x_value = ctypes.c_int.from_address(addr_x + 24)
x_value.value
Enter fullscreen mode Exit fullscreen mode
45
Enter fullscreen mode Exit fullscreen mode

Why add 24 to the original address? An integer is stored similar to the below struct in C (actual struct maybe different). The allocated 32 bytes are divided as follows.

struct {
    long   ob_refcnt;  // long   = 8 bytes (c_long)
    void*  ob_type;    // void*  = 8 bytes (c_voidp)
    long   ob_size;    // long   = 8 bytes (c_long)
    uint   ob_digit[]; // uint[] = (8 x ob_size) bytes (c_uint)
}
Enter fullscreen mode Exit fullscreen mode

For the digit 45, array size of ob_digit will be 1. Now let's unpack the above fields using the same from_address function and actually make sense of the returned values.

1. ob_refcnt -> represents the total number of references made to the object

ob_refcnt_x = ctypes.c_long.from_address(addr_x)
print(f'Total references made to {x} - {ob_refcnt_x.value}')
Enter fullscreen mode Exit fullscreen mode
Total references made to 45 - 88
Enter fullscreen mode Exit fullscreen mode

2. ob_type -> Points to the type object, in our case int

Since int is an object, it has an address in memory!

id(int)
Enter fullscreen mode Exit fullscreen mode
4417817776
Enter fullscreen mode Exit fullscreen mode
ob_type_x = ctypes.c_void_p.from_address(addr_x + 8)
ob_type_x.value
Enter fullscreen mode Exit fullscreen mode
4417817776
Enter fullscreen mode Exit fullscreen mode

Observe that, id(int) is same as ob_type_x.value

3. ob_size -> size of ob_digit array.

We will see about its use further below

ob_size_x = ctypes.c_long.from_address(addr_x + 16)
ob_size_x.value
Enter fullscreen mode Exit fullscreen mode
1
Enter fullscreen mode Exit fullscreen mode

4. ob_digit[]

ob_digit_x = ctypes.c_uint.from_address(addr_x + 24)
ob_digit_x.value
Enter fullscreen mode Exit fullscreen mode
45
Enter fullscreen mode Exit fullscreen mode
print(f'''
Integer {x} is represented as,

struct_ {{
    long   ob_refcnt  = {ctypes.c_long.from_address(addr_x).value}
    void*  ob_type    = {ctypes.c_long.from_address(addr_x + 8).value}
    long   ob_size    = {ctypes.c_long.from_address(addr_x + 16).value}
    uint   ob_digit[] = {ctypes.c_long.from_address(addr_x + 24).value}
}}
''')
Enter fullscreen mode Exit fullscreen mode
Integer 45 is represented as,

struct_ {
    long   ob_refcnt  = 91
    void*  ob_type    = 4417817776
    long   ob_size    = 1
    uint   ob_digit[] = 45
}
Enter fullscreen mode Exit fullscreen mode

Big integers

In Python, we can declare and operate on integers of size as big as you want!

big_x = 2**1000
big_x
Enter fullscreen mode Exit fullscreen mode
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
Enter fullscreen mode Exit fullscreen mode
big_x + 1
Enter fullscreen mode Exit fullscreen mode
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069377
Enter fullscreen mode Exit fullscreen mode

How does Python store this value? It uses the fields ob_size and ob_digit[]. The formula for calculating the integer representation stored inside ob_digit is as follows

Python integer formula

where n = ob_size

Example 1

ob_size = 1
ob_digit = [45]

value = ob_digit[0] x (1024^3)^0
      = 45 x 1
      = 45
Enter fullscreen mode Exit fullscreen mode

Example 2

ob_size  = 2
ob_digit = [2, 3]

value = (ob_digit[0] x (1024^3)^0) +
        (ob_digit[1] x (1024^3)^1)
      = (2 x 1) + (3 x 1024^3)
      = 3,221,225,474
Enter fullscreen mode Exit fullscreen mode

Let's verify Example 2

y = 3221225474
addr_y = id(y)
y, addr_y
Enter fullscreen mode Exit fullscreen mode
(3221225474, 4463632688)
Enter fullscreen mode Exit fullscreen mode
ob_size_y = ctypes.c_long.from_address(addr_y + 16)
ob_size_y.value
Enter fullscreen mode Exit fullscreen mode
2
Enter fullscreen mode Exit fullscreen mode
ob_digit_y_0 = ctypes.c_uint.from_address(addr_y + 24)
ob_digit_y_0
Enter fullscreen mode Exit fullscreen mode
c_uint(2)
Enter fullscreen mode Exit fullscreen mode
ob_digit_y_1 = ctypes.c_uint.from_address(addr_y + 28)
ob_digit_y_1
Enter fullscreen mode Exit fullscreen mode
c_uint(3)
Enter fullscreen mode Exit fullscreen mode

Ta! We got back the values ob_size = 2 and ob_digit = [2, 3]

Size of an integer

Both x and y take 32 bytes of memory. How about even bigger integers? We can make use of sys.getsizeof function

import sys

sys.getsizeof(x), sys.getsizeof(y)
Enter fullscreen mode Exit fullscreen mode
(28, 32)
Enter fullscreen mode Exit fullscreen mode
for power in range(10, 101, 10):
    num = 10 ** power
    size = sys.getsizeof(num)
    print(f'Number = 10^{power:<3}, Size = {size}')
Enter fullscreen mode Exit fullscreen mode
Number = 10^10 , Size = 32
Number = 10^20 , Size = 36
Number = 10^30 , Size = 40
Number = 10^40 , Size = 44
Number = 10^50 , Size = 48
Number = 10^60 , Size = 52
Number = 10^70 , Size = 56
Number = 10^80 , Size = 60
Number = 10^90 , Size = 64
Number = 10^100, Size = 72
Enter fullscreen mode Exit fullscreen mode

Recreating integers

We made use of from_address to retrieve various parts of an integer. Let's actually write a custom structure that takes an integer using ctypes.Structure abstract base class.

class IntStructure(ctypes.Structure):
    _fields_ = [("ob_refcnt", ctypes.c_long),
                ("ob_type"  , ctypes.c_void_p),
                ("ob_size"  , ctypes.c_long),
                ("ob_digit" , ctypes.c_uint * 0)]

    def __repr__(self):
        digits   = self.get_digits()
        int_size = self.calc_size()
        int_num  = self.calc_integer(digits)
        return (f'Int(ob_refcnt={self.ob_refcnt},\n'
                f'    ob_size={self.ob_size},\n'
                f'    ob_digit={digits},\n'
                f'    int_size={int_size} bytes,\n'
                f'    int_num={int_num})')

    def get_digits(self):
        """Internal representation of `ob_digit` as a list"""

        digits = ctypes.cast(
            ctypes.byref(self.ob_digit),
            ctypes.POINTER(ctypes.c_uint * self.ob_size)
        ).contents

        return list(digits)

    def calc_size(self):
        """Number of bytes consumed by the integer"""

        size = (ctypes.sizeof(ctypes.c_long)   + 
                ctypes.sizeof(ctypes.c_void_p) +
                ctypes.sizeof(ctypes.c_long)   + 
                ctypes.sizeof(ctypes.c_long))
        if self.ob_size > 2:
            size += (self.ob_size - 2) * ctypes.sizeof(ctypes.c_uint)
        return size

    def calc_integer(self, digits):
        """Retrieve the actual integer"""

        return sum(
            [digit * (1024**3)**i
             for i, digit in enumerate(digits)]
        )
Enter fullscreen mode Exit fullscreen mode
a = 50
addr_a = id(a)
int_a = IntStructure.from_address(addr_a)
int_a
Enter fullscreen mode Exit fullscreen mode
Int(ob_refcnt=77,
    ob_size=1,
    ob_digit=[50],
    int_size=32 bytes,
    int_num=50)
Enter fullscreen mode Exit fullscreen mode
b = 10**40
addr_b = id(b)
int_b = IntStructure.from_address(addr_b)
int_b
Enter fullscreen mode Exit fullscreen mode
Int(ob_refcnt=1,
    ob_size=5,
    ob_digit=[0, 668304384, 172751547, 175927511, 7523],
    int_size=44 bytes,
    int_num=10000000000000000000000000000000000000000)
Enter fullscreen mode Exit fullscreen mode

Changing the value of an integer

Referred from r/Python

Do not do this anywhere in your actual programs! Very bad things can happen

Let's modify the integer 120

# Define a simpler structure (similar to IntStructure)

class MyInt(ctypes.Structure):
    _fields_ = [("ob_refcnt", ctypes.c_long),
                ("ob_type"  , ctypes.c_void_p),
                ("ob_size"  , ctypes.c_ulong),
                ("ob_digit" , ctypes.c_long)]
Enter fullscreen mode Exit fullscreen mode
addr_120 = id(120)
struct_120 = MyInt.from_address(addr_120)
struct_120.ob_digit = 25
Enter fullscreen mode Exit fullscreen mode
120
Enter fullscreen mode Exit fullscreen mode
25
Enter fullscreen mode Exit fullscreen mode
120 + 120
Enter fullscreen mode Exit fullscreen mode
50
Enter fullscreen mode Exit fullscreen mode
60 + 60
Enter fullscreen mode Exit fullscreen mode
25
Enter fullscreen mode Exit fullscreen mode




What next?

Using similar techniques, we can decode how the basic data structures are implemented efficiently in Python.

  • Python Lists
  • Dictionaries
💖 💪 🙅 🚩
nowke
Navaneesh Kumar

Posted on March 31, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related