Building fake data for tests using python

rqmlr

Erick Müller

Posted on July 8, 2020

Building fake data for tests using python

Sometimes you start a project from the scratch and have that empty feeling of having no data to help you to test your code. Or you will receive a "gift" of a legacy project that has code signaled as "done" but has no data that really proves this code really works.

We’ve all been there, even you, novice programmer. And every time, we do the same: inserts some data manually, in a tedious work that is always done using headphones, listening to that song that makes the time passes SLOWER.

So, to help you to avoid these problems, I'll show you some techniques that I've been using in the last years to generate test data, or fake data, if you want to use this name. And using Python, because is the best language to do this kind of thing.

Using the random package

Python comes bundled with a nice package to handle with randomness, called random, and it can help us to generate some random data based on a existing sample. And, yes, you can build this sample, and have random work with it.

To use it, let's start by importing the package.

import random

random.seed()

Important: The random.seed() initializes the randomizer. If you want to your randomizer always return the same data, pass the same value to this command (eg: random.seed(42)). This is specially useful for reproducible tests. If you do not inform nothing, it will use the current system's datetime to generate the seed, and then for each run you will receive aleatory data.

Ok. So now, for the first example, we want to generate a list of products and prices. We know some things: the product has a product_id, a category_id and a price. The product_id has at least 4 numbers, the category is always one, but based in a known list, and the price is between a known range, from 0.01 to 100.00. And we need exactly 273 itens in this list.

import random

random.seed()

known_categories = ['DRY', 'FRESH', 'LIQUID']    # my data sample
items = [
    {
        "product_id": random.sample(range(1000,9999)), 
        "category_id": random.choice(known_categories), 
        "price": random.randrange(1,10000) / 100
    } 
    for _ in range(273) 
]

After running this code, we'll have the items list made of items as expected.

I've used a list comprehension, that is a expression that generates a list.

The three used functions from random are:

  • random.sample, that selects a item from the range. sample assures that there will be no repeated items.

  • random.choice that select one item from the list

  • random.randrange that select one item from the range. Given that randrange only returns an integer, we divide the returned number by 100 to have our price.

Using the faker package

For other data formats, specific cases, or other scenarios, we can use the faker package.

This package can be used to generate data, based some common concepts, like "name" and "address".

To install it, use:

pip install faker

On your code, import and use it. For our second example, we will generate a list of users, with full name, address and phone number.

from faker import Faker

fake = Faker()
user_list = [
    { 
        "full_name": "{} {}".format(fake.first_name(),fake.last_name()),
        "address": fake.address(),
        "phone": fake.phone_number(),

    }
    for _ in range(3)
]

The code returns:

[{'address': '3005 Sydney Isle\nCombsmouth, FL 14919',
  'full_name': 'Rachel Weaver',
  'phone': '691-992-2752'},
 {'address': '290 Rich Walk\nQuinntown, MT 05852',
  'full_name': 'Elijah Hood',
  'phone': '9308368606'},
 {'address': '787 Andrea Valley\nWoodfurt, VA 86763',
  'full_name': 'Janice Jones',
  'phone': '(889)096-4636x1433'}]

If you specify the language code when instancing the Faker object, the same functions returns data specific for this language. If you want data tailored for Germany, you can use:

fake = Faker('de_DE')

Updating the localization parameter, the same example now will return

[{'address': 'Scheelplatz 863\n68737 Pinneberg',
  'full_name': 'Tim Jessel',
  'phone': '+49(0)5917 36668'},
 {'address': 'Bertold-Dussen van-Weg 78\n89881 Hohenstein-Ernstthal',
  'full_name': 'Sophia Lorch',
  'phone': '(01072) 277426'},
 {'address': 'Herrmannstr. 4/6\n98896 Lübeck',
  'full_name': 'Elke Bohlander',
  'phone': '(03509) 08874'}]

Even the localizations have their own special functions. The full list can be found here, and the special functions related to each localization can be found inside the provider's documentation to each "locale".

As a last note, you can use the seed method the Faker class, exactly as you can use on Python's random, and it yields the same results, given the same seed.

And now...

Using this two tools, you can generate any data you need to test your code. From this starting point, some options are available:

  • you can send this generated data to a database, and regenerate the data to each test. If you set the seed to the same value, you always have the same results, and then is assured that the tests always run with the same test data.

  • you write a file with the data, using a format like csv, and share with users that can validate the test data you're using on your tests.

💖 💪 🙅 🚩
rqmlr
Erick Müller

Posted on July 8, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

How to Use KitOps with MLflow
beginners How to Use KitOps with MLflow

November 29, 2024

Configure python file in vscode
undefined Configure python file in vscode

November 30, 2024

Configure python file in vscode
undefined Configure python file in vscode

November 30, 2024