Using faker and pandas Python Libraries to Create Synthetic Data for Testing

rahulbhave

rahulbhave

Posted on September 15, 2024

Using faker and pandas Python Libraries to Create Synthetic Data for Testing

Introduction:
Comprehensive testing is essential for data-driven applications, but it often relies on having the right datasets, which may not always be available. Whether you are developing web applications, machine learning models, or backend systems, realistic and structured data is crucial for proper validation and ensuring robust performance. Acquiring real-world data may be limited due to privacy concerns, licensing restrictions, or simply the unavailability of relevant data. This is where synthetic data becomes valuable.

In this blog, we will explore how Python can be used to generate synthetic data for different scenarios, including:

  1. Interrelated Tables: Representing one-to-many relationships.
  2. Hierarchical Data: Often used in organizational structures.
  3. Complex Relationships: Such as many-to-many relationships in enrollment systems.

We’ll leverage the faker and pandas libraries to create realistic datasets for these use cases.


Example 1: Creating Synthetic Data for Customers and Orders (One-to-Many Relationship)

In many applications, data is stored in multiple tables with foreign key relationships. Let’s generate synthetic data for customers and their orders. A customer can place multiple orders, representing a one-to-many relationship.

Generating the Customers Table

The Customers table contains basic information such as CustomerID, name, and email address.

import pandas as pd
from faker import Faker
import random

fake = Faker()

def generate_customers(num_customers):
    customers = []
    for _ in range(num_customers):
        customer_id = fake.uuid4()
        name = fake.name()
        email = fake.email()
        customers.append({'CustomerID': customer_id, 'CustomerName': name, 'Email': email})
    return pd.DataFrame(customers)

customers_df = generate_customers(10)

Enter fullscreen mode Exit fullscreen mode

Screen Shot

This code generates 10 random customers using Faker to create realistic names and email addresses.

Generating the Orders Table

Now, we generate the Orders table, where each order is associated with a customer through CustomerID.

def generate_orders(customers_df, num_orders):
    orders = []
    for _ in range(num_orders):
        order_id = fake.uuid4()
        customer_id = random.choice(customers_df['CustomerID'].tolist())
        product = fake.random_element(elements=('Laptop', 'Phone', 'Tablet', 'Headphones'))
        price = round(random.uniform(100, 2000), 2)
        orders.append({'OrderID': order_id, 'CustomerID': customer_id, 'Product': product, 'Price': price})
    return pd.DataFrame(orders)

orders_df = generate_orders(customers_df, 30)

Enter fullscreen mode Exit fullscreen mode

Screen shot

In this case, the Orders table links each order to a customer using the CustomerID. Each customer can place multiple orders, forming a one-to-many relationship.


Example 2: Generating Hierarchical Data for Departments and Employees

Hierarchical data is often used in organizational settings, where departments have multiple employees. Let’s simulate an organization with departments, each of which has multiple employees.

Generating the Departments Table

The Departments table contains each department's unique DepartmentID, name, and manager.

def generate_departments(num_departments):
    departments = []
    for _ in range(num_departments):
        department_id = fake.uuid4()
        department_name = fake.company_suffix()
        manager = fake.name()
        departments.append({'DepartmentID': department_id, 'DepartmentName': department_name, 'Manager': manager})
    return pd.DataFrame(departments)

departments_df = generate_departments(10)

Enter fullscreen mode Exit fullscreen mode

Screen shot

Generating the Employees Table

Next, we generate theEmployeestable, where each employee is associated with a department via DepartmentID.

def generate_employees(departments_df, num_employees):
    employees = []
    for _ in range(num_employees):
        employee_id = fake.uuid4()
        employee_name = fake.name()
        email = fake.email()
        department_id = random.choice(departments_df['DepartmentID'].tolist())
        salary = round(random.uniform(40000, 120000), 2)
        employees.append({
            'EmployeeID': employee_id,
            'EmployeeName': employee_name,
            'Email': email,
            'DepartmentID': department_id,
            'Salary': salary
        })
    return pd.DataFrame(employees)

employees_df = generate_employees(departments_df, 100)

Enter fullscreen mode Exit fullscreen mode

Screen shot

This hierarchical structure links each employee to a departmentthrough DepartmentID, forming a parent-child relationship.


Example 3: Simulating Many-to-Many Relationships for Course Enrollments

In certain scenarios, many-to-many relationships exist, where one entity relates to many others. Let’s simulate this with students enrolling in multiple courses, where each course has multiple students.

Generating the Courses Table

def generate_courses(num_courses):
    courses = []
    for _ in range(num_courses):
        course_id = fake.uuid4()
        course_name = fake.bs().title()
        instructor = fake.name()
        courses.append({'CourseID': course_id, 'CourseName': course_name, 'Instructor': instructor})
    return pd.DataFrame(courses)

courses_df = generate_courses(20)

Enter fullscreen mode Exit fullscreen mode

Screen shot

Generating the Students Table

def generate_students(num_students):
    students = []
    for _ in range(num_students):
        student_id = fake.uuid4()
        student_name = fake.name()
        email = fake.email()
        students.append({'StudentID': student_id, 'StudentName': student_name, 'Email': email})
    return pd.DataFrame(students)

students_df = generate_students(50)
print(students_df)
Enter fullscreen mode Exit fullscreen mode

Screen shot

Generating the Course Enrollments Table

The CourseEnrollments table captures the many-to-many relationship between students and courses.

def generate_course_enrollments(students_df, courses_df, num_enrollments):
    enrollments = []
    for _ in range(num_enrollments):
        enrollment_id = fake.uuid4()
        student_id = random.choice(students_df['StudentID'].tolist())
        course_id = random.choice(courses_df['CourseID'].tolist())
        enrollment_date = fake.date_this_year()
        enrollments.append({
            'EnrollmentID': enrollment_id,
            'StudentID': student_id,
            'CourseID': course_id,
            'EnrollmentDate': enrollment_date
        })
    return pd.DataFrame(enrollments)

enrollments_df = generate_course_enrollments(students_df, courses_df, 200)

Enter fullscreen mode Exit fullscreen mode

Screen shot

In this example, we create a linking table to represent many-to-many relationships between students and courses.


Conclusion:
Using Python and libraries like Faker and Pandas, you can generate realistic and diverse synthetic datasets to meet a variety of testing needs. In this blog, we covered:

  1. Interrelated Tables: Demonstrating a one-to-many relationship between customers and orders.
  2. Hierarchical Data: Illustrating a parent-child relationship between departments and employees.
  3. Complex Relationships: Simulating many-to-many relationships between students and courses.

These examples lay the foundation for generating synthetic data tailored to your needs. Further enhancements, such as creating more complex relationships, customizing data for specific databases, or scaling datasets for performance testing, can take synthetic data generation to the next level.

These examples provide a solid foundation for generating synthetic data. However, further enhancements can be made to increase complexity and specificity, such as:

  1. Database-Specific Data: Customizing data generation for different database systems (e.g., SQL vs. NoSQL).
  2. More Complex Relationships: Creating additional interdependencies, such as temporal relationships, multi-level hierarchies, or unique constraints.
  3. Scaling Data: Generating larger datasets for performance testing or stress testing, ensuring the system can handle real-world conditions at scale. By generating synthetic data tailored to your needs, you can simulate realistic conditions for developing, testing, and optimizing applications without relying on sensitive or hard-to-acquire datasets.

If you like the article, please share it with your friends and colleagues. You can connect with me on LinkedIn to discuss any further ideas.


💖 💪 🙅 🚩
rahulbhave
rahulbhave

Posted on September 15, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related