Using faker and pandas Python Libraries to Create Synthetic Data for Testing
rahulbhave
Posted on September 15, 2024
Introduction:
Comprehensive testing is essential for data-driven applications, but it often relies on having the right datasets, which may not always be available. Whether you are developing web applications, machine learning models, or backend systems, realistic and structured data is crucial for proper validation and ensuring robust performance. Acquiring real-world data may be limited due to privacy concerns, licensing restrictions, or simply the unavailability of relevant data. This is where synthetic data becomes valuable.
In this blog, we will explore how Python can be used to generate synthetic data for different scenarios, including:
- Interrelated Tables: Representing one-to-many relationships.
- Hierarchical Data: Often used in organizational structures.
- Complex Relationships: Such as many-to-many relationships in enrollment systems.
We’ll leverage the faker and pandas libraries to create realistic datasets for these use cases.
Example 1: Creating Synthetic Data for Customers and Orders (One-to-Many Relationship)
In many applications, data is stored in multiple tables with foreign key relationships. Let’s generate synthetic data for customers and their orders. A customer can place multiple orders, representing a one-to-many relationship.
Generating the Customers Table
The Customers table contains basic information such as CustomerID, name, and email address.
import pandas as pd
from faker import Faker
import random
fake = Faker()
def generate_customers(num_customers):
customers = []
for _ in range(num_customers):
customer_id = fake.uuid4()
name = fake.name()
email = fake.email()
customers.append({'CustomerID': customer_id, 'CustomerName': name, 'Email': email})
return pd.DataFrame(customers)
customers_df = generate_customers(10)
This code generates 10 random customers using Faker to create realistic names and email addresses.
Generating the Orders Table
Now, we generate the Orders
table, where each order is associated with a customer through CustomerID
.
def generate_orders(customers_df, num_orders):
orders = []
for _ in range(num_orders):
order_id = fake.uuid4()
customer_id = random.choice(customers_df['CustomerID'].tolist())
product = fake.random_element(elements=('Laptop', 'Phone', 'Tablet', 'Headphones'))
price = round(random.uniform(100, 2000), 2)
orders.append({'OrderID': order_id, 'CustomerID': customer_id, 'Product': product, 'Price': price})
return pd.DataFrame(orders)
orders_df = generate_orders(customers_df, 30)
In this case, the Orders
table links each order to a customer using the CustomerID
. Each customer can place multiple orders, forming a one-to-many relationship.
Example 2: Generating Hierarchical Data for Departments and Employees
Hierarchical data is often used in organizational settings, where departments have multiple employees. Let’s simulate an organization with departments, each of which has multiple employees.
Generating the Departments Table
The Departments
table contains each department's unique DepartmentID
, name, and manager.
def generate_departments(num_departments):
departments = []
for _ in range(num_departments):
department_id = fake.uuid4()
department_name = fake.company_suffix()
manager = fake.name()
departments.append({'DepartmentID': department_id, 'DepartmentName': department_name, 'Manager': manager})
return pd.DataFrame(departments)
departments_df = generate_departments(10)
Generating the Employees Table
Next, we generate theEmployees
table, where each employee is associated with a department via DepartmentID
.
def generate_employees(departments_df, num_employees):
employees = []
for _ in range(num_employees):
employee_id = fake.uuid4()
employee_name = fake.name()
email = fake.email()
department_id = random.choice(departments_df['DepartmentID'].tolist())
salary = round(random.uniform(40000, 120000), 2)
employees.append({
'EmployeeID': employee_id,
'EmployeeName': employee_name,
'Email': email,
'DepartmentID': department_id,
'Salary': salary
})
return pd.DataFrame(employees)
employees_df = generate_employees(departments_df, 100)
This hierarchical structure links each employee
to a department
through DepartmentID
, forming a parent-child relationship.
Example 3: Simulating Many-to-Many Relationships for Course Enrollments
In certain scenarios, many-to-many relationships exist, where one entity relates to many others. Let’s simulate this with students enrolling in multiple courses, where each course has multiple students.
Generating the Courses Table
def generate_courses(num_courses):
courses = []
for _ in range(num_courses):
course_id = fake.uuid4()
course_name = fake.bs().title()
instructor = fake.name()
courses.append({'CourseID': course_id, 'CourseName': course_name, 'Instructor': instructor})
return pd.DataFrame(courses)
courses_df = generate_courses(20)
Generating the Students Table
def generate_students(num_students):
students = []
for _ in range(num_students):
student_id = fake.uuid4()
student_name = fake.name()
email = fake.email()
students.append({'StudentID': student_id, 'StudentName': student_name, 'Email': email})
return pd.DataFrame(students)
students_df = generate_students(50)
print(students_df)
Generating the Course Enrollments Table
The CourseEnrollments
table captures the many-to-many relationship between students and courses.
def generate_course_enrollments(students_df, courses_df, num_enrollments):
enrollments = []
for _ in range(num_enrollments):
enrollment_id = fake.uuid4()
student_id = random.choice(students_df['StudentID'].tolist())
course_id = random.choice(courses_df['CourseID'].tolist())
enrollment_date = fake.date_this_year()
enrollments.append({
'EnrollmentID': enrollment_id,
'StudentID': student_id,
'CourseID': course_id,
'EnrollmentDate': enrollment_date
})
return pd.DataFrame(enrollments)
enrollments_df = generate_course_enrollments(students_df, courses_df, 200)
In this example, we create a linking table to represent many-to-many relationships between students and courses.
Conclusion:
Using Python and libraries like Faker and Pandas, you can generate realistic and diverse synthetic datasets to meet a variety of testing needs. In this blog, we covered:
- Interrelated Tables: Demonstrating a one-to-many relationship between customers and orders.
- Hierarchical Data: Illustrating a parent-child relationship between departments and employees.
- Complex Relationships: Simulating many-to-many relationships between students and courses.
These examples lay the foundation for generating synthetic data tailored to your needs. Further enhancements, such as creating more complex relationships, customizing data for specific databases, or scaling datasets for performance testing, can take synthetic data generation to the next level.
These examples provide a solid foundation for generating synthetic data. However, further enhancements can be made to increase complexity and specificity, such as:
- Database-Specific Data: Customizing data generation for different database systems (e.g., SQL vs. NoSQL).
- More Complex Relationships: Creating additional interdependencies, such as temporal relationships, multi-level hierarchies, or unique constraints.
- Scaling Data: Generating larger datasets for performance testing or stress testing, ensuring the system can handle real-world conditions at scale. By generating synthetic data tailored to your needs, you can simulate realistic conditions for developing, testing, and optimizing applications without relying on sensitive or hard-to-acquire datasets.
If you like the article, please share it with your friends and colleagues. You can connect with me on LinkedIn to discuss any further ideas.
Posted on September 15, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
September 15, 2024