Nicholas Mendez
Posted on July 27, 2022
Contaminated databases are a drag on performance.
In machine learning , datasets contaminated with duplicates cause models to learn artificial patterns that do not exist. In the real-life context of emergency dispatch call centers, multiple calls about the same complaint cause multiple officers to respond to the same incident.
Consequently , misallocated resources resulting from poor dataset structure create larger problems than originally encountered.
The solution is naïvely simple : Don't enter conflicting records in our dataset.
While the constraints and parameters of every dataset will vary , the general process and application of said solution is the same.
- Identify which parameters will be used to find existing records in your model's dataset.
- Uniquely identify each entry when submitting an entry/record.
- Handle all responses from querying the dataset.
Simple enough , right? Fortunately, ActiveRecord offers a few finder methods. These methods allow us to determine whether or not a conflicting record exists prior to entering a new record into our database.
Let's explore how to maintain a stable and clean database through practical examples.
Queue Wag n' Walk , a schedule planner tailored to dog walking. There are a few different headaches that will come from contaminating the database ; let's identify the different entries that may compromise our table of appointments.
This example requires a basic understanding of ActiveRecord gem , table associations , and SQL tables.
1. Identify Useful Parameters
Each scheduled appointment has 4 different user inputs ; a date/time , a dog , a walker (employee) , and the walk duration. A few potential scheduling conflicts should stand out:
- Scheduling a walker for a time slot that conflicts with their current appointments
- Scheduling a dog for a time slot that conflicts with their current appointments
- Scheduling the same appointment twice
The first method to consider is the find_or_create_by method.
This method either finds the first record with the provided attributes, or creates a new record if one is not found. This method used on its own is useful when we're searching for an exact record , or when our dataset is simple.
Appointment.find_or_create_by({
employee_id: params[:employee_id] ,
start: params[:start]})
If no record with the provided employee id and starting appointment date/time is found in our Appointments Table, then a new record is entered.
If we attempt schedule an appointment while an existing appointment is in progress , our find_or_create_by method will not notice and will enter a conflicting appointment for our walker into our schedule.
Method Chaining
The second method to consider is actually a combination of multiple methods, known as method chaining. By method chaining , we are able to apply multiple conditions to vet and find records in our database.
.where Method
exist = Appointment.where({
start: params[:start] ,
dog_id: params[:dog_id]
})
.where selects all records that match the attributes passed to the method. In this case , exist will equal all appointments with the start time matching the provided start time , and the dog id matching the dog id.
.or Method
exist = Appointment.where({
start: params[:start] ,
dog_id: params[:dog_id]
}).or(Appointment.where({start: params[:start] ,
employee_id: params[:employee_id]}))
Chain the .or method to add a second condition to your query. In this case , we are looking for any appointment matching the provided start time and dog id , or any appointment matching the provided start time and employee id.
.find_all Method
appt_in_progress_dogs = Appointment.all
.find_all {|a| a[:dog_id] == params[:dog_id]}
The find_all method returns an array containing the records selected for which the given block returns a true value. In this case , we are finding all records from the Appointments table whose dog id matches the provided dog id.
.between? Method
appt_in_progress_dogs = Appointment.all
.find_all {|a| a[:dog_id] == params[:dog_id]}
.find_all {|a| params[:start].between?(a[:start] , a[:end])}
The between method returns a true or false value. It determines whether or not a value is between a provided minimum or maximum. In this case , we want to find all records with a dog id matching the provided dog id. Then , out of those records , we want to find all cases where the new appointment starts while an existing appointment for the provided dog is in progress.
In other words , we do not want to schedule a walk for a dog if the dog is in the middle of a walk. The same case applies for any walker (employee).
.find_in_batches Method
Appointment.find_in_batches(batch_size: 1000) do |a|
a.find_by({start: params[:start]})
end
The find_in_batches method is useful when working with large datasets. Provided the size of a batch , this method will only query the provided batch size at a time. This method will cut some of the performance issues that occur from working with larger datasets.
We are able to be more intentional and specific about the types of records we want to find through method chaining.
2. Uniquely Identify Records
This answer is fairly simple with Active Record. By using Active Record Migrations , a primary key column is automatically generated , and each record entered to our table is assigned a unique id.
3. Handling Query Responses
Now that we've queried the appointments table , we will conditionally respond to the user's new appointment request.
IF Statement
if !exist.exists? &&[*appt_in_progress_dogs,*appt_in_progress_walkers].length < 1
The IF statement in our POST method makes use of the queries we executed. If the new appointment request does not match any existing record exactly , then it passes the first condition in the IF statement. The .exists? method returns true or false. In this case , if no record exists , then we proceed to test the second condition.
The second condition of the IF statement utilizes the splat (*) operator, which functions similarly to Javascript (ES6) spread (...) operator. If no appointments in progress are found for the provided dog or walker, then the length of our array will be zero.
If both conditions are met , then a new Appointment will be created with the provided attributes sent by our front end and our response will be sent as a JSON object.
Else Statement
If either of the conditions return false , then our backend will send an error message response as a JSON object. Some errors may be more complex than others , and therefore , you may setup multiple conditions for multiple errors.
fetch(`http://localhost:3005/appointments` , {
method : 'POST' ,
headers: { "Content-Type" : 'application/json'} ,
body : JSON.stringify(newAppointment)
})
.then(r => r.json())
.then((appointment) => {
if(Object.keys(appointment).length === 1) {
alert(appointment.error)
} else {
setAppointments([...appointments , appointment])
alert(`Appointment for ${newDate} at ${time} has been scheduled`)
}
})
Given the response received from our initial POST request , we alert the user of either a successfully scheduled appointment or with an error we sent from out backend.
The same queries and conditions may also be applied to PATCH request. We also need to validate whether updating an existing appointment will cause the same contamination as our POST request.
In Conclusion
ActiveRecord provides many useful methods for querying our databases. Chaining methods allow us to be more intentional and specific with our queries. Determine whether or not a conflicting record exists in our database , and conditionally respond to the possible outcomes.
Resources
How to Deal With Duplicate Entries
CS: Duplicate Records
What is Data Cleaning
Ruby on Rails Docs
Active Record Migrations Docs
Posted on July 27, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.