Aditya
Posted on February 19, 2024
Database import is a crucial element and needs to be done right. Whether it's asking a user to fill out a form or uploading a CSV file through a UI, all data that enters the database must go through proper validation. For large CSV imports, balancing security and performance can be challenging. In this article, I'll demonstrate how to import 1 million records into a MySQL database with validation using Laravel APIs.
Here is the YouTube video from my channel WebDevWithArtisan
The key is to optimize for speed while not compromising data integrity. I'll use a REST API approach so any front-end can upload files. To handle the load, we'll employ a divide-and-conquer tactic by splitting the large CSV into chunks and harnessing Laravel's powerful job batching for queues.
Validation Strategy
We need to validate the structure and data types of the CSV before insertion. I'll create validator classes for each model that will be populated. These will check for required fields, data formats, relationships, etc.
Import Workflow
- The API endpoint accepts a CSV file upload
- The file gets divided into smaller chunks
- Each chunk is pushed into a job queue
- Jobs will validate rows and batch insert into the database
Optimization Considerations
- Use queue workers and supervised jobs for failure resiliency
- Tune batch sizes to balance speed and memory usage
- Add rate limiting to manage concurrent API requests
- Transact inserts in a single database transaction for data integrity
By following these patterns, we can reliably ingest datasets of any size while protecting data quality. Now let's examine the code to implement our CSV import solution using Laravel and MySQL.
Prerequisite: This article assumes the reader has a basic understanding of Laravel and its core concepts.
Create a Laravel project with a Breeze starter kit.
As I am creating a project from scratch, I will use the Laravel Breeze API starter kit as it is minimal and comes with all the features (Laravel Sanctum configs) that we need to create a Laravel API.
If you are integrating this in your own project then you might want to skip this step as starter kits will override your current code.
Run the following command to create a Laravel project
composer create-project laravel/laravel laravelapi
Then run the following command to create Breeze API scaffolding.
composer require laravel/breeze --dev
php artisan breeze:install
The terminal output should be something similar to this
Creating model, controller and migration
I have added the Database credentials in my .env file. Let's create migrations. The fake dataset I have is page_analytics. Let's create a model, migration and controller.
php artisan make:model PageAnalytic -mc
This is the schema of my page_analytics table
<?php
use Illuminate\Database\Migrations\Migration;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Support\Facades\Schema;
return new class extends Migration
{
/**
* Run the migrations.
*/
public function up(): void
{
Schema::create('page_analytics', function (Blueprint $table) {
$table->id();
$table->string('url');
$table->string('activity');
$table->foreignId('user_id')->nullable()->constrained('users')->nullOnDelete();
$table->timestamps();
});
}
/**
* Reverse the migrations.
*/
public function down(): void
{
Schema::dropIfExists('page_analytics');
}
};
Let's push this table to the database.
php artisan migrate
I have also seeded fake 10k users in the table using the default UserFactory that comes with the Laravel project.
Laravel Queue and Batch Setup
I will be using a queue connection of the database, you can use anything (SQS, Redis). So we will change the environment variable of QUEUE_CONNECTION in the .env file
QUEUE_CONNECTION=database
Then we will run these two commands to create the migrations file
php artisan queue:table
php artisan queue:batches-table
Then migrate these two tables to our database.
php artisan migrate
Possible PHP errors
When uploading large files in PHP you might get errors related to
- Memory limit
- Request Timeout
- Post Content limit
To solve the file upload limit error you will need to change the configurations in php.ini file. You need to change the configurations for upload_max_filesize
and post_upload_size
. Make sure to have post_upload_size
> upload_max_filesize
.
To handle request timeout errors we will use PHP Generator and Laravel Job Batches.
Working on API
With Laravel Breeze API setup we get great scaffolding for auth. I will modify the login request function in App\Http\Controllers\Auth\AuthenticatedSessionController
to have it send an API token.
<?php
namespace App\Http\Controllers\Auth;
// use statements
class AuthenticatedSessionController extends Controller
{
/**
* Handle an incoming authentication request.
*/
public function store(LoginRequest $request): JsonResponse
{
$request->authenticate();
$token = $request->user()->createToken('authToken')->plainTextToken;
return response()->json([
'token' => $token,
'user' => $request->user(),
]);
}
}
I will be using Postman to test our API endpoints. The problem is our API might throw a CSRF Cookie error for the Register and Login route. For that reason purpose we will set the csrf cookie by adding this pre-request script in Postman
pm.sendRequest({
url: 'http://localhost:8000/sanctum/csrf-cookie',
method: 'GET'
}, function (error, response, {cookies}) {
if (!error){
pm.collectionVariables.set('xsrf-cookie', cookies.get('XSRF-TOKEN'))
}
})
After this setup, if we send a post request to the login route we should get user details and an API token. We will use this token to make further requests.
API routes
Let's add two API routes, one for checking the batch status and one for importing the CSV file into database.
<?php
use App\Http\Controllers\BatchController;
use App\Http\Controllers\PageAnalyticController;
use Illuminate\Http\Request;
use Illuminate\Support\Facades\Route;
Route::middleware(['auth:sanctum'])->group(function(){
Route::get('/user', function (Request $request) {
return $request->user();
});
Route::get("/import-status/{batch_id}", [BatchController::class, 'status']);
Route::post("/import-analytics", [PageAnalyticController::class, 'store']);
});
API for batch status
As we are importing record using jobs which will be done in batch, we can use the batch id to get the progress of the entire import process.
//BatchController.php
<?php
namespace App\Http\Controllers;
use Illuminate\Http\Request;
use Illuminate\Support\Facades\Bus;
class BatchController extends Controller
{
public function status(string $batch_id)
{
return response()->json([
'details' => Bus::findBatch($batch_id)
]);
}
}
API for file import
Now this is the real game, let's add the code below in PageAnalyticController.php
<?php
namespace App\Http\Controllers;
use App\Services\PageAnalyticsImportService;
use Illuminate\Http\Request;
class PageAnalyticController extends Controller
{
public function store(Request $request)
{
$data = $request->validate([
'file' => ['required', 'file', 'mimes:csv', 'max:61440']
]);
$batchID = (new PageAnalyticsImportService())->import($data);
return response()->json(['batch_id' => $batchID], 201);
}
}
I am validating the file for CSV mime and the max upload size of 60 MB.
I have also created a PageAnalyticsImportService service in app/Services to keep the controller code cleaner. It's a PHP class which will handle the import process.
<?php
namespace App\Services;
use App\Jobs\AnalyticsImportJob;
use Generator;
use Illuminate\Support\Facades\Bus;
class PageAnalyticsImportService
{
public function import(array $data): string
{
$batch = Bus::batch([])->dispatch();
$filePath = $data['file']->getRealPath();
foreach($this->chunkAsGenerator($filePath) as $chunk)
{
$batch->add(new AnalyticsImportJob($chunk));
}
return $batch->id;
}
public function chunkAsGenerator(string $filePath) : Generator
{
$handle = fopen($filePath, 'r');
if($handle !== false)
{
fgetcsv($handle, 0, ',');
$chunkData = [];
$chunkSize = 0;
while(($row = fgetcsv($handle, 0, ',')) !== false)
{
$chunkData[] = $row;
$chunkSize++;
if($chunkSize >= 500)
{
yield $chunkData;
$chunkData = [];
$chunkSize = 0;
}
}
if(!empty($chunkData))
{
yield $chunkData;
}
fclose($handle);
}
}
}
Let's break down this code.
import(): This method orchestrates the import process and is designed to handle large CSV files efficiently without overwhelming the server's resources. The method starts by creating an empty batch and dispatching it. It accepts the validated request input and extracts the uploaded file's temporary path. The chunkAsGenerator function is called with the file path as its argument. This generator function reads the CSV file in chunks, yielding each chunk as it goes. This approach is key to managing memory usage effectively, allowing the server to handle large files without running into memory limit issues. For each chunk yielded by the generator, a new AnalyticsImportJob is instantiated with the chunk data and added to the previously created batch. This design ensures that the heavy lifting of processing CSV data is offloaded to Laravel's queue system, allowing for asynchronous processing and better resource management. Finally, the import method returns the batch's id. This ID can be used to monitor the batch's progress, check for completion, or handle failures, providing flexibility in managing the import process.
chunkAsGenerator(): At the heart of the import process is the chunkAsGenerator method. This method cleverly breaks down the CSV file into manageable chunks, reducing the risk of exhausting the server's memory. The method attempts to open the provided CSV file path for reading. If successful, it proceeds; otherwise, it exits. The first call to fgetcsv reads and ignores the CSV header row. The method then enters a loop, reading each row of the CSV file. Rows are accumulated into an array ($chunkData) until the specified chunk size is reached (in this case, 500 rows which can be changed as needed). Once the limit is hit, the current chunk is yielded, and the accumulation starts anew for the next chunk. After the loop, if there’s any remaining data that didn't meet the chunk size threshold, it's yielded as the final chunk. This ensures no data is left behind. Finally, the file handle is closed, ensuring no resources are left hanging.
Creating the job
AnalyticsImportJob is a Laravel job. It is designed to process chunks of CSV data related to page analytics asynchronously. It's another crucial component of the system that handles the efficient insertion of large volumes of data into the database, ensuring data integrity and validation.
Run the command below to create the job for import
php artisan make:job AnalyticsImportJob
Job code
<?php
namespace App\Jobs;
use App\Models\PageAnalytic;
use App\Models\User;
use Illuminate\Bus\Batchable;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Support\Facades\Validator;
class AnalyticsImportJob implements ShouldQueue
{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels, Batchable;
public array $chunkData;
/**
* Create a new job instance.
*/
public function __construct(array $chunk)
{
$this->chunkData = $chunk;
}
/**
* Execute the job.
*/
public function handle(): void
{
$validatedInputs = $this->validateEachRowAndDiscard();
// Store the validated inputs in the database
PageAnalytic::insert($validatedInputs);
}
public function validateEachRowAndDiscard()
{
$validatedInputs = [];
$size = count($this->chunkData);
for($i=0; $i<$size; $i++)
{
$inputArray = [
'user_id' => $this->chunkData[$i][0],
'created_at' => $this->chunkData[$i][1],
'activity' => $this->chunkData[$i][2],
'url' => $this->chunkData[$i][3],
];
$validator = Validator::make($inputArray, [
'user_id' => ['required', 'integer'],
'activity' => ['required', 'string'],
'url' => ['required', 'string'],
]);
if($validator->fails())
{
continue;
}
$validatedInputs[] = $inputArray;
}
$userIDsArray = array_column($validatedInputs, 'user_id');
$uniqueUserIDs = array_unique($userIDsArray);
$userIDsFromDB = User::whereIn('id', $uniqueUserIDs)->pluck('id')->toArray();
$validatedInputs = array_filter($validatedInputs, fn($input) => in_array($input['user_id'], $userIDsFromDB));
return $validatedInputs;
}
}
Let's break down this code
Batchable: Used for handling job batching, making it part of a larger batch of jobs if needed. Make sure to add this trait.
__construct(): When an instance of this job is created, it receives a chunk of CSV data as an array. This data is assigned to the $chunkData property for processing during job execution.
handle(): This method is automatically called by Laravel's queue worker when the job is processed. It's where the main logic of the job resides. In this case, it calls the validation method and then bulk inserts the validated record in the page_analytics table.
validateEachRowAndDiscard(): This function iterates through each row in the chunk, validating it against a set of rules. If a row fails validation, it's skipped. We also want to make sure that the user_id exists in the users table. Now running 500 queries (1 for each row) to check the user_id won't be efficient. We will take all the user_ids using array_column() -> pass it to array_unique() so that we have distinct ids -> make 1 single database call to get all the user_ids from the table using the unique_ids and whereIn clause -> use array_filter() to filter out all the validated rows whose user_id doesn't exists in the database.
Testing it out
Let's test through postman and run the worker using
php artisan queue:work
Conclusion
And that's it, we manage to create a Laravel API to import large size of data in much faster and efficient way. Let me know in the comment section any improvements that can be done further in the code to make it more efficient :)
Posted on February 19, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.