Laravel API - Import 1 million records with validation in few seconds.

adityar15

Aditya

Posted on February 19, 2024

Laravel API - Import 1 million records with validation in few seconds.

Database import is a crucial element and needs to be done right. Whether it's asking a user to fill out a form or uploading a CSV file through a UI, all data that enters the database must go through proper validation. For large CSV imports, balancing security and performance can be challenging. In this article, I'll demonstrate how to import 1 million records into a MySQL database with validation using Laravel APIs.

Here is the YouTube video from my channel WebDevWithArtisan

The key is to optimize for speed while not compromising data integrity. I'll use a REST API approach so any front-end can upload files. To handle the load, we'll employ a divide-and-conquer tactic by splitting the large CSV into chunks and harnessing Laravel's powerful job batching for queues.

Validation Strategy

We need to validate the structure and data types of the CSV before insertion. I'll create validator classes for each model that will be populated. These will check for required fields, data formats, relationships, etc.

Import Workflow

  • The API endpoint accepts a CSV file upload
  • The file gets divided into smaller chunks
  • Each chunk is pushed into a job queue
  • Jobs will validate rows and batch insert into the database

Optimization Considerations

  • Use queue workers and supervised jobs for failure resiliency
  • Tune batch sizes to balance speed and memory usage
  • Add rate limiting to manage concurrent API requests
  • Transact inserts in a single database transaction for data integrity

By following these patterns, we can reliably ingest datasets of any size while protecting data quality. Now let's examine the code to implement our CSV import solution using Laravel and MySQL.

Prerequisite: This article assumes the reader has a basic understanding of Laravel and its core concepts.

Create a Laravel project with a Breeze starter kit.

As I am creating a project from scratch, I will use the Laravel Breeze API starter kit as it is minimal and comes with all the features (Laravel Sanctum configs) that we need to create a Laravel API.

If you are integrating this in your own project then you might want to skip this step as starter kits will override your current code.

Run the following command to create a Laravel project



composer create-project laravel/laravel laravelapi


Enter fullscreen mode Exit fullscreen mode

Then run the following command to create Breeze API scaffolding.



composer require laravel/breeze --dev


Enter fullscreen mode Exit fullscreen mode


php artisan breeze:install


Enter fullscreen mode Exit fullscreen mode

The terminal output should be something similar to this

Terminal output

Creating model, controller and migration

I have added the Database credentials in my .env file. Let's create migrations. The fake dataset I have is page_analytics. Let's create a model, migration and controller.



php artisan make:model PageAnalytic -mc


Enter fullscreen mode Exit fullscreen mode

This is the schema of my page_analytics table



<?php

use Illuminate\Database\Migrations\Migration;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Support\Facades\Schema;

return new class extends Migration
{
    /**
     * Run the migrations.
     */
    public function up(): void
    {
        Schema::create('page_analytics', function (Blueprint $table) {
            $table->id();
            $table->string('url');
            $table->string('activity');
            $table->foreignId('user_id')->nullable()->constrained('users')->nullOnDelete();
            $table->timestamps();
        });
    }

    /**
     * Reverse the migrations.
     */
    public function down(): void
    {
        Schema::dropIfExists('page_analytics');
    }
};


Enter fullscreen mode Exit fullscreen mode

Let's push this table to the database.



php artisan migrate


Enter fullscreen mode Exit fullscreen mode

I have also seeded fake 10k users in the table using the default UserFactory that comes with the Laravel project.

Laravel Queue and Batch Setup

I will be using a queue connection of the database, you can use anything (SQS, Redis). So we will change the environment variable of QUEUE_CONNECTION in the .env file



QUEUE_CONNECTION=database


Enter fullscreen mode Exit fullscreen mode

Then we will run these two commands to create the migrations file



php artisan queue:table


Enter fullscreen mode Exit fullscreen mode


php artisan queue:batches-table


Enter fullscreen mode Exit fullscreen mode

Then migrate these two tables to our database.



php artisan migrate


Enter fullscreen mode Exit fullscreen mode

Possible PHP errors

When uploading large files in PHP you might get errors related to

  1. Memory limit
  2. Request Timeout
  3. Post Content limit

To solve the file upload limit error you will need to change the configurations in php.ini file. You need to change the configurations for upload_max_filesize and post_upload_size. Make sure to have post_upload_size > upload_max_filesize.

To handle request timeout errors we will use PHP Generator and Laravel Job Batches.

Working on API

With Laravel Breeze API setup we get great scaffolding for auth. I will modify the login request function in App\Http\Controllers\Auth\AuthenticatedSessionController to have it send an API token.



<?php

namespace App\Http\Controllers\Auth;

// use statements

class AuthenticatedSessionController extends Controller
{
    /**
     * Handle an incoming authentication request.
     */
    public function store(LoginRequest $request): JsonResponse
    {
        $request->authenticate();

        $token = $request->user()->createToken('authToken')->plainTextToken;

        return response()->json([
            'token' => $token,
            'user' => $request->user(),
        ]);
    }
}



Enter fullscreen mode Exit fullscreen mode

I will be using Postman to test our API endpoints. The problem is our API might throw a CSRF Cookie error for the Register and Login route. For that reason purpose we will set the csrf cookie by adding this pre-request script in Postman



pm.sendRequest({
    url: 'http://localhost:8000/sanctum/csrf-cookie',
    method: 'GET'
}, function (error, response, {cookies}) {
    if (!error){
        pm.collectionVariables.set('xsrf-cookie', cookies.get('XSRF-TOKEN'))
    }
})


Enter fullscreen mode Exit fullscreen mode

Postman PreRequest Script

After this setup, if we send a post request to the login route we should get user details and an API token. We will use this token to make further requests.

API routes

Let's add two API routes, one for checking the batch status and one for importing the CSV file into database.



<?php

use App\Http\Controllers\BatchController;
use App\Http\Controllers\PageAnalyticController;
use Illuminate\Http\Request;
use Illuminate\Support\Facades\Route;

Route::middleware(['auth:sanctum'])->group(function(){
    Route::get('/user', function (Request $request) {
        return $request->user();
    });

    Route::get("/import-status/{batch_id}", [BatchController::class, 'status']);

    Route::post("/import-analytics", [PageAnalyticController::class, 'store']);
});


Enter fullscreen mode Exit fullscreen mode

API for batch status

As we are importing record using jobs which will be done in batch, we can use the batch id to get the progress of the entire import process.



//BatchController.php
<?php

namespace App\Http\Controllers;

use Illuminate\Http\Request;
use Illuminate\Support\Facades\Bus;

class BatchController extends Controller
{
    public function status(string $batch_id)
    {
        return response()->json([
            'details' =>  Bus::findBatch($batch_id)
        ]);
    }
}


Enter fullscreen mode Exit fullscreen mode

API for file import

Now this is the real game, let's add the code below in PageAnalyticController.php



<?php

namespace App\Http\Controllers;

use App\Services\PageAnalyticsImportService;
use Illuminate\Http\Request;

class PageAnalyticController extends Controller
{
    public function store(Request $request)
    {
        $data = $request->validate([
            'file' => ['required', 'file', 'mimes:csv', 'max:61440']
        ]);

        $batchID = (new PageAnalyticsImportService())->import($data);

        return response()->json(['batch_id' => $batchID], 201);
    }
}


Enter fullscreen mode Exit fullscreen mode

I am validating the file for CSV mime and the max upload size of 60 MB.

I have also created a PageAnalyticsImportService service in app/Services to keep the controller code cleaner. It's a PHP class which will handle the import process.



<?php

namespace App\Services;

use App\Jobs\AnalyticsImportJob;
use Generator;
use Illuminate\Support\Facades\Bus;

class PageAnalyticsImportService
{
    public function import(array $data): string
    {
        $batch = Bus::batch([])->dispatch();

        $filePath = $data['file']->getRealPath();

        foreach($this->chunkAsGenerator($filePath) as $chunk)
        {
            $batch->add(new AnalyticsImportJob($chunk));
        }

        return $batch->id;
    }

    public function chunkAsGenerator(string $filePath) : Generator
    {
        $handle = fopen($filePath, 'r');

        if($handle !== false)
        {
            fgetcsv($handle, 0, ',');

            $chunkData = [];
            $chunkSize = 0;

            while(($row = fgetcsv($handle, 0, ',')) !== false)
            {
                $chunkData[] = $row;
                $chunkSize++;
                if($chunkSize >= 500)
                {
                    yield $chunkData;
                    $chunkData = [];
                    $chunkSize = 0;
                }
            }

            if(!empty($chunkData))
            {
                yield $chunkData;
            }

            fclose($handle);
        }

    }
}


Enter fullscreen mode Exit fullscreen mode

Let's break down this code.

  1. import(): This method orchestrates the import process and is designed to handle large CSV files efficiently without overwhelming the server's resources. The method starts by creating an empty batch and dispatching it. It accepts the validated request input and extracts the uploaded file's temporary path. The chunkAsGenerator function is called with the file path as its argument. This generator function reads the CSV file in chunks, yielding each chunk as it goes. This approach is key to managing memory usage effectively, allowing the server to handle large files without running into memory limit issues. For each chunk yielded by the generator, a new AnalyticsImportJob is instantiated with the chunk data and added to the previously created batch. This design ensures that the heavy lifting of processing CSV data is offloaded to Laravel's queue system, allowing for asynchronous processing and better resource management. Finally, the import method returns the batch's id. This ID can be used to monitor the batch's progress, check for completion, or handle failures, providing flexibility in managing the import process.

  2. chunkAsGenerator(): At the heart of the import process is the chunkAsGenerator method. This method cleverly breaks down the CSV file into manageable chunks, reducing the risk of exhausting the server's memory. The method attempts to open the provided CSV file path for reading. If successful, it proceeds; otherwise, it exits. The first call to fgetcsv reads and ignores the CSV header row. The method then enters a loop, reading each row of the CSV file. Rows are accumulated into an array ($chunkData) until the specified chunk size is reached (in this case, 500 rows which can be changed as needed). Once the limit is hit, the current chunk is yielded, and the accumulation starts anew for the next chunk. After the loop, if there’s any remaining data that didn't meet the chunk size threshold, it's yielded as the final chunk. This ensures no data is left behind. Finally, the file handle is closed, ensuring no resources are left hanging.

Creating the job

AnalyticsImportJob is a Laravel job. It is designed to process chunks of CSV data related to page analytics asynchronously. It's another crucial component of the system that handles the efficient insertion of large volumes of data into the database, ensuring data integrity and validation.

Run the command below to create the job for import



php artisan make:job AnalyticsImportJob


Enter fullscreen mode Exit fullscreen mode

Job code



<?php

namespace App\Jobs;

use App\Models\PageAnalytic;
use App\Models\User;
use Illuminate\Bus\Batchable;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Support\Facades\Validator;

class AnalyticsImportJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels, Batchable;

    public array $chunkData;
    /**
     * Create a new job instance.
     */
    public function __construct(array $chunk)
    {
        $this->chunkData = $chunk;
    }

    /**
     * Execute the job.
     */
    public function handle(): void
    {
        $validatedInputs = $this->validateEachRowAndDiscard();
        // Store the validated inputs in the database
        PageAnalytic::insert($validatedInputs);
    }

    public function validateEachRowAndDiscard()
    {
        $validatedInputs = [];
        $size = count($this->chunkData);

        for($i=0; $i<$size; $i++)
        {
           $inputArray = [
            'user_id' => $this->chunkData[$i][0],
            'created_at' => $this->chunkData[$i][1],
            'activity' => $this->chunkData[$i][2],
            'url' => $this->chunkData[$i][3],
           ];

           $validator = Validator::make($inputArray, [
                'user_id' => ['required', 'integer'],
                'activity' => ['required', 'string'],
                'url' => ['required', 'string'],
           ]);

            if($validator->fails())
            {
                continue;
            }  

            $validatedInputs[] = $inputArray;
        }

        $userIDsArray = array_column($validatedInputs, 'user_id');
        $uniqueUserIDs = array_unique($userIDsArray);

        $userIDsFromDB = User::whereIn('id', $uniqueUserIDs)->pluck('id')->toArray();

        $validatedInputs = array_filter($validatedInputs, fn($input) => in_array($input['user_id'], $userIDsFromDB));

        return $validatedInputs;
    }
}


Enter fullscreen mode Exit fullscreen mode

Let's break down this code

  1. Batchable: Used for handling job batching, making it part of a larger batch of jobs if needed. Make sure to add this trait.

  2. __construct(): When an instance of this job is created, it receives a chunk of CSV data as an array. This data is assigned to the $chunkData property for processing during job execution.

  3. handle(): This method is automatically called by Laravel's queue worker when the job is processed. It's where the main logic of the job resides. In this case, it calls the validation method and then bulk inserts the validated record in the page_analytics table.

  4. validateEachRowAndDiscard(): This function iterates through each row in the chunk, validating it against a set of rules. If a row fails validation, it's skipped. We also want to make sure that the user_id exists in the users table. Now running 500 queries (1 for each row) to check the user_id won't be efficient. We will take all the user_ids using array_column() -> pass it to array_unique() so that we have distinct ids -> make 1 single database call to get all the user_ids from the table using the unique_ids and whereIn clause -> use array_filter() to filter out all the validated rows whose user_id doesn't exists in the database.

Testing it out

Let's test through postman and run the worker using



php artisan queue:work


Enter fullscreen mode Exit fullscreen mode

Jobs Terminal

page_analytics table

Conclusion

And that's it, we manage to create a Laravel API to import large size of data in much faster and efficient way. Let me know in the comment section any improvements that can be done further in the code to make it more efficient :)

💖 💪 🙅 🚩
adityar15
Aditya

Posted on February 19, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related