Can Data Lakes Accelerate Building ML Data Pipelines?
Taavi Rehemägi
Posted on April 20, 2021
Investigating time-to-insights for data science with a data lake vs. data warehouse
Photo by Emil Jarfelt on Unsplash
A common challenge in data engineering is to combine traditional data warehousing and BI reporting with experiment-driven machine learning projects. Many data scientists tend to work more with Python and ML frameworks rather than SQL. Therefore, their data needs are often different from those of data analysts. In this article, we'll explore why having a data lake often provides tremendous help for data science use cases. We'll finish up with a fun computer-vision demo extracting text from images stored in an S3 data lake.
1. Data lakes are data agnostic
Having only a purely relational data warehouse imposes a limitation on the variety of data formats this data platform can support. Many data warehouse solutions allow you to analyze nested JSON-like structure, but it's still a fraction of data formats that can be supported by a data lake.
While nothing beats relational table structure for analytics, it's still beneficial to have an additional platform that allows you to do more than that. Data lakes are data agnostic. They support a large variety of data crucial for data science:
- different file types (csv, tsv, json, txt, parquet, orc), data encryption, and compression formats (snappy, gzip, zlib, lzo),
- images, audio, video files enabling deep learning use cases with Computer Vision algorithms,
- model checkpoints created by ML training jobs,
- joins across relational and non-relational data, both server-side (ex. Presto or Athena) and on the client-side (ex. your Python script),
- data from the web: clickstream, shopping cart data, social media (ex. tweets, Reddit, blog posts, and news article scraped for Natural Langage Processing analytics),
- time-series data: IoT and sensor data, weather data, financial data.
2. Increased development efficiency
Manual ingestion of raw data into a data warehouse is quite a tedious and slow process.
You need to define a schema and specify all data types in advance, create a table, open a JDBC connection in your script or ETL tool, then finally you can start loading your data. In contrast, the load step in a data lake is often as simple as a single command. For instance, ingesting a Pandas dataframe into an S3-based data lake with AWS Glue catalog can be accomplished in a single line of Python code (the syntax in Pyspark and Dask is quite similar):
Ingestion into an S3 data lake --- Image by author
If you are mainly using Python for analytics and data engineering, you will likely find it much easier to write and read data using a data lake rather than using a data warehouse.
3. Support for a wider range of data processing tools
We discussed a variety of data supported by data lakes, but they also support a variety of processing frameworks. While a data warehouse encourages processing data in-memory using primarily SQL and UDFs, data lakes make it easy to retrieve data in a programming language or platform of your choice. Because no proprietary format is enforced, you have more freedom. This way, you can leverage the power of a Spark or Dask cluster and a wide range of extremely useful libraries that are built on top of them simply using Python. See the example below demonstrating reading a parquet file in Dask and Spark:
Note that it's not an either-or decision, a good data engineer can handle both (DWH and data lake), but data lakes make it easier to use data with those distributed processing frameworks.
4. Failures in your data pipelines become easier to fix
It's difficult and time-consuming to fix the traditional ETL data pipelines in which only the "Load" part failed because you have to start the full pipeline from scratch. Data lakes encourage and enable the ELT approach. You can load extracted data in its raw format straight into a data lake and transform it later, either in the same or within an entirely separate data pipeline. There are many tools that let you sync raw data into your data warehouse or data lake and do the transformations later when needed. Decoupling of raw data ingestion from transformation leads to more resilient data workloads.
Demo: using data lake to provide ML as a service
To illustrate the benefits of data lakes for data science projects, we'll do a simple demo of the AWS Rekognition service to extract text from images.
What's our use case? We upload an image to an S3 bucket that stores raw data. This triggers a Lambda function that extracts text from those images. Finally, we store the extracted text into a DynamoDB table and inspect the results using SQL.
How can you use it in your architecture? Instead of DynamoDB, you might as well use a data warehouse table or another S3 bucket location that could be queried using Athena. Also, instead of using the detect_text method (line 18 in the code snippet below) of AWS Rekognition, you can modify the code to:
- detect_faces
- detect_custom_labels
- recognize_celebrities
- ...and many more.
How to implement this? First, I created an S3 bucket and a DynamoDB table. The table is configured with img_filename, i.e. the file name of an uploaded image, as a partition key so that rerunning our function will not cause any duplicates (idempotency).
Create DynamoDB table for our demo --- image by the author
I already have an S3 bucket with a folder called images: s3://data-lake-bronze/images.
We also need to create a Lambda function with an IAM role to which I attached the IAM policy for S3, Rekognition, and DynamoDB attached. The function shown below has a lambda handler called lambda_function.lambda_handler and runtime Python 3.8. It also has an S3 trigger attached, which calls the function upon any PUT object operation, i.e. on any file upload to the folder /images/.
Lambda function to detect text in images uploaded to data lake --- image by author
The code of the function:
- creates a client for Rekognition and corresponding S3 and DynamoDB resource objects,
- extracts the S3 bucket name and key (filename) from the event trigger,
- reads the image object and passes it to the Rekognition client,
- finally, it retrieves the detected text and uploads it to our DynamoDB table.
Let's test it with some images:
Photo by Nadi Lindsay from Pexels
Photo by Alexas Fotos from Pexels
Photo by Mudassir Ali from Pexels
After uploading those images to the S3 bucket that we defined in our Lambda trigger, the Lambda should be invoked once for each image upload. Once finished, we can inspect the results in DynamoDB.
PartiQL query editor in DynamoDB (1) --- image by the author
PartiQL query editor in DynamoDB (2) --- image by the author
It looks like the first two images were recognized pretty well, but the difficult image on the right ("Is this just fantasy") was not.
How can we do it at scale?
If you want to implement a similar use case at scale, it may become challenging to monitor those resources and to implement proper alerting about errors. In that case, you may want to consider a serverless observability platform such as Dashbird. You could group all your Lambda functions and corresponding DynamoDB, ECS, Kinesis, Step Functions, SNS, or SQS resources into a project dashboard that allows you to see the current status of your microservices and ML pipelines at a glance.
Tracking resources with Dashbird
Each color represents a different resource such as a specific Lambda function. By hovering over each of them, you can see more details. Then, you can drill down for more information.
Side note: The zero costs in the image above can be justified by an always-free tier of AWS Lambda and DynamoDB. I've been using Lambda for my personal projects for the last two years and have never been billed for Lambda so far.
AWS always-free tier. Source--- screenshot by the author
Conclusion
To answer the question from the title: yes, a data lake can definitely speed up the development of data pipelines, especially those related to data science use cases. The ability to deal with a wide range of data formats and easily integrate this data with distributed processing and ML frameworks makes data lakes particularly useful for teams that do data science at scale. Still, if using raw data from a data lake requires too much cleaning, data scientists and data engineers may still prefer to use already preprocessed and historized data from DWH. As always, consider carefully what works best for your use case.
References & additional resources:
Posted on April 20, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.