Serverless Speech-to-Text with AssemblyAI
Majdi Dhissi
Posted on November 25, 2024
This is a submission for the AssemblyAI Challenge : Sophisticated Speech-to-Text.
What I Built
This project is a serverless solution for performing speech-to-text using AWS Lambda, SQS, and the AssemblyAI API, integrated with a front-end Blazor Web project. When a user uploads an audio file through the web application, the file is stored in an S3 bucket, which triggers a Lambda function to process the file. The speech-to-text conversion is handled by the AssemblyAI API, and the results are communicated back to the front-end via SQS and a background polling service.
The solution demonstrates a scalable and efficient way to leverage cloud technologies for advanced AI-powered transcription.
Demo
The target S3 bucket used to store transcript data in Json
Journey
Incorporating AssemblyAI’s Universal-2 Model
The Universal-2 model provided by AssemblyAI played a central role in this project. Its robust transcription capabilities ensured that audio files were processed with high accuracy.
Architecture
The solution was designed as follows:
Blazor Web Front-End:
The web project serves as the user interface for uploading audio files. Users can drag and drop files or use a browse button to select files. Once uploaded, the file is sent to an S3 bucket, with status updates and results displayed dynamically.AWS Lambda Function:
Built using .NET 8, the Lambda function is triggered by an S3 event whenever a new file is uploaded. This function downloads the file, processes it using the AssemblyAI API, and sends a success message containing transcription results to an SQS queue.SQS Integration:
SQS acts as a communication bridge, decoupling the Lambda function from the Blazor application. This ensures that the system remains robust and scalable, handling spikes in audio processing without impacting the UI.Blazor Background Service:
A background polling service in the Blazor project checks the SQS queue for new messages. When results are fetched, they are displayed in the web application in real time.
Technical Highlights
Background Service Overview
The Blazor front-end includes a background service that interacts with AWS SQS to fetch and process transcription results. This service ensures that messages from the SQS queue are retrieved and used to update the UI dynamically.
Here is a summary of its key components:
1. SqsService Class
The SqsService class encapsulates logic for communicating with AWS SQS.
Core Functionality:
- Uses the AWS SDK for .NET to fetch messages from the SQS queue with long polling to reduce unnecessary API calls.
- After processing a message, it deletes the message from the queue to ensure it is not reprocessed.
2. SqsBackgroundService Class
The SqsBackgroundService is a hosted service that continuously polls the SQS queue for messages.
Core Functionality:
- Calls FetchMessageAsync from SqsService to retrieve messages.
- Upon receiving a message, invokes a delegate (Func) that triggers a refresh in the Blazor UI to display transcription results.
Blazor Front-End Overview
The Blazor front-end component serves as the user interface for uploading files, displaying data from DynamoDB, and reflecting updates in real time via integration with the background SQS service.
Here is a breakdown of the main functionalities:
1. File Upload Feature
- The component is used to select a file for upload. Once a file is selected, details such as the file name and extension are displayed to the user.
- Files are uploaded to an S3 bucket using the TransferUtility from the AWS SDK for .NET.
- A unique file key is generated using Guid to ensure no name conflicts.
- The UI displays a modal during the upload process, indicating that the operation is in progress.
- After the upload is complete, the SuccessMessage is updated to notify the user of the outcome.
2. DynamoDB Integration
- The ListDynamoDBItems method retrieves all items from a DynamoDB table, which loads data into the DynamoDBItems list.
- The table is displayed on the page, showing each item's Id, Transcribed text, and Timestamp.
- A refresh button allows the table to be updated dynamically.
- Text fields with more than 2,000 characters are truncated for readability.
3. SQS Background Service Integration
- The component starts the SqsBackgroundService during initialization, allowing real-time updates when new transcription results are available in SQS.
- When the service receives a message, it triggers the RefreshDynamoDbTable method, which reloads the data and refreshes the UI.
- The service runs in the background and is gracefully stopped when the component is disposed.
AWS Lambda Function Overview
The Lambda function integrates S3, DynamoDB, and AssemblyAI to handle audio transcription, storage, and processing. Here is a breakdown of its functionality:
1. S3 Event Trigger
The function is triggered by S3 events, such as an object being created in a bucket.
- Metadata is retrieved for validation purposes.
- A pre-signed URL is generated to provide external access to the file for the AssemblyAI API.
- The transcription file is then processed using the AssemblyAI API using an HTTP client
2. AssemblyAI Integration
The function initializes an AssemblyAIClient using an API key from environment variables.
- The StabilityAIProcessor handles file transcription via the pre-signed S3 URL.
- The transcription result includes the text and metadata, which are logged and processed further.
- Logging:
- Transcription text is unescaped and logged for debugging or auditing.
3. DynamoDB Integration
Each transcription result is converted into a DynamoDB document and stored in the AssemblyAI table.
The table stores:
- Id: The unique transcription ID.
- Text: The transcribed text.
- Timestamp: The upload time in UTC.
Any issues during the database operation are logged and re-thrown.
4. Enhanced S3 Functionality
- The transcription results are uploaded to a designated bucket (e.g., assemblyai-challenge-transcripts) with a .json extension.
- The uploaded transcription files use a consistent naming convention: -transcription.json.
- A short-lived pre-signed URL (120 seconds) is generated for secure external access to the uploaded files. Since the S3 bucket is not public, the URL for the uploaded file is pre-signed and available for the AssemblyAI to download for further processing within a short time frame
In addition to the transcribed text, we also retrieve several data items such as:
- The Confidence Score
- Total number of words
- Audio duration
- Number of speakers
- List of highlights
- Sentiment Analysis (Negative, Neutral, Positive) for each sentence
- Detected language
- Number of chapters
Deployment
The solution is deployed using the following:
Terraform IaC
It enables building the necessary infrastructure to host the solution including:
Source and Target S3 buckets
Serve as secure storage locations where data is ingested and processed. The source bucket contains raw data, while the target bucket stores the transformed or processed output.
SQS Queue
A highly reliable and scalable messaging service that decouples components by queuing messages between producers and consumers, ensuring asynchronous communication.
DynamoDB Table
A fully managed NoSQL database optimized for high availability and low-latency access to application data, structured by a key-value or document data model.
IAM Policies and Roles
Define and enforce granular access permissions, ensuring resources and services are accessed securely. IAM roles enable temporary, controlled access to AWS services by trusted entities.
Lambda Function
A serverless compute service that executes custom logic in response to events or triggers, such as processing SQS messages or transforming data from the source bucket.
ECR repository for the respective docker images
A secure, scalable container registry to store and manage Docker images required for deployment, facilitating seamless integration with ECS and other AWS services.
ECS and Task Definitions that will run the images
ECS orchestrates containerized applications, while task definitions specify the configuration for running containers, including image, CPU, memory, and networking requirements.
ALB for exposing the front-end running on ECS Fargate
An Application Load Balancer distributes incoming HTTP/HTTPS traffic across front-end tasks running on ECS Fargate, ensuring high availability, scalability, and secure access.
This submission was crafted for the challenge, and the full source code is available at GitHub. Try it out, and feel free to share feedback!
Conclusion
AssemblyAI's advanced features make it a powerful tool for creating sophisticated speech-to-text solutions. With capabilities such as sentiment analysis, speaker identification, language detection, and detailed transcription metadata, it goes beyond basic transcription to deliver valuable insights from audio content. These features enable developers to build robust, intelligent, and scalable applications tailored to diverse use cases.
This article focused on showcasing the overall solution's architecture and integration while omitting detailed infrastructure or coding aspects, as those are outside its scope, which is to demonstrate a serverless solution for speech to text using AssemblyAI API
Feel free to explore the provided repositories and try out the solution. Feedback is always welcome!
If you find this article insightful, please
- Share it on your feed or social media
- Follow me to receive updates
- Keep in touch LinkedIn
Posted on November 25, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.