How AWS Service - Amazon Transcribe acts on PII
Revathi Joshi
Posted on December 15, 2022
In my previous article, I have shown you how to use Amazon Transcribe (automatic speech recognition service), to create a text transcript of a pre-recorded speech file in English.
In this article, I am going to show you how to use Amazon Transcribe to add privacy to your transcriptions by not exposing personal and sensitive information (PII
) after uploading your transcript to a S3 bucket via AWS Management Console. Each PII word is taken as an entity-type and masks the content with the PII entity-type in the transcript output, such as Social Security Number 123-45-6789 will be masked as [SSN].
Amazon Transcribe
It is an automatic speech recognition service.
You can use it to transcribe media files stored in as Amazon S3 bucket (batch transcription) and in real time (stream transcription).
-
The following types of PII recognized for batch transcriptions
- SSN
- CREDIT_DEBIT_NUMBER
- CREDIT_DEBIT_EXPIRY
- CREDIT_DEBIT_CVV
- BANK_ACCOUNT_NUMBER
- BANK_ROUTING
- PIN
- NAME
- PHONE (10 digits)
- ADDRESS
Batch transcription is available with US English (en-US).
You get word-for-word portion of the transcription output.
A perfect use case would be an organization where you may or may not want to expose certain transcription data to various team members.
In such situations, personally identifiable information (PII) may need to be removed to protect privacy and comply with local laws and regulations.
Using Amazon Transcribe, it is easy to get accurate and redacted sensitive text which otherwise would not have been possible due to manual errors and time consuming process.
Let’s get started!
Please visit my GitHub Repository for S3 articles on various topics being updated on constant basis.
Objectives:
1. Create a S3 bucket
2. Upload an audio PII file into S3 bucket
3. Create a transcription job
4. Review transcription results
Pre-requisites:
AWS user account with admin access, not a root account.
Create an IAM role, with AmazonS3FullAccess.
Resources Used:
Steps for implementation to this project:
1. Create a S3 bucket
On Amazon S3 console / Create bucket / Under General configuration /
Bucket name: - pii-bucket12
AWS Region: - US East (N. Virginia) us-east-1
- Take all defaults and Create bucket
2. Upload an audio PII file into S3 bucket
Amazon Transcribe supports MP3, MP4, WAV, FLAC, AMR, OGG, and WebM formats.
Click on your bucket’s name to navigate to the bucket / On the Buckets Home page / Select Upload / Add files / Upload the
PII-file.mp3
file
Upload
- Select
PII-file.mp3
file / Under Properties / For Object overview / Copy the S3 URL / Save it for future use
s3://pii-bucket12/PII-file.mp3
3. Create a transcription job
From the top menu bar, select Services then begin typing Transcribe in the search bar and select Amazon Transcribe to open the service console.
On the Amazon Transcribe Console / Transcription jobs page, click Create job / Under Specify job details / Job settings /
Name: - PII-transcribe-job
Language: - English,US (en-US)
- Input data / Input file location on S3:
s3://pii-bucket12/PII-file.mp3
Output data location type: take the default - Service-managed S3 bucket.
Next
- On the Configure page / Under Content removal / Check PII redaction / Take the default Select ALL
Create job
- Wait for the status of your job to change from In progress to Complete
4. Review transcription results
Click on
PII-transcribe-job
/ Under Transcription preview / TextYou can see that all personally identifiable information (PII) in the transcript is masked with the PII entity-type.
Cleanup
Delete the audio file -
PII-file.mp3
Delete the S3 bucket -
pii-bucket12
Delete the Transcription job -
PII-transcribe-job
What we have done so far
Using Amazon Transcribe (automatic speech recognition service), we have successfully redacted certain personal and sensitive identifiable information (PII).
Posted on December 15, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.