In my previous article, I have shown you how to use Amazon Transcribe (automatic speech recognition service), to create a text transcript of a pre-recorded speech file in English.

In this article, I am going to show you how to use Amazon Transcribe to add privacy to your transcriptions by not exposing personal and sensitive information (PII) after uploading your transcript to a S3 bucket via AWS Management Console. Each PII word is taken as an entity-type and masks the content with the PII entity-type in the transcript output, such as Social Security Number 123-45-6789 will be masked as [SSN].

Amazon Transcribe

It is an automatic speech recognition service.
You can use it to transcribe media files stored in as Amazon S3 bucket (batch transcription) and in real time (stream transcription).
The following types of PII recognized for batch transcriptions
- SSN
- CREDIT_DEBIT_NUMBER
- CREDIT_DEBIT_EXPIRY
- CREDIT_DEBIT_CVV
- BANK_ACCOUNT_NUMBER
- BANK_ROUTING
- PIN
- NAME
- EMAIL
- PHONE (10 digits)
- ADDRESS
Batch transcription is available with US English (en-US).
You get word-for-word portion of the transcription output.
A perfect use case would be an organization where you may or may not want to expose certain transcription data to various team members.
In such situations, personally identifiable information (PII) may need to be removed to protect privacy and comply with local laws and regulations.
Using Amazon Transcribe, it is easy to get accurate and redacted sensitive text which otherwise would not have been possible due to manual errors and time consuming process.

Let’s get started!

Please visit my GitHub Repository for S3 articles on various topics being updated on constant basis.

Objectives:

1. Create a S3 bucket

2. Upload an audio PII file into S3 bucket

3. Create a transcription job

4. Review transcription results

Pre-requisites:

AWS user account with admin access, not a root account.
Create an IAM role, with AmazonS3FullAccess.

Resources Used:

Amazon Transcribe

IAM Access Policy

S3 Bucket

Steps for implementation to this project:

1. Create a S3 bucket

On Amazon S3 console / Create bucket / Under General configuration /

Bucket name: - pii-bucket12

AWS Region: - US East (N. Virginia) us-east-1

Take all defaults and Create bucket

2. Upload an audio PII file into S3 bucket

Amazon Transcribe supports MP3, MP4, WAV, FLAC, AMR, OGG, and WebM formats.
Click on your bucket’s name to navigate to the bucket / On the Buckets Home page / Select Upload / Add files / Upload the PII-file.mp3 file

Upload

Select PII-file.mp3 file / Under Properties / For Object overview / Copy the S3 URL / Save it for future use

s3://pii-bucket12/PII-file.mp3

3. Create a transcription job

From the top menu bar, select Services then begin typing Transcribe in the search bar and select Amazon Transcribe to open the service console.
On the Amazon Transcribe Console / Transcription jobs page, click Create job / Under Specify job details / Job settings /

Name: - PII-transcribe-job

Language: - English,US (en-US)

Input data / Input file location on S3: s3://pii-bucket12/PII-file.mp3

Output data location type: take the default - Service-managed S3 bucket.

Next

On the Configure page / Under Content removal / Check PII redaction / Take the default Select ALL

Create job

Wait for the status of your job to change from In progress to Complete

4. Review transcription results

Click on PII-transcribe-job / Under Transcription preview / Text
You can see that all personally identifiable information (PII) in the transcript is masked with the PII entity-type.

Cleanup

Delete the audio file - PII-file.mp3
Delete the S3 bucket - pii-bucket12
Delete the Transcription job - PII-transcribe-job

What we have done so far

Using Amazon Transcribe (automatic speech recognition service), we have successfully redacted certain personal and sensitive identifiable information (PII).

Blog

How AWS Service - Amazon Transcribe acts on PII

Revathi Joshi

Amazon Transcribe

Objectives:

Pre-requisites:

Resources Used:

Steps for implementation to this project:

1. Create a S3 bucket

2. Upload an audio PII file into S3 bucket

3. Create a transcription job

4. Review transcription results

Cleanup

What we have done so far

Join Our Newsletter. No Spam, Only the good stuff.

Related