Day 17: Speech-to-Text with gRPC and Golang
Dilek Karasoy
Posted on January 24, 2023
To have a working gRPC microservice, three components are essential:
-
.proto
file to define the gRPC services and messages - server to process the submitted audio and returns back the transcription
- client to talk to the server
.proto file
syntax = "proto3";
package messaging;
option go_package = "go-grpc/messaging";
service LeopardService {
rpc GetTranscriptionFile(stream Chunk) returns (transcriptResponse) {}
}
message Chunk {
bytes Content = 1;
}
We define only one service (GetTranscriptionFile
) in the proto file for simplicity.
gRPC has a limit of 4MB for incoming messages. Hence, transcription service type needs to be set to the client-side stream. So files can be sent in chunks of bytes.
enum StatusCode {
Unknown = 0;
Ok = 1;
Failed = 2;
}
message transcriptResponse {
string transcript = 1;
StatusCode Code = 2;
}
Now, let's compile the .proto
file with protoc
as we are going to write both server and client in Go
.
Client:
First, we need a client for the defined LeopardService
service.
func main() {
f, err := os.Open(inputAudioPath)
defer f.Close()
opts := grpc.WithInsecure()
conn, err := grpc.Dial(*serverAddressArg, opts)
defer conn.Close()
client := messaging.NewLeopardServiceClient(conn)
runTranscriptionFile(client, *inputAudioPathArg)
}
Inside the runsTranscriptFile function, the audio file is read in chunks of 1 MB and transmitted over to the server, and a timeout of 10 seconds is considered here. Finally, the stream is closed, and the server response is received by calling the CloseAndRecv function.
func runTranscriptionFile(client messaging.LeopardServiceClient, filePath string) (err error) {
var (
writing = true
buf []byte
n int
file *os.File
)
file, err = os.Open(filePath)
defer file.Close()
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
stream, err := client.GetTranscriptionFile(ctx)
defer stream.CloseSend()
buf = make([]byte, 1024*1024*1024) // 1 MB
for writing {
n, err = file.Read(buf)
if err != nil {
if err == io.EOF {
writing = false
err = nil
continue
}
return err
}
// send the loaded bytes to the server
err = stream.Send(&messaging.Chunk{Content: buf[:n]})
}
// signal the server that it is done and ready to receive a response
reply, err := stream.CloseAndRecv()
log.Printf("replay: %v", reply)
return err
}
Server:
On the server-side, a gRPC service instance is defined and registered to answer to LeopardService calls.
func main() {
add := fmt.Sprintf("localhost:%d", *port)
lis, err := net.Listen("tcp", add)
grpcServer := grpc.NewServer()
messaging.RegisterLeopardServiceServer(grpcServer, newServer(*accessKeyArg))
grpcServer.Serve(lis)
}
After getting a transcription request, the server starts an instance of Leopard and keeps reading the shipped bytes until the EOF
. Then, the bytes are stored as a temporary file and passed to Leopard. Finally, the transcription is sent back to the client side along with a status code.
func (s *leopardServer) GetTranscriptionFile(stream messaging.LeopardService_GetTranscriptionFileServer) (err error) {
// define an instance of Leopard and init it
engine := leopard.NewLeopard(s.accessKey)
error := engine.Init()
defer engine.Delete()
var audio []byte = make([]byte, 0)
// default returned values if any error happens
var transcription string = ""
var statusCode messaging.StatusCode = messaging.StatusCode_Failed
for !is_done {
// keep reading bytes from the stream till it reaches to the end
audioFileChunk, err := stream.Recv()
if err == io.EOF {
// create a temporary file to store the received audio stream
f, err := os.CreateTemp("", "auido_temp_file")
defer os.Remove(f.Name())
_, err = f.Write(audio)
// process the audio file without any preprocessing with ProccessFile method of Leopard
transcription, _, err = engine.ProcessFile(f.Name())
statusCode = messaging.StatusCode_Ok
is_done = true
} else {
audio = append(audio, audioFileChunk.Content...)
}
}
// send back the result and close the stream connection
return stream.SendAndClose(&messaging.TranscriptResponse{
Transcript: transcription,
Code: statusCode,
})
}
We could also have sent the audio in raw (pcm) format
and directly fed it to Leopard without storing, but there are two caveats.
- more preprocessing needed on the client-side to decode the audio file.
- amount of data to be transferred is significantly more for the raw format than than compressed formats such as MP3 or OGG.
Learn more about Leopard, and check out the open-source demos.
Posted on January 24, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.