Making your CV talk 🤖 Explaining the concept and system design👨‍💻

nmitic

Nikola Mitic

Posted on June 5, 2024

Making your CV talk 🤖 Explaining the concept and system design👨‍💻

Explaining the concept (in non technical terms)

Here are where things finally gets to be more interesting. To make it very simple to understand I will simplify concept so that we can understand it easier and build on top of it later.

Imagine we have a house with 3 floors, each flow unlocks something new.

  1. First flor is called Question floor.
  2. Second flor knowledge flor.
  3. Third flor Speech flor.

Question floor ❓

Now let's imagine you enter the first flow. Here you can ask your question. This represents our web page with chat and audio interface.

Knowledge floor 📚

You can take a piece of paper and write you question down. But you want an answer. Getting the answer requires knowledge. Problem is, you do not have time to wait. You tried to enter next flow, knowledge room but it takes so long for you to find an answer write it down and go back to you question room, where you can ask more.

So instead, you bring a friend with you, this friend will take your question look for the answer and as soon it has something that might be an answer to you question he will write it down, on multiple papers, each containing part of the answer, and will give it to you so that you can read one by one and not wait until he write the complete answer.

Knowledge floor is our Interview AI backend service leveraging Open AI (or Groq or any other LLM service of your choice) being aware of our CV data.

But wait there is one more flow to unlock! Speech floor. This is because you got tired of reading and talking, you want to ask using questions using your voice and you want answer to be voice as well!

Speech floor 🗣️

Speech floor is where we have our cloned voice! And all it does is take paper with text written on it and start talking whatever is written. And here, just as in knowledge floor, you want to hear the voice as soon as answer is available!

So you have a third friend who is very loud and can speak in volume that you from the first floor can hear. You friend in knowledge room is now giving papers with chunked answer (text stream in tech term) to your friend in speech room, who start reading them to you part by part (paper by paper) as soon as he gets the first paper. (Audio stream in tech terms) Speech floor is our Eleven Labs voice clone API.

System design - How it all comes together?

Personal AI clone design system flow chart
Click here for link to diagram

Ok back to our world of ones and zeros. You can get yourself familiar with our system design using the diagram above.

It looks more complicated than it actually is.

We have 5 main components:

  1. Client - Next JS
  2. Express JS - server with two routes (/api/talk and api/ask)
  3. Headless CMS - Hygraph
  4. Open AI API (and Groq API as a second choice for reduced costs)
  5. Eleven AI api

The name of the game here is STREAM. 🏄‍♂️

Express JS always return stream of either text chunks or audio chunks. Both Eleven AI and Open AI support streaming which are in a way just proxied through our custom Express JS server.

One tricky part here is how to feed Eleven Lab web socket with stream from Open AI. We will talk about that in the following chapters.

Here is what happens when user type question in chat interface:

  1. Client makes an API call to Backend service via /api/ask path
  2. Backend service makes a call to headless CMS to get latest data
  3. Once CMS data is returned Backend makes another request towards Open AI to get answer in a form of stream.
  4. and finally Backend returns stream to the client.

Here is what happens when user type ask question using audio interface:

  1. Client converts audio into text using client speech recognitions API.
  2. Client makes an API call to Backend service via /api/talk path.
  3. Backend service makes a call to headless CMS to get latest data.
  4. Once CMS data is returned Backend makes another request towards Open AI to get answer in a form of stream.
  5. As soon as first chunk of streamed data is returned, backend will create a buffer, represents array of words needed for Elevent labs web socket api.
  6. Backend makes request towards Eleven Labs web socket API as soon as first buffered word is ready.
  7. Eleven Labs returns audio stream to Backend.
  8. Backend returns the stream and exposes a route which can be played in client using audio web api.

I won't go in to details about each part of the system and how to implement, however there are certain problems you will face regardless of tech stack of choice, and would like to share how I went about solving them.

  1. How to have Open AI api answer based on your custom data?
  2. How to send text stream from Express JS?
  3. How to send audio stream from Express JS?
  4. How to read text stream on client using Next JS?
  5. How to read audio stream on client using Next JS?
  6. How to combine Open AI text stream response with Eleven Lab Web socket streaming in TypeScript?
  7. How to use express JS routes to listen for Web socket API?

❤️If you would like to stay it touch please feel free to connect❤️

  1. X
  2. Linkedin
  3. nikola.mitic.dev@gmail.com

💖 💪 🙅 🚩
nmitic
Nikola Mitic

Posted on June 5, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related