đ Learning WebRTC to make browser calls for the Twilio hackathon
David Pereira
Posted on April 24, 2020
Currently I'm learning more about WebRTC in order to develop the next feature for my Twilio hackathon application - Client Connector. The feature I'm trying to implement is making a phone call using the browser, and this post will be more of a documentation of my thought process and what I've been learning, rather than a tutorial or a guide. I don't have all the answers yet đ .
The application is currently deployed and you can check it out here. It's pretty basic since it only lets you send an SMS to a phone number (I haven't tested sending to other countries, only Portugal đ ).
I've already created a separate branch for this feature, and during development, I hit a lot of walls following this tutorial and getting something working took a while. At a certain point I stopped and started asking questions.
TL;DR
getUserMedia is used to ask the user permission to use the audio and microphone devices
Twilio secures the audio of the phone call with the AES_CM_128_HMAC_SHA1_80 cryto suite and TLS
How does this work? đ¤
In the midst of the errors and my 30 tabs of documentation or YouTube (maybe more đ ), I asked myself: "How does the browser connect to my TwiML application"? I mean I created that application, gave it the URL to my ngrok server, and I still don't see any log on the server of a request to that route. So what's wrong?
I researched and stumbled upon this image on the Twilio docs, explaining the phone call process:
After this I thought "Ok, seems simple enough. Twilio abstracts most stuff so that I only need to give them instructions in TwiML (which is a language specific for this) and the client uses a library to connect to Twilio". But I still kept getting stuck while following the tutorial, and seeing their code repo helped to a certain extent. So I decided to dig deeper and understand more concepts about this whole process.
First, on the link to the docs above it says: "You setup your device and establish a connection to Twilio". This is done with the Device.connect() method of the twilio-client npm module.
twilioDevice.connect({phone:phoneNumber})
Code example of the connect method
So, what is happening when I call that method? What is that connection that is being created? Is the data going through that connection secure, or can someone listen in? Let's dig deeper into each question.
What is happening when I call that method?
To figure this out, I took a look at the library code because I wanted to know if they use the classes I researched about WebRTC (plus I'm curious đ). On the Device.setup() method I found references to the RTCPeerConnection class that is part of the WebRTC API and some other terms related to WebRTC like ICE candidate, but those terms seem to be about other APIs of WebRTC and I was focused on the audio side.
By this point I've mentioned WebRTC quite a bit, so let's talk about it.
WebRTC đ
I wasn't aware of all the APIs that came with HTML5, and that WebRTC (Web Real-Time Communications) was one of them. It consists of three APIs:
MediaStream - access devices like the camera and microphone
RTCDataChannel - real-time P2P transfer of generic data
The function getUserMedia() that I use to ask the user permission to use the microphone and audio devices, is the MediaStream API. Underneath, WebRTC uses codecs to determine how to compress and send the data. As I was studying I was introduced to the opus codec, and it seemed interesting because it changes the audio quality, for example, based on the connection speed đŽ. Also, the twilio-client seems to support it as a valid codec. I found this snippet in the library code:
/**
* Valid audio codecs to use for the media connection.
*/enumCodec{Opus="opus",PCMU="pcmu"}
I still don't know which codec is used, since I couldn't define the default codec and I don't specify one in particular on my code. If you do, feel free to post a comment below and I'll be happy to read it đ.
What is that connection that is being created?
The first few times I looked at that photo on the docs, I totally skimmed past the "VoIP Connection" đ and started thinking if it was like a peer-to-peer connection or a TCP connection, since I was reading WebRTC uses that.
When I looked at the twilio-client npm module I got a bit more confused because I saw PSTN instead of VoIP đĩ.
More questions started to arise, stress for feeling unproductive began to bubble up as well since I was coding way less. To combat this I took a little break and tried to ask questions to the community and other people. Turns out, the connection that is created is an UDP connection, which makes sense now that I think about it, because when sending an audio stream we are less worried about losing some data packets and more worried about not having lag on the phone call, that TCP could potentially add.
Is the data going through that connection secure? đđ
From their docs we can see they have a table about security, where there is some information about what is used to secure the connection:
DTLS-SRTP is a key exchange mechanism, and the DTLS part is the most important to me since it basically means UDP + security. AES_CM_128_HMAC_SHA1_80 seems to be the set of algorithms used to secure the data going in through a TLS connection. The name is quite lengthy because there are different algorithms in it:
AES stands for Advanced Encryption Standard and CM stands for Counter Mode. From what I understood this is the algorithm used for encrypting and decrypting the data, with a master-key length of 128 bits
HMAC is the MAC (Message Authentication Code) algorithm used along with the hash function called SHA1 and an 80-bit authentication tag, that is used to carry message authentication data
This table helped to answer my initial question, but I still didn't know what was the "Signaling" channel, for example. So in the desire to know more, I researched "what is the signalling channel of twilio" on Google and it led me here đ . It didn't seem the right information for what I wanted, so I kept researching about browser signaling and signaling on WebRTC. At the end I found some information about SIP being a signaling protocol and figured that's what is used on the Twilio Client JS SDK.
Anywho, the topic of WebRTC security is vast and I definitely didn't read everything about it. Maybe a subject for another day.
This application uses the dotenv module to read the environement variables configuration. So in order to run the server, you must create a .env file and set the appropriate values to each variable. Below is a table with the variables you need to set, or check the file .env-sample (optional values aren't on the table):
That's it for now, thanks for reading this post đ! If you have any knowledge about the inner details of how the whole process of making a phone call using a web browser, the protocols used, etc. I'd love to hear about it and learn from it đ. I'm very much on a "question everything" mindset, and I tried to put together all the online resources I've read or seen.
Also, do post comments if I got any information wrong or you have feedback.
Additional Resources đ¯
Here are some links I've been using to learn more about WebRTC and other concepts in general: