Everything About Developing Voice Assistant Applications

lankinen

Lankinen

Posted on May 24, 2020

Everything About Developing Voice Assistant Applications

Voice assistants are increasing popularity making it interesting platform to develop content. Making money by developing applications to voice assistants is still primitive but there are already some people doing it for living and making good bucks.

We are use to develop content for visual interfaces (like mobile phone or computer) but voice assistants require understanding of voice user interface (VUI). In most of the VUIs input is voice (might be also an option to write but it's still the same idea) and output is voice. It is some what popular to have screen on the device making it possible to also show information visually.

The reason why voice assistants are increasing importance is that natural language processing (NLP) has took some big leaps in the last few years and it has made possible to understand better human language. When we compare even voice assistant 5 years ago and today the difference is huge. The biggest difference is speech-to-text feature. It's now so accurate that even a foreigner with accent is understood nearly perfectly. This is important improvement because if user need to say "call mom" 5 times before the assistant understands, it's probably faster to just take a mobile phone and touch a few times.

Right now voice assistants are mostly used to get answers to general questions (How tall is Empire States Building?), get weather (What's the weather today?), set alarms (Set 10 minute alarm), control music or other smart home devices (Play Magic by Mr Jukes), or read news (Read TechCrunch).

Problems

Long queries aren’t working well because it’s difficult for a machine to understand what user wants if they say multiple sentences or take back what they said previously. When talking with voice assistant I have notice myself always thinking how to ask the thing simply so that Alexa understands it. That is not something I need to do when talking with another human. It's kind of like talking with bad hearing person when you want to keep the sentences simple and on point.

Another problem is that speaking out loud in public place is awkward. Mobile phones are great because rarely people see what the other person is doing on their phone but voice assistants require the user to say things out loud making them public to everyone. This is probably a thing that becomes easier as more people start to do it making it a norm but there are still things you don’t want everyone in a subway to hear.

Showing information is the biggest problem in voice assistants. It’s impossible to show a lot of information quickly. It’s just something visual interface do better. If someone knows exactly what flight they want, they can do it using voice assistant but if they want to look different options visual interface is required. Some voice assistants solve this by having a screen in the device making it possible to show information and then control everything with voice.

Speaking is faster than typing but reading is faster than listening.

When it’s practical?

Voice is great for specific requests and something people don’t realize is how often their intents are specific enough. Some people complain that they don’t want to buy milk using voice assistant because they want to look for different options. In reality they have some specific thing they look like price, brand, or combination of multiple variables. Voice assistant can learn pretty easily users behaviors to predict what milk they would like to buy.

There will be things where touch screen is better the same way we still have computers and that is why voice probably never replaces other ways to interact with devices. Still voice is really useful interface in many cases and it’s intuitive because that’s how we have interacted thousands of years with other humans. It's not really hard to say that voice is here to stay and everyone from the biggest deniers will be using it in some situations.

Designing Voice Application

Application refers to skill, capsule, etc. depending what assistant we talk. It's really confusing when every voice assistant has their own name for the applications.

Voice commands contains four key elements:

  • Intent
    • What user wants.
  • Utterance
    • The same question can be asked multiple ways and the machine need to understand all of them.
  • Context
    • To take voice assistants the next level it's important to understand the context really well. For example, when booking a flight, the voice assistant should already know where the user is going and that way offer the right flight more easily.
  • Slots
    • Changing fields in the requests like asking weather is the same but what changes is the date user want to know the weather.

The user didn't define what day they want to know the weather but it's very likely that they meant "today" in case there is no other context. If the user just asked about Mets' game tomorrow, it's more likely that they wanted to know the weather on that day.

1 What the interaction needs voice feature?

When designing voice interactions it's important to think what kind of things are used often and what are almost never used. Those things that are never used might not need voice interface but those things that users use regularly might be useful to have as voice. Uber skill can be used to book rides but not sign up.

Voice assistants offer developers a way to log in or sign up the service so for example it's possible to create voice Uber and not have a mobile app at all if sign up process is the only thing people do on the mobile app.

2 Designing the interactions

In visual user interface (VUI) all the information is in flat hierarchy. It means that user can do anything with one command when in visual interface some of the information might be hidden behind menus so that user might need to press certain buttons before they can do some action. It's still important to have some flows in VUI because not everyone remembers exactly what they needed to ask. For example, if someone is about to book a table for two in local restaurant they might forget to mention the time so it's important that instead of giving an error the VUI asks "What time do you want the table to be reserved?"

3 Where users use the product?

Car, home, transportation, office? This affects the sound quality but also what they feel comfortable saying. Some voice assistants are supported in certain kind of places more than others. A lot of them try to partner with car companies and that's something to think if the application is used mostly in cars.

4 How to use a screen?

If the device offers a screen it might be really powerful tool to use. Some people just put the exact same thing the device says there just to get it ready but this is not a good use of estate given. As mentioned before a screen is great place to show a lot of information once and it's probably the best use case for it. If creating voice application for Uber it might make sense to show the different rides even though there are maybe 3 options just in case this time the user don't want the same option they often pick.

5 Testing

When testing it's important to really say things with the real device and real situation to make sure machine understands them. Sometimes a word might be pronounced the same way three other words and it might cause problems. Some words are hard to pronounce for people who are not native speakers.

One big problem voice assistant applications has is that retention drops much faster than for example in mobile apps. Users don’t see the icon on their home screen getting reminded to use the application but they just need to remember to open it. But if the application solves some important problem the user probably remembers the application every time they have that problem.

Voice assistants are now racing with each other to get the biggest market share because in a few years there might be only a few winners getting all the users because of network effect. As a developer and as a user we can benefit from this race. The devices are sold for low prices because it's more important for the device makers to create network and get data than make profit in short term. There are all kind of competitions and rewards for people who create content to the voice assistants. There was some campaign on Alexa where all developers got a free Echo Dot when they created the first skill.

Popular Voice Assistants and Thoughts About Developing Content

Alexa (Amazon)

The frontend is developed using Alexa Developer Console. Backend can be built using any server but they offer simple way to connect with AWS Lambda functions. Their library ASK SDK is available for Python and Node.js.

ASK CLI is a console tool they created to make it easier code locally. They offer online code editor but it's far from perfect and I don't think anyone working with bigger project wants to use it.

Testing is still primitive and slow because you need to deploy the code and then go to browser. If changes are made on browser it's still really slow to test changes because changing tab takes a few seconds (and NO, I don't have a slow connection). Local unit tests are probably the best way to test things quickly but ASK is not supporting them that easily. As all developers know tests should run very quickly because otherwise developers don't want to run them as often as possible leading to coding where they find mistakes late.

Google Assistant

The frontend uses DialogFlow but the projects are made and published on Google Assistant developer pages. The backend can be made using any server but they recommend Firebase Cloud Functions and offer simple way to integrate them.

There is simple code editor in DialogFlow that can be used but it's easy to setup local coding environment. Firebase Cloud Functions support only JavaScript but Dialogflow library is created for C#, Go, Java, Node.js, PHP, Python, and Ruby so it's possible to use these easily by having some other server.

Testing is as bad as on Alexa. Saving changes and testing online is maybe a little bit better experience than on Alexa but the library is not easily supporting any local unit tests.

Bixby (Samsung)

Bixby definitely has the best tool (excluding the coding language) for developing capsules. Bixby Studio offers local coding editor where you can also test the capsules quickly. The whole experience just feels smooth.

The problem they have is that they created own programming language to create the frontend. They said in their website that the language is designed to be easy to write automatically and manually. I personally felt that the language was hard to understand and the names weren't that intuitive. The whole structure is pretty complex even for simple "Hello World" program.

It's kind of like not having XCode to develop iOS apps. You need to code the UI without seeing live time how things change. I really hope that they soon create some tool to remove this process because it's complicated for no reason.

The backend can be developed in the same editor using JavaScript but of course you can use separate server with any programming language and then just use JavaScript to call those APIs.

Having separate tool for developing capsules is definitely really good idea even though not everyone will like it because changing computer and access rights might cause problems. In my opinion this was the best experience if the language could be a little bit easier or support some existing language.

As a platform I'm the most interested of Bixby because it's working on Samsung's smart watches. I personally believe it to be really popular device for voice assistant because the screen is too small for a lot of things but then compared to for example smart headphones it offers a screen to show information.

Voiceflow (tool)

This is a tool that can be used to develop Google Assistant Actions and Alexa Skills. It's kind of like Scratch.

It's promoted as no-code solution which scares coders away because it often means a lot of limitations and slower creation but it's not true in this case. The tool is designed well and it's relatively easy to create voice products.

Testing is similar to Alexa Developer Console where you need to go to separate tab to test it. It's also possible to deploy the project to Alexa or Google Assistant device but it's more complex and takes some time. Because there is no code it's impossible to do unit testing and not having this option is definitely a big fat minus. When changing something it's required to test the project manually which takes some time.

Other Voice Assistants

Siri (Apple)

Everyone knows Siri as Apple's voice assistant. It supports third party content but not the same way Alexa and others. It's possible to add features to Siri either using Shortcuts app on iOS or by adding Siri feature to existing app. This means that it's not possible to create very custom independent voice only applications for Siri without requiring the user to install any apps.

Cortana (Microsoft)

People might have seen it on their Windows computers. It use to offer third-party content but just recently they pivoted to focus enterprise usage. It means that they focus productivity related features dropping support to things like playing music and third party content. It's different but I'm not sure if it's a good thing. Platforms are the reason products succeed nowadays so this kind of approach is a little bit weird.

Chinese Voice Assistants

In China voice assistant markets are booming. The markets are younger than in the United States but smart speaker sales is already bigger than in the United States. Baidu for example is already selling more smart speakers than Google [1]. People outside China probably don't hear much about them because they only support Chinese. When testing these it's fair to say that the technology is not at least a lot further than the English voice assistants. These speakers seems to sometimes have troubles understanding and there are no extra features compared to English alternatives. But it's not a surprise because English voice assistants got some head start. But it's interesting to see how the markets evolve as there are much more users in China than English speaking countries.

AliGenie (Alibaba), DuerOS (Baidu), Xiaowei (Tencent), Xiao Ai (Xiaomi), Xiaoyi (Huawei) are some of the biggest players in Chinese voice assistant markets. Most of them are pretty new but betting big to get the markets.

[1] https://www.ft.com/content/9d923d82-e37d-11e9-9743-db5a370481bc

💖 💪 🙅 🚩
lankinen
Lankinen

Posted on May 24, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related