The challenge of unconstrained voice queries

Petya Petkova

Automatic speech recognition (ASR) is an essential component of a voice AI. It hears and depending on the approach might also be involved in understanding what a user is saying. At a very high level, a typical ASR system consists of statistical models – an acoustic model that converts an acoustic signal into phonetic representations and a language model that produces meaningful words, phrases and sentences (Fig.1.). As every statistical model learns from the data it has been trained on, one of the biggest challenges to unconstrained voice queries is training these models on anything that speakers may say.

High level ASR architecture

ASR Architecture

Human speech and the challenges it poses to automatic speech recognition

Below are some characteristics of human speech which make the task of automatic speech recognition difficult:

    • – People are creative and this is strongly expressed when they speak – they coin new words and phrases on the spot which leads to lexical and grammatical variability and unpredictability.
    • – When they speak, people very often rely on situational, cultural context and background knowledge, and as a result they often intentionally skip syntactic and semantic information which would aid automatic speech recognition.
    • – Speech is fast and errors are common. These errors are sometimes indicated with redundant expressions, interjections, repetitions or silences which cannot be predicted.
    • – People use various pronunciations depending on their accent and regional language variety, on their social and cultural background or for the purposes of emphasis or irony.
  • – Speech is acoustically and prosodically variable and dynamic.

Corpus generation for language model training

Training an ASR system for a specific domain and purpose still needs to solve these problems but from a linguistic point of view it is more feasible to predict the lexicon, syntax and semantics of the user queries.

Consider an online clothes shop as an example. User requests could look like:

“I’m looking to buy a red summer dress.”
“Show me dark blue cotton hoodies.”
“What medium sized Levi jeans do you have available?”

Given a shop’s database of products, brands, sizes, colours, designs, etc., we need a corpus of a few million natural sounding sentences to train our language model. There are three approaches to creating such a corpus.

The manual approach

We can manually collect millions of domain specific user queries from a large number of native speakers. This process is time-consuming, expensive and in the end we are still not guaranteed to have an exhaustive list of everything people normally say in a given situation.

The semi-automatic approach 

The semi-automatic approach is based on syntactic and semantic analysis of a smaller manually collected dataset. The goal of this analysis is to find underlying repetitive syntactic and semantic structures and to produce a set of abstract patterns. These abstract patterns can be used to automatically generate a large training corpus.

For example, the sentences above come from patterns like:

<SEARCH_EXPRESSION> <ARTICLE> <COLOUR> <SEASON> <PRODUCT>

<SEARCH_EXPRESSION>  <COLOUR> <MATERIAL> <PRODUCT>

<QUESTION_WORD> <SIZE> <SIZE_EXPRESSION> <BRAND> <PRODUCT> <AVAILABILITY_EXPRESSION>

Each of these placeholders stands for a set of possible words and expressions that when put together will form natural sounding sentences.

Applying this approach is much faster and requires less human effort compared to the manual approach.

The automatic approach

Here the task of discovering patterns and underlying structures in data is delegated to the computer. A computer can find abstract patterns in a small native-speaker dataset using unsupervised machine learning methods. Depending on what datasets are available, domain adaptation techniques can also be applied to generate a text corpus in our target domain from a similar domain where data is available.

The automatic approach requires suitable software and hardware resources and a certain amount of experimentation but eliminates the human effort required and can utilise existing data.

At Voysis, we are particularly interested in the automatic approaches to generate high quality, natural sounding text corpora.

We are excited about being on the edge of AI and are looking to discuss and share our views and experience. We would be thrilled to hear your opinions!


Get the latest content from Voysis.
The Voysis blog is the place for voice insights, best practices, and technical articles written by industry experts. We promise to be kind to your inbox.