Pages

Showing posts with label voice. Show all posts
Showing posts with label voice. Show all posts

Tuesday, January 3, 2017

Google voice search faster and more accurate



Back in 2012, we announced that Google voice search had taken a new turn by adopting Deep Neural Networks (DNNs) as the core technology used to model the sounds of a language. These replaced the 30-year old standard in the industry: the Gaussian Mixture Model (GMM). DNNs were better able to assess which sound a user is producing at every instant in time, and with this they delivered greatly increased speech recognition accuracy.

Today, we’re happy to announce we built even better neural network acoustic models using Connectionist Temporal Classification (CTC) and sequence discriminative training techniques. These models are a special extension of recurrent neural networks (RNNs) that are more accurate, especially in noisy environments, and they are blazingly fast!

In a traditional speech recognizer, the waveform spoken by a user is split into small consecutive slices or “frames” of 10 milliseconds of audio. Each frame is analyzed for its frequency content, and the resulting feature vector is passed through an acoustic model such as a DNN that outputs a probability distribution over all the phonemes (sounds) in the model. A Hidden Markov Model (HMM) helps to impose some temporal structure on this sequence of probability distributions. This is then combined with other knowledge sources such as a Pronunciation Model that links sequences of sounds to valid words in the target language and a Language Model that expresses how likely given word sequences are in that language. The recognizer then reconciles all this information to determine the sentence the user is speaking. If the user speaks the word “museum” for example - /m j u z i @ m/ in phonetic notation - it may be hard to tell where the /j/ sound ends and where the /u/ starts, but in truth the recognizer doesn’t care where exactly that transition happens: All it cares about is that these sounds were spoken.

Our improved acoustic models rely on Recurrent Neural Networks (RNN). RNNs have feedback loops in their topology, allowing them to model temporal dependencies: when the user speaks /u/ in the previous example, their articulatory apparatus is coming from a /j/ sound and from an /m/ sound before. Try saying it out loud - “museum” - it flows very naturally in one breath, and RNNs can capture that. The type of RNN used here is a Long Short-Term Memory (LSTM) RNN which, through memory cells and a sophisticated gating mechanism, memorizes information better than other RNNs. Adopting such models already improved the quality of our recognizer significantly.

The next step was to train the models to recognize phonemes in an utterance without requiring them to make a prediction for each time instant. With Connectionist Temporal Classification, the models are trained to output a sequence of “spikes” that reveals the sequence of sounds in the waveform. They can do this in any way as long as the sequence is correct.

The tricky part though was how to make this happen in real-time. After many iterations, we managed to train streaming, unidirectional, models that consume the incoming audio in larger chunks than conventional models, but do actual computations less often. With this, we drastically reduced computations and made the recognizer much faster. We also added artificial noise and reverberation to the training data, making the recognizer more robust to ambient noise. You can watch a model learning a sentence here.

We now had a faster and more accurate acoustic model and were excited to launch it on real voice traffic. However, we had to solve another problem - the model was delaying its phoneme predictions by about 300 milliseconds: it had just learned it could make better predictions by listening further ahead in the speech signal! This was smart, but it would mean extra latency for our users, which was not acceptable. We solved this problem by training the model to output phoneme predictions much closer to the ground-truth timing of the speech.
The CTC recognizer outputs spikes as it identifies various phonetic units (in various colors) in the input speech signal. The x-axis shows the acoustic input timing for phonemes and y-axis shows the posterior probabilities as predicted by the neural network. The dotted line shows where the model chooses not to output a phoneme.
We are happy to announce that our new acoustic models are now used for voice searches and commands in the Google app (on Android and iOS), and for dictation on Android devices. In addition to requiring much lower computational resources, the new models are more accurate, robust to noise, and faster to respond to voice search queries - so give it a try, and happy (voice) searching!
Read More..

Friday, December 2, 2016

The neural networks behind Google Voice transcription



Over the past several years, deep learning has shown remarkable success on some of the world’s most difficult computer science challenges, from image classification and captioning to translation to model visualization techniques. Recently we announced improvements to Google Voice transcription using Long Short-term Memory Recurrent Neural Networks (LSTM RNNs)—yet another place neural networks are improving useful services. We thought we’d give a little more detail on how we did this.

Since it launched in 2009, Google Voice transcription had used Gaussian Mixture Model (GMM) acoustic models, the state of the art in speech recognition for 30+ years. Sophisticated techniques like adapting the models to the speakers voice augmented this relatively simple modeling method.

Then around 2012, Deep Neural Networks (DNNs) revolutionized the field of speech recognition. These multi-layer networks distinguish sounds better than GMMs by using “discriminative training,” differentiating phonetic units instead of modeling each one independently.

But things really improved rapidly with Recurrent Neural Networks (RNNs), and especially LSTM RNNs, first launched in Android’s speech recognizer in May 2012. Compared to DNNs, LSTM RNNs have additional recurrent connections and memory cells that allow them to “remember” the data they’ve seen so far—much as you interpret the words you hear based on previous words in a sentence.

By then, Google’s old voicemail system, still using GMMs, was far behind the new state of the art. So we decided to rebuild it from scratch, taking advantage of the successes demonstrated by LSTM RNNs. But there were some challenges.
An LSTM memory cell, showing the gating mechanisms that allow it to store
and communicate information. Image credit: Alex Graves
There’s more to speech recognition than recognizing individual sounds in the audio: sequences of sounds need to match existing words, and sequences of words should make sense in the language. This is called “language modeling.” Language models are typically trained over very large corpora of text, often orders of magnitude larger than the acoustic data. It’s easy to find lots of text, but not so easy to find sources that match naturally spoken sentences. Shakespeare’s plays in 17th-century English won’t help on voicemails.

We decided to retrain both the acoustic and language models, and to do so using existing voicemails. We already had a small set of voicemails users had donated for research purposes and that we could transcribe for training and testing, but we needed much more data to retrain the language models. So we asked our users to donate their voicemails in bulk, with the assurance that the messages wouldn’t be looked at or listened to by anyone—only to be used by computers running machine learning algorithms. But how does one train models from data that’s never been human-validated or hand-transcribed?

We couldn’t just use our old transcriptions, because they were already tainted with recognition errors—garbage in, garbage out. Instead, we developed a delicate iterative pipeline to retrain the models. Using improved acoustic models, we could recognize existing voicemails offline to get newer, better transcriptions the language models could be retrained on, and with better language models we could recognize again the same data, and repeat the process. Step by step, the recognition error rate dropped, finally settling at roughly half what it was with the original system! That was an excellent surprise.

There were other (not so positive) surprises too. For example, sometimes the recognizer would skip entire audio segments; it felt as if it was falling asleep and waking up a few seconds later. It turned out that the acoustic model would occasionally get into a “bad state” where it would think the user was not speaking anymore and what it heard was just noise, so it stopped outputting words. When we retrained on that same data, we’d think all those spoken sounds should indeed be ignored, reinforcing that the model should do it even more. It took careful tuning to get the recognizer out of that state of mind.

It was also tough to get punctuation right. The old system relied on hand-crafted rules or “grammars,” which, by design, can’t easily take textual context into account. For example, in an early test our algorithms transcribed the audio “I got the message you left me” as “I got the message. You left me.” To try and tackle this, we again tapped into neural networks, teaching an LSTM to insert punctuation at the right spots. It’s still not perfect, but we’re continually working on ways to improve our accuracy.

In speech recognition as in many other complex services, neural networks are rapidly replacing previous technologies. There’s always room for improvement of course, and we’re already working on new types of networks that show even more promise!
Read More..

Sunday, November 20, 2016

Voice Control on the Raspberry Pi

Note: Updated version here:
http://stevenhickson.blogspot.com/2013/05/voice-command-v20-for-raspberry-pi.html

Im finally releasing my Raspberry Pi voice control software.
The beauty of this is that you can use it and customize it without programming anything.
This will work on any linux machine but I find it uniquely suited for the Raspberry Pi and thats what I developed it for.
You just type voicecommand -e to edit the configuration file and add any speech and the coinciding command. You can see all of the little nifty commands it can recognize just with my config file in the video. I use all of these pretty often. It is very mutable and works with lots of other programs.

Watch this video to really get a feel for the program:




The commands should be in the format speech=command. Ex:
music=xterm -e pianobar
Doctor Who=playvideo -r -f Doctor Who

Unfortunately, it is caps sensitive voice control right now, but you can test the input by running
speech-recog.sh
on you machine to see what the speech recognition will give you. It uses Googles api and is really good at recognizing what you say. I used to use espeak instead of festival for the text to speech because of the speed difference. However, after some very helpful comments from people, Ive switched to using Googles TTS api, which is much more pleasant to listen to. The video above is still using espeak but the newest code does not.

I hope you will try this out and make some cool things with it. Instructions to install are below.
Im in the process of setting up a repository so until then, here are the instructions (do this in the terminal):


sudo apt-get install git-core
git clone git://github.com/StevenHickson/PiAUISuite.git
cd PiAUISuite/Install/
./InstallAUISuite.sh

Update Instructions 

cd PiAUISuite
git pull
cd Install
sudo ./UpdateAUISuite.sh


Here are the dependencies required to run and build:
sudo apt-get install libboost1.50-dev libboost-regex1.50-dev youtube-dl axel curl xterm libcurl4-gnutls-dev mpg123 flac sox

**Note: If the file .commands.conf doesnt exist in your home directory and you have an old version of code, the program will exit. You should either grab the newest code from github or create the file yourself.

**To make it listen for longer, edit the file /usr/bin/speech-recog.sh and change -f cd -t wav -d 3 to -f cd -t wav -d # where number is how many seconds it should listen.

Consider donating to further my tinkering


Places you can find me
Read More..

Friday, October 14, 2016

Voice Command v3 0 for the Raspberry Pi

Voicecommand v3.0 Changes


Ive made some big changes and their are even bigger things in the works. Here is a small list. Ive had some help from a couple of committed users who have come up with some new good ideas which is awesome.


  • There is now a ~ option that finds the word anywhere in a command. For instance ~music==pianobar will work if you say: lets hear some music, play music, or music.
  • !filler is now a string so you can set it manually. If you put it to 0, it will be empty and if you put it to 1, it will be FILLER FILL for compatibility issues.
  • Example scripts have been added in the Misc folder for you to play with. These can send and receive emails and text messages as well as posting to facebook; all using only your voice.
  • Flags can now overwrite the config options and can be reversed by following them with a 0 or enforced if followed with a 1. Ex. if !continuous==1 in your config file, you can force it to run only once with voicecommand -c0
  • The commands and keywords are now case insensitive. So no tricky case matching.
  • Multiple language support has been added. This is based on your country code which I think you can find here (plus en_uk and en_us). Look up your country code and use that. Ex. For US: !language==en_us, for Spain !language==es, for Germany !language==de.
  • You can set a Wolfram Alpha API and maxResponse (the number of branches) like !api==XXXXXX-XXXXXXXXXX and !maxResponse==3. This will give you better answers. You can sign up for a Wolfram Alpha API on their website for free.
  • Logging has been enabled into /dev/shm/voice.log. It throws stuff to this instead of /dev/null
  • The need for tts-nofill has been removed!! Now tts doesnt use any filler unless you send it yourself.

New Install and Update videos have been added. They can be found here:
http://stevenhickson.blogspot.com/2013/06/installing-and-updating-piauisuite-and.html

Consider donating to further my tinkering since I do all this and help people out for free.
Places you can find me
As always, the updated man page can be found below:

voicecommand

Section: voicecommand man page (8)
Updated: 13 May 2013
Index   

NAME

voicecommand - Listen to user defined strings and run the corresponding command   

SYNOPSIS

voicecommand [OPTIONS]...  

DESCRIPTION

voicecommand was developed for the Raspberry Pi but will work on any linux system with a microphone attached. It is a crude program, which uses basic comparisons to determine if your voicecommand fits a format specified in a config file; it it does, it runs the corresponding linux command. It supports auto-completion and variables as well as command verification, a continuous mode, and other options. For help/comments/questions, feel free to e-mail me at help@stevenhickson.com. I answer sporadically but do eventually respond.


Note: All of the flags that turn something on are off can be reversed and overwritten by following it with a 1 or 0. So for instance if you have !continuous==1 in your config file, you can run voicecommand -c0 to turn continuous off.
 

OPTIONS

?
Same as -h
-b
Turns off the FILL audio. The purpose of this was because the Raspbery Pi (or mine at least) cuts off the first few seconds of audio. This flag turns that feature off. You should only be concerned with this if you hear FILL before everything it says.
-c
Makes voicecommand run in continuous mode, where it will keep listening over and over again.
-d
Sets the duration for listening to the audio for voice commands
-D
Sets the audio hardware. The default is plughw:1,0 -
-e
Edits the voicecommand config file.
The format is voice==command
You can use any character except for newlines or ==
If the voice starts with ~, the program looks for the keyword anywhere. Ex: ~weather would pick up on weather or whats the weather
You can use ... at the end of the command to specify that everything after the given keyword should be options to the command.
Ex: play==playvideo ...
This means that if you say "play Futurama", it will run the command playvideo Futurama
You can use $# (where # is any number 1 to 9) to represent a variable. These should go in order from 1 to 9
Ex: $1 season $2 episode $3==playvideo -s $2 -e $3 $1
This means if you say game of thrones season 1 episode 2, it will run playvideo with the -s flag as 1, the -e flag as 2, and the main argument as game of thrones, i.e. playvideo -s 1 -e 2 game of thrones
Because of these options, it is important that the arguments range from most strict to least strict.
This means that ~ arguments should probably be at the end.
You can also put comments if the line starts with # and special options if the line starts with a !
Default options are shown as follows:
!keyword==pi,!verify==1,!continuous==1,!quiet==0,!ignore==0,!thresh==0.7,!maxResponse==-1
api==BLANK,!filler==FILLER FILL,!response==Yes Sir?,!duration==3,!com_dur==2,!hardware==plughw:1,0,!language==en_us
Keyword, filler, and response accept strings. verify, continuous, quiet, and ignore except 1 or 0 (true or false respectively). thresh excepts a floating point number. These allow you to set some of the flags as permanent options (If these are set, you can overwrite them with the flag options).
You can set a WolframAlpha API and maxResponse (the number of branches) like !api==XXXXXX-XXXXXXXXXX amd !maxResponse==3
You can now customize the language support for speech recognition and some text to speech with the language flag. Look up your country code and use that. Ex. For US: !language==en_us, for Spain !language==es, for Germany !language==de.
-f /my-location/config-file
This allows you to load a different config file located in a different spot. The default one is in your home directory and is ~/.commands.conf
The config file must be formatted the same way.
-h
Shows this man page.
-i
Sets the ignore mode. When this flag is activated, if a command is not in the config file, nothing happens. The default behavior is to try to find an answer or response to that question and then speak it. This turns off that behavior.
-I string
Sets the forced input mode. This allows you to test it without the microphone or get it to parse typed information. It will not run in continuous mode with this.
-k word
Sets the keyword. The default is pi. If this flag is set, the verify and continuous flags are also set since this is only checked during those two modes.
      Ex. voicecommand -c -v -k Jarvis

-l
Sets the duration for listening to the audio for the command keyword. This is different than the -d flag that listens for the voice commands.
-s
Runs a setup operation that attempts to set all of the config options in the config file so that voicecommand works properly
-r word
Sets the response. The default is "Yes Sir?" (For version 1.0, it was Ready?. If this response is more than one word, it should be put in quotes, otherwise it doesnt need to be
      Ex. voicecommand -r Ready?

-t #
Sets the threshold for volume to determine if the keyword was spoken. This should be a floating point number. The default value is 0.7 which works well with the Logitech C310 camera/mic from about 6 feet away.
Ex. voicecommand -t 1.2
-p
Sets passthrough mode on so that instead of running the commands, it just prints them. This is going to be used for the XBMC plugin and Android app.
-q
Sets quiet mode on so that voicecommand never speaks through the audio output. It still prints everything but doesnt ever respond. This includes the keyword response.
-v
Makes voicecommand verify the keyword. This only happens in continuous mode so if this flag is set, the continuous flag will be set as well. The default mode is to not verify. When voicecommand hears any sound above the threshold, it says the response then listens for a command. The default keyword is pi. When the verify flag is set, after the threshold is met, voicecommand verifies that the keyword was spoken.

AUTHOR

Steven Hickson (help@stevenhickson.com)  

BUGS

No known bugs. To report bugs, send a clear description to help@stevenhickson.comSince this program is fairly crude, user typos could cause crashes/failed responses. Please read the man page thoroughly before submitting a bug.  

COPYRIGHT

Copyright © 2013 Steven Hickson. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it as long as you give credit to the author and include this license. There is NO WARRANTY, to the extent permitted by law.  

HISTORY

This is the second major version of this program  

SEE ALSO

http://stevenhickson.blogspot.com/


Read More..

Saturday, October 1, 2016

Voice Command v2 0 for the Raspberry Pi

Voice Command 2.0 differences

Note: Updated version here
http://stevenhickson.blogspot.com/2013/06/voice-command-v30-for-raspberry-pi.html

Ive made some big changes since my last post with voice control with the Raspberry Pi.
You can now verify the keyword, change the keyword, change the response, put it in quiet mode to not talk to you, and put it in ignore mode to not try to answer questions not in your config file.
The config file format has also been changed from voice=command to voice==command, comments have been allowed in the config file by starting a line with #, and special settings can be done by starting a line with !.
Ive updated the TTS from espeak to Googles API since it sounds a lot better.
Finally, Ive made an update script in the Install folder, that way you dont have to reinstall every time new changes get pushed out to github. All of the source code is at:
https://github.com/StevenHickson/PiAUISuite

Video:

Heres a video demonstrating the new changes:


Special Options


The default special options are as follows:
!keyword==pi
!verify==1
!continuous==1
!quiet==0
!ignore==0
!filler==1
!thresh=0.7
!response=Yes sir?

response and keyword can be any string.
verify, continuous, quiet, filler, and ignore can be 1 or 0 (true or false respectively).
thresh can be any floating point number to set the appropriate volume.

Install Instructions

(this requires git)

sudo apt-get install git-core
git clone git://github.com/StevenHickson/PiAUISuite.git
cd PiAUISuite/Install/
./InstallAUISuite.sh


Update Instructions 

cd PiAUISuite
git pull
cd Install
sudo ./UpdateAUISuite.sh


Consider donating to further my tinkering



Ive also created a man page fore voicecommand. You can access it with man voicecommand. It is shown below:

voicecommand

Section: voicecommand man page (8)
Updated: 13 May 2013
Index  

NAME

voicecommand - Listen to user defined strings and run the corresponding command  

SYNOPSIS

voicecommand [OPTIONS]...  

DESCRIPTION

voicecommand was developed for the Raspberry Pi but will work on any linux system with a microphone attached. It is a crude program, which uses basic comparisons to determine if your voicecommand fits a format specified in a config file; it it does, it runs the corresponding linux command. It supports auto-completion and variables as well as command verification, a continuous mode, and other options. For help/comments/questions, feel free to e-mail me at me@stevenhickson.com. I answer sporadically but do eventually respond.
 

OPTIONS

-?
Same as -h
-b
Turns off the FILL audio. The purpose of this was because the Raspbery Pi (or mine at least) cuts off the first few seconds of audio. This flag turns that feature off. You should only be concerned with this if you hear FILL before everything it says.
-c
Makes voicecommand run in continuous mode, where it will keep listening over and over again.
-e
Edits the voicecommand config file.
The format is voice==command
You can use any character except for newlines or ==
You can also put comments if the line starts with # and special options if the line starts with a !
Default options are shown as follows:
!keyword==pi,!verify==1,!continuous==1,!quiet==0,!ignore==0,!thresh=0.7,!response=Yes sir? keyword and response accept strings. verify, continuous, quiet, and ignore except 1 or 0 (true or false respectively). thresh excepts a floating point number. These allow you to set some of the flags as permanent options (though the flags can overwrite them temporarily).
-f /my-location/config-file

This allows you to load a different config file located in a different spot. The default one is in your home directory and is ~/.commands.conf
The config file must be formatted the same way.
-h

Shows this man page.
-i

Sets the ignore mode. When this flag is activated, if a command is not in the config file, nothing happens. The default behavior is to try to find an answer or response to that question and then speak it. This turns off that behavior.
-k word

Sets the keyword. The default is pi. If this flag is set, the verify and continuous flags are also set since this is only checked during those two modes.
      Ex. voicecommand -c -v -k Jarvis

-r word

Sets the response. The default is "Yes Sir?" (For version 1.0, it was Ready?. If this response is more than one word, it should be put in quotes, otherwise it doesnt need to be
      Ex. voicecommand -r Ready?

-t #

Sets the threshold for volume to determine if the keyword was spoken. This should be a floating point number. The default value is 0.7 which works well with the Logitech C310 camera/mic from about 6 feet away.
Ex. voicecommand -t 1.2
-q

Sets quiet mode on so that voicecommand never speaks through the audio output. It still prints everything but doesnt ever respond. This includes the keyword response.
-v

Makes voicecommand verify the keyword. This only happens in continuous mode so if this flag is set, the continuous flag will be set as well. The default mode is to not verify. When voicecommand hears any sound above the threshold, it says the response then listens for a command. The default keyword is pi. When the verify flag is set, after the threshold is met, voicecommand verifies that the keyword was spoken.

AUTHOR

Steven Hickson (me@stevenhickson.com)  

BUGS

No known bugs. To report bugs, send a clear description to me@stevenhickson.comSince this program is fairly crude, user typos could cause crashes/failed responses. Please read the man page thoroughly before submitting a bug.  

COPYRIGHT

Copyright © 2013 Steven Hickson. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it as long as you give credit to the author and include this license. There is NO WARRANTY, to the extent permitted by law.  

HISTORY

This is the second major version of this program  

SEE ALSO

http://stevenhickson.blogspot.com/



Consider donating to further my tinkering
Read More..

Thursday, September 29, 2016

Sawasdeee ka Voice Search




Typing on mobile devices can be difficult, especially when youre on the go. Google Voice Search gives you a fast, easy, and natural way to search by speaking your queries instead of typing them. In Thailand, Voice Search has been one of the most requested services, so we’re excited to now offer users there the ability to speak queries in Thai, adding to over 75 languages and accents in which you can talk to Google.

To power Voice Search, we teach computers to understand the sounds and words that build spoken language. We trained our speech recognizer to understand Thai by collecting speech samples from hundreds of volunteers in Bangkok, which enabled us to build this recognizer in just a fraction of the time it took to build other models. Our helpers are asked to read popular queries in their native tongue, in a variety of acoustic conditions such as in restaurants, out on busy streets, and inside cars.

Each new language for voice recognition often requires our research team to tackle new challenges, including Thai.
  • Segmentation is a major challenge in Thai, as the Thai script has no spaces between words, so it is harder to know when a word begins and ends. Therefore, we created a Thai segmenter to help our system recognize words better. For example: ????? can be segmented to ??? ?? or ?? ???. We collected a large corpus of text and asked Thai speakers to manually annotate plausible segmentations. We then trained a sequence segmenter on this data allowing it to generalize beyond the annotated data.
  • Numbers are an important part of any language: the string “87” appears on a web page and we need to know how people would say that. As with over 40 other languages, we included a number grammar for Thai, that tells you that “87” would be read as ??????????.
  • Thai users often mix English words with Thai, such as brand or artist names, in both spoken and written Thai which adds complexity to our acoustic models, lexicon models, and segmentation models. We addressed this by introducing ‘code switching’, which allows Voice Search to recognize when different languages are being spoken interchangeably and adjust phonetic transliteration accordingly.
  • Many Thai users frequently leave out accents and tone markers when they search (eg ??????? instead of ???????? OR ??????? instead of ????????) so we had to create a special algorithm to ensure accents and tones were restored in search results provided and our Thai users would see properly formatted text in the majority of cases.

We’re particularly excited that Voice Search can help people find locally relevant information, ranging from travel directions to the nearest restaurant, without having to type long phrases in Thai.

Voice Search is available for Android devices running Jelly Bean and above. It will be available for older Android releases and iOS users soon.


Read More..

Monday, August 29, 2016

Using the Google Voice C API

First of all, this obviously isnt an official Google API. I wrote it because I wanted one in C++ and couldnt find one that worked.

That being said, the code is on github here (in the TextCommand folder) and is available under GPLv3 for anyone to use.
I know this could (maybe even should) have its own github project but since I use it for my home automation projects, its included in the PiAUISuite. That being said, the current gvapi binary is also compiled for the Pi. So you will need to run make gvapi if you want to use it on any other linux machine.

To include the C++ code in your own project, you just need to include gvoice.h in your file and then you should be able to access the GoogleVoice class. An excellent example to look at is gvapi.cpp since it shows how to interact with it. You should need only to call Init and Login, then you should be able to do any of the Google voice functions.

If you only to use gtextcommand to send commands to your pi, see here. If you want to send or check SMS without having to include the source in your code/project, just use the gvapi program. It can get contacts, send, receive, and delete SMS. It also comes in with built in spoof protection (this is included in GoogleVoice::CheckSMS).

Heres a quick video demo as an example of the kinds of things you can do with gvapi


Using the gvapi program is fairly simple and its man page is below:


gvapi


gvapi - Use of Google voice API in order to send and receive SMS  

SYNOPSIS

gvapi [OPTIONS]...  

DESCRIPTION

gvapi was compiled for use with home automation on the Raspberry Pi but will work on any linux system with an internet connection. It uses curl and boost regex in order to login and stores the cookie in /dev/shm. It only logs in once every 24 hours to save time. It supports a multitude of different options and parameters. For help/comments/questions, feel free to e-mail me at help@stevenhickson.com. I answer sporadically but do eventually respond.
 

OPTIONS

-?
Same as -h
-c
Checks your incoming text messages to see if you have anything unread. It will mark it whatever it outputs as read unless you also use the -r flag. If you include a number with the -n flag, you can specify it to only check messages received from that number. You can also use the -k flag to specify that the message must start with a certain keyword.
-d
Sets debug mode on. You can also specify debug mode from 1 - 3, where 3 is the most verbose.
Ex: gvapi -c -d2
-h
Asks for some quick general use help for gvapi.
-i
Outputs the contacts of your google voice account in the form name==number
-k KEYWORD
Sets a keyword that has to be in the beginning of the text message for it to be returned. This is useful as a password for parsing only certain messages.
-m MESSAGE
Sends the message you specify. This should be in quotes if it contains special characters or a space. The number also has to be specified otherwise it wont have a message to send
Ex: gvapi -n +15551234 -m Hello
-n NUMBER
Can be used with the -c flag to check messages from a certain number or the -m flag to send a message. Can be in a format without the country code, with just the country code, or with a plus sign and the country code. If the formost, it interprets it as a US number.
Ex: gvapi -n 5551234 -c
-p PASSWORD
Can be used with the -u flag to log in manually. If not specified, the program uses the default username and password specified in ~/.gv
-r
Receives and deletes SMS messges. Can be used with the -c flag or without. It is the same as the -c flag but instead of just marking the messages as read. It deletes them.
-u USERNAME
Can be used with the -p flag to log in manually. See the -p flag for more details. Cannot be used without the -p flag.
-v
Outputs version and creater information

AUTHOR

Steven Hickson (help@stevenhickson.com)  

BUGS

No known bugs. To report bugs, send a clear description to help@stevenhickson.comSince this program is fairly crude, user typos could cause crashes/failed responses. Please read the man page thoroughly before submitting a bug.  

COPYRIGHT

Copyright © 2013 Steven Hickson. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it as long as you give credit to the author and include this license. There is NO WARRANTY, to the extent permitted by law.  

HISTORY

This is the second major version of this program  

SEE ALSO

http://stevenhickson.blogspot.com/

Consider donating to further my tinkering.


Places you can find me
Read More..

Friday, June 3, 2016

Under the hood of Croatian Filipino Ukrainian and Vietnamese in Google Voice Search



Although we’ve been working on speech recognition for several years, every new language requires our engineers and scientists to tackle unique challenges. Our most recent additions - Croatian, Filipino, Ukrainian, and Vietnamese - required creative solutions to reflect how each language is used across devices and in everyday conversations.

For example, since Vietnamese is a tonal language, we had to explore how to take tones into consideration. One simple technique is to model the tone and vowel combinations (tonemes) directly in our lexicons. This, however, has the side effect of a larger phonetic inventory. As a result we had to come up with special algorithms to handle the increased complexity. Additionally, Vietnamese is a heavily diacritized language, with tone markers on a majority of syllables. Since Google Search is very good at returning valid results even when diacritics are omitted, our Vietnamese users frequently omit the diacritics when typing their queries. This creates difficulties for the speech recognizer, which selects its vocabulary from typed queries. For this purpose, we created a special diacritic restoration algorithm which enables us to present properly formatted text to our users in the majority of cases.

Filipino also presented interesting challenges. Much like in other multilingual societies such as Hong Kong, India, South Africa, etc., Filipinos often mix several languages in their daily life. This is called code switching. Code switching complicates the design of pronunciation, language, and acoustic models. Speech scientists are effectively faced with a dilemma: should we build one system per language, or should we combine all languages into one?

In such situations we prefer to model the reality of daily language use in our speech recognizer design. If users mix several languages, our recognizers should do their best in modeling this behavior. Hence our Filipino voice search system, while mainly focused on the Filipino language, also allows users to mix in English terms.

The algorithms we’re using to model how speech sounds are spoken in each language make use of our distributed large-scale neural network learning infrastructure (yes, the same one that spontaneously discovered cats on YouTube!). By partitioning the gigantic parameter set of the model, and by evaluating each partition on a separate computation server, we’re able to achieve unprecedented levels of parallelism in training acoustic models.

The more people use Google speech recognition products, the more accurate the technology becomes. These new neural network technologies will help us bring you lots of improvements and many more languages in the future.
Read More..

Friday, May 20, 2016

Text to Speech for low resource languages episode 2 Building a parametric voice



This is the second episode in the series of posts reporting on the work we are doing to build text-to-speech (TTS) systems for low resource languages. In the previous episode, we described the crowdsourced data collection effort for Project Unison. In this episode, we describe our work to construct a parametric voice based on that data.

In our previous episode, we described building TTS systems for low resource languages, and how one of the objectives of data collection for such systems was to quickly build a database representing multiple speakers. There are two main justifications for this approach. First, professional voice talents are often not available for under-resourced languages, so we need to record ordinary people who get tired reading tedious text rather quickly. Hence, the amount of text a person can record is rather limited and we need multiple speakers for a reasonably sized database that can be used by others as well. Second, we wanted to be able to create a voice that sounds human but is not identifiable as a real person. Various concatenative approaches to speech synthesis, such as unit selection, are not very suitable for this problem. This is because the selection algorithm may join acoustic units from different speakers generating a very unnatural sounding result.

Adopting parametric speech synthesis techniques is an attractive approach to building multi-speaker corpora described above. This is because in parametric synthesis the training stage of the statistical component will take care of multiple-speakers by estimating an averaged out representation of various acoustic parameters representing each individual speaker. Depending on number of speakers in the corpus, their acoustic similarity and ratio of speaker genders, the resulting acoustic model can represent an average voice that is indistinguishable from human and yet cannot be traced back to any actual speakers recorded during the data collection.

We decided to use two different approaches to acoustic modeling in our experiments. The first approach uses Hidden Markov Models (HMMs). This well-established technique was pioneered by Prof. Keiichi Tokuda at Nagoya Institute of Technology, Japan and has been widely adopted in academia and industry. It is also supported by a dedicated open-source HMM synthesis toolkit. The resulting models are small enough to fit on mobile devices.

The second approach relies on Recurrent Neural Networks (RNNs) and vocoders that jointly mimic the human speech production system. Vocoders mimic the vocal apparatus to provide a parametric representation of speech audio that is amenable to statistical mapping. RNNs provide a statistical mapping from the text to the audio and have feedback loops in their topology, allowing them to model temporal dependencies between various phonemes in human speech. In 2015, Yannis Agiomyrgiannakis proposed Vocaine, a vocoder that outperforms the state-of-the-art technology in speed as well as quality. In 2013, Heiga Zen, Andrew Senior and Mike Schuster proposed a neural network-based model that mimics deep structure of human speech production for speech synthesis. The model has further been extended into a Long Short-Term Memory (LSTM) RNN. This allows long term memorization, which is good for speech applications. Earlier this year, Heiga Zen and Hasim Sak described the LSTM RNN architecture that has been specifically designed for fast speech synthesis. The LSTM RNNs are also used in our Automatic Speech Recognition (ASR) systems recently mentioned in our blog.

Using the Hidden Markov Model (HMM) and LSTM RNN synthesizers described above, we experimented with a multi-speaker Bangla corpus totaling 1526 utterances (waveforms and corresponding transcriptions) from five different speakers. We also built a third system that utilizes LSTM RNN acoustic model, but this time we made it small and fast enough to run on a mobile phone.

We synthesized the following Bangla sentence "??? ???? ????? ??????? ??????" translated from “This is an example sentence in Bangla”. Though HMM synthesizer output can sound intelligible, it does exhibit some classic downsides with a voice that sounds buzzy and muffled. With the LSTM RNN configuration for mobile devices, the resulting audio sounds clearer and has improved intonation over the HMM version. We also tried a LSTM RNN configuration with more network nodes (and thus not suitable for low-end mobile devices) to generate this waveform - the quality is slightly better but is not a huge improvement over the more lightweight LSTM RNN version. We hypothesize that this is due to the fact that a neural network with many nodes has more parameters and thus requires more data to train.

These early results are encouraging for several reasons. First, they confirm that natural-sounding speech synthesis based on multiple speakers is practically possible. It is also significant that the total number of recordings used was relatively small, yet were able to build intelligible parametric speech synthesis. This means that it is possible to collect training data for such a speech synthesizer by engaging the help of volunteers who are not professional voice artists, for a short period of time per person. Using multiple volunteers is an advantage: it results in more diverse data, and the resulting synthetic voice does not represent any specific individual. This approach may well be the foundation for bringing speech technology to many more traditionally under-served languages.

NEXT UP: But can it say, “Google”? (Ep.3)
Read More..

Saturday, May 14, 2016

Crowdsourcing a Text to Speech voice for low resource languages episode 1



Building a decent text-to-speech (TTS) voice for any language can be challenging, but creating one – a good, intelligible one – for a low resource language can be downright impossible. By definition, working with low resource languages can feel like a losing proposition – from the get go, there is not enough audio data, and the data that exists may be questionable in quality. High quality audio data, and lots of it, is key to developing a high quality machine learning model. To make matters worse, most of the world’s oldest, richest spoken languages fall into this category. There are currently over 300 languages, each spoken by at least one million people, and most will be overlooked by technologists for various reasons. One important reason is that there is not enough data to conduct meaningful research and development.

Project Unison is an on-going Google research effort, in collaboration with the Speech team, to explore innovative approaches to building a TTS voice for low resource languages – quickly, inexpensively and efficiently. This blog post will be one of several to track progress of this experiment and to share our experience with the research community at large – our successes and failures in a trial and error, iterative approach – as our adventure plays out.

One of the most critical aspects of building a TTS system is acquiring audio data. The traditional way to do this is in a professional recording studio with a voice talent, sound engineer and a voice director. The process can take considerable time and can be quite expensive. People often mistake voice talent work to be similar to a news reader, but it is highly specialized and the work can be very difficult.

Such investments in time and money may yield great audio, but the catch is that even if you’ve created the best TTS voice from these recordings, at best it will still sound exactly like the voice talent - the person who provided the raw audio data. (We’ve read the articles about people who have fallen for their GPS voice to find that they are real people with real names.) So the interesting problem here from a research perspective is how to create a voice that sounds human but is not identifiable as a singular person.

Crowd-sourcing projects for automatic speech recognition (ASR) for Google Voice Search had been successful in the past, with public volunteers eager to participate by providing voice samples. For ASR, the objective is to collect from a diversity of speakers and environments, capturing varying regional accents. The polar opposite is true of TTS, where one unique speaker, with the standard accent and in a soundproof studio is the basic criteria.

Many years ago, Yannis Agiomyrgiannakis, Digital Signal Processing researcher on the TTS team in Google London, wrote a “manifesto” for acoustic data collection for 2000 languages. In his document, he gave technical specifications on how to convert an average room into a recording studio. Knot Pipatsrisawat, software engineer in Google Research for Low Resource Languages, built a tool that we call “ChitChat”, a portable recording studio, using Yannis’ specifications. This web app allows users to read the prompt, playback the recording and even assess the noise level of the room.
From other past research in ASR, we knew that the right tool could solve the crowd sourcing problem. ChitChat allowed us to experiment in different environments to get an idea of what kind of office space would work and what kind of problems we might encounter. After experimenting with several different laptops and tablets, we were able to find a computer that recognized the necessary peripherals (the microphone, USB converter, and preamp) for under $2,000 – much cheaper than a recording studio!

Now we needed multiple speakers of a single language. For us, it was a no-brainer to pilot Project Unison with Bangladeshi Googlers, all of whom are passionate about getting Google products to their home country (the success of Android products in Bangladesh is an example of this). Googlers by and large are passionate about their work and many offer their 20% time as a way to help, to improve or to experiment on something that may or may not work because they care. The Bangladeshi Googlers are no exception. They embodied our objectives for a crowdsourcing innovation: out of many, we could achieve (literally) one voice.

With multiple speakers, we would target speakers of similar vocal profiles and adapt them to create a blended voice. Statistical parametric synthesis is not new, but the advances in recent technology have improved quality and proved to be a lightweight solution for a project like ours.

In May of this year, we auditioned 15 Bangaldeshi Googlers in Mountain View. From these recordings, the broader Bangladeshi Google community voted blindly for their preferred voice. Zakaria Haque, software engineer in Machine Intelligence, was chosen as our reference for the Bangla voice. We then narrowed down the group to five speakers based on these criteria: Dhaka accent, male (to match Zakaria’s), similarity in pitch and tone, and availability for recordings. The original plan of a spectral analysis using PRAAT proved to be unnecessary with our limited pool of candidates.

All 5 software engineers – Ahmed Chowdury, Mohammad Hossain, Syeed Faiz, Md. Arifuzzaman Arif, Sabbir Yousuf Sanny – plus Zakaria Haque recorded over 3 days in the anechoic chamber, a makeshift sound-proofed room at the Mountain View campus just before Ramadan. HyunJeong Choe, who had helped with the Korean TTS recordings, directed our volunteers.
Left: TPM Mohammad Khan measures the distance from the speaker to the mic to keep the sound quality consistent across all speakers. Right: Analytical Linguist HyunJeong Choe coaches SWE Ahmed Chowdury on how to speak in a friendly, knowledgeable, "Googly" voice
ChitChat allowed us to troubleshoot on the fly as recordings could be monitored from another room using the admin panel. In total, we recorded 2000 Bangla and English phrases mined from Wikipedia. In 30-60 minute intervals, the participants recorded over 250 sentences each.

In this session, we discovered an issue: a sudden drop in amplitude at high frequencies in a few recordings. We were worried that all the recordings might have to be scrapped.
As illustrated in the third image, speaker3 has a drop in energy above 13kHz which is visible in the graph and may be present at speech, distorting the speaker’s voice to sound as if he were speaking through a tube.
Another challenge was that we didn’t have a pronunciation lexicon for Bangla as spoken in Bangladesh. We worked initially with the publicly available TTS data from the Indian Institute of Information Technology, but this represented the variant of Bangla spoken in West Bengal (India), which differs from the speech we recorded. Our internally designed pronunciation rules for Bengali were also aimed at West Bengal and would need to be revised later.

Deciding to proceed anyway, Alexander Gutkin, Speech software engineer and lead for TTS for Low Resource Languages in Google London, built an initial prototype voice. Using the preliminary text normalization rules created by Richard Sproat, Speech and Language Processing researcher, the first voice we attempted proved to be surprisingly good. The problem in the high frequencies we had seen in the recordings is undetectable in the parametric voice.
When we return to the sound studio to record an additional 200 longer sentences, we plan to try an upgrade of the USB converter. Meanwhile, Martin Jansche, Natural Language Understanding software engineer, has worked with a team of native speakers on a pronunciation and lexicon and model that better matches the phonology of colloquial Bangladeshi Bangla. Alexander will use the additional recordings and the new pronunciation dictionary to build the second version.

NEXT UP: Building a parametric voice with multiple speaker data (Ep.2)
Read More..

Friday, March 11, 2016

Voicecommand image file and controlling electronics with your voice

With the introduction of the Raspberry Pi B+, a lot of people are finding image files arent working unless updated first. Because of this, I went ahead and made a fresh image file for voicecommand on Raspbian that is A/B/B+ compatible.

You can download it at:
https://mega.co.nz/#!MM8W1JxR!4PlZ_1-dumasDUCYRI4LuiBwEJgtqhfoin0R8ls90NQ

Ive also taken the liberty of putting wiringPi and pilight on it. This means its easier than ever to control electronics with your voice (Ill post more on that later).
You can read more about it at the hackaday projects page here.

And here is a quick video demo:

Consider donating to further my tinkering since I do all this and help people out for free.



Places you can find me
Read More..