Technical choices

Speech recognition and synthesis

The first technical choice deals with the software for speech recognition and speech synthesis. Many programs of this kind are available nowadays but some of them are paid or impose restrictions on the amount of data that can be processed. Putting aside these ones (for example Google's cloud service for recognition and synthesis based on neural networks) we eventually opted for Microsoft's .NET libraries which manage both Speech-to-Text input and Text-to-Speech output. This decision led us to install on the Raspberry Pi the operating system Windows 10 IoT Core (specifically designed for devices with reasonable computing power) and to use C# as coding language. The main advantage resulting from this setup was the possibility to work inside a unique and complete environment made by .NET framework and Universal Windows Platform. Besides, since our vocal assistant is a UWP application, it can run on every machine/device equipped with Windows 10 and derivatives (PC, Xbox, HoloLens and so on).

Software's structure

Before coding, we thought about what kind of structure to give to the software; from another perspective, that meant defining the operating rules of the assistant. The solution we ended up using consisted in shaping the software as a finite-state machine (click here for more details).

Reminder's grammar

Inside the function that manages the reminders, we decided to rely on a custom grammar in order to recognize the pronounced date. The reason for this choice is that, in this step, we must be able to understand accurately only a finite set of words defined by the day-month combinations instead of any possible sentence; in particular, the sequence that should be pronounced in order to specify a date has the form "month, day" (with the day as cardinal number) e.g. "January fifteen". We know this isn't the usual way to express a date in English but we chose this format in order to decrease the odds of misunderstandings (pronouncing the date in an ordinal fashion often produced errors). Conversely, the comprehension of the task to memorize is carried out using the whole english vocabulary since every arbitrary phrase is allowed (for example "Dental examination at 3 PM" or "Buy groceries"). The box below contains the XML code needed to define our custom grammar.

<grammar
    version="1.0"
    xml:lang="en-US"
    xmlns="http://www.w3.org/2001/06/grammar"
  root="rootRule">
  <rule id="rootRule">
    <one-of>
      <item><ruleref uri="#rule1"/></item>
      <item><ruleref uri="#rule2"/></item>
      <item><ruleref uri="#rule3"/></item>
      <item><ruleref uri="#rule4"/></item>
      <item><ruleref uri="#rule5"/></item>
      <item><ruleref uri="#rule6"/></item>
      <item><ruleref uri="#rule7"/></item>
      <item><ruleref uri="#rule8"/></item>
      <item><ruleref uri="#rule9"/></item>
      <item><ruleref uri="#rule10"/></item>
      <item><ruleref uri="#rule11"/></item>
      <item><ruleref uri="#rule12"/></item>
    </one-of>
  </rule>
  <rule id="rule1">
    <item>january</item>
    <ruleref uri="#rule13"/>
  </rule>
  <rule id="rule2">
    <item>february</item>
    <ruleref uri="#rule13"/>
  </rule>
  <rule id="rule3">
    <item>march</item>
    <ruleref uri="#rule13"/>
  </rule>
  <rule id="rule4">
    <item>april</item>
    <ruleref uri="#rule13"/>
  </rule>
  <rule id="rule5">
    <item>may</item>
    <ruleref uri="#rule13"/>
  </rule>
  <rule id="rule6">
    <item>june</item>
    <ruleref uri="#rule13"/>
  </rule>
  <rule id="rule7">
    <item>july</item>
    <ruleref uri="#rule13"/>
  </rule>
  <rule id="rule8">
    <item>august</item>
    <ruleref uri="#rule13"/>
  </rule>
  <rule id="rule9">
    <item>september</item>
    <ruleref uri="#rule13"/>
  </rule>
  <rule id="rule10">
    <item>october</item>
    <ruleref uri="#rule13"/>
  </rule>
  <rule id="rule11">
    <item>november</item>
    <ruleref uri="#rule13"/>
  </rule>
  <rule id="rule12">
    <item>december</item>
    <ruleref uri="#rule13"/>
  </rule>
  <rule id="rule13">
    <one-of>
      <item>1</item>
      <item>2</item>
      <item>3</item>
      <item>4</item>
      <item>5</item>
      <item>6</item>
      <item>7</item>
      <item>8</item>
      <item>9</item>
      <item>10</item>
      <item>11</item>
      <item>12</item>
      <item>13</item>
      <item>14</item>
      <item>15</item>
      <item>16</item>
      <item>17</item>
      <item>18</item>
      <item>19</item>
      <item>20</item>
      <item>21</item>
      <item>22</item>
      <item>23</item>
      <item>24</item>
      <item>25</item>
      <item>26</item>
      <item>27</item>
      <item>28</item>
      <item>29</item>
      <item>30</item>
      <item>31</item>
    </one-of>
  </rule>
</grammar>

User interface

As far as the user interface is concerned, we decided to handle the visual output as shown in the following image. In general, UWP gives us the possibility to arrange the UI in several so-called pages; however, with the aim of keeping the system simple without sacrificing the usability, we used a single page made up of distinct layers (called grids). When necessary, one of these layers is made visible while the others are hidden. Overall there are five grids (default, weather, media player, stopwatch and recipe); a grid is set in the foreground when the corresponding feature is in use. For example, the media player grid includes some buttons, a slider and a few written information but these are invisible while the user is reading the weather forecast.

public void SwitchToMediaPlayerGrid()
{
    default_grid.Visibility = Visibility.Collapsed;
    weather_grid.Visibility = Visibility.Collapsed;
    media_player_grid.Visibility = Visibility.Visible;
    stopwatch_grid.Visibility = Visibility.Collapsed;
    recipe_grid.Visibility = Visibility.Collapsed;
}

Challenges

Avoiding speech recognition errors

Initially the speech capture functions were based on a limited grammar, i.e. a grammar that was characterized by some fixed recognition rules and a small set of allowed words. That was because we wanted to reduce the probability of the assistant misunderstanding the voice command given by the user. After some tests, though, we realized that such a choice was detrimental in that the constraints caused by the limited grammar were forcing the system to associate each word from the user's speech to one of the allowed commands. Practically, that resulted in some generic sentences being forcefully interpreted as a command. Knowing that, we decided to use an unconstrained grammar (that is, one that allows each word from the English language) and to later compare the captured sentence with the available commands. Such a choice is also supported by the fact that the recognition software is able to understand correctly generic sentences with a fairly good accuracy. For this reason we believe that any misunderstandings should be attributed to the poor quality of the USB microphone we are using.

Avoiding "self-listening"

Another problem arising during the early tests was caused by the fact that the speech synthesis function ended without waiting for the assistant to stop speaking. That meant that, whenever the user had to speak after the assistant, the program was listening to itself, that is, wrongly interpreting its output as an input command. To solve this, we added a loop in which the program waits for the audio playback to terminate. The function that handles speech synthesis is shown below.

// Text-to-Speech method
public async Task SayAsync(string text)
{
    using(var stream = await speech_synth.SynthesizeTextToStreamAsync(text))
    {
        speech_player.Source = MediaSource.CreateFromStream(stream, stream.ContentType);
    }
    speech_player.Play();
    
    var old_state = speech_player.PlaybackSession.PlaybackState;
    while(true)
    {
        var current_state = speech_player.PlaybackSession.PlaybackState;
        if(old_state == MediaPlaybackState.Playing && current_state == MediaPlaybackState.Paused)
            break;
        old_state = current_state;
    }
}

Having vocal and visual outputs

On a structural level, UWP applications separate the code that deals with the functionalities (App class) with the one that controls the graphics (MainPage class). The problem is that MainPage functions cannot be directly called (i.e. using the dot operator) by the App class. In order to solve this issue, we used a particular instruction which allows the UI thread to execute some specific lines of code even if we are inside the App class.

// Vocal and visual output handling
private async Task GUIOutput(string text, bool has_to_speak)
{
    await Windows.ApplicationModel.Core.CoreApplication.MainView.CoreWindow.Dispatcher.RunAsync(CoreDispatcherPriority.Normal,
                                () =>
                                {
                                    main_page.PrintText(text);
                                });
    if(has_to_speak)
        await speaker.SayAsync(text);
}

Possible future developments

Hotword-based command recognition

A first improvement would consist in implementing a command recognition system that is based on hotword sequences instead of equalities between strings. That would make the interaction with the assistant much more natural and easy in that we would not have to pronounce very specific sentences in order to activate the desired feature. For example, instead of saying "What's the weather like today?" to get the weather forecast, we would just need any sentences containing the words "weather" and "today" (in that order).

New features

We could obviously add new features such as a digital frame function, sending e-mails or setting alarms. We could even try to make the virtual assistant work as a command hub for household IoT devices like smart lamps.

Better geolocation

It could also be worth enhancing the geolocation capabilities of the device with a GPS module. Currently, the position is determined through the Internet connection by reading the IP address but the results are quite rough. In this respect, making the positioning more accurate would allow us to get even better weather forecasts.