Gabriel Skantze: Professor in Speech Communication and Technology at KTH Royal Institute of Technology
“Prof. Gabriel Skantze’s research is on human communication and the development of computational models that allow computers and robots to have face-to-face conversations with humans. This involves both verbal and non-verbal (gaze, prosody, etc) aspects of communication, and his research involves phenomena such as turn-taking, feedback, joint attention, and language acquisition. Since social robots are likely to play an important role in our future society, the technology has direct applications, but it can also be used to increase our understanding of the mechanisms behind human communication. This requires an interdisciplinary approach, which includes phonetics, linguistics, language technology, artificial intelligence, and machine learning.
In 2014, Prof. Gabriel Skantze co-founded the company Furhat Robotics together with Samer Al Moubayed and Jonas Beskow at KTH where he is working part-time as a Chief Scientist in the company.
He is also the President of SIGdial, the ACL (Association for Computational Linguistics) Special Interest Group on Discourse and Dialogue, and Associate Editor for the Human-Robot Interaction section of Frontiers in Robotics and AI.”
Below is a transcript of the Q&A session
You have only the head, but what about the body? What kind of benefits would it add? Or, why is it not incorporated?
So, obviously, you would like to have a body also. One thing it would add is the ability to point to things, especially in this kind of situated interaction, it can point to things on the table, and so on, potentially also manipulate things. But even if you can’t do that like Pepper robot, at least you can point to things. It is obviously something that would be nice. It is technically much more expensive to also put on arms on the robot and challenging to make the arms move in a very natural way in the same way we can now animate the face in a very natural way. So, it is technically challenging and sort of expensive, and our robot is bigger, and so on. If I had to choose between having arms and having the face, an expressive face like Furhat, and arms, I think the face is more important. I think you could do a thought experiment, if you talked to someone, and you could only see their face or you could see their arms but not their face, I think you would prefer to see the face as I think there is more going on there, even though I fully acknowledge that other parts of the body are important as well.
Have you observed any differences in terms of the context? What I mean is that typically we have different conversations, for example, among friends vs. a conversation between a manager and employee. Do you think you would also have different cues for turn-taking or how would that be different?
Absolutely, we have this setting, for example, the interview scenario with the robot. And there the turn-taking protocol is special because you have the interviewing leading the conversation and it is trying to understand when to give backchannels like “u-hum” and when to proceed to the next question and that is actually very tricky to do. But it is a very different dynamics compared to this card game where you are a three-party discussion where everyone can take the turn at any point. And then there are other factors like status, intimacy things that affect turn-taking, absolutely there is different dynamics, and this also makes it very hard to model these things because you might train a model on one dataset for a certain type of conversation but if you applied that model to another conversation, it might make the wrong type of predictions. So, when you train a model, you want to do that on a very diverse set of different types of conversations.
Can you elaborate a bit more on the part where you talked about the neural network and predicting from audio? And relating this question to the previous one, is that because it is a particular type of data that is amenable to predictions to be realized, or if you took any data, you would always be able to predict who the next speaker is?
We get fairly good predictions. For example, we have looked at both task-oriented dialogue, it is human-human dialogue like “Map task” where one person is instructing another person on the route on the map, that is very task-oriented and is more structured. And then we have other datasets like Switchboard which is more like social chit-chat and very free. And we make fairly good predictions on both of these datasets, again especially, if you train on that dataset and that type of dialogue. I think you can make fairly good predictions. The challenge is that you risk training it on a specific type of dialogue. One of the challenges we have is to find good enough datasets to train on, that is a big challenge.
I am just curious how ubiquitous all these signals are across different languages and cultures and whether that is something that you are looking at as well?
We haven’t looked at that really much. The available datasets we have are quite limited, British or American English, to some extent Swedish. It is quite limited actually; other researchers have looked at Japanese and so on. There are differences like backchannels, such as “u-hum”, which are much more common in Japanese apparently. I also think that gaze behavior, to what extent you can maintain mutual gaze, is different across cultures. Also, again in Japan you cannot maintain mutual gaze for as long as you can in, for example, France. So, I have seen these numbers where people have compared these things. There are definitely cultural differences, it is just something that we haven’t so far really investigated.
And then by extension, would you see that what you are developing at the moment, which sounds like quite a generic model, may eventually become quite a dynamic model where the AI tries to learn based on its current interaction with a person, how the person speaks and then adjusts their behavior and their signalling accordingly?
Definitely, that would be super interesting to do, to try to adapt the model. In turn-taking you would probably get quite a clear signal that you are doing something wrong if you are constantly interrupting each other and so on, so you would probably get quite a good signal that your model has to be adapted and it is probably possible to do that online in a conversation. So, that’s a super interesting direction.
We have already addressed some cultural differences, do you know of any age differences, for example, between children and adults? Are children using the same cues for turn-taking as adults, or are they more easily recognizable?
That is very interesting and, actually, something that we have looked into. In this dataset we collected at this museum, it is quite a large dataset, we got around 400 interactions recorded. We looked at both age and gender differences in how people were interacting with the robot. And yeah, age is a difference, we didn’t really look so much at which cues they used but how much they talked under different configurations. When you have two children interacting with the robot, they are much more likely to be more talkative and talk to the robot compared to when you have a child and an adult talking to the robot because then the child starts to speak much less, and the adult takes over. So, that is an interesting dynamic and it sort of is similar to what you find in the literature that when two children speak, they talk much more than a child and an adult because as a child you expect the adult to lead the conversation. That is something that we could definitively see in the data that we had. It is interesting from an educational perspective. This card game scenario that I showed you is something that we are using now in a project “Robots for language learning”. We are playing this card game with children, Swedish children learning English. By talking to the robot in English we hope that they can learn English and pick up new words. And that is where you might have two children and I think that is when it is also interesting to allow the child to talk to the robot without an adult next to them.
As a follow-up, when you had this card game, and you had two humans and a robot, so did you see that, for instance, humans interact more with each other than with the robot? Or this is not the case?
Actually, this is one of the conditions that we are doing in this language learning experiment, so we are comparing one child and the robot vs. two children and the robot. We haven’t analysed the data yet so I am very curious to see what the data will show. But what typically happened in the museum setting is that you get one dominant and one less dominant speaker, one person, on average, speaks twice as much as the other. And one thing we tried to do there was to see whether the robot could address the less dominant speaker, would that person be more likely to speak. Yes, it looked like that, the robot can actually act like someone who can try to balance the interaction. So, this is an interesting use case for a robot in this setting.
How are bad speech habits (unnecessary “likes”, etc.) affecting AI recognition of turn-taking cues?
Some people would argue that there are no bad speech habits it is just a variety of people speaking. The robot just should be able to handle that. I guess it comes back to the previous question around having an adaptive model that can handle a variety of behaviors. We are all different in different ways and I think this is something ultimately that the robot should be able to handle. As it is now, probably not, it would handle the most average speaker the best because that is representative of the data. But this is not what we want in the future, of course.
Sometimes you have a new concept or slang when you start calling the object differently among friends. So, again the robot has to understand the meaning and that say, we all call this object X and another object Y.
Actually, we have another research project related to exactly this question, How do you ground language when you are talking about things? And how people use different words for different things and how you can adapt to that because that is what you see humans doing. When humans are talk about something, they basically develop a language that is specific to them to solve this task. And that is something that robots don’t do. They have a fixed vocabulary for talking about things and the human has to adapt to the robot, not the other way around. So how can you make the robot pick up new ways of using language during a conversation?
Those models that you were talking about, how are they implemented? Will they be run locally on a Furhat, or will they be accessible through an API? Usually, delays are associated with the latter, how do you foresee implementing this technology?
That’s a constant question, when do we put the processing on the robot and when do we put it on the cloud. As researchers we can be flexible, we can put the big computer next to the robot and connect it to that one, with a lot of GPUs and so on. But once we move to the more practical case, that’s an obvious difference. For example, the camera, the face detection, and everything is done on the robot, and not in cloud, because, I think, that would create too much of a delay. So, all the face processing is done on the robot computer while the speech processing is done in the cloud. For everything you have to think a bit about where that should be done. And for the turn-taking we would like to come down to this 200-millisecond latency, that’s not just challenging when developing the model but also from the engineering perspective, it’s also challenging. For now, those turn-taking models are not something that is in the robot, it’s things that we’re doing on the research side so that we can be very flexible in how we practically solve that.
Currently, I understand, you have some animations that you can play on the face, but in the future, have you considered trying to come up with some models to present various facial behaviors more autonomously?
Yeah, we want to get away from the pre-recorded stuff and to generate more and more everything dynamically. One thing you can always do, and you should always do is insert a lot of random variation. If you produce a filled pause like you do for the turn-taking, like “Uh” in the robot, you want to have 20 different ways saying “Uh” and then switch between those because if you play exactly the same sound, it will sound super strange. The nice thing with multimodality is that you can use different signals, sometimes you can use a smile or gaze to show turn-hold and sometimes you can use “Uh”. So, the fact that we have these different modalities and that they are all effective means that we can also create that kind of variation. And that is also the strength of having multiple modalities and having a robot. But yeah, the key is definitively variation and when you think about robotic you think about something that doesn’t have variation.
This question is related to AI more broadly, what ethical considerations are you guys potentially looking at, or are you just not looking at anything like that at the moment?
That’s a very interesting discussion, indeed, and some of these ethical concerns are more futuristic, and sort of might not apply to the kind of systems we have today. But some are sort of common, and are not just for robots; privacy is something that is very important and that all of these systems have to deal with, what kind of data can you store, etc. Maybe it is an even bigger problem for robots because you have all these cameras and so on, so, that is obviously something we have to deal with. To what extent do you need to tell someone that they are being recorded? – so, these are obvious things. But you might also have other ethical issues, Does the person know how intelligent the robot really is? Are you deceiving people if you are sort of letting them believe that the robot is more intelligent that it is? And especially, if you had people with dementia talking to the robot who might almost not understand that it is a robot and not the person, that would be obviously ethically problematic, and children also. And if you are creating something like a chatbot that just answers questions and a bit like without the robot really knowing what it is talking about, there is a big risk that the robot starts saying things that it shouldn’t say, which might encourage you to do things that you shouldn’t do, just because it is playing along. These are also interesting ethical problems. At the current stage, the technology is not advanced enough to raise all these ethical concerns, but they will come, of course, eventually even more.
Did you already experiment Furhat’s barge-in function? Does that work well, do users find it natural that they can interrupt Furhat while it is still speaking?
Yeah, we are currently trying to avoid that because interruptions can lead to a lot of confusion. We want to have clear turn-taking signals to avoid having to interrupt each other because as soon as you start interrupting each other’s utterances that is very challenging to handle. This is because if you are interrupting in the middle of what you are saying, then you need to keep track of, How much did I actually say before I got interrupted? Where should I resume after that interruption? And that’s typically not how these systems are built. They think that as soon as I started to say something it’s out there, it’s said and so on. So, I think, handling interruptions is just a big challenge in itself. And the way for now to avoid that is to have clear turn-taking signals. But eventually we need to be much better at handling interruptions, and I would say that it is currently one of the main sources of confusion is when you end up in these kinds of interruptions and you don’t know how to get out of it.
A deep dive into the topic:
Skantze, G. (2021). Turn-taking in conversational systems and human-robot interaction: a review. Computer Speech & Language, 67, 101178. [link]