Rachael Jack: Professor of Computational Social Cognition at the University of Glasgow
“Her research has produced significant advances in understanding facial expression of emotion within and across cultures using a novel interdisciplinary approach that combines psychophysics, social psychology, dynamic 3D computer graphics, and mathematical psychology. Most notably, she has revealed cultural specificities in facial expressions of emotion; that four, not six, expressive patterns are common across cultures; and that facial expressions transmit information in a hierarchical structure over time. Together, Jack’s work has challenged the dominant view that six basic facial expressions of emotion are universal, which has led to a new theoretical framework of facial expression communication that she is now transferring to digital agents to synthesize culturally sensitive social avatars and robots.”
Below is a transcript of the Q&A session
You mentioned this research where you disrupted the order of facial movement sequences. What actually happened there, can we say that the order of movements is important?
Yes, in a study we did a while ago, we switched the order of facial movements. We built a model of facial movements that are dynamic, so we know the order of movements in which people expect the movements to appear. And if you switch the order of the Action Units, it is more difficult for people to recognize the facial expressions because they are expecting them in a different order. We actually did a study with adults with autism and their expectation was the opposite of adults without autism. When you reverse the order of what they expected, it was more difficult for them to recognize the facial expressions. When you see the facial expression, you of course are accumulating evidence as you watch the video, but if it is not really what you are expecting, it looks a bit weird. And so what was quite interesting about that was that if facial movements are expected in a particular order and assuming that then maybe they are expressed in that way, that could be one reason why some adults with autism have more trouble in decoding facial expressions because the Actions Units are not in the order in which they expect them to be, so this would disrupt a little bit the social perception process.
Is it possible that through an AI I can make the system learn if it knows that this video has this expression?
Well, that depends if you can quantify the face information. Often times you have deep learning, or machine learning where they take videos of faces, you have them annotated and then there’s learning between an image and annotated label. But you don’t know what the features are. So why is that person staying happy? If the video can be trackable, if you can identify the features in it, then you can understand the features that the human or the network is using to classify the images or videos. But without a tracktable stimulus, you can’t do that, and that’s a problem with a lot of machine learning and deep learning.
I was wondering, how much contextual information are we using to interpret these facial expressions? So, if there is a lot of semantic information about the situation one is in that we are using to interpret whether a frown is indicating anger or concentration, or if we really do have different facial expressions for all of these emotions?
Yeah, this is a great question, the question of context. What is context? It could be the weather, it could be the person’s identity. So my face is the context for my face and my facial movements. If I smile and have a nose wrinkler, one is the context for the other. The background is a context, the conversation we just had is a context, your memory of me is a context. This is not to say it is not important. It is absolutely important, but it is not the question that is easily tractable at the moment, because context is just more information. And yes, we want to know what information is combined together to reach a perceptual decision, to understand what information is used by the observer to reach a perceptual decision. Someone had told me this and I think it is very true, if you are a scene researcher, faces are context, if you are a face researcher, scenes are context. If we try to be pan-domain about it and think about simply information, when we look around there is so much information that comes to our visual stream that’s really information that we are using to arrive at perceptual decision and so being able to know what information is driving a perceptual decision as I have shown with only facial movements is not trivial. So, how do you identify the features from a complex scene that is dynamic and is changing every movement? That is really not a trivial thing to do. What we plan to do is to reduce the information space of facial expressions and then to be able to pair them with other types of information, such as speech, face identity, each of which could arguably be described as context, or other information, such as semantic information, to be able to understand how this additional information modulates perception. The context question is really important and something that we can address now, because we have things like virtual reality. We can have a proper experimental control, because you need to know what’s driving the behavior otherwise you don’t know, and we have it in close enough real life context.
What do you think of having a body? How important is it to have a body for an agent? Here we have just the face, but we also express emotions using our gestures.
That is a very good question as well, when do we need body, and do we really need one? In part, it really depends on the function of the digital agent, because you have some assistants that don’t even have the face, they have only voice, and, for their purposes, they do pretty well. They don’t really need an articulate body. So, it really depends on what you really want your digital agent to do. There are some things that the body can do that the face can’t really do. So, even in terms of finger articulation, there’s so many degrees of freedom there; for example, you can make all kinds of things, you can make shadow puppets, you can have iconic gestures, and these gestures are very rich. There might be certain contexts in which hand gestures are really required. It depends a bit on the intended role and utility of the agent. In some roles, they won’t need the face, they won’t need the body, and that would be an over engineered agent where it has got all these great social signalling features, but they are not actually used.
Besides the cultural differences that you found, have you found any differences in facial expressions between men and women?
I don’t think that we found any substantive differences in the facial expression models between men or women, or according to the sex or gender of the stimulus face. Facial movements do tend to overwrite, perceptions are derived just from morphology. There might be limits there for sure, and we are exploring those limitations. We are exploring how facial morphology, complexion might push around some of the interpretation of facial expressions, but we haven’t found anything so far with respect to the gender of the participants or the stimulus face. One thing we do find and other researchers have found is that there’s differences in the social perceptions of face shape with respect to social traits like competence. A competent woman has features which correspond with other social traits that are different in men like trustworthiness and so on. This is quite a robust finding, probably related to the concepts of that a competent woman would also be the x, y, z in terms of social traits, but a competent man is more similar in these other social traits, like untrustworthiness and so on.
We determined that context is a difficult task to tackle. What about contingencies in social interaction? For example, one person says or does one thing and then they influence the other person. Also, because you mentioned that you will be pairing speech segments with non-verbal behaviors in your future research. Is contingency, which can be understood as a dyadic interaction, a difficult aspect to address, or do you think it is more manageable than context?
This is a next step for us to go in, dyadic interactions, really in reference to how facial expressions are used for listeners, when listeners are conveying information via the faces. Someone would be speaking and the other person would be listening and conveying information via the face. I think in a dyadic interaction where you are exchanging facial expressions and exchanging speech at the same time, obviously, this is more complex. You have more parameters, you have a dynamic component, and also what you want to do is to know what information is being transmitted by that person. You can have, say, 10 bits of information that are transmitted by the transmitter, but only 5 of them are actually used by the receiver, and this is part of normal signal design where you have redundancy and degeneracy, et cetera. So you have to know what subset is being used, you can’t assume that all of them are being used. And now that we have much better technologies and methods to control complex social stimuli like voices. I don’t do voices, but I have colleagues that do them very well and faces that we do. We then have a good traction on these complex signals and then the ability to understand what’s used in the exchange and what is really driving the different behaviors in these dyads. I’d like to do it and understand more of the complexities of doing it. In short, we do now have an opportunity to look at these much more complex situations because technology has caught up, fortunately.
You mentioned this idea that variability of facial expressions is important. What about another aspect which is subtlety vs. exaggeration? If you have a look at the human data, humans are certainly expressive with their faces, but in a subtle manner, which raises the question: should we replicate the subtlety in virtual agents and social robots, or we have to make them really expressive?
Yeah, that’s a really good question, because there’s also this other layer of question of how do digital agents need to express themselves for humans to believe, or perceive them in a particular way? We know that they are not humans and there’s definitely an argument for us to not be tricked into thinking that they are humans. If you think of them as a different culture, then perhaps they need to express themselves in a slightly different way than humans do based on how humans perceive digital agents. And I think it is pretty much an open question about whether they need to express themselves in slightly different ways. The other part is about the function of the agent and the social role that it is playing. Having exaggerated facial expressions, would this be unnecessary, would this be disruptive? Or, would this be necessary, because you have some digital agents that need to express or communicate certain messages from their face across longer viewing distances, or on a much noisier channel, and so you want to amplify those signals. And agents can amplify them, we can’t amplify them, we can’t do that. We can make them have popping eyes, or large movements so that humans would be able to see them from longer distances so they might serve them well in certain situations, and not so well in other situations, where we need those signals to be much more subtle. Equally, some facial movements have more utility as subtle in applications where they are used for behavioural change. So, it depends on a particular role that the agent is playing and whether it would serve them well in the role to have these exaggerated, ritualized facial expressions, or to have more subtle facial movements. It is about the role why are you building this agent, what do you want it to do, what is its job going to be? If it needs to express itself across long distances, it probably needs to have the capacity to generate and be able to know that my receiver is way over there, so I need to have a big smile. But if my receiver is here, close by, I don’t need to do big smiles, because it might be off-putting.
Are emotions less ambiguous when expressed through speech than just facial expressions?
It is probably not so much the modality whether it is speech or the face but rather what’s voluntary and what’s involuntary. If signals are involuntary, you can’t fake them, they are honest, reliable signals. This is something we are working on at the moment, to combine speech and facial movements to understand how they combine together to support a perceptual decision in the end, what is really driving the perceptual decision in the end, and what information is discounted when it is presented alongside others. If I am blushing, and I say that I am totally calm, what’s your decision about me? Do you think that I am nervous? Yes, you probably think that I am nervous, and you will probably ignore what I am saying.
A deep dive into the topic:
Blais, C., Jack, R. E., Scheepers, C., Fiset, D., & Caldara, R. (2008). Culture shapes how we look at faces. PloS one, 3(8), e3022. [link]
Jack, R. E., & Schyns, P. G. (2015). The human face as a dynamic tool for social communication. Current Biology, 25(14), R621-R634. [link]
Jack, R. E., Garrod, O. G., Yu, H., Caldara, R., & Schyns, P. G. (2012). Facial expressions of emotion are not culturally universal. Proceedings of the National Academy of Sciences, 109(19), 7241-7244. [link]