"Multi-modelling" of multimodality interaction
Dorothy Rachovides, Zoë Swiderski, Alan P. Parkes
Computing Department,
Lancaster University
Lancaster. LA1 4YR. U.K.
{dot | zoe | app}@comp.lancs.ac.uk
Keywords: Multimodality interaction; multimedia
Introduction
Information can be conveyed between people in a number of ways. People draw on a range of materials (e.g. pen and paper) and physical abilities (e.g. gesturing) in order to express themselves. While machines have become more prevalent as providers of information, the methods used to convey and receive information have undergone some radical changes. This evolution from human-human to machine-human information sharing was merely the beginning. It was not sufficient for machines to simply replace humans as providers of information; they had to provide larger quantities of well-presented information and more possibilities of interactivity.
The traditional keyboard/mouse interaction style has long been regarded as limiting in terms of expressiveness, efficiency and how naturally it can be used, which has led to an interest in the development of alternative methods of input. Similarly, the output produced by such systems has become more dynamic, and exploits enhanced graphical interfaces to provide an enriched audio-visual experience. Much development has occurred in terms of input and output facilities. Output, in particular, can consist of a rich structured and interlinked collection of multi-media objects. However, little thought has been devoted to the expressiveness of combined multiple input modalities. Users are unlikely to receive the information they require, or be able to refer to it appropriately when they receive it, unless the same consideration is given to the input as is increasingly being given to the output.
This paper discusses the relationship between multimodality input and multimedia output, with specific consideration of the implications of multimedia for the form, content and meaning of multimodality input. The paper concludes by discussing the potential for adopting a "multi-modelling" approach to multimodality input. The proposed approach considers the relationship between utterances comprised of interrelated expressions in various modalities.
The implications of multimedia output for multimodality input
Consider the transmission of information from one person to another, or a group of people, perhaps in a classroom setting. The interaction that takes place in this situation is clearly multimodal. The instructor uses voice as a primary channel of communication, both for teaching-lecturing and the question-answer process that takes place in instructional interaction. At times, the instructor may turn to a black/white board, thereon writing text or drawing diagrams. Text may also be included in the form of handouts, textbooks and notebooks. From the moment of physically entering the classroom, even if not a word is uttered, interaction is taking place - via body language, or the location of the participants in the classroom (typically the instructor at the front and the students at desks), through posture, facial expressions, gaze, and hand gestures. The instructor will use gestures to support speech, but also to interact with other modalities present in the classroom, for example, images (diagrams, pictures, maps and posters), videos or overhead projections. Students will also use these modalities in their interaction within the classroom; speech in the question-answering and discussion processes, gestures and text.
Figure 1. Human-Human Information Transmission
Figure 1 illustrates information transmission between humans. In this scenario, the instructor is the information provider, disseminating information to the receivers using speech, body language and various artefacts. At the same time, the information receiver (e.g. a student) expects to obtain information in the form of speech, the instructor’s body language and through artefacts. The dialogue that takes place involves interaction featuring mainly the speech and body language of the two parties. However, it also involves their individual interactions with artefacts that function as vehicles of interaction. Consider what happens when such a situation takes a computer-based form, as depicted in Figure 2.

Figure 2. Machine-Human Information Transmission
Over recent years, many systems have been developed for the dissemination of information. One example is a multimedia encyclopaedia. Another is the computer-based training system. In such circumstances, there is no longer a human information provider. The digital artefact assumes this role. This has consequences for the information receiver. Today, the information receiver, or user, is usually faced with multimedia information. This information may be delivered in a highly structured form, where the user is guided to the result of the interaction, so the interaction and control is probably limited. Alternatively, the information may be poorly structured, and therefore a high level of interaction and user control is subsequently required. Neither of these two extreme situations is optimal for the user. They may result in a confused user who feels overwhelmed or insufficiently informed, or is perhaps reduced to being little more than a "page turner" of an "electronic book". The facilities available for the user to use the presented output as the basis for entering into further specification of information or refinement of their information needs are severely limited. The output has a much higher level of expressiveness than that provided by the input.

Figure 3. Modalities of Input and Output
When read left to right, Figure 1 suggests an instructor passing on information to students. A much less prescribed model is suggested by Figure 3. Now the learner/user is constantly providing inputs, whether these be answers to questions, requests for information on a particular topic, or the selection of links in order to browse through the constantly changing information. Of course, a superficially similar dialogue can happen in human-human communication, but then the use and understanding of a much richer combination of modalities is available to both participants.
The implications of the immediately preceding discussion are as follows. The user has access to a wide range of modalities through which he or she typically interacts. Interaction should involve bi-directional communication. In human-computer interaction, while there is often great richness in the system’s output, users’ ability to make use of the modalities available to them is severely limited. Moreover, having received the multimedia output, the facilities available for the user to specify to the system what it is that he or she really requires are also typically very limited. Facilities such as pointing, clicking, and maybe simple speech and primitive gestures are provided for, but almost always each modality is considered in isolation from any other.
To date, most research in multimodal interaction has focused on using each modality separately, or in pairs, i.e. speech and gesture, gesture and gaze. Very little research has been carried out on the basis of a detailed analysis of human-human multimodal interaction. As a result, crucial contextual metadata are overlooked, such as the effect of facial expressions, gestures, or voice intonation on the meaning of an utterance.
However, the above discussion is not meant to imply that a multimodality interface will always be used in a multimodal way by its users. Oviatt (1999) points out that human–human communication is typically a mixture of unimodal and multimodal. In a well-designed system, individuals would be able to choose whether to interact multimodally or not. Often, this choice would be made on the basis of the activity being carried out, or the context in which that activity takes place. Though users favour the ability to interact multimodally, they do not always choose to do so, and they usually explore the use of each modality separately and then form their own pattern of interaction.
A further problem in human-computer interaction is whether designers should tailor their systems to the user, or users tailor their interaction patterns to the system. Considering multimodality interaction, the problem is the extent to which we can assume that multimodality will be exploited in a uniform way by different users. In Oviatt’s study (Oviatt, 1999), users adopted either simultaneous or sequential integration patterns when combining speech and pen input. Each user’s integration pattern was established early and remained consistent, but nevertheless, each user’s pattern was unique. As an aside, it is probably the case that similar individual patterns of usage also apply to users’ use of multimedia output.
A further important factor in multimodal systems is the extent to which the integration of modalities introduces redundancy in the content specified by different modalities. However, redundancy is often complementarity. The ability to convey the same information in several different modalities does not imply that a user will use all of these modalities to interact at any one time, but rather may choose which modality or combination of modalities is suitable at the given moment, in the particular context. Likewise, if the system produces output involving multiple media types, the user will often focus on a preferred media format, which may lead to the risk of missing important information. Thus, redundancy is often a useful property of multimedia output. Its implications for multimodality user input has yet to be fully explored.
The relationship between modalities
When developing multimodal systems, most designers consider speech and pointing to be the dominant integrated modalities. This has resulted in a variety of "point-and-speak" systems that are an elaborate version of "point-and-click" systems. Linguistic analysis of spontaneous manual gesturing in human multimodal communication shows that pointing gestures account for less than 20% of all gestures. However, this should not be assumed to suggest that this corresponds to their level of significance. "Point-and-speak" systems ignore other modalities such as manual gestures and facial expressions, which are capable of generating symbolic information that is more richly expressive than simple object selection. This clearly limits the expressiveness available to the user.
To us, the term "multimodal input" implies the existence of simultaneous or temporally co-ordinated expressions in a variety of modalities. The two most frequently combined modalities in human-human multimodal interaction, speech and gesture
, are highly interdependent and synchronized during interaction. They are not always simultaneous, as gesture can often precede speech, or complement it by conveying information that is not explicitly uttered. Such cases typically involve a quick switch from speech to gesture and back to speech. This is accomplished so quickly and blended so naturally that it is perceived as simultaneous.The view of linguists and some computer scientists that speech is a primary input mode has biased early multimodal systems towards speech input and "point-and-speak" systems. This has rendered speech to be the primary input mode in most multimodal systems in which it is included. Unfortunately, this has led to systems that consider other modalities that are employed as secondary, thus failing to recognize information that is not present in the speech. Speech is not the exclusive carrier of information. Even in a simple "point-and-speak" interface, it is possible to imagine a scenario in which both modalities in a particular activity are an indispensable component of the meaning of the "utterance". Consider telling a system to move a previously marked block of text to a new location:
"move that" [spoken, accompanied by] pointing to block of text "to there" [spoken, accompanied by] pointing to target location
As this simple example demonstrates, when users interact multimodally they selectively eliminate linguistic complexities and replace them with an interaction pattern, which involves unimodal and multimodal aspects. However, what results is a complex "linguistic" structure in which meaning depends on the temporal and significant relationship between expression in two modalities.
Different input modalities can be used to specify different content. The different modalities found in emerging technologies that recognize speech, handwriting, manual gesturing, head movement and gaze can significantly differ in the information they specify. They can also differ in their functionality during communication, the ways in which they are integrated with each other and their suitability for incorporation into different interface styles. In some cases, a given modality can be a simple analogue of another, in the sense that there is a direct translation between one and the other. However, in many cases, modalities vary in the degree to which they represent similar information, with some groups of modalities being more similar (speech and writing) than others (speech and facial expression).
Conclusions: towards "multi-models" of multimodality input
Multimodality has the potential to facilitate richer interaction styles in both information retrieval and learning environments. However, its true potential will not be realised unless consideration is given to the application of combined modalities, both simultaneously and over time. Progress has long been made in the structural and grammatical analysis of language, where the term is usually meant in the unimodal sense, as it applies to, say spoken English, of which the structural and semantic analysis is, of course a well established field . However, mixed modality interaction, while drawing on the various languages of speech, gesture, etc., implies that account must be taken of the relationship between the simultaneously expressed statements from each of these languages. For example, the utterance "we’ll get this paper finished by this evening", when accompanied by the quickly raised eyebrows of the speaker, might mean something quite different when accompanied by the speaker’s reassuring smile. For multimodality interaction, then, the corresponding "grammar" would describe the structure of mixed modality "sentences", and the lexicon would map out the meaning of mixed modality "words". The meaning of an utterance would be inextricably linked with all of the multimodality components of the "utterance" and the relationship between them. In this respect, what is required is a "multi-model" of multimodality communication. Such a model would enable us to specify and interpret mixed-modality inputs, and support an expressiveness and flexibility of input to match that increasingly found in forms of output.
At the time of writing, multimodality interaction in HCI is much less sophisticated than that offered by the combination of speech, gestures and other modalities found in everyday human-human interaction. However, even the standard typing, pointing, and clicking interface offers gestural possibilities (the selection of a portion of text with a mouse is essentially gestural, after all) that have hitherto been almost exclusively applied unimodally. Thus, the central argument of this paper applies to current, as well as future, systems.
Finally, we assert that multimedia output from a system actually requires multimodality on the part of the user. Communicating with a system about a diagram, for example, requires more than just speech, text and simple pointing. The effectiveness of a diagram may be lost if the participants in a discussion about that diagram must constantly translate their knowledge of the diagram into an alternative form to express it to the other participants. In other words, a final requirement of the "multi-model" of modality is that it considers the role played by the media that are referred to in by the input, since, for example, even the meaning of a simple gesture such as a wave of the hand will depend partly on properties of the referent of that gesture. The diversity of these properties of multimedia information will open up new expressive possibilities for multimodal communication in human-computer interaction. The "multi-model" of multimodality communication may provide a framework in which to address such issues.
Acknowledgements
The authors are supported by the Distributed Multimedia Research Group. Zoë Swiderski’s research is also supported by the Engineering and Physical Sciences Research Council (EPSRC) and British Telecom Laboratories.
References
Oviatt, S. (1999). "Ten Myths of Multimodal Interaction." Communications of the ACM 42(22): 74-81.