A Computational Semiotic Framework for Interactive Cinematic Virtual Worlds

Craig A. Lindley

CSIRO Mathematical and Information Sciences

E6B Macquarie University, North Ryde, NSW Australia 2113

Craig.Lindley@cmis.csiro.au

 

Abstract

Film semiotics has analysed the nature of codification in cinematic images, resulting in the identification of several distinct and interacting levels of codification. Studies in film semiotics have also considered questions of syntactic form in cinema, and narrative form has been extensively analysed. Research in interactive video presentation generation has developed techniques based upon film semiotics for automatically selecting and assembling video presentations from a database of underlying video clips; the resulting presentations may have either a narrative or a categorical syntactic form. Real-time graphics rendering from three dimensional object models supports a more complete synthesis of moving image content, based upon the simulation of the diegetic world of a production. The design of computational agents within such a virtual world can be based upon a variety of behavior generation methods, ranging from reaction rules to goal-directed and deliberative planning, and then to high level, natural language-like scripts. Behavior generation can occur at varying levels of encapsulation, from the world level to object, agent, and discursive agent levels. These concepts can be integrated into an overall framework in which the multi-level semiotic model provides the high level structure of an ontology for action generation, where detailed actions may be scripted or created either reactively or deliberatively from categorical, narrative, or hybrid principles at each level of encapsulation. The result is a multi-dimensional analytical and synthetic matrix implying a wide range of interactive cinematic forms, each providing different expressive possibilities within computational systems viewed as a meta-medium.

Introduction

Interactive cinema refers to interactive media experiences that may have cinematic qualities, in the sense of achieving cinematic levels of presentation quality, viewer impact, and narrative strength in visual storytelling. Interactive narrative systems that emphasise story elements have frequently been text-based (see Murray, 1997). Multi-path movies (see www.bde3d.com) and similar commercial or experimental interactive video productions have replicated the interactive fiction model by having users select between alternative video segments at key decision points in a branching narrative structure. Computer games typically have more intensive interaction requirements, although the narrative content is frequently limited to highly stylized and/or ritualized scenarios (eg. involving actions limited to shooting weapons, dodging missiles, evading monsters, and collecting "easter eggs") within high level structures characterized by phases of high interaction leading to very strict points of narrative convergence (at level transitions). Computer games achieve high levels of engagement, often with very long play times for individual productions, for the games audience. The question naturally arises of the possibility of achieving such levels of intense interaction and engagement for broader audiences by creating game-like systems that support broader and deeper forms of narrative.

The development of more interactive and deeper narrative systems is an issue of both aesthetics (in a broad sense of artistic function) and of technology. On one hand, aesthetic functions can provide specifications for the development of new technical approaches, while on the other hand, new computational techniques can suggest new aesthetic functions and interactive possibilities. The ongoing development of interactive media must therefore be a dialectical process between media arts theory and practice, and technical developments in computing science. The computational semiotics of new media can be seen as an attempt to capture that dialectical process, by drawing upon detailed analyses of signification in both traditional and new media forms as a foundation for defining computational techniques. This paper considers a number of distinctions defined in film semiotics and film theory, and brings those distinctions together with a number of computational techniques for generating behavior in autonomous agents and world models. The result is a multidimensional matrix onto which some existing interactive cinema forms and productions can be mapped, but also suggesting new forms that have not yet been realized in particular media productions. The matrix provides a framework by which existing or new productions can be analysed or specified. It also suggests a model for a comprehensive computational architecture for interactive cinema, within which the realization of a specific form can be achieved at a high level of abstraction and content authorship. Ongoing research and development is required to realize the complete architecture.

Levels of Cinematic Codification

Based upon the film semiotics pioneered by the film theorist Christian Metz (1974), we identify five levels of cinematic codification, representing levels of semantics that a cinematic presentation may have:

  1. the perceptual level: the level at which visual phenomena become perceptually meaningful, the level at which distinctions are perceived by a viewer within the perceptual object. This level includes perceptible visual characteristics, such as colours and textures.
  2. the diegetic level: at this level the basic perceptual features of an image are organised into the four-dimensional spatio-temporal world posited by a video image or sequence of video images, including the spatiotemporal descriptions of agents, objects, actions, and events that take place within that world.
  3. the cinematic level: the specifics of formal film and video techniques incorporated in the production of expressive artifacts ("a film", or "a video"). This level includes camera operations (pan, tilt, zoom), lighting schemes, and optical effects. It should also include more synthetic elements of cinematic construction, such as layers and inserts within a composited image, image elements consisting of different modalities (eg. a video with overlayed still image areas and text blocks), and possibly also image elements supporting viewer interaction (buttons and menus).
  4. the connotative level: this is the level of metaphorical, analogical, and associative meanings that the denoted (ie. diegetic) objects and events of a video may have. The connotative level captures the codes that define the culture of a social group and are considered "natural" within the group.
  5. the subtextual level: this is the level of more specialised meanings of symbols and signifiers. Examples might include feminist analyses of the power relationships between characters, or a Jungian analysis of particular characters as representing specific cultural archetypes.

A model of the meaning of a video sequence can involve the description of the sequence at any or all of the levels described above. The different levels interact, so that, for example, particular cinematic devices can be used to create different connotations or subtextual meanings while dealing with similar diegetic material. In the case of video, cinematic and perceptual level descriptions may be generated automatically to an increasing extent (eg. see Aigrain et al, 1996). Subtextual and connotative descriptions are necessarily created by hand. The diegetic level represents an interface between what may be detected automatically and what must be defined manually, with ongoing research addressing the further automation of diegetic modelling (eg. Kim et al, 1998).

These levels of cinematic codification apply to computer-generated animation and virtual worlds as much as they do to film and video, where the function of a physical camera provides the model for the projection of an image of the world onto a framed viewing plane from a particular virtual viewpoint. For computer generated imagery, the basis for perceptual and diegetic levels of codification is explicitly modeled in the system (eg. as 3D structure models and texture maps). The diegeis is then generated from the models of one or more virtual environments containing computational agents with behavioral capacities representing action potentials, where the specific unfolding diegesis is a result of the users’ interaction with the underlying world model and its interactive narrative possibilities. Those narrative possibilities may also encode specific connotative and subtextual messages, functioning to deepen the overall semantics of a production, and realize its themes. Connotative and subtextual coding applies to the visual, dynamic, and behavioral design of agents, the world, objects, locations, architecture, etc., as elaborated below.

The advantages of separating these layers in the design of ontologies for producing interactive cinema are currently mostly hypothetical. A video semantics metamodel based upon these levels is the core component of the FRAMES system for generating video presentations dynamically from an underlying database of video clips (Srinivasan et al, 1999, Lindley, 2000). The FRAMES schema includes thirty two video annotation types, subclassified by codification layer, with a variable number of description types available within each annotation type, depending upon the production. Users of the FRAMES system to date have tended to use small but different and partially overlapping subsets of the full set of available annotation types, depending upon the author of the annotation space design. The multi-layered model provides a set of useful distinctions for discussing semantic modelling. It is hoped that incorporating this model into a more generic framework for interactive cinema will assist in the identification of levels of image and interactive experience synthesis knowledge that may be differentially transferable across productions. For example, a subtextual and connotative ontology may be transferable between productions involving different media modalities, from text to video and computer-generated animation, that have quite different presentation levels (eg. text relies upon layout and literary stylistic codes, rather than cinematic codes).

Models of Syntactic Form for Cinematic Presentations

Different theorists focus on different qualities that characterise narrative (see Stam et al, 1992), and the meaning of the term "narrative" varies from narrow interpretations involving strong spatio-temporal continuity to very broad interpretations in terms of the overall formal, rhetorical, or thematic coherence of a production. Narrative in a broad sense has been the goal of numerous research projects dealing with diverse media, from text to interactive 3D systems (see, for example, Mateas and Sengers, 1999). Research concerned with the construction of narrative video sequences by the selection and ordering of clips from a video database has tended to use a narrow interpretation of narrative in the sense of continuity-edited depictions of causally interconnected actions and events. When narrative is understood in this narrow sense, it is meaningful to characterize alternative, non-narrative forms for the organization of filmic material. Bordwell and Thompson (1997) identify the non-narrative forms of categorical, associational, abstract, rhetorical forms.

A close analysis of these forms suggests a reorganization into three primary types. Categorical films use subjects or categories as a basis for their syntactic organisation, typically basing each segment of the film on one category or subcategory. Common examples of stereotyped categorical films include lifestyle and gardening programs, travelogues, and sporting programs. Narrative and rhetorical film forms are both distinguished from categorical films by their creation of new meanings (ie. meanings not expressed by the constituent cinematic sequences prior to their conjunction) by the sequential association of initially distinct video sequences. That is, any basic video component in a categorical or associational film can represent a designated meaning expressed in an annotation irrespectively of what precedes or follows it. For rhetorical and narrative films, however, the rhetorical and narrative meanings created by the sequential juxtaposition of basic video components is not, and generally cannot be, conveyed by those components in isolation. Rhetorical films present an argument and lay out evidence to support it. The aim of a rhetorical film is to persuade the audience to hold a particular opinion or belief. Rhetorical films will frequently present arguments as if they are observations or facts, and will typically fail to present any opposing views. A standard description of rhetorical form suggests that it begins with an introduction of the situation, goes on to a discussion of the relevant facts, then presents proofs that a given solution fits those facts, and ends with an epilogue that summarises what has gone before. Common examples of rhetorical films are television commercials, although news programs can also be analysed in terms of the rhetorical functions of their parts (Lindley et al, 2000).

Each of these forms represents a different (partially codified) syntactic structure for film sequences, and for time-based media presentations in general. All of these forms are associative, since each is concerned with meaning created by the sequential association of images. Hence associative form as described in Bordwell and Thomson’s original classification scheme is best regarded as a supercategory, and includes forms not covered by the three primary categories identified here (ie. it should include meanings created by a combination of sequences that are not limited to a pattern of categorical similarities and dissimilarities, or to rhetorical or narrative functions). The various forms apply at multiple levels of cinematic structure, a given cinematic sequence may involve multiple forms at the same level, and multiple forms may occur at different levels. The various formal models support different algorithmic presentation generation techniques in the context of interactive cinema.

Levels of Behavior Generation in Virtual Worlds

Action generation in virtual worlds can be regarded in terms of four levels of control, as depicted on Figure 1. These levels and their components are identified at a level of authorship that lies between the represented world that a user of the media system experiences and the implementation architecture; this is the level at which a priori classes of represented entity can manifest different behaviors within a range of possible behaviors that is constrained by the behavior generation representations provided for a particular production. The arrows in the figure suggest the inheritance of behaviors from higher levels to lower levels of the model. (A similar architecture might be used for implementation, in which case a one-to-one mapping between software components and representational components cannot necessarily be assumed).

Each level is distinguished by a scale of encapsulation, and by a set of primitive components to which it may refer. For instance, world level scripts are not encapsulated within any kind of representational structure at a level smaller than the world level in Figure 1, while agent-level scripts are encapsulated within a representation of an agent class or instance. Each level may be realized by the interpretation of a more or less detailed script or system of scripts, by deliberative planning to achieve high level goals for that level of control, or by the execution of reaction rules in response to user actions or external state changes.

 

 

 

 

 

 

 

 

Figure 1. Levels of behavioral control representation for virtual worlds.

The distinction between planning approaches and behavioral approaches is based upon the extent to which action generation relies upon the creation of a model of (some part of) the world and its contents, as opposed to the creation of behaviors. A purely behavioral control system can only refer to goals, internal state representations, simulated sensory inputs and behavioral outputs at a given level of the hierarchy. Alternatively, a more traditional planning-based approach, referred to by Brooks (1999) as the sense-model-plan-act, or SMPA approach, relies upon the construction of a model of the external (virtual) environment as a basis for action determination, and may maintain a description of the current state of progress with respect to a complex hierarchical or extended linear action plan. Behaviors may also be scripted at higher levels, where the scripting language is closer to a natural language (eg. Goldberg, 1997). This distinction suggests levels of authorship for behavioral control, such that a set of agents may be authored at a planning and/or behavioral level, and then the behaviors created for those agents provides the semantics for higher level scripts. This is not a necessary architecture, however, and a system could be completely authored at an agent level, or a general scripting language may be detailed enough to directly drive a rendering system interface.

The world level of the hierarchy represented on Figure 1 represents the definition of behavior from a global perspective. An example of this is a multi-path movie limited to world level control which moves the user through a branching narrative structure of video clips according to simple responses to questions and options (www.bde3d.com presents a number of examples of systems of this kind). In this case, all agent or character behaviors are presented as fixed performance representations within the video clips, and are not actively varied by the system other than at the story level achieved by the sequencing of clips.

Beneath the world level, more variability can be achieved by scripting object and agent behaviors. Objects are generally passive, with most object control amounting to modeled physical responses to outside perturbances (eg. how to bounce when kicked). As such, object planning is not necessary, and object behavior may be implemented by a generic physics or mechanics engine that does not require detailed scripting. Object scripting in this case may be limited to basic existential statements (that an object of a particular kind exists at a particular time and place under specific conditions), and parameterisation of object characteristics. If objects have autonomous changes over time, they are better regarded as agents.

Agents can be regarded as objects having autonomous behaviors. Autonomous behaviors are those that are not simply the result of generic physical laws within the virtual world (eg. falling under the influence of gravity at the world level), but originate within the control architecture of the agent itself. Autonomous behaviors range from simple reactive stimulus-response rules characteristic of behavioral systems, to complex sequences of behavior requiring higher level, deliberative AI-planning techniques.

Discursive agents are those having some behaviors that serve purely communicative or expressive functions. Discursive behaviors also potentially range from reactive responses (eg. using ELIZA-style language generation rules, or Expressivator-style gesture rules, Sengers, 1998) to the autonomous generation of stories (Bickmore and Cassell, 1999) and the ability to use a conversational input as a goal for complex planning (Cavazza et al, 1999). More advanced discursive capabilities might include the ability of a virtual agent to interact with another agent or the user through a more complex discourse structure, such as an extended conversation or explanations for problem solving, commentary, post-mortems or tutoring (eg. André and Rist, 2000). The distinction between discursive and non-discursive agents is not meant to suggest that behavior does not or cannot have communicative functions; rather it is intended to separate explicitly controlled communicative behavior from behavior conceived pragmatically that has only implicit communicative functions.

These techniques can be combined, for instance, by using a script interpreter to read a high level or natural language expression that is converted into one or more high level goals that are then passed by the interpreter to a deliberative planner, or to instantiate the parameters of, or activate, a set of reaction rules. Different interactive media systems use different subsets of the components identified in Figure 9, including the use of a single approach. In effect this is a two dimensional taxonomy for classifying behavior generation techniques, with scripting/planning/behavioral categories along one dimension and world/object/agent/discursive agent categories along the second dimension. This taxonomy can also be used as a specification for behavior generation facilities that could be made available within a comprehensive software architecture for interactive cinema.

 

 

 

 

 

Categorical, Narrative, and Hybrid Behavior and Path Generation

Categorical Sequence Generation

The CSIRO FRAMES project has developed a categorical system for dynamic virtual video sequence synthesis from databases of video data (see Lindley, 2000). The generation of dynamic virtual videos in the FRAMES system is based upon annotations of stored video, together with a specification of the videos that are to be created, and queries embedded within specifications expressed using descriptors common to the content models. Video annotations are based upon the multi-level model of video semantics described above. Once video components have annotations generated for them, the annotations are stored in a database. The high level structure of a virtual video (ie. an interactively generated video presentation) is expressed in a virtual video prescription that can incorporate direct references to specific video components, parametric queries based upon exact or approximate matching of annotations to a query expression, and specifications that initiate the generation of an associative chain of video content. The generation of video sequences by categorical (referred to as associative) chaining was first demonstrated in the MIT Automatist system (Davenport and Murtaugh, 1995, Murtaugh, 1996). The FRAMES prototype extends this concept with the development of a multi-level semantic model for video, together with a flexible specification language and chaining algorithm, and weighting mechanisms to implement deep or broad coverage within particular annotation types.

Categorical chaining in the FRAMES system is a method of generating video sequences based upon patterns of similarity and dissimilarity in annotations. Chaining starts with specific parameters that are progressively substituted as the chain develops. At each step of categorical chaining, the video component selected for presentation at the next step is the component having annotations that most match the association specification when parameterised using values from the annotations attached to the video segment presented at the current step. The high-level algorithm for categorical chaining is:

  1. initialise the current state description according to the association specification. The current state description includes:

  1. Generate a ranked list of video sequences matching the current state description.
  2. Replace the current state description using annotation values from the most highly ranked matching video component: this becomes the new current state description.
  3. Output the associated video component identification for the new current state description to the media server.
  4. If further matches can be made and the termination condition (specified as a play length, number of items, or associative weight threshold) is not yet satisfied, go back to step 2.
  5. End.

Since categorical association is conducted progressively against annotations associated with each successive video component, paths may evolve significantly away from the annotations that match the initial specification. This algorithm has been implemented in the current FRAMES demonstrator. Specific filmic structures and forms can be generated in FRAMES by using particular annotations, association criteria and constraints. In this way the sequencing mechanisms remain generic, with emphasis shifting to the authoring of metamodels, annotations, and specifications for the creation of specific types of dynamic virtual video productions. However, the basic form created explicitly by the chaining engine is categorical. The data model associates typed annotations with video segments. Annotation types can be created that represent category types, and the annotations themselves can be category names. A categorical associative chain is initiated by sending an association specification to the association engine. The specification includes the category types to chain on, as well as initial category values and possible constraints upon values. For a categorical film, the supercategory or general topic (if specified) can be represented by a constrained (and hence unchanging) category value. The subcategories to move through are then represented as unconstrained category types. The rate at which the categories change can be determined by a weighting attached to the subcategories: the higher the positive weighting, the more slowly the categories will change, while the more negative the weighting, the faster the categories will change. Hence for n category types, the association engine moves through a video annotation search space of n dimensions.

The annotations for video data may be regarded as a set of categories used to characterise a semantic subspace. Applied to virtual environments, the annotation space defines a semantic hyperspace for the classification of virtual objects at several levels of resolution. In terms of the layered world representation model described above, these semantic subspaces may be used to classify state subspaces or behavior generation structures at the world, object, agent, and discursive agent levels. The association algorithm therefore defines a principled method of moving through these subspaces according to patterns of categorical similarity and dissimilarity at the physical, cinematic, diegetic, connotative, and/or subtextual levels. A variety of strategies for invoking this associative path generation method are described below.

Semantic Space Characterisation

A complete set of annotation values for a hyperspace at a particular level of resolution constitutes a form of index over that hyperspace, where a tuple of annotation values for each available type may identify one or more subspace of the hyperspace. The interpretation of the semantic subspace represented by the index depends upon the level of modelling that the subspace is used to index (eg. it may be a subworld or an agent behavior). The behavior of a sequencing algorithm can then be regarded in terms of a search or traversal across the index structure.

A complete index expression may be said to be one that includes an instance value for each annotation type in the annotation set; ie. it is a conjunction of index terms that includes a term for each annotation type in the annotation scheme. In order to relate index structure to search behavior, and hence path or behavior generation results, it is useful to define two characteristics that the annotation space (ie. index) may have:

1. completeness: the index (or annotation set) is said to be complete if and only if each semantic hyperspace component is uniquely determined by some complete index expression (ie. unique set of annotations).

By this definition, an index is incomplete if a complete index expression determines more than one hyperspace component (the 1 : n case, category to subspace), or if there are hyperspace components that have no complete indexation (the 0 : n case).

2. minimality: the index (or annotation set) is said to be minimal if and only if every unique complete index expression uniquely determines some hyperspace component or set of components.

By this definition, an index is non-minimal if more than one complete index expression determines some particular subspace, if a subset of index terms within a complete index expression determines some particular subspace (the n : 1 case), or if there are complete index expressions that determine no hyperspace component (the n : 0 case).

In these terms, a minimal complete index provides a 1 : 1 mapping between the set of complete index terms and the set of hyperspace components. An incomplete index may still be useful or desirable, except in the case where there is no index at all for one or more subspace (the 0 : n case), since a non-indexed subspace cannot be entered by the algorithm; the 1 : n case can be used to design an interactive system in terms of selection from sets of subspaces, rather than individual subspaces. Similarly, a non-minimal index may also be useful, in order to use a particular subspace in more than one context; the exception to this is when a complete index refers to no subspace (the n : 0 case), since that index is then superfluous. An incomplete and non-minimal index provides an n : n mapping between the set of index terms and the set of subspaces that may also be useful. Hence there are four useful types of index to consider: a minimal complete index (1 : 1 case), a non-minimal complete index (the n : 1 case), a minimal incomplete index (the 1 : n case), and a non-minimal incomplete index (the n : n case).

A non-minimal complete annotation space includes the following forms of the n : 1 case:

- some annotation types are not used in the unique identification of any subspace (hence some annotation types are redundant),

- some annotation instances are not used in the unique identification of any subspaces (hence some instances are redundant)

This notion of "redundancy" is, however, defined in terms of the annotation set functioning as an index. Such redundancy can be a useful and/or desirable feature of an annotation space from other perspectives, such as the aesthetic, phenomenological, or pedagogical. For example, multiple index terms corresponding to alternative semantic characterisations may be used to convey different interpretative perspectives on the same visual content.

From the viewpoint of interaction design, as n becomes larger in the the n : 1 case, compared to the overall size of the search space nS, alternative subspace specifications and user interactions that have a semantics defined in terms of the subspace semantics will have a correspondingly decreased influence upon which hyperspace components are accessed. This is because an increasing number of index expressions refer to the same semantic subspace, so the specifications and user interactions that determine different index expressions have decreasing selection value.

A minimal incomplete annotation space is one in which each hyperspace component has a classification in terms of all of the annotation types available, but is not uniquely identified by its classification.

The case of components that have no complete indexation (the 0 : n case) represents a bad design, since those components are inaccessible to the association engine, and hence cannot be accessed via any interactions mediated by the association engine. Cases in which a complete index expression determines more than one component (the 1 : n case) can be used as a general design strategy, so that the total set of subspaces S is a set of subspaces having components si that are non-intersecting subsets of hyperspace components.

The behavior of the categorical association algorithm in this case will be the same as for the minimal complete annotation set, except that the roots of the search tree correspond with sets of hyperspace components that may be revisited until all of the components in a given set have been selected. This is a useful annotation strategy for creating specific patterns of sampling by the association algorithm from different groupings (by category) of hyperspace components. It also serves phenomenological, aesthetic, and/or pedagogical functions of manifesting a variety of visual material that has been designated as having the same conceptual characterisation.

A non-minimal, incomplete annotation space (an n : n index) means that multiple index terms can refer to the same subsets of hyperspace components. This case combines the characteristics of the 1 : n incomplete and n : 1 non-minimal cases discussed above. These effects combine to create overlapping semantic subspaces. Overlapping hyperspace components create a more complex search space, and the resulting behaviour of the association engine will be less predictable. The ways in which two subspaces may overlap are:

In both cases, the current context of the categorical association algorithm (ie. the set of specified semantic annotations for a currently selected hyperspace component) may include annotations of more than one index expression involved in the definition of the intersecting subspace. A positively weighted specification will then favor a search restricted to the intersection subspace until it is exhausted, and then move to one of the non-overlapping regions of the intersecting subspaces. The space into which the search process emerges may be different from that from which it entered the intersecting subspace, so the intersection represents a path to a different region of the annotation space. The emergent subspace may intersect with other subspaces, which may in turn intersect with further subspaces. The overall effect may be to create clusters and linked paths of subspaces, the strength of which may have a greater influence upon search behavior, and hence path generation, than either original association specifications or subsequent modifications to specifications created by user interaction. This has been observed in practice with the initial FRAMES virtual video demonstrator application such that the dynamically generated video sequences quickly evolved towards recurring content in a recurring sequence.

Narrative Sequencing

As mentioned above, narrative film is concerned with the creation of a pattern of cause-effect relationships among the diegetic events, actions, and situations represented by a film. Film editing has developed a strong set of conventions, referred to collectively as the technique of continuity editing, which aims to convey a very strong impression of the continuity of action through multiple shots in order to enhance the creation of narrative meaning. Research in automatically generating narrative video sequences from a database of underlying video clips has attempted to codify editing rules in order to ensure narrative coherence between successive video clips (for example, Nack, 1996, and Nack and Parkes, 1997, describe the AUTEUR system for narrative generation). Continuity editing rules include rules for matching shots on elements of actions to minimize the effects of cuts, as well as rules for how shots of different subjects can be combined to create new meanings, such as conjoining a shot of a person looking with a shot of another object or event to imply that the object or event is the subject of a gaze. In general it is possible to distinguish rules for the construction of a coherent diegesis from rules for cinematically representing that diegesis in order to convey particular connotative or subtextual meanings. Here it is postulated that:

- diegetic generation is a matter of action generation in order to satisfy state goals within the four dimensional diegesis. Hence diegetic generation can draw upon a broad range of techniques from research in autonomous planning and behavior generation for autonomous agents. Diegetic generation is largely concerned with what is to be done within the diegetic world and how it is to be done.

- narrative generation concerns the dynamic generation of diegetic goals according to metalevel goals (or metagoals) of connotation and subtext. This is a metaplanning task from the diegetic perspective. Narrative generation is concerned with the teleology of actions within the diegetic world (ie., why).

- cinematic presentation rules can be applied to an unfolding diegesis to maximize the attainment of narrative metagoals. Cinematic presentation generation is concerned with how diegetic action is to be presented.

These three elements are addressed by the AUTEUR architecture. However, cinematic presentation generation by the selection of predefined video clips from a video database has limited flexibility in practice, due to the difficulty of defining a set of clips that sustain a large number of narratively meaningful sequential combinations. Computer games and virtual environments have strong action generation techniques, supporting a much larger number of possible interactive experiences, but tend to be very weak on narrative, and rarely employ cinematic techniques going beyond simplistic first or third person perspectives (a point analysed in greater detail by Clarke and Mitchell, 2000). A more active presentation system may be disruptive to the sense of immersion that virtual environments create. The key to deeper and more interesting narrative experiences in interactive cinema may therefore lie in systems that more actively modify or contextualise diegetic events on the basis of connotative and subtextual criteria. This amounts to systems in which unfolding virtual events have a deeper semantics, expressed in terms of character personalities, purposes, relationships, and dimensions of form beyond the dramaturgical.

Strategies for Mixed Categorical and Narrative Sequencing

There are a number of distinct strategies by which categorical and narrative sequence generation might be combined at a given level of the video syntax (described in Lindley and Nack, 2000). These strategies make sense in terms of the available computations demonstrated in the AUTEUR and FRAMES systems. From a production perspective, a combined narrative/categorical sequencing system provides a wider range of formal options for media producers. Combined strategies are:

1. a narrative sequence generator that cannot proceed due to lack of material and/or exhaustion of rules for satisfying a current goal can use an associative matching step to shift context for lower level goals, after which narrative generation might resume in service of higher level goals but beginning with a state describing the shifted context. This strategy is a strategy for getting out of a dead end by shifting to a new, but closely related, state description.

2. a categorical sequence generator that fails to find material matching the current state description at some point in sequence generation can resort to narrative synthesis to create a new sequence from lower level components, where the current state description serves as (part of) the goal for narrative generation. Once a narrative component has been created, categorical sequencing resumes, with a state description modified according to the complete description of the narrative segment.

3. a narrative sequence generator creates a narrative sequence that is then subjected to post-processing by an association engine. This represents a (syntagmatic) reordering of the narrated order away from the diegetic time order of the synthesized narrative material.

4. a categorical sequence generator creates a categorical sequence that is then subjected to post-processing by a narrative engine. This represents a (syntagmatic) reordering of the narrated order away from the arbitrary time order of the selected categorical material towards a sequence reflecting the diegetic time order of the content of those segments, and ordering them to reflect causal interrelationships.

5. a narrative sequence generator incorporates specific mechanisms for the explicit insertion of a categorical sequence.

6. a categorical sequence generator incorporates specific mechanisms for the insertion of a narrative sequence.

7. a narrative sequence generator uses associative mechanisms to structure a narrative presentation as a series of episodes related by theme, topic, etc.. This strategy is very similar to strategy 3 above , but presents the articulation of the categorical structure as an outcome of narrative reasoning.

8. a categorical sequence generator uses generated narrative sequences as bridge material to connect categorically distinct segments.

9. the detailed mechanisms for associative matching and action sequencing are integrated. In this case the progression of selected video material is made by a synthesis of matching on continuity rules according to a high level thematic goal, and matching on patterns of similarity and dissimilarity of associations (represented in annotations) of the individual video segments. This can be envisaged as a kind of fuzzy resolution theorem prover, where predicate matching is partial and weighted with weights modified by an association specification and current state description.

These strategies combine demonstrated video sequence generation methods. The strategies can be used in the context of video sequencing, but also represent strategies for behavior generation for fully synthesised images. Video sequencing takes a representation as a content description at some level of abstraction and seeks to find matching descriptions among the content annotations of video clips within a database. Image synthesis approaches can take the representations as specifications for further decomposition if necessary down to the level required to map onto the behavioral control interfaces of a virtual world at the world, object, agent, or discursive agent levels. This use of the strategies amounts to the selection of different subspaces within the space of behavior generation rules, script elements, reaction rules, or state descriptions at the world, object, agent, or discursive agent levels.

The combination of narrative and categorical sequencing represents a rich variety of aesthetic strategies, supporting potentially dense cultural interpretations. Narrative is a highly conditioned expectation in dominant cinema, with well defined conventions of establishment, conflict, and resolution (Dancyger and Rush, 1995). However, linear narrative alone leads to a predictable story form, and fails to model frequent nonlinear pathways in human discourse, behavior, communication, expression, and cognition. Hybrid continuity narrative/categorical algorithmic strategies provide a variety of approaches representing different priorities of linear narrative and associative path generation, and a broader range of experiential and expressive possibilities that can be modeled and generated. Strategies 1 and 2 above propose a primary strategy with the alternate strategy being invoked in the event of failure of the primary strategy to generate a satisfactory solution. This supports a number of interpretations based upon the meaning of the failure of the primary strategy. In the case when the primary strategy is narrative, failure can represent the failure of the narrative ideology underlying presentation generation, with the shift into associative linking representing a lapse into "irrational" behavior under circumstances when the rationalist frame is untenable, the associative episode representing a moment of absurdity in the light of failed reason. Absurdity is itself a rationalist surface interpretation, where the dynamics of ongoing behavior generation are nevertheless systematic, the system being represented by the principle of association that is invoked. Where the primary structure is categorical, the resort to narrative generation in the event of failure may represent the invocation of instrumentalist reason to cope with deficiencies manifested in the dialectic between a purpose (the association specification under a current contextual state description) and the contingent state of the virtual universe represented by the set of atomic video clips (or behavior generation primitives defined at a particular level of representation within the behavioral interface of a virtual environment). This is the function of narrative as a discourse of power, finding the world inadequate, and resorting to instrumental construction as a strategy to impose will to create a solution.

Strategies 3 and 7, involving the reordering of a narrative sequence by categorical criteria, or the generation of independent categorically ordered narrative sequences, respectively, may function as strategies that foreground thematic elements underlying the narrative(s) that a (narratively) linear presentation may tend to hide. Alternatively, post-processing of a categorical sequence by narrative criteria (strategy 4) may serve to weave the thematic threads of a presentation into a more memorable order (the mnemonic function of narrative), or simply create another level of order to unify the formal integrity of the presentation to a degree beyond that achieved by the initial categorical structure.

The explicit insertion of a categorical subsequence within a primarily narrative sequence (strategy 5) may implement a categorical subsequence as a diegetic element of the narrative, such as the inclusion of a news presentation within a story. Alternatively, the explicit insertion of a narrative sequence by a predominantly categorical sequence generator (strategy 6) may represent a more self-conscious form of control over the generated sequence than the ‘resort to narrative’ approach represented by strategy 2. This might occur when it is known beforehand that the underlying clip database, or behavior generation system at the level of abstraction represented by the categorical specification, cannot satisfy a particular state description, or where generation from more primitive elements may more directly serve aesthetic, pedagogical, and/or ideological goals than selection from a predefined database or behavioral potential at the represented level.

Strategy 8, in which a categorical sequence generator uses generated narrative sequences as bridge material to connect categorically distinct segments, can provide a narrative frame for contextualising categorical sequences. Such a frame can provide a high level formal integrity to a generated presentation, and could include explication of the topical and thematic principles by which the categorical elements have been generated.

Strategy 9, in which the detailed mechanisms for associative matching and action sequencing are integrated, provides a more ‘organic’ model for the generation of meaning in presentations, potentially modelling the divergence of monologues or multi-agent discourse from a strict linear, rationalist semantic vector, and also from the systematic thematic changes of a pure categorical format. This may represent the meandering of natural conversation (perhaps in the service of higher level but implicit goals of contact, relationship establishment (as discussed by Bickmore and Cassell, 1999), or the transmission of information) under circumstances in which the strict linear pursuit of conversational goals is inappropriate (for example, when the agent functioning as the information source must not appear to be too didactic for reasons of relative status). Intrinsic hybrid strategies may also function to model the manifestation of madness (by rationalist definitions), trickster behavior, or the diegetic elements of dreams.

A Framework and Architecture for Behavior Generation in Virtual Worlds

This paper has described a number of distinctions for describing how meaning may be encoded within virtual worlds, and behavior may be generated according to those encodings. Five levels of semiotic codification have been described, based upon the film semiotics of Christian Metz. In addition, four levels of representational modelling have been identified, those being the world level, the object level, the agent level, and the discursive agent level. Behavior generation within those levels could be based upon a more or less abstract script, upon SMPA planning, or upon a more direct definition of agent behaviors without the representation of external (virtual) entities. At any of these levels of control system representation, any of eleven sequencing methods may apply (ie. narrative, categorical, or any of nine hybrid strategies). Hence there are four levels of control, each of which could be achieved by any of three forms of behavior generation representation, and those behavior representations could be processed by any of eleven sequencing strategies. This represents a total of one hundred and thirty two different behavior generation strategies that could in principle be used independently or in any set of combinations. Since any particular strategy can be either present or absent, the number of possible combinations of strategies is n = 2132 – 1. Each combination of strategies defines a different potential authoring environment for interactive cinema, and hence a different interactive cinematic form from the author’s perspective (although the resulting presentations may not show a corresponding variability). The framework therefore provides a basis for the classification and comparison of virtual world authoring systems.

The framework presented here can also be used as a specification for the functionality of a software architecture supporting this range of authoring strategies. Such an architecture is required to support high level scripting, SMPA behavior generation, and direct behavior generation according to the eleven (categorical, narrative and hybrid) sequencing strategies. The eleven sequencing strategies are all based upon the two fundamental narrative and categorical approaches. Implementing these strategies at the four different levels of control is largely a matter of defining different sets of behavioral operators and operands for the four different levels, and is likely to have little if any impact upon the core mechanics of the sequencing algorithms. The levels of encapsulation represented by the four levels of control mean that the lower levels present an abstracted interface to the higher levels that includes both object identifiers and behavioral operators (or methods) that may have an implementation based upon inheritance of a generic definition of each sequencing strategy. Those levels of encapsulation also present a hierarchy of animation programming interfaces for which high level, SMPA-based, or direct behavioral directives can be authored at higher levels of the system.

Conclusion

The framework presented in this paper has been developed as a combinatorial implication of a number of sequencing strategies derived from demonstrated systems (in particular the AUTEUR and FRAMES systems), together with a layered model of behavior generation in virtual worlds and broad distinctions between scripted, SMPA, and direct behavioral generation techniques drawn from research in autonomous agency. The result is a three dimensional classification scheme defining a large space of possible control models. This scheme can be used for the classification of interactive cinema systems. However, the scheme includes subspaces that are unlikely to have been explored to date, representing new forms for the authoring of interactive cinema productions. The classification can also be treated as a specification for a generic architecture for control authorship in interactive cinema. That all of the subspaces within the scheme are feasible to implement is currently a hypothesis. The actual usefulness of the different subspaces, alone or in various combinations, can only be explored by creating different interactive cinema productions that use the different control models. The search space of possible authoring forms is very large, but it is hoped that the orthogonality of the classification dimensions will allow a real system to be developed that will allow system authors to freely and conveniently explore this space.

 

 

 

References

Aigrain P., Zhang H., and Petkovic D. 1996 "Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review", Multimedia Tools and Applications, vol. 3, pp. 179-202, Klewer Academic Publishers, The Netherlands.

André E. and Rist T. 2000 "Presenting Through Performing: On the Use of multiple Animated Characters in KNowledge-Based Presentation Systems", Proceedings of the Second International Conference on Intelligent User Interfaces (IUI 2000), pp. 1 - 8.

Bickmore T. and Cassell J. 1999 "Small Talk and Conversational Storytelling In Embodied Conversational Interface Agents", AAAI 1999 Fall Symposium on Narrative Intelligence, http://www.cs.cmu.edu/~michaelm/narrative.html.

Bordwell D. and Thompson K. 1997 Film Art: An Introduction, 5th edn., McGraw-Hill.

Brooks R. 1999 Cambrian Intelligence: The Early History of the New AI, MIT Press.

Cavazza M., Bandi S., and Palmer I. 1999 ""Situated AI" in Video Games: Integrating NLP, Path Planning and 3D Animation", 1999 AAAI Spring Symposium on Artificial Intelligence and Computer Games, AAAI Technical Report SS-99-02.

Clarke A. and Mitchell G. 2000 "Screen Play: Film and the Future of Interactive Entertainment", BCS Computer Graphics & Displays Group Conference on Digital Content Creation, Bradford, UK, 12-13 April.

Davenport G. and Murtaugh M. 1995 "ConText: Towards the Evolving Documentary" Proceedings, ACM Multimedia, San Francisco, California, Nov. 5-11.

Dancyger K. and Rush J. 1995 Alternative Scriptwriting: Writing Beyond the Rules, 2nd Edition, Focal Press.

Fencott C. 2000 "Comparative Content Analysis of Virtual Environments Using Perceptual Opportunities", BCS Computer Graphics & Displays Group Conference on Digital Content Creation, Bradford, UK, 12-13 April.

Goldberg A. 1997 "IMPROV: A System for Real-Time Animation of Behavior-Based Interactive Synthetic Actors", in Trappl R. and Petta P. (Eds.), Creating Personalities for Synthetic Actors, Springer-Verlag Lecture Notes in Artificial Intelligence (LNAI) 1195.

Kim M., Choi J. G., and Lee M. H. 1998 "Localising Moving Objects in Image Sequences Using a Statistical Hypothesis Test", Proceedings of the International Conference on Computational Intelligence and Multimedia Applications, pp. 836-841, 1998.

Laurel B. 1993 Computers as Theatre, Addison-Wesley Publishing Co..

Lindley C. A. 2000 "A Video Annotation Methodology for Interactive Video Sequence Generation", BCS Computer Graphics & Displays Group Conference on Digital Content Creation, Bradford, UK, 12-13 April.

Lindley C. A. and Nack F. 2000 "Hybrid Narrative and Associative/Categorical Strategies for Interactive and Dynamic Video Presentation Generation", submitted for publication.

Lindley C. A., Davis J., Nack F. and Rutledge L. 2000 "The Application of Rhetorical Structure Theory to Interactive News Program Generation from Digital Archives", submitted for publication.

Mateas M. and Sengers P. 1999 "Introduction to NI Symposium", AAAI 1999 Fall Symposium on Narrative Intelligence, http://www.cs.cmu.edu/~michaelm/narrative.html.

Metz, C. 1974 "Film Language: A Semiotic Of The Cinema". New York: Oxford University Press.

Murray J. 1997 Hamlet on the Holodeck: the Future of Narrative in Cyberspace, MIT Press.

Murtaugh M. 1996 The Automatist Storytelling System, Masters Thesis, MIT Media Lab, http://ic.www.media.mit.edu/groups/ic/icPeople/murtaugh/thesis/index.html.

Nack F. and Parkes A. 1997 The Application of Video Semantics and Theme Representation in Automated Video Editing. Multimedia Tools and Applications, [Ed: Zhang, H.], Vol. 4, No. 1, pp. 57 - 83.

Nack F. 1996 AUTEUR: The Application of Video Semantics and Theme Representation for Automated Film Editing. Ph.D. Thesis, Lancaster University, UK.

Sengers P. 1998 Anti-Boxology: Agent Design in Cultural Context, PhD Thesis, CMU Department of Computer Science and Program in Literary and Cultural Theory, August 1998.

Srinivasan U., Lindley C., Simpson-Young B. 1999 "A Multi-model framework for Video Information Systems", "Semantic Issues in Multimedia Systems", 8th IFIP 2.6 Working Conference on Database Semantics (DS-8), Jan 5-8 1999, Rotorua, New Zealand.

Stam R., Burgoyne R., and Flitterman-Lewis S. New Vocabularies in Film Semiotics: Structuralism, Post-Structuralism and Beyond, Routledge, 1992.