Signal Processing:

Does it mean anything?

Ed Hartley, Adam T. Lindsay, Alan P. Parkes

Distributed Multimedia Research Group

Computing Dept.

Lancaster University

{e.hartley@lancaster.ac.uk, atl@comp.lancs.ac.uk, app@comp.lancs.ac.uk}

 

 

Introduction

The title of this paper was chosen to be deliberately provocative in order to introduce the following question: can any meaning be obtained directly from the decomposition of non-textual media through the application of signal processing techniques? The authors argue that this question lies at the heart of any consideration of the application of computational semiotics to multimedia content.

The application of signal processing techniques to the task of determining meaning results in one of the following:

  1. the direct comparison of content elements against predefined categories or
  2. the decomposition of the content into elements that are subsequently matched against categories.

We would argue that ascribing any meaning to the aforementioned categories is a result of an a priori mapping, determined by the designer of the analytical or segmentation tool, from some domain of knowledge to the categories. This mapping is the result of conscious decision making when humans assign meaning to derived characteristics. The fact that this mapping is occurring is often overlooked by the designers of these tools. Yet in a consideration of computational semiotics this issue must be placed at the forefront, for it is here that the distinction between the denotative meaning and the connotative meaning lies. The aim of this paper is to provoke discussion concerning the extent to which denotative meaning can be extracted by signal processing techniques, and whether such techniques can actually derive connotative meaning at all.

This paper identifies key issues that the authors believe signal-processing techniques cannot address. The paper also discusses the computational challenges that arise in attempting to derive a semiotic account of multimedia.

What signal processing can do

It is now useful to review the capabilities of signal processing techniques. Signal processing and associated statistical techniques provide tools that can perform the types of segmentation and characterisation discussed next.

Analytic techniques in the visual domain can be broadly divided into those that perform temporal and spatial categorisation. A further subdivision can be made on the basis of making a distinction between inter- and intra-media techniques i.e. techniques, which are wholly dependent on a characterisation of the media and those that are dependent on the analysis of a database of media instances. A brief discussion of currently available analysis methodologies and their location on the axes identified above is now given, and we will identify the way in which meaning is ascribed to the results of the analysis. In this discussion we will move from techniques that are based wholly on either temporal or spatial analysis to hybrid techniques which combine elements of both.

There have been decades of research into machine vision and digital image processing (Gonzalez and Woods, 93; Chen, Pau, Wang, 93). More recently there has been considerable work driven in part by the needs of the communications industry. (VLBV 98, & WIAMIS, 99). Many of these techniques have been presented to ISO/IEC SC29 WG11 (MPEG) (Nack 99) as part of the development of the Multimedia content representation

Many of the early image analysis techniques relied on the application a succession of image transform functions typical of these are the edge detection techniques (Canny 1983, 1986), (Marr 82). These techniques can be applied successively and are able to provide some measure of meaningful segmentation. It should be noted that the meaning is supplied by the viewer and not by the system. Of particular interest to this discussion is the approach adopted in the MOODS system (Griffioen, Yavatkar, Mehrotra, 95) where successive image processing operations are applied to an image class by a human expert. These operations form a data model, which is constructed within the system and supports the assignment of semantic meaning to the results of the image transformations. This combination is then used to automatically process weather satellite images.

Object motion analysis provides the capability to identify object boundaries and track the paths followed by these objects in image time and space. This has applications in both low bit rate coding for video, transmission, object coding, and analysis applications.

In moving image analysis, some types of media genre can be successfully distinguished through a statistical analysis of edge frequency, motion and colour characteristics (Fischer and Lienhart 1995). It is notable that the correlation between the genre style characteristics identified by each of the analytic modules described is again made via reference to human experts who provide the reference against which the matching is carried out.

Temporal Segmentation analysis has been used to identify the location of filmic effects such as cuts, dissolves (Zabih, Miller, Mai 1995, 1999), and shot breaks (Lew, Haas, Touber & Wentzler) amongst others. It is noteworthy in this context that the methods identified to date can falsely interpret instances of rapid motion within a shot as shot breaks and fail to detect shot breaks where the edge differential is low.

Now continuing the survey across other sensory domains, signal processing has proven to be most effective in recognition contexts (i.e. matching to pre-determined categories) utilising features that do not necessarily have any basis in psychophysics (i.e. the way humans perceive multimedia). Speech recognition bases its categorical comparisons most commonly on cepstral features, a homomorphic signal processing technique that no one asserts has a relation to what happens in the human mind. Nevertheless, cepstral processing’s effectiveness in teasing out the differences and transitions between phonemes is well-recognised and yields good results when combined with a language model in a speech recognition system.

We may now point out that the language model is critical at this point, and the mapping of the hypothesised phonemes to hypothesised phrases is a key point at which the system is "designed," or configured to assign pre-determined meaning to particular input sequences. Every language model imposes some restrictions on the output, based on its assumptions of the input: that it be grammatical and following some sort of continuity of topic.

One of the most successful examples of signal processing on media has thus gone from signal to linguistic tokens. Even though we have achieved that much, where is the meaning? We argue that although signal processing gets us to that point, representing meaning falls to other domains like Artificial Intelligence. Other examples in the visual domain include facial recognition and other pattern matching feats that do not necessarily rely on human models, such as comparing textures through the fast scale space method but do require precise mapping of the target recognition class.

A Semiotic Approach to Description

For this paper we draw freely from Hartley Parkes and Hutchison’s (Hartley 99) analysis of semiotic terms in the context of content-based multimedia applications. At its heart, it describes six different planes possible with human perception and machine description of multimedia. The five planes that appear in the model of machine description are the transformed data plane, in which one finds encoded digital media data, the data plane, in which analogue or uncompressed video or audio are ready to be sent to a transducer. Once such data is sent into the perceivable world, it forms the expression plane, which may then be analysed by computational means to arrive at a description plane, which may then be further encoded and/or compressed for easy transmission or manipulation in the transformed description plane. Please note that this simplified description does not intend to describe a temporal ordering or prescribe processing order, but presents the general relationship between planes. Figure one presents this model.

 

Figure 1: the various planes possible when generating a machine-derived description of content. Typical transitions of the data in its different forms from plane to plane are noted.

Figure two presents the model in the case of human perception and understanding of the media, leading from the expression plane to the content plane, which is based on the important but oft-ignored axiom that multimedia does not become content in the absence of a human perceiver. In creating a machine description, one hopes that there is a relationship between the human’s content plane and the machine’s description plane.

.

Figure 2: the planes in effect with a human perceiver

Since the models are identical through the first three planes described, we shall concentrate on the relationship between the human’s content plane and the machine’s description plane, which both fulfil similar roles in their respective planes.

A ladder towards meaning

One way of looking at a computational process from expression (signal) to description (meaning) is to imagine several steps on a ladder going from a signal towards meaning. This is very much inspired by Marr’s (Marr 82) representational framework laid out in his seminal work. We abstract from his reliance on the primal and 2½-D sketches to a more extensive survey of what may plausibly happen along the lines of human perception and understanding. One instance is illustrated in figure three.

Figure 3: one series of steps taken to extract more "meaning" from a signal. An interpretation may be iteratively re-segmented and "chunked" into larger items of meaning.

We initially present the ladder as a series of possible steps in the signal analysis chain, a "zooming in" on the analysis process between the expression and description planes. However, the steps in the ladder also represent a model of human perception. Thus a very similar model gets one from the expression to the content planes.

We also note that at each of the processing steps, one may also extract a representation which is also a very valid type of description of the multimedia data. A physical description, which can be seen as translating to a sub-plane of or an axis on the description plane, relates to such features as colour intensity or fundamental frequency. A perceptual description relates it to features that adhere more to psychophysical and perceptual principles, such as a perceived colour within a colour model, or pitch height. Segmentation gets us to the point of "This agglomeration of pixels/samples/frames is an object."

It is important to note what is happening with the processing at this point: given one representation, more information—whether it be in the form of assumptions, heuristics, or the model implicit in the processing—is added, and a new representation is derived. Each of the multiple representations is valid when describing the content. Although each progressive step up the ladder may be a higher-level description that approaches "meaning," the number of a priori assumptions that must typically be made in machine processing to get to that point makes the description successively more fragile.

Once the segmented object is recognised as belonging to a category or as similar to another object already processed, there may be a token attached to it, such as, "This object belongs to this class." Signal processing can get us to the point of simple relationships being drawn.

Beyond this point, one leaves the realm of signal processing, and relies more on heuristics to give a greater scope to the objects that have been recognised. One might design a system that hypothesises that "If this object is present in the context of the other, then another assertion may be true." Given a new assertion, more processing might be achieved.

Although the most popular methods for achieving some meaning from a signal get a fair distance up the ladder (which perhaps is drawn with those limits in mind), It still requires a human to ascribe a meaning to a recognised sentence. Another issue is presented at this point is this in fact setting the goal too high, as it falls well outside of the scope of signal processing. So, again, what sorts of units of meaning can signal processing realistically achieve?

 

Staying close to the surface

Another area in which signal processing might have some success is in the area of looking at "surface" features, or aspects of the media that are closer to the level of the signifier than the signified. Signal processing has some potential in examining features that are closer to the style or genre plane of expression. Granted, making assertions about these features from low-level feature extraction relies very heavily on heuristic guides, again creating a heavy reliance on the assumptions of the system designer.

The presence of a strong, regular amplitude peaks across frequency bands suggests dance music. High contrast caused by specifically limited illumination suggests film noir. This approach quickly runs against limits, but it points to another approach of extracting some meaning from a signal.

 

What Signal Processing Cannot do

The above discussion supports the view that while signal processing can characterise media sufficiently to support search applications. With the exception of speech recognition applications, signal processing is capable of providing little beyond an outline of the meaning of non-speech media objects. We contend that there are two reasons for this. Firstly, there is the lack of a canonical notation for such objects. The creation of such a canonical representation in itself represents a major intellectual challenge. Secondly, there is the problem of distinguishing connotative and denotative aspects of the results of such processing.

Research has begun to develop such canonical representations of media objects. Many organisations are working on content description and "metadata" (SMPTE, EBU, MPEG, and the Dublin Core, among others). Given the theoretical challenges posed, it seems unlikely that such a canonical representation will be developed in the short term. However, promising work has been done in the development of formal methodologies to represent video content.

 

Computational Challenges

A significant challenge is presented by the need to identify an distinguish between the denotative and connotative meanings associated with media instances. Implicit in an ability to make this distinction is the development of representation structures that can accommodate both the denotative meaning and the connotative meaning of media instances.

Another area where significant work is needed is the development of mechanism for disambiguation of meaning and/or the representation of the multiple meanings that might be ascribed to a given media instance. Again it is useful to contrast the well-defined techniques in Natural Language Processing for quantifying the probability (Garside 97) of a given text production having a given grammatical structure and therefore meaning with the techniques available in statistical pattern recognition. This contrast points again at the lack of common representation formalism. It should be noted that the capability of NLP systems to provide such a disambiguation capability has been based on many years of analysis on large corpora of annotated material. It is inevitable given the rudimentary nature of current representation schemas at this stage that such corpora of annotated content do not exist in any complete sense.

 

Conclusions

This paper has presented an analysis of several aspects of the current state of the art in what might be called computational semiotics of multimedia. We have drawn attention to the limited accounts of meaning that can be derived by pure signal processing. We have identified some of the computational challenges that lie ahead in representing and defining the meaning that can be attributed to content.

From this analysis we have also begun to identify some of the steps needed to develop a program towards a computational semiotics for multimedia.

 

References

Gonzalez and Woods, 93, RC Gonzalez and RE Woods, Digital Image Processing, Addison Wesley Publishing Company Reading Massachusetts, 1993

Chen, Pau, Wang, 93 CH Chen, LF Pau, PSp Wang Editors, Handbook of Pattern Recognition and Computer Vision, World Scientific Publishing 1993

Canny 1986, A computational approach to edge detection , IEEE transactions on Pattern Analysis and Machine Intelligence, 8(6):679-698, 1986.

Garside 97, Corpus Annotation, Linguistic Information from Computer Text Corpora, R. Garside, G. Leach, A. McEnery Longman 1997, ISBN-0-582-29837-7

Hartley 99, Hartley E., Parkes A.P. & Hutchison D. (1999), A Conceptual Framework to Support Content-Based Multimedia Applications, Proceedings of ECMAST '99, 4th European Conference on Multimedia Applications, Services and Techniques. 26-28 May 1999, Madrid, Spain. Published as Lecture Notes in Computer Science 1629, Leopold H. & Garcia N. (Eds.), Springer-Verlag, Berlin, pp. 297-315, ISBN 3-540-66082-8.

Marr 82. D Marr, Vision, Freeman, San Francisco.

Nack99, F Nack, A T Lindsay, "Everything you Wanted to know about MPEG-7, Part 1," in IEEE Multimedia, July-Sept 1999.

VLBV 98, Proceedings of the International Workshop on Very low Bitrate Video Coding October 8-9, Urbana Illinois

WIAMIS, 99 Proceedings of Workshop on Image Analysis for Multimedia Services, 31 May 1 June 1999 Heinrich Hertz Institute Berlin

Griffioen, Yavatkar, Mehrotra, 95 J Griffioen, R Yavatkar & R. Mehrotra. A semantic model for embedded image information. In Proceedings of the 1994 IST/SPIE Symposium on Electronic Imaging and Technoloy.

Fischer and Lienhart 1995, S. Fischer, R. Lienhart and W. Effelsberg Automatic Recognition of Film Genres Proceedings of ACM Multimedia 1995 San Francisco Nov 1995

Zabih, Miller, Mai 1995, R Zabih, J Miller, K. Mai, Feature Based Classification Algorithms for detecting and Classifying Scene Breaks.

Lew, Haas, Touber & Wentzler