FIAT/IFTA FIAT/IFTA
Search in FIAT/IFTA
www.fiatifta.org
About Fiat Conferences Projects and professional standards Services Awards Links
About Fiat
What is FIAT
FIAT policy
- FIAT at the WSIS
- Appeal from Paris
- Action Plan
- Annual Plan
- Executive Council Meeting report
Statutes
- English Version
- Version Française
- Versión Española
Join FIAT/IFTA
FIAT organisation
- Executive Council
- Commissions
     > Media Management
     > Training
     > Television Studies
     > Programme & Production
Members details
- Members Archive presentation,
Clips & stills
- Sponsors and Partners presentation
- Access to the members database
News
- Latest News
- Former News: 2006>2004
- Former News: 2003>2001
- Former News: 2000>1998
- Former News: 1997>1994
Calendar of events

November 2001
Newsletter

The first Euromedia workshop on Image processing and Computer-Aided Video Retrieval (3 to 5 March 1997)

Setting and attendance

The workshop was hosted by ORF in Vienna. The original planning called for a four-day event with three major speakers and a day for discussion, however, this was then shortened to three days and speakers from only two institutions after Prof. Jaime had to decline and no other distinguished expert could be found within the time constraints of the preparation phase. The total attendance was 44 persons. Most of these persons had an archive background, only a few of them came from the technical side. About a third of all attendants were from non-Euromedia organizations.

Besides the speakers from the Euromedia consortium, Prof. Desai Narasimhalu from the University of Singapore and Prof. Effelsberg (together with his assistant Mr. Lienhart) were featured. All speakers brought presentations of their most current research results and developments.

Introductory talks

The introductory talks set the framework for the workshop. After the opening statements by ORF officials, Dr. Peter Thomas first presented an introduction into the Euromedia project for those attendants who did not have the necessary information about the project itself. This talk did not go into any software-technical detail but instead presented the general organizational framework and, importantly, the "visible components" - the hardware concept and a prototype of the user interface. Three hardware configurations were presented, corresponding to the three possible hardware options for the prototype, namely the broadcast quality storage system, the (upgradable) preview quality system and the small archive preview solution.

For the user interface discussion, the central point raised was the portability and scalability of the purely web-based interface solution allowing the user to be fully ignorant of the actual server back-end. Configuration. This interface in particular presents the opportunity to use the same client software for preview archives, broadcast quality playout systems and even contribution quality solutions supporting full studio quality editing support.

The second talk, by Mr. Neidecker-Lutz, raised the central problem of the workshop, namely "If image processing is the answer, what is the question ?" Currently image processing, with all its features is still more of a lab curiosity than a commercially applicable technology, mostly not because it would be unaffordable or unavailable but because the people who could make use of it often have no idea what it actually can offer to them and how the algorithms could be usefully integrated into their image data processing systems.

To illustrate the power of image processing, Mr. Neidecker-Lutz presented several basic algorithms designed and implemented for Euromedia's indexing facility.

Found to be a key element of any system involving video processing is cut detection. Reliably detecting cuts and transitions allows to split a video stream into its syntactic building blocks called shots (defined as contiguous pieces of uninterrupted camera work). As each shot constitutes a coherent self-contained content item, it can usually be used as the first level of abstraction above the single frame. Current cut detection algorithms already achieve beyond 98% accuracy on hard cuts and approach 90% and better on detecting the various types of soft transitions.

The algorithm developed by Digital is special in one respect, namely that it does not work on the original uncompressed image data but rather directly on the stored compressed video stream saving the extra processing load involved in decompressing video and then detecting image to image differences when actually most of this data is already contained in the compressed stream as the compression algorithm already exploits the similarity between temporally adjacent frames to achieve good compression factors.

The basic principle of the cut detection algorithm is thus to look at the coded differences between adjacent images which are usually very small. Sudden peaks with large differences between two adjacent frames and a consecutive drop back to normal low differences are good indicators for a hard cut between two normal shots while longer runs of increased but not excessive differences can indicate either a soft transition or camera work (zooms or pans, see below for an explanation of how these are distinguished from each other). Lastly, two or more peaks in very rapid succession can indicate a flashlight or other disturbance within a shot or several cuts very close to each other , this ambiguity can be resolved by comparing the frames just outside the run of possible cuts, if they are very similar, a disturbance can be assumed, should they be significantly different, it is very probable that at least one of the peaks indicated a real cut.

The next step in a video analysis can then go in two directions, both of which have been explored in Digital's algorithms. Shot fusion aims at aggregating the single shots into larger contexts by detecting temporally close occurrences of very similar shots. By evaluating these fused shots, an attempt can be made at detecting scenes (semantic instead of syntactic units), although this is not yet implemented in this system. The other direction is key frame extraction, aimed at creating a still picture overview over a complete video sequence that is as concise as possible. For this extraction to work, an intra-shot analysis of motion will have to be conducted to find out at which points the accrued change in image content warrants or demands the inclusion of another key frame.

This motion analysis step is also very important for a reliable detection of soft transitions. Again, the algorithm works in the compressed domain and utilizes the motion vectors already calculated by the compression algorithm. The motion analysis averages the motion vectors for five (overlapping) image areas and then compares them over a short period of time. There are several possibilities for the kind of motion vectors that can be found:

  1. Fluctuating: The vectors show temporally uncorrelated motion. This can be caused by an image with very poor contrast, by an effect transition or on-screen effect (e.g. explosion) or chaotic motion of several objects. This kind of motion vector pattern is currently ignored except for detecting soft transitions.
  2. Static: All five motion vectors are very short. If this occurs, the picture is poor in overall motion. No further key frames are extracted until motion is found. If found with gradual scene content change (transition indicator), this strongly suggests a fade or dissolve.
  3. Temporally correlated, spatially uncorrelated. This pattern can be found if there is object motion. The background is static (or moves in one direction), and at least one object of significant size moves in front of it. This type is counter-indicative for a transition unless there are significant other signs that a wipe could be present. For key frame extraction, a running total of detected motion is calculated and if exceeding a given threshold, a new key frame is chosen.
  4. Spatially and temporally constant, significant value: This type of pattern indicates a camera pan. A new key frame is chosen when the total pan distance equals half the screen size. Although this pattern could indicate a wipe, it is usually treated as a pan (far more common) unless a hard edge can be detected orthogonal to the movement direction and moving over the entire screen during the motion phase.
  5. Temporally constant, radial (Center vector near static, all others point to or from center and are of similar magnitude): This pattern is indicative of a zoom. If the motion vectors point to the center, a zoom out is indicated, should they point away from the center, there is a zoom in.

Note that only part of this detection is already implemented or evaluated in Digital's algorithms, a more complete realization is given by Mr. Brunelli (see below).
Currently the output of Digital's key frame extraction algorithm is presented as an HTML page, with the key frames grouped by shot.

The talk of Mr. Brunelli (Instituto Trentino di Cultura) presented a very similar approach, however, with more technical background and far more research-oriented. The user interface was less self-explanatory but more powerful, geared to the more technical user who requires a maximum of functionality and flexibility.

Again, a shot detection and key frame selection algorithm is at the core of the system. The results are however presented in a hierarchical view with the first structure level being the shot. Each shot is at the basic level represented by one key frame. However, it is possible to expand the view to include progressively more frames per shot in up to four levels of detail (depending on shot content, not all four levels may be available).

In addition to this basic structuring functionality, advanced means of retrieval are available in this system. Similarity retrieval is the most prominent one. Unfortunately it was not detailed which measurements of similarity are being used, but the algorithm performed reasonably well when used to find similar shots within one movie, thus it could also be well used for shot fusion and scene detection applications.

Automatic Audio and Video Content Analysis

Automatic multimedia content analysis is an ongoing research subject at the University of Mannheim since the late 80's. A high amount of progress has been made there since and several functioning research prototypes have been developed. Prof. Effelsberg, head of the institute, and his colleague Mr. Lienhart presented their most current state of the art. Before starting the actual talk bout content analysis, Prof. Effelsberg gave an introduction into the basics of video and audio compression, presenting a few of the major ideas to the non-technical personnel. This part shall not be covered here, most relevant compression algorithms are well covered in the reports on standards. Important to mention however is that even though the lossy compression algorithms are designed to lose data only in parts of the image or audio signal not perceived by humans, these algorithms can lose data in a way that can cause automatic analysis algorithms to fail or give significantly different results than if applied to the uncompressed data.

An interesting side note was the mention of the current developments in the MPEG consortium leading to the formation of a working group developing a standard for the inclusion of textual and other descriptive information into a video stream (MPEG-7).

In the talk on audio analysis, the primary application fields of the related techniques were given as audio-based segmentation and semantic audio analysis. Audio segmentation is employed mainly to assist the video segmentation process, often allowing to resolve ambiguous situations when dealing with the detection of shot and/or scene transitions.

When using audio analysis semantically, a high amount of mental preparation and human knowledge is required before the algorithms can be productively employed. For example, the university's content analysis project makes an attempt at detecting violence using the audio track. Three common audible events associated with violence are shots, cries and explosions. All three have very distinguishable characteristics (e.g. a very short onset time when dealing with a shot or the presence of voice frequency bands for cries), but only an analysis of the combination of all these features can give a reliable identification with the added complication that the weights of the various parameters must be different for each type of sound that shall be identified. For example, when detecting shots, out of seven calculated parameters, only two (onset and frequency spectrum) already make up nearly 60% of the weighted total, but compared to that, the onset time is a very bad indicator for cries, so its importance for detecting these is close to none. Current recognition rates are between 40% and 85% accuracy and highly dependent on the type of sound to be recognized..

Low- to medium level video-analysis is the basis for all content analysis run on the visual part of the material. Most low-level operations are classified as Point Operations or Neighborhood operations. Pint operations are mathematical functions that are applied to all points / pixels singly, without reference to other pixels whereas neighborhood operations also take into account the pixels surrounding the one for which the function value is being calculated. Some common point operations are histograms and pseudo color images whereas neighborhood operations encompass edge detection and color coherence vector calculation. (Color coherence vector analysis is an improved scheme similar to histogram analysis only that it takes into account whether colors occurs in small contiguous areas or in larger ones, thus it could for example distinguish the French national flag -large area of white- from the US national flag -many small areas of white-, a distinction a regular color histogram cannot make.

One vital component of a system designed to recognize content in video is a segmentation system, i.e. an algorithm capable of extracting objects and/or regions from the image. Unfortunately the progress in this direction is not yet very good as there are very many factors that can negatively affect the outcome (for a simple example see below).

One approach towards segmentation is of course edge detection. Edge detection itself can be attempted starting at two different ideas - one being to try to detect the edges themselves and the other one being to detect contiguous regions in the image and declaring their boundaries to be edges. Detecting the edges directly is not a very successful approach as it is far too sensitive too noise. A single off-color pixel in a line causes the line to be interrupted and to be detected as two different lines instead and on the other hand, there has to be a distinction between what constitutes a line and what is rather a (near) monochrome area or object.

Approaches based on regions seem more promising. Again, there are two options, namely Region Splitting and Region Growing. Region Splitting is a two-step procedure. In the first step, the examination starts with the entire picture as one region. Now, for each region, if it is already homogenous it is stored as being such, otherwise, it is split into several (usually two or four) smaller regions that are then recursively examined the same way. Once all of these proto-regions have been found, the system tries to merge adjacent regions so that they grow as large as possible while still being homogenous. Single noise pixels do not cause as much trouble here, they will be treated as a one-pixel region (and then usually ignored by higher level components of the video analysis).

The other approach is Region Growing, which is basically using the second step of Region Splitting after dividing the entire image into one-pixel regions. It is far more sensitive to image content abnormalities as this example might show:

Imagine a picture that actually consists of two regions. The top region is a continuous smooth grey scale running from black at the left to white at the right. The bottom region is the exact inverse of the top, having white at the left and black at the right.

Region Splitting will rather soon arrive at having contiguous regions in the split, if I assume e.g. a tolerance of 8 (on a 255 value scale) and a square 255 x 255 picture, the split regions will be of size 8x8. In the very center of the image, there is a small area where the top and the bottom region have very similar colors, however the area where this difference is less than 8 grey levels is smaller than 8x8 pixels, thus the top and bottom regions are not joined, compatible with what a human viewer would say.

However, if I use Region Growing, the center vertical line is absolutely monochrome and thus allows the top region to spill over into the bottom region's middle and from there cover the entire bottom half. Thus the image is treated as one region which is obviously not correct. (Note that one singe noise pixel at the right position - in the very center) could prevent this misdetection at a threshold of 1 - because of the "negative" property of the bottom half, the smallest non-zero difference between the halves would be 2 - 1 step darker on the top and 1 step lighter at the bottom or vice versa. This is an extreme example but it shows the possibility that even one pixel a human observer would not even notice can vastly change the output of a segmentation algorithm.)

Using these techniques, the MoCA (Movie Content Analysis) system as a first step detects Shot boundaries, object motion (approximated), Fades and Dissolves as medium level syntactic elements. The system then performs a shot fusion and shot clustering analysis (going from the syntactic to the semantic level), attempting to detect scenes (already very successfully).

Besides these general attempts at video analysis, the MoCa system already performs a few specialized tasks also helpful during indexing of video.

The first of these is face detection, which it does with a neural net. This net is supplied with various rectangular excerpts from each frame, varying in both size and orientation. Each of these rectangles is then first subjected to a normalization process, adjusting its brightness and contrast to standardized distributions so that the pattern can be extracted. This lighting compensation can also cope with inhomogenous lighting, as long as the lighting situation can be approximated by a linear gradient (a common case).Then the rectangle is fed into the neural network which then attempts to detect the typical pattern of a face (eyes, mouth, nose). The network does only recognize the area from the forehead to the chin and thus avoids the high variation bandwidth associated with hair features. All raw detection's are then grouped by location and a final detection is generated if enough single positive results fall into one rectangle to improve the system's accuracy (the neural net delivers a significant number of false positives, however these are far more sensitive to small environment changes and thus far less likely to reoccur after a slight displacement or rescaling than a true detection.)

The next important element of semantic analysis is text segmentation and recognition, allowing the system to automatically extract captions and other text (e.g. credits) from a video sequence. The algorithm is as follows:

  1. Extract all monochrome or near monochrome areas from the image
  2. Out of these, remove all that have insufficient contrast with their background to be legible and/or have a size that is not compatible with the assumption that it could be a letter.
  3. Out of the remaining areas, remove all those that are not static or slowly moving in a direction parallel to one of the screen edges.

At this point, most (50 to 95%) of the extracted areas are actually characters (letters or digits) and over 97% of all actual characters are correctly extracted. The extracted areas are now fed into an OCR (Optical Character Recognition) algorithm and translated into ASCII text. Currently the OCR software is very simple and will also match each area to a character regardless of recognition certainty. (i.e. even a randomly shaped blob will be matched to whatever character it resembles most closely). The character recognition rate is around 80 %, good enough for a human user to take the output and quickly grasp the meaning of the text as well as recognize sequences of garbage characters created by false detection's. The system could easily benefit from the use of a more sophisticated OCR software, however this is not in the scope of the project as well-performing OCR programs are already available.

The last of the specialized recognition systems is the commercials detector. This component uses various syntactic video properties to detect and delimit commercial blocks in television broadcasts. It can be used for several purposes, among which are

  1. Automatically monitoring TV channels to verify the correct broadcasting of contracted commercials
  2. Automatically monitoring TV channels for new commercials that could possibly be competitors' spots and extraction of all new commercials for viewing / further automatic analysis (e.g. text detection to find the brand name of the product being advertised)
  3. Shielding children from commercials by automatically denying them access to this kind of programming
  4. Assist impact studies aimed at determining the best placement for commercials.

Two approaches are possible and used, one being the detection of commercial blocks based on video features and the other one being detection based on the re-recognition of already know commercials. These approaches complement each other and are being used together to arrive at a very high detection and delimiting accuracy. Feature-based detection uses the distinguishing syntax of the visual language of commercial blocks. Commercial blocks are characterized by

  1. a length of 8 minutes or less
  2. short groups of 5 to 12 monochrome (often black) frames occurring between spots, in distances of 90 seconds or less (usually less than 30 seconds apart)
  3. a higher general volume of the audio signal during the commercial block that during the surrounding programming.
  4. high cut rates and rapid change between shots with very high motion and stills.
  5. high probabilities of finding text, especially in the stills.

All these features except for the audio volume and the text content are detected and evaluated during commercial block detection using the monochrome frames and hard cut criteria for coarse scanning of the video stream and the motion rate feature for determination of the exact limits. The length criterion is used as a plausibility check once the block has been detected. To recognize individual spots, the Color Coherence Vectors of each frame are calculated and stored as that spot's "fingerprint". Comparison of these fingerprints is computationally cheap. Equipped with this type of recognition system the commercials detector can be extended to an autonomous self-learning commercials charting and cataloguing system.

The most advanced project research is currently conducted in two directions, genre recognition and video abstracting. Genre recognition uses feature-based heuristics similar to the audio classification system to sort broadcasts into genres like newscast, commercial, music video or tennis match. Most operators described above are calculated and weighted against each other. When confronted with seven possible genres (and no material from outside the genre space), the recognition rates of the system are 90% and better.

Lastly, video abstracting is the creation of a short, moving picture summary of a longer video. This has only been implemented in the form of a very rough prototype and because of its computational intensity has only been tested on few materials, however the abstracts generated match human-edited trailer material of the same length with accuracy's of between 30% (Thrillers) and 80% (Action movies) and generally give an acceptable indication of the material's content. The system does not currently use the mocha low-level routines but is adaptable to use these features. A significant rise in accuracy can be expected from utilizing the more sophisticated low-level processing routines available.

Content based Image retrieval

In contrast to the more academically oriented research of Prof. Effelsberg, most of Prof. Narasimhalu's research is from the beginning inspired by a practical application and often actually financed by a user.

His research group is active in the areas of image processing and retrieval since the early to middle 1980s and the focus of the research has changed a lot over this time. The main challenge presented by image retrieval is that graphical information, while very intuitive to a user, is from a computer standpoint very inexact and unclear data.

The first image-enhanced database system created by the group in 1988 was a hotel database that stored images and free text data with the formal data about more than 100 hotels in Singapore. The main challenge at that time was not yet retrieval on images but just storing several hundred images on disks and making them available for display. The group solved this problem by using magneto-optical disks as mass storage and only transferring them to the normal magnetic hard disk on demand, an approach that would today be known as a hierarchical storage management system.

The second creation was a newspaper photo archiving system (1989 / 90) that had to deal with many of the problems Euromedia faces today, namely the fact that the source material existed in many different formats, that the full quality digitized material cannot be held in on line storage, that searches are only possible based on the verbal descriptions and that users will often request their materials to be output in different formats and/or media.

Some of the solutions are very similar to what Euromedia is planning to do. So did the newspaper archive for example store its images on a hard disk in preview quality for fast access (thumbnails) and on other media (optical disks, tape) in full quality for output. The research on images themselves was not possible, but the researching editor was able to refine his text-based queries using thesaurus technology and relevance feedback.

The Document Development System was created for Fujitsu and allowed searching image databases based on contents and structures. It contained sophisticated tools for version management and a database engine based on SGML. It can be considered an early object-relational database system. Sponsored by the Singapore Police, the CAFIIR Face Recognition System is a system for feature-based image retrieval designed for homogenous images. The images taken of faces are homogenous because they fulfill the following conditions:

  • The objects depicted are of the same kind (face)
  • The lighting under which the objects are photographed is controlled and constant
  • All objects are depicted using the same orientation
The system extracts several features from each face, currently the chin, hair, eyes, eyebrows, nose and mouth are evaluated. The input query into the system consists of a graphical image having one or more of these features set and a set of weights depending on how sure the observer is that the particular feature is looking the way he described it. The database then matches the input query with the stored images after applying an aging model to the stored faces should the pictures stored be old enough that the person might have significantly changed. A similarity retrieval is conducted and relevance feedback and interactive query refinement allow the user to iteratively narrow down the search space until a perfect match has been found.

For recognizing heterogeneous images, STAR is a system to catalogue and retrieve trademark logos based on typographical, phonetic, graphical and structural and even meaning similarity. The system is used to allow trademark granting authorities to quickly search the database of existing trademarks for those that present a potential of confusion with a newly submitted trademark logo or a logo that some trademark owner claims to be challenging his trademark. After a (human-assisted) optical character recognition and an analysis of image features, the system can present trademark logos that are similar to an existing one on any weighted combination of

  • the letters in the word (house vs. oust)
  • the sounds (shout vs. loud)
  • the shapes in the image (e.g. a 5-point star)
  • the meaning of the image (Bird in a circle)
  • the general layout of elements.

The system then presents the user with the logos that most closely resemble the input image by any of these criteria (either by a very close match in one category or by several moderately good matches) and allows him to quickly reach a decision on whether the input data touches at a catalogued trademark or not.

The Cranofacial Growth System is a Medical application of similarity retrieval based on a new technology, namely not treating the source images as a matrix of pixels but rather vectorise them and do the analysis based on these vectors. The advantage of this approach is that it can easily be adapted to three-dimensional images and is thus extremely well suited for medical applications. Besides the image data, the system stores textual data about various details of the patient, the diagnosis and the treatment. The system is developed far enough to give the doctor advice about possible diagnoses and treatments.

On the subject of archiving and retrieving moving images, three projects have been attempted at the University of Singapore, two of which are research-oriented and very similar to the already mentioned approaches. The third, concerned with archiving Legacy Videos used preview video and textual descriptions in databases and is industrially used for example during DVD production and in the ImageBank. The system is far more interesting from its hardware aspect than its software aspect and shall not be discussed in detail here.

Discussion Results

The most important result of the discussion was that a pure preview video archive was not deemed sensible by the users. Even though they see the high costs associated with the creation of a broadcast or even contribution quality archive, they also see that the labour cost of digitizing all necessary existing material is even higher and that thus digitization should be done only once and at the maximum possible quality, which means all multimedia data should be losslessly compressed and stored.

On the same topic, several informal discussions revolved about the best way to manage digitization for mixed format stock. Several interesting concepts were brought up which shall be discussed in detail in the workflow report, however the bottom line of them all is doing as much automatically for which it is technologically possible to do so.

Generally, the exchange of ideas was considered very fruitful by all sides, the researchers having gained an insight into the desires of the users, the implementers having been able to get ideas on both the economical and technical feasibility of their ideas and lastly the users having gotten an idea what can be done and how it could be applied to their problems.

Herbert Hayduck

 

EDITORS: Gösta Johansson, Stellan Norrlander (Sveriges Television, S-105 10 Stockholm, Sweden).
LAYOUT: Ragnar Lilliestierna(Stockholm, Sweden)
HTML LAYOUT: Karl Erik Andersen (National Library, Mo i Rana, Norway)
Contact
Content : office@fiatifta.org
Tech. : webmaster@fiatifta.org
Site Map Site Map Print Print this page
Top
I About FIAT I Conferences I Projects & Professional standards I Services I Awards I Links I
Last update : 01/03/06
© 2006 - FIAT/IFTA