Erling Wold, Thom Blum, Douglas Keislar, and James Wheaton

Muscle Fish LLC

2550 Ninth Street, Suite 207 B

Berkeley, CA 94710, USA


Abstract: Many audio and multimedia applications would benefit if they could interpret the content of audio rather than relying on descriptions or keywords. These applications include multimedia databases and file systems, digital libraries, automatic segmentation or indexing of video (e.g., news or sports footage), surveillance, as well as sound browsers for effects designers and musicians. This chapter describes an audio analysis, search, and classification engine that reduces sounds to acoustical and perceptual features. This analysis allows the sounds to be classified or queried by their audio content. Queries can based on any one or a combination of the acoustical features, by specifying previously learned classes based on these features, or by selecting or entering reference sounds and asking the engine to retrieve sounds that are similar or dissimilar to them. We present examples of this engine as it would be used in some of the application areas listed above.



The rapid increase in speed and capacity of computers and networks has allowed the inclusion of audio as a data type in many modern computer applications. Until recently, the audio has been treated as an opaque collection of bytes with only primitive fields attached: name, sampling rate, and so on. Developers and users who are accustomed to searching, scanning and retrieving text data can be frustrated by the inability to look at the content inside an audio object.

For example, multimedia databases or file systems can easily have thousands of audio recordings. These could range from sound effects or music libraries to archives of news footage. Such libraries are often sparsely indexed to begin with. Even if someone has assigned keywords or indices to the data, these keywords are highly subjective. They may be useless to another person who is coming to the data from a different perspective or who uses a different taxonomy. To make things worse, audio is hard to browse directly, since it must be auditioned in real-time unlike video which can be keyframed. To search for a particular sound or class of sound (e.g., applause or music or a particular person's speech) can be a daunting task.

Here are some typical ways that one might want to access sound:

• Physical attributes: describing a sound in terms of commonly understood acoustical characteristics, such as brightness, pitch and loudness.

• By example: saying one sound is like another sound or a group of sounds in terms of some characteristics. For example, "like the sound of a herd of elephants." A simpler example would be to say that it belongs to the class of speech sounds or the class of applause sounds, where the system has previously been trained on other sounds in this class. As another example, the user could make a buzzing sound to find bees or electrical hum.

• Subjective features: describing the sounds using personal descriptive language. This requires training the system (in our case, by example) to understand the meaning of these descriptive terms. For example, a user might be looking for a "shimmering" sound.

• Semantic content: text content for speech recordings and the score (events, instruments and rhythms) for musical recordings.

In a retrieval application, all of the above could be used in combination with traditional keyword and text queries.

To accomplish the above, we develop analyses of sound which compute the instantaneous acoustic characteristics. These characteristics have been chosen based on knowledge of what is psychoacoustically important. Next, we apply statistical and heuristic algorithms to reduce this data to a more compact or more easily compared set of features. Finally, we use several different distance measures to compare these features and thus to compare, classify and search for sound. Note that this approach is not unique to audio. It has been used in the image retrieval world and, more generally, in the field of pattern recognition for many years. What is unique here are the particular features which are appropriate for audio and the use of these features in audiocentric applications.


Sounds have traditionally been described by their pitch, loudness, and timbre. The first two of these psychological percepts are well-understood and can be accurately modeled by measurable acoustic features. Timbre, on the other hand, is an ill-defined attribute that encompasses all the distinctive qualities of a sound other than its pitch and loudness. The effort to discover the components of timbre underlies much of the previous psychoacoustic research that is relevant to content-based audio retrieval [1].

Salient components of timbre include the amplitude envelope, harmonicity, and spectral envelope. The attack portions of a tone are often essential for identifying the timbre. Timbres with similar spectral energy distributions tend to be judged as perceptually similar. However, research has shown that the time-varying spectrum of a single musical instrument tone cannot generally be treated as a "fingerprint" identifying the instrument, because there is too much variation across the instrument's range of pitches and dynamic levels.

In the computer music community, various researchers have discussed or prototyped algorithms capable of extracting audio structure from a sound [2]. The goal was to allow queries such as "find the first occurrence of the note G-sharp." These algorithms were tuned to specific musical constructs and were not appropriate for all sounds.

There are a number of people who have done work which overlaps what we will present here. Feiten and Gnzel have looked at audio databases with content analysis using neural nets [3]. Foote has developed algorithms for classification and comparison using cepstral coefficients and histogram representations of their probability densities [4]. Fischer et al at the University of Mannheim have been working in the area of audio content analysis [5]. Scheirer and Slaney have developed a music/speech classification system which uses features similar to ours as well as many features designed specifically for this task [6]. They look at a broad range of classifier strategies as well. Many in the field of speaker identification, although focused on this particular task, have used the strategy of feature extraction and statistical comparison. Speech recognition systems can be used to be used to transcribe speech, allowing for text-based queries [7]. Finally, there is our own early work [8-10].


In this section, we present techniques for analyzing audio signals in a way that facilitates audio classification and search.


Figure 1 shows an overview of the feature extraction process. For each frame of audio data (25 to 40 msecs), we measure a number of acoustic features of each sound. These are the basis for all the higher-level analysis which follow. The features analyzed here are not the only acoustic features which could be extracted. We have found these features to be of wide utility, but, for specific applications, it may be advantageous to look at other low-level parameters.

Figure 1. Overview of feature extraction

Note that each class of features has its own set of useful distance measures. In addition, some distance measures may be more useful in particular applications.


The following features are currently extracted from each frame of sound, producing, for the entire sound, a time-series {fj} where each element is a vector of floating point numbers representing the instantaneous values of the features. For some applications, this is the appropriate level of content description. For example, a melody-matching algorithm would compare the pitch functions of two or more musical selections more-or-less directly. Similarly, a audio copyright-infringement detection system could be built using correlations of some or all the time functions.

2.2.1. Features Loudness

Loudness is approximated by the square root of the energy of the signal computed from the short-time Fourier transform (STFT), in decibels. The loudness time series is highpass filtered to remove the long-term average volume level. A more accurate loudness estimate would account for the frequency response of the human ear; if desired, the necessary equalization can be added by applying the Fletcher-Munson equal-loudness contours and even more detailed models of loudness summation. The human ear can hear over a 120 decibel range. Our software produces estimates over a 100 decibel range given 16-bit audio recordings. Pitch

The monophonic pitch is estimated from the STFT. For each frame, the frequencies and amplitudes of the spectral peaks are computed using a parabolic interpolation. An approximate greatest common divisor algorithm is used to calculate an estimate of the fundamental frequency. The pitch time-series is cleaned up by heuristics which remove harmonic jump and noise errors. We store the pitch as a log frequency. The pitch algorithm also returns a pitch confidence value that can be used to weight the pitch in later calculations. A perfect young human ear can hear frequencies in the 20Hz to 20kHz range. Our software can measure pitches in the range of 50Hz to about 8kHz. Tone (brightness and bandwith)

Brightness is computed as the centroid of the STFT, again stored as a log frequency. It is a measure of the higher-frequency content of the signal. As an example, putting your hand over your mouth as you speak reduces the brightness of the speech sound as well as the loudness. This feature varies over the same range as the pitch, although it can't be less than the pitch estimate at any given instant.

Bandwidth is computed as the magnitude-weighted average of the differences between the spectral components and the centroid. As examples, a single sine wave has a bandwidth of 0 and ideal white noise has an infinite bandwidth.

Figure 2: Bulgarian singer. Cepstrum

We compute a vector of Mel-Filtered Cepstral Coefficients (MFCCs) by applying a mel-spaced set of triangular filters to the STFT and following this with the discrete cosine transform. We have an option which can alleviate channel and noise effects using cepstral mean subtraction and other post-filtering techniques. Derivatives

Since the dynamic behavior of a sound is important, the low-level analyzer also computes the instantaneous derivative (time differences) for all of the features above.

2.2.2. Example

Figure 1 shows the time-varying nature of selected acoustic parameters for a recording of a Bulgarian vocalist. At the beginning of the recording, she is holding a long tone with a large vibrato. About halfway through, she switches to a pattern of notes which is repeated two times. Both of these are fairly clear in the pitch and loudness traces. There are clearly correlations between the pitch, loudness and spectral features. The vowel changes, especially the one at the end of the first held note, are visible in the spectral features.

2.2.3. Distance

Distance measures are necessary for classification and query-by-example. There are several useful distance measures for comparing the frame data of two sounds. A straight least-squared error or correlation measure is appropriate for applications where an exact match is required, e.g., finding a piece of copyrighted audio in an internet broadcast. Given two feature vector time-series {fj} and {gj} of length L from two sounds, the least-squares distance is given as follows. For simplicity, we show the 1-dimensional case. For N dimensions, the distances can be computed separately for each dimension and summed.

D = L-1 (S j (fj-gj)2)1/2 (1)

The correlation measure is given below. Again, the 1-dimensional case is shown. For N dimensions, the 1-dimensional correlations can be multiplied together.

C = L-1 S j fj*gj (2)

Both of the above should be normalized by the length of the series. Some policy must be decided upon for series of different lengths. For the correlation measure, the mean is typically removed before the correlation is performed.

In some applications, such as searching for sound effects by example, it is more useful to compare the shape of the time function rather than the exact values. This can be accomplished by scaling the time functions in both dimensions to lie in 1´ 1 square, say, and then comparing these two functions using either of the measures above.

In all these distance measures, it is useful to provide user-controllable weights on the components of the distances. For example, a user might only want to consider the timbral features when doing a comparison.


Once we have a time series of frame values, we can extract higher-level information from it. The first thing we do is a statistical analysis. When determining a general classification for a sound, e.g., speech or music, or when comparing the quality of two sounds, we don't want to compare the above time functions directly. Rather, we would like to compare their statistical properties.

2.3.1. Gaussian model Analysis

In the simplest case, we compute the mean and standard deviation s of the frame-level time series for each parameter, including the parameter derivatives. This is detailed enough to distinguish between many classes of sound and can achieve subjectively reasonable sound-to-sound comparisons. When computing the mean and standard deviation, the frame-level features are weighted by the instantaneous loudness so that the perceptually important sections of the sound are emphasized.

For applications where finer sound distinctions are necessary, such as speaker ID, a simple mean and standard deviation is not enough for satisfactory performance. The Gaussian model can be extended to include Gaussian mixtures. The increase in performance is accompanied by additional costs in storage and computation.

Note that in query applications, it can be useful to directly specify constraints on the values of the N-vector described above. For example, the user could be looking for musical instrument samples in a certain range of pitch. However, it is also possible to query by example. In this case, the user trains the system given examples of sounds.

The above analysis makes sense when the input sound is homogeneous in character (statistically stationary). For example, a short sound effect like a door slam or recording of a rain ambience. When analyzing a longer heterogeneous sound recording, for example a live news broadcast, it is more reasonable to segment the recording and compute a feature vector for each segment. The segmentation can be done arbitrarily, e.g., in overlapping equal-sized windows of a few seconds each where we hope that each window is relatively homogeneous, or it can be done by a scene change analysis which we will describe later. Modeling a class

If the system is presented with several homogeneous sound recordings which are all examples of a single class of sound, this can be interpreted in two ways. The first is to just treat the sounds as if they were parts of one long sound. This is reasonable if the resulting sound is homogeneous. For example, in speaker ID, there may be several recordings of the same speaker, all of which can be treated together.

The second is to treat them as different members of the class. In this case, we can infer something from the variability of the parameters across the different recordings. For example, there may be several samples of oboe tones, each at a different pitch. If one of these is presented as an example of an oboe sound to the system, the system has no a priori way of determining that it is the timbre of the sound that determines the class, rather than the particular pitch of this sample. However, if all the samples are presented to the system, the variability of the pitch can be noted across the samples and then used to weight the different parameters in comparison. This information can then be used when comparing new sounds to this class. This variability can be stored in standard deviation portion of the N-vector above, or, if the standard deviation portion of the N-vector is important to the class identity, it can be stored as additional information. Distance

For the single Gaussian model, there are several different useful distance measures. When the two sounds are both homogeneous, an effective and simple measure is the Euclidean distance between the two sounds' N -vectors.

D = ||a0-a1||

where a0 and a1 are the feature vectors for the two sounds being compared.

When one of the feature vectors, say a0, represents a class of sound such as the oboe example above, the inverse of the standard deviation of its N-vector can be used to weight the distance calculation. If we break the N-vector a into its component mean and standard deviation s, we have the sound-to-class distance given by:

D = ((1-0)TR0-1(1-0))1/2 (3)

Where R0 is the covariance matrix containing the squared elements of the s vector down the diagonal. If, as was discussed in the previous section, the variability information was stored as additional information, the distance measure should be modified to be

D = ((a1-a0)TR0-1(a1-a0))1/2 (4)

where R0 now contains this additional variability information.

When a Gaussian mixture has been estimated from the frame data, we need to use a density distance measure such as the joint entropy.

In all these distance measures, the user should be given the ability to apply their own weights to the different elements in the N-dimensional computation based on their a priori knowledge of the importance of the different features for the task at hand.

2.3.2. Histogram model Analysis

A Gaussian mixture model can represent the N-dimensional probability density of the frame features with arbitrarily high accuracy. However, another model which, as we shall see, can be useful in certain applications is a histogram model. This simply is a count of the number of frames of the sound in each volume of a particular partitioning of the feature space. Determining useful and efficient partitionings of the space is the difficult part of the computation. We will discuss just one possible partitioning strategy which is useful for discrimination applications, i.e., applications which require the classification of sound or sound segments into one of several classes. When the classes are very broad, e.g., speech and music, it is desirable to find those places in the space which most distinguish the different classes.

Our strategy is simple. We build a decision tree by searching for optimal split points along each dimension of the frame feature data. These split points are chosen to maximize the partitioning of the different classes. The algorithm is as follows. Starting with the unpartitioned N-dimensional space and M classes represented by their frame-level feature data, we do a Monte Carlo optimization looking for the dimension d and value v which maximizes a quality of separation metric. If the total number of frames in class i is given by ni, the total number of frames in all classes by n, the total number of frames in class i which lie above v in dimension d is given by ci, and the sum of this over all dimensions is c, the quality metric is

r0 = ci/ni

r1 = (c-ci)/(n-ni)

Q = S i MAX [ r0 + (1 - r1), r1 + (1 - r0) ] (5)


After the best split is found, the algorithm is run again on each of the resulting partitions. No split is computed for a partition if it is dominated by features of only one class. When this condition is met, this partition is made a leaf of the decision tree and is given a unique leaf index. We keep track of these leaves, their index numbers, and how many points from each class are in each leaf.

When the process finishes, we have a set of histogram vectors, one for each class of sound. We normalize the histograms so that the nth element of the vector for a particular class of sound contains the number of points from that class in the nth leaf divided by the total number of points in that class. Thus it is a measure of how densely that class populates all the leaves of the decision tree. How distinct these histogram vectors are from each other is a measure of how well the sound classes can be distinguished given the current set of frame-level features. Distance

Given a target segment of sound, we can use the decision tree above to classify the sound. The sound segment is run through the preliminary audio analysis, producing a set of frame data points. Each point is used to traverse the decision tree and will end up in one of the leaves. The total number of points from the segment's data set in each leaf is put into a normalized histogram vector sorted by leaf index as before.

This vector is then compared to the vectors for each class derived in the training process above using the correlation distance. If the two histograms are given by the sequences {gi} and {hi}, the similarity is given by

S = S i gi*hi (6)

The target sound is assigned to the class with the highest similarity value.


When the audio recording is known or has been classified to be music, one would like to extract music-level information, including rhythms, events such as notes (which might include fields such as pitch, duration, loudness) and instrument identification. The overall sound of the music is a useful measure as well, but this is handled by the statistical features above.

2.4.1. Rhythm Analysis

Various beat detection and rhythm analysis algorithms have been reported in the literature. Ours is relatively simple, as we do not need a sophisticated analysis, just one which can produce enough features for reasonable comparisons.

To set up the rhythm analysis, another frame-level parameter needs to be computed. Since, for many types of music, the bass instruments are a more reliable indicator of the beat, we produce an alternate version of the loudness feature using a linearly bass-weighted version of the STFT.

The first step is to perform an FFT on the bass loudness time-series. This yields a spectrum whose x-axis measures distances in time, and whose peaks indicate the most frequent separation in time between bass events. The spectrum is normalized so that its maximum value is 1.0.

We search for a peak through the spectrum's frequencies over a search band corresponding to a reasonable maximum and minimum tempo. This way, peaks at very large time separations are interpreted as not the actual beat of the music, but as a time interval comprising several beats. Similarly, the minimum time separation avoids excessively small time intervals. We compute a parabolic interpolation of the amplitude to arrive at a tempo estimate.

Once the tempo is known, the bass loudness time series is normalized into a rhythm feature vector in two ways. First, it is normalized in value so that it has an average loudness of 1.0, and, secondly, is it downsampled to a rate of 1/6th of beat. For simple rhythms made of duplets and triplets, this will capture the most important sub-beat information. A higher sampling rate would pick up finer subdivisions, but also would tend to pick up rhythmically unimportant details that are due to deviations of the musician's performance from a strictly metronomic, theoretical performance. This normalization makes the rhythm feature vector independent of tempo. This independence allows the system to identify two musical excerpts as having similar rhythms, even if one excerpt has a faster tempo. Distance

The distance between two tempi is trivial, although it should be computed in a log space for perceptual scaling. As explained earlier, rhythm feature vectors are always normalized in time so that their sampling rates are all 1/6 beat, allowing rhythms with different tempos to be meaningfully compared. One problem is that the rhythm feature vectors might be offset from each other. For example, they might be identical except that one starts with an upbeat while the other starts on the downbeat.

To detect such cases, we slide the two rhythm feature vectors past each other, one sample at a time, computing the least-squared distance at each point. We then take the average of the two smallest values of the distance as the overall rhythm distance.

2.4.2. Events

Score-level descriptions of music are typically in the form of events. These events are typically characterized by a starting time, a duration, and a series of parameters such as pitch, loudness, articulation, vibrato and so on. All of these parameters can vary in time during the event. MIDI data is widely used as a score-level representation of music in computer environments. It is very limited in terms of the musical concepts it contains, but can be coerced into representing a wide range of music well enough.

Since we already have continuous frame-level data for pitch, loudness and so on, it is straightforward to convert these into the MIDI values for key number, pitch bend, key velocity and so on. The major difficulty is determining the event boundaries. If you look at figure 2, you will see that the note boundaries are not self-evident. The heuristics we follow for extracting musical note events are as follows:

A new event begins when:

• the loudness increases above a silence threshold

The current event ends when:

• the loudness decreases below a silence threshold

The current event transitions to a new event when:

• the loudness increases by a factor above the current average level

• the pitch changes suddenly

• the pitch has been changing, then settles into a new steady value

• the spectrum changes suddenly

2.4.3. Instrument identification

We would like to identify what is the type of the sound source. For musical examples, this would typically be one of the standard musical instruments. Instrument identification is accomplished using the histogram classification system described in section 2.3.2. This requires that the system has been trained on all possible instruments.


When the audio recording is known or has been classified to contain speech, it is very useful to provide a text version of the speech. Since there is a large body of literature on the problem of transcribing speech to text as well as on the problem of searching text in useful ways, we will not discuss it in detail here. Suffice it to say that it is a necessary part of a comprehensive audio content-based system. It may be useful in a query environment to keep an intermediate, phonetic form of the speech rather than the final text. The advantage of this is that the final text may have mistranscribed a word to a similar sounding word, whereas the intermediate form will have several possible phonetic transcriptions with likelihoods attached and can thus allow for fuzzier matches.


When performing a query-by-example in a small database, it is easy to compute the distance measure for all the sounds in the database from the example sound or model and then to choose the sounds that match the desired result. For large databases, this can be prohibitively expensive. To speed up the search, we produce an index of the sounds in the database by all the acoustic features. This allows us to quickly retrieve any desired hyper-rectangle of sounds in the database by requesting all the sounds whose feature values fall in a set of desired ranges. Requesting such hyper-rectangles allows a much more efficient search. This technique has the advantage that it can be implemented on top the very efficient index-based search algorithms in existing databases.

As an example, consider a query to retrieve the top M sounds in a class. If the database has M0 sounds total, we first ask for all the sounds in a hyper-rectangle which is centered around the mean and has volume V such that

V/V0 = M/M0 (7)

where V0 is the volume of the hyper-rectangle surrounding the entire database. The extent of the hyper-rectangle in each dimension is proportional to the standard deviation of the class in that dimension.

We then compute the distance measure for all the sounds returned and return the closest M sounds. If we didn't retrieve enough sounds that matched the query from this first attempt, we increase the hyper-rectangle volume by the ratio of the number requested to the number found and try again.

Note that the above discussion is a simplification of our current algorithm, which asks for bigger volumes to begin with to correct for two factors. First, for our distance measure, we really want a hypersphere of volume V, which means we want the hyper-rectangle that circumscribes this sphere and, second, the distribution of sounds in the feature space is not perfectly regular. If we assume some reasonable distribution of the sounds in the database, we can easily compute how much larger V has to be to achieve some desired confidence level that the search will be successful.


We now show the behavior of some of the above algorithms on a test sound database. These sound files were culled from various sound effects and musical instrument sample libraries. There are a wide variety of sounds represented from animals, machines, musical instruments, speech and nature. The sounds vary in duration from less than a second to about 15 seconds.

A number of classes were made by running the simple single-Gaussian classification algorithm of section 2.3.1. on some perceptually similar sets of sounds. These classes were then used to reorder the sounds in the database by their likelihood of membership in the class. The likelihood is a mapping of the distance measure given in section, where a distance of 0 corresponds to a high likelihood and an infinite distance corresponds to a likelihood of 0. The likelihood is normalized so that 1.0 corresponds approximately to the boundary of the original set of sounds which made up the class. The following shows the results of this process for several sound sets. These examples show the character of the process and the fuzzy nature of the retrieval. An interactive demonstration (with sound) is available at the Muscle Fish web site.


For this example, all of the recordings of laughter except two were used in creating the class. Figure 3 shows a plot of the class membership likelihood values (the Y-axis) for all of the sound files in the test database. Each vertical strip along the X-axis is a user-defined category (the directory in which the sound resides). The highest returned likelihoods are for the laughing sounds, including the two that were not included in the original training set, as well as one of the animal recordings. This animal recording is of a chicken coop and has strong similarities in sound to the laughter recordings, consisting of a number of strong sound bursts.


Our test database contains a number of very short recordings of a group of female and male speakers. For this example, the female spoken phrase "tear gas" was used. Figure 4 shows a plot of the likelihood of each of the sound files in the test database to this sound using a default value for the covariance matrix R. The highest likelihoods are for the other female speech recordings, with the male speech recordings following close behind.


Figure 3: Likelihoods of membership in laughter class.

Figure 4: Similarity to female speech.


The above technology is relevant to a number of application areas. The examples in this section will show the power this capability can bring to a user or developer working in these areas.


Any audio database or, equivalently, a file system designed to work with large numbers of audio files, benefits from content-based capabilities. Both of these require that the audio data be represented or supplemented by a so-called metadata record or object that points to the sound and adds the necessary analysis data.

When a sound is added to the database, the analyses presented in the section 2 are performed and a new database record or object is formed with this supplemental information. Typically, the database would allow the user to add his or her own information to this record. In a multi-user system, each user could have their own copy of the database records that they could modify for their particular requirements.

Figure 5 shows an example database schema which would be appropriate for a sound effects database. Fields in this record include features such as the sound file's name and properties, some of the acoustic features as computed by system analysis routines and user-defined keywords and comments. This level of functionality is available commercially in the current INFORMIX-Universal Server. The Bulldog Group’s asset management system has similar capabilities.

sound file attributes



sample rate

sample size

sound file format

number of channels

creation date

analysis date

user attributes



Gaussian statistics feature vector

loudness [mean, stdev]

D loudness [mean, stdev]

pitch [mean, stdev]

D pitch [mean, stdev]

tone [mean, stdev]

D tone [mean, stdev]

MFCCs [mean, stdev]

D MFCCs [mean, stdev]

Figure 5: Database record

Any user of the database can form an audio class. Audio classes are formed by presenting a set of sounds to the classification algorithm of the last section. The object returned by the algorithm contains a list of the sounds and the resulting statistical information. This class can be private to the user or could be made available to all users of the database. The kinds of classes that would be useful depend on the application area. For example, a user who was doing automatic segmentation of sports and news footage might develop classes that allow the recognition of various audience sounds (applause, cheers and so on), referee's whistles, close-miked speech, etc.

The database should support the queries that were described in the last section in combination with more standard queries on the keywords, sampling rate and so on.


In this section, we present the SoundFisher audio browser. It is a front-end database application written in JAVA which allows a user to search for sounds using combinations of content-based and traditional text queries. In addition, it permits general maintenance of the database's entries through adding, deleting and editing of sounds and sound record data.

Figure 6 shows the GUI for the application after a search has been performed. This is an artificially complex query, but it serves to illustrate the functionality. The upper portion of the window contains the query itself. The Search button initiates a search using the query and the results are then displayed in the Current Records window.

A query is formed using a combination of constraints on the various fields in the database schema as well as query-by-example. That shown in figure 6 is a query to find mid-fidelity sounds in the database in the "animals" or "laughter" categories that are similar to the previously defined class called "laughter young male".

Figure 6. SoundFisher browser.

The topmost element of the query indicates that it is to be applied to the entire database. It could have been applied to the list in the Current Records window as well. Below this, there are a set of rows, each of which is a component of the total query. Each component includes the name of the field, a constraint operator appropriate for the data type of that field and the value to which the operator is applied. Pressing one of the buttons in the row pops up a menu of possibilities. In figure 6, there is one component which constrains the categories and one which specifies a medium sampling rate. The OR subcomponent on the category field is added through a menu item. There are also menu items for adding and deleting query components. The final component specifies the query-by-example. The "laughter young male" class was built using functions available through the Models menu, and consists of a single laughter recording.

Although not visible in this figure, many of the query component operators are fuzzy. For example, the user can constrain the pitch to be approximately 100 Hz. This constraint will cause the system to compute the distance between each sound’s mean pitch and 100 Hz and use this as part of the query resolution.

There are a number of ways to refine searches through this interface. Any query can be saved (bookmarked) under a name given by the user. These queries can be recalled and modified. The Navigate menu contains these commands as well as a history mechanism which remembers all the queries on the current query path. The Back and Forward commands allow navigation along this path. An entry is made in the path each time the Search button is pressed. It is of course possible to start over from scratch.


Current audio editors operate directly on the samples of the audio waveform. The user can specify locations and values numerically or graphically, but the editor has no knowledge of the audio content. The audio content is only accessible by auditioning the sound, which is tedious when editing long recordings.

A more useful editor would include knowledge of the audio content. Using the techniques presented in this chapter, a variety of sound classes appropriate for the particular application domain would be developed. For example, editing a concert recording would be aided by classes for audience applause, solo instruments, loud and soft ensemble playing, and so on. Using the classes, the editor could have the entire concert recording initially segmented into regions and indexed, allowing quick access to each musical piece and subsections thereof. During the editing process, all the types of queries presented in the preceding sections could be used to navigate through the recording. For example, the editor could ask the system to highlight the first C sharp in the oboe solo section for pitch correction.

A graphical editor with these capabilities would have Search or Find commands that functioned like the query command of the sound browser of the last section. Since it would often be necessary to build new classes on the fly, there should be commands for classification and analysis or there should be tight integration with a database application such as the audio browser of the previous section.


The application of content-based retrieval in surveillance is identical to that of the audio editor except that the identification and classification would be done in real-time. Many offices already are equipped with computers that have built-in audio input devices. These could be used to listen for the sounds of people, glass breaking, and so on. Also, there are a number of police jurisdictions using microphones and video cameras to continuously survey areas where there is a high incidence of criminal activity or a low tolerance of such activity. Again, such surveillance could be made more efficient and easier to monitor with the ability to detect sounds associated with criminal activity.


The audio soundtrack of video as well as audio-only recordings can be automatically indexed and segmented. Segmentation can be accomplished in a number of ways.

Analogous to automatic keyframing in the video world, we can find audio windows which represent each homogeneous section of audio. First, we extract the desired audio features from a short (1-5 seconds) window of the recording. As new sections of audio arrive and are analyzed, they are compared to the first window using the appropriate distance measure. When the distance exceeds a threshold, this window is marked as a new "keyframe."

Alternately, one can look for scene changes by measuring the distance between neighboring windows of audio. When the distance between neighbors exceeds a threshold, this represents a sudden change in the audio sound characteristics.

A trivial segmentation is to look at fixed-size overlapping windows of the raw analysis data as the individual sound segments. Once the segmentation is done, each of these sounds can be classified and thus indexed.

Figure 7: Real-time classification.


Figure 7 shows a block diagram of a real-time system capable of classifying audio on the fly. This system has been used at Muscle Fish to create a real-time speech/music classifier. The system is used commercially in the Virage video logging and archival system. A interactive demo is available on the Muscle Fish web site.


This chapter has outlined some of the main features of an engine for content-based classification and retrieval of audio. The engine analyzes the acoustic features of the audio and extracts higher-level features from them. The analyzed features are relatively straightforward but suffice to describe a relatively large universe of sounds. More analyses could be added to handle specific problem domains.

We have presented examples that show the efficacy and useful fuzzy nature of the search. The results of searches are sometimes surprising in that they cross semantic boundaries, but aurally the results are reasonable.

We have also presented some example applications, including an audio database or file system, an audio browser, a real-time classifier, as well as a content-aware audio editor and surveillance system. We have found that the basic approach presented here works well for a variety of interesting audio applications.


1. Plomp, R., "Aspects of Tone Sensation: A Psychophysical Study." Academic Press, London, 1976.

2. Foster, S., Schloss, W., and Rockmore, A. J., "Towards an Intelligent Editor of Digital Audio: Signal Processing Methods," Computer Music Journal , 6(1), 42-51, 1982.

3. Feiten, B., and Günzel, S., "Automatic Indexing of a Sound Database Using Self-Organizing Neural Nets," Computer Music Journal, 18(3), 53-65, Summer 1994.

4. Foote, J., "A Similarity Measure for Automatic Audio Classification," Proceedings AAAI 1997 Spring Symposium on Intelligent Integration and Use of Text, Image, Video and Audio Corpora, Stanford, California, March 1997.

5. Pfeiffer, S., Fischer, S., and Effelsberg, W., "Automatic audio content analysis," Tech. Rep. TR-96-008, University of Mannheim, D-68131 Mannheim, Germany, April 1996.

6. Scheirer, E. and Slaney, M., "Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator," Proceedings ICASSP-97, Munich, Germany, 1997.

7. Brown, M., Foote, J., Jones, G., Spärck-Jones, K., Young, S., "Open-Vocabulary Speech Indexing for Voice and Video Mail Retrieval," Proceedings ACM Multimedia 96, Boston, November 1996.

8. Keislar, D., Blum, T., Wheaton, J., and Wold, E., "Audio Databases with Content-Based Retrieval," Proceedings of the International Computer Music Conference 1995, International Computer Music Association, San Francisco, 1995, pp. 199-202.

9. Blum, T., Keislar, D., Wheaton, J., and Wold, E., "Audio Databases with Content-Based Retrieval," in Intelligent Multimedia Information Retrieval, AAAI Press, Menlo Park, California, 113-135, 1997.

10. Wold, E., Blum, T., Keislar, D., and Wheaton, J.,, "Content-Based Classification, Search and Retrieval of Audio," IEEE Multimedia, 3(3), 27-36, Fall 1996.