next_inactive up previous


Ostitch: MIR applied to musical instruments

Abram Hindle

Software Engineering Group

April 15, 2005

University of Victoria

Abstract:

The paper discusses the use of MIR in computer music instruments. This paper proposes and implements a performance time MIR based instrument (Ostitch) that produces ``audio mosaics'' or ``audio collages''. Buffering, overlapping and stitching (audio concatenation) algorithms are discussed - problems around these issues are evaluated in detail. Overlapping and mixing algorithms are proposed and implemented.

Collage, Overlapping, Mosaicing

INTRODUCTION

Music Information Retrieval (MIR) has been commonly used for querying repositories of sound and music. One aspect of MIR that has been somewhat explored has been applying MIR techniques to live music performances. Often MIR is applied to conventional instruments kapur2004:1 or voice samples kapur2004.

What I propose in this paper is an instrument which uses MIR technology to produce and replace sound from a corpus other sounds. This instrument is controlled by sound much like a filter but unlike a classical filter it can learn about the sound coming in and use that sound to produce new sounds. The main instrument demonstrated in this paper is a performance-time (real-time for music performance purposes) audio collager. Using either a corpus of sound or a growing corpus of incoming sound, similar sounds are queried and played instead of the original signal. An instrument like the collager (Ostitch) attempts to imitate or use the control signal as ``inspiration'' for its decisions.

MIR can be used to improve human computer interaction with a computer , especially those interactions of skilled musicians. A musician could play a tune to query for a song or use their instrument to control a computer instrument. Interfaces could be based on simple pitch trackers to more complex feature trackers which use domain specific features to describe the instrument input kapur2004:1.

MIR can also be used to provide cost effective input to a computer music instrument. A user can simply use a microphone and their own voices or sound making tools to provide an input signal kapur2004. The voice is an excellent source because it can produce a very wide range of sounds and noises, from vowels and consonants to throat and mouth sounds. The voice is also a very natural instrument for most people.

The inputs commonly associated with a computer, the keyboard and mouse, often are not expressive enough to replicate the input a user might provide with a real instrument. Thus why not let the user use the real instrument with all of its expressiveness to control another instrument?

Related Work

Schwarez has produced a PhD thesis about audio concatenation schwarez04, thus many of these related works are extracted from his thesis:

Contribution

My contribution is the ad-hoc testing and implementation of certain kinds of stitching and overlapping techniques used to produce mosaics for interactive use. As well as providing information about the performance time use of audio collage.

SYSTEM DESCRIPTION

Ostitch is an audio collager. It makes audio collages or audio mosaics out of input signals based upon a corpus of audio or the input signal itself.

There are 2 main sources of signal inputs to Ostitch, one is through the sound-card whether it be CD, Line In, or a microphone, the other is through files. Output can be either or both file output and sound output.

User input can be directed into the system via the command-line or remotely via the UDP port where ``setter'' commands can be sent (it is one way communication). UDP setter commands include commands that turn classifiers on and off, start or stop recording commands. UDP is used for an interface as GUIs can be created which is decoupled from the running instrument process. UDP is explicitly used for its connectionless and latency related attributes.

Command-line options include sampling rate (-sr), number of overlap chunks (-o ), number of samples per block (-ol ), FFT frame size (-fft ), which window to use as an envelope of the samples (-han, -ham, -black, -bart), and feature disabling options (-nozc, -norms, -noflux, -noroll, -nocent ).

File IO switches include input corpus (-i file), save rendered output (-o outfile) , no recording to the database (-norec), no sound output to the sound card (-nosound).

FFT Mixers can be chosen using the following switches: -lowpass,-hipass, -dither , -pass , -dither .

Usage Scenarios

Ostitch can be used for various purposes with various results, some of these scenarios include:

PROBLEMS

In this section I will discuss various problems that make this research interesting and slightly different from other sound mosaicers. Concatenation of sounds fragments and the mixing of the sounds during concatenation will be referred to as stitching. Sample size refers the size of the chunks of audio being mixed.

Performance Issues

With any kind of stitching, each kind of stitching sounds different when different sized samples are used. If samples are small then the grains of sound are small, thus the lower frequencies will be interrupted and often lost. If the grains are large the source of the grain becomes more and more obvious. The more obvious the source of a grain, often the less enjoyable it is to hear. A medium size grain seems more perceptually pleasing (at least to the author).

Overlapping is useful as it joins samples tighter together and allows for time domain or frequency domain mixing. There are some caveats with overlapping. Overlapping can result in a pitch shifting of the sound, the pitch usually increases. The less the overlap the less noticeable the pitch shift. Although the less overlap the higher chance of a popping sound. Overlapping samples of different size can produce much more musical results.

Simplified Overlapping Algorithm

Figure: Overlapper Algorithm
\includegraphics[width=.9\textwidth]{overlapper.eps}

A performance time overlapping algorithm with minimal buffering was implemented (see figure 1). This algorithm would have mix blocks (triangles) and blit blocks (rectangles). Mix blocks would be mixed together using a mixing algorithm while blit blocks would be copied verbatim. Figure 1 demonstrates the overlapping algorithm based on $n$ number of blocks (number of blocks being the number of equally sized segments we cut our samples into) . Each outlining rectangle indicates a block of time being for playback. Samples starting on the left side are buffered from the last call to the overlapper.

Mix indexes are important as the iterate the job of the overlapper for the current time unit. Mix indexes are decremented from $n-1$ to $1$ for cases where $n >= 2$ (where $n$ is the number of blocks).

Note how the time units are numbered with mix indexes from 0 to $n-1$ where $n$ is the number of blocks. The first case is a special case so normally mix indexes start at 1. Mix indexes are significant because a time unit of mix index 1 implies that 2 samples plus the last sample are needed to complete that frame (notice how for $n = 2$ all the frames have a mix index of 1). Also when the mix index is 1 that implies there are 2 pairs mix blocks. When a mix index is greater than 1 there is only 1 pair of mix blocks. When a mix index is greater than 1, let $m$ be the mix index, the $n -
m$ to $m - 1$ blocks (zero indexed) of the last buffer will be blitted into the output buffer, then the $m$ indexed block of the buffer will be mixed with with the first block of the first supplied buffer. Then the $n -
m$ blocks after the first block of the first new buffer will be copied to the output buffer. If the mix index was 1, the last block of the last buffer will be mixed with the first block of the first new buffer, the middle $n-2$ blocks of the new buffer will be blitted to the output buffer and then the last block of the new buffer will be mixed with the first block of the second buffer. In this case the second buffer will be copied as the lastbuffer for the next iteration.

The algorithm is described in point form below (assume copying to the outbuffer is done in order as to avoid indexing the outbuffer):

Time Domain Stitching

Time Domain stitching is the simplest form of stitching. One can simple concatenate the time domain data in order to stitch the data. Simple concatenation is problematic because it can introduce pops or other extraneous noise.

Solutions for dealing with pops and noise are to window or envelope the samples using windows such as the Hann window or a ADSR envelope. One problem with enveloping is that it can add other sounds, such as the sound of the envelope. For examples if we have a block size of 1024 samples and a sampling rate of 44100 and we apply a Hann window to each block, we will have produced another mixed wave at around 43 hz. Of course that wave could be canceled out but if we had a quiet signal holding above 0 we would produce a time domain wave form that looked like many concatenated Hann windows - a far cry from silence.

We found that time domain wise overlapping worked pretty well although one would have to mix the edges of the sample to avoid popping. If there was too much of an overlap ($n = 2$ or multiple samples played at the same time which were similar) the pitch seems be shifted and it sound like we were playing samples at double rate.

If random samples are chosen the pitch can be perceptually increased because there are not a lot of continuous lower frequencies. This happens more if the edges of the samples are mixed out.

There are multiple ways to mix time domain sound. High pass and low pass filters, weighted averaging, dithering are an example of a few. Figure 2 refers to FFT based mixing but many of the examples are attributable to time domain mixing as well.

FFT Based Overlapping and Stitching

FFT Based overlapping allows ``filter'' based overlapping. Frames refer to blocks of data of the FFT size which are subsections of samples. One difficultly with FFT overlapping is if the FFT is too large and there aren't enough overlapping frames the mixing algorithms won't work well and the mixing could be rather abrupt and harsh. Thus one problem with FFT Based overlapping is that it can easily bring in extra noise.

There are many kinds of ways to mix FFT frames; one can simply sum FFT frames for more linear mixing (it probably would have been computationally cheaper to use linear mixing in the time domain). With access to the FFT of frames it is very easy to apply a steep high and low-pass filter to each frame and the summate the frames.

Another mixing type is dithering. Dithering is where samples from each FFT are interlocked in a dithered pattern. As the one FFT frame mixes more into the other, it has a higher proportion of the FFT samples.

See figure 2 for a diagram of Fourier Transform Based Mixers.

Figure: From Left To Right, Top to Bottom: Low-pass mixer, Hi Pass Mixer, Bandpass Mixer, Smear Mixer, Big Dither Mixer, Small Dither
\includegraphics[width=.9\textwidth]{mix.eps}

Collage

What collage consisted of was:

The parameters of the collager were the sampling rate, the overlap chunks, the number of samples per ``sample'' (sample refers to a block of audio read in), the FFT size, and the envelope used for time domain stitching.

Similar chunks were found by using a nearest neighbor algorithm with Euclidean distance (Mahalanobis distance was not used as it would require continuous recalculation of the covariance matrix - Mahalanobis distance makes more sense to use on static corpuses).

The collager can be furthure parameterized by modifications to the chunk similarity algorithm, as it was found during performance that sometimes inaccuracy in the piece selection was musically more interesting than high accuracy of selection.

IMPLEMENTATION

The tools used by Ostitch include (bold items indicate that it is needed to run Ostitch):

Most of the important functionality was written in OCaml. C was used primarily to talk to ALSA and provide a STDIN to UDP client program. OCaml was chosen because it is a functional, type-safe language who's speed rivals that of C . OCaml is an elegant and clear language which made writing much of the code very easy and intuitive.

OCaml suffers from some debugging issues (such as backtraces and clarity of compiler error messages). OCaml also suffers from painfully poor syntax in regards to floating point numbers (+ and - are for integers , +. and *. are for floating point numbers). Other weird OCaml related issues are the fact that OCaml does not use native Ints or floats, extLib was required to allow OCaml to read and write native types.

OCaml's C integration is very nice and really consists of includes full of type munging macros.

OCaml was appropriate for the project as the program deals with audio which is stream based - audio filters are also usually recursively defined (OCaml supports tail recursion). Unfortunately when I tried to use OCaml with linked lists representing streams of sound I ran into grave garbage collection problems, the program was eating memory very quickly. Due to this I had to switch to an array/block based architecture which complicated things a little bit. For instance it meant I had to use more statements (rather than expressions) than I'd like to. I had to be explicit about what was mutable and what wasn't. Also I couldn't use functions like map because that would require making a new array. The program mostly relies on static buffers, some functions will make their own buffers (hidden by closures).

One challenge was importing an FFT into OCaml. I did not want to use FFTW because I didn't feel like munging around with FFTW types, I wanted to use types that I could manipulate. So I ported the FFT from Marsyas. Unfortunately when I attempted a reverse FFT the algorithm would take an inconsiderate amount of CPU time. A reverse FFT shouldn't be that complex. So I ported over a FFT I had written in Java (which was already array bounds checked and somewhat statically typed). The new FFT worked well, the most useful aspect of the new FFT was that I had to explicitly provide output arrays for real and imaginary values. Even more helpful was that reals and imaginary values were separated into their own arrays (a FFTData type was created to handle these 2 arrays ). In summary the FFT and porting the FFT to OCaml caused no end of trouble. I was able to re-implement the FFT without using references or mutables (except for the arrays which have mutable values).

SNDLib was a C library I had built to gain ALSA sound output. I modified it a bit to allow for OCaml integration and to enable it to read sound as well. SNDLib was wrapped in the Sndcaml module to provide an interface to SNDLib within OCaml.

Commlib is a module I wrote which enables remote control of a program via commands sent to a UDP port that Commlib listens on. This style of interaction is very nice as you can run one continuous instance of a GUI and Ostitch can be stopped and restarted. Unfortunately this does not lend to very good feedback inside of the GUI. Perl and TK were used to provide a simple GUI to Ostitch.

Audio is a module which contains miscellaneous audio functions. These functions include windowing algorithms, feature extractors like zero crossings, RMS etc, euclidean distance, flux, overlappers, and numerous array utility functions like arrayapplyi which is like a in place map, array2fold which folds 2 arrays into one scalar value, arrayfoldi which folds 1 array into one scalar value but provides the indices of the values being folded.

Findarg is a simple module much like getopt but much much simpler, it supports flags and string based input from the command-line.

Ostitch calls upon all these and the overlapper to produce a usable system. One of the difficulties of Ostitch was to provide near real-time performance. Issues arise in the size of the buffers, not overflowing the output buffer and scheduling the reading in of samples while outputting samples at the same time. Some buffering inside the overlapper were needed to deal with the fact that samples were consumed at a faster rate than they were read or played at. Of course buffering is always an issue when doing any sort of audio programming.

CONCLUSIONS

In summary it is hard to tell how well certain aspects of the project work because most of the project is so perceptual. Artistically Ostitch seems to have merit, it is quite fun to scream into a microphone as heavy metal samples replace your screaming.

For live performance I think this instrument is a success. I plan to use it to perform music at the next Victoria Noise Festival, and I gave a small demo in front of George Tzanitakis's MIR class. With corpus pre-loading,one can have predetermined music sets without worrying about training.

Lessons learned from the instrument are to be weary of overlapping and sample size. Bother of these parameters can increase the perceived pitch of audio. Small sample sizes sound noisy and poppy, they don't provide any hooks into real audio. Medium sized samples provide some coherency while still being separate from the audio they were extracted from.

Future Work

There are many future directions one can take.

One direction would be useful would be to implement some perceptual metrics based on MOS (Mean Opinion Score) or at least attempt to model MOS. I would like to test the quality of the various parameters against users or at least models of users.

Stitching

There are various kinds of Stitching that could be explored.

For sample selection, edge metrics might be useful where you match the start and end of the sample to infer a more smooth transition.

Non-linear mixing would also be appropriate such as logarithmic scaling of the audio in the time domain.

A good question to evaluate is ``Do we need samples to be played at a consistent time, can we play samples at any time?''. It might be the case that the proposed algorithm was not as good as say randomly layering samples at various start times.

Different mixing techniques should be evaluated as well. Maybe convolving the audio might produce a nicer transition between samples.

Collage

Other directions for audio collage that should be evaluated include:

Applications of MIR to Performance

In the more general case of using MIR to performance there are some areas that others or I should evaluate:

Bibliography

natexlab url

Hazel(2001)
Steven Hazel.
Soundmosaic.
http://thalassocracy.org/soundmosaic/, 2001.

Hoskinson.(2002)
Reynald Hoskinson.
Manipulation and resynthesis with natural grains.
Master's thesis, University of British Columbia, 2002.

Kapur et al.(2004a)Kapur, M.Benning, and Tzanetakis
A. Kapur, M.Benning, and G. Tzanetakis.
Query-by-beatboxing: music retrieval for the dj.
In Proc. Int. Conf. on Music Information Retrieval (ISMIR), 2004a.

Kapur et al.(2004b)Kapur, Tzanetakis, and Driessen
A. Kapur, G. Tzanetakis, and P.F. Driessen.
Audio-based gesture extraction on the esitar controller.
In Proceedings of the International Conference on Digital Audio Effects, pages 17-21, October 2004b.

Lazier and Cook(2003)
Ari Lazier and Perry Cook.
Mosievius: Feature driven interactive audio mosaicing.
In DAFX 2003, 2003.

Lu et al.(2004)Lu, WenYin, and Zhang
Lie Lu, Liu WenYin, and Hong-Jiang Zhang.
Audio textures: Theory and applications.
In IEEE Trans. on Speech and Audio Processing, volume 12, pages 156-167. Institute of Electrical and Electronics Engineers, Inc., March 2004.

Oswald(1999)
John Oswald.
Plunderphonics.
http://www.plunderphonics.com, 1999.
URL http://www.plunderphonics.com/.
Last accessed April 2005.

Schwarz(2004)
Diemo Schwarz.
Data-Driven Concatenative Sound Synthesis.
PhD thesis, Université Pari, January 2004.

Sturm(2004)
Bob L. Sturm.
Matconcat: An application for exploring: Concatenative sound synthesis using matlab.
In ICMC 2004, 2004.

Tzanetakis(2002)
George Tzanetakis.
Manipulation, Analysis and Retrieval Systems for Audio Signal.
PhD thesis, Princeton, 2002.

Zils and Pachet(2001)
A. Zils and F Pachet.
Musical mosaicing.
In Proceedings of DAFX 01. University of Limerick, December 2001.

About this document ...

Ostitch: MIR applied to musical instruments

This document was generated using the LaTeX2HTML translator Version 2002-2 (1.70)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 report.tex

The translation was initiated by Abram Hindle on 2005-04-15


next_inactive up previous
Abram Hindle 2005-04-15