July 1996

The New Alchemists
Spinning Sound into MIDI and Back Again

by Paul D. Lehrman

The hottest things at music and audio trade shows these days, besides those 48-channel automated mixers that fit in your shirt pocket, are the dog and pony shows in which audio recordings are amazingly turned into MIDI data and back again. A high-jumping flute passage comes out of a cello, and medieval choirs sing in be-bop harmonies. A symphony orchestra is conducted by drawing accelerandos on a screen, and the Beatles are forced into a 110-bpm techno beat.

These tools are cool. You can do a lot of things with them that would have been difficult, if not impossible, before they came along. For the right people, in the right situations, they can be a real spur to creativity. So naturally, tons of manufacturers are jumping on the bandwagon, proclaiming that their conversion algorithms and DSP are faster/cleaner/better sounding than the next guy's. But snazzy technologies have a way of taking over people's imaginations in excess of their real worth, which makes it wise--before anyone decides to call it "Whoopee!" and base a $50-million IPO on it--to take a fairly sober look at what this stuff is for, and what it isn't.

The promise of audio-to-MIDI-to-audio conversion (how about we call it "AMAC" for convenience?) is that we can now dissect a recorded musical performance and isolate individual nuances, tweak them just as if they were MIDI notes and controllers, and re-apply them to the original; thus we can create performances with the sonic quality of digital recordings but with a whole new level of expressive control. Computer-based composition can be liberated from the strictures of discrete synthesis and one-shot sampling, and the manipulative techniques MIDI composers know and love can be applied to any sound at all.

How true is this? Well, let's look at the process one step at a time. The first part of AMAC involves extracting performance data from an audio signal on disk. Using FFTs, spectrum analysis, weighting, and other tools, the process derives initial pitch, pitch change, initial level, level change, high-frequency content, and duration from the signal, translating them into, respectively, note-on, pitchbend, velocity, volume, "brightness", and note-off MIDI data. "Brightness" is not strictly defined in the MIDI spec, and there is little commonality among synthesizers as to how to interpret it, so at least for now, brightness data from this process is not all that useful. The derived MIDI data can be used to play a MIDI synthesizer in close step with the original audio, or used as a timing reference to trigger other MIDI tracks. This in itself, if done well, is worthwhile.

Turning an audio signal into MIDI data is nothing new--companies like Fairlight, IVL Technologies, Digitech, and Roland have been doing it for many years, not to mention all the hapless MIDI guitar makers who continue to bang their heads against the wall of pitch-to-MIDI conversion. This hardware's task has been to do the process in real time, that is, to extract a MIDI line from a voice or a trumpet while it's playing. The new software systems, however, do it off-line, which gives them plenty of time to sift, analyze, transform, and compare, so theoretically their performance should be much better. But many old problems remain: where do you draw the line between pitchbend and a new pitch; between a fundamental and a harmonic; between the noise at the beginning of a note and the note itself; or between voices in a chord?

Just as a hardware convertor must be programmed with gate times, attack times, volume and pitch-change thresholds, etc., so that it knows what kinds of limits to set on its processing, a software convertor has to be told what to do when it's confronted with ambiguous data--which, unless it's analyzing a diatonic Theremin, is just about all of the time. If the parameters are not set correctly (and sometimes even if they are) the result is a jumble of micro-notes, appogiaturas, flying pitchbends, and wrong octaves. In some ways, the software convertor's job is harder: in a hardware convertor designed for live performance, these minor errors fly by in an instant and are forgotten about, but once they are part of a recording or a sequence, they must all be dealt with.

Of course, unless the audio is truly homophonic (that is, one instrument playing one note), the output will be essentially worthless. The days when an accurate analysis of polyphonic music or chords can be made by a machine are still very far in the future. Even in the best of cases, for the resultant data to be of much use, it has to be massaged carefully, with an eye towards how it's going to be used. Many of these programs include sets of algorithms optimized for different instruments and vocal ranges, but the parameters involved in setting up these algorithms are fiendishly difficult and rarely under user control, so if the material you're working with doesn't fit any of the presets, there's little you can do. And even if the "right" algorithm is there, it might not work consistently in all cases: try a few different recordings of the Bach 'Cello Suites, or of Debussy's Syrinx for flute solo, or any a cappella blues or ethnic singer, and see how varied the MIDI tracks will be from each other, and from the record.

Now let's talk about the other end. This is where the jaws drop at the trade shows: when the derived data is modified in ways that only MIDI data can be, and then re-applied to the original audio track. Thus, instruments can be made to harmonize with themselves, flat or sharp notes can be corrected, and tracks can have their tempos changed--not just statically, but over time--or be rhythmically quantized, to fit with other tracks.

The chief advantage here is not that we are now able to do time-based audio processing--that's also old hat. It's that we now have a dynamic front end to do it with, one that can be manipulated in musical ways. So, to choose a simple example, instead of merely telling a horn note we want it a minor third down, we can now tell it we want it to start a minor third down and drop another half-step over its duration. If we want to create a whole new musical phrase out of a recording, we don't have to do it in a sample editor note by note and then assemble the pieces: it can all happen in one operation.

Whether this is successful is largely dependent on the quality of the pitch- or tempo-shifting process (they are essentially the same thing, just turned upside down). Pitch shifting an audio file isn't rocket science: actually, it's a lot harder--although, fortunately, mistakes aren't quite as expensive. While great progress has been made since the first chipmunking algorithms appeared in samplers and sample-editing software, there's still a lot that can't be done. Sharping a singer who sings a couple of flat notes is easy, but turning a symphony orchestra playing "Also Sprach Zarathustra" into "Mary Had a Little Lamb" is dicey. Any changes of more than about 7% or 8% are pretty much out of the question, and for many recordings, especially when stereo image stability or room ambience are at issue, the restrictions are much tighter. (I got an excited call from a Very Famous Conductor recently who had read an article of mine describing pitch and tempo shifting, and she was crestfallen to hear that she might not be able to change an Adagio recording she had made into an Allegro. She was even more disappointed when I told her this wasn't something she could learn to do by herself in a couple of days!)

The fact that there are dozens of pitch-shifting algorithms out there shows that engineers are still searching for this particular Holy Grail. Some of the algorithms are optimized for certain types of music, but none will do right by everything. Many of the programs let you decide what factors--pitch, timbral quality, or rhythmic accuracy--the processing should favor. I've never seen any worthwhile documentation on these settings, and I don't know if that's because the manufacturers are lazy or that they figure it would be meaningless anyway: trial and error, for just about every type of audio signal you want to work with (if not every piece of audio), is still an integral part of the process. And even on the fastest computers, like Pentiums and Power PCs, these processes are s-l-o-w-w-w: about the best ratio of processing time to file length I've seen is 15-to-1. Of course, you never get it right the first time, so in real-world terms you'll have to at least double or quadruple that.

There's also a problem, sort of a basic philosophical flaw, in the concept of imposing expressiveness--whether it's time-, pitch-, or level-based-- onto an audio track after it's been recorded. In MIDI sequencing this is done all the time, but to my ears (and this is something I try constantly to beat into my students' heads), an artificially expression-enhanced MIDI track is far inferior to one which was played expressively in the first place. When the basic track is audio, that problem multiplies. For a performance to be expressive, there must be audible feedback to the player: a singer or instrumentalist who can't hear himself can't make those instantaneous micro-decisions that determine how he will impose his will on the instrument physically, and make the changes in the sound that we recognize as expression. This feedback loop is crucial, and it works even if it's highly convoluted: a church organist who's playing pipes 50 yards away can, once she gets used to the delay, be just as expressive on her instrument as someone playing a set of tablas in her lap. Interrupt that loop, and the performance suffers greatly: if you would like to see a pianist have a nervous breakdown, make him play an electronic keyboard in which the attack times of the notes are varied, very slightly, at random.

When the loop is eliminated completely, performance parameters are in danger of losing all of their meaning. Do too much to the parameters, and they will sever any relationship they had to the original track. Re-apply them to the original, and they will be irrelevant: instead of a real performance, you'll end up with a stilted track that's neither fish nor fowl.

But before you go away thinking that I consider all this stuff worthless, let me hasten to say that the idea of putting a MIDI front end onto a pitch-shifting algorithm is fascinating. You can use it to bring new levels of complexity and immediacy to sound editing. For example, if you have a sampled sound effect that you want to make into a musical instrument (a very popular technique in advertising production these days), you could put it into a sampler and play it from a sequencer, but each note will have different timbral, loop, and vibrato characteristics, and when you go more than a certain direction up or down you will invoke the dreaded munchkin effect. If, on the other hand, a pitch-shifting template, based on a MIDI sequence complete with volume and pitchbend information, is laid on the sound, the timbral qualities don't change in the same ways, and the effect can be much more musical. Granted, if you go too far astray the result can still be ghastly, but we're talking sound effects here, not the New York Philharmonic.

And there's certainly nothing wrong with taking an audio track of a funky rhythm section, or even a piece of "The Rite of Spring", and imposing a hip-hop tempo map over it, if that's going to make your day. I doubt any one will worry about whether it's being true to Stravinsky. When the quest for new sounds is more important than the need to preserve the fidelity of the original material, this kind of processing can be great fun. Just don't try to make "Yellow Submarine" fit the beat of a Ramones track. There's way too much to lose there (and besides, you'll probably get sued).

So enjoy the dog and pony shows, and think about how AMAC can work for you--in your fantasies, and in reality. It won't change your life, but once it settles down, and if you have some patience with it, it will add some neat tricks to your production arsenal.

* * *

I saw the movie "Mr. Holland's Opus" the other night, which will probably be out on video by the time you read this. See it. The dialog is creaky, you can see the plot developments coming a mile away, you'll want to scream "Stop whining and get a MIDI setup!" at Richard Dreyfuss's frustrated composer, and the music (by Michael Kamen, who should know better) is drivel. But the point the film makes, that music transforms and inspires people in an infinite variety of ways, is of the highest importance to all of us--and it expresses beautifully the fact that snatching music away from schoolchildren in phony austerity moves is shameful and ultimately self-defeating. As the title character says, "You can teach them reading and writing, but without the arts, what are they going to write about?"


Paul D. Lehrman writes at his original pitch and tempo.

These materials copyright ©1996 by Paul D. Lehrman and Intertec Publishing