Physical Modeling Theory

A Light Treatment for Musicians

The Problem

Given that you want to synthesize complex musical effects from relatively simple control inputs, such as happens with acoustic instruments, how do you do this? It doesn't take long to abandon the sampler paradigm. Imagine how many samples you would have to have and the complex crossfading involved, just to represent a single sustaining acoustic instrument in a healthy range of its expressive possibilities. Imagine how weird it would be to try to play such a beast. Picture by contrast how straightforward it is to perform fluid expressive gestures using acoustic instruments. When a cello player increases the pressure and slows the velocity on her bow, it creates a complex, wonderful timbre and texture change, even though the same note is sounding. The full range of such expressive playing is what makes cello lead parts exciting. You really can't do that with a sample-based synth. It's just not feasible, and what's more, from a design standpoint, it's 'ugly'. So the bottom line is that when you are talking about expressive playing for lead parts with sustaining instruments, a sample-based approach ala conventional synths just cannot do it.

What we need is a synthesis technique that allows a musician to perform with lots of flair and emotion using relatively simple control inputs, and create those nice complex effects that acoustic instruments do. As a bonus, if you could also emulate analog synths and such, that would be nice, but the real challenge is in the sentence above. In order to accomplish this we are going to have to generate a waveform using processing of some sort guided by simple control inputs (midi).

Synthesis Approaches

Considering semi-developed synthesis techniques, we have four contenders that I will list:

Analog
Frequency Modulation (FM)
Additive
Physical Modeling

Analog synths make 'analog sounds', but the difficulty of emulating the organic characteristics of acoustic instruments accurately, and the attendant costs will rule out this approach. All of the traditional problems with noise and distortion will plague you in this approach, also.

Some people are doing research into advanced FM techniques which could be considered hopeful, I suppose. Personally I don't think it realistically has the potential to do what we want, because the model is too far off.

Additive synthesis has its adherents. It remains to be seen if anyone can come up with a synth that sounds great at acoustic emulation using this technique. The idea is that you break a complex waveform into small components in frequency and time, and then you put them back together again to construct variations. I know of only one commercial 'synth' that uses this technique, and it is not really a commercial synth. It is a high-end trick noise-removal and sound effects box likely to be used in the broadcast and film industries. In any case, the overall model is not nearly as intuitively straightforward for the musician.

Physical modeling has the lovely characteristic that, on the surface at least, the structure of the processing model at the top levels matches the structure of the acoustic instrument being modeled. You have, for example, the mouthpiece, the resonant tube, and the flared bell. You input things like breath pressure and lip pressure, plus midi note control, and it emulates an acoustic instrument. Besides the fluid, organic control during performance, you want the ability to 'redesign' your instrument, including minor timbre tweaks, and so forth. The challenges with this approach are:

Substantial Processing Requirements
Algorithm Development
Voice Refinement

We appear to be in the latter stages of resolving these challenges and producing usable commercial synths based on PM technology. I will address each of these three issues below.

Digital Signal Processing

Digital signal processors, aka DSP chips, are quite a ways down their evolutionary path. Although they have grown in size and complexity over the years, the typical chips used for audio processing are much smaller and cheaper than traditional CPUs like the modern Intel Pentium series, PowerPC, SPARC, and so on. DSP chips are targeted at the specific characteristics of digital signal processing:

Small data space because numbers are crunched in a stream passing through. This reduces memory costs.
Small code space. The processing algorithms are quite small compared to modern 'bloatware' that is served up for PC use. This enables entire algorithms to be kept on-chip or in cache, and to be carefully optimized for speed.
No need for floating point capabilities. You are usually processing integers, although some modern DSP's are now using floating point math.
Fast multiplier(s). We want to be able to do a multiply and overlap it with an add in one clock cycle.
Good speed. Traditionally DSP's have used separate busses for programs and data to improve speed. They also have multiple execution units internally in order to do several things during the same clock cycle.
Low cost. This characteristic combined with the 'streaming' character of the data, allows multiple DSP chips to be used in a 'pipeline' configuration to increase processing capabilities.

To give you a feel for this, consider a typical older DSP that runs at 50 MHz and can do a multiply and add (plus fetch or store) in one clock cycle. We will be conservative and say that it can execute one instruction per clock cycle. If our data is coming through at about a 50 khz sample rate, we can execute 1000 (50 million divided by 50 thousand) instructions per sample. Considering first a waveguide calculation, which involves about three instructions per sample, we could use a few waveguides to represent our instrument and still have plenty of instructions to spare. Now there is 'overhead' beyond the three instructions mentioned, and other calculations such as filters are more complex, but you see that a single DSP has a decent 'budget' of instructions for processing. It's not infinite, we cannot support much polyphony, and we will need elegant, efficient algorithms to make the most of our DSP processing cycles, but we have a realistic 'engine' for crunching out waveforms. DSP price/performance continues to improve, and the synth makers are riding that wave.

PM Algorithms

This is a critical area that has received some attention in the last several years, but not much, considering the potential value. I will start by giving you a feel for how waveforms are basically manipulated in digital form. These methods apply to all digital synths and modern digital effects boxes.

Remember to visualize that each waveform exists in a digital music box as a series of samples. The samples are normally 16-bit integers (numbers that can vary between approximately plus and minus 32000). Common CD-grade sampling gives us about 44000 of these numbers per second per waveform streaming through. To do something simple like adjust volume, we simply multiply each sample by some factor. Most digital synths have three or four places where volume parameters are set, and the overall volume is simply the product of each of those factors. In digital form, when you want to mix two sounds, you just add the matching sample numbers from each waveform together. If you want to make a hard limiter, you would write an algorithm something like this:

If Sample > HighLimit Then Sample = HighLimit
Else If Sample < LowLimit Then Sample = LowLimit
Else (leave it alone)

From this you can see that the typical things that are done in most synths (enveloping, scaling, and layering) take almost no processing power at all. They can be done with just a few instructions per sample. Now lets consider a simple delay (echo). To do this we simple put each sample aside in a holding area, and after a delay of so many samples, we multiply it by some factor (wet/dry mix ratio) and then add it to the main sample stream. To do reverb, we in essence add together a bunch of delays of varying length to represent a confluence of echoes. We don't necessarily need to store the sample stream multiple times to do this, as we can just pull from our holding area in multiple places to represent echoes from different distances. We should filter the echoes, however, as real echoes are normally quite muted in high frequencies. We may also want to do some left and right channel mixing of the reverb samples, as this normally occurs in natural settings. To do it right you have to simulate multiple-bounce echoes, and this picture of digital reverb is somewhat simplistic, but from the basics of it you can see that reverb takes considerably more processing power than scaling and mixing. This is why one of the big tests of a good reverb unit is how 'dense' or thick and uniform the large hall reverbs are, because it takes lots of processing power to simulate all of the varying echoes from each point in a large hall.

This business of filtered delays has application in physical modeling, also. Musical instruments develop pitched tones by setting up standing waves in tubes, chambers, or on strings. The resonant frequency of the standing waves is manipulated by the musician, and we get various notes. The short delay lines used in musical instrument modeling are not nearly as processing intensive as large hall reverbs, but there is more to the picture. The entire complex acoustic characteristics of a specific material (e.g. the acoustic absorption/reflectance characteristics of a specific wood or brass material, of a given thickness, with a particular coating, and so on) is normally lumped all together into a filter, which is tuned based on measurements of actual instruments. This filter may have to be executed at each bounce of a short delay line, and now we're starting to get into some computation. Furthermore, the actual length of the delay line in an instrument model is usually not an even number of samples, so we have to interpolate in between samples, adding more computation.

To model various types of noise, we will need some noise generators, and add those in. In itself, noise modeling is not so hard, but to get acoustically accurate modulated (varying with performance) noise characteristics takes more processing. For wind instruments we have to model the complex acoustics of the mouthpiece and its interaction with the reed, 'lip reed' (brass), or 'jet reed' (flutes). This part of the model will have a dramatic effect on the all-important note attacks and legato transitions. In other words, when an excitation begins, but the instrument has not yet gotten into a resonant mode, we have the transition sound, which is unique to each instrument and the performer's action. This is one of the hardest parts to get right in a model, but it is important, because people recognize instruments and artists more by attack characteristics and other expressive gestures than by timbre. Some instrument models (such as those in the Yamaha VL line) even model the effect of harmonic 'overblowing', as is done with wind instruments to get higher notes with the same fingering. For some wind instruments we should also model the effect of the player's throat cavity, and have a 'growl' feature.

To model plucked strings, you end up with a variation on the damped delay line model presented above. You have a resonating string that is initially excited by a sharp pluck, and it gradually dies out due to absorption at the ends and in air. The absorption is modeled with a filter and scaling, and the resonant body is modeled with a filter. To model bowed strings, by contrast, is more difficult. We have to model that complex slip/grab action that occurs when the bow rubs on the string, and how it changes with pressure, velocity, bow position (relative to the bridge), rosin on the bow, and perhaps bow angle (which affects contact area). All the while, we have to model the complex coupling to the complex resonant cavity of the instrument body. A salient characteristic of bowed strings is the rich 'texture' (I call it) created by the starting and stopping multiple times per second of the excitation, as the bow alternately slips and grabs the string in transit. To model this properly so that it would change with pressure and velocity, would give a great bowed string model. To also model the acoustic resonances of the instrument body fairly accurately would be wonderful. Some work along these lines has been done by Julius Smith at Stanford, and also by the Yamaha PM development team. Many of us are anxiously awaiting further developments.

Many musicians might like to be able to select, for example, a teak verses a poplar violin body, and have the physical model give them the correct sound. Or you might like to have a menu for the trumpet model that gives you a choice of plating material and wall thickness. The models that I have seen are not yet at that level of development. What happens, as stated above, is that these characteristics are modeled in one or more filters, and the filter parameters are set by musicians by ear. Tweaking the models for accurate sound or specific characteristics leads us into the next section.

Voice Development

Once the algorithm developers have done their thing, there remains lots of editing work for the patch developers. PM patch development is probably the hardest of all, if you are really digging into the guts of the model. Fortunately for us mortal musicians, we can usually use a supplied patch, and just adjust simple parameters like breath noise type, noise level, or noise sensitivity to pressure. It's important to understand, however, that basic voice design in a PM synth is a tedious, painstaking art form. It is comparable to instrument design of physical instruments. Once you realize that PM instruments are just a bunch of algorithms controlled by parameters, you can see how someone can set up the algorithms with provision for editing the parameters, and that part of the job is done. Getting all of the parameters set so that someone would want to listen to the sound is yet a long road. People who work with basic PM algorithms know how easy it is to get 'instruments' that either make no sound, or sound wacky, to put it kindly. If you look at a sophisticated physical model, such as the Yamaha VL model, and you see all of the different parts of the model, each with its own parameters, you begin to get a feel for the complexity here. In the future, improved models will present a simpler, yet suitably adjustable interface to musicians.

What Yamaha has done is a good basic approach. They have divided voice editing into two basic parts: basic element design and everything else. Basic element design, where you design the fundamental acoustics of your virtual instrument, is the hard part. This part is best done by skilled professionals who have a good command of musical acoustic theory AND the strange nuances of the particular model. Where to draw the line here is not an easy judgement call, however, and Yamaha has opted to release even the Expert Editor, which allows access to all VL voice parameters, for free download and use in the field. The problem comes in when you want to alter an attack characteristic of the model, which you as a musician decide is crucial to your sound, and it turns out that it is created in the 'guts' of the model. When you mess with the appropriate parameters, the whole model goes haywire, and it takes more than a little skill to get back to a playable voice. Welcome to PM! Even just adjusting the parameters on the VL front panel can get you in more trouble than you might think.

I think that we can compare PM and sample-based synths to 3D modeling and 2D art. Tweaking a filter in a sample-based synth is like adjusting the color balance on a photo in Adobe Photoshop. Layering sounds is like compositing layers in Photoshop, because in both cases you are dealing with 'photographic images' of sounds or objects. The sample and the photo both capture nicely all of the detailed texture, color, and shape of a sound or scene. When you do exotic stuff with a synth like modulate analog waves in real time, it is more like using Fractal Painter to render brush strokes to simulate a shape. When you want to show motion, however, in the 3D situation you have to either take a BUNCH of photos (a movie), or use a 3D modeler to create the images. The 3D modeler allows you to design a basic organic shape, such as an animal, and then using reasonably simple parametric controls, to 'tell' it to move in fluid, organic ways that don't look robotic. Modern movies demonstrate that 3D technology can produce images that look extremely fluid, textured and lifelike. Similarly, if you want a playable instrument that models all of the great expressive nuances of an acoustic instrument, you use physical modeling, and you 'tell' it to play a certain way by controlling the breath and midi inputs. The sophistication of the software and the time spent tweaking parameters in both cases is extreme, because fluid organic control is complex, but that's what people want. The 3D modeling case has much higher computational demands, though, and the motion capture and facial expression capture equipment is prohibitively expensive and not yet in common use. By contrast, PM instruments are playable today, and some efforts at parameter tweaking have given us some excellent models to play.