To mimic real acoustic space successfully requires that:
All of these three stages are a challenge onto themselves.
- One calculates the sound at listener's position.
- Changes are done smoothly as listener moves around.
- This sound must then be projected to the listener.
From this data one can generate impulse response of the room. Another algorithm is used to add HRTF filtering to the sound for spatialization. Finally a separate algorithm is used to create late reverberation.
These algorithms are programmed for a dedicated DSP (Digital Signal Processing) chip that convolves the sounds with the given impulse and HRTF responses in real-time or off-line.
Such methods have problems when used in virtual reality environments. Neither the listener, nor the sound source can move as the impulse response of the room and the HRTFs used are fixed.
Whenever listener moves one must check how the auralized sound should change. If there is a lot of lag in calculating this the listener will notice the sound dragging behind his movements.
If the listener only turns his head one can get away by updating current image sources's azimuth and elevation.
In case listener's position changes one has to do visibility check to all the image sources. It is also necessary to calculate the delay time and amplitude of the image source (distance and distance attenuation). Additionally the filtering of the image sources has to be calculated (matching air and wall absorption) for different frequencies.
Auralization is applied to the direct sound and first reflections to create 3-D sound image. Currently this is done by applying ITD (Interaural Time Difference) and HRTFs to headphone listening (binaural techniques). Together these create spatial panning; both vertical and horizontal. The HRTFs are FIR type filters and one can use individually tailored filters (for different listeners) if those are available.
Full auralization is not done to all the first reflections. This is because HRTF filtering requires very much CPU power. As an alternative short HRTF filters are used or even simple amplitude panning with ITD.
The sound can be reproduced with headphones or loudspeakers. Headphones give the best control of sound field in the listeners ear as the surrounding acoustics have no effect on the sound. If headphone listening is undesirable array of loudspeakers can be used in their place. We have adopted Vector Base Amplitude Panning (VBAP) for interpolating sound sources between loudspeakers. This method delivers 3-D sound to large area with as opposed to one or few listeners. We are also studying the possibility of using a pair of loudspeakers with HRTF filtering (transaural technique). This method suffers from from very limited listening area. The illusion of virual acoustic space is disturbed if the listener moves or rotates his head.
Interpolating image sources is another problem. If listener moves the image source will change. Sometimes a new image source will appear or disappear while sometimes the same image source may be visible, but with slightly different parameters.
Usually we check listeners position about twenty times per second and if necessary run image source checks. When image sources appear or disappear a simple fade-in or fade-out is performed. The situation of the same image source changing parameters is more complex. One must move smoothly from the old image source parameters to the new parameters. Thus one needs interpolation. The things to interpolate are the delay time of the image source, its amplitude and ITD. Also one needs to change the HRTF. As a bonus of interpolating the distance (delay time) one gets real Doppler effect.
In practice one keeps a record of old image sources and tries to find out if any of the new image sources are in fact slightly different version of old disappearing image sources. When this is the case the image sources are interpolated. The listening space can be arbitrarily complex. If the space was a simple box one could get away with neat analytical calculations of the reflection paths, but this seldom is the case. Real spaces such as concert halls may have hundreds of reflecting surfaces. Most of them must be taken into account if one is to make good simulation of the space. Luckily one doesn't have to take into account all surfaces (a concrete wall may have millions of small surfaces), but only the ones that are large enough to affect the acoustics.
Multiple sound sources require all private processing. For example, if there are musicians in concert hall all instruments must be auralized separately to place them in different positions of the space. This can be considered as the next step up after typical surround sound systems.
Late reverberation can be added at random. The most important thing is to match the tone color and reverberation time to the space. The fine-tuning must be done by ear in advance. There are a few implemented algorithms that can be used in the signal chain. One thing we are studying is how to best get the best 3-D sound from any of these algorithms as they do have some common features.
As a result the audio processing has been implemented in software written in C++ rather than using the conventional hybrid of host computer coupled with dedicated DSP chips. This also removes the problem of communication between the CPU and DSP. System can be used in real-time or - for higher resolution sound - off-line.
The system requirements for real-time operation are strict. One needs a fair amount of computational power and multichannel audio capabilities (digital preferred). This has led to the usage of Silicon Graphics workstations.
There are two programs that handle the processing. These can be run on different computers and they communicate across standard UDP socket. One program calculates the image sources at listeners position. In practice this means calculating the direct sound & some reflections from all sound sources to the listener. The acoustic model can be changed during normal operation. Theoretically one could even study the effect of closing or opening a door or moving a wall.
The other program handles the audio. This means processing the various image sources (direct sound and reflections) and running a reverberation algorithm. All sound sources are kept separate during processing and mixed together for listening.
The system was well tested in SIGGRAPH 97 conference. Five days of heavy usage without a single crash or uncontrolled audio burst. Thousands of people had a go at listening to the 3-D sound with headphones and most were very pleased. Four virtual musicians (sound sources) played throughout the week in good harmony.
DIVA sound system has also been used to create soundtracks for video and a short film featuring 3-D image and sound.
This page is maintained by Tommi Ilmonen, E-mail: Tommi.Ilmonen(at)hut.fi.
The page has last been updated 30.9.1999.