Sound Localization & Spatialization Techniques
for Portable Digital Environments
When considering what techniques are available for positioning a sound in space, it is essential first to develop an understanding of how with only two ears, humans can perceive a sound’s location in three-dimensions. While most of the spatialization techniques that have developed from replicating human spatial audio perception remain relatively the same across audio delivery systems (e.g., stereo vs. surround, speakers vs. headphones), there are unique difficulties, as well as some compelling benefits and future potentials, of each system and the audio file formats associated with them. As a result, a discussion of possible delivery systems and their related file formats will help guide a creator in making choices and troubleshooting any issues that may arise when spatializing audio within each format. Finally, with the proliferation of powerful personal electronic devices that are making immersive, but more importantly, interactive virtual environments (e.g., Youtube/Facebook 360, Augmented Reality, Virtual Reality) more widely accessible, replicating or modeling the ways a sound interacts with its surroundings is becoming more and more vital to giving the listener a more realistic sense of auditory presence.
Part 1: Spatial Perception
Identifying a sound’s spatial location requires listening for different cues in each of the four sonic dimensions: lateral, front/back, elevation, and distance. In each of these dimensions, the component frequencies or qualities of a sound will affect how, or if, localization is possible.
The lateral location of a sound can be reliably identified by perceiving differences in amplitude and timing between each ear.
In general, when a sound is louder in one ear than the other, the listener will perceive the sound as being on whichever side is louder. The bigger the differences in amplitude (interaural intensity differences, IID) the more to one side or the other it will appear to be. However, sounds below ~800 hertz (Hz) become increasingly difficult to discern IID because of the size of their wavelength relative to the listener's head. In this range, interaural time differences may help.
Human ears are on average 21.5 centimeters apart. Sound travels through air at approximately 343,000 centimeters per second which, depending on a sound's location can result in small timing differences in the range of 0 to ~625 microseconds. These differences are called interaural time differences, or ITD, and two aspects of ITD help a listener locate a sound: onset/arrival, and more subtly, phase relationship.
A sound that originates from one side will arrive at the nearest ear first before arriving at the other. The greater the ITD of the sound’s arrival, the more to one side or the other it will appear to be. If a sound has no clear onset quality, listening for interaural phase differences can aid in locating the sound.
Interaural Phase Relationships
From 0 to ~80 hertz the wavelengths are too large in relationship to most heads and both IID or interaural phase difference are too minuscule to be detected giving the sensation of an omnidirectional sound.
For frequencies from around 80 to ~800 Hz, IID are too challenging to identify, but differences in phase can be detected and relied upon for localization.
There is a sweet spot from ~800 to ~1600 Hz where both IID and ITD are both useful for sound localization but sounds above this range have wavelengths that are too small, completing more than half a cycle within 21.5 cm. While interaural phase difference can still be detected, they can no longer be used to locate a sound accurately. Much like the illusion of a forward rotating wheel appearing to rotate backward when captured on film, the interaural phase relationship of a sound above ~1600 Hz can be deceptive as to its origin direction.
The wavelength of frequencies which exceed ~1600 Hz become small enough to be obfuscated by the average human head. If a sound contains frequency above ~1600 Hz subtle high-frequency attenuation will be audible at the ear opposite from the sound origin. This phenomenon is known as headshadowing. Listening for headshadowing can help with localization.
Front/Back and Elevation
If a listener is not able to alter the axis relationship of their ears to a sound, they can not rely upon IID or ITD for determining if a sound is in front of, behind, above, or below them. Instead, they must listen for spectral modifications caused by their head, neck, torso, pinnae (outer ear). While the lateral perceptions of a sound are already somewhat person-specific –where the distance between the listener’s ears determines the frequency crossovers between the different lateral localization modes—the way bodies spectrally modify sounds depending upon their spatial relationship to the sound is extremely person- and even clothing -specific. The individual nature of how sound localization works becomes a significant hurdle for delivering 360° audio environments that accurately translate to each listener.
Direction Selective Filters
Similar to head shadowing, the shape of a listener’s head, neck, torso, and pinnae filter the sound differently depending upon its relative location. For example, consider a sound that is directly behind a listener. The sound will arrive at both ears simultaneously, eliminating the use of both IID and ITD. But because the listener’s ears are angled slightly forward, some higher frequencies will be filtered out (reflected off the backs of the ears). The sound that is received by the listener's ears will have a different spectral characteristic than if that same sound source originated from in front of the listener.
Head-Related Transfer Functions
A person's Direction Selective Filters are captured in an anechoic chamber as Head-Related Impulse Responses (HIRR). This process involves placing small microphones in each of the listener's ears, and recording either white noise pulses or full-spectrum sine tone sweeps for a higher signal-to-noise-ratio (SNR) at as many points as possible in a one-meter sphere surrounding the listener. An HRIR records how the listener’s body alters any frequency arriving from any of those recorded locations. A captured HRIR can then be applied to any sound as a Head-Related Transfer Function (HRTF) to deceive a listener into believing the sound is coming from some location as the applied HRTF.
Head motion is one final localization strategy. A listener can rotate their head to align their ears along any axis and turn any challenging perceptual axis into a lateral localization problem.
Detecting the distance to a sound is relatively simple, but as will be seen later, accurately recreating the sensation of distance in synthetic audio environments can be the key to successfully immersing the listener. Perceiving a sound's distance requires listening beyond IID, ITD, and HRTFs to cues given to us by the sound’s environmental interactions.
For sound sources that are familiar – a car engine, a musical instrument, someone speaking – one way to determine its distance with relative accuracy is by comparing previous experiences with similar sounds against the volume the current sound is perceived at. With strange sounds, listening for relative volume changes over time can reveal if the sound is at the very least moving closer or further away. If the sound is not moving, listening for other environmental cues can help determine its distance.
Initial Time Delay
If the environment has sonically-reflective surfaces, a sound's approximate distance can be discerned by comparing the arrival time of the direct sound against the arrival of the first reflections of that sound off the surrounding surfaces. The longer the time between the direct sound and the reflected sound the closer the sound is perceived to be. This method is not an option in wide open environment without any nearby reflective surfaces.
Ratio of Direct Sound to Reverberation
The late reflections or diffusion of a sound within a reverberant space will eventually overpower the direct aspect of a sound when it gets too far away from the listener. This may also occur if the sound's direct path to the listener is occluded by a wall or large object.
How quickly a sound can shift perceptually from the left or right side can indicate how far away a sound is. Something that is very close will quickly transfer from one side to the other, while a sound that is further away will take longer.
Higher frequencies tend to be reflected and more easily absorbed by objects, or the medium (air) through which they are traveling, so as a sound gets further away its higher frequency content will decrease in amplitude more rapidly than its lower frequency components. The effect is very subtle and usually not audible until the sound is hundreds, if not thousands of feet away, but this is one final sonic attribute that can be listened to for distance cues.
Part 2: Audio Delivery Systems
Transitioning from thinking about how to perceptually locate a sound to looking at how to spatialize a sound in artificial environments, a brief pause should be taken to consider the different systems currently available for delivering audio. Each system has its own set of benefits and limitations.
Personal Delivery Systems
For a growing portion of the population, the most common way to consume music is through a personal listening system, like over-ear headphones or earbuds. Personal listening systems are currently the ideal means for experiencing spatial audio environments. Typically personal listening systems allow for better acoustic isolation from the external world, simplified implementation of head-tracking, avoidance many of the issues that can arise when applying HRTFs for sound localization, they keep the listener in the “sweet spot,” and are very portable – to name a few.
Personal delivery systems come in a variety of forms.
Closed-back headphones can offer great acoustic isolation and generally better low-frequency response but at the potential cost of long-term comfort and accurate sound reproduction due to internal resonance and pinnae deformations. Closed-back headphones can either be on-ear, or over-ear. Over-ear closed-back headphones offer superior acoustic isolation and comfort. But on-ear closed-back headphones tend to be more portable and more affordable.
Open-back headphones can be more comfortable over extended listening sessions and tend to deliver a more natural sounding spatial representation at the cost though of less acoustic isolation from the external world. Almost all open-back headphones are designed to fit over the listener’s ears.
The cheapest and most accessible personal listening solution, earbuds also tend to offer the lowest audio fidelity, least acoustic isolation, and poorest bass response. Excellent low-frequency response is not necessarily relevant to audio spatialization, but it can significantly raise the level of immersion, or sense of presence the listener experiences.
In-Ear Monitors create the best situation for delivering high-quality spatialized audio experiences. They usually offer excellent frequency responses across the full range of human hearing and a high degree of acoustic isolation from the outside world. They also bypass any spectral modification from the listener’s pinnae. However, they also bypass the ear canal which can prevent the proper receptions of HRTFs.
External Delivery Systems
External delivery systems (speakers) are historically thought of as providing a better listening experience, but more current consumers are trending away from traditional hi-fi home stereo systems and towards more portable wireless solutions. Both more traditional and contemporary wireless home stereo systems are inferior to personal systems at delivering spatial audio. It is still important to discuss these systems and their specific spatialization techniques, but their limitations make them impractical for use in widely distributed interactive and immersive audio environments.
Today, probably the most used external delivery systems are the speakers built-in to our personal-electronic devices (e.g., mobile phones, laptops, hand-held game consoles). These speakers typically have very narrow frequency response ranges and little if any separation. Some listeners may use Bluetooth/wireless speakers that offer broader frequency responses but deliver poor channel separation and add audio/visual synchronization issues. While there are undoubtedly spatial translation issues when using personal delivery systems, many more issues can arise when using speakers: poor isolation, no head-tracking, narrow “sweet spots,” and room effects.
Two final and significant limitations of delivering spatial audio with external delivery systems are accessibility and portability. An ideal speaker setup would require: two speakers to the front left and front right of the listener for delivering lateral stereo audio and two speakers to the rear left and rear right of the listener for front/back stereo audio.
So far with four speakers, only two dimensions of audio are possible. To complete the third dimension, another set of four speakers above the listener, and another set of four speakers below the listener are needed, as well as the ability to suspend the listener in the exact center of all twelve of these speakers. And this list does not even account for the additional equipment and cabling required to power and run this system. The cost, ongoing maintenance, and expertise required to properly install, calibrate, and operate dramatically limited the availability and portability of an external system capable of delivering compelling spatial audio experiences.
Negatives aside, there are benefits to using external speaker systems for delivering spatial audio experiences: It makes having a group experience and true spatialization possible. And it can simplify the computations and processing required to create complex spatial effects.
Standard Speaker Configurations
Many configurations are possible, but there are several standard surround-sound speaker setups. Each of these configurations will commonly have a subwoofer, and a center channel added. The subwoofer is used to extend the frequency range and because low frequencies are omnidirectional the subwoofer’s positioning is reasonably flexible. The center channel is usually reserved for dialog and placed directly in front of the listener inline with the front two speakers.
Two speakers positioned in an equilateral triangle and at ear level with the listener.
A front stereo pair and a rear stereo pair of speakers. There are a variety of ways four speakers might be arranged. Most computer/academic music calls for the speakers to be positioned in a perfect square with the listener at the very center and ear level.
Commercial or home theaters typically expand on the quadraphonic setup by adding a center speaker in front of the listener and a subwoofer. The front and rear speakers are then repositioned to be at ear level with a 20-30˚ angle towards the listener and nearly alongside and slightly above ear level with a 90-110˚ orientation respectively.
Eight equally spaced speakers at ear level in either a circle or a square with the listener at the center. Two conventional channel ordering are used: Stereo pairs equally spaced from the front to the rear, or a clockwise and sequential ring starting in front of the listener.
Further Channel Expansion
Larger commercial movie theaters and specialized music venues/research facilities often expand upon these standard speaker arrangements by adding speakers along the horizontal and vertical planes.
Adding along the horizontal plane can increase lateral localization accuracy as well as expand the “sweet spot,” positioning more listeners in the ideal listening zone. Adding speakers above and below the audience, not only allows for the vertical placement of sounds but can also substantially add to the audience’s sensation of envelopment in the sonic environment.
Part 3: Distribution Formats
In addition to knowing the delivery system, knowing what format the audio will be delivered in can collectively dictate which spatialization techniques will be available and most effective.
Here are some of the terms that will be heard in reference to audio file formats, and their significance to delivering spatialized audio.
Compressed vs. Uncompressed
Digital audio file types fall into two major categories: compressed and uncompressed. Most portable electronic devices default to compressed audio files because they require less storage space than uncompressed. Both compressed and uncompressed audio files can deliver spatial audio, but some compressed audio formats can introduce sonic degradation that may alter or confuse spatial effects.
The two primary uncompressed audio files are Waveform Audio File Format (.WAV) and Audio Interchange File Format (.AIFF).
Lossless compression algorithms decrease that amount of storage a file requires without removing or losing any information.
As an example we could take this series of numbers:
00000000 11111111 000 111 00 11 0 1
And express it more efficiently as:
80 81 30 31 20 21 10 11
Compressing an audio file without losing any of the original data is a little bit more complicated than this, but hopefully, this provides the general concept of how lossless compression can work.
Some common lossless audio file formats are:
Free Lossless Audio Codec (FLAC) uses a linear prediction method for compressing files by 40% to 70%.
Apple Lossless Audio Codec (ALAC) which is commonly stored within an MPEG 4 Audio container file (M4A) also uses a liner prediction method capable of reducing files by as much as 60% of their original size.
Lossy compression can be a significant concern even for non-spatialized audio because file sizes are reduced by actually removing information from the file.
The Motion Picture Experts Group Layer-3 (MP3) file was conceived at a time when exchanging and storing large amounts of data was still a costly task. The developers of the mp3 discovered they could reduce an audio file’s size by removing information that was considered less perceptually relevant. As has been seen, however, accurate delivery and localization of spatial audio depends on the faithful reproduction of the entire audio signal. Removing or altering the audio in any way can distort or confuse the listener’s ability to locate the sound.
Some common lossy audio file formats are:
Motion Picture Experts Group Layer-3 Audio (MP3)
Advanced Audio Coding (AAC)
File Container Formats
Arising in the early 2000s, the unique OGG file is a free and popular example of a flexible file container format that can carry a range of both lossless and lossy audio file formats along with video files. Several other container formats are Quicktime's .mov, the Motion Picture Experts Group (MPEG) file stream, and the Audio Video Interleave (AVI) file.
While file container formats can offer many conveniences for developers and audio professionals, they can be risky to work with because it can be difficult to determine which file formats they can contain.
Discrete vs. Matrixed
In surround audio, setup information is delivered to each output channel as either a discrete stream of data or matrixed. For a discrete channel, the preferred and more precise of the two, all the audio information streaming to an output channel is independent of all the others. A matrixed channel, on the other hand, borrows portions of another channel’s stream of audio information to create a faux surround experience.
Perhaps the most useful, and misunderstood area of research and development for creating easily distributable surround audio experiences is how spatial audio information can be encoded into an audio file format called Ambisonics which can then be decoded for any delivery system.
Ambisonic audio has been around since its development in the 1970s by the British National Development Corporation. For most of its lifetime, it has been used only by academics and audiophiles. However, with the recent arrival of more portable, visually immersive and interactive digital environments, Ambisonics is finally finding large-scale commercial applications.
Ambisonics treats the area around the listener as one large sphere that is then divided into smaller zones called spherical harmonic components, which a sound can be placed within. The greater the number of zones the listener’s space is divided into, the higher the spatial resolution.
Ambisonic audio is ranked in terms of orders. The greater the order of Ambisonics, the higher the spatial resolution and the more channels of audio required to capture and store all the spatial information.
The way Ambisonic audio works and what differentiates the orders is best explained by imagining an array of microphones with different pickup-patterns all capturing the sound arriving at a single point from different directions.
Zeroth-Order Ambisonics would be the equivalent of listening to a room with an omnidirectional microphone – all sounds are heard but without any spatial information.
First-Order Ambisonics requires four channels of audio, referred to as W, X, Y, and Z. Continuing with the imagined array of microphones, First-Order Ambisonics maintains the omnidirectional microphone (W) buts adds three bidirectional microphones oriented along the three-dimensional axes: left/right, front/back, up/down.
Second-Order Ambisonics requires nine channels of audio (WXYZRSTUV) and adds five additional bidirectional microphones to the imagined array each oriented along evenly distributed axes.
Third-Order Ambisonics requires 16 channels of audio (WXYZRSTUVK LMNOPQ) and adds another seven bidirectional microphones.
Higher order Ambisonics is possible but is currently uncommon because of the number of channels required. As an example, sixth-order Ambisonics would require 49 channels of audio.
Ambisonic audio is commonly discussed in different formats. Each format references the different states of Ambisonic audio from capture to playback. Rarely are any of these formats discussed with the exception of B-Format.
A-Format: the direct signal captured by an Ambisonic microphone.
B-Format: the encoded audio signal stored as a multichannel-audio file ready for decoding/playback. Usually refers to first-order (4-channel) files, but can also be used to describe higher orders.
C-Format (Consumer Format): was a proposed format in the early days of Ambisonic audio that strove to make the playback of Ambisonic audio possible on home stereos. It was not widely adopted and rarely used today.
D-Format: An Ambisonic audio stream decoded for any delivery system/configuration.
G-Format (Geoffrey’s Format): named by its inventor Geoffrey Barton, was another attempt at driving the commercial adoption of Ambisonics by creating a default decoded format for the popular 5.1 home theater setup.
In additions to the lettered formats listed above, several conventions exist within the 4-channel B-format standard which differ only in how they organize the four channels (WXYZ).
New conventions may be adopted as higher orders of ambisonics become more practical, so it will be essential to monitor how the channels are organized with each new convention.
Pros and Cons of Ambisonics
There are positives and negatives to working with Ambisonic audio. The greatest strength of Ambisonic audio is undoubtedly the separation of its two primary states: encoded and decoded. With this separation, we can take an audio source that was captured/generated in any format (mono, stereo, quadraphonic, A-Format) and encode it to any order of Ambisonics. We can then decode that Ambisonic audio file to a given delivery system (stereo, headphones, quadraphonic, 5.1) and the spatial information will be preserved at a relatively high level of accuracy.
Ambisonic audio is also much more efficient at storing and delivering multichannel audio with fewer channels than would be required by a discrete multichannel format. For example, first-order Ambisonics only requires four channels to deliver six directions of spatial information which could be decoded to any number of speakers.
A final selling point of Ambisonic audio is that the format is free of patents and there are many free tools for capturing, manipulating, encoding and decoding Ambisonic audio available for all major operating systems.
The format is not without its weaknesses. One major shortcoming is that Ambisonic audio,—despite a recent reinvigoration with the increased distributability of AR, VR and 360 videos—has not been widely adopted by audio professionals, likely because it is conceptually difficult to understand and its perceptual results can be difficult to judge.
Sonically it is also not perfect for delivering spatial audio. The “sweet spot” is tiny, spatial confusion commonly occurs, the audio can be heavily colored from comb-filtering when played over speakers, and setting up an ideal delivery system can be extremely challenging even for experienced engineers.
A final format that is essential to making spatial audio easily consumable by large audience using readily accessible delivery systems is binaural audio. The term mostly refers to a recording technique that mounts microphones either in a dummy-head or an actual human’s ears. This is the same recording technique that is used to measure and capture HRIRs. As a result, the recorded sounds are imprinted with the IIDs, ITDs, and spectral modification (HRTFs) of the body, head, and ears the microphones were placed on/within. Then on playback, the listener will have the same aural experience of actually having been there.
Because the crosstalk (sound emitted by one speaker which is then received by both ears) that would occur when playing binaural audio over speakers would distort the binaural reproduction, binaural audio can only be accurately played back using a personal-delivery system. Binaural audio makes an ideal partner to Ambisonics for mass delivery of immersive and interactive audio experiences to individual consumers. The creators of the audio experience can encode all the spatial audio as a B-format Ambisonic audio file. Then the consumer’s playback system (a cellphone and cheap earbuds) can decode and render a binaural realization of the spatial audio in real time as determined by any interaction their playback system may afford them.
Part 4: Spatialization Techniques
Audio spatialized for one delivery system will not translate accurately to another delivery system. This is because of the natural crosstalk that occurs with speakers and the acoustic and stereo isolation that results when wearing headphones/earbuds. If a listener is wearing headphones, head tracking must also be account for, monitoring the orientation of the head to apply the necessary transformation, keeping the sonic environment stationary as the listener turns.
External Delivery Systems
From the perspective of spatialization techniques and processes, external delivery systems are the easiest to use. To make a sound to appear to come from a specific location, place a speaker at that location and play the sound through that speaker.
When working with a fixed speaker arrangement (e.g. quadraphonic), a sound can be made to appear to come from a location precisely between two speakers by reducing the sound’s volume by about 3.0 dB and playing it back from both speakers simultaneously. The listener will hear the sum of both speakers outputs and will perceive the sound at its original volume and as if it is coming from a phantom source in the center of the two speakers. Using gradation of this technique enables a sound to be positioned anywhere between two speakers.
The problem with employing more sophisticated spatialization techniques like "phantom image" is that proper reception of the spatial audio relies on the listener remaining in the center, or “sweet spot,” used to deliver that technique.
Wave Field Synthesis
There is one method for speaker-based sound spatialization that does not contend with the “sweet spot” issue. Rather than using psychoacoustic tricks, Wavefield Synthesis (WFS) relies upon a massive array of equally spaced full-range speakers to physically create the desired sound field. While WFS can create a much more realistic spatial sound environment, the number of speakers required makes it an extraordinarily impractical technique that will likely never see any successful consumer-level applications.
Balancing a sound across more than two speakers will make the sound appear to occupy more physical space. Subtle spectral alteration to the sound as it plays through each speaker can enhance this effect, but taken too far can lead to localization confusion.
Headshadowing, IID, and ITD
The beautiful thing about working with external delivery systems is that they take care of all the person-specific spectral modification required for sound localization. However, playing with the timing and frequency differences of a sound as it simultaneously transmits from different speakers can create more intriguing and immersive sonic environments at the cost of confused localization.
Spatial Effect Ideas for use with External Delivery Systems
Spatial Filter Sweeps: Distribute a sound across multiple speakers, apply the same filter (lowpass, highpass, bandpass) to all channels. Then modulate/automate the filter’s cutoff frequency either in a synchronized or randomized fashion.
Spatial Tremolo: Distribute a sound across multiple speakers then modulate the amplitude of each channel in either a synchronized or randomized rate, somewhere between 0 and ~20 Hz. Synchronizing the rate of the tremolo across all channels and adjusting each channel’s phase alignment can result in repeating movements around the space. Asynchronous rates will result in endless variations in spatial movements.
Microtonal Pitch Inflections: Using either fixed or time-varying processing, discretely raise or lower the pitch of a single spatially distributed sound. Small alterations in pitch will thicken the resulting sound and give it a “shimmering” quality. More significant alterations in pitch may cause the listener to no longer perceive them as being a single sound source.
Complex Delay Networks: To achieve a more advanced technique, send a single audio source through a complex network of delay lines with feedback. Send each channel of delayed sound directly to an output speaker as well as to feedback into one, or multiple delay line inputs. This technique works best with single impulse sounds, like the single strike of a piano key or drum hit. This technique can be taken further by inserting additional processing into the feedback network. For example a lowpass filter, or reverb effect. These effects will repeatedly process the sound as it echoes through the feedback network and around the space.
Personal Delivery Systems
Creating realistic spatial audio environment inside of personal delivery systems requires a lot more work because we do not have the luxury of an acoustic environment for the necessary physical interactions provide localization cues. For example, if a listener is wearing headphones and only the left headphone emits a sound, the listener will undoubtedly be able to identify that their left ear only hears the sound. It will, however, sound unnatural because they will not hear any portion of that sound with their right ear as they would if the same audio file was played back over speakers.
Of course, for creative purposes, this problem could be exploited. Considering what we know about the omnidirectional behavior of low frequencies, receiving completely different low-frequency content in each ear can be very unnerving and anxiety-inducing for most listeners. Cutting the low-frequency content entirely from one ear may disrupt a listener’s equilibrium and induce nausea.
The interaural differences that we listen to for clues to a sounds lateral location can be easily created using audio tools bundled with most DAWs. Below they will be discussed in isolation, but combining two or more of these processes can result in a more dramatic spatial effect.
Creating Interaural Intensity Differences
Most DAWs provide the user with either a balance or pan control in every audio track’s mixer section. While on the surface these may appear to have the same effect there are some critical differences.
On a mono audio source, balance and pan both produce the same results and can be used to set the amplitude distribution of a sound across two output (stereo) channels. On a stereo audio source, however, balance adjusts the independent level of the left and right channels of the audio source. If the source audio has different information on the left and the right channels, one channel will be wholly lost whenever the balance control is adjusted to either extreme. A stereo pan control, on the other hand, allows for the independent amplitude distribution of a sound between two output channels. In this case, no stereo information is lost when the pan control is set to either extreme. Panning has the added benefit of being able to control the stereo width. Setting both the left and the right pan controls to the same value will result in folding the stereo information down to mono and placing it at that location within the stereo field.
Due to the aural separation of personal listening systems, it is advisable to apply only minor changes to amplitude to preserve a more natural spatial sensation.
Creating Interaural Time Differences
Most DAWs come bundled with a variety of delay effects. However, only a few come with those that allow a sound to be delayed within the same timescales as ITDs (< ~625 µs). For example, Logic Pro X includes the “Sample Delay” audio effect that enables the user to independently set the delay time for the left and right channels by a number of samples.
If the Logic Pro X project is set to a sample rate of 48,000 samples per second, then one sample is equal to approximately two microseconds, and thirty samples of delay are equal to approximately 625 microseconds. This audio effect could be used to suggest a lateral location by adjusting either the left or the right delay to ≤ 30 samples.
Headshowing is accomplished by applying a highshelf filter to either the left or the right side to attenuate frequencies above ~1600 Hz.
Front, Back, and Elevation
Without tracking the listener’s head movement and allowing them to reorient their ears to check for front/back, or elevation cues, the other dimension of spatial audio can be extremely difficult to accurately and universally mimic using the tools included in most DAWs.
In recent years, however, many freely available Ambisonic software plugins have been released for applying generic HRTFs and rendering the resulting spectral modification as a binaural audio file relatively painless. Until there is a way for users to capture their own HRIRs, using these specialized tools will not guarantee the accurate sound localization by all listeners.
Distance is quickly and commonly replicated using any DAW’s reverb effect. Start by either selecting or designing the type of space the sound is located within (cathedral, bathroom, hallway). Then adjust the reverb effect’s pre-delay and dry/wet parameters, as well as overall level of the resulting sound.
To achieve a sound that appears to be close in a very reverberant space, increase the pre-delay time and decrease the wet level.
To achieve a sound that appears further away, decrease the predelay time, increase the wet level, and decrease the dry level.
To create a more realistic sense of distance, calculating the reverb (the acoustic interaction between a sound and space) should result from the actual space the listener and sounds are within.
Part 5: Environmental Modeling
Something that is crucial to delivering an immersive and realistic aural experience is placing both the sounds and the audience in an acoustically responsive environment.
Now that computers, gaming consoles, mobile devices, and other content delivery systems have become powerful enough to handle the computational requirements of real-time acoustic-environment modeling, the consumer demand for more realistic audio experiences has begun to grow. Many new tools have become available that make it easier for content creators to deploy these environmental modeling techniques and meet the demand.
Modeling a three-dimensional acoustic environment boils down to three main components: direct sound, indirect sound, and occlusions.
Direct sound is the sound heard without any transformations or modification applied to it by the space or surrounding objects. The direct sound carries with it the original sound plus all the information needed to detect its direction of arrival. All other iterations of a sound – the slap-back of a snare drum in a small club, an echo from deep within a canyon, the indecipherable roar that fills an excited stadium – is the indirect sound.
There are two parts to indirect sound: early reflections, and diffusion. Both of these elements are derived from the geometry and material composition of the space and the objects which inhabit it.
The size and shape of the room will determine, first, how long it will take the sound emitted from an internal location to reach each surface, and, second, with what trajectories the sound will reflect off each surface. Then depending on the construction materials of each surface, the sound will be absorbed and spectrally altered to varying degrees. The harder the material, the more reflective the surface will be. The softer and more porous the material the more high-frequency absorption occurs.
Early reflections can be replicated by first measuring the distance of all the potential paths from a sound source’s location to the listener’s ears that encounter only one reflection point. Second, by applying to each reflection both the spectral modification that the reflective surface imposes on it and all the necessary alteration to perceive from what direction the reflection is arriving. Finally, delay each reflection by the time it would take for the sound to travel the distance of its path and mix them all with the original (direct) sound.
The human ear is extremely sensitive to early reflections, and they play a considerable part in spatial awareness. Because of this, when a device’s computational resources are in high demand by other processes, most of the available processing can be focused on calculating and rendering early reflections and less on a sound’s spatial diffusion.
Depending on the material construction of the room, the sound will likely continue to reflect around the room until its energy has either dissipated or been completely absorbed. The behavior is referred to as diffusion.
Any space's complex diffusion network could be recreated using an infinite set of delays and filters. However, because diffusion is less critical to spatial awareness, two more generalized processes for fabricating a sound’s spatial diffusion can work just as well: convolution or algorithmic reverb.
Using a similar process to capturing HRTFs, convolution reverb begins with capturing an impulse response (IR) of a space by recording the results of either a short burst of white noise (all frequency at equal amplitudes) or a full-spectrum sine tone sweep. This recording can then be analyzed for how all the produced frequencies respond in the space over time. This IR can then be used to digital simulate the resonance of any audio source produced within the captured space by multiplying (convolving) each sample of the audio source with each sample of the IR. Typically both time and frequency domain convolution methods are used to render the most detailed and realistic responses. The combination of these two methods require a lot more processing power in contrast to algorithmic reverb techniques, but this level of detail makes convolution reverb particularly well suited to simulating outdoor spaces.
Algorithmic reverb, on the other hand, is more computationally efficient, but at the cost of realism. There are various methods for creating an algorithmic reverb, each with its unique characteristics and set of capabilities. Typically they involve a network of delays for simulating the early reflections followed by a series of allpass filters which feed into a parallel bank of feedback comb filters that mix together to form the reverb’s diffusion. This popular algorithm was developed by Manfred Schroeder and Ben Logan in the early 1960s and is at the heart of most algorithmic reverb effects.
Occlusion and Obstruction
Acoustic occlusion occurs any time a sound is completely blocked from the listener. Acoustic obstruction is caused by smaller objects that do not completely block the sound from the listener but are substantial enough to cause spectral modifications to the sound. The most significant difference between occlusion and obstruction is that an object that occludes a sound also occludes the reverberance of that sound, whereas an object the obstructs a sound mostly affects the direct sound. This is a simplification of the two, and in most real-world situations any object between the listener and a sound will cause occlusion to some aspects of the sound and obstruction to other aspects.
If, for example, a listener is standing in a large office space with a pillar very close on their right-hand side and a radio playing about 20’ in front of them, the pillar will not obstruct the direct sound or the early reflections/diffusion from their left. It will, however, almost wholly occlude all the reflections/diffusion of the sound on their right side. It will also provide some fascinating reflections of the both the direct and reflected sound.
Even if the listener and all the sound sources remain stationary, acoustic environment modeling involves monitoring and calculating many factors. Rendering the environment in real time becomes hugely complex if the listener and the sounds are moving. However, when done well this can be the difference between a listener losing themselves in the sonic world, or hearing it as a rudimentary attempt at reality.
Along with the renewed interest in virtual reality, acoustic environment modeling is undergoing rapid research and development which has already yielded a wide variety of new tools and techniques. So rather than going into how to implement an acoustic modeling system, it is recommended that the reader research what new tools are available and read their respective manuals.
Part 6: Sound Design Guidelines for Spatialized Audio
Not all sounds spatialize easily. There are many concerns when designing or composing music and sounds for new sonic territories. Here are some general guidelines to keep in mind in the early stages of spatial audio projects.
It is easier to spatialize a sound that starts as a single channel (mono) audio source. There are two reasons for this:
- If a sound is already a multichannel audio file any difference between the channels must be dealt with while also attempting to position it spatially.
- Most spatialization tools treat a sound as if it is coming from an infinitely small point in space and will fold any multichannel audio file down to mono without necessarily notifying the user.
If an audio source needs more than one channel, (pre-rendered stereo/multichannel delay effects) break each channel out into a single file so that it may be spatialized independently.
Most sound localization cues are derived from listening for small interaural differences in the strength and timing of different frequencies. If the sound has too pure of a timbre, like a sine tone, it can be challenging to locate it or detect movement accurately.
Head-locked vs Spatialized
Most spatial audio mediums are capable of both head-locked and spatialized audio. Head-locked audio is the same as traditional stereo audio. It moves with the listener as they turn their head rather than remaining in some spatial location.
The term "Head-locked" is interchangeable with "2D Audio". And the term "Spatialized Audio" is interchangeable with "3D Audio" and "360 Audio”.
In these new interactive environments one of the creative challenges is choosing what will be a part of the audio to spatialize and what audio to lock to the listener’s head : The listener’s virtual heartbeat and breath, the narrator's voice might be head-locked, while the sounds of lasers flash past and explode in specific spatial locations remain independent of the listener.
If audio environments are embedded in visual worlds, attaching sounds to objects within the experiencer field of view can dramatically increase their ability to accurately localize the sound, even if the sonic cues are far from what they would expect to hear coming from that object.
Keep it simple
Not everything needs to be spatialized and moving all the time. This may overwhelm the listener and cause them to disengage with the experience. Instead, if most things remain in a fixed location and only one or two elements are moving, the listener may be more intrigued.
The proliferation of powerful personal electronic devices is opening up new environments that are more portable, more accessible, and potentially more engaging and inclusive than any other music venue in history. The possibilities of these new spaces are endless and limited only by a creator’s willingness to learn and drive the development of new tools and to use those tools to push forward their creativity and imagination. Each great master begins by emulating what others have already done. This paper looks at how humans can locate a sound as a means of understanding how fundamental audio effects can be used to recreate those same sensations in a virtual audio environment. With this knowledge of copying reality, creators can now look past emulation towards imagining and creating what has never been possible.
Blue Ripple Sound. “HOA Technical Notes - Introduction to Higher Order Ambisonics.” Blue Ripple Sound, 26 May 2011, www.blueripplesound.com/hoa-introduction.
Bohn, Dennis A. “Pro Audio Reference (PAR).” Audio Engineering Society, 20 Dec. 2016, www.aes.org/par/.
Cabrera, Andres. “Ambisonics.” Ambisonics, 27 Feb. 2014, w2.mat.ucsb.edu/240/D/notes/Ambisonics.html.
EMPAC. “Wave Field Synthesis System.” Experimental Media and Performing Arts Center (EMPAC), 15 Mar. 2018, empac.rpi.edu/research/wave-field-synthesis.
Funkhouser, Thomas A, et al. “Interactive Acoustic Modeling of Complex Environments.” Princeton University, The Trustees of Princeton University, 27 June 2005, www.cs.princeton.edu/~funk/acoustics.html.
Google. “Discover Resonance Audio | Resonance Audio | Google Developers.” Google, 28 Nov. 2017, developers.google.com/resonance-audio/discover/overview.
Hodges, Paul. “Channel Formats.” Ambisonic Info | Channel Formats, 6 Apr. 9AD, ambisonic.info/index.html.
Langer, Michael. “Computational Perception, Lecture Notes Week 16–23 .” Comp 546. McGill University, Montréal, 14 April 2018, http://www.cim.mcgill.ca/~langer/546.html
Massey, Howard. “The Recording Academy's Producers & Engineers Wing: Recommendations for Surround Sound Production.” The National Academy of Recording Arts & Sciences, Inc., 2004.
Oculus. “Introduction to Virtual Reality Audio.” Oculus Developer Center, 10 Feb. 2016, developer.oculus.com/documentation/audiosdk/latest/concepts/book-audio-intro/.
Soper, Taylor. “How Oculus Audio Engineers Are Using New Sound Technology to Enhance Virtual Reality Experiences.” GeekWire, 21 Sept. 2017, www.geekwire.com/2017/oculus-audio-engineers-using-new-sound-technology-enhance-vr-experiences/.
Steam Audio. “Steam Audio :: Introducing Steam Audio.” Steam Community, 2 Mar. 2017, steamcommunity.com/games/596420/announcements/detail/521693426582988261.
The Game of Life. “About Wave Field Synthesis.” The Game Of Life | About Wave Field Synthesis, 3 July 2011, gameoflife.nl/en/about/about-wave-field-synthesis/.
VR, Oculus. “Beyond Surround Sound: Audio Advances in VR.” Oculus Rift | Oculus, 19 Sept. 2017, www.oculus.com/blog/beyond-surround-sound-audio-advances-in-vr/.
Waves Audio Ltd. “Ambisonics Explained: A Guide for Sound Engineers | Waves.” Waves.com, 10 Oct. 2017, www.waves.com/ambisonics-explained-guide-for-sound-engineers.
Xiph.Org Foundation. “Xiph Community Updates.” Xiph.org, 12 May 12AD, www.xiph.org/.
Z., Deia. “Understanding Surround Sound Formats.” Crutchfield, 23 May 2018, www.crutchfield.com/S-e38JxrWlcR1/learn/learningcenter/home/hometheater_surround.html.