As 3D Audio is recognized as being a key factor in providing immersion and presence to VR content, it is important to understand the various sound formats (Multi-channel, Object, Ambisonics) and their implications on audio content creation and rendering. We therefore propose a quick overview of the three paradigms as well as an analysis on why Ambisonics makes sense for VR Audio.
In the channel based representation, the unit of information is the loudspeaker. Each channel is associated to a loudspeaker and the sound reproduction is made by mixing the various channels on several speakers. The more channels, the more spatial sound capabilities. The channel based representation has been the traditional sound representation used for the past 50 years or more. The Stereo, 5.1, 7.1 formats are channel based horizontal representations. 3D is obtained by adding elevated speakers, like in the 11.1 format, where 4 ceiling speakers are added to a 7.1 horizontal speaker layout. One of the main drawback of the multi-channel audio representation is that it is loudspeaker set up dependent and that one needs one mix for type of each set-up, whereas Object-based and Ambisonics contents are independent of the loudspeaker set-up.
In the Object-based representation, the unit of information is the sound source. A scene is made of several sound sources and information about their locations, their directivity patterns and the rendering environment (room size, reverberation parameters…). The 3D audio rendering is made by calculating the combination of all the sources, including the reverberation, at the listener position. This is a great paradigm to interactively create content, but it also uses a lot of CPU resources. The more complex (number of sound sources) and realistic (precision of the reverberation) the scene, the more CPU is needed.
Unlike the two other representations, the Ambisonics format does not rely on the description of individual sound sources (speakers or objects) but instead represents the resulting sound field at the listener’s position. The mathematical formalism used to describe the sound field is called spherical harmonics and the unit of information is the number of component (or the Order) of this spherical representation. The more components or the higher the order you have, the more precision in the spatial representation of the scene you get. This paradigm is not new and has been used by a small sound professional community for several decades with a concept called the B-Format which is in fact a Higher Order Ambisonics representation at the 1st order.
Ambisonics is a very attractive solution for VR
Several key players (Google, Sony…) in the VR industry are now embracing the concept of Higher Order Ambisonics and are developing commercial applications based on this paradigm, like the emblematic YouTube 360 platform that uses HOA as its default audio format. There are several reasons behind this choice. The most important ones are the following:
1. It provides the best 3D audio realism VS. computing resources compromise. With the 4 channels of a B-format (1st order representation), you can realistically represent a 3D Sound scene, whereas it is very difficult to do that with only for objects or 4 speakers!
2. It is built with a hierarchical structure organized in layers of Orders that makes it uniquely scalable. One can adapt the level of spatial precision to the resources of the platforms (CPU Load, Bandwidth…). This is very convenient when you want to have content available of both High End PC and basic Android smartphones or when you have variable bandwidth to transport the content! On the contrary, with the object based representation, if you do not have enough resources to process the full content, then the only option available is to not process some of these objects which leads to a change of integrity of the sound scene (missing information!).
3. Ambisonics is the best format to represent recorded 3D audio content, as a real audio world is best represented by a sound field rather than a collection of sound objects or speaker positions.
4. It is “headtracking friendly”. In the spherical harmonics domain, head movements are modeled as rotation of the sound field, which are very simple operations.
5. It is Loudspeaker set-up independent: one content can be decoded to any loudspeaker speaker layout.
6. Unlike the Object based representation, Ambisonics preserves the content integrity. When the content is made of sound objects, position and acoustic parameters, the final user experience depends on the algorithms to reconstruct the sound field from all this information. In Ambisonics, the final end-user experience is “baked” in the content.
At 3D Sound Labs, we believe that the multichannel is gradually becoming a legacy format not so well suited to the need of VR audio and that Object-based and Ambisonics are the formats needed for VR. We have developed the following vision of the future:
The content creation stage will mainly use sound objects as a convenient way to interactively create sound scenes and will marginally use Ambisonics to “import” real life recording from sound field microphones in the scene.
The “recorded” content rendering like VR 360 will more and more use the Ambisonics format as its intrinsically scalable nature makes it perfectly suited to run on a large range of platforms. YouTube choosing Ambisonics is a clear illustration of this notion.
For the interactive content rendering, like VR Gaming, the Object based paradigm makes a lot of sense. However, the rendering of complexes sound scenes made of many sound objects is very computing intensive and requires resources not necessarily available on mass market platforms. That can be solved by converting all or part the Object based representation into Ambisonics and leveraging the scalability of the rendering to adapt to available CPU resources.