WaveFile Gem

Wave File Format

The Wave file format has changed over time and is defined in several documents:

Besides this information being split into multiple documents, the first two documents contain extraneous information not related to Wave files. This article attempts to consolidate the relevant information in a single place.

Getting Started

If you’re new to audio programming, you might want to read up on some of the basics of digital audio first. Check out this blog post for an introduction.

Wave Files Store Audio Data

Wave files are a container format that allows storing many types of audio data. The most common format is integer PCM. This is raw, uncompressed sample data where each sample is an integer. (PCM stands for pulse code modulation). Similarly, PCM data can be defined using a floating point value for each sample, although this is technically considered a different format.

There are many other audio formats officially defined. However, most are rare or obsolete and unlikely to be encountered.

Currently, the WaveFile gem supports these sample formats:

Wave Files are RIFF Files

Back in the late 80s Electronic Arts came up with a general container file format that could be used to store different types of data – audio, graphics, etc. It was called IFF, for Interchange File Format. Microsoft then took this format, switched the byte order from big-endian to little-endian to better suit Intel processors, and dubbed it RIFF (Resource Interchange File Format). The RIFF format was then used for the *.wav file format.

All multi-byte numbers in a RIFF file are stored as little-endian. (Some non-numeric data is stored as a sequence of bytes, in which endianness isn’t relevant per se).

RIFF Files Contain “Chunks”

Like an IFF file, a RIFF file is broken up into “chunks” of data. Each chunk starts with an 8-byte header containing a 4-byte identifier code, and a 4-byte size field.

The identifier code, called a FourCC, is a sequence of 4 bytes. When each byte is interpreted as an 8-bit ASCII character, they typically form a human readable string. For example, 0x52 0x49 0x46 0x46 (i.e. "RIFF"), or 0x64 0x61 0x74 0x61 (i.e. "data"). Since this is a raw sequence of bytes, the characters are case-sensitive.

The size field indicates the size of the chunk’s body in bytes, as a 32-bit unsigned integer. The size should not include the 8-byte header. I.e., if a chunk consists of the 8-byte header followed by 1,000 bytes of data, the size field should indicate 1000, not 1008. Chunks can internally contain nested child chunks, if the spec for that chunk type allows it.

Important! If a chunk body has an odd number of bytes, it must be followed by a padding byte with value 0. In other words, a chunk must always occupy an even number of bytes in the file. The padding byte should not be counted in the chunk header’s size field. For example, if a chunk body is 17 bytes in size, the header’s size field should be set to 17, even though the chunk body occupies 18 bytes (17 bytes of data followed by the padding byte).

High Level Wave File Structure

At top level, a Wave file consists of a single "RIFF" chunk, which contains all of the data for the wave file. The RIFF chunk body starts with a format code "WAVE" which indicates that the child chunks are for a Wave file (since a RIFF file can also contain other types of data). This is followed by the child chunks.

A Wave file is required to contain at minimum a format chunk ("fmt ") and a data chunk ("data"), and the format chunk must come before the data chunk. If the format code in the format chunk is not 1 (see below), then it must also contain a "fact" chunk. It can also contain other optional chunks.

For example a typical file might look like this:

RIFF Chunk ID ("RIFF") RIFF Chunk Body Size Format Code: "WAVE"
Format Chunk ID ("fmt ") Format Chunk Body Size Chunk Body


Data Chunk ID ("data") Data Chunk Body Size Chunk Body


Important! Other than the format chunk coming before the data chunk, there isn’t any requirement that the chunks come in any particular order. You shouldn’t assume that the data chunk is the last chunk. (Although in practice, it often is).

The RIFF Chunk

Like all chunks, the RIFF chunk starts with an FourCC ID code. In this case, it is "RIFF". Next is the size field, which is the size of the entire Wave file except for the 8-byte RIFF chunk header.

The first 4 bytes following the header will identify the type of RIFF chunk. In the case of a Wave file, it will be "WAVE". Immediately following that will be the inner Wave file chunks.

Field Bytes Description
Chunk ID 4 0x52 0x49 0x46 0x46 (i.e. "RIFF")
Chunk Body Size 4 32-bit unsigned integer
RIFF Format Code 4 0x57 0x41 0x56 0x45 (i.e. "WAVE")
Child Chunks Variable Variable

The Format Chunk

The format chunk describes the format that the samples in the data chunk are encoded in. The exact structure of the format chunk depends on the value of the format code field. If the format code is 1 (integer PCM), then the format chunk will only contain the fields above the dashed line in the diagram below. If it’s not 1, the chunk will also contain the fields after the dashed line.

Field Bytes Description
Chunk ID 4 0x66 0x6d 0x74 0x20 (i.e. "fmt ")
Chunk Body Size 4 32-bit unsigned integer
Format Code 2 16-bit unsigned integer
Number of Channels 2 16-bit unsigned integer
Samples per second 4 32-bit unsigned integer
Bytes per Second
(a.k.a byte rate)
4 32-bit unsigned integer
Bytes per Sample Frame
(a.k.a block align)
2 16-bit unsigned integer
Bits per sample 2 16-bit unsigned integer
These fields are only present if format code is not 1:
Extension Size 2 16-bit unsigned integer
Extra fields Variable It depends on the format code

Good to Know The reason for the different types of extension is that the Wave format is a container for many different kinds of sample formats, and because the Wave format has evolved over time to support new formats. Extra fields that are needed for one sample format might not be needed for another sample format. This also allows new fields to be added without having to change pre-existing Wave files.

While some of these fields have a large range of possible values, in practice there are only a few that will actually be used. For some background on what some of this terminology means, check out this blog post.

Format Code – Indicates how the sample data for the wave file is stored. The most common format is integer PCM, which has a code of 1. Other formats include floating point PCM (3), ADPCM (2), A-law (6), μ-law (7), and WaveFormatExtensible (65534).

Number of channels – Typically a file will have 1 channel (mono) or 2 channels (stereo). A 5.1 surround sound file will have 6 channels.

Sample rate – The number of sample frames that occur each second. A typical value would be 44,100, which is the same as an audio CD.

Bytes per second (a.k.a. byte rate) – The spec calls this byte rate, which means the number of bytes required for one second of audio data. This is equal to the bytes per sample frame times the sample rate. So with a bytes per sample frame of 4, and a sample rate of 44,100, this should equal 176,400.

Bytes per sample frame (a.k.a. block align) – Called block align by the spec, this is the number of bytes required to store a single sample frame, i.e. a single sample for each channel. (Sometimes a sample frame is also referred to as a block). To calculate, first round bits per sample to the next multiple of 8 (if necessary), divide by 8, then multiply by the number of channels. For example:

Bits Per Sample Channels Bytes Per Sample Frame
5 → 8 1 (8 ÷ 8) × 1 = 1
8 1 (8 ÷ 8) × 1 = 1
8 2 (8 ÷ 8) × 2 = 2
12 → 16 1 (16 ÷ 8) × 1 = 2
16 1 (16 ÷ 8) × 1 = 2
16 2 (16 ÷ 8) × 2 = 4
32 6 (32 ÷ 8) × 6 = 24

This field can be used to calculate the bytes per second field. Another possible use is for seeking around in a file. For example, if the bytes per sample frame is 4, then to seek forward 10 sample frames you need to seek forward 40 bytes.

For PCM data, this field is essentially redundant since it can be calculated from the other fields. However, be sure to note the point of rounding bits per sample values to the nearest multiple of 8.

Bits per sample – For integer PCM data, typical values will be 8, 16, or 32. If the sample format doesn’t require this field, it should be set to 0.

Extension Size – This field should only be present if the format code is not 1. This indicates the size of the extra fields in bytes. It does not include the bytes in this field itself. If the given sample format has no extra fields, then this field should still be present, but set to 0.

Extra Fields – It depends on the format code! The next sections describe the extra fields for a few audio formats.

Extra Format Fields for Floating Point

If the format code is 3, then the sample data is stored as PCM using floating point numbers. There are no extra fields for this format, so the extension size field should be set to 0.

Field Bytes Description

Other fields in format chunk
Extension Size 2 16-bit unsigned integer (value 0)

Extra Format Fields for EXTENSIBLE format

If the format code is 65534, then the format is called “WAVE_FORMAT_EXTENSIBLE”. This comes from the name of a data structure given to this format in the Windows API. The extensible format is a container format (…within *.wav, which is itself a container format). It exists to work around some ambiguities in the original Wave file format without having to break compatibility with pre-existing files.

When the format is WAVE_FORMAT_EXTENSIBLE, the extension size in the format chunk should be 22, and the following three fields should be included:

Field Bytes Description

Other fields in format chunk
Extension Size 2 16-bit unsigned integer (value 22)
Valid Bits Per Sample 2 16-bit unsigned integer
Channel Mask 4 32-bit unsigned integer
Sub Format GUID 16 16-byte GUID

Valid Bits Per Sample – Allows storing samples with bit-depths that are not a byte multiple in size. For example, to store 12-bit samples, this value can be set to 12, and the bits-per-sample field in the format chunk set to 16. Each sample will still take up 16 bytes on disk, but the reader can be informed that only the lower 12 bits should be used.

Channel Mask – In a vanilla non-WAVE_FORMAT_EXTENSIBLE file the 1st channel is defined to be mapped to the left speaker, the 2nd channel to the right speaker, but the remaining channel→speaker mappings are undefined. The channel mask field lets you define specific speaker mappings for each channel (e.g. for a surround sound file with 6 channels). There are 18 defined speakers that can be mapped.

The least significant 18 bits of this field are used as a bit field to indicate which channel maps to which speaker (if any). Each defined speaker corresponds to a bit:

Bit Speaker
1 Front Left
2 Front Right
3 Front Center
4 Low Frequency
5 Back Left
6 Back Right
7 Front Left of Center
8 Front Right of Center
9 Back Center
10 Side Left
11 Side Right
12 Top Center
13 Top Front Left
14 Top Front Center
15 Top Front Right
16 Top Back Left
17 Top Back Center
18 Top Back Right

For example, if in a 4-channel file the channels should be mapped (in order) to the front left, front right, back left, and back right speakers, the channel mask field should be set to 00000000 00000000 00000000 00110011, which is equivalent to 0x00000033. (However, remember this should be stored as a little-endian value, not big-endian as it may be implied).

The channels will be mapped in the order of the speakers list above. E.g. if in a 2 channel file the channel mask field has the bits for Back Left and Top Center speakers set, the first channel will be mapped to Back Left and the second channel to Top Center (because Back Left comes earlier than Top Center in the list above).

If there are more channels than bits set in this field, the remaining channels will have an undefined speaker mapping. If there are more bits set in the speaker mapping than there are channels, the extra bits should be ignored. To explicitly indicate that no channel is mapped to any specific speaker, set this field to 0.

The spec says that if the most significant bit of this field is set to 1 (e.g. 0xFFFFFFFF) it indicates “an entity supports all possible channel configurations.” To be honest, I don’t understand what that means.

Sub Format GUID – Identifies the format of the sample data in the data chunk. Since the format code will already be set to 65534 to indicate WAVE_FORMAT_EXTENSIBLE, this GUID indicates the sample format instead. Some GUID mappings include:

Format Original Format Code Extensible GUID
PCM Integer 1 0x0100000000001000800000aa00389b71
ADPCM 2 0x0200000000001000800000aa00389b71
PCM Float 3 0x0300000000001000800000aa00389b71
A-law 6 0x0600000000001000800000aa00389b71
μ-law 7 0x0700000000001000800000aa00389b71
MPEG 80 0x5000000000001000800000aa00389b71

The Fact Chunk

Field Bytes Description
Chunk ID 4 0x66 0x61 0x63 0x74 (i.e. "fact")
Chunk Body Size 4 32-bit unsigned integer
Number of sample frames 4 32-bit unsigned integer

The fact chunk indicates how many sample frames are in the file. If the format code is 1 it’s optional; otherwise it’s required.

Note that this field indicates the total number of sample frames, not the total number of samples. For example, if a stereo file contains 1,000 samples for the left channel and 1,000 samples for the right channel, the value of this field should be 1000, not 2000.

The reason this chunk exists is that with some sample formats the total number of sample frames can’t be determined without reading the entire data chunk (e.g. because they store data in a compressed format that has to be decoded). This gives a way of determining e.g. the playing time for the file without having to do that.

It’s not needed for integer PCM data (format code 1), because the total number of sample frames can be derived by dividing the bytes-per-sample-frame field in the format chunk by the total bytes in the data chunk body. For example, if the data chunk body is 352,800 bytes, and bytes-per-sample-frame is 4 (e.g. two 16-bit channels), then the total number of sample frames is 352,800 ÷ 4 = 88,200.

The Data Chunk

Field Bytes Description
Chunk ID 4 0x64 0x61 0x74 0x61 (i.e. "data")
Chunk Body Size 4 32-bit unsigned integer
Sample Data Various It depends on the format code

The layout for the data chunk is simpler than the format chunk: the normal 8-byte chunk header, followed by nothing but raw sample data. The sample data can be stored in a number of formats, which will be indicated by the format code field in the format chunk.

The following sections describe several formats that sample data in the data chunk can be stored as.

Integer PCM Data Chunk

Format code: 1

This is the most common format, and consists of raw PCM samples as integers. The bits per sample field will indicate the range of the samples:

Bits per sample Min Value Mid Value Max Value
8 0 128 255
16 -32,768 0 32,767
24 -8,388,608 0 8,388,607
32 -2,147,483,648 0 2,147,483,647

Important! Notice that 8-bit samples are unsigned, while larger bit depths are signed.

Samples in a multi-channel PCM wave file are interleaved. That is, in a stereo file, one sample for the left channel will be followed by one sample for the right channel, followed by another sample for the left channel, then right channel, and so forth.

One set of interleaved samples is called a sample frame (also called a block). A sample frame will contain one sample for each channel. In a monophonic file, a sample frame will consist of 1 sample. In a stereo file, a sample frame has 2 samples (one for the left channel, one for the right channel). In a 5-channel file, a sample frame has 5 samples. The bytes per sample frame field in the format chunk gives the size in bytes of each sample frame. This can be useful when seeking to a particular sample frame in the file.

For example, for a 2 channel file with 16-bit PCM samples, the sample data would look like this:

Sample Frame 1
Left Channel
LSB MSB
Right Channel
LSB MSB
Sample Frame 2
Left Channel
LSB MSB
Right Channel
LSB MSB
...etc

LSB means “least significant byte”, and MSB means “most significant byte.”

Floating Point PCM Data Chunk

Format code: 3

Alternately, PCM samples can be stored as floating point values. This is essentially the same as integer PCM format (i.e. format code 1), except that samples are in the range -1.0 to 1.0. The bits per sample field should be set to 32 or 64 to indicate the precision of the values. Sample frames should be layed out in the same way as described in the “Integer PCM Data Chunk” section above.

EXTENSIBLE Data Chunk

Format code: 65534

Since WAVE_FORMAT_EXTENSIBLE is a container format, the format code of 65534 doesn’t imply any particular sample format. Instead, the sample format is indicated by the sub format GUID in the format chunk extension. For example, if the sub format GUID is the GUID for integer PCM, then the samples are in the same format as if the format code was 1.

References

Documents from Microsoft defining the initial file format, and changes over time:

Other links:

View the source on GitHub

Copyright © Joel Strait 2009-20