WaveFile Gem - Wave File Format

Wave File Format

The Wave file format has changed over time and is defined in several documents:

This article attempts to consolidate the most important information in one place. This hopefully removes the need to cross-reference multiple documents, weed out extraneous information about other file formats, etc.

Getting Started

If you’re new to audio programming, you might want to read up on some of the basics of digital audio first. Check out this blog post for an introduction.

Wave Files Store Audio Data

Wave files are a container format that allows storing many types of audio data. The most common format is integer PCM. This is raw, uncompressed sample data where each sample is an integer. (PCM stands for pulse code modulation). Similarly, PCM data can be defined using a floating point value for each sample, although this is technically considered a different format.

There are many other audio formats officially defined. However, most are rare or obsolete.

Currently, the WaveFile gem supports these sample formats:

Integer PCM at 8, 16, 24, or 32 bits per sample (format tag 1)
Floating point PCM at 32 or 64 bits per sample (format tag 3)
The formats above when using WAVE_FORMAT_EXTENSIBLE (format tag 65534)

Wave Files are RIFF Files

Back in the late 80s Electronic Arts came up with a general container file format that could be used to store different types of data – audio, graphics, etc. It was called IFF, for Interchange File Format. Microsoft then took this format, switched the byte order from big-endian to little-endian to better suit Intel processors, and dubbed it RIFF (Resource Interchange File Format). The RIFF format was then used for the *.wav file format.

All multi-byte numbers in a RIFF file are stored as little-endian. (Some non-numeric data is stored as a sequence of bytes, in which endianness isn’t relevant per se).

RIFF Files Contain “Chunks”

Like an IFF file, a RIFF file is broken up into “chunks” of data. Each chunk starts with an 8-byte header containing a 4-byte identifier code, and a 4-byte size field. This is followed by the chunk body.

The identifier code, called a FourCC, is a sequence of 4 bytes. When each byte is interpreted as an 8-bit ASCII character, the sequence typically forms a human readable string. For example, 0x52 0x49 0x46 0x46 (i.e. "RIFF"), or 0x64 0x61 0x74 0x61 (i.e. "data"). Since this is a raw sequence of bytes, the characters are case-sensitive.

The size field indicates the size of the chunk’s body in bytes, as a 32-bit unsigned integer. The size should not include the 8-byte header. I.e., if a chunk consists of the 8-byte header followed by 1,000 bytes of data, the size field should indicate 1000, not 1008.

Important! If a chunk body has an odd number of bytes, it must be followed by a padding byte with value 0. In other words, a chunk must always occupy an even number of bytes in the file. The padding byte should not be counted in the chunk header’s size field. For example, if a chunk body is 17 bytes in size, the header’s size field should be set to 17, even though the chunk body occupies 18 bytes (17 bytes of data followed by the padding byte).

High Level Wave File Structure

At top level, a Wave file consists of a single RIFF chunk ("RIFF"), which contains all of the data for the wave file. The RIFF chunk body starts with a field indicating the RIFF form type. This field indicates what type of data the RIFF file contains. In a Wave file, this field should contain the value "WAVE". This is followed by child chunks, nested inside the parent RIFF chunk.

At minimum, the child chunks must include a format chunk ("fmt ") and a data chunk ("data"), and the format chunk must come before the data chunk. If the format tag in the format chunk is not 1 (see below), then there must also be a fact chunk ("fact").

A file can optionally contain other chunks. A "smpl" or "inst" chunk allows storing data that helps the file to be used with a synthesizer, such as the musical note the audio should be assigned to. A "cue " chunk (note the space at the end) allows storing shortcuts (i.e. cue points) to different parts of the sample data. A "LIST" chunk allows giving labels to the cue points, or storing metadata such as author or copyright info. There are other types of chunk as well.

If a chunk has an unrecognized type it should be skipped and ignored.

A file with the minimum number of chunks looks like this:

RIFF Chunk ID: "RIFF" RIFF Chunk Body Size RIFF Form Type: "WAVE"

Format Chunk ID: "fmt " Format Chunk Body Size Chunk Body

Data Chunk ID: "data" Data Chunk Body Size Chunk Body

Important! Other than the format chunk coming before the data chunk, there isn’t any requirement that the child chunks come in any particular order. You shouldn’t assume that the data chunk is the last chunk. (Although in practice, it often is).

The RIFF Chunk

Like all chunks, the RIFF chunk starts with an FourCC ID code. In this case, it is "RIFF". Next is the size field, which is the size of the entire Wave file except for the 8-byte RIFF chunk header.

The first 4 bytes following the header will identify the type of RIFF chunk. In the case of a Wave file, it will be "WAVE". Immediately following that will be the child chunks.

Field	Bytes	Description
Chunk ID	4	`0x52` `0x49` `0x46` `0x46` (i.e. `"RIFF"`)
Chunk Body Size	4	32-bit unsigned integer
RIFF Form Type	4	`0x57` `0x41` `0x56` `0x45` (i.e. `"WAVE"`)
Child Chunks	Variable	Variable

The Format Chunk

The format chunk describes the format that the samples in the data chunk are encoded in. The exact structure of the format chunk depends on the value of the format tag field. If the format tag is 1 (integer PCM), then the format chunk will only contain the fields above the dashed line in the diagram below. If it’s not 1, the chunk will also contain the fields after the dashed line.

Field	Bytes	Description
Chunk ID	4	`0x66` `0x6d` `0x74` `0x20` (i.e. `"fmt "`)
Chunk Body Size	4	32-bit unsigned integer
Format Tag	2	16-bit unsigned integer
Number of Channels	2	16-bit unsigned integer
Samples per second	4	32-bit unsigned integer
Average Bytes per Second	4	32-bit unsigned integer
Block align	2	16-bit unsigned integer
Bits per sample	2	16-bit unsigned integer
These fields are only present if format tag is not `1`:
Extension Size	2	16-bit unsigned integer
Extra fields	Variable	It depends on the format tag

Good to Know The reason for the different types of extension is that the Wave format is a container for many different kinds of sample formats, and because the Wave format has evolved over time to support new formats. Extra fields that are needed for one sample format might not be needed for another sample format. This also allows new fields to be added without having to change pre-existing Wave files.

While some of these fields have a large range of possible values, in practice there are only a few that will actually be used. For some background on what some of this terminology means, check out this blog post.

Format Tag – Indicates how the sample data for the wave file is stored. The most common format tag is 1, for integer PCM. Other formats include floating point PCM (3), ADPCM (2), A-law (6), μ-law (7).

A special format called WAVE_FORMAT_EXTENSIBLE (65534) allows adding two extra fields to any of the other formats (see below).

Number of channels – Typically a file will have 1 channel (mono) or 2 channels (stereo). A 5.1 surround sound file will have 6 channels.

Sample rate – The number of sample frames that occur each second. A typical value would be 44,100, which is the same as an audio CD.

Average bytes per second – The number of bytes required for one second of audio data. For integer PCM of float PCM data, this is equal to the block align times the sample rate. So with a block align of 4, and a sample rate of 44,100, this should equal 176,400. For other formats the calculation can be different.

Block align – Indicates the size in bytes of a unit of sample data. The meaning of “unit of sample data” depends on the format.

When the format is integer PCM or float PCM this is the number of bytes required to store a single sample frame. A sample frame contains a single sample for each channel. Each sample, and the sample frame as a whole, must be a whole number of bytes (i.e. it must be a multiple of 8 bits). For example, a sample frame size of 16 bits is valid, but 12 bits is not.

To calculate this value, first round bits per sample up to the next multiple of 8 (if necessary), divide by 8, then multiply by the number of channels. For example:

Bits Per Sample	Channels	Block Align
5 → 8	1	(8 ÷ 8) × 1 = 1
8	1	(8 ÷ 8) × 1 = 1
8	2	(8 ÷ 8) × 2 = 2
12 → 16	1	(16 ÷ 8) × 1 = 2
16	1	(16 ÷ 8) × 1 = 2
16	2	(16 ÷ 8) × 2 = 4
32	6	(32 ÷ 8) × 6 = 24

This field can be used to calculate the average bytes per second field. Another possible use is for seeking around in a file. For example, if the block align (i.e. bytes per sample frame) is 4, then to seek forward 10 sample frames you need to seek forward 40 bytes.

For other formats the way to determine the value of this field can be different.

Bits per sample – This field has one of two meanings. For all format tags except WAVE_FORMAT_EXTENSIBLE (65534), it means the size in bits of a single sample. (Not a sample frame, but a single sample). Each format can define a valid set of possible values.

When the format tag is 1 (i.e. integer PCM), this value can be 8 or 16. Integer PCM data can also be stored with a different number of bits per sample, but in that case the WAVE_FORMAT_EXTENSIBLE format tag (65534) should be used instead.

When the format tag is 3 (i.e. float PCM), this value is normally 32 or 64.

Other formats have their own rules for the valid set of values. If a format doesn’t use this field, it should be set to 0.

The WAVE_FORMAT_EXTENSIBLE format tag (65534) is a special case. For this format, this field instead indicates the size of the “container” each sample is in. A different “valid bits per sample” field is used to record the bits per sample value instead. For example, if 24-bit samples are each stored in a 32-bit container, this field should be set to 32, not 24. The container size must always be a multiple of 8 bits. If bits per sample is not relevant for the format, this field should be set to 0. See below for more info.

Extension Size – This field should only be present if the format tag is not 1. This indicates the size of the extra fields in bytes. It does not include the bytes in this field itself. If the given sample format has no extra fields, then this field should still be present, but set to 0.

Extra Fields – It depends on the format tag! The next sections describe the extra fields for a few audio formats.

Extra Format Fields for Floating Point

If the format tag is 3, then the sample data is stored as PCM using floating point numbers. There are no extra fields for this format, so the extension size field should be set to 0.

Field	Bytes	Description
↑ Other fields in format chunk
Extension Size	2	16-bit unsigned integer (value `0`)

Extra Format Fields for EXTENSIBLE format

The format tag 65534 is called WAVE_FORMAT_EXTENSIBLE. This is a special format tag that can be used in place of any of the other defined format tags to add two additional fields to the format chunk. These two fields help clarify info that can be ambiguous when using the original chunk structure. However, the first field has one of three different meanings depending on the format. Since the original format tag is replaced by 65534, a third additional field indicates the actual format tag instead.

These three new fields occupy a total of 22 bytes. If the non-WAVE_FORMAT_EXTENSIBLE version of an official format contains extra fields, then these extra fields should also be included in the WAVE_FORMAT_EXTENSIBLE version, after the Sub Format GUID field. This means that while the minimum size of the extra fields is 22 bytes, it could also be larger.

Note that when using this format tag in place of format tag 1 (i.e. integer PCM data) the file should include a fact chunk, even though it’s not normally needed for that format and is redundant for the purpose of determining the number of sample frames.

When the format is WAVE_FORMAT_EXTENSIBLE, the format chunk should look like this:

Field	Bytes	Description
Chunk ID	4	`0x66` `0x6d` `0x74` `0x20` (i.e. `"fmt "`)
Chunk Body Size	4	32-bit unsigned integer
Format Tag	2	16-bit unsigned integer (value `65534`)
Number of Channels	2	16-bit unsigned integer
Samples per second	4	32-bit unsigned integer
Average Bytes per Second	4	32-bit unsigned integer
Block align	2	16-bit unsigned integer
Bits per sample	2	16-bit unsigned integer
Extension Size	2	16-bit unsigned integer (min value `22`)
Valid Bits Per Sample …or Samples Per Block …or Reserved	2	16-bit unsigned integer
Channel Mask	4	32-bit unsigned integer
Sub Format GUID	16	16-byte GUID
Other extra fields (if any) ↓

Valid Bits Per Sample – For integer PCM and float PCM data (and possibly other formats) the first field contains the number of valid bits per sample.

The original Wave file spec is not clear about whether the bits per sample field represents the number of bits in each sample, or the size of the container each sample is in. For example, if each sample is 20 bits inside a 24 bit container, should this field be set to 20, or 24? The spec says this field “specifies the number of bits of data used to represent each sample of each channel”, which has apparently been interpreted either way.

In a WAVE_FORMAT_EXTENSIBLE file, this ambiguity is resolved. In this type of file, the bits per sample field should always be set to the container size of one sample, and the valid bits per sample field in the extension should be set to the number of bits that are actually relevant. For example, if a file contains 20 bit samples in 24 bit containers, then the original bits per sample field should be set to 24, and the valid bits per sample field in the extension should be set to 20.

Samples Per Block – For compressed formats the first field instead contains the number of samples per block. If the count is variable per block, then this field should be set to 0 to indicate that. The spec implies that this field should be used if bits per sample is 0.

Reserved – If neither valid bits per sample nor samples per block makes sense for the format then this field should be considered reserved for future use, and the value should be set to 0.

Channel Mask – In a vanilla non-WAVE_FORMAT_EXTENSIBLE file the 1st channel is defined to be mapped to the left speaker, the 2nd channel to the right speaker, but the remaining channel → speaker mappings are undefined. The channel mask field lets you define specific speaker mappings for each channel (e.g. for a surround sound file with 6 channels). There are 18 defined speakers that can be mapped.

The least significant 18 bits of this field are used as a bit field to indicate which channel maps to which speaker (if any). Each defined speaker corresponds to a bit:

Bit	Speaker
1	Front Left
2	Front Right
3	Front Center
4	Low Frequency
5	Back Left
6	Back Right
7	Front Left of Center
8	Front Right of Center
9	Back Center
10	Side Left
11	Side Right
12	Top Center
13	Top Front Left
14	Top Front Center
15	Top Front Right
16	Top Back Left
17	Top Back Center
18	Top Back Right

For example, if a file has 4 channels, and the channels should be mapped (in order) to the front left, front right, side left, and side right speakers, the 4 bytes of the channel mask field should be set to 00000011 00000110 00000000 00000000. (Since this is a little-endian value, the bytes are ordered from least → most significant).

The channels will be mapped in the order of the speakers list above. E.g. if a file has 2 channels and the channel mask field only has the bits for Back Left and Top Center speakers set, the first channel will be mapped to Back Left and the second channel to Top Center (because Back Left comes earlier than Top Center in the list above).

If there are more channels than bits set in this field, the remaining channels will have an undefined speaker mapping. If there are more bits set in the speaker mapping than there are channels, the extra bits should be ignored. To explicitly indicate that no channel is mapped to any specific speaker, set this field to 0.

The spec says that if the most significant bit of this field is set to 1 (e.g. 0xFFFFFFFF) it indicates “an entity supports all possible channel configurations.” To be honest, I don’t understand what that means.

Sub Format GUID – Identifies the format of the sample data in the data chunk. Since the format tag will already be set to 65534 to indicate WAVE_FORMAT_EXTENSIBLE, this GUID indicates the sample format instead. For example, the GUID for integer PCM format (i.e. format tag 1) is 00000001-0000-0010-8000-00aa00389b71.

The GUID for any format tag can be determined by plugging the 4 digit hex representation of the format code into the GUID template 0000____-0000-0010-8000-00aa00389b71. For example, the format tag for A-law format is 6 in decimal, which is 0x0006 as a 4 digit hex number. Therefore the GUID for this format is 00000006-0000-0010-8000-00aa00389b71.

However, it’s also possible to use WAVE_FORMAT_EXTENSIBLE for other formats that don’t have a format tag defined. This allows new formats to be defined without having to coordinate with Microsoft to reserve a format tag. In that situation the GUID won’t match the template above.

The GUID’s bytes are not stored in an order that matches the text representation. Each byte (i.e. every two hex digits) of the first three groups is stored in reverse order from the text format, unlike the final two groups. For example, the GUID 00000001-0000-0010-8000-00aa00389b71 is stored as the bytes 0x01 0x00 0x00 0x00 – 0x00 0x00 – 0x10 0x00 – 0x80 0x00 – 0x00 0xaa 0x00 0x38 0x9b 0x71.

This means that the bytes of an existing format tag’s GUID can be determined by taking the two bytes of the original format tag and appending 0x00 0x00 0x00 0x00 0x10 0x00 0x80 0x00 0x00 0xaa 0x00 0x38 0x9b 0x71.

The Fact Chunk

Field	Bytes	Description
Chunk ID	4	`0x66` `0x61` `0x63` `0x74` (i.e. `"fact"`)
Chunk Body Size	4	32-bit unsigned integer
Number of sample frames	4	32-bit unsigned integer

The fact chunk indicates how many sample frames are in the file. If the format tag is 1 it’s optional; otherwise it’s required.

Note that this field indicates the total number of sample frames, not the total number of samples. For example, if a stereo file contains 1,000 samples for the left channel and 1,000 samples for the right channel, the value of this field should be 1000, not 2000.

The reason this chunk exists is that with some sample formats the total number of sample frames can’t be determined without reading the entire data chunk (e.g. because they store data in a compressed format that has to be decoded). This gives a way of determining e.g. the playing time for the file without having to do that.

It’s not needed for integer PCM data (format tag 1), because the total number of sample frames can be derived by dividing the bytes-per-sample-frame field in the format chunk by the total bytes in the data chunk body. For example, if the data chunk body is 352,800 bytes, and bytes-per-sample-frame is 4 (e.g. two 16-bit channels), then the total number of sample frames is 352,800 ÷ 4 = 88,200.

The Data Chunk

Field	Bytes	Description
Chunk ID	4	`0x64` `0x61` `0x74` `0x61` (i.e. `"data"`)
Chunk Body Size	4	32-bit unsigned integer
Sample Data	Various	It depends on the format tag

The layout for the data chunk is simpler than the format chunk: the normal 8-byte chunk header, followed by nothing but raw sample data. The sample data can be stored in a number of formats, which will be indicated by the format tag field in the format chunk.

The following sections describe several formats that sample data in the data chunk can be stored as.

Integer PCM Data Chunk

Format tag: 1

This is the most common format, and consists of raw PCM samples as integers. The bits per sample field will indicate the range of the samples:

Bits per sample	Min Value	Mid Value	Max Value
8	0	128	255
16	-32,768	0	32,767
24	-8,388,608	0	8,388,607
32	-2,147,483,648	0	2,147,483,647

Important! Notice that 8-bit samples are unsigned, while larger bit depths are signed.

Samples in a multi-channel PCM wave file are interleaved. That is, in a stereo file, one sample for the left channel will be followed by one sample for the right channel, followed by another sample for the left channel, then right channel, and so forth.

One set of interleaved samples is called a sample frame (also called a block). A sample frame will contain one sample for each channel. In a monophonic file, a sample frame will consist of 1 sample. In a stereo file, a sample frame has 2 samples (one for the left channel, one for the right channel). In a 5-channel file, a sample frame has 5 samples. The block align field in the format chunk gives the size in bytes of each sample frame. This can be useful when seeking to a particular sample frame in the file.

For example, for a 2 channel file with 16-bit PCM samples, the sample data would look like this:

Sample Frame 1

Left Channel

LSB MSB

Right Channel

LSB MSB

Sample Frame 2

Left Channel

LSB MSB

Right Channel

LSB MSB

...etc

LSB means “least significant byte”, and MSB means “most significant byte.”

Floating Point PCM Data Chunk

Format tag: 3

Alternately, PCM samples can be stored as floating point values. This is essentially the same as integer PCM format (i.e. format tag 1), except that samples are in the range -1.0 to 1.0. The bits per sample field should be set to 32 or 64 to indicate the precision of the values. Sample frames should be layed out in the same way as described in the “Integer PCM Data Chunk” section above.

EXTENSIBLE Data Chunk

Format tag: 65534

Since WAVE_FORMAT_EXTENSIBLE doesn’t imply any particular sample format, the sample format is instead indicated by the sub format GUID in the format chunk extension. For example, if the sub format GUID is the GUID for integer PCM, then the samples are in the same format as if the format tag was 1.

References

Documents from Microsoft defining the initial file format, and changes over time: