This article is a work in progress.

Getting Started

If you're just getting started with audio programming, you might want to read up on some of the basics of digital audio first. Check out this blog post for an introduction.

Wave Files Store Audio Data

Wave files store audio data, encoded using one of several sample formats. Some sample formats contain raw, uncompressed data, while some formats (e.g. μ-law) are compressed. The most common sample format is PCM, which stands for pulse code modulation. This is raw, uncompressed sample data where each sample is an integer.

Currently, the WaveFile gem supports these sample formats:

Wave Files are RIFF Files

Back in the late 80s Electronic Arts came up with a general container file format that could be used to store different types of data – audio, graphics, etc. It was called IFF, for Interchange File Format. Microsoft then took this format, switched the byte order to little-endian to match Intel processors, and dubbed it RIFF (Resource Interchange File Format). Many of the venerable Microsoft multimedia file formats are stored as RIFF files, including *.rtf (“rich text format”, a WYSIWYG text format), *.avi (a basically obsolete movie format), and of course, *.wav.

As mentioned above, all data in a RIFF file is stored as little-endian, owing to its Wintel heritage.

RIFF Files Contain “Chunks”

An IFF file, and therefore a RIFF file, is broken up into several “chunks” of data. Each chunk has an 8-byte header containing a 4-byte identifier code, and a 4-byte size field.

The identifier code, called a FourCC, is typically a more-or-less human-readable ASCII string. For example, “wave”, “fmt ”, or “data”. This identifier is case-sensitive.

The size field indicates the size of the chunk in bytes. The size does not include the 8-bytes in the header. I.e., if a chunk consists of the header plus 1,000 bytes of data, the size field will indicate 1,000, not 1,008. Chunks can internally contain nested sub-chunks, if the spec for that chunk allows it.

A Wave file consists of two levels of nested chunks. At the top level, it consists of a single “RIFF” chunk, which contains all of the data for the wave file. The RIFF chunk includes a format code “WAVE” which indicates that the sub-chunks are for a Wave file. Internally, the “RIFF” chunk includes at minimum a format chunk (“fmt ”) and a data chunk (“data”). As the name suggests, the format chunk describes the, well, format of the wave file. The data chunk contains all of the raw sample data. A wave file might also contain other optional chunks, but it must include a format and data chunk, and the format chunk must come first.

Visually this is what it looks like:

RIFF Chunk
Format: “WAVE
Format Chunk (“fmt ”)
other optional chunks*1
Data Chunk (“data”)

The RIFF Chunk

Like all chunks, the RIFF chunk starts with an ID code, in this case the ASCII string “RIFF”. Next is the size field, which is the size of the entire Wave file except for the 8-byte RIFF header.

The first 4 bytes following the header will identify the type of RIFF chunk. In the case of Wave files, it will be “WAVE”. Immediately following that will be the inner Wave file chunks.

Field Size Description
Chunk ID 4 ASCII string "RIFF"
Chunk Size 4 Size of entire file, except for 8-byte RIFF chunk header
RIFF Format Code 4 ASCII string "WAVE"
Sub Chunks ? The Wave format sub chunks (format, data, etc.)

The Format Chunk (Basic Version)

The format chunk describes the format that the samples in the data chunk are encoded in. At minimum, it contains these fields:

Field Size Description
Chunk ID 4 "fmt " (note the space after 't')
Chunk Size 4 16, 18, or 40 (why not always 16? see below)
Format Code 2 Indicates PCM, floating point, μ-law, etc.
Number of Channels 2 1 for mono, 2 for stereo, up to 65535*
Samples per second
(a.k.a. sample rate)
4 44100 for CD quality
Bytes per Second 4 Bytes per sample frame × samples per second
Bytes per Sample Frame
(a.k.a block align)
2
Bits per sample 2 8, 16, 32, etc.

While some of these fields have a large range of possible values, in practice there are only a few that will actually be used.

Here is more detail on what these fields mean. For some background on what some of this terminology means, check out this blog post.

Format Code – Indicates how the sample data for the wave file is stored. The most common format is PCM, which has a code of 1. Other formats include IEEE floating point (3), ADPCM (2), μ-law (7), and WaveFormatExtensable (65534).

Number of channels – Typically a file will have 1 channel (mono) or 2 channels (stereo). A surround sound file would have 6* channels. Although this field technically allows you to have up to 65,535 channels, for audio data that would be flat out ridiculous. You would only hear all of the channels if you had 65,535 different speakers, and since a chunk can only hold 4GB of data (due to the 32-bit size field), you would only be able to store about a second and a half** of 8-bit PCM data.

Sample rate – The number of sample frames that occur each second. A typical value would be 44,100, which is the same as an audio CD*. Another reasonable value is 22,050. Although this field supports any arbitrary value between 1 and ______, in practice there are only a few values you should use. TODO: What kind of support to audio programs have for wacky sample rate values?

Bytes per second (byte rate) – The spec calls this byte rate, which means the number of bytes required for one second of audio data. This is equal to the bytes per sample frame times the sample rate. So with a bytes per sample frame of 32, and a sample rate of 44,100, this should equal 1,411,200. TODO: Does the spec have a flaw that would cause the proper value of this to be larger than the field supports?

Bytes per sample frame – Called block align by the spec, this is the number of bytes required to store a single sample frame, i.e. a single sample for each channel. (Sometimes a sample frame is also referred to as a block). It should be equal to the number of channels times the bits per sample rounded up to a multiple of 8. For example:

Channels Bits Per Sample Bytes per sample frame
1 8 8
2 8 16
1 16 16
2 16 32
6 32 192

This field can be used to calculate the bytes per sample frame field. Another possible use is for seeking around in a file. For example, if the bytes per sample frame is 32, then to seek forward 10 sample frames you need to seek forward 320 bytes.

Note that for PCM data, this field is essentially redundant since it can be calculated from the other fields. However, be sure to note the point of rounding bits per sample values to the nearest multiple of 8.

Bits per sample – For PCM data, typical values will be 8, 16, or 32. I'm not sure why this is a two-byte field in the spec, since any values over 255 would seem excessive. TODO: Do other non-PCM formats take advantage of it being two bytes?

The Format Chunk (More Exact Version)

OK, so I simplified some things a bit. That last section described the format chunk when the sample format is PCM, has 1 or two channels, and has 8 or 16 bits per sample. (In other words, a classic old school Wave file). The full format chunk spec actually includes an extension if the format is something else:

Field Size Description
Extension Size 2 0 or 22
Valid Bits Per Sample 2 TODO
Channel Mask 4 TODO
Sub Format 16 TODO

If format does not meet the criteria below, then the extension size field should be present:

For all formats except WaveFormatExtensible (i.e. 65534), the extension size should be 0 (i.e, two blank bytes), and that should be the end of the format chunk.

When the format is WaveFormatExtensible, the extension size should be 22, and the remaining three fields should be included:

Valid Bits Per Sample – TODO

Channel Mask – TODO

Sub Format – TODO

Note that the presence or absense of these fields determines the format chunk size - if both are absent, the chunk size should be 16, if the extension size is present the chunk size should be 18, and if all are present the chunk size should be 40.

These variations in the format chunk are a result of adding support for formats that the original wave file spec wasn't capable of specifying.

The Data Chunk

The layout for the data chunk is simpler than the format chunk: the normal 8-byte chunk header, followed by nothing but sweet, raw, unfiltered sample data. The sample data can be stored in a number of formats, which will be indicated by the format chunk.

The next several sections describe various formats that data in the data chunk can be stored as.

PCM Data Chunk

The simplest, and most common, is to store PCM samples (format code 1). This is just raw sample data stored as integers. The bits per sample field will indicate the range of the sample data:

Bits per sample Minimum Sample Maximum Sample
8 0 255
16 -32,768 32,767
32 -2,147,483,648 2,147,483,647

Notice that 8-bit samples are unsigned, while other bit depths are signed. Not weird at all. TODO: Add note about non-multiple-of-8 bit depths

Samples in a multi-channel PCM wave file are interleaved. That is, in a stereo file, one sample for the left channel will be followed by one sample for the right channel, followed by another sample for the left channel, then right channel, and so forth.

The samples for all channels at a moment in time are called a sample frame (also called a block). That is, a sample frame will contain one sample for each channel. In a monophonic file, a sample frame will consist on 1 sample. In a stereo file, a sample frame has 2 samples (one for the left channel, one for the right channel). In a 5-channel file, a sample frame has 5 samples. The block align field in the format chunk gives the size in bytes of each sample frame. This can be useful when seeking to a particular sample frame in the file.

Floating Point Data Chunk

Another basic format is to store samples as floating point values (format code 3). This is essentially the same as PCM format, except that samples are in the range -1.0 to 1.0. The bits per sample field for floating point files should be set to 32 or 64. TODO: What about non-32 or 64 bit sizes?

Compressed Data Chunk Formats

Sample data can also be stored in a compressed format. Examples include ______. I'm not too familiar with how these algoritms work, and the WaveFile gem doesn't currently support them. But, you can read up on them at Wikipedia if you're interested:

The More You Know: Even Chunk Sizes

According to the RIFF spec*, chunk sizes must be an even number of bytes. So for example if writing a monophonic file with an odd number of samples, you should append a blank* byte to the end of the data chunk to get it to an even size. TODO: How will this be interpreted as audio?

References

*1 - I'm not sure that the spec prevents chunks also coming after the data chunk, but I'm not sure how common that is. It doesn't seem like a good idea to do that.