Sampling Rates, Sample Depths, and Bit Rates: Basic Audio Concepts

Ragy Morkos January 17, 2020

In a previous blog post, Rahul talked about audio codecs and transcoding, some of which “compress” audio in order to save storage space. But what exactly do these audio codecs do in terms of compression, and what determines the quality of an audio file in the first place?

When it comes to audio processing, there is a lot of terminology that most people have heard before but do not really understand. I used to be one of those people before I had to work on audio processing. To that end, I wanted to talk about some of these terms, describe what they are, and showcase what they mean for the quality of an audio recording or stream. For the rest of this post, we’ll assume that we are dealing with only one channel of uncompressed audio.

(1) Sampling Rate / Sampling Frequency

The first term we often hear about is the sampling rate or sampling frequency, which both refer to the same thing. Some of the values you might have come across are 8kHz, 44.1kHz, and 48kHz. What exactly is the sampling rate of an audio file?

The sampling rate refers to the number of samples of audio recorded every second. It is measured in samples per second or Hertz (abbreviated as Hz or kHz, with one kHz being 1000 Hz). An audio sample is just a number representing the measured acoustic wave value at a specific point in time. It’s very important to note that these samples are taken at temporally equispaced instants in a second. For example, if the sampling rate is 8000 Hz, it’s not enough that there be 8000 samples sampled during a second; they must be taken at exactly 1/8000 of a second apart. The 1/8000 number in this case would be called the sampling interval (measured in seconds), and the sampling rate is simply the multiplicative inverse of that.

The sampling rate is analogous to the frame rate or FPS (frames per second) measurement for videos. A video is simply a series of pictures, usually called in this context “frames”, displayed back to back very quickly to give the illusion (at least to us humans) of continuous non-interrupted motion or movement.

While the audio sampling rate and the video frame rate are similar, the usual numerical minimum for guaranteed usability in each one is very different. For video, a minimum of 24 frames per second is required in order to guarantee that motion be depicted accurately; less than that, and the motion might appear choppy, and the illusion of continuous non-interrupted movement cannot be maintained. This especially holds true the more motion is occurring between frames. Moreover, a video with 1 or 2 frames per second might have “split-second” events that are guaranteed to be missed between the frames.

For audio, the minimum number of samples per second to unambiguously represent English speech is 8000 Hz. Using less than that would result in speech that might not be comprehensible due to a variety of reasons, one of which is how similar utterances will not be distinguishable from one another. Lower sampling rates confound phonemes, or sounds in a language, which have significant high-frequency energy; for example, with 5000 Hz, it is difficult to distinguish /s/ from /sh/ or /f/.

Since we mentioned video frames, another term worth elaborating on is that of audio frames. Although audio samples and audio frames are both measured in Hertz, they are not the same thing. An audio frame is the group of audio samples for an instance of time that come from one or more audio channels.

The most common values for the sampling rate is the aforementioned 8kHz (most common for telephone communications), 44.1kHz (most common for music CDs), and 48kHz (most common for audio tracks in movies). Lower sampling rates mean less samples per second, which in turn mean less audio data, since there is a smaller number of sample points to represent the audio. The sampling rate is chosen for a certain application depending on what acoustic artifacts need to be captured. Some acoustic artifacts like speech utterances require a lower sampling rate than an acoustic artifact such as a music tune in a music CD. It’s important to note that higher sampling rates require more storage space and processing power to handle, though this might not be as big of an issue now as it used to be in the old days when digital storage and processing power were of primary considerations.

(2) Sample Depth / Sample Precision / Sample Size

In addition to the sampling rate, which is how many data points of audio we have, there is also the sample depth. Measured in bits per sample, the sample depth, (also known as the sample precision or sample size), is the second important property of an audio file or stream, and it represents the level of detail, or “quality” each sample has. As we mentioned above, each audio sample is just a number, and while having a lot of numbers is helpful to represent audio, you also need the range or “quality” of every individual number to be large enough to represent each sample or data point accurately.

What does “quality” mean? For an audio sample, it simply means that the audio sample can represent a higher range of amplitudes. A sample depth of 8 bits means that we have 2^8 = 256 distinct amplitudes that each audio sample can represent, and a sample depth of 16 bits means that we have 2^16 = 65,536 distinct amplitudes that an audio sample can represent, and so on for higher sample depths. The most common sample depths for telephony audio are 16 bits and 32 bits. The more distinct amplitudes one has in a digital recording, the closer the digital recording sounds to the original acoustic event.

Again, this is analogous to the 8bit or 16bit numbers we might hear about regarding image quality. For images or videos, each pixel in an image or a video frame also has a number of bits to represent color. A higher bit depth in a pixel yields a pixel that is more color-accurate, since the pixel has more bits to “describe” the color to be represented on a screen, and the pixel or image overall would look more realistic to how one would see it in real life. More technically, the bit depth of a pixel indicates how many distinct colors can be represented in the pixel. If you permit each of R, G, and B to be represented by an 8-bit number, then each pixel is represented by 3 x 8 = 24 bits. This means that there are 2^24 ~ 17 million different colors that can be represented by that pixel.

(3) Bit Rate

Tying the sampling rate and the sample depth together is the bit rate, which is simply the product of both. Since the sampling rate is measured in samples per second and the sample depth is measured in bits per sample, it is therefore measured in (samples per second) x (bits per sample) = bits per second, abbreviated as bps or kbps. It’s worth noting that because the sample depth and the bit rate are related, they frequently, yet erroneously, get used interchangeably.

The bit rate in audio varies according to application. Applications that require high audio quality, like music, usually have a higher bit rate yielding higher quality, or “crisper” audio. Telephony audio, including that of call centers, doesn’t need a high bit rate, and so the bit rate for an ordinary phone call is usually much lower than that of a music CD. For either the sampling rate or the bit rate, lower values might (literally) sound worse, but again, depending on the application, lower values save storage space and/or processing power.

All in all, what does compression really mean, then, when it comes to audio? Compressed audio formats, such as AAC or MP3 have a bit rate that is some smaller number than the true product of the sampling rate and the sample depth. The formats achieve this by having information "surgically" removed from the bit stream on perceptual grounds, which means that --- in dynamic contexts --- those frequencies or amplitudes which are not heard by the human ear for biological reasons are not stored, leading to an overall smaller file size.

Credits to Kornel Laskowski, Voci’s Chief Scientist, for reviewing the technical details of this article.

Ragy Morkos

Ragy Morkos is a software engineer, embedded with the research staff at Voci. He develops web demonstrations of emerging and future speech technologies, builds automated verification infrastructure and test cases, and authors data preparation tools that interface with Voci's existing product line. His responsibilities also extend into the machine learning realm, where he is working on turn-taking analysis and age identification from speech.

Stay updated with Voci's speech insights