Search This Blog

Sunday, February 26, 2012

Digital Dexterity: An Audio Primer

From the Tech Vault Dept.: Please don't try to make it through this piece. It's four thousand words explaining digital audio technology as practiced twenty years ago, and thus probably has more value now as a museum piece. But, damn, there's a lot of authorly splendor packed into it, which is one reason I'm sharing it. The other is to remind myself that there was a time in my life, ever so long ago, when I actually got paid for this shit.


BACK IN MY REBELLIOUS teen years, I discovered that I could listen in on my parents’ whispered discussions by running two wires from the woofer of their low-fi record player under the carpet, out the window, up to my room and into an AUX jack on my slightly better stereo. Aside from its usefulness in planning my day, it demonstrated the simplicity of analog sound reproduction.

Sound itself – whether it be speech, music, or the angry whispers of frustrated parents – is a series of changes in air pressure. The old string-and-cups telephone system demonstrates how a flexible diaphragm can intercept that sound – much as the eardrum does – and, by tugging on an attached string and stimulating another, similar diaphragm, reproduce it.

Edison’s phonograph, patented in 1877, used a cone to capture and focus sound, with a needle taking the place of taut string. The needle inscribed a series of pits in the tinfoil wrapped around a cylinder. Sending the needle over those pits again reproduced it. And the quality of the reproduction was so impressive to early auditors that many claimed it was gimmicked, with a talking human hidden nearby.

The cylinder was flattened by Emile Berliner in the 1890s, and mass- production of 78s – familiar to turn-of-the-century Caruso fans and late-‘40s boppers – was the result. The 78 was, so far, the longest-lived format, but it went through noticeable changes. The most significant was the transition from acoustic to electrical recording in the ’20s. Before then, performers gathered in a cramped studio in front of a big horn, which funneled the sounds they produced into a cutting stylus. In other words, the earliest recordings were “direct to disk,” a technique that resurfaced, albeit more cleanly, in the waning days of the LP.


Then microphones took over. The diaphragm in a microphone changes sound waves into electrical impulses, which may be transmitted along a wire and even broadcast. (As we’ll see, broadcasting techniques were among the important precursors to digital reproduction.) At the receiving or reproduction end, those impulses are amplified and sent to a loudspeaker, which effectively reverses the microphone’s process. As I discovered, a speaker can even serve as a crude microphone, although the woofer gave my parents a Chaliapin-like basso.

At first, microphones also recorded direct to disk; with the refinement of magnetic tape in the 1940s, it was possible to introduce an intermediate stage that allowed editing. Every stage of the process added noise. Microphones, wire, amplifiers, tape, and especially the LPs themselves put in their own sounds, to the distress of audiophiles. As sound reproduction got cleaner, noise became more and more of an issue.

In 1949, 78s gave way to long-playing records; ten years later, following 20 years of experimentation, stereo was introduced to the consumer for the most natural-sounding listening experience yet. 

Ten years after that, the Philips “compact cassette” came on the scene. Like the Edison cylinder, it was intended for spoken word: dictation, talking books. The 1/8th-inch-wide tape, moving at 1 7/8 inches per second, was horribly noisy compared to the industry recording standard of 1/2-inch tape at 30 ips. Even home consumer machines, typically running 1/4-inch tape at 7 ½ ips, sounded vastly better.

But the cassette was too convenient to be relegated to the low end. Dolby noise reduction applied a compression-decompression scheme to the high frequencies, allowing the so-equipped deck to ignore the high- frequency hiss imposed by the tape itself. Better tape coatings, typically using a formula of chromium dioxide, also improved the sound.

By the 1980s, as the personal computer began its own fast evolution, cassettes had eclipsed LPs as the preferred sound-storage medium. But by then a new contender had crested the horizon, based on new technology and presaging an inevitable conjunction with PCs.


As any record collector can tell you, there’s an invisible bandit with jam-stained fingers who likes to fondle the vinyl of newly-acquired albums, rendering them practically unplayable a week after they’re brought into the house. Cassettes generally sounded lousier but stood up to abuse better. Digital technology eliminated those problems (although the lifespan of a compact disc is still being debated) by eliminating all of the noisy analog steps of the recording and reproduction process, presenting the result on a medium impervious to noise.

The binary digits familiar to computerists were the key. The undulating waveform of an analog sound still undulates when captured, say, on magnetic tape (Fig. 1). But the brain theoretically can’t process every millisecond of sound. Motion pictures are based on the principle of persistence of vision – when you’re watching a movie in a theater, the screen is actually blank two-thirds of the time, but the brain is still studying one frame by the time the next one gets projected. Digital audio takes a similar approach, but, being more abstract than a picture, the details are far more complicated.
Figure 1: Acoustic Waveform

Sampling describes the process of capturing digital representations of pieces of sound. The optimal sampling rate was determined in the late 1920s by an AT&T engineer named Harry Nyquist, whose formula suggests that a signal should be sampled at least twice as often as the highest frequency in that signal.

The human ear has a range of 20 cycles per second (Hertz) to 20,000 Hz, although by adulthood (or after a few really rockin’ concerts) that upper limit deteriorates to about 16-17 kHz. Compact disc technology is based on a sampling rate of 44.1 kHz, which is also the high-end rate of many computer audio boards. Professional studio rigs, like digital audio tape (DAT) machines, sample at the rate of 48 kHz.

What happens if you sample too slowly? When the frequency of a sound exceeds half of the sampling rate, a process called aliasing kicks in. The sound is converted to a lower frequency, based on its harmonic values, but it’s nothing as pleasant as simply dropping it down an octave. In fact, it sounds downright ugly.

Which means that even in an optimal sampling situation, sounds inaudible to the human hear could still trigger the aliasing effect. To avoid that, low-pass filtering is built into the recording chain, which cuts off the too-high sounds. Low-pass filtering requires its own headroom, which is why the sampling rates in use are more than twice the 20 kHz ear limit.


Microsoft’s first multimedia specifications, released late in 1992, called for a sound board capable of recording a single (mono) track of 8-bit sound sampled at 11 kHz and playing back 8-bit sounds sampled at 22 kHz. Obviously, we’re not talking about much in the way of high frequencies or dynamic range.

You could usually play an audio CD through a so-equipped board with acceptable results, however, because CD-ROM players have their own DAC circuitry. But the conservative specs were designed to minimize storage requirements. This is where the numbers start to get nasty.

Even a mere minute of 8-bit audio sampled at 11 kHz takes over half a megabyte of disk space. As a full-blown stereo audio CD file, it would require over 10 Mb. We’ll look at storage considerations and compromises in a moment.


Sampling gets more complicated still. Each sample in an audio CD is stored as a 16-bit number, which gives it 65,536 possible values. To the eye, this would be a reasonably acceptable number of colors (but compare it to a 24-bit video card’s 16.7 million colors to appreciate how much detail the eye really craves), but the ear has a somewhat smaller spectrum. Translated into decibels (dBs), 16 bits allow a range of 96 dB, which is greater than most people can stand. A decibel is defined as the smallest change in volume that a good ear can distinguish; 0 dB is silence, a difficult-to-find quantity, while 96 dB is the sound of a subway express car rushing past while you’re waiting, annoyed, at a local station.

Figure 2: Analog vs. Digital Waveforms
Quantization divides the audible spectrum into zones – in this case, over 65,000 of them. Seems (or sounds) like plenty, but no more is allowed. Obviously, sound, being truly analog, contains more information still. Therefore, the recording circuitry in an analog-to-digital converter (ADC) rounds the in-between values up or down to the nearest bit level. What results is the audio equivalent of an enlarged bitmapped image: the jaggies. The smooth analog wave gets staircased (Fig. 2), and the result can sound unpleasantly harsh. Audiophiles who lament the passing of LPs (“vinyl Luddites,” I’ve seen them termed) point to quantization noise as one of the evils of digital technology.

Odd as it seems, the way to get around that noise is to introduce noise. A small amount of white noise is added to the recording, masking the quantization noise in a process known as dithering – again, the audio equivalent of something familiar to graphics users.

Although computer audio limped along at inferior sampling and quantization rates during its infancy, it has now emerged with full-blown Red Book standard sound – referring to the international standard (ISO 10149) for digital audio that was published in a book with a, you guessed it, red jacket. You’ll find settings for 16 bit recording and playback at 44.1 kHz in high-end cards from all of the major manufacturers, with Media Vision, Turtle Beach, and Creative Labs leading the way.


Digitizing audio is actually even more complicated, but whether you’re using a professional DAT system or making wiseguy WAV files on a home computer, the process is fairly consistent.

Once the sound is captured by a microphone, it passes into a sample-and-hold circuit, which freezes a portion of the signal. And remember, at the high end, it freezes about 44,100,000 portions each second. Each portion is simply a voltage measurement that is then sent to the ADC, which converts the amplitude measurement into the appropriate number of bits – usually two bytes’ worth.

Low frequencies, which are slower moving, are typically sampled many times per second. Concert A, the note an orchestra tunes to, vibrates at 440 Hz, and therefore can be sampled 100,000 times. Although none of the common instruments of the orchestra produces a note much above 6 kHz, all sounds, especially musical ones, produce overtones: sympathetic vibrations at mathematical intervals above the characteristic frequency. These sounds give an instrument its characteristic sound, or timbre. So a 12 kHz overtone can be sampled three times, while a brain-maddening whistle up in the 16 kHz range will at least be sampled twice to prevent the aforementioned aliasing.

Another problem is the more human one of error. Storing or transmitting those numbers – and there are lots and lots of them – opens up the possibility of losing or scrambling some of them. Recording tape, for example, is just as likely to screw up digital information as it will audio, but the ear is less forgiving. After all, if a grease speck gets on a cassette and causes a dropout in the middle of “Mean Mr. Mustard,” you’re usually able to remember or at least fill in the portion of the song that’s obscured.

A dropout on a digital tape is an abstraction, a paint-by-numbers canvas with a bunch of the numbers wiped out. Misplace even one bit of a 2-byte word and you’re talking about a completely different sound. Which, if you exaggerate the possibilities, could end up sounding like a Berlioz chorus just snuck into the midst of The Beatles. A lot more information needs to be stored for a digital recording than an analog one, so inch for inch there’s room for a lot more error.

Hence the use of error-correction techniques like interpolation and parity bits. Interpolation is the digital equivalent of the benevolent ear described above. When the playback circuit of a so-equipped system discovers that bits are missing, it examines the information adjacent to what’s missing and concocts a sound that lies somewhere in between. In fact, the really good interpolation systems draw an average from more than simply the adjacent signals for an even better guess. Interestingly, this was a technique used by click suppressors (SAE called theirs an “impulse noise reduction system”) in the LP days, which detected abnormal spikes in amplitude such as might be caused by a scratch and replaced them with a sound averaged from adjacent information.

Parity bits, commonly used in a computer’s memory banks and in serial communications, are extra bits that tell whether a given sample is even or odd, based on the idea that only a bit (or at least an odd number of them) will go wrong. As error police, then, they’re not foolproof, and they’re only capable of indicating that something’s wrong – they don’t offer corrections.

Error correction codes are the most reliable, and the most space-consuming. They can include redundant information to supplement what’s already on the recording, or they can be part of a scheme that spreads a given information stream over a larger area, interleaving samples over the disc or tape. Another technique is to provide supplemental information: it’s all math, so an error code can take the form of the difference between adjacent samples.


Recording digital information on a computer is easy. Computers are themselves digital, and love working with bits and bytes. But getting digital info into tape-recordable form is more difficult. One of the simple techniques is frequency shift keying (FSK), which was used by cheap computers to write to cassettes and is still used in modems for signal processing. It simply requires that the 1s and 0s of a byte be translated into different frequencies, which can be re-digitized on playback.

When digital recording came on the scene in the early ’80s, it was common to see a bulky PCM box hooked into a video tape recorder. The pulse code modulator was usually a Sony unit known as an F1 (just like the Help key). It worked by creating a video-like output, at 30 frames per second (the standard NTSC TV projection speed) with video sync pulses to guide it.

Sony F1
Pulse code modulation can be understood as the digital outgrowth of common broadcast techniques that also describe the frequency bandwidths they occupy: AM and FM.

Because radio waves lie well out of the range of human sight or hearing, early broadcasters came up with a system that took an invisible carrier frequency – say, 810 kHz – and modulated it with the amplitude characteristics of the audible signal. Amplitude modulation (AM) is rugged but susceptible to noise. A more sophisticated technique, called frequency modulation (FM), subjected the carrier – now up in the 88-108 MHz range –  to a cleaner alteration.

Because the carrier is a known quantity, it can be examined at the receiving end for variations. By calculating the difference in amplitude or frequency, the original sound is reconstructed.

A digital recording system uses a carrier signal in the form of a square or rectangular wave, which is detected as a series of pulses. Two discernible voltages define each end of the pulse, which changes extremely quickly. With the pulse as a constant, the carrier can be changed by varying its height, for PAM (easy acronym), or width (PWM), which is similarly similar to FM (Fig. 3).

In pulse code modulation, which is the method now used by most digital processors, the shape of the carrier pulse stays the same, but it is broken into bursts of varying length that correspond to each analog sample. Incredibly, PCM was invented back in the 1939, and adapted for some forms of communications ten years later.

PCM data is recorded by a digital audio tape recorder equipped with a helical scan system similar to that used by video machines, in which a rotating head drum is mounted to read and write diagonally to the passing tape. This allows a greater surface area for each frame than the perpendicular, immobile head on an open reel machine.

And within each VCR head drum are two to four heads, the higher number ensuring snappier startups when editing. For digital use, the main requirement is storage area. DAT machines lay diagonal stripes of information from two adjacent record/playback heads positioned 40 degrees apart from one another to prevent crosstalk between adjacent tracks.

Because everything in the DAT process is purely digital, there’s no signal degradation from copy to copy. It’s just like copying computer disks, and the attendant fear of rampant piracy held up the introduction of DAT in this country for a couple of years. When it finally was introduced, a form of copy-protection was added to prevent making multiple copies of a compact disc.


Compact discs offer an even more rugged format by storing digital information as a series of microscopic pits, read by a laser beam that’s 1.7 millionths of an inch wide. Over 70 minutes of audio information can be stored on a typical music CD (the story goes that it was designed to hold a complete performance of Beethoven’s Ninth Symphony), while over 600Mb of text can be stored on a computer CD-ROM.

Optical devices like a CD-ROM depend on one of two basic rotation methods. Constant angular velocity (CAV), describes the way in which a turntable turns: at the same speed, no matter where the tone-arm is sitting. Therefore, a lot of information is crammed into those tiny inner tracks compared to the large outer ones.

Constant linear velocity (CLV) changes the motor speed to compensate for track size, which is critical in something as finely-wrought as a compact disk. Because a CD reads from the inside out, the speed of its spin gradually slows, something you can see if you watch a disc in a window-equipped portable player. The trade-off is a slower access time, however, what with the need to adjust motor speed.

That spiral of information, which turns about 20,000 times, is a series of “pits” and “lands.” Each land is just an area of regular surface, and reflects the laser beam to a photo sensor. It represents an “on” condition. A pit throws the laser light in all directions, and represents an “off.” But they’re not the 1s and 0s of a computer’s ons and offs. The changes between pits and lands during a given period of time produce channel bits, and 14 channel bits represent a data unit that translates to eight computer bits, or one byte. This allows a little data redundancy for error-correction.

Coding a byte of digital information into 14 channel bits is known as eight-to-fourteen modulation (EFM), and every CD player has a ROM chip that contains a lookup table to decode the process. 

Channel bits – 588 of them – are clustered into a data frame that holds a wealth of data and supporting information. At the heart of it are 24 bytes of information – either six ASCII characters or 12 audio samples. There are also 8 bytes of error-correction parity info. One byte is a control and display word. Those bytes are separated from each other by three merge bits, and 24 channel bits are used for synchronization. 

In the lingo of the CD audio realm, a block is a group of 98 frames. As far as a CD-ROM or CD-I is concerned, it’s a sector. Whatever you call it, a sector functions like its hard disk equivalent. It’s the smallest unit you can write to. Because of constant linear velocity, the same amount of CD- based information is read every second: 75 sectors, or 7,350 frames per second.


The Red Book standard specifies the data transfer rate, the style of pulse code modulation, the sampling rate, and quantization for CD audio. Other compact disc formats are based on variations of that format. CD-ROM (compact disc read-only memory) describes what began as a text-only device and now includes digitized audio and video; CD-XA (CD-ROM extended architecture) improves upon that by storing even more audio data – in compressed form – along with the other CD-ROM data, and the standard defines the way those audio signals are combined with text and graphics on a single track.

CD-I, which stands for compact disc interactive, uses a technique called adaptive delta pulse code modulation (ADPCM) on its audio for up to two hours of music (in stereo!) or as much as 20 hours of a monaural voice track droning on and on. Better still, CD-I lets you interleave the audio with graphics and even video for something that starts to resemble full-blown motion pictures with accompanying stereo sound.


Unless you’re relying on a CD-ROM player or an outboard DAT machine, storing digitally-sampled information requires an oppressive amount of disk space. There are good reasons for it: numbers are easy to manipulate, and it’s possible to edit and enhance digital sound and pictures in ways that were unimaginable in the days of magnetic tape and razor blades. These days it’s common for an audio studio to install several gigabytes’ worth of storage on a speedy machine.

But the watchword these days is compression. Less is never more in the computer world: the challenge really is to make it something approaching the same. As the vogue in hard disk compression shows, text and even program files can shrink to fractions of their former selves without any loss of data. It’s not the same when audio and video gets compressed.

The Joint Photographic Experts Group has come up with an eponymous acronym to describe a format of saving video (JPEG) based on studies of human sight perception. The luminance of an image area eight by eight pixels is evaluated, and frequencies too high to perceive are converted to 0s. Slight color gradations are quantized on a sliding scale tied to the desired amount of compression, and the result gets a further dose of compression along the familiar Lempel-Ziv lines. A reverse process is applied at the viewing end, but not all data is regained. It’s termed lossy compression, and the result, applied too drastically, is easily discernible. Used correctly, however, JPEG files require far less storage space. Compression ratios of 200:1 can be achieved; realistically, expect something on the lines of a 10:1 or even 20:1 ratio.

A similar approach has been taken to audio, and the technology has been unleashed with a barrage of record-store hype. Studies of audio perception have determined that not all of the sounds present in a digital – or, for that matter, analog – recording can be heard by even a very good listener. Compression techniques have been developed that incorporate algorithms designed to eliminate information that is otherwise masked, and two competing formats have emerged: Sony’s MiniDisk, and the Philips Digital Compact Cassette. 

Both are recordable, and both are small. The DCC is the size of a standard cassette, which can be played (but not recorded) in a DCC machine. The MiniDisk, on the other hand, looks like a tiny floppy disk. Unlike the genesis of CD audio, which brought those two companies together and ensured a consistent standard, this promises to be a battle of the mavericks, with no clear consumer consensus over the acceptability of the sound of either format.

Still keeping a turntable to one side of the audio system? Good idea. The march of technology doesn’t necessarily obliterate everything in its path. But if the evolution of computers has taught us nothing else, it’s shown us that successive standards have shorter lives. There’s no going back from digital sampling to the old analog days, but we’ll no doubt be taking smaller steps more frequently forward.

Computer Shopper, November 1993

No comments: