There is an old and often misunderstood philosophical query: "if a tree fall in a forest with no one around to hear the crash, is a sound produced?" The practical physicist dismisses the question as absurd. Obviously vibrations in the air are produced independently of the presence or absence of hearers, so of course there is sound! The physicist misses the point, however, for what is suggested here is that the aerial vibrations and the perceived noise are not the same things by any means. The word "sound" is being used for two distinct concepts. The physical sound is the stimulus, which when received by the ear creates nerve impulses (physical things as well) that are then interpreted by the brain. The perception of sound is very different from the physical stimulus, and occurs only in the mind.
This is completely analogous to the sense of sight. The physical stimulus, light, is received by the retina, converted into nerve impulses, and sent to be brain. These impulses are interpreted to create the wonderful mental picture of the world around us, which seems so real we scarcely realize that it is really an impression in our minds overlaying the real world so perfectly that we perceive little difference between them. We might ask whether a tree is seen to fall if it falls when there is no observer, with the same significance as in the case of sound. Differences between our perceived world, whether visual or aural, and the real world can exist. Optical illusions are one indication of this, and there are aural illusions as well.
The eye and the ear are remote sensors, working by means of wave motions arriving from distant points, giving us information of things not immediately in contact with us. Touch, taste and smell all are excited by objects in physical and chemical contact with our bodies. The visual sense is a kind of remote tasting, since the light causes a chemical reaction in its detection (the change in shape of a protein). Similarly, the aural sense is a remote touching, since the receptors are modified touch receptors. Although the visual sense is by far the most developed and complex, involving yet unknown mental processes and pathways, the aural sense is also quite wonderful and intricate. Many similarities between the two senses can be recognized.
The mammalian ear has been developed by evolution to detect faint aerial vibrations. It can be traced back to organs of balance in fishes, which consist of three semicircular canals at the sides of the head, one for each rotation axis, that are connected directly with the brain. These semicircular canals still are prominent, and have retained their utility in mammals, but have no function in hearing. A lower lobe of these organs has evolved into a wonderful acoustic detector, using hair cell sensors like those in the semicircular canals. The whole structure is encased in bone and is well-protected.
A section of the human ear is shown at the right. It is conventionally divided into three parts, the outer, middle and inner ears. The outer ear is a rudimentary horn, receiving sound vibrations with the pinna, and channeling them through the meatus to the slightly conical tympanum, or eardrum. The pinna in some mammals can be directed under muscular control to change the angular sensititivy of the ear, but in humans this is restricted to useless wiggling the ears. The tension in the tympanum can be changed by one or two small muscles not under voluntary control. If the tympanum tends to waffle in the presence of low-frequency sound, its resonant frequency can be raised at the cost of a slight loss in sensitivity by these muscles.
The vibration of the tympanum is transmitted to the sensory organ through small bones, or ossicles in the middle ear. The pressure in this small chamber can be equalized with that outside through the Eustachian tube, whose opening is under voluntary control, generally by swallowing. The tympanum drives the malleus (hammer) directly. The surfaces and ligaments connecting the malleus and the incus (anvil) act as cams to decrease the amplitude of motion and to increase its force. The anvil drives the stapes (stirrup) that presses on the fenestra ovalis (oval window), communicating the motion to the fluid of the inner ear. Muscles in the inner ear can batten down the ossicles so they do not rattle with loud sounds.
A schematic diagram of the ear is shown at the left, in which the parts are unrolled to show their relationships. The outer and middle ears are shown, but our principal interest is in the inner ear. This is a snail-like tapered cone called from its shape the cochlea, and is filled with fluid. There are actually three longitudinal sections of the cochlea, of which only the most important two are shown here. The upper one is the scala vestibuli (SV), and the lower is the scala tympani (ST). At the large end of the SV is the oval window communicating with the stapes. At the small end is the helicotrema, a small hole the equalizes pressures between the SV and ST, but only with a rather long time constant. The ST then expands as we go back, ending in the fenestra rotunda (round window). High frequencies do not propagate far down the cochlea; only low frequencies can easily reach the small end. The flexible windows at the ends of the SV and ST allow the fluid to oscillate, within the rigid bony chamber.
A cross-section of the cochlea is shown at the right. The three sections are now seen. The cochlear canal is the one not labelled. Reissner's membrane separates the SV from the canal, while the basilar membrane and the tectorial membrane separate the ST from the canal, and enclose the sensory structure, the organ of Corti, which rests on the basilar membrane. A plan view of the basilar membrane is shown in the schematic diagram of the ear. We note that it is tapered oppositely to the taper of the cochlea. Helmholtz suggested that there were transverse ligaments supporting it, and they were more tightly stretched in the thinner part, more loosely in the wider part. This now seems not to be the case, but the natural frequency of vibration of the basilar membrane decreases from left to right, from about 8000 Hz to 250 Hz, whatever the structure of the membrane may be. The oscillations are about critically damped, with relaxation times of 50 to 150 ms.
The basilar membrane seems to perform one of the basic functions of the ear, the Fourier analysis of signals impressed on the oval window by the stapes. The delicate hair cells of the organ of Corti are supported on the basilar membrane, under the protection of the tectorial membrane, and detect vibrations excited in it. There are some 25 000 outer hair cells, each with 140 hairs, and 3500 inner hair cells, each with 40 hairs, approximately. There are two layers of hair cells, separated by the tunnel of Corti. Their signals then are passed to ganglion cells, whose axons comprise the aural nerve. As in the eye, there are many, many more hair cells than axons in the aural nerve, so coding must be present. The nerve then leads away through the bony shelf dividing the cochlea, to the brain. There is crossover in the aural nerve pathways, so both hemispheres of the brain receive information from both ears, again as in the case of the eyes. The organ of Corti is separated from the circulation of the blood, for otherwise the pulse would be deafening. The ear is much less complicated than the retina, and is much more easily investigated histologically. Nevertheless, its functioning is mysterious, and many questions remain. The brain probably receives a Fourier transform of what has stimulated the ear, and develops its perceptions from there.
It was once the fashion to assume that the properties of the ear would determine all the peculiarities of hearing, or that every such peculiarity had a physical basis in the structure and functioning of the ear. The ear is surely important, since it is the intermediate between acoustic disturbances and the brain, but it is not everything. The eye is not a camera, presenting a finished picture to the brain, and neither is the ear a phonograph, presenting finished tone impressions to the brain. The ear only provides data; the mind interprets. For example, if two tones f and f' are simultaneously presented to the ear, the difference tone f - f' can sometimes be perceived. This difference tone could arise from nonlinearity in the ear, a physical cause, or could be created in the mind, in which case it would have no physical cause. It is very difficult to distinguish these two cases. Just as memory plays a great role in vision (in the identification of things seen) it may play a role in hearing, as well. However, in hearing the objects are identified as tones, not as objects, which is a great difference.
The physical stimulus of a sinusoidal acoustic wave, or simple tone, can be specified by two parameters, the frequency f in Hz and the overpressure amplitude p in dyne/cm2 or μbar. This stimulus plays an important role in hearing as a spectral color does in vision, but is more fundamental. Ohm's Law, stated by G. S. Ohm (1787-1854) in 1843, says that the audible sensation of a simple tone cannot be analyzed further, and any aural sensation is analyzed by the ear into simple tones. This is a far-reaching generalization that expresses a basic feature of the aural sense, that experiment proves to be valid. It was very well substantiated by Hermann von Helmholtz.
The phase differences between the partials in a complex sound do not influence its perception. The aural sense does not distinguish waveshapes, only the content of harmonics and partials. Curiously, phase differences between sounds received in the two ears are used for source location in binaural hearing. It should be recognized that phase differences are only significant between signals of exactly the same frequency; otherwise the phase is subject to continual change.
For a simple tone of any frequency f, sensation begins when the pressure reaches a level called the threshold of audibility. As the pressure is increased, the sensation becomes louder and louder until it becomes unpleasant and harsh at a level called the threshold of feeling. These thresholds are plotted in the diagram at the right. They intersect at a low frequency around 20 Hz, and again at a high frequency around 10,000 Hz, enclosing an area. This is only schematic, since it is difficult to determine the curves close to the frequency limits of audibility. Any point in this area corresponds to an aural sensation of a certain pitch and loudness. These curves vary with different ears, but this shows the general behavior. Some say there is aural sensation from as low as 16 Hz to as high as 28,000 Hz, but the limits are very difficult to determine. The lower limit, in particular, is subject to effects resulting from the large amplitude of vibration and possible nonlinearities. The threshold of hearing rises at higher frequencies for older people, significantly above 8000 Hz. A much smaller bandwidth of 100 Hz to 5000 Hz includes most speech and music, however, and few ears would notice restriction to this bandwidth.
The sensitivity of the ear is quite remarkable. The ultimate sensitivity, at 3500 Hz, appears to be around 8 x 10-5 μbar, rms. This corresponds to an energy flux of 1.55 x 10-17 W/cm2, a particle velocity of 2.74 x 10-6 cm/s, a condensation (Δρ/ρ) of 8.07 x 10-11, and a particle displacement of 1.25 x 10-10 cm, which is only 1.25 Å, about the diameter of a hydrogen atom! The ear is so sensitive that any further sensitivity would pick up the white noise of the molecular density fluctuations, or Brownian movment. The fact that we cannot hear as well as a cat is more a problem with paying attention than to lesser sensitivity. The extremes of pressure sensitivity of the ear, from 0.00008 μbar to 3000 μbar, is a ratio of 3.75 x 107, or 151 dB, an enormous dynamic range. Our senses handle such enormous dynamic ranges by logarithmic response.
The intensity of sound waves is usually specified by decibels (dB) above a certain reference level, usually 0.0002 dyne/cm2 or μbar rms, through the formula dB = 20 log (p/0.0002). This is often called sound pressure level, or SPL. Ordinary conversation occurs at an SPL of 50-60 dB. Power level can also be used, with a reference level of 10-16 W/cm2, with dB = 10 log (P/10-16). The two reference levels are almost equivalent, with .0002 μbar rms corresponding to 0.95 x 10-16 W/cm2 in air. The power in a sound wave, in terms of the rms overpressure p, is p2/r erg/s-cm2, where r is the wave impedance 42.6 g-cm/s. To convert to watt, 1 erg/s = 10-7 W.
A musical note, as played by an instrument or sung by the voice, is said to have three characteristics: pitch, loudness and quality. All of these characteristics are sense perceptions, not physical realities. The pitch of any note can be matched by the pitch of a simple tone of a certain frequency f, called the fundamental frequency. The loudness can be matched by the loudness of this simple tone. To some degree, the loudness can even be matched by the loudness of a tone of a different fundamental, but with less assurance. Quite often the pitch and loudness are assumed to be exactly the same as the physical frequency and pressure of the equivalent simple tone, but this is not rigorous. The pitch of a sound is simpler than the hue of a color, to which it is analogous, since it is one-dimensional. Nevertheless, pitch and frequency are not the same thing, and neither are intensity and loudness.
The pitch of a low note has been observed to decrease as the sound becomes louder. On the other hand, the pitch of a high note seems to increase with intensity. A curious cyclic sequence of notes can be created that seems always to increase or decrease in pitch, like one of Escher's staircases, as the notes are played in sequence, but never gets anywhere. Of course, these are not simple tones. We are pretty sure, however, that we can always arrange any set of simple tones in a monotonic scale of pitch, and that there is a close correlation between frequency and pitch. Assuming that the smallest detectable pitch increment at any frequency represents the same increment of pitch sensation, a fully subjective pitch scale can be established, independent of the physical frequency. One such scale is in mels, with 1000 mel = 1000 Hz. This program is rather difficult to carry out, and its utility is not very great.
A string of a certain length, tension and linear mass density will vibrate in simple harmonic motion, and produce a simple tone. It has been known since Pythagoras' time that harmonic tones are produced by string lengths that are in the ratios of small integers. Galileo (1564-1642) found that the frequency of vibration is inversely proportional to the length of the string, proportional to the square root of the tension, and inversely proportional to the diameters of wires of the same material (that is, inversely proportional to the square root of the linear mass density). M. Mersenne (1588-1648) also was studying sound, and published first, so these are known as Mersenne's Laws. Mersenne determined the frequencies of organ pipes by comparison with a brass wire, and was famous as well for measuring the speed of sound.
A string sounds its fundamental when vibrating as a whole with a loop, or maximum amplitude, at the centre, and nodes at its ends. If held at the centre, half the string will vibrate at twice the frequency, and if at a 1/3 point, at three times the frequency. Because they corresponded to vibrations of parts of the string, these higher frequencies were called partials. Another term was overtones, or harmonics. Harmonics may be restricted to frequencies that are a multiple of the fundamental; otherwise the term "partial" is better usage. To a physicist, the fundamental was the first harmonic, and the double frequency was the first overtone or second harmonic. To a musician, the first harmonic is the first overtone. The ratios of these frequencies were as 1:2:3:..., the integers. Bells and bars do not have harmonic partials, as do strings and organ pipes.
It was found in every case that musical harmonies of every kind were determined by the ratios of frequencies, not by additive relationships. The octave is a ratio of 2:1 and the fifth is 3:2, for example. A musical interval was a ratio, not a difference, of frequencies. Musical intervals can, therefore, be measured in logarithmic units, so that the composition of ratios would be the addition of intervals. If I is the measure of an interval, or ratio of frequencies, then we might set I = k log (f/f'), where I is the interval between frequencies f and f', and k is an arbitrary constant. By choosing k, we can make the interval between two pitches any number we please. It does not matter whether we use natural or common logarithms (it merely affects the value of k), so common logarithms will be used as it seems the usual practice. For musical scales, see Music.
If we are measuring an interval I in octaves, then 1 = k log 2, or k = 3.32. The interval between 440 Hz and 1000 Hz is then 1.18 octaves. The modern even-tempered musical scale divides the octave into 12 parts called semitones. If I is to be in semitones, then, 12 = k log 2, or k = 39.86. For a very fine division, we may want each semitone to be divided into 100 cents. Then, 1200 = k log 2, or k = 3986. An octave can also be divided into 1000 savarts, so 1000 = k log 2, or k = 3322.
The smallest frequency increment Δf that can just be detected as a change in pitch is proportional to the frequency. In fact, experiment shows that Δf/f = 0.003 over the mid-frequency range where pitch is well determined. Now, 1 + Δf/f = (f + Δf)/f, so the interval is I = k log(1.003). In cents, this is 5.2 cents, or about 20% of a semitone.
The loudness of a sound is related to its physical intensity in terms of overpressure in μbar, or to its intensity in W/cm2, but the relation is not a simple or obvious one, even for sounds of the same frequency. For a given frequency, the loudness is approximately proportional to the logarithm of the intensity, in accordance with the Weber-Fechner law. Just as in the case of vision, there are significant departures from this law.
The problem arises when we consider different frequencies. An intensity that is 20 dB above threshold at 1000 Hz is not even audible at 100 Hz. Fortunately, different observers can agree when tones of different frequency are equal in loudness. That is, the intensity of the 100 Hz signal can be increased until it becomes as loud as a 1000 Hz signal that is 20 dB above threshold. A 100 Hz signal of this intensity is then said to have a loudness of 20 phons. In this case, the threshold at 100 Hz is 37 dB, while the 20 phon level is 52 dB, 15 dB above threshold (all re 0.0002 μbar rms). This can be carried out for a number of frequencies, and for various phon levels, and curves plotted.
Phons relate the loudness of different frequencies, but unfortunately are not accurately proportional to perceived loudness, and so cannot be added to give an overall loudness of a complex sound. Psychophysical experiments can relate phons to loudness to establish a fully subjective measure that is additive. One sone is the loudness of a 1000 Hz tone at 40 dB intensity. A sound of 2 sone should seem twice as loud as one of 1 sone. Above 40 phon, log L(sone) = 0.033(phon - 40), approximately, but the loudness drops more quickly for weaker sounds. This does not agree with Fechner's Law, but makes loudness about proportional to the 1/3 power of the intensity. In practice, loudness is usually measured by a weighted average of intensity, and expressed as SPL.
In speech and music, pitch and loudness are significant, but the property of quality or timbre are supreme. A simple tone carries no language information, and is a very plain and uninteresting musical statement. It was earlier assumed that a tone of the fundamental frequency carried most of the energy, and its quality was modified by the higher harmonics that it contained. This is, in a way, true, but the facts are often very different. When an oboe sounds a note, there is practically no energy in the apparent fundamental tone. Most of the energy is in the 4th and 5th harmonics, with smaller amounts in the 6th to 12th. The pitch of the note that it plays is created in the mind; the corresponding frequency hardly appears in the emitted sound. The clarinet shows a fairly strong 1st and 3rd harmonics, but then most of the remaining energy is in the 7th to 11th harmonics. Only the flute agrees with our naive expectations, with most of the energy in the fundamental, with small contributions from each of the higher harmonics. If you played an A, 440 Hz, on each of these instruments into a filter that removed everything below 1000 Hz, the oboe would sound as before, the clarinet would have a strange tinny sound, while the flute would be silent. This gives some indication of the difficulty of specifying the quality of a musical note.
Hearing is subject to illusions, as is sight, and mental processing is very important. Again, the point is to provide us with useful information about our surroundings, and our chances of staying alive in the near future. One interesting example is that the apparent pitch of a bell, called the strike tone, often does not correspond to any of the normal modes of vibration of the bell. Curiously, it is usually an octave below the fifth partial tone. The first, or lowest, partial tone corresponds to the simple vibration of the bell with four nodal meridians, and is called the hum tone. A goblet vibrates in similar modes when stroked around the rim with a wet finger. The partial tones of a bell are not harmonics of a fundamental, but are incommensurate.
When the same sound reaches us by different pathways, the weaker delayed signals are usually eliminated by the auditory system as unimportant, an effect known as precedence, or the Haas effect. This protects us from continually hearing echoes. Sometimes the delayed signals may even contain more energy than the direct one. If the time delay is too great, or the delayed sound too loud, we do hear an echo. Precedence occurs for delays less than about 40 ms in complex sounds, but only 5 ms for clicks. Precedence complicates the hearing of stereophonic music (in which phase clues are eliminated as far as possible to minimize the criticality of location). Visual clues interact with auditory clues as to sound location; the sound will be perceived to come from its logical sources (such as the image of an actor in cinema) instead of from actual sources (loudspeaker behind the screen).
In speech, the maximum acoustic power is around 160 Hz, with 50% of the energy below 350 Hz. Nevertheless, speech passed through a low-pass filter with a cutoff at 500 Hz is an unintelligible rumbling noise. If the same speech is passed through a high-pass filter of the same cutoff, so that everything below 500 Hz is removed, the speech is perfectly intelligible and normal. The telephone bandwidth of 300 Hz to 3000 Hz is carefully designed to pass the maximum possible intelligibility in the minimum possible bandwidth. I say designed, but this also happens to be the bandwidth of the early carbon granule transmitter, which originally determined the choice. A very fortunate coincidence!
Speech consists of vowel sounds formed in the resonating chambers of the mouth, nose and pharynx, excited by puffs of air from the vocal cords, and modified by consonantal sounds that distinguish the phonemes, or units of communication. Consonants are formed by the lips, teeth, tongue and palate. Vowels are generally the superposition of a lower tone, from 400 to 800 Hz, and a higher tone, from 800 Hz to 2400 Hz. It is the higher tones that generally distinguish the vowels from each other. Consonants depend on higher frequencies, introducing and terminating the vowel sounds. The highest frequencies are necessary to distinguish s, f and th, unvoiced, and voiced as z, v and ð.
The audibility of one sound in the presence of others is a subject of considerable practical and theoretical interest. We have a remarkable ability to identify and concentrate on one component of a complex sound, though the skill is usually poorly developed. The different partials in a musical note can be heard separtely by trained observers. There is no analogous capability in vision; it would be like seeing the red and green separately in yellow, or whether the yellow was a pure spectral color or a mixture.
If we are listening to a certain intensity of one pure tone, say 440 Hz, and another tone at, say, 1000 Hz is slowly increased in intensity from zero, there comes a point at which the 1000 Hz tone becomes perceptible. This intensity level is greater than the threshold for 1000 Hz alone. If the 1000 Hz is first detected at an intensity of 20 dB, then we say that the 440 Hz tone masks the 1000 Hz tone by 20 dB. If the masking is measured at a number of frequencies both below and above 440 Hz, a masking curve can be plotted of dB vs. frequency, as shown schematically in the diagram. The dip at 440 Hz is due to the appearance of beats in this region.
The general characteristics of masking by a simple tone is that the masking increases with the intensity of the tone, and that the masking of higher frequencies is greater than that of lower frequencies. Masking by a continuous spectrum, as in the important case of noise, is more complicated. For the masking of any frequency, there is a critical bandwidth depending on frequency such that noise outside this bandwidth has little effect. The minimum critical bandwidth is about 35 Hz at around 400 Hz, rising to 60 Hz for masking 100 Hz, and 600 Hz for masking 10 kHz.
It is a very great advantage to be able to concentrate on one sound in the presence of others, and the mind has developed this power as far as possible. In the case of masking, however, it seems that the mind cannot extract the necessary information from what is supplied to it, and so masking must be caused mainly by the characteristics of the ear. That is, the masking tone affects the basilar membrane so that it is not as sensitive to the masked tone, so that masking is a physical, not a mental, effect.
We are equipped with two ears, just as with two eyes. The first thing we note is that we are not always hearing two sounds with slight time delays, which is what the ears actually receive. As in the case of vision, the mind presents us with a single, fused image, but may use the additional information offered by two sensors in various ways. Our binocular vision gives us accurate depth perception within the range of our hands, which is very useful though seldom appreciated. This facility rests on the accurate spatial registration of images of nearby objects with respect to the distant background. The ears provide us only with time and intensity information, which cannot locate objects as accurately. In spite of this, estimates of the locations of sound sources can often be made with surprising accuracy. If you try to become aware of this facility, you will be surprised by the accuracy of the sense of location. Binaural reception aids the discrimination of sounds in certain directions so they are not masked by others. Cats seem very adept at using this information. With no more sensitive ears than our own, though with ears sensitive to the higher frequencies that make location easier, they seem to perceive an auditory picture of their surroundings, and can follow objects by sound when we are totally at a loss. Ordinary human life appears to make few demands on the remarkable properties of our senses, which seem largely wasted on us. We assume birds have better sight, and cats have better hearing, while they both just pay more attention.
Rayleigh showed that intensity information plays only a small role in localization. The pinna of the ear can also be moved to affect received intensity, giving directional information, though this is not important in humans with relatively immobile and (usually) flat-lying ears. Some people can also detect the presence of nearby sound-reflecting objects by a kind of echo location. In this case, the objects seem to be directly felt in some way. This sense is naturally best-developed in the blind. The head diffracts sound of wavelengths comparable to its diameter and smaller, creating sound shadows, and also intensification. This is effective mainly for frequencies higher than 1 kHz (wavelength about one foot), and can amount to as much as 20 dB. The head forms a low-pass filter for sound.
The major role in source localization is played by arrival time or phase differences between the two ears. For a pure tone, phase differences are the only possible time information, while amplitude variations may give clues with complex sounds. Right-left localization is usually easy, but front-back discrimination is difficult. Localization is easier for high frequencies than for low. Head movements can reduce ambiguity in sound localization. Binaural beats occur when slightly different frequencies are applied to the two ears; this shows sensitivity to relative phase between the ears, and occurs only for low frequencies, below 1 kHz.
Stereophonic, or two-channel, sound popular in "hi-fi" is really something of a fraud. It most certainly is not equivalent to binaural hearing, and is called the pseudostereo effect. It does give a much better presence to the sound emitted, and the shifting of prominence from one speaker to the other gives an interesting spatial effect. More channels give an even better effect. Actually, to reproduce an orchestra faithfully in three-dimensional sound requires a microphone and speaker for every instrument. This might also be managed with an acoustic hologram analogous to a white-light hologram, but I do not think this has ever been attempted.
Some success has been obtained in presenting sound from two loudspeakers that gives a strong impression of spatial location. It is certainly not enough to use ordinary stereophonic sound with its pseudostereo effect. What is required is discussed in Scientific American, February 2002, p. 94 [there are no references to the actual recent work by Alastair Sibbald]. An important consideration is the Haas or precedence effect, in which the mind suppresses similar signals following the first within 40 milliseconds. This is, of course, an adaptation to avoid disturbing echoes. It is not a fatigue or adaptation effect, but another example of the mind's effort to interpret sensory data correctly, rather than to reproduce exactly what is sensed.
L. E. Kinsler and A. R. Frey, Fundamentals of Acoustics, 2nd ed. (New York: John Wiley & Sons, 1962). Chapter 13.
A. Wood, Acoustics, 2nd ed. (New York: Dover, 1966) Chapter XVII.
Composed by J. B. Calvert
Created 3 September 2003