Realtime speech and voice trasmission on the Internet:

April, 1997
Jarkko Ahonen & Arttu Laine
Helsinki University of Technology
Telecommunications Software and Multimedia Laboratory
Jarkko.Ahonen@iki.fi & Arttu.Laine@eunet.fi
 

Abstract

The last year or so has seen the exploding growth of realtime voice applications that use the Internet as their transport medium. The public´s interest has shown that these applications are very much in demand. The problem with todays software is, that most of them use proprietary protocols, so vendor interoperability is comparatively low. Recently some coalitions have emerged which intend to make vendor interoperability common but it is far too early to say what the household standard will be. ITU-T [5]has given a recommendation that describes a protocol for realtime multimedia transmission over the Internet. The recommendation is ITU-T recommendation H.323 which describes voice, video and data communications over a non-guaranteed quality of service network. This paper focuses on the part of H.323 which describes voice transmission over TCP/IP and other widely used audio coding standards and recommendations. 


1. Introduction 
2. The ITU-T H.32x recommendations.
3. The ITU-T H.323 standard on IP networks
3.1 The protocol overview
3.2 The codecs
4. Current audio coding standards.
3.1 G.711 (PCM, mu-law & a-law)
3.2 G.722  (SB-ADPCM)
3.3 G.723.1
3.4 G.728 (LD-CELP)
3.5 G.729 (A) (CS-ACELP)
3.6 I-30036 (GSM)
5. Attributes of Speech Coding Technology.
5.1 Bit Rate
5.2 Delay
5.3 Complexity
5.4 Quality
7. Conclusions and discussion of the implications of ... 
8. References 

1. Introduction

This is a short introduction to low bandwidth digital speech coding. Real time operations over shared capacity systems challenges algorithms and messaging protocols. So far plain data transmission has been sufficent in networks that do not provide Quality of Service (QoS). Operations have been carried out in virtual batch mode, and variable transmit latencies, packet loss and non existent bandwidth management have been tolerated as part of the medium characteristics. Today, we have lightweight and low delay algorithms for situations where computing power is limited, but bandwidth is not of a major concern. Other areas require transmission to be as compact as possible, but this requires a lot of computing power, additional delay and/or power-consuming additional memory. Real-time speech and clarity at the receiving end suffers mostly from overall systen delay and very heterogenous transit-routes. The choice of the speech coding algorithm is not really a win-win situation, so the chosen algorithm must be the one fine tuned for that given task. With an increasing number of standards (ITU-T [5] , ETSI) and proprietary audio/voice compression algorithms, the selection process can be complicated. There are also continuous controversy regarding the CPU/bandwidth/frequency range balance of given algorithms.

2. The ITU-T H.32x recommendations.

Recommentation  H.320  H.321  H.322  H.323  H.324 
Approval Date  1990  1995  1995  1996  1996 
Network  Narrowband switched digital ISDN  Broadband ISDN ATM LAN  Guaranteed bandwidth packet switched networks  Non-guaranteed bandwidth packet switched networks, (Ethernet)  PSTN or POTS, the analog phone system 
Audio  G.711 
G.722 
G.728 
G.711 
G.722 
G.728 
G.711 
G.722 
G.728 
G.711 
G.722 
G.728 
G.723 
G.729 
G.723 
Comm. Interface  I.400  AAL 
I.363 
AJM I.361 
PHY I.400 
I.400& 
TCP/IP 
TCP/IP  V.34 Modem 
An index of the audio coding recommendations and communication interfaces used in various H.32x recommendations. [3]

    As can be clearly seen from the above table, the H.323 recommendation covers by far the largest amount of audio codecs. In fact the audio codecs of the other recommendations are a subset of the audio codecs outlined in the H.323 recommendation. Also the H.323 is the recommendation that is made for a non-guaranteed bandwidth, packet swithed networks such as the Internet is. Therefore in this paper we will be focusing on the H.323 standard and the codecs defined by it.

3. The ITU-T H.323 recommendation.

3.1 The protocol overview

    The H.323 recommendation covers the technical requirements for audio and video communications services in LANs that do not provide a guaranteed Quality of Service (QoS). H.323 references the T.120 specification for data conferencing and enables conferences which include a data capability. The scope of H.323 does not incude the LAN itself or the transport layer which may be used to connect various LANs. Only elements needed for
interaction with the Switched Circuit Network (SCN) are within the scope of H.323. Figure 1. [3] outlines an H.323 system and its components.

    H.323 defines four major components for a network-based communications system: Terminals, Gateways,
Gatekeepers, and Multipoint Control Units (MCUs).

 
Figure 1. Interoperability of H.323 Terminals with other H.32x terminals and networks.
 

    The H.323 recommendation actually covers a complete visual telephony system for non-quaranteed quality of service over local and wide area networks. However the recommendations only requirement for the end equipment states that the terminal must support voice transmission with data and video transfer being an option.  Every terminal must attempt to locate a gatekeeper and abide by the configuration that the gatekeeper responds with.

    The H.323 recommendation uses the H.245  signalling procedures for opening, closing and synchronization of various transmission channels. The H.245 channels transport must be reliable (eg. TCP, SPX)

    The H.323 recommendation does have some minimal requirements for the media over which it passes, but does not require any guaranteed quality-of-service (QoS). However the recommendation does not cover the actual LAN itself, so any LAN-media, that fills the requirements for the transmission media stated in recommendation H.323 can be used to trasmit voice and data within the H.323 spec. The further discussion of the transmission media requirements in H.323 are out of the scope of this paper and will not be described in further detail.

    The first commercial product to abide by the H.323 recommendation is Intel's Internet Phone. It seems that the H.323 recommendation is going to be a strong contender as the multimedia framework for the Internet  as also Microsoft, Netscape, MCI and IBM have announced support for H.323. Also firewall vendors Checkpoint and Trusted Information Systems have announced H.323 proxies. The directory services Bigfoot, WhoWhere? and Four11 have also announced support for H.323. [14]

4. Common audio coding standards and recommendations.

4.1 Basics of coding an analog audio signal for digital transmission.

4.1.1 The coding process.
The coding process of an analog signal to a digital signal fit to be transmitted over low-bandwidth communications channels consists of the following steps: The bandpass filtering, Sampling, Quantization and Coding  (Figure 2) .


    The bandpass filter is used to limit the frequency range that we want to sample, in order to lower the cost of the sampler and lower the bit-rate and bit-resolution needed for digitalization of the analog signal. Since we are sampling human speech, a commonly used frequency range around 200Hz-3400Hz is ample. This is the frequency range where normal human speech is. Usually the bandpass filters that are used in speech codecs are built to filter a 4kHz band.  The other reason we want to put the signal through a bandpass filter is that signals often have significant energy or noise above the gihest frequency of interest and thereby the bandpass filter is the first point where we filter out noise.

    The sampling process takes the analog signal, which is a discrete-time continuous-valued signal and turns it into a discrete-time discrete valued signal. What actually happens, is that the sampler takes a measurement of the signal at discrete time intervals (which makes up the sampling frequency).  The sampling frequency must be at least twice the maximum sampled frequency (The nyqvist theorem). If we use a lower sampling frequency, we might start to experience an effect called aliasing, which may cause two different signals to get the exact same sampled values.  Note that after sampling, the contiguous analog signal becomes a signal that is represented at discrete times, with the values of the samples equalling those of the original signal at the exact same discrete times.

    Quantization of the signal is the process, where the series of samples we got in the sampling process gets allocated a numerical value or quantity (hence the name quantization).  In quantization, each analog voltage level is assigned one of 2B values. Each binary value corresponds to a specific voltage level.The number of bits (B in this case) determines the level of error that is introduced during quantization. The error in question comes from the difference between the original analog signal and the approximation done by selecting the level closest to the actual analog signal. This error is called the quantization error. 
4.1.2 Reducing the required sampling frequency.
    Because the human perception of voice is much more fine grained at low frequencies than at high frequencies, we could "afford" to lose some of the information of higher frequency signals than those of lower frequency signals. Therefore in order to lower the number of samples/second (directly effects the number of bit/second needed to transmit over our communications channel), we can slice our frequency band into smaller chunks and sample each chunk at different sampling frequencies. The sampling frequencies for each sub-band are chosen so that  the perceived quality of the signal does not decrease significantly
4.1.3 Reducing the quantization error.
    By far the simplest method of reducing the quantization error is by inreasing the number of bits thereby decreasing the voltage difference between the various quantization levels. The drawback to this approach is that it also increases the number of bits required for transmission on our transmission channel. For low-bandwidth media, this is not an acceptable solution.

    We can use the characteristics of human hearing to our advantage in the process of sampling. We humans are more sensitive to vhanges in the volume of voice when the amplitude is low, but when the amplitude gets higher, the absolute change in amplitude has to be bigger in order to be noticed. Because of this characterustuc fenominon, we can use a logarithmic amplitude scale instead of a linear one and get no noticable difference in human perception of the sound.

4.2 Some ways of coding a quantizised sample.

    There are two main ideologies of coding a quantizised signal, the waveform coding and the vocalization of the signal.
4.2.1 The waveform coding.
    The idea becing waveform coding is to reproduce the input signal´s waveform. They are generally signal independent, so they can be successfully be used to code a wide variety of signals. They also are somewhat resistant to noise and transmission errors. Their drawback however is a need for medium bit rates.

    Pulse Code Modulation (PCM) is the simples ty of waveform coding as it is esentially just the quantization process. Differential Pulse Code Modulation (DPCM) on the other hand, it uses a feedback mechanism, where the signal from the quantiziser is fed back to the input. This feedback makes the DPCM coder actually code only the difference between any two sequent signals. This effectively reduces the amount of information needed to transmit over the communications channel.Adaptive Differential Pulse Code Modulation (ADPCM) is a more advanced version of DPCM, where the ADPCM codec adapts itself to different types of signal thereby dynamically increasing or decreasing the resolution or quantization frames. The quality of ADPCM coding at medium bit-rates (e.g 32 kb/s) is good.
4.2.2 The vocalization process.
    The vocalization mechanism does not even try to code the waveform input to the system. Instead it tries to mimic the human speech forming process. The coders that try to vocalize a signal are often called vocoders.

    When a human produces voiced sounds (usually vocals), the vocal chords are vibrated in a pulsing manner at certain frequencies. The human vocalized voices have peaks at odd intervals of 500Hz in their frequency bands and these peaks are the most important parts of the spectrum for speech intelligability. When a human produces non-voiced signals (s, sh), the vocal cords are immobile and all the sounds are produced by forcing air through various constrictions in the mouth or vocal tract (like clinched teeth etc.).

    Vocoders try to mimic these characteristics of human speech. At the coding end the coder tries to code the human speech into a series of voiced and non-voiced sounds. The receiver then reconstructs this analysis into a simulation of the original speech. The receiver has two sound generators: a random noise-generator (for non-voiced sounds) and a periodically excited generator (for voiced sounds) that are swithed continuously according to the transmitted data.
 
    Vocoders are very efficient in what they do, as they usually require as little as 2 kb/s of bandwidth. The drawback with vocoders is that they are sensitive to transmission errors and the sound they produce sounds very synthetic. Also vocoders make for very lousy transmitters of music or song.
4.2.3 Hybrid solutions.
    Because the waveform reproduction mechanisms have a good quality of speech but a moderately high need of bandwidth as opposed to the poor quality of vocoders but a low bandwidth requirement, hybrids systems that try to get the best of both worlds have spawned been engineered. The most commonly used hybrid coding techniques are: Residual Excited Linear Prediction (RELP), Multi-Pulse Coding (MPC), Codebook Excited Linear Prediction (CELP)Sinusoidal Modelling (STC) , Multiband Excitation (MBE), and Time-Frequency Interpolation (TFI). All of these coding algorithms have numerical variations to them all optimized for some specific trait (good quality, low bandwidth, etc.). The most used algorithm of this type however, are different variations of CELP.

4.3 Common coding recommendations and standards


    H.323´s Audio codecs minimum requirement for voice is G.711 (PCM coded voice), but in real life (PSTN modem connections) the software should support G.723.1, since G.711 bandwidth requirements are too much for modern POTS modems. Optionals are G.722 (7 KHz Audio within 64 Kbps), G.723.1 (Speech 6.4 & 5.3Kbps), PSTN-POTS G.728 (Speech coding @16kbps), G.729 (Speech coding@8/13kbps).

4.3.1 G.711 (PCM)

    G.711 (ITU-T, PCM: Pulse code modulation) is an internationally agreed  standard and widely used in conversion of analogue voice signals for transit in digital transmission networks. G.711 (toll) quality and characteristics are widely used as a refecence point when new or improved algorithms are matched against the present standard for speech coding. Two sub methods exist for US and non-US use, namely mu-law and a-law. Current PCM implementations signal at 64 kbps and form a low-quality 4KHz audio signal.

4.3.2 G.722 (SB-ADPCM)

    G.722 (ITU-T, AD-PCM: Sub-Band Adaptine Differential Pulse Code Modulation) describes how medium quality audio signals should be encoded using aff variant of the ADPCM [12]. SB-ADPCM is used with ISDN 64 kbps B-channel connectivity and uses frequences up to 7 KHz. Coding delay is almost non existent (5ms).

4.3.3 G.723.1

    ITU-T G.723.1 (G.721 + G.723 combined) produces digital voice compression levels of 20:1 and 24:1. It operates at 6.3 kbps and 5.3 kbps respectively.  The only difference between these two transmission speeds is the amount of horse power needed from the CPU. Frame size is <= 32ms, look-ahead is 7.5 ms, overall one way system delay is 95 ms.  Algorithm complexity requires 16 MIPS and 14.6 MIPS respectively, with 2.2 Kwords of RAM.

    The low bandwidth requirement is ideal for real time Internet telephony and usage over POTS-PSTN lines. G.723.1 has become one emerging standard for cross platform interoperatibility regarding the transmission of voice. Tests have shown equal quality with PSTN toll quality services at 1/10 of the bandwith against PCM today. [8]. As of March 12, 1997, the International Multimedia Teleconferencing Consortium's (IMTC) [9] Voice Over IP (VoIP) Forum, has recommended G.723.1 as the default low bitrate audio coder for the overall H.323 standard.

4.3.4 G.728 (LD-CELP)

    LD-CELP (ITU-T, LD-CELP: Low Delay Code Excited Linear Prediction) is a European ITU-T variant of US federal stanard 1016 for CELP [6]. LD-CELP digitizes 4 KHz speech at 16 Kbps and low delay. Speech compression is done by an analytical model of the vocal tract and it also computes errors between the original voice and speech. Both transmiting and receiving peer share a common code table for speech model and errors. CELP and LD-CELP is a hybrid of vocoder and wave form coding, as it has features from both techniques. Penetration of LD-CELP applications is not widely known.

4.3.5 G.729(A) (CS-ACELP)

    G.729 (ITU-T) uses CS-ACELP coding (Conjugate structure algebraic code excited linear prediction) at 7 KHz at 8 Kbps. Frame size is 10 ms. Coding delay is low (algoritmic delay is 15 ms), even overall peer-to-peer system delay is only 35 ms. Complexity of the algorithm requires 17 MIPS of CPU and 3 Kwords of RAM.
Lighter version of G.729 is also available by standard G.729A. It is bit-stream compatible with G.729, and requires only 10 MIPS of CPU and 2 Kwords of RAM.

4.3.6 I-30036 (GSM)

    GSM (ETSI I- 30036, GSM:Global System for Mobile Communication) is widely used in European mobile radio networks for speech and low speed data communications.
The GSM full rate speech codec operates at 13 kbps and uses a Regular Pulse Excited (RPE) codec with a 8 kHz sample-rate. Half rate GSM codec is also available at 7 kbps with a sample-rate of 5kHz . The input speech is split up into frames 20 ms long, and for each frame a set of 8 short term predictor coeffiecients are found. Each frame is then further split into four 5 ms sub-frames, and for each sub-frame the encoder finds a delay and a gain for the codec's long term predictor. Finally the residual signal after both short and long term filtering is quantized for each sub-frame.
    GSM codec generates good quality for speech, although G.728 codec (CELP) still outperforms slightly with the higher rate.  GSM codec is lighter, and can be run without DSP or special audio hardware in realtime, even in an i486-66. [10]

5  Attributes of coding realtime speech.

5.1 Bit Rate

    The bit rate attribute is a very simple term to understand. It simply means that how many bits / second are transmitted in/out of the codec to code/decode the desired signal. But it does have it´s own considerations that are not as simple as they might seem at first.

    Since we´re transmitting speech on a shared channel, this means that the speech coding algorithm has to take into account the fact that it cannot use all (or even the most) of the available bandwidth for speech transmission. Therefore it is desirable that the peak bit rate of the voice coding is as low as possible to allow transmission of other types of data over the communications channel without being a hog. Many speech coders code and transmit at a fixed bit rate regardless of the characteristics of the incoming signal (active speech vs. background noise or static). In a mixed data environment like modern multimedia or Internet environments, this is not acceptable. Therefore we want to use a variable-rate codec that takes into account the variable nature of the input signal.
5.1.1 Silence supression
    To achieve variable-rate coding, it is simplest to use silence compression techniques to lower the usage of bandwidth during pauses in speech and to use fixed bit rate coding during speech coding.

    Modern silence compression techniques consist of using two algorithms together. These are the the voice activity detector (VAD) and the comfort noise generation (CNG) algorithms.

    The VAD algorithm tries to determine whether the input signal is speech or background noise. If the signal is speech, it is coded at full bit rate. If the speech is found to be background noise, the signal is either coded at a very low bit rate (low quality) or not at all. If we do not code silence, we need to use the CNG algorithm to generate some static noise at the receiving end. The need for this is purely one of human preference. People like to hear a certain noise from their receiver as opposed to total silence, since it brings them a comfortable feeling of "being connected".

    The use of silence supression algorithms does bring on problems though. If the VADs quality is low, we start to run into certain types of problems. If the VAD is not sensitive enough, it might fail to notice the beginning of speech, which results in cutting off  beginnings of sentences. This is called front-end-clipping. If on the other hand the VAD is too sensitive, it will never (or very seldom) notice the difference between background noise and actual voice. This will then result in inefficient bandwidth utilization. The performance (speed) and the quality of the VADs algorithms becomes apparent when we try to encode speech in a noisy environment like  an office for example, where we have telephones, music, conversation etc. in the background.

    The CNG has problems too (even though generation of static noise is simple). If we don´t keep the generation of comfort noise in synchronization between tha transmitter and receiver, we may run into problems during transitions between active and nonactive periods of speech. We may also run into possible transmission channel timeout problems. Therefore even if we don´t transmit encoded data during periods of silence, we must occationally transmit some synchronization and keepalive packets in order to keep everyone happy.

5.2 Delay

5.2.1 The components of delay.
    The delay of speech coding components usually consists of three major components: algorithmic delay, processing delay and communications delay.

    The algorihms that are usually used in speech coding, form their data stream by sampling windows of fixed length. Algorithm cannot complete the coding until a full sampling frame is recorded. This leads into a minimum coding delay that equals the size of the window. The algorithms also analyze frequently the data that goes beyond the window in question at any single moment of time. This is called look-ahead. Look-ahead adds to the delay caused by using fixed-length windows (frames). The combined delay caused by windowing and look-ahead is called algorithmic delay, and it cannot be reduced by changing the implementation of the algorithm. All other delay components can be reduced by optimizing the implementation of the algorithm.

    Because the data transmitted over the communications channel is radically reduced in size from the original sampled data, the sampled data have to be processed before and after transmission. The delay caused by this encoding/decoding is called processing delay. Processing delay can be reduced with more efficient implementations of the coding algorithms and possible migration to hardware accelerated plaforms.

The sum of the algorithmic delay and the processing delay is called the one-way codec delay. 

    The third major delay component is the delay that is caused by the transit channel between the encoder and the decoder. This latency is caused by transit paths, especially satellite links, router and link congestion and queuing-policies.This delay is called the communications delay.

The total sum of these three major delays is called the one-way system delay.

    A delay that is not discussed above doesn´t really come from the algorithm, but the implementation of the codecs. This delay is the delays caused by the heterogenous transit latencies of the connection, where we may get a flood of packets at one moment (possibly even overflowing our buffers) and not getting any packets for a certain period of time.
5.2.2 Effects of delay
    The calculated maximum delay for voice connections is around 400 ms. In tests however it has been noted, that in order to achieve comfortable communication, the preferred one-way system delay is below 200 ms in environments without echo. If echo is brought in to the equation, the maximum tolerable delay is only 25 ms!!!.[1]  The cure for this is to use good quality echo cancellation devices. It also means that echo cancellation devices are almost mandatory.

    In situations where we need to bridge many channels together (such as teleconferencing), we need to bridge the different channels together. The bridge takes all the input channels, decodes them, sums the signals into one signal, encodes the sum and sends it out to all participants. Needless to say, this at least doubles the one-way system delay. So in order to be able to offer conferencing capabilities, the manufacturers need to make their codecs capable of below 100 ms one-way delays.  [VIITE]

5.3 Complexity

    Speech codecs are most often implemented on specialized hardware, such as DSP:s although software implementations are not exactly rare. In the case of complexity description, the complexity of an algorithm or implementation can be expressed as a combination of three variables: required processing power (in MIPS or Millions of Instructions Per Second), Random Access Memory (RAM) or Read Only Memory (ROM). At the moment, an algorithm requiring 15 MIPS is thought to be low-complexity whereas an algorithm requiring 30 MIPS is complex. The case of RAM and ROM, the matter is of higher cost resulting from larger size of die.

    The systems designers allways desire codecs that are fast, small and cheap. If an algorithm is complex, the implementation is more difficult and requires more power, therefore increasing cost. On the other hand if its simple, but requires tons of memory, the codec takes up too much space on the system board or consumes too much power. This can be very a very crucial point for example in mobile phones where the manufacturer with the smallest phone has the uphand on the market.

5.4 Quality

5.4.1 About quality
    Quality may be (most propably is) the most difficult and multifaceted of the four attributes. The importance of this attribte varies greatly from person to person and application to application so the weight of this attribute in the decision of which codec to use, should be weighted against the actual use. The thing to remember however is, that the specifications for any codecs quality of speech is for ideal conditions with clean speech, no transmission errors, one encoding and no background noise.

    The differences begin to arise, when we take into account packet loss, background noise, distortion, transmission errors etc.  For example an algorithm may or may not detect single-bit errors or the loss of whole frames. The algorithm may handle the loss of frames with silence, static, waveform substitution or some more advanced method of "taking a good guess" at the lost frame(s). The quality may also degrade when the speech is encoded/decoded several times in series when going through audio bridges or gateways. And what about people talking in a language that the algorithm or codec was not designed for (finnish/japanese fonems differ greatly from the anglo-saxin phonemes).
5.4.2 Testing the quality of speech coding
    To test the quality of an algorithm or a codec, there really isn´t any scientific objective way of testing, since the ultimate juror is the user. The most widely used test is the absolute category rating  (ATR) test. The subjects listening to the test material, usually listen to 8-10 seconds of speech material, and are then asked to rate the quality of what they had just heard. The rating is usually a scale of five points ranging from 1 to 5, with bad equalling 1 and excellent equalling 5. Toll quality, the quality of speech heard through a normal telephone line, is rated at ATR 4.  After the ratings have been recorded, a mean opinion score (MOS) is calculated as an average of the receiver scores. The problem with these kinds of tests are twofold. The tests are allways run in perfect conditions, so we don´t get the "real life" values for any specific codec. A codec getting an excellent (5)  rating in ATR testing may get a poor (2) rating in real life because its silence suppression technique is underclassed by comparison. On the other hand a codec getting "only" a good (4) rating in ATR testing may still get a rating between fair (3) and good (4) in real life conditions. Now which would you choose for your application. The argument can also be put around the other way. If test conditions are deliberately made poor to test the differences of codecs in real life,  there may be nothing separating various implementations or algorithms, whereas in optimum conditions the differences might be dramatical.

    Because of this, two new tests have been developed, namely the decradation category rating (DCR) test and the comparative category rating test (CCR). Neither of these two tests have seemed to gather any real support as to date,. The DCR test has failed to gain acceptance because it introduces another type of a problem while solving the problem with ATR-testing. As DCR  compares the original sample to the coded one and asks the user to rate how much the sample had degraded during coding, the listeners usually equatethe difference of sound with degradation, which is contrary to the desired. ATR testing on the other hand takes a little bit of a different approach. In ATR testing, the listeners are asked to rate if the sample sound better, the same or worse than the original sample. The ATR test uses a seven point scale with three points for better, three for worse and one for "no change". The ATR test was first used in the G.729 characterization testing phase.

Comparison of various 8KHz mono audio formats 
Courtesy of [http://xanadu.com.au/sc/audio.html] 
Audio format  16 bit PCM  G.711 mu-law  32Kbps MPEG-1  IMA/DVI ADPCM  GSM 06.10  Inter- 
Wave VSC112 
True 
Speech 8.5 
RealAudio v1.0  ToolVox for the Web 
File extension  .wav or .aiff  .au  .mpa or .mp2  .wav  .gsm  .vmf  .wav  .ra  .vox 
Data rate  128Kbps  64Kbps  32Kbps  32Kbps  13.2Kbps  11.2Kbps  8.5Kbps  8Kbps  2.4Kbps 
File size per minute  960K  480K  240K  240K  96K  82K  62K  59K  18K 
Compression factor  1:1  2:1  4:1  4:1  10:1  11:1  15:1  16:1  53:1 
Sound quality  5 (sample)  4 (sample)  4 (sample)  3 (sample)  2 (sample)  2 (sample)  2 (sample)  1 (sample)  0-3 (sample) 
Relative compression speed  N/A  10  (not tested)  1.2  0.75  3  0.5  0.2  0.25 
Win player  Yes  Yes  Yes  Yes  Yes  Yes  Yes  Yes  Yes 
Mac player  Yes  Yes  Yes  Yes  Yes  No  Yes  Yes  Yes 
Unix player  Yes  Yes  Yes  Some  Yes  No  No  only v2.0  Promised 
Supports higher sample rates  Yes  Yes, but rarely used  Yes  Yes  No  Yes  No  in v2.0  in other products 
Streaming playback file  None  None  None  None  .gsd  .vmd  .tsp  .ram  None 
Credits  None  None  None  None  None  In .vmd   None  In .ra file  None 


6. Real-Life software.

6.1 Cooltalk by Insoft.

    Uses RT-24, FR-GSM or HR-GSM codecs. RT-24 uses a 8 kHz sample-rate with bandwidth requirement of 2.4 kbps. RT-24 requires a Pentium class machine to work properly and is designed solely for speech. Therefore music and singing does not work properly with the RT-24 codec.  The RT-24 codec used by cooltalk is developed by Voxware.

6.2 Voxvare by Voxware.

MetaSound Family of Codecs
Codec   Bit-Rate   Sampling Frequency   Audio Bandwidth   Encode1)   Decode1) 
AC8  4 kHz  48 
AC10  10  11  5.5 kHz  60 
AC16  16  16  8 kHz  98  24 
AC24  24  22  11 kHz  131  10 
1) % of 133 MHz Pentium machines processor usage.

The MetaSound family of codecs are optimized for audio and music and uses a waveform algorithm.

Voxware's speech codecs are the RT24 and RT29HQ, which are both vocoders with RT24 being an 8 kHz sample rate and 2.4 kbps bit rate codec. The RT29HQ on the other hand has a 3 kbps bit rate. The RT29HQ is a newer version of the RT24 and has a better quality of sound than the RT24. Objective measurement data could not be found at the time of this writing.

6.3 Streamworks, XingTech

MPEG-1 Audio Streams (Layer II) 
  • Per Channel Bit Rates: 32, 48, 56, 64, 80, 96, 112, 128, 160 or 192kbits/s
  • Modes: mono, joint stereo
  • Joint Stereo Bands: 4, 8, 12, 16
  • Sample Rates: 32, 44.1, 48
  • MPEG-2 Audio Streams (Layer II) 
  • Per Channel Bit rates: 8, 16, 24, 32, 40, 48, 56, 64 or 80kbits/s
  • Modes: mono, joint stereo
  • Joint Stereo Bands: 4, 8, 12, 16
  • Sample Rates: 16, 22.05, 32, 44.1, 48
  • LBR Audio Streams 
  • Bit rates: 8, 9, 10, 11, 12, 13, 14, 15 or 16kbits/s
  • Modes: mono
  • Sample Rates: 8
  • 8. References

      [1] Richard V. Cox & Peter Kroon. Low Bit-Rate Speech Coders for Multimedia Communication, IEEE Communications Magazine, 12(34):34-41, December 1996.
      [2] A primer on the H.323 Series Standard.
      [3] Visual Telephone Systems and Equipment for Loacl Area Networks Which Provide a Non-Guaranteed Quality of Service. ITU-T Draft H.323, April 22, 1996.
      [4] Speech coding tutorial
      [5] ITU-T BlueBook Fasc III.4, 1990
      [6] Federal Standard 1016, Telecommunications: Analog to Digital conversion of Radio Voice by 4800 bps Code Excited Linear Prediction (CELP)
      [8] TrueSpeech ITU G.723 Speech compression technology for the ITU-T H.324
      [9] International Multimedia Teleconferencing Consortium's (IMTC) Voice Over IP (VoIP) Forum
      [10] GSM-Codec standard
      [12] Summary of ITU-T Recommendation G.726, ADPCM
      [14] Netscape joint press release about H.323
      [15] http://cips02.physik.uni-bonn.de/~scheller/audio/ Comparison of audio compression techniques.
      [16] http://fas.sfu.ca/cs/undergrad/CourseMaterials/CMPT365/material/notes/Chap4/Chap4.3/Chap4.3.html Audio compression techniques and human audio perception