Speed编码解介绍资料翻译-中英对照
Spexx编码解码器的出现是因为对开放源码有这方面的需求,并且不需缴纳软件专利使用费的语音编码解码器。这些都
是任何开放源码的软件具有可用性的必要条件。本质上来说,Speex对于语音来说就如同Vorbis对于音频/音乐一样。不像其它的语音编码解码
器,Speex并不是设计用于移动电话,而是针对于掌上电脑和网络语音电话(VoIP)的应用。当然也支持文档的压缩。
Speex编码解码器的设计非常灵活,所支持的语音质量和比特率的范围很广泛。能够支持良好质量的语音,同时也指的是除窄频带语音(电话
的质量,8kHz的采样率)以外,Speex还可以解码宽频带的语音(采样率为16kHz)。
面向网络语音电话而不是移动电话的设计指的是:在数据包丢失方面Speex十分可靠,而且它不会对数据包造成损坏。这是根据网络语音电话
VoIP的假设,到达的数据包是不变的或者没有数据包到达。因为Speex是以多种不同的设备为目标的,它具有适度(可调节的)的复杂性和很
小的内存占用。
所有的设计目标都是指向将码激励线性预测编码CELP作为编解码技术的。主要原因之一是CELP已经证明可以稳定可靠地工作并且可以同时兼容
低比特率(比如:DoD CELP @ 4.8 kbps)和高比特率(比如:G.728 @ 16 kbps)。
1.1 获得帮助
对于许多开放源码的项目,有许多方法可以得到Speex相关信息。这些信息包括:
•本说明手册
•在Speex的网站上获取的其它文件材料。
•电邮清单:可以发邮件到来讨论任何和Speex有关事宜(并不仅限于开发人员)。
•互联网交互聊天IRC:主通道是 #speex,在 irc.freenode.net 上。请注意:可能由于时差的原因,需要等待一段时间您才能和某人连接上,所以请
保持耐心。
•私下给作者发送邮件,邮箱地址是:这里只接受您不愿意在公开讨论的私人/或者敏感话题 。
在寻求帮助以前(通过电邮清单或IRC),最重要的是首先阅读本手册(是的,所以如果你已经看到这里,那么这是一个很好的开端)。通
常认为发邮件咨询一些已经在文件材料上清楚说明的问题是很无礼的。另一方面,完全可以要求对手册中包含的某些内容进行说明(我们也提
倡这种做法)。由于本手册未能包含有关Speex的所有内容,所以鼓励每个人提问题,发表评论,表达要求,或者告诉我们你正在使用Speex也
是可以的。
关于电邮清单,这里有一些其它的说明。在向清单报告Speex的故障(bug)之前,强烈建议(如果可能的话)您首先测试这些故障在使用
speexenc和speexdec(见第4章)命令行工具的时候会不会重现。所报告的第三方代码的故障不仅比较难找到,而且常常会由和Speex没有任何
关系的错误造成。
1.2 关于本文件
本文件按照以下方式进行了划分。第2章描述了不同的Speex的产品特性并且对本手册中反复出现的许多基本条款记性了定义。第4章记载论述
了在Speex分配中提供的标准的命令行工具。第5章包括了关于在libspeex应用程序界面(API)中所用到的编程的详细介绍。第7章是和Speex以
及标准相关一些信息。
最后三个章节描述了再Speex中所用到的算法。理解这些章节要求具有信号处理的知识,但是如果仅仅是使用Speex的话则没有这个要求。 它们
是为那些想要彻底理解Speex是如何工作的/或者是为想要Speex来进行研究的人而准备的。第8章解释了在码激励线性预测CELP后的大概原理,
而第9和第10章则是针对Speex的。
2. 对编码解码器的描述
这一章描述了Speex及其特性的细节问题。
2.1 概念
在介绍所有Speex特点之前,这里关于语音编码的一些概念能够帮助大家理解本手册其余部分。其中有一些是语音/音频处理中的一般性概念,
另外一些则是针对于Speex的。
采样率
用赫兹(Hz)来表达的采样率是指每秒钟从一个信号中取得的样本数量。对于一个采样率为Fs kHz的来说,其所能代表的最高的频率等同于
Fs/2 kHz(Fs/2 kHz认为是奈奎斯特频率)。这是在信号处理过程中的一个基本的属性,并且由抽样定理进行了描述。Speex是主要为三种不同
的采样率而设计的: 8 kHz,16 kHz,和32 kHz。这些采样率分别属于窄频带、宽频带和超宽频带。
比特率
当解码一个语音信号的时候,比特率的定义为在解码语音中单位时间所需要的比特的数量。以每秒的比特数bits per second (bps),或者通常用
每秒的千比特数kilobits per second来测量的。有一点很重要,就是要弄清楚每秒钟千比特数kilobits per second (kbps)和每秒钟千字节数kilobytes
per second (kBps)之间的差别。
质量(可变的)
Speex是一种有损的编码解码器,这就是说它所达到的压缩要以损失输入语音信号的保真度为代价。不像一些其它的与音编码解码器,Speex可
以控制使得在质量和比特率之间的平衡。多数时候,Speex编解码过程是由一个范围从0到10的质量参数来控制的。在恒定的比特率(CBR)操
作中,质量参数是一个整数,而对于可变的比特率(VBR)来说,质量参数则是一个浮点数。
复杂性(可变的)
采用了Speex,有可能实现让编码器允许复杂性进行改变。这是通过一个1到10的整数来控制搜索的执行方式来实现的,这种工作方式是和gzip
和bzip2压缩工具所采用的-1到-9的选择是类似的。对于常规使用来说,在复杂性为1的时候噪音级比复杂性10要高出在1和2dB之间,但是在复
杂性为10的时候,对于CPU的需求却要比复杂度1的时候高5倍。实际上,最好的平衡是在复杂度为2到4之间,尽管更高的设置在编码非语音的
声音比如DTMF声调的时候通常也很有用。
可变比特率(VBR)
可变比特率(VBR)允许编码解码器不断地改变其比特率从而来适应所要编码的音频的“困难性”。对于Speex这个例子来说,听起来像元音和高
能量的瞬态要求一个较高的比特率来达到好的质量,而摩擦音(比如:f声音)则用比较低的比特来编码就足够了。为此,VBR可以用比较低的
比特率达到同样的质量,或者在某一特定的比特率的情况下达到更好的质量。尽管它具有这些优势,但是VBR也存在着着两个主要的不足之处
:第一,通过仅仅指定质量,对于最后的平均比特率没有保证。第二,对于一些实时的应用比如网络语音电话(VoIP),最大比特率是最重要
的,其对于通信信道必须足够低。
平均比特率(ABR)
平均比特率解决了VBR的一个问题,由于它动态的调节了VBR的质量从而来满足了一个特定的目标比特率。因为质量/比特率是实时调整的(开
环的),所以总体的质量会比采用了正确的质量设置来满足目标平均比特率而所使用VBR编码得到的质量要稍微低一些。
声音活动探测(VAD)
如情况允许,声音活动探测会检测正在编码的音频是否为语音或者安静/或者有背景声音。在用VBR编码的时候,VAD总是会隐激活,所以该选
项只对于非VBR 的运算有用。这样的话,Speex检测到非语音阶段并且仅仅用足够的比特来编码以复制背景噪音。这称为“缓和噪音发生”(CNF
)。
不连续传送(DTX)
不连续传送是对于VAD/VBR运算的一种补充,它允许在背景噪音是固定的时候完全停止传送。在文件的运算中,由于我们不能只是停止向文件
中写入,只有5比特用于这种帧中(与250位/秒相一致)。
感觉增强
感觉增强是解码器的一部分,当启用的时候,它会尝试去减少在编码或者解码过程而产生的噪音/变形的感觉。在更多的情况下,感觉增强会
使声音偏离最初的客观性(比如说仅仅考虑SNR),但是在最后听起来还是会感觉更好(主观的改进)。
等待时间和算法延迟
每种语音编码解码器都会在传送中引进一种延迟。对于Speex来说,这种延迟是和帧尺寸相等同的,即在处理每个帧的时候,加上一定数量的
所要求的“预见性”。在窄频带的运算中(8 kHz),延迟是30毫秒,而对于宽频带(16 kHz)来说,延迟是34毫秒。这些时间不算在用来编码
或者解码帧的中央处理器时间中。
1 Introduction to Speex
The Speex codec exists because there is a need for a speech codec that is open-source and
free from software patent royalties. These are essential conditions for being usable in any open-source software. In essence,
Speex is to speech what Vorbis is to audio/music. Unlike many other speech codecs, Speex is not designed for mobile phones
but rather for packet networks and voice over IP (VoIP) applications. File-based compression is of course also supported.
The Speex codec is designed to be very flexible and support a wide range of speech quality and bit-rate. Support for very
good quality speech also means that Speex can encode wideband speech (16 kHz sampling rate) in addition to narrowband
speech (telephone quality, 8 kHz sampling rate).
Designing for VoIP instead of mobile phones means that Speex is robust to lost packets, but not to corrupted ones. This is
based on the assumption that in VoIP, packets either arrive unaltered or don’t arrive at all. Because Speex is targeted at a wide
range of devices, it has modest (adjustable) complexity and a small memory footprint.
All the design goals led to the choice of CELP as the encoding technique. One of the main reasons is that CELP has long
proved that it could work reliably and scale well to both low bit-rates (e.g. DoD CELP @ 4.8 kbps) and high bit-rates (e.g.
G.728 @ 16 kbps).
1.1 Getting help
As for many open source projects, there are many ways to get help with Speex. These include:
• This manual
• Other documentation on the Speex website • Mailing list: Discuss any Speex-related topic on (not just for developers)
• IRC: The main channel is #speex on irc.freenode.net. Note that due to time differences, it may take a while to get
someone, so please be patient.
• Email the author privately at only for private/delicate topics you do not wish to discuss
publically.
Before asking for help (mailing list or IRC), it is important to first read this manual (OK, so if you made it here it’s already
a good sign). It is generally considered rude to ask on a mailing list about topics that are clearly detailed in the documentation.
On the other hand, it’s perfectly OK (and encouraged) to ask for clarifications about something covered in the manual. This
manual does not (yet) cover everything about Speex, so everyone is encouraged to ask questions, send comments, feature
requests, or just let us know how Speex is being used.
Here are some additional guidelines related to the mailing list. Before reporting bugs in Speex to the list, it is strongly
recommended (if possible) to first test whether these bugs can be reproduced using the speexenc and speexdec (see Section 4)
command-line utilities. Bugs reported based on 3rd party code are both harder to find and far too often caused by errors that
have nothing to do with Speex.
1.2 About this document
This document is divided in the following way. Section 2 describes the different Speex features and defines many basic terms
that are used throughout this manual. Section 4 documents the standard command-line tools provided in the Speex distribution.
Section 5 includes detailed instructions about programming using the libspeex API. Section 7 has some information related to
Speex and standards.
The three last sections describe the algorithms used in Speex. These sections require signal processing knowledge, but are
not required for merely using Speex. They are intended for people who want to understand how Speex really works and/or
want to do research based on Speex. Section 8 explains the general idea behind CELP, while sections 9 and 10 are specific to
Speex.
6
2 Codec description
This section describes Speex and its features into more details.
2.1 Concepts
Before introducing all the Speex features, here are some concepts in speech coding that help better understand the rest of the
manual. Although some are general concepts in speech/audio processing, others are specific to Speex.
Sampling rate
The sampling rate expressed in Hertz (Hz) is the number of samples taken from a signal per second. For a sampling rate
of Fs kHz, the highest frequency that can be represented is equal to Fs/2 kHz (Fs/2 is known as the Nyquist frequency).
This is a fundamental property in signal processing and is described by the sampling theorem. Speex is mainly designed for
three different sampling rates: 8 kHz, 16 kHz, and 32 kHz. These are respectively refered to as narrowband, wideband and
ultra-wideband.
Bit-rate
When encoding a speech signal, the bit-rate is defined as the number of bits per unit of time required to encode the speech. It
is measured in bits per second (bps), or generally kilobits per second. It is important to make the distinction between kilobits
per second (kbps) and kilobytes per second (kBps).
Quality (variable)
Speex is a lossy codec, which means that it achives compression at the expense of fidelity of the input speech signal. Unlike
some other speech codecs, it is possible to control the tradeoff made between quality and bit-rate. The Speex encoding process
is controlled most of the time by a quality parameter that ranges from 0 to 10. In constant bit-rate (CBR) operation, the quality
parameter is an integer, while for variable bit-rate (VBR), the parameter is a float.
Complexity (variable)
With Speex, it is possible to vary the complexity allowed for the encoder. This is done by controlling how the search is
performed with an integer ranging from 1 to 10 in a way that’s similar to the -1 to -9 options to gzip and bzip2 compression
utilities. For normal use, the noise level at complexity 1 is between 1 and 2 dB higher than at complexity 10, but the CPU
requirements for complexity 10 is about 5 times higher than for complexity 1. In practice, the best trade-off is between
complexity 2 and 4, though higher settings are often useful when encoding non-speech sounds like DTMF tones.
Variable Bit-Rate (VBR)
Variable bit-rate (VBR) allows a codec to change its bit-rate dynamically to adapt to the “difficulty” of the audio being
encoded. In the example of Speex, sounds like vowels and high-energy transients require a higher bit-rate to achieve good
quality, while fricatives (e.g. s,f sounds) can be coded adequately with less bits. For this reason, VBR can achive lower bit-rate
for the same quality, or a better quality for a certain bit-rate. Despite its advantages, VBR has two main drawbacks: first, by
only specifying quality, there’s no guaranty about the final average bit-rate. Second, for some real-time applications like voice
over IP (VoIP), what counts is the maximum bit-rate, which must be low enough for the communication channel.
Average Bit-Rate (ABR)
Average bit-rate solves one of the problems of VBR, as it dynamically adjusts VBR quality in order to meet a specific target
bit-rate. Because the quality/bit-rate is adjusted in real-time (open-loop), the global quality will be slightly lower than that
obtained by encoding in VBR with exactly the right quality setting to meet the target average bit-rate.
7
2 Codec description
Voice Activity Detection (VAD)
When enabled, voice activity detection detects whether the audio being encoded is speech or silence/background noise. VAD
is always implicitly activated when encoding in VBR, so the option is only useful in non-VBR operation. In this case, Speex
detects non-speech periods and encode them with just enough bits to reproduce the background noise. This is called “comfort
noise generation” (CNG).
Discontinuous Transmission (DTX)
Discontinuous transmission is an addition to VAD/VBR operation, that allows to stop transmitting completely when the
background noise is stationary. In file-based operation, since we cannot just stop writing to the file, only 5 bits are used for
such frames (corresponding to 250 bps).
Perceptual enhancement
Perceptual enhancement is a part of the decoder which, when turned on, attempts to reduce the perception of the noise/distortion
produced by the encoding/decoding process. In most cases, perceptual enhancement brings the sound further from the
original objectively (e.g. considering only SNR), but in the end it still sounds better (subjective improvement).
Latency and algorithmic delay
Every speech codec introduces a delay in the transmission. For Speex, this delay is equal to the frame size, plus some amount
of “look-ahead” required to process each frame. In narrowband operation (8 kHz), the delay is 30 ms, while for wideband (16
kHz), the delay is 34 ms. These values don’t account for the CPU time it takes to encode or decode the frames.
2013.2.1





