本文译自《A Tutorial on Using the ALSA Audio API》[1],基于 Deepl
[2] 的翻译结果润色而得。部分翻译术语参考了另一篇翻译[3]。
ToC
开始
This document attempts to provide an introduction to the ALSA Audio API. It is not a complete reference manual for the API, and it does not cover many specific issues that more complex software will need to address. However, it does try to provide enough background and information for a reasonably skilled programmer but who is new to ALSA to write a simple program that uses the API.
本文简单介绍了 ALSA
音频 API
。这不是一份完整的 API
参考手册,也不涉及复杂软件中涉及到的许多具体问题。它试图为熟练但不了解 ALSA
的程序员提供足够的背景和信息,使其能通过该 API
编写一些简单的程序。
All code in the document is licensed under the GNU Public License. If you plan to write software using ALSA under some other license, then I suggest you find some other documentation.
本文档中的所有代码均使用 GPL
协议。如果你想用在其他协议下编写使用 ALSA
的软件,建议查阅其他文档。
理解音频接口
Let us first review the basic design of an audio interface. As an application developer, you don’t need to worry about this level of operation - its all taken care of by the device driver (which is one of the components that ALSA provides). But you do need to understand what is going at a conceptual level if you want to write efficient and flexible software.
首先让我们回顾一下音频接口的基本设计思想。作为应用开发者,你不需要考虑这个层面的操作——它的所有操作都由设备驱动程序(ALSA
提供的组件之一)来处理。但是,如果你想编写高效灵活的软件,了解一些概念层面的内容是有必要的。
An audio interface is a device that allows a computer to receive and to send audio data from/to the outside world. Inside of the computer, audio data is represented a stream of bits, just like any other kind of data. However, the audio interface may send and receive audio as either an analog signal (a time-varying voltage) or as a digital signal (some stream of bits). In either case, the set of bits that the computer uses to represent a particular sound will need to be transformed before it is delivered to the outside world, and likewise, the external signal received by the interface will need to be transformed before it is useful to the computer. These two transformations are the raison d’etre of the audio interface.
音频接口是一种允许计算机从外界接收和向外界发送音频数据的设备。和其他数据一样,音频数据在计算机内部是以比特流的形式表示的。
音频接口可以通过模拟信号(时变电压)或数字信号(比特流)的形式发送和接收音频,但无论是输入还是输出,信号都需要经过一次转换:计算机内的数字信号在向外界传送之前需要先转换为模拟信号,而接口接收到的外部模拟信号也需要先转换为数字信号才能为计算机所用。音频接口就是为这两种信号的转换而生的。
Within the audio interface is an area referred to as the “hardware buffer”. As an audio signal arrives from the outside world, the interface converts it into a stream of bits usable by the computer and stores it in the part hardware buffer used to send data to the computer. When it has collected enough data in the hardware buffer, the interface interrupts the computer to tell it that it has data ready for it. A similar process happens in reverse for data being sent from the computer to the outside world. The interface interrupts the computer to tell it that there is space in the hardware buffer, and the computer proceeds to store data there. The interface later converts these bits into whatever form is needed to deliver it to the outside world, and delivers it. It is very important to understand that the interface uses this buffer as a “circular buffer”. When it gets to the end of the buffer, it continues by wrapping around to the start.
音频接口内有一个被称为“硬件缓冲区”的区域。当音频信号从外部到达时,接口将其转换成计算机可用的比特流,并将其存储在用于向计算机发送数据的部分硬件缓冲区中。当硬件缓冲区中收集到了足够的数据时,接口就会产生一个中断,告知数据已经准备完毕。
计算机向外界发送数据的过程与上述过程恰恰相反。接口产生一个中断,向计算机告知缓冲区有空间,于是计算机就在缓冲区中存储数据。随后,接口将这些数据转换成所需要的形式并发送到外界。
重要的一点是:接口的缓冲区是循环缓冲区。即如果当前数据存储的位置到达了缓冲区的末尾,那么之后的数据会从缓冲区的起点开始继续存储。
For this process to work correctly, there are a number of variables that need to be configured. They include:
为了让这一过程正确运作,需要配置一些变量,其中包括:
what format should the interface use when converting between the bitstream used by the computer and the signal used in the outside world?
- 在进行数模转换时,接口应该使用什么格式?
at what rate should samples be moved between the interface and the computer?
- 采样率应该是多少?
how much data (and/or space) should there be before the device interrupts the computer?
- 多少数据应该产生一次中断?
how big should the hardware buffer be?
- 硬件缓冲区应该多大?
The first two questions are fundamental in governing the quality of the audio data. The second two questions affect the “latency” of the audio signal. This term refers to the delay between
前两个问题是制约音频数据质量的根本,而后两个问题则会影响音频信号的延迟。
术语“延迟”指以下二者:
1. data arriving at the audio interface from the outside world, and it being available to the computer (“input latency”)
2. data being delivered by the computer, and it being delivered to the outside world (“output latency”)
- 输入延迟:数据从到达音频接口到提供给计算机之间的延迟。
- 输出延迟:数据从计算机开始输出到输出到外界之间的延迟。
Both of these are very important for many kinds of audio software, though some programs do not need be concerned with such matters.
上述二者对许多音频软件而言都很重要。当然,有些程序不需要关注这些问题。
一个典型的音频应用做了什么
A typical audio application has this rough structure:
一个典型的音频应用大致是这样的结构:
最低限度的播放程序
This program opens an audio interface for playback, configures it for stereo, 16 bit, 44.1kHz, interleaved conventional read/write access. Then its delivers a chunk of random data to it, and exits. It represents about the simplest possible use of the ALSA Audio API, and isn’t meant to be a real program.
该程序打开一个用于播放的音频接口,将其配置为:立体声、16 位、44.1kHz
、交错式常规读/写访问,然后向接口传送一块(chunk
)随机数据,最后退出。
这个例子仅展示了 ALSA
音频 API
的简单使用方法,不代表实际程序。
最低限度的捕获程序
This program opens an audio interface for capture, configures it for stereo, 16 bit, 44.1kHz, interleaved conventional read/write access. Then its reads a chunk of random data from it, and exits. It isn’t meant to be a real program.
该程序打开一个音频接口以进行捕获,并将其配置为立体声、16位、44.1kHz
、交错式常规读/写访问。然后它从其中读取一块随机数据,最后退出。该程序不代表实际程序。
最低限度的中断驱动程序
This program opens an audio interface for playback, configures it for stereo, 16 bit, 44.1kHz, interleaved conventional read/write access. It then waits till the interface is ready for playback data, and delivers random data to it at that time. This design allows your program to be easily ported to systems that rely on a callback-driven mechanism, such as JACK, LADSPA, CoreAudio, VST and many others.
该程序打开一个音频接口进行播放,并将其配置为立体声、16位、44.1kHz
、交错式常规读/写访问。然后等待接口准备好接收数据,并在那时将随机数据传送给它。
这种设计使你的程序可以很容易地移植到依赖回调驱动机制的系统中,如 JACK
、LADSPA
、CoreAudio
、VST
等。
最低限度的全双工程序
Full duplex can be implemented by combining the playback and capture designs show above. Although many existing Linux audio applications use this kind of design, in this author’s opinion, it is deeply flawed. The the interrupt-driven example represents a fundamentally better design for many situations. It is, however, rather complex to extend to full duplex. This is why I suggest you forget about all of this.
全双工可以通过结合上述的播放和捕获来实现。虽然很多现有的 Linux
音频应用都使用了这种设计,但在笔者看来,它存在很大的缺陷。中断驱动的例子在很多情况下代表了一种根本上更好的设计。然而,要扩展到全双工,它是相当复杂的。这就是为什么我建议你忘掉这些的原因。
术语
Terminology | Description |
---|---|
capture | Receiving data from the outside world (different from “recording” which implies storing that data somewhere, and is not part of ALSA’s API) |
playback | Delivering data to the outside world, presumably, though not necessarily, so that it can be heard. |
duplex | A situation where capture and playback are occuring on the same interface at the same time. |
xrun | Once the audio interface starts running, it continues to do until told to stop. It will be generating data for computer to use and/or sending data from the computer to the outside world. For various reasons, your program may not keep up with it. For playback, this can lead to a situation where the interface needs new data from the computer, but it isn’t there, forcing it use old data left in the hardware buffer. This is called an “underrun”. For capture, the interface may have data to deliver to the computer, but nowhere to store it, so it has to overwrite part of the hardware buffer that contains data the computer has not received. This is called an “overrun”. For simplicity, we use the generic term “xrun” to refer to either of these conditions |
pcm | Pulse Code Modulation. This phrase (and acronym) describes one method of representing an analog signal in digital form. Its the method used by almost computer audio interfaces, and it is used in the ALSA API as a shorthand for “audio”. |
channel | |
freame | A sample is a single value that describes the amplitude of the audio signal at a single point in time, on a single channel. When we talk about working with digital audio, we often want to talk about the data that represents all channels at a single point in time. This is a collection of samples, one per channel, and is generally called a “frame”. When we talk about the passage of time in terms of frames, its roughly equivalent to what people when they measure in terms of samples, but is more accurate; more importantly, when we’re talking about the amount of data needed to represent all the channels at a point in time, its the only unit that makes sense. Almost every ALSA Audio API function uses frames as its unit of measurement for data quantities. |
interleaved | a data layout arrangement where the samples of each channel that will be played at the same time follow each other sequentially. See “non-interleaved” |
non-interleaved | a data layout where the samples for a single channel follow each other sequentially; samples for another channel are either in another buffer or another part of this buffer. Contrast with “interleaved” |
sample clock | a timing source that is used to mark the times at which a sample should be delivered and/or received to/from the outside world. Some audio interfaces allow you to use an external sample clock, either a “word clock” signal (typically used in many studios), or “autosync” which uses a clock signal present in incoming digital data. All audio interfaces have at least one sample clock source that is present on the interface itself, typically a small crystal clock. Some interfaces do not allow the rate of the clock to be varied, and some have clocks that do not actually run at precisely the rates you would expect (44.1kHz, etc). No two sample clocks can ever be expected to run at precisely the same rate - if you need two sample streams to remain synchronized with each other, they MUST be run from the same sample clock. |
术语 | 解释 |
---|---|
捕获(capture ) | 从外接接收数据(而不是录制,后者带有存储的意味,而这不是 ALSA API 的一部分) |
播放(playback ) | 向外界传输数据(大概)使其能被听到 |
双工(duplex ) | 同时捕获和播放的情形。 |
xrun | 当音频接口开始工作,直到被叫停它都会一直运行,从计算机中接收数据并发送到外部世界。但由于各种各样的原因,你的程序可能跟不上它 |
对于播放,可能出现:需要数据,但不存在数据的情况,使得其只能使用硬件缓冲区中的旧数据。这被称为 underrun
。
对于捕获,接口可能有向计算机发送的数据,但没有空间存放,因此它不得不覆盖硬件缓冲区中计算机还没有接收的部分。这被称为 overrun
。
简单起见,我们使用通用术语 xrun
表示二者中的任意一种。 |
| pcm
| 脉冲编码调制。这个短语(和首字母缩写)描述了一种以数字形式表示模拟信号的方法。这种方法被几乎所有计算机音频接口所使用,在 ALSA API
中以 audio
简写。 |
| channel
| |
| 帧(frame
) | 采样是一个单一的值,它描述了音频信号在单一时间点、单一通道上的振幅。当我们谈论数字音频工作时,我们希望谈论在一个单一的时间点上代表所有通道的数据。这是单通道采样的集合,通常被称为 “帧”。
当我们用帧来谈论时间的流逝时,它大致相当于人们用采样来衡量的东西,但更准确。
更重要的是,当我们谈论在一个时间点上代表所有通道所需的数据量时,它是唯一有意义的单位。几乎所有的 ALSA
音频 API
函数都使用帧作为其数据量的测量单位。 |
| 交错式(interleaved
) | 一种数据布局,其中每个通道的采样将在同一时间被播放,并按顺序存储。见“非交错式”。 |
| 非交错式(non-interleaved
) | 一种数据布局,其中单一通道的样本按顺序存储;而另一个通道的样本存储于另一个缓冲区或该缓冲区的另一部分。与交错式形成对比。 |
| 采样时钟(sample clock
) | 时钟源,用于标记传送和/或接收样本的时间。一些音频接口允许你使用外部采样时钟,可以是字时钟信号[4](通常在许多工作室使用),也可以是自动同步(使用输入数字数据中的时钟信号)。
所有的音频接口都至少有一个采样时钟源,存在于接口本身,通常是一个小的晶体时钟。有些接口不允许改变时钟的速率,有些接口的时钟实际上并不以你所期望的速率精确运行(44.1kHz
等)。
没有任何两个采样时钟可以以完全相同的速率运行——如果你需要两个采样流彼此保持同步,它们必须从同一个采样时钟运行。 |
如何去做…
打开设备
ALSA separates capture and playback …
ALSA
区分了捕获和播放……
设置参数
We mentioned above that there are number of parameters that need to be set in order for an audio interface to do its job. However, because your program will not actually be interacting with the hardware directly, but instead with a device driver that controls the hardware, there are really two distinct sets of parameters:
我们在上面提到过,为了让音频接口完成它的工作,需要设置一些参数,但是,由于你的程序实际上并不直接与硬件交互,而是与控制硬件的设备驱动程序交互,所以实际上有两组不同的参数。
硬件参数
These are parameters that directly affect the hardware of the audio interface.
这些都是直接影响音频接口硬件的参数。
采样率
This controls the rate at which either A/D/D/A conversion is done, if the interface has analog I/O. For fully digital interfaces, it controls the speed of the clock used to move digital audio data to/from the outside world. On some interfaces, other device-specific configuration may mean that your program cannot control this value (for example, when the interface has been told to use an external word clock source to determine the sample rate).
如果接口有模拟 I/O
,它可以控制 A/D
或 D/A
转换的速度。对于全数字接口,它控制用于将数字音频数据传入或传出的时钟速度。
在某些接口上,其他设备特有的配置可能意味着你的程序无法控制这个值(例如,当接口被告知使用外部时钟源来确定采样率时)。
采样格式
This controls the sample format used to transfer data to and from the interface. It may or may not correspond with a format directly supported by the hardware.
这控制了用于在接口之间传输数据的样本格式。它可能与硬件直接支持的格式一致,也可能不一致。
通道数
Hopefully, this is fairly self-explanatory.
如题。
数据访问与布局
This controls the way that the program will deliver/receive data from the interface. There are two parameters controlled by 4 possible settings. One parameter is whether or not a “read/write” model will be used, in which explicit function calls are used to transfer data. The other option here is to use “mmap mode” in which data is transferred by copying between areas of memory, and API calls are only necessary to note when it has started and finished.
这控制了程序从接口传送/接收数据的方式。两个参数有 4 个可能的设置。
其中一个参数表示使用读/写模式还是 mmap
模式。读/写模式使用显式函数调用来传输数据;而在 mmap
模式下,数据通过内存的区域之间的复制进行传输,API
调用只需要注意何时开始、何时结束即可。
The other parameter is whether the data layout will be interleaved or non-interleaved.
而另一个参数则用于表示数据布局是交错的还是非交错的。
中断间隔
This determines how many interrupts the interface will generate per complete traversal of its hardware buffer. It can be set either by specifying a number of periods, of the size of a period. Since this determines the number of frames of space/data that have to accumulate before the interface will interrupt the computer. It is central in controlling latency.
这决定了接口每次完整地遍历其硬件缓冲区将产生多少次中断。可以通过指定周期数量和单周期大小来设置。
由于这决定了在接口产生中断前必须积累的空间/数据的帧数,因此它是控制延迟的核心。
缓冲区大小
This determines how large the hardware buffer is. It can be specified in units of time or frames.
这决定了硬件缓冲区的大小。可以以时间或帧为单位来指定。
软件参数
These are parameters that control the operation of the device driver rather than the hardware itself. Most programs that use the ALSA Audio API will not need to set any of these; a few will need set a subset of them.
这些是控制设备驱动程序而不是硬件本身操作的参数。大多数使用 ALSA
音频 API
的程序不需要设置这些参数;少数程序需要设置其中的一部分。
何时启动设备
When you open the audio interface, ALSA ensures that it is not active - no data is being moved to or from its external connectors. Presumably, at some point you want this data transfer to begin. There are several options for how to make this happen.
当打开音频接口时,ALSA
确保它处于非活动状态——没有数据被移入或移出其外部连接器。在某个时刻,你可能希望开始数据传输。有几个选项可以实现这一点。
The control point here the start threshold, which defines the number of frames of space/data necessary to start the device automatically. If set to some value other than zero for playback, it is necessary to prefill the playback buffer before the device will start. If set to zero, the first data written to the device (or first attempt to read from a capture stream) will start the device.
这里的控制点是启动阈值,它定义了自动启动设备所需的空间/数据帧数。如果设置为零以外的其他值,则需要在设备启动前预先填充回放缓冲区。
如果设置为 0,则向设备写入的第一个数据(或第一次尝试从捕获流中读取数据)将启动设备。
You can also start the device explicitly using
snd_pcm_start
, but this requires buffer prefilling in the case of the playback stream. If you attempt to start the stream without doing this, you will get -EPIPE as a return code, indicating that there is no data waiting to deliver to the playback hardware buffer.
你也可以使用 snd_pcm_start
显式地启动设备,但这需要进行缓冲区预填充。如果未经预填充就直接试图启动,则会返回 -EPIPE
,表示没有数据等待传送到回放硬件缓冲区。
如何处理 xruns
If an xrun occurs, the device driver may, if requested, take certain steps to handle it. Options include stopping the device, or silencing all or part of the hardware buffer used for playback.
如果发生 xrun
,设备驱动程序可以根据要求采取某些处理步骤,包括停止设备,或关闭全部或部分用于播放的硬件缓冲区。
停止阈值
if the number of frames of data/space available meets or exceeds this value, the driver will stop the interface.
如果可用的数据/空间帧数达到或超过这个值,驱动程序将停止接口。
静音阈值
if the number of frames of space available for a playback stream meets or exceeds this value, the driver will fill part of the playback hardware buffer with silence.
如果回放流的可用空间帧数达到或超过该值,驱动程序将用静音填充部分硬件缓冲区。
静音大小
when the silence threshold level is met, this determines how many frames of silence are written into the playback hardware buffer
当达到静音阈值时,这个参数决定有多少帧静音会被写入硬件缓冲区。
可供唤醒的最小空间/数据
Programs that use
poll(2)
orselect(2)
to determine when audio data may be transferred to/from the interface may set this to control at what point relative to the state of the hardware buffer, they wish to be woken up.
使用 poll(2)
或 select(2)
来决定何时将音频数据传入/传出接口的程序,可以设置这个参数,使其能够在相对硬件缓冲区状态的某个点被唤醒。
传输块大小
this determines the number of frames used when transferring data to/from the device hardware buffer.
这决定了向/从设备硬件缓冲区传输数据时使用的帧数。
There are a couple of other software parameters but they need not concern us here.
还有一些其他的软件参数,但这里我们不去关注。
接收和发送数据
NOT YET WRITTEN
为什么你可能想忘记这一切?
In a word: JACK.
一言以蔽之:[JACK](http://jackit.sf.net/)
。
碎碎念
译文中去掉了很多翻译腔的内容,可能改变了原文的意思,但更符合说话习惯。具体如下:
attempt to ...
译为“…了”:一般中文语境下不会使用这样的谦词。more complex software
译为“复杂软件”:这里的more
应该是和自身对比,但默认语境下本文也不是“复杂软件”,因此省略。will need to address
译为“涉及到的”:涉及到的问题就是要解决的,但前者更加通顺。However, it does try to...
省略:这里的转折是与上文不提供对比而得,而强调则是增加这种转折感。但该转折可以隐式进行,因此省略。write a simple program
译为“编写一些简单的程序”:在这里使用单数会使得句子看起来没有泛用感,复数的表现效果更自然。worry about
译为“考虑”:语境意相同的表达,“担心”不常出现在技术类文档中。bitstream used by the computer
译为数字信号:部分情况下使用术语:数字信号更加简洁。signal used in the outside world
译为“模拟信号”:同上。