音视频 pytorch深度学习入门（5）之-Torchaudio音频加载

Torchaudio简介

Torchaudio是一个用于处理音频数据的Python库，它是基于PyTorch的扩展库，提供了丰富的音频处理功能和一系列预处理方法，方便用户在音频领域进行机器学习和深度学习的研究。具体来说，Torchaudio提供了从音频文件的读取到加载，音频变换和增强，以及音频数据可视化的整套工具。此外，Torchaudio还集成了一些常见的音频数据集，方便用户快速获取和处理音频数据。

在安装方面，首先需要安装PyTorch，可以通过pip install torch命令来安装最新版本。然后，可以使用pip install torchaudio命令来安装Torchaudio库。

Torchaudio的读取音频文件功能通过torchaudio.load方法实现，该方法返回一个包含音频数据的waveform张量和采样率。同时，Torchaudio提供了一些常用的音频变换和增强方法，如时域混响、频域滤波等，方便用户对音频数据进行处理和增强。

总之，Torchaudio是PyTorch在音频处理领域的强大扩展，它提供了丰富的音频处理功能和预处理方法，方便用户在音频领域进行机器学习和深度学习的研究。本教程展示如何使用 TorchAudio 的基本 I/O API 来检查音频数据、将其加载到 PyTorch Tensors 中并保存 PyTorch Tensors。首先查看Torchaudio版本是否正确，建议使用2.1以上版本

import torch

import torchaudio

print(torch.__version__)

print(torchaudio.__version__)

输出：

2.1.1

2.1.0

准备首先，我们导入模块并下载本教程中使用的音频资源。

请使用以下命令安装所需的包，使用pip安装：

pip install boto3

import io

import os

import tarfile

import tempfile

import boto3

import matplotlib.pyplot as plt

import requests

from botocore import UNSIGNED

from botocore.config import Config

from IPython.display import Audio

from torchaudio.utils import download_asset

SAMPLE_GSM = download_asset("tutorial-assets/steam-train-whistle-daniel_simon.gsm")

SAMPLE_WAV = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav")

SAMPLE_WAV_8000 = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042-8000hz.wav")

def _hide_seek(obj):

class _wrapper:

def __init__(self, obj):

self.obj = obj

def read(self, n):

return self.obj.read(n)

return _wrapper(obj)

输出：

0%| | 0.00/7.99k [00:00

100%|##########| 7.99k/7.99k [00:00<00:00, 7.62MB/s]

0%| | 0.00/53.2k [00:00

100%|##########| 53.2k/53.2k [00:00<00:00, 49.1MB/s]

查询音频元数据函数torchaudio.info()获取音频元数据。您可以提供类似路径的对象或类似文件的对象。

metadata = torchaudio.info(SAMPLE_WAV)

print(metadata)

AudioMetaData(sample_rate=16000, num_frames=54400, num_channels=1, bits_per_sample=16, encoding=PCM_S)

在哪里

sample_rate是音频的采样率

num_channels是通道数

num_frames是每个通道的帧数

bits_per_sample是位深度

encoding是示例编码格式

encoding可以采用以下值之一：

“PCM_S”：有符号整数线性 PCM

“PCM_U”: 无符号整数线性 PCM

“PCM_F”：浮点线性PCM

“FLAC”：Flac，免费无损音频编解码器

“ULAW”：Mu-law，[维基百科]

“ALAW”：A-law [维基百科]

“MP3”：MP3、MPEG-1 音频第三层

“VORBIS”: OGG Vorbis [ xiph.org ]

“AMR_NB”：自适应多速率[维基百科]

“AMR_WB”：自适应多速率宽带[维基百科]

“OPUS”: 作品 [ opus-codec.org ]

“GSM”：GSM-FR [维基百科]

“HTK”：单通道16位PCM

"UNKNOWN"以上都不是

笔记

bits_per_sample可以0用于具有压缩和/或可变比特率的格式（例如 MP3）。

num_frames可以用于0GSM-FR 格式。

metadata = torchaudio.info(SAMPLE_GSM)

print(metadata)

AudioMetaData(sample_rate=8000, num_frames=39680, num_channels=1, bits_per_sample=0, encoding=GSM)

查询类文件对象 torchaudio.info()适用于类似文件的对象。

url = "https://download.pytorch.org/torchaudio/tutorial-assets/steam-train-whistle-daniel_simon.wav"

with requests.get(url, stream=True) as response:

metadata = torchaudio.info(_hide_seek(response.raw))

print(metadata)

输出：

AudioMetaData(sample_rate=44100, num_frames=109368, num_channels=2, bits_per_sample=16, encoding=PCM_S)

传递类似文件的对象时，info不会读取所有底层数据；相反，它只从头开始读取部分数据。因此，对于给定的音频格式，它可能无法检索正确的元数据，包括格式本身。在这种情况下，您可以传递format参数来指定音频的格式。

加载音频数据

要加载音频数据，您可以使用torchaudio.load().

该函数接受类似路径的对象或类似文件的对象作为输入。

返回值是波形 ( Tensor) 和采样率 ( int) 的元组。

默认情况下，生成的张量对象具有dtype=torch.float32，其值范围为。[-1.0, 1.0]

有关支持的格式列表，请参阅torchaudio 文档。

waveform, sample_rate = torchaudio.load(SAMPLE_WAV)

def plot_waveform(waveform, sample_rate):

waveform = waveform.numpy()

num_channels, num_frames = waveform.shape

time_axis = torch.arange(0, num_frames) / sample_rate

figure, axes = plt.subplots(num_channels, 1)

if num_channels == 1:

axes = [axes]

for c in range(num_channels):

axes[c].plot(time_axis, waveform[c], linewidth=1)

axes[c].grid(True)

if num_channels > 1:

axes[c].set_ylabel(f"Channel {c+1}")

figure.suptitle("waveform")

plot_waveform(waveform, sample_rate)

def plot_specgram(waveform, sample_rate, title="Spectrogram"):

waveform = waveform.numpy()

num_channels, num_frames = waveform.shape

figure, axes = plt.subplots(num_channels, 1)

if num_channels == 1:

axes = [axes]

for c in range(num_channels):

axes[c].specgram(waveform[c], Fs=sample_rate)#specgram函数生成语谱图，也就是经过傅里叶变换生成频谱图

if num_channels > 1:

axes[c].set_ylabel(f"Channel {c+1}")

figure.suptitle(title)

plot_specgram(waveform, sample_rate)

生成时频图，采样滑动窗口对音频数据进行傅里叶变换，因此反映了时间与频率分布的情况

Audio(waveform.numpy()[0], rate=sample_rate)

从网络文件对象加载 I/O 函数支持类文件对象。这允许从本地文件系统内外的位置获取和解码音频数据。以下示例说明了这一点。

# Load audio data as HTTP request

url = "https://download.pytorch.org/torchaudio/tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav"

with requests.get(url, stream=True) as response:

waveform, sample_rate = torchaudio.load(_hide_seek(response.raw))

plot_specgram(waveform, sample_rate, title="HTTP datasource")

HTTP数据源

# Load audio from tar file

tar_path = download_asset("tutorial-assets/VOiCES_devkit.tar.gz")

tar_item = "VOiCES_devkit/source-16k/train/sp0307/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav"

with tarfile.open(tar_path, mode="r") as tarfile_:

fileobj = tarfile_.extractfile(tar_item)

waveform, sample_rate = torchaudio.load(fileobj)

plot_specgram(waveform, sample_rate, title="TAR file")

输出：

0%| | 0.00/110k [00:00

100%|##########| 110k/110k [00:00<00:00, 76.0MB/s]

# Load audio from S3

bucket = "pytorch-tutorial-assets"

key = "VOiCES_devkit/source-16k/train/sp0307/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav"

client = boto3.client("s3", config=Config(signature_version=UNSIGNED))

response = client.get_object(Bucket=bucket, Key=key)

waveform, sample_rate = torchaudio.load(_hide_seek(response["Body"]))

plot_specgram(waveform, sample_rate, title="From S3")

切片技巧提供num_frames和frame_offset参数将解码限制为输入的相应段。

使用普通张量切片（即）可以实现相同的结果。然而，提供和论证更为有效。waveform[:, frame_offset:frame_offset+num_frames]num_framesframe_offset

这是因为一旦完成对请求帧的解码，该函数将结束数据采集和解码。当音频数据通过网络传输时，这是有利的，因为一旦获取了必要的数据量，数据传输就会停止。

下面的例子说明了这一点。

# Illustration of two different decoding methods.

# The first one will fetch all the data and decode them, while

# the second one will stop fetching data once it completes decoding.

# The resulting waveforms are identical.

frame_offset, num_frames = 16000, 16000 # Fetch and decode the 1 - 2 seconds

url = "https://download.pytorch.org/torchaudio/tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav"

print("Fetching all the data...")

with requests.get(url, stream=True) as response:

waveform1, sample_rate1 = torchaudio.load(_hide_seek(response.raw))

waveform1 = waveform1[:, frame_offset : frame_offset + num_frames]

print(f" - Fetched {response.raw.tell()} bytes")

print("Fetching until the requested frames are available...")

with requests.get(url, stream=True) as response:

waveform2, sample_rate2 = torchaudio.load(

_hide_seek(response.raw), frame_offset=frame_offset, num_frames=num_frames

)

print(f" - Fetched {response.raw.tell()} bytes")

print("Checking the resulting waveform ... ", end="")

assert (waveform1 == waveform2).all()

print("matched!")

输出：

Fetching all the data...

- Fetched 108844 bytes

Fetching until the requested frames are available...

- Fetched 108844 bytes

Checking the resulting waveform ... matched!

将音频保存到文件

要将音频数据保存为常见应用程序可解释的格式，您可以使用torchaudio.save().

该函数接受类似路径的对象或类似文件的对象。

当传递类似文件的对象时，您还需要提供参数，format 以便函数知道应该使用哪种格式。对于类似路径的对象，该函数将从扩展名推断格式。如果要保存到没有扩展名的文件，则需要提供 argument format。

保存WAV格式的数据时，Tensor的默认编码float32是32位浮点PCM。您可以提供参数encoding并 bits_per_sample更改此行为。例如，要将数据保存在 16 位有符号整数 PCM 中，您可以执行以下操作。

以较低位深度的编码保存数据会减少生成的文件大小，但也会降低精度。

waveform, sample_rate = torchaudio.load(SAMPLE_WAV)

def inspect_file(path):

print("-" * 10)

print("Source:", path)

print("-" * 10)

print(f" - File size: {os.path.getsize(path)} bytes")

print(f" - {torchaudio.info(path)}")

print()

保存时不带任何编码选项。该函数将选择所提供的数据适合的编码

with tempfile.TemporaryDirectory() as tempdir:

path = f"{tempdir}/save_example_default.wav"

torchaudio.save(path, waveform, sample_rate)

inspect_file(path)

输出：

----------

Source: /tmp/tmph9zhvqey/save_example_default.wav

----------

- File size: 108878 bytes

- AudioMetaData(sample_rate=16000, num_frames=54400, num_channels=1, bits_per_sample=16, encoding=PCM_S)

另存为 16 位有符号整数 Linear PCM 生成的文件占用一半存储空间但会损失精度

with tempfile.TemporaryDirectory() as tempdir:

path = f"{tempdir}/save_example_PCM_S16.wav"

torchaudio.save(path, waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)

inspect_file(path)

输出：

----------

Source: /tmp/tmpnjohn05p/save_example_PCM_S16.wav

----------

- File size: 108878 bytes

- AudioMetaData(sample_rate=16000, num_frames=54400, num_channels=1, bits_per_sample=16, encoding=PCM_S)

torchaudio.save()还可以处理其他格式。仅举几例：

formats = [

"flac",

# "vorbis",

# "sph",

# "amb",

# "amr-nb",

# "gsm",

]

waveform, sample_rate = torchaudio.load(SAMPLE_WAV_8000)

with tempfile.TemporaryDirectory() as tempdir:

for format in formats:

path = f"{tempdir}/save_example.{format}"

torchaudio.save(path, waveform, sample_rate, format=format)

inspect_file(path)

输出

----------

Source: /tmp/tmpt4f0mdkn/save_example.flac

----------

- File size: 45262 bytes

- AudioMetaData(sample_rate=8000, num_frames=27200, num_channels=1, bits_per_sample=16, encoding=FLAC)

保存到类似文件的对象

与其他 I/O 功能类似，您可以将音频保存到类似文件的对象。保存到类似文件的对象时，format需要参数。

waveform, sample_rate = torchaudio.load(SAMPLE_WAV)

# Saving to bytes buffer

buffer_ = io.BytesIO()

torchaudio.save(buffer_, waveform, sample_rate, format="wav")

buffer_.seek(0)

print(buffer_.read(16))

输出：

b'RIFFF\xa9\x01\x00WAVEfmt '

完整代码

import torch

import torchaudio

print(torch.__version__)

print(torchaudio.__version__)

import io

import os

import tarfile

import tempfile

import boto3

import matplotlib.pyplot as plt

import requests

from botocore import UNSIGNED

from botocore.config import Config

from IPython.display import Audio

from torchaudio.utils import download_asset

SAMPLE_GSM = download_asset("tutorial-assets/steam-train-whistle-daniel_simon.gsm")

SAMPLE_WAV = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav")

SAMPLE_WAV_8000 = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042-8000hz.wav")

def _hide_seek(obj):

class _wrapper:

def __init__(self, obj):

self.obj = obj

def read(self, n):

return self.obj.read(n)

return _wrapper(obj)

metadata = torchaudio.info(SAMPLE_WAV)

print(metadata)

metadata = torchaudio.info(SAMPLE_GSM)

print(metadata)

url = "https://download.pytorch.org/torchaudio/tutorial-assets/steam-train-whistle-daniel_simon.wav"

with requests.get(url, stream=True) as response:

metadata = torchaudio.info(_hide_seek(response.raw))

print(metadata)

waveform, sample_rate = torchaudio.load(SAMPLE_WAV)

######################################################################

def plot_waveform(waveform, sample_rate):

waveform = waveform.numpy()

num_channels, num_frames = waveform.shape

time_axis = torch.arange(0, num_frames) / sample_rate

figure, axes = plt.subplots(num_channels, 1)

if num_channels == 1:

axes = [axes]

for c in range(num_channels):

axes[c].plot(time_axis, waveform[c], linewidth=1)

axes[c].grid(True)

if num_channels > 1:

axes[c].set_ylabel(f"Channel {c+1}")

figure.suptitle("waveform")

######################################################################

plot_waveform(waveform, sample_rate)

######################################################################

def plot_specgram(waveform, sample_rate, title="Spectrogram"):

waveform = waveform.numpy()

num_channels, num_frames = waveform.shape

figure, axes = plt.subplots(num_channels, 1)

if num_channels == 1:

axes = [axes]

for c in range(num_channels):

axes[c].specgram(waveform[c], Fs=sample_rate)

if num_channels > 1:

axes[c].set_ylabel(f"Channel {c+1}")

figure.suptitle(title)

######################################################################

plot_specgram(waveform, sample_rate)

######################################################################

Audio(waveform.numpy()[0], rate=sample_rate)

# Load audio data as HTTP request

url = "https://download.pytorch.org/torchaudio/tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav"

with requests.get(url, stream=True) as response:

waveform, sample_rate = torchaudio.load(_hide_seek(response.raw))

plot_specgram(waveform, sample_rate, title="HTTP datasource")

######################################################################

# Load audio from tar file

tar_path = download_asset("tutorial-assets/VOiCES_devkit.tar.gz")

tar_item = "VOiCES_devkit/source-16k/train/sp0307/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav"

with tarfile.open(tar_path, mode="r") as tarfile_:

fileobj = tarfile_.extractfile(tar_item)

waveform, sample_rate = torchaudio.load(fileobj)

plot_specgram(waveform, sample_rate, title="TAR file")

######################################################################

# Load audio from S3

bucket = "pytorch-tutorial-assets"

key = "VOiCES_devkit/source-16k/train/sp0307/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav"

client = boto3.client("s3", config=Config(signature_version=UNSIGNED))

response = client.get_object(Bucket=bucket, Key=key)

waveform, sample_rate = torchaudio.load(_hide_seek(response["Body"]))

plot_specgram(waveform, sample_rate, title="From S3")

frame_offset, num_frames = 16000, 16000 # Fetch and decode the 1 - 2 seconds

url = "https://download.pytorch.org/torchaudio/tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav"

print("Fetching all the data...")

with requests.get(url, stream=True) as response:

waveform1, sample_rate1 = torchaudio.load(_hide_seek(response.raw))

waveform1 = waveform1[:, frame_offset : frame_offset + num_frames]

print(f" - Fetched {response.raw.tell()} bytes")

print("Fetching until the requested frames are available...")

with requests.get(url, stream=True) as response:

waveform2, sample_rate2 = torchaudio.load(

_hide_seek(response.raw), frame_offset=frame_offset, num_frames=num_frames

)

print(f" - Fetched {response.raw.tell()} bytes")

print("Checking the resulting waveform ... ", end="")

assert (waveform1 == waveform2).all()

print("matched!")

waveform, sample_rate = torchaudio.load(SAMPLE_WAV)

######################################################################

def inspect_file(path):

print("-" * 10)

print("Source:", path)

print("-" * 10)

print(f" - File size: {os.path.getsize(path)} bytes")

print(f" - {torchaudio.info(path)}")

print()

######################################################################

# Save without any encoding option.

# The function will pick up the encoding which

# the provided data fit

with tempfile.TemporaryDirectory() as tempdir:

path = f"{tempdir}/save_example_default.wav"

torchaudio.save(path, waveform, sample_rate)

inspect_file(path)

######################################################################

# Save as 16-bit signed integer Linear PCM

# The resulting file occupies half the storage but loses precision

with tempfile.TemporaryDirectory() as tempdir:

path = f"{tempdir}/save_example_PCM_S16.wav"

torchaudio.save(path, waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)

inspect_file(path)

######################################################################

# :py:func:`torchaudio.save` can also handle other formats.

# To name a few:

formats = [

"flac",

# "vorbis",

# "sph",

# "amb",

# "amr-nb",

# "gsm",

]

######################################################################

waveform, sample_rate = torchaudio.load(SAMPLE_WAV_8000)

with tempfile.TemporaryDirectory() as tempdir:

for format in formats:

path = f"{tempdir}/save_example.{format}"

torchaudio.save(path, waveform, sample_rate, format=format)

inspect_file(path)

######################################################################

# Saving to file-like object

# --------------------------

# Similar to the other I/O functions, you can save audio to file-like

# objects. When saving to a file-like object, argument ``format`` is

# required.

waveform, sample_rate = torchaudio.load(SAMPLE_WAV)

# Saving to bytes buffer

buffer_ = io.BytesIO()

torchaudio.save(buffer_, waveform, sample_rate, format="wav")

buffer_.seek(0)

print(buffer_.read(16))

好文阅读

评论可见，请评论后查看内容，谢谢！！！

您阅读本篇文章共花了：

金钥匙

音视频 pytorch深度学习入门（5）之-Torchaudio音频加载

前端音视频 web端播放rtsp视频流（摄像头监控视频）教程及window下开机自启动部署

开发语言 Python爬虫--爬取哔哩哔哩（B站）短视频平台视频

发表评论取消回复

金钥匙

音视频 pytorch深度学习入门（5）之-Torchaudio音频加载

前端 音视频 web端播放rtsp视频流（摄像头监控视频）教程及window下开机自启动部署

开发语言 Python爬虫--爬取哔哩哔哩（B站）短视频平台视频

相关文章

发表评论取消回复

前端音视频 web端播放rtsp视频流（摄像头监控视频）教程及window下开机自启动部署