语音识别
语音识别是一种让机器通过识别和理解过程把语音信号转变为相应文本或命令的高技术。它涉及信号处理、模式识别、概率论和信息论、发声机理和听觉机理、人工智能等多个领域。近二十年来,语音识别技术取得了显著的进步,开始从实验室走向市场,预计未来10年内,语音识别技术将进入工业、家电、通信、汽车电子、医疗、家庭服务、消费电子产品等各个领域。 本教程将向您展示如何正确格式化音频数据集,然后在数据集上训练/测试音频分类器网络。
首先,我们导入常用的 torch 包,例如 torchaudio,可以按照网站上的说明进行安装。
# Uncomment the line corresponding to your "runtime type" to run in Google Colab
# CPU:
# !pip install pydub torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
# GPU:
# !pip install pydub torch==1.7.0+cu101 torchvision==0.8.1+cu101 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchaudio
import sys
import matplotlib.pyplot as plt
import IPython.display as ipd
from tqdm import tqdm
让我们检查 CUDA GPU 是否可用并选择我们的设备。在 GPU 上运行网络将大大减少训练/测试运行时间。
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
输出
cuda
导入数据集 我们使用 torchaudio 下载并表示数据集。这里我们使用 SpeechCommands,它是由不同人说出的 35 个命令的数据集。数据集 SPEECHCOMMANDS是torch.utils.data.Dataset数据集的一个版本。在此数据集中,所有音频文件的长度约为 1 秒(因此大约有 16000 个时间帧长)。
实际的加载和格式化步骤发生在访问数据点时,torchaudio 负责将音频文件转换为张量。如果想直接加载音频文件, torchaudio.load()可以使用。它返回一个元组,其中包含新创建的张量以及音频文件的采样频率(对于 SpeechCommands 为 16kHz)。
回到数据集,这里我们创建一个子类,将其分为标准训练、验证、测试子集。
from torchaudio.datasets import SPEECHCOMMANDS
import os
class SubsetSC(SPEECHCOMMANDS):
def __init__(self, subset: str = None):
super().__init__("./", download=True)
def load_list(filename):
filepath = os.path.join(self._path, filename)
with open(filepath) as fileobj:
return [os.path.normpath(os.path.join(self._path, line.strip())) for line in fileobj]
if subset == "validation":
self._walker = load_list("validation_list.txt")
elif subset == "testing":
self._walker = load_list("testing_list.txt")
elif subset == "training":
excludes = load_list("validation_list.txt") + load_list("testing_list.txt")
excludes = set(excludes)
self._walker = [w for w in self._walker if w not in excludes]
# Create training and testing split of the data. We do not use validation in this tutorial.
train_set = SubsetSC("training")
test_set = SubsetSC("testing")
waveform, sample_rate, label, speaker_id, utterance_number = train_set[0]
输出
0%| | 0.00/2.26G [00:00, ?B/s]
1%| | 13.1M/2.26G [00:00<00:17, 137MB/s]
1%|1 | 30.4M/2.26G [00:00<00:14, 163MB/s]
2%|2 | 47.5M/2.26G [00:00<00:13, 171MB/s]
3%|2 | 64.8M/2.26G [00:00<00:13, 175MB/s]
4%|3 | 82.0M/2.26G [00:00<00:13, 177MB/s]
4%|4 | 98.9M/2.26G [00:00<00:13, 177MB/s]
5%|5 | 116M/2.26G [00:00<00:12, 178MB/s]
6%|5 | 133M/2.26G [00:00<00:12, 177MB/s]
6%|6 | 150M/2.26G [00:00<00:12, 175MB/s]
7%|7 | 167M/2.26G [00:01<00:12, 175MB/s]
8%|7 | 184M/2.26G [00:01<00:12, 176MB/s]
9%|8 | 201M/2.26G [00:01<00:12, 177MB/s]
9%|9 | 218M/2.26G [00:01<00:12, 176MB/s]
10%|# | 236M/2.26G [00:01<00:12, 179MB/s]
11%|# | 254M/2.26G [00:01<00:11, 184MB/s]
12%|#1 | 273M/2.26G [00:01<00:11, 188MB/s]
13%|#2 | 291M/2.26G [00:01<00:11, 189MB/s]
13%|#3 | 310M/2.26G [00:01<00:11, 191MB/s]
14%|#4 | 329M/2.26G [00:01<00:10, 194MB/s]
15%|#5 | 348M/2.26G [00:02<00:10, 195MB/s]
16%|#5 | 367M/2.26G [00:02<00:10, 198MB/s]
17%|#6 | 386M/2.26G [00:02<00:10, 199MB/s]
18%|#7 | 406M/2.26G [00:02<00:10, 199MB/s]
18%|#8 | 425M/2.26G [00:02<00:09, 199MB/s]
19%|#9 | 444M/2.26G [00:02<00:09, 199MB/s]
20%|#9 | 463M/2.26G [00:02<00:09, 199MB/s]
21%|## | 482M/2.26G [00:02<00:09, 197MB/s]
22%|##1 | 501M/2.26G [00:02<00:09, 196MB/s]
22%|##2 | 519M/2.26G [00:02<00:09, 195MB/s]
23%|##3 | 538M/2.26G [00:03<00:09, 196MB/s]
24%|##4 | 557M/2.26G [00:03<00:09, 196MB/s]
25%|##4 | 576M/2.26G [00:03<00:09, 192MB/s]
26%|##5 | 594M/2.26G [00:03<00:09, 188MB/s]
26%|##6 | 612M/2.26G [00:03<00:09, 188MB/s]
27%|##7 | 630M/2.26G [00:03<00:09, 187MB/s]
28%|##7 | 648M/2.26G [00:03<00:09, 188MB/s]
29%|##8 | 668M/2.26G [00:03<00:08, 194MB/s]
30%|##9 | 687M/2.26G [00:03<00:08, 196MB/s]
30%|### | 706M/2.26G [00:03<00:08, 197MB/s]
31%|###1 | 725M/2.26G [00:04<00:08, 198MB/s]
32%|###2 | 744M/2.26G [00:04<00:08, 198MB/s]
33%|###2 | 763M/2.26G [00:04<00:08, 197MB/s]
34%|###3 | 781M/2.26G [00:04<00:08, 197MB/s]
35%|###4 | 801M/2.26G [00:04<00:08, 198MB/s]
35%|###5 | 820M/2.26G [00:04<00:07, 198MB/s]
36%|###6 | 838M/2.26G [00:04<00:07, 197MB/s]
37%|###7 | 857M/2.26G [00:04<00:07, 197MB/s]
38%|###7 | 876M/2.26G [00:04<00:07, 197MB/s]
39%|###8 | 895M/2.26G [00:04<00:07, 197MB/s]
39%|###9 | 914M/2.26G [00:05<00:07, 195MB/s]
40%|#### | 933M/2.26G [00:05<00:07, 196MB/s]
41%|####1 | 952M/2.26G [00:05<00:07, 198MB/s]
42%|####1 | 971M/2.26G [00:05<00:07, 199MB/s]
43%|####2 | 990M/2.26G [00:05<00:06, 200MB/s]
44%|####3 | 0.99G/2.26G [00:05<00:07, 186MB/s]
44%|####4 | 1.00G/2.26G [00:05<00:07, 190MB/s]
45%|####5 | 1.02G/2.26G [00:05<00:06, 194MB/s]
46%|####6 | 1.04G/2.26G [00:05<00:06, 195MB/s]
47%|####6 | 1.06G/2.26G [00:05<00:06, 195MB/s]
48%|####7 | 1.08G/2.26G [00:06<00:06, 197MB/s]
48%|####8 | 1.10G/2.26G [00:06<00:06, 196MB/s]
49%|####9 | 1.12G/2.26G [00:06<00:06, 198MB/s]
50%|##### | 1.13G/2.26G [00:06<00:06, 195MB/s]
51%|##### | 1.15G/2.26G [00:06<00:06, 190MB/s]
52%|#####1 | 1.17G/2.26G [00:06<00:06, 183MB/s]
52%|#####2 | 1.19G/2.26G [00:06<00:06, 182MB/s]
53%|#####3 | 1.21G/2.26G [00:06<00:06, 186MB/s]
54%|#####4 | 1.22G/2.26G [00:06<00:05, 190MB/s]
55%|#####4 | 1.24G/2.26G [00:06<00:05, 188MB/s]
56%|#####5 | 1.26G/2.26G [00:07<00:05, 191MB/s]
56%|#####6 | 1.28G/2.26G [00:07<00:05, 192MB/s]
57%|#####7 | 1.30G/2.26G [00:07<00:05, 192MB/s]
58%|#####8 | 1.31G/2.26G [00:07<00:05, 187MB/s]
59%|#####8 | 1.33G/2.26G [00:07<00:05, 175MB/s]
60%|#####9 | 1.35G/2.26G [00:07<00:05, 175MB/s]
60%|###### | 1.36G/2.26G [00:07<00:05, 172MB/s]
61%|######1 | 1.38G/2.26G [00:07<00:05, 169MB/s]
62%|######1 | 1.40G/2.26G [00:07<00:05, 167MB/s]
62%|######2 | 1.41G/2.26G [00:08<00:05, 166MB/s]
63%|######3 | 1.43G/2.26G [00:08<00:05, 168MB/s]
64%|######3 | 1.44G/2.26G [00:08<00:05, 166MB/s]
64%|######4 | 1.46G/2.26G [00:08<00:05, 166MB/s]
65%|######5 | 1.47G/2.26G [00:08<00:06, 141MB/s]
66%|######5 | 1.49G/2.26G [00:08<00:05, 149MB/s]
67%|######6 | 1.51G/2.26G [00:08<00:05, 159MB/s]
67%|######7 | 1.52G/2.26G [00:08<00:04, 165MB/s]
68%|######8 | 1.54G/2.26G [00:08<00:04, 164MB/s]
69%|######8 | 1.56G/2.26G [00:09<00:04, 164MB/s]
69%|######9 | 1.57G/2.26G [00:09<00:04, 164MB/s]
70%|####### | 1.59G/2.26G [00:09<00:04, 166MB/s]
71%|####### | 1.60G/2.26G [00:09<00:04, 167MB/s]
72%|#######1 | 1.62G/2.26G [00:09<00:04, 167MB/s]
72%|#######2 | 1.63G/2.26G [00:09<00:04, 166MB/s]
73%|#######2 | 1.65G/2.26G [00:09<00:03, 165MB/s]
74%|#######3 | 1.67G/2.26G [00:09<00:03, 166MB/s]
74%|#######4 | 1.68G/2.26G [00:09<00:03, 169MB/s]
75%|#######5 | 1.70G/2.26G [00:09<00:03, 169MB/s]
76%|#######5 | 1.71G/2.26G [00:10<00:03, 169MB/s]
76%|#######6 | 1.73G/2.26G [00:10<00:03, 171MB/s]
77%|#######7 | 1.75G/2.26G [00:10<00:03, 170MB/s]
78%|#######7 | 1.76G/2.26G [00:10<00:03, 168MB/s]
79%|#######8 | 1.78G/2.26G [00:10<00:03, 167MB/s]
79%|#######9 | 1.79G/2.26G [00:10<00:03, 166MB/s]
80%|#######9 | 1.81G/2.26G [00:10<00:02, 167MB/s]
81%|######## | 1.82G/2.26G [00:10<00:02, 168MB/s]
81%|########1 | 1.84G/2.26G [00:10<00:02, 169MB/s]
82%|########2 | 1.86G/2.26G [00:10<00:02, 168MB/s]
83%|########2 | 1.87G/2.26G [00:11<00:02, 169MB/s]
83%|########3 | 1.89G/2.26G [00:11<00:02, 170MB/s]
84%|########4 | 1.90G/2.26G [00:11<00:02, 171MB/s]
85%|########4 | 1.92G/2.26G [00:11<00:02, 172MB/s]
86%|########5 | 1.94G/2.26G [00:11<00:02, 171MB/s]
86%|########6 | 1.95G/2.26G [00:11<00:01, 171MB/s]
87%|########6 | 1.97G/2.26G [00:11<00:01, 169MB/s]
88%|########7 | 1.98G/2.26G [00:11<00:01, 168MB/s]
88%|########8 | 2.00G/2.26G [00:11<00:01, 168MB/s]
89%|########9 | 2.01G/2.26G [00:11<00:01, 168MB/s]
90%|########9 | 2.03G/2.26G [00:12<00:01, 170MB/s]
90%|######### | 2.05G/2.26G [00:12<00:01, 170MB/s]
91%|#########1| 2.06G/2.26G [00:12<00:01, 169MB/s]
92%|#########1| 2.08G/2.26G [00:12<00:01, 168MB/s]
93%|#########2| 2.09G/2.26G [00:12<00:01, 169MB/s]
93%|#########3| 2.11G/2.26G [00:12<00:00, 171MB/s]
94%|#########4| 2.13G/2.26G [00:12<00:00, 171MB/s]
95%|#########4| 2.14G/2.26G [00:12<00:00, 169MB/s]
95%|#########5| 2.16G/2.26G [00:12<00:00, 167MB/s]
96%|#########6| 2.17G/2.26G [00:12<00:00, 167MB/s]
97%|#########6| 2.19G/2.26G [00:13<00:00, 169MB/s]
98%|#########7| 2.21G/2.26G [00:13<00:00, 169MB/s]
98%|#########8| 2.22G/2.26G [00:13<00:00, 169MB/s]
99%|#########8| 2.24G/2.26G [00:13<00:00, 170MB/s]
100%|#########9| 2.25G/2.26G [00:13<00:00, 146MB/s]
100%|##########| 2.26G/2.26G [00:13<00:00, 179MB/s]
SPEECHCOMMANDS 数据集中的数据点是由波形(音频信号)、采样率、话语(标签)、说话者 ID、话语数量组成的元组。
print("Shape of waveform: {}".format(waveform.size()))
print("Sample rate of waveform: {}".
推荐文章
发表评论