言语信息:基于DTW和MFCC的单个词语音识别

实验内容

基于DTW算法实现单个词的语音识别

实验思路

MFCC

根据上课所学知识,对于单个词的语音进行识别,首先需要将输入的音频信号转化为语音特征MFCC、即梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient, MFCC),而MFCC的生成流程如下图所示:
Alt text
为了识别待识别语音,我们首先应该得到一些模版特征(template MFCC),得到template mfcc后,对于每一条输入进来的待测试音频,将其与模版挨个匹配即可,其最终的分属类别即属于最相似的模版类属。

DTW

DTW,即Dynamic Time Warping、动态时间规整算法,为MFCC的匹配提供了一个最基础的方法,其应用思路是:对于表述同样内容的两端音频,由于说话人音色、语速的不同,从而造成了相同的音素在发音时占用着不同的时常,而这些音素在比较时又被务必分在同一类别之下,所以就有了将DTW应用至此的思路。
DTW的过程可表示如下:
Alt text
所以现在我们需要定义上图中的wraped path
wraped path的定义有以下两种
Alt text
两者的主要区别在于:Vertical match是否被允许,即是否允许跳过当前的symbol或必须匹配当前的symbol

显然,DTW的求解过程是一个动态规划的思路,所以对于以上两种思路,DP过程可被分别表示如下:

  • LEVENSHTEIN:
    $$
    cost_{i,j} = dist_{i,j} + min(cost_{i,j-1}, cost_{i-1,j}, cost_{i-1,j-1})
    \\ where \qquad cost_{0,0} = dist_{0,0}, i > 0, j > 0
    $$
  • DTW:
    $$
    cost_{i,j} = dist_{i,j} + min(cost_{i,j-1}, cost_{i-1,j-1}, cost_{i-2,j-1})
    \\ where \qquad cost_{0,0} = dist_{0,0}, i > 1, j > 0
    $$

实验过程

有了以上思路后,接下来编写程序实现上述思路:

实验环境

  • 系统:Mac OS X
  • 语言:python3.6.5
  • requirements:
    • numpy
    • scipy.io.wavfile: 用于wav数据读取
    • python_speech_features.base: 用于生成MFCC特征
    • matplotlib

项目结构

为了保证泛用性,这里使用面向对象的思想来实现单个词语音识别,定义为Digit_Voice_Rec类:

方法 说明
__init__ 实现类的初始化以及公共变量的赋值
get_input 读取wav文件,返回rate及Amplitude
get_mfcc 接收get_input的读入数据,返回对应的MFCC特征
gather_mfcc 整合将所有的数据集,返回总的MFCC列表
get_template 取出当前代表当前类别的模版MFCC
DTW 基于DTW的DP实现
LEVENSHTEIN 基于LEVENSHTEIN的DP实现
run_Voice_rec 测试算法的acc及表现

运行步骤

项目的运行逻辑及数据流如下图所示:
Alt text

实验结果

数据集

本次试验的数据集来源于github上的公开资源,每个数字5条语音数据,有0~9共50条数据
英文数字语音
语谱图如图所示:
Alt text

实验结果

DTW

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6
predict digit: 0, true label: 0
predict digit: 0, true label: 0
predict digit: 0, true label: 0
predict digit: 0, true label: 0
predict digit: 9, true label: 1
predict digit: 1, true label: 1
predict digit: 1, true label: 1
predict digit: 1, true label: 1
predict digit: 2, true label: 2
predict digit: 2, true label: 2
predict digit: 2, true label: 2
predict digit: 2, true label: 2
predict digit: 3, true label: 3
predict digit: 3, true label: 3
predict digit: 3, true label: 3
predict digit: 3, true label: 3
predict digit: 4, true label: 4
predict digit: 4, true label: 4
predict digit: 4, true label: 4
predict digit: 4, true label: 4
predict digit: 5, true label: 5
predict digit: 5, true label: 5
predict digit: 5, true label: 5
predict digit: 9, true label: 5
predict digit: 6, true label: 6
predict digit: 6, true label: 6
predict digit: 6, true label: 6
predict digit: 6, true label: 6
predict digit: 7, true label: 7
predict digit: 7, true label: 7
predict digit: 7, true label: 7
predict digit: 7, true label: 7
predict digit: 8, true label: 8
predict digit: 8, true label: 8
predict digit: 8, true label: 8
predict digit: 8, true label: 8
predict digit: 9, true label: 9
predict digit: 9, true label: 9
predict digit: 9, true label: 9
predict digit: 9, true label: 9
Acc: 0.95

Process finished with exit code 0

LEVENSHTEIN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6
predict digit: 0, true label: 0
predict digit: 0, true label: 0
predict digit: 0, true label: 0
predict digit: 0, true label: 0
predict digit: 9, true label: 1
predict digit: 1, true label: 1
predict digit: 1, true label: 1
predict digit: 1, true label: 1
predict digit: 2, true label: 2
predict digit: 2, true label: 2
predict digit: 2, true label: 2
predict digit: 2, true label: 2
predict digit: 3, true label: 3
predict digit: 3, true label: 3
predict digit: 3, true label: 3
predict digit: 3, true label: 3
predict digit: 4, true label: 4
predict digit: 4, true label: 4
predict digit: 4, true label: 4
predict digit: 4, true label: 4
predict digit: 5, true label: 5
predict digit: 5, true label: 5
predict digit: 5, true label: 5
predict digit: 9, true label: 5
predict digit: 6, true label: 6
predict digit: 6, true label: 6
predict digit: 6, true label: 6
predict digit: 6, true label: 6
predict digit: 7, true label: 7
predict digit: 7, true label: 7
predict digit: 7, true label: 7
predict digit: 7, true label: 7
predict digit: 8, true label: 8
predict digit: 8, true label: 8
predict digit: 8, true label: 8
predict digit: 8, true label: 8
predict digit: 9, true label: 9
predict digit: 9, true label: 9
predict digit: 9, true label: 9
predict digit: 9, true label: 9
Acc: 0.95

可以看到,在此数据集下,使用LEVENSHTEINDTW的区别并不大。

改进

DTW相较于LEVENSHTEIN的改进只是在模版匹配中删改了vertical的对齐动作,对于进一步的改进,考虑在模版匹配中增加新的改进方法。

上面的实验都是基于单模版进行的匹配,单模版虽然执行速度快,但却无法保证选出的单模版是一个较为“整齐”的MFCC,所以改进时可以考虑使用多个MFCC构造模版,从一定程度上消减了因偶然性引起的匹配误差。

我们假定在以下的多模版匹配中每个类别对应的候选模版数目均有$t_k$个,每个类别下的$t_k$可以相同也可不同。

多模版均值匹配

该方法的思路是:对于每个类别下多个候选模版,测试数据$MFCC_i$去和每一个模版$template\_MFCC_j \quad j \in t_k$执行DTW对齐操作,并获得对应的代价$cost_j$,最终$MFCC_i$属于该类的总代价便是各个模版下对应代价的加权平均$cost_i = \sum_{j \in t_k} p_j*cost_j$

这里的$p_j$可以由一些先验来设定,比如认为观察到一些模版数据不能满足该类别下的对齐需求,即对应的$cost$总是非常大,那么就可以人为的降权,一般而言在时间中,总是使用平均权重。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
predict digit: 0, true label: 0
predict digit: 0, true label: 0
predict digit: 9, true label: 1
predict digit: 9, true label: 1
predict digit: 2, true label: 2
predict digit: 2, true label: 2
predict digit: 3, true label: 3
predict digit: 3, true label: 3
predict digit: 4, true label: 4
predict digit: 4, true label: 4
predict digit: 5, true label: 5
predict digit: 1, true label: 5
predict digit: 6, true label: 6
predict digit: 6, true label: 6
predict digit: 7, true label: 7
predict digit: 7, true label: 7
predict digit: 8, true label: 8
predict digit: 8, true label: 8
predict digit: 9, true label: 9
predict digit: 9, true label: 9
Acc: 0.85

多模版全局最小值匹配

均值匹配提供了一个初步的思路,即同时利用多个模版数据来进行匹配,该方法虽然能在一定程度上消除在单模版匹配中碰到的随机选择模版上的误差,但是如果我们选择的$t_k$个模版有$t_k-1$个都是“不好的”,那么在没有先验知识时,使用平均权重会带来很大的误差。

考虑到整个匹配过程的核心思想是:匹配所得类别即最小代价对应的模版类,所以在使用多模版时,我们可以放弃平均累计的思想,只要在该类中得到了一个足够小的$cost_{min}$,且别的所有模版中均未有高过$cost_min$的值,就认为该$MFCC_i$属于$cost_{min}$对应的类

这样子改进的好处是,不仅可以消除单个template选取的偶然误差,还能够在没有先验知识时最大程度上消除由多个“不好的”模版带来的混合误差。

实验结果如下($t_k=3$):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6
predict digit: 0, true label: 0
predict digit: 0, true label: 0
predict digit: 1, true label: 1
predict digit: 1, true label: 1
predict digit: 2, true label: 2
predict digit: 2, true label: 2
predict digit: 3, true label: 3
predict digit: 3, true label: 3
predict digit: 4, true label: 4
predict digit: 4, true label: 4
predict digit: 5, true label: 5
predict digit: 1, true label: 5
predict digit: 6, true label: 6
predict digit: 6, true label: 6
predict digit: 7, true label: 7
predict digit: 7, true label: 7
predict digit: 8, true label: 8
predict digit: 8, true label: 8
predict digit: 9, true label: 9
predict digit: 9, true label: 9
Acc: 0.95

结果分析

从实验结果来看,均值匹配如我们上述分析,其在没有先验来确定$p_j$时,单使用平均权重并不能起到一个很好的提升效用,从另一个角度看,全局最小的确是一个可信赖的方法,除了匹配速度较慢(因为要对比所有模版),效果还是可观的。

代码附录

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
# -*- coding: utf-8 -*-
import numpy as np
import scipy.io.wavfile as wav
import python_speech_features.base as psf

from matplotlib import pyplot as plt

class Digit_Voice_Rec:
def __init__(self, digit_size=9, example_size=5, framerate=16000, channels=1, sampwidth=2, datapath='./data_en/'):
'''
:param digit_size: 0~9
:param example_size: 每个数字的训练用例数
:param framerate: 采样频率 16kHz
:param channels: 声道数 单声道
:param sampwidth: 采样字节 2Bytes
:param datapath: 数据路径
'''

self.digit_size = digit_size
self.example_size = example_size
self.framerate = framerate
self.channels = channels
self.sampwidth = sampwidth
self.datapath = datapath

# 计数器
self.correct_cnt = 0

def get_input(self, filename, show=False):
rate, data = wav.read(filename)

if show:
print(filename, "\tfs, signal:", rate, data)
plt.title(filename)
plt.xlabel("Time [s]")
plt.ylabel("Amplitude")
plt.plot(np.linspace(0.,data.shape[0]/rate, data.shape[0]), data, label=filename)
plt.show()

return rate, data

def get_mfcc(self, rate, data):
feature = psf.mfcc(signal=data, samplerate=rate, nfft=1200)
delta_feature = psf.delta(feature, 1)
d_delta_feature = psf.delta(feature, 2)
mfcc = np.hstack((feature, delta_feature, d_delta_feature))

return mfcc

def gather_mfcc(self):
gather = []
for i in range(self.digit_size + 1):
_ = []
for j in range(self.example_size):
file_name = self.datapath + str(i) + "-" + str(j + 1) + ".wav"
r, d = self.get_input(file_name)
feature = self.get_mfcc(r, d)
_.append(feature)
gather.append(_)

return gather

def get_template(self, gather):
template_mfcc = []
# 第0个充当模版
for mfcc in gather:
template_mfcc.append(mfcc[0])

return template_mfcc

# 多模版
def get_multi_template(self, gather, t_k=1):
template_mfcc = {}
# 前t_k个充当模版
for i in range(len(gather)):
if i not in template_mfcc:
template_mfcc[i] = []
for j in range(t_k):
template_mfcc[i].append(gather[i][j])
return template_mfcc

def DTW(self, mfcc1, mfcc2):

mfcc1 = np.array(mfcc1)
mfcc2 = np.array(mfcc2)

def get_distance(x1, x2):
dis = 0
for i in range(x1.shape[0]):
dis += abs(x1[i] - x2[i])
return dis

cost = np.full((mfcc1.shape[0], mfcc2.shape[0]), np.inf)
dist = np.zeros((mfcc1.shape[0], mfcc2.shape[0]))

# 填充(mfcc1, mfcc2)的distance数组
for i in range(mfcc1.shape[0]):
for j in range(mfcc2.shape[0]):
dist[i][j] = get_distance(mfcc1[i], mfcc2[j])

# 初始化cost
cost[0][0] = dist[0][0]
# 初始化i=0行
for j in range(1, mfcc2.shape[0]):
cost[0][j] = cost[0][j-1] + dist[0][j]

# DWT: cost[i][j] = dist[i][j] + min(
# cost[i][j-1], ->
# cost[i-1][j-1], />
# cost[i-2][j-1] //>
# )
for i in range(2, mfcc1.shape[0]):
for j in range(1, mfcc2.shape[0]):
cost[i][j] = dist[i][j] + min(cost[i][j-1], cost[i-1][j-1], cost[i-2][j-1])

final_cost = cost[-1][-1]
return final_cost

def LEVENSHTEIN(self, mfcc1, mfcc2):

mfcc1 = np.array(mfcc1)
mfcc2 = np.array(mfcc2)

def get_distance(x1, x2):
dis = 0
for i in range(x1.shape[0]):
dis += abs(x1[i] - x2[i])
return dis

cost = np.zeros((mfcc1.shape[0], mfcc2.shape[0]))
dist = np.zeros((mfcc1.shape[0], mfcc2.shape[0]))

# 填充(mfcc1, mfcc2)的distance数组
for i in range(mfcc1.shape[0]):
for j in range(mfcc2.shape[0]):
dist[i][j] = get_distance(mfcc1[i], mfcc2[j])

# 初始化cost
cost[0][0] = dist[0][0]
# 初始化i=0行
for i in range(1, mfcc1.shape[0]):
cost[i][0] = cost[i - 1][0] + dist[i][0]
for j in range(1, mfcc2.shape[0]):
cost[0][j] = cost[0][j - 1] + dist[0][j]

# DWT: cost[i][j] = dist[i][j] + min(
# cost[i][j-1], ->
# cost[i-1][j], |>
# cost[i-1][j-1] />
# )
for i in range(1, mfcc1.shape[0]):
for j in range(1, mfcc2.shape[0]):
cost[i][j] = dist[i][j] + min(cost[i][j-1], cost[i-1][j], cost[i-1][j-1])

final_cost = cost[-1][-1]
return final_cost

def run_Voice_rec(self):
gather_mfcc = self.gather_mfcc()
template_mfcc = self.get_template(gather_mfcc)
test_mfcc = [mfcc for mfcc_i in gather_mfcc for mfcc in mfcc_i[1:]]

for i in range(len(test_mfcc)):
# 与所有模版匹配,寻找最小的cost
cost = [self.LEVENSHTEIN(template, test_mfcc[i]) for template in template_mfcc]
# print(cost)
pred_digit = np.argmin(np.array(cost))
true_digit = int(i/(self.example_size-1))

if pred_digit == true_digit:
self.correct_cnt += 1

print("predict digit: %d, true label: %d"%(pred_digit, true_digit))

print("Acc: %.2f"%(self.correct_cnt/len(test_mfcc)))

# 多模版
def run_Voice_rec_multi_template(self, t_k):
gather_mfcc = self.gather_mfcc()
template_mfcc = self.get_multi_template(gather_mfcc, t_k=t_k)
test_mfcc = [mfcc for mfcc_i in gather_mfcc for mfcc in mfcc_i[3:]]

for i in range(len(test_mfcc)):
# 与所有模版匹配,寻找最小的cost
# cost = [self.LEVENSHTEIN(template, test_mfcc[i]) for template in template_mfcc]
cost = []
for key in template_mfcc:
# c = 0
c = []
for j in range(t_k):
# 使用平均cost
# c += 1.0/t_k * self.DTW(template_mfcc[key][j], test_mfcc[i])
# 使用最小cost
c.append(self.DTW(template_mfcc[key][j], test_mfcc[i]))
# cost.append(c)
cost.append(min(c))

# print(cost)
pred_digit = np.argmin(np.array(cost))
true_digit = int(i/(self.example_size-t_k))

if pred_digit == true_digit:
self.correct_cnt += 1

print("predict digit: %d, true label: %d"%(pred_digit, true_digit))

print("Acc: %.2f"%(self.correct_cnt/len(test_mfcc)))

DVR = Digit_Voice_Rec()
DVR.run_Voice_rec()

DVR.run_Voice_rec_multi_template(t_k=3)
小手一抖⬇️