Multichannel Voice Trigger Detection Based on Transform-average-concatenate

0 minutes, 53 seconds Read

This paper was accepted on the workshop HSCMA at ICASSP 2024.

Voice triggering (VT) allows customers to activate their gadgets by simply talking a set off phrase. A front-end system is usually used to carry out speech enhancement and/or separation, and produces a number of enhanced and/or separated alerts. Since standard VT techniques take solely single-channel audio as enter, channel choice is carried out. A disadvantage of this method is that unselected channels are discarded, even when the discarded channels may comprise helpful data for VT. On this work, we suggest multichannel acoustic fashions for VT, the place the multichannel output from the frond-end is fed instantly right into a VT mannequin. We undertake a transform-average-concatenate (TAC) block and modify the TAC block by incorporating the channel from the standard channel choice in order that the mannequin can attend to a goal speaker when a number of audio system are current. The proposed method achieves as much as 30% discount within the false rejection fee in comparison with the baseline channel choice method.

Source link

Source link

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *