Abstract:
In view of the singleness of feature extraction method and low classification accuracy in traditional bird sound recognition algorithms, a bird sound recognition method that combines convolutional neural networks and Transformer networks is proposed in this paper. The method comprehensively considers local feature learning and global context dependency construction of the network, first extracts the features of the short-time Fourier transform (STFT) spectrogram from the original bird sound signal, and then inputs them into the convolution neural network (CNN) to extract local spectrum feature information. At the same time, the log-Mel feature, the first-order and secondorder difference features of bird sound signal are extracted to synthesize the mixed Mel frequency cepstrum coefficient (MFCC) feature vector and input into the Transformer network to obtain the global sequence feature information. Finally, the obtained features are fused to obtain richer bird sound feature parameters, and the bird sound recognition results are obtained by Softmax classifier. Experiments on Birdsdata and xeno-canto bird sound datasets show that the average recognition accuracies of this method are 97.81% and 89.47%, respectively, higher than that of other existing bird sound recognition models.