Generative Adversarial Networks (GANs) for Audio-Visual Speech Recognition in Artificial Intelligence IoT

Yibo He; Kah Phooi Seng; Li Minn Ang

doi:10.3390/info14100575

Back

Generative Adversarial Networks (GANs) for Audio-Visual Speech Recognition in Artificial Intelligence IoT

Journal article

Open access

Peer reviewed

Generative Adversarial Networks (GANs) for Audio-Visual Speech Recognition in Artificial Intelligence IoT

Yibo He, Kah Phooi Seng and Li Minn Ang

Information, Vol.14(10), pp.1-23

2023

DOI: https://doi.org/10.3390/info14100575

Files and links (2)

pdf

Generative Adversarial Networks (GANs) for Audio-Visual Speech Recognition in Artificial Intelligence IoT5.12 MBDownload View

Published VersionCC BY V4.0, Open Access

url

https://doi.org/10.3390/info14100575View

Published Version

Abstract

Internet of things (IoT)

generative adversarial networks (GANs)

deep learning

audio-visual speech recognition

This paper proposes a novel multimodal generative adversarial network AVSR (multimodal AVSR GAN) architecture, to improve both the energy efficiency and the AVSR classification accuracy of artificial intelligence Internet of things (IoT) applications. The audio-visual speech recognition (AVSR) modality is a classical multimodal modality, which is commonly used in IoT and embedded systems. Examples of suitable IoT applications include in-cabin speech recognition systems for driving systems, AVSR in augmented reality environments, and interactive applications such as virtual aquariums. The application of multimodal sensor data for IoT applications requires efficient information processing, to meet the hardware constraints of IoT devices. The proposed multimodal AVSR GAN architecture is composed of a discriminator and a generator, each of which is a two-stream network, corresponding to the audio stream information and the visual stream information, respectively. To validate this approach, we used augmented data from well-known datasets (LRS2-Lip Reading Sentences 2 and LRS3) in the training process, and testing was performed using the original data. The research and experimental results showed that the proposed multimodal AVSR GAN architecture improved the AVSR classification accuracy. Furthermore, in this study, we discuss the domain of GANs and provide a concise summary of the proposed GANs.

Details

Title: Generative Adversarial Networks (GANs) for Audio-Visual Speech Recognition in Artificial Intelligence IoT
Authors: Yibo He (Author) - Xi’an Jiaotong-Liverpool University
Kah Phooi Seng (Author) - University of the Sunshine Coast, Queensland, School of Science, Technology and Engineering
Li Minn Ang (Corresponding Author) - University of the Sunshine Coast, Queensland, School of Science, Technology and Engineering
Publication details: Information, Vol.14(10), pp.1-23
Publisher: MDPI AG
Date published: 2023
DOI: 10.3390/info14100575
ISSN: 2078-2489
Copyright note: This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Data Availability: The data presented in this study are openly available in refs [43,44,45].
Organisation Unit: University of the Sunshine Coast, Queensland; School of Science, Technology and Engineering; Engage Research Lab
Language: English
Record Identifier: 99971188902621
Output Type: Journal article

Metrics

2 File views/ downloads

65 Record Views

See more details