Hung-yi Lee

My research team uses deep learning to develop a series of language understanding and speech processing technology. We proposed a series of technology remarkably reducing the requirement of annotated data for language understanding and speech processing.

Sequence Generation by Generative Adversarial Network (GAN): Proposing novel training algorithm of GAN for sequential data. Compared with the previous state-of-the-art approach, the training speed of the proposed approach is five times faster. Because text and speech are intrinsically sequential data, the proposed approach is very useful for language understanding and speech processing.
Unsupervised Speech Recognition

Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder .
Audio segment representations learned from one language can be applied to other languages .
Audio segment representations improve video captioning .
Without text transcription, machine automatically learns to identify word boundaries from audio .
Developing the world’s first unsupervised speech recognition system . Without paired text, it has achieved a phoneme recognition error rate of 33% .

Transforming the voice of speaker A into speaker B can be considered as a typical example of VC. To achieve that, usually speakers A and B have to read hundreds of sentences with the same content to teach machine how to transform their voices, which is not practical. Some new VC approaches are proposed based on GAN (one of the 12 finalists for the best student paper award of INTERSPEECH 2018). Only the audio content of speakers A and B are needed, and they do not have to read the same sentences.

Developing the world’s first spoken language understanding system that can take TOEFL Listening Comprehension Test (the paper nominated for the Best Student Paper Award in INTERSPEECH 2016). Machine achieved 49% accuracy . Machine achieved 55% accuracy . dataset
Developing the world’s first deep learning based spoken QA system (QA system that can answer questions based on spoken content) and benchmark corpora .
Using GAN to deal with the problem of the lack of training data for spoken QA.
Proposing to learn Chinese word representations from glyphs of characters.
Leading a team participating Formosa Grand Challenge competition held by the Ministry of Science and Technology of Taiwan. The competition is listening comprehension of Chinese spoken content by machine. 143 teams participated in the competition. We won the champion in the final competition(news).
Leading a team participating MovieQA competition 2017, in which machine answers questions based on the plots of movies. The team ranked at the 2nd place (leaderboard).

Unsupervised Abstractive Summarization: Abstractive summarization is to generate a summary that describes the core ideas of the document in its own words. To train a summarizer with reasonable performance, in general, millions of paired documents and summaries as training examples are needed, which limits the application of the technology. Based on GAN, we propose unsupervised abstractive summarization, and the approach achieves performance comparable with the state-of-the-art approaches but with only 20% of the paired data .
Abstractive summarization for spoken content using connectionist temporal classification (CTC) .
Abstractive summarization for spoken content using attention-based sequence-to-sequence network with ASR error modeling .
Key term extraction using neural attention models .

Reinforcement Learning: Proposing to apply reinforcement learning in human-machine interaction to determine the machine actions for interactive retrieval (the paper nominated for the Best Student Paper Award in INTERSPEECH 2012).
Learnable Simulated User: However, reinforcement learning relies on hand-crafted user simulators. Building a reliable user simulator is difficult and expensive. Inspired from the framework in GAN, we proposed to further improve human-machine interaction by proposing a learnable user simulator which is jointly trained with an interactive agent, precluding the need for a hand-crafted user simulator . This paper won the best student paper award of INTERSPEECH 2018 (3 out of 700).
Style Controllable Chatbot: The conventional chatbot is in general emotionless, and this is a major limitation because the emotion plays a critical role in human social interactions. We propose to train the chatbot to generate responses with scalable sentiment by setting the mode for chatting . This can be achieved by GAN which transforms the style of chatbot response. The techniques mentioned here is extended to conversational style adjustment, so the machine may imitate the conversational style of someone the user is familiar with, to make the chatbot more friendly or more personal . Demo at the 2016 Intel Asia Innovation Summit (news).

Proposing using the posts on social network to personalize recurrent neural network based language models (a paper nominated for the Best Student Paper Award in INTERSPEECH 2013) and Word Embedding

I'm a co-author of a tutorial paper summarizing the recent study
Proposing the innovative directions beyond the mainstream approaches of cascading speech recognition and text information retrieval with performance shown to be significantly less constrained by recognition errors:

Relevance Feedback: Proposing the new framework integrating recognition and retrieval by user relevance feedback (mentioned in textbook), and a series of approaches using acoustic feature space similarity (a paper nominated for the Best Student Paper Award in ASRU 2011)
Semantic Retrieval: Proposing new frameworks by acoustic feature similarity, context consistency, and query expansion based on automatically discovered acoustic patterns (a paper received Spoken Language Processing Student Travel Grant in ICASSP 2012)
Unsupervised Semantic Retrieval: Proposing the novel approach for semantic retrieval of spoken content without using speech recognition at all
Parameter Learning: Proposing the novel approach of learning the weights on the indexing features by optimizing the evaluation metrics (mentioned in textbook)

Participating in research on managing and organizing knowledge from on-line course materials in Spoken Language Systems Group, Computer Science and Artificial Intelligence Lab (CSAIL), Massachusetts Institute of Technology (MIT)

A platform developed for helping learners take on-line courses by analyzing, discovering and visualizing the relationships among lectures from similar courses and textbooks of related subjects
Video Demonstration

Participating in the IARPA Babel Program in Spoken Language Systems Group, CSAIL, MIT

Using acoustic feature space similarity to significantly improve the performance of spoken content retrieval for three different languages (Assamese, Bengali and Lao)

Participating in developing the prototype system of NTU Virtual Instructor in Digital Speech Processing Laboratory, NTU

An on-line learning platform organizing spoken knowledge in course lectures for efficient personalized learning (responsible for the spoken content retrieval part) , slides
Demo System (please browse it by FireFox or Chrome)