語音訊號處理研討會是中華民國計算語言學學會，一年一度定期舉辦的學術交流盛會，本次會議所邀請之演講者，包括 Facebook 人工智慧研究團隊的 Tomas Mikolov 博士、德國帕紹大學Björn W. Schuller 教授、臺北大學江振宇教授、中央大學王家慶教授、清華大學李祈均教授、中央研究院資創中心賴穎暉博士。演講內容涵蓋語音訊號處理非常多不同的面向，是所有台灣學術界與產業界對語音訊號處理、自然語言處理以及音樂訊號處理有興趣的專家學者們不容錯過的一場盛會。
03/18 SWS 2016!
Tomas Mikolov is a research scientist at Facebook AI Research lab. His most influential work includes development of recurrent neural network language models and discovery of semantic regularities in distributed word representations. These projects have been published as open-source tools RNNLM and word2vec which have been since widely used both in academia and industry. His main research interest is to develop intelligent machines.
Chen-Yu Chiang was born in Taipei, Taiwan, in 1980. He received the B.S., M.S., Ph.D. degrees in communication engineering from National Chiao Tung University (NCTU), Hsinchu, Taiwan, in 2002, 2004, and 2009, respectively. In 2009, he was a Postdoctoral Fellow at the Department of Electrical Engineering, NCTU, where he primarily worked on prosody modeling for automatic speech recognition and text-to-speech system, under the guidance of Prof. Sin-Horng Chen. In 2012, he was a Visiting Scholar at the Center for Signal and Image Processing (CSIP), Georgia Institute of Technology, Atlanta. Currently he is the director of the Speech and Multimedia Signal Processing Lab and an assistant professor at the Department of Communication Engineering, National Taipei University. His main research interests are in speech processing, in particular prosody modeling, automatic speech recognition and text-to-speech systems
Jia-Ching Wang received the M.S. and Ph.D. degrees in electrical engineering from National Cheng Kung University, Tainan, Taiwan, in 1997 and 2002, respectively. He was an Honorary Fellow with the Department of Electrical and Computer Engineering, University of Wisconsin-Madison in 2008 and 2009. Currently, he is an Associate Professor with the Department of Computer Science and Information Engineering, National Central University, Jhongli City, Taiwan. His research interests include signal processing and VLSI architecture design. Dr. Wang is an honorary member of Phi Tau Phi Scholastic Honor Society and a member of the Association for Computing Machinery and IEICE.
Björn W. Schuller is Full Professor and Chair of Complex and Intelligent Systems at the University of Passau/Germany, Reader (Associate Professor) in Machine Learning at Imperial College London/UK, and the co-founding CEO of audEERING. Further affiliations include HIT/China as Visiting Professor and the University of Geneva/Switzerland and Joanneum Research in Graz/Austria as an Associate. Previously, he was with the CNRS-LIMSI in Orsay/France and headed the Machine Intelligence and Signal Processing Group at TUM in Munich/Germany. There, he received his diploma in 1999, doctoral degree in 2006, his habilitation in 2012, and was entitled Adjunct Teaching Professor – all in electrical engineering and information technology. Best known are his works advancing Intelligent Audio Analysis and Affective Computing. Dr Schuller is President Emeritus of the AAAC, elected member of the IEEE SLTC, and Senior Member of the IEEE. He (co-)authored 5 books and >500 peer reviewed technical contributions (>10,000 citations, h-index = 49). Selected activities include his role as Editor in Chief of the IEEE Transactions on Affective Computing, Associate Editor of Computer Speech and Language, IEEE Signal Processing Letters, IEEE Transactions on Cybernetics, and IEEE Transactions on Neural Networks and Learning Systems. Professor Schuller was General Chair of ACM ICMI 2014, Program Chair of ACM ICMI 2013, IEEE SocialCom 2012, and ACII 2015 and 2011, as well as organiser of the INTERSPEECH 2009-2016 annual Computational Paralinguistics Challenges and the 2011-2016 annual Audio/Visual Emotion Challenges. He won several awards including best results in research challenges such as CHiME, MediaEval, or of ACM Multimedia. In 2015 and 2016 he has been honoured as one of 40 extraordinary scientists under the age of 40 by the World Economic Forum.
Chi-Chun Lee (Jeremy) is an Assistant Professor at the Electrical Engineering Department of the National Tsing Hua University (NTHU), Taiwan. He received his B.S. degree with honor, magna cum laude, in electrical engineering from the University of Southern California (USC) in 2007, and his Ph.D. degree in electrical engineering from the USC in 2012. He was a data scientist at id:a lab at ID Analytics in 2013. He was awarded with the USC Annenberg Fellowship. He led a team to win the Emotion Challenge in Interspeech 2009. He is a coauthor on a best paper in Interspeech 2010. He is a member of Tau Beta Pi, Phi Kappa Phi, and Eta Kappa Nu honor societies.
His research interests are in interdisciplinary human-centered behavioral signal processing, emphasizing the development of computational frameworks in recognizing and quantifying human behavioral attributes and interpersonal interaction dynamics using machine learning and signal processing techniques.
Ying-Hui Lai received the B.S. degrees in department of industrial education from National Taiwan Normal University in 2005, and the Ph.D. degree in department of biomedical engineering from National Yang-Ming University in 2013. From January 2010 to June 2012, Dr. Lai was as a research and development (R&D) engineering in Aescu technology, Taipei, Taiwan, where he engaged in research and product development in hearing aids. Currently, Dr. Lai is a postdoctoral fellow of the research center for information technology innovation at Academia Sinica. His research focuses on hearing aids, cochlear implant, speech enhancement and pattern recognition.
|08:30 - 09:00||報到||-|
|09:00 - 09:10||開幕致詞||李宏毅 助理教授|
|09:10 - 10:10||Recurrent Networks and Beyond||Tomas Mikolov 博士|
|10:10 - 10:30||Coffee Break||-|
|10:30 - 11:30||Prosody Modeling and its Applications to Spoken Language Processing||江振宇 助理教授|
|11:30 - 12:30||Robust Sound Event Recognition||王家慶 副教授|
|12:30 - 14:00||Lunch||-|
|14:00 - 15:00||Say no more – the computer already deeply knows you?||Björn W. Schuller 教授|
|15:00 - 16:00||A window into you: BSP effort for quantifying human behaviors across domains of health, education, and psychology||李祈均 助理教授|
|16:00 - 16:20||Coffee Break||-|
|16:20 - 17:20||Improvement of the speech intelligibility for cochlear implantees by the adaptive compression strategy and deep learning based noise reduction approaches.||賴穎暉 博士|
|17:20 - 17:30||閉幕||-|
In this talk, I will give a brief overview of recurrent networks and their applications. I will then present several extensions that aim to help these powerful models to learn more patterns from training data. This will include a simple modification of the architecture that allows to capture longer context information, and an architecture that allows to learn complex algorithmic patterns. The talk will be concluded with a discussion of a long term research plan on how to advance machine learning techniques towards development of artificial intelligence.
The term prosody refers to certain inherent suprasegmental properties that carry melodic, timing, and pragmatic information of continuous speech, encompassing accentuation, intonation, rhythm, speaking rate, prominences, pauses, and attitudes or emotions intended to express. Prosodic features are physically encoded in the variations in pitch contour, energy level, duration, and silence of spoken utterances. Prosodic studies have indicated that these prosodic features are not produced arbitrarily, but rather realized after a hierarchically organized structure which demarcates speech flows into domains of varying lengths by boundary or break cues. It is also known that the hierarchical prosodic structures are highly correlated with information sources of the linguistic features (lexical, syntactic, semantic, and pragmatic), the para-linguistic features (intentional, attitudinal, and stylistic), and the non-linguistic features (physical and emotional). Therefore, we can regard prosodic information as an interface between messages generated by humans and realized speech acoustic features. We may also regard prosody as a communication protocol between speakers. This talk will introduce some advances of prosody modeling jointly developed by the Speech and Multimedia Signal Processing Lab, NTPU, and the Speech Processing Lab, NCTU. The applications of the prosody modeling to automatic speech recognition (ASR) and text-to-speech system (TTS) will be also addressed.
Using sound event recognition in home environments has become a new research issue in home automation or smart homes. Identifying sound classes can significantly help home environmental monitoring. Predefined home automation services can be triggered by associated sound classes. However, variant noises or interferences always make an impact on the recognition performance. These problems are unsolved and researches to tackle them are greatly needed. In this talk, we will present several robust sound event recognition techniques such as front-end processes to filter out noises or interferences, and an approach to extract robust audio features.
Recent advances in deep and weakly supervised learning helped to lend computers new socio-affective skills. Focusing on human speech analysis, this talk highlights current abilities and potential in automatic characterisation of speakers in rich ways. This includes acquisition of information on speakers’ sincerity, deception, native language and degree of nativeness, cognitive and physical load, emotion and personality, or health diagnostics – just to name a few. An according modern architecture for holistic speech analysis will be shown including cooperative on-line learning by efficient crowd-sourcing. Further, an approach for end-to-end learning will be featured aiming at seamless speech modelling. Then, a low-resource implementation based on the openSMILE toolkit co-developed by the presenter is demonstrated considering real-time on device processing on smart phones and alike. Examples of application use-cases stem from a number of ongoing European projects – these will show-case the potential, but also current shortcomings. In an outlook, future avenues are laid out to best overcome these.
The abstraction of humans with a signals and systems framework naturally brings a synergy between engineering and behavioral sciences. Behavioral signal processing (BSP) offers a new frontier of interdisciplinary research between these communities. The core research in BSP is to model human behaviors, internal states, and perceptual judgements with observational data by using computational methods grounded in signal processing and machine learning. The outcome of BSP offers novel informatics for enhancing the capabilities of domain experts in facilitating better decision making
In this talk, we will demonstrate the use of BSP techniques in various application domains: affective computing, mental health, and educational research. The heterogeneity in human behavior expression, the subjectivity in human perceptual judgement, and the complex non-linear interplay of multiple influencing factors require not only an advancement in algorithmic development but also a closer collaboration with domain experts. With this emerging effort of BSP, we strive not only to provide engineering solutions to domain experts but also to open up potential opportunities of novel insights in the applications with broad societal impact.
Cochlear implants (CIs) are surgically implanted electronic devices that provide a sense of sound in patients with profound to severe hearing loss. The considerable progress of CI technologies in the past three decades has enabled many CI users to enjoy a high level of speech understanding in quiet. For most CI users, however, understanding speech in noisy environments remains a challenge. In this talk, I will present two approaches to address this important issue to further improve the benefits of speech intelligibility for CI recipients under noisy conditions. First, I will describe the proposed adaptive envelope compression (AEC) strategy, which is effective at enhancing the modulation depth of envelope waveform by making best use of its dynamic range and thus improving the intelligibility for CI recipients compared with the traditional static envelope compression. Second, I will introduce the deep learning based NR approach (deep denoising autoencoder, DDAE), which has been investigated its effectiveness for improving the speech intelligibility for CI recipients. Experimental results indicated that, under challenging noisy listening conditions, the AEC strategy and DDAE NR yields higher intelligibility scores than conventional approaches for Mandarin-speaking listeners, suggesting that AEC and DDAE NR could potentially be integrated into a CI processor to overcome speech perception degradation caused by noise.
台灣大學 - 10617 臺北市羅斯福路四段一號