顔写真

ノセ タカシ
能勢 隆
Takashi Nose
所属
大学院工学研究科 通信工学専攻 知的通信ネットワーク工学講座(マルチメディア通信分野)
職名
准教授
学位
  • 博士(工学)(東京工業大学)

学歴 1

  • 東京工業大学 総合理工学研究科 物理情報システム専攻

    ~ 2009年3月

委員歴 4

  • 音響学会東北支部 会計幹事

    2014年4月 ~ 2016年3月

  • 音声研究会 幹事補佐

    2014年4月 ~ 2016年3月

  • 音響学会東北支部 会計幹事

    2014年4月 ~ 2016年3月

  • 音声研究会 幹事補佐

    2014年4月 ~ 2016年3月

所属学協会 5

  • ISCA

  • 情報処理学会

  • 音響学会

  • 電子情報通信学会

  • IEEE

研究キーワード 7

  • マルチメディア情報処理

  • 音楽情報処理

  • 音声符号化

  • 音声対話

  • 音声認識

  • 音声合成

  • 音声情報処理

研究分野 2

  • 情報通信 / 知能ロボティクス /

  • 情報通信 / 知覚情報処理 /

論文 88

  1. Dialog-based interactive movie recommendation: Comparison of dialog strategies 査読有り

    Hayato Mori, Yuya Chiba, Takashi Nose, Akinori Ito

    Smart Innovation, Systems and Technologies 82 77-83 2018年

    出版者・発行元:Springer Science and Business Media Deutschland GmbH

    DOI: 10.1007/978-3-319-63859-1_10  

    ISSN:2190-3026 2190-3018

    eISSN:2190-3026

    詳細を見る 詳細を閉じる

    The user interface based on natural language dialog has been gathering attention. In this paper, we focus on the dialog-based user interface of movie recommendation system. We compared two kinds of dialog systems: the system-initiative system presented all the information about the recommended item at a time, and the user-initiative system provided information of the recommended item based on a dialog between the system and the user. As a result of dialog experiment, the users preferred to the user-initiative system for availability of obtaining required information, while the system-initiative system was chosen for the simplicity of obtaining the information. In addition, it was found that the appropriateness of the system’s replies in the dialog affected the user’s preference to the user-initiative system.

  2. Voice conversion from arbitrary speakers based on deep neural networks with adversarial learning 査読有り

    Sou Miyamoto, Takashi Nose, Suzunosuke Ito, Harunori Koike, Yuya Chiba, Akinori Ito, Takahiro Shinozaki

    Smart Innovation, Systems and Technologies 82 97-103 2018年

    出版者・発行元:Springer Science and Business Media Deutschland GmbH

    DOI: 10.1007/978-3-319-63859-1_13  

    ISSN:2190-3026 2190-3018

    eISSN:2190-3026

    詳細を見る 詳細を閉じる

    In this study, we propose a voice conversion technique from arbitrary speakers based on deep neural networks using adversarial learning, which is realized by introducing adversarial learning to the conventional voice conversion. Adversarial learning is expected to enable us more natural voice conversion by using a discriminative model which classifies input speech to natural speech or converted speech in addition to a generative model. Experiments showed that proposed method was effective to enhance global variance (GV) of melcepstrum but naturalness of converted speech was a little lower than speech using the conventional variance compensation technique.

  3. Response selection of interview-based dialog system using user focus and semantic orientation 査読有り

    Shunsuke Tada, Yuya Chiba, Takashi Nose, Akinori Ito

    Smart Innovation, Systems and Technologies 82 84-90 2018年

    出版者・発行元:Springer Science and Business Media Deutschland GmbH

    DOI: 10.1007/978-3-319-63859-1_11  

    ISSN:2190-3026 2190-3018

    eISSN:2190-3026

    詳細を見る 詳細を閉じる

    This research examined the response selection method of an interview-based dialog system that obtains the user’s information by the chat-like conversation. In the interview dialog, the system should ask about the subject that the user is interested in to obtain the user’s information efficiently. In this paper, we proposed the method to select the system’s utterance based on the user’s emotion to a focus detected from the user’s utterance. We prepared the question types corresponding to the semantic orientation, such as the positive, neutral, and negative. The focus was detected by the CRF, and the question type was estimated from the user’s utterance and the system’s previous utterance.

  4. Development and evaluation of julius-compatible interface for Kaldi ASR 査読有り

    Yusuke Yamada, Takashi Nose, Yuya Chiba, Akinori Ito, Takahiro Shinozaki

    Smart Innovation, Systems and Technologies 82 91-96 2018年

    出版者・発行元:Springer Science and Business Media Deutschland GmbH

    DOI: 10.1007/978-3-319-63859-1_12  

    ISSN:2190-3026 2190-3018

    eISSN:2190-3026

    詳細を見る 詳細を閉じる

    In recent years, the use of Kaldi has rapidly grown because it has adopted various technologies of DNN-based speech recognition in succession and has shown high recognition performance. On the other hand, the speech recognition engine, Julius, has been widely used especially in Japan. Julius is also attracting attention since DNN-HMM is implemented in it. In this paper, we describe the design plan of interfaces that make Kaldi speech recognition engine be compatible with Julius, a system overview, and the details of the speech input unit and the recognition result output unit. We also refer to the functions that we are planning to implement.

  5. Detection of singing mistakes from singing voice 査読有り

    Isao Miyagawa, Yuya Chiba, Takashi Nose, Akinori Ito

    Smart Innovation, Systems and Technologies 82 130-136 2018年

    出版者・発行元:Springer Science and Business Media Deutschland GmbH

    DOI: 10.1007/978-3-319-63859-1_17  

    ISSN:2190-3026 2190-3018

    eISSN:2190-3026

    詳細を見る 詳細を閉じる

    We investigate a method of detecting the wrong lyrics from the singing voice. In the proposed method, we compare the input singing voice and the reference singing voice using dynamic time warping, and then observe the frame-by-frame distance to find the error location. However, the absolute value of the distance is affected by the singer individuality of the reference and input singing voice. Thus, we attempted to adapt the singer individuality into the reference singer’s one by a linear transformation. The results of the experiment showed that we could detect the wrong lyrics with high accuracy when the different part of the lyrics was long. In addition, we investigated the effect of iterative linear transformation, and we could not find any benefit from the second or third linear transformations.

  6. Evaluation of nonlinear tempo modification methods based on sinusoidal modeling 査読有り

    Kosuke Nakamura, Yuya Chiba, Takashi Nose, Akinori Ito

    Smart Innovation, Systems and Technologies 82 104-111 2018年

    出版者・発行元:Springer Science and Business Media Deutschland GmbH

    DOI: 10.1007/978-3-319-63859-1_14  

    ISSN:2190-3026 2190-3018

    eISSN:2190-3026

    詳細を見る 詳細を閉じる

    Modifying tempo of musical signal is one of the basic signal processing for music signal, and many methods have been proposed so far. Nishino et al. proposed a tempo modification method of nonlinear modification based on sinusoidal model, but the evaluation of the methods was insufficient. In this paper, we evaluated the tempo modification methods with sinusoidal model and nonlinear signal stretch and compression. Namely, we compared effectiveness of use of residue signal and methods of determination of stretchable parts. From the experimental result, we could confirm the efficiency of the nonlinear tempo modification. We also compared several methods of determining the stretchable parts as well as the use of residue signal. As a result, the effect of the methods depended on the input signal.

  7. Analysis of Efficient Multimodal Features for Estimating User’s Willingness to Talk: Comparison of Human-Machine and Human-Human Dialog 査読有り

    2018-February 1-4 2017年12月13日

    DOI: 10.1109/APSIPA.2017.8282069  

  8. A Study on 2D Photo-Realistic Facial Animation Generation Using 3D Facial Feature Points and Deep Neural Networks 査読有り

    112-118 2017年8月14日

    DOI: 10.1007/978-3-319-63859-1_15  

  9. HMM-Based Photo-Realistic Talking Face Synthesis Using Facial Expression Parameter Mapping with Deep Neural Networks 査読有り

    Kazuki Sato, Takashi Nose, Akinori Ito

    Journal of Computer and Communications 5 (10) 55-65 2017年8月

    DOI: 10.4236/jcc.2017.510006  

  10. 日常音識別による活動記録自動生成のためのデータの収集と分析

    古谷崇拓, 千葉祐弥, 能勢隆, 伊藤彰則

    情報処理学会研究報告 1-6 2017年6月17日

  11. Cluster-based approach to discriminate the user’s state whether a user is embarrassed or thinking to an answer to a prompt 査読有り

    Yuya Chiba, Takashi Nose, Akinori Ito

    Journal on Multimodal User Interfaces 11 (2) 185-196 2017年6月

    出版者・発行元:None

    DOI: 10.1007/s12193-017-0238-y  

    ISSN:1783-7677

    eISSN:1783-8738

  12. Sentence Selection Based on Extended Entropy Using Phonetic and Prosodic Contexts for Statistical Parametric Speech Synthesis 査読有り

    Takashi Nose, Yusuke Arao, Takao Kobayashi, Komei Sugiura, Yoshinori Shiga

    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 25 (5) 1107-1116 2017年5月

    出版者・発行元:IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

    DOI: 10.1109/TASLP.2017.2688585  

    ISSN:2329-9290

    eISSN:2329-9304

    詳細を見る 詳細を閉じる

    This paper proposes a sentence selection technique for constructing phonetically and prosodically balanced compact recording scripts for speech synthesis. In the conventional corpus design of speech synthesis, a greedy algorithm that maximizes phonetic coverage is often used. However, for statistical parametric speech synthesis, balances of multiple phonetic and prosodic contextual factors are important as well as the coverage. To take account of both of the phonetic and prosodic contextual balances in sentence selection, we introduce an extended entropy of phonetic and prosodic contexts, such as biphone/triphone, accent/stress/tone, and sentence length. For detailed investigation, conventional and proposed techniques are evaluated using Japanese, English, and Chinese corpora. The objective experimental results show that the proposed technique achieves better coverage and balance of contexts. In addition, speech synthesis experiments based on hidden Markov models reveal that the generated speech parameters become closer to those of the natural speech compared with other conventional sentence selection techniques. Subjective evaluations show that the proposed sentence selection based on the extended entropy improves the naturalness of the synthetic speech while maintaining the similarity to the original sample.

  13. Dimensional paralinguistic information control based on multiple-regression HSMM for spontaneous dialogue speech synthesis with robust parameter estimation 査読有り

    Tomohiro Nagata, Hiroki Mori, Takashi Nose

    SPEECH COMMUNICATION 88 137-148 2017年4月

    出版者・発行元:ELSEVIER SCIENCE BV

    DOI: 10.1016/j.specom.2017.01.002  

    ISSN:0167-6393

    eISSN:1872-7182

    詳細を見る 詳細を閉じる

    This paper describes spontaneous dialogue speech synthesis based on the multiple regression hidden semi-Markov model (MRHSMM), which enables users to specify paralinguistic information of synthesized speech with a dimensional representation. Paralinguistic aspects of synthesized speech are controlled by multiple regression models whose explanatory variables are abstract dimensions such as pleasant unpleasant and aroused-sleepy. However, in the training phase of the MRHSMM, estimated regression coefficients may have unreasonably large values, which cause fragility in the parameter generation with respect to paralinguistic information given to the synthesizer. For robust estimation of the regression matrices of the MRHSMM with unbalanced spontaneous dialogue speech samples, the re-estimation formulae were derived in the framework of the maximum a posteriori (MAP) estimation. By examining the synthesized speech, it was confirmed that the acoustic features of synthesized speech are well controlled by the dimensions, especially by the dimension of aroused-sleepy. The result of a perceptual experiment confirmed that the naturalness of synthesized speech was improved by applying the MAP estimation for regression matrices. In addition, a relatively high correlation was observed between given and perceived paralinguistic information, which implies that the proposed method could successfully reflect intended paralinguistic messages on the synthesized speech. (C) 2017 Elsevier B.V. All rights reserved.

  14. クロスリンガル音声合成のための共有決定木コンテクストクラスタリングを用いた話者適応 査読有り

    長濱大樹, 能勢隆, 郡山知樹, 小林隆夫

    電子情報通信学会論文誌D J100-D (3) 385-393 2017年

  15. 統計モデルに基づく多様な音声の合成技術 査読有り

    能勢隆

    電子情報通信学会論文誌D J100-D (4) 556-569 2017年

  16. Collection of example sentences for non-task-oriented dialog using a spoken dialog system and comparison with hand-crafted DB 査読有り

    Yukiko Kageyama, Yuya Chiba, Takashi Nose, Akinori Ito

    Communications in Computer and Information Science 713 458-464 2017年

    出版者・発行元:Springer Verlag

    DOI: 10.1007/978-3-319-58750-9_63  

    ISSN:1865-0929

    詳細を見る 詳細を閉じる

    Designing a question-answer database is important to make natural conversation for an example-based dialog system. We focused on the method to collect the example sentences by actual conversations with the system. In this study, examples in the database were collected from the conversation logs, then we investigated the relationship between the response accuracy and the number of the interaction. In the experiment, the transcriptions of the user’s utterances are added to the database at every end of the interaction. The responce sentences in the database were created manually. The result showed that the response accuracy appropriateness improved as increasing the number of the interactions and saturated at around 70%. In addition, we compared the collected database with the fully handcrafted database by the subjective evaluation. The score of the user satisfaction, dialog engagement, intelligence, and willingness to use were higher than the handcrafted database, and these results suggested that the proposed method can obtain more appropriate examples to the actual conversation from subjective point of view.

  17. Efficient Implementation of Global Variance Compensation for Parametric Speech Synthesis 査読有り

    Takashi Nose

    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 24 (10) 1694-1704 2016年10月

    出版者・発行元:IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

    DOI: 10.1109/TASLP.2016.2580298  

    ISSN:2329-9290

    詳細を見る 詳細を閉じる

    This paper proposes a simple and efficient technique for variance compensation to improve the perceptual quality of synthetic speech in parametric speech synthesis. First, we analyze the problem of spectral and FO enhancement with global variance (CV) in HMIVI-based speech synthesis. In the conventional GV-based parameter generation, the enhancement is achieved by taking account of a GV probability density function with fixed GV model parameters for every output utterance through the speech parameter generation process. We find that the use of fixed CV parameters results in much smaller variations of GVs in synthesized utterances than those in natural speech. In addition, the computational cost is high because of iterative optimization. This paper examines these issues in terms of multiple objective measures such as variance characteristics, CV distortions, and CV correlations. We propose a simple and fast compensation method based on a global affine transformation that provides a GV distribution closer to that of natural speech and improves the correlation of GVs between natural and generated parameter sequences. The experimental results demonstrate that the proposed variance compensation methods outperform the conventional CV -based parameter generation in terms of objective and subjective speech similarity to natural speech while maintaining speech naturalness.

  18. Estimating the User's State before Exchanging Utterances Using Intermediate Acoustic Features for Spoken Dialog Systems 査読有り

    Yuya Chiba, Takashi Nose, Masashi Ito, Akinori Ito

    IAENG International Journal of Computer Science 43 (1) 1-9 2016年2月29日

  19. DNNを利用したAnimation Unitの変換に基づく顔画像変換の検討 査読有り

    齋藤優貴, 能勢隆, 伊藤彰則

    電子情報通信学会論文誌 J199-D (11) 1112-1115 2016年

  20. Prosodically rich speech synthesis interface using limited data of celebrity voice 査読有り

    Takashi Nose, Taiki Kamei

    Journal of Computer and Communications 4 (16) 79-94 2016年

  21. 発話状態推定に基づく協調的感情音声合成による音声対話システムの評価 査読有り

    加瀬嵩人, 能勢隆, 千葉祐弥, 伊藤彰則

    電子情報通信学会論文誌 J199-A (1) 25-35 2016年1月

  22. Real-time talking avatar on the internet using Kinect and voice conversion 査読有り

    Takashi Nose, Yuki Igarashi

    International Journal of Advanced Computer Science and Applications 6 (12) 301-307 2015年12月

  23. HMM-based expressive singing voice synthesis with singing style control and robust pitch modeling 査読有り

    Takashi Nose, Misa Kanemoto, Tomoki Koriyama, Takao Kobayashi

    COMPUTER SPEECH AND LANGUAGE 34 (1) 308-322 2015年11月

    出版者・発行元:ACADEMIC PRESS LTD- ELSEVIER SCIENCE LTD

    DOI: 10.1016/j.csl.2015.04.001  

    ISSN:0885-2308

    eISSN:1095-8363

    詳細を見る 詳細を閉じる

    This paper proposes a singing style control technique based on multiple regression hidden semi-Markov models (MRHSMMs) for changing singing styles and their intensities appearing in synthetic singing voices. In the proposed technique, singing styles and their intensities are represented by low-dimensional vectors called style vectors and are modeled in accordance with the assumption that mean parameters of acoustic models are given as multiple regressions of the style vectors. In the synthesis process, we can weaken or emphasize the intensities of singing styles by setting a desired style vector. In addition, the idea of pitch adaptive training is extended to the case of the MRHSMM to improve the modeling accuracy of pitch associated with musical notes. A novel vibrato modeling technique is also presented to extract vibrato parameters from singing voices that sometimes have unclear vibrato expressions. Subjective evaluations show that we can intuitively control singing styles and their intensities while maintaining the naturalness of synthetic singing voices comparable to the conventional HSMM-based singing voice synthesis. (C) 2015 Elsevier Ltd. All rights reserved.

  24. Statistical Parametric Speech Synthesis Based on Gaussian Process Regression 査読有り

    Tomoki Koriyama, Takashi Nose, Takao Kobayashi

    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 8 (2) 173-183 2014年4月

    出版者・発行元:IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

    DOI: 10.1109/JSTSP.2013.2283461  

    ISSN:1932-4553

    eISSN:1941-0484

    詳細を見る 詳細を閉じる

    This paper proposes a statistical parametric speech synthesis technique based on Gaussian process regression (GPR). The GPR model is designed for directly predicting frame-level acoustic features from corresponding information on frame context that is obtained from linguistic information. The frame context includes the relative position of the current frame within the phone and articulatory information and is used as the explanatory variable in GPR. Here, we introduce cluster-based sparse Gaussian processes (GPs), i.e., local GPs and partially independent conditional (PIC) approximation, to reduce the computational cost. The experimental results for both isolated phone synthesis and full-sentence continuous speech synthesis revealed that the proposed GPR-based technique without dynamic features slightly outperformed the conventional hidden Markov model (HMM)-based speech synthesis using minimum generation error training with dynamic features.

  25. A Parameter Generation Algorithm Using Local Variance for HMM-Based Speech Synthesis 査読有り

    Takashi Nose, Vataya Chunwijitra, Takao Kobayashi

    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 8 (2) 221-228 2014年4月

    出版者・発行元:IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

    DOI: 10.1109/JSTSP.2013.2283459  

    ISSN:1932-4553

    eISSN:1941-0484

    詳細を見る 詳細を閉じる

    This paper proposes a parameter generation algorithm using a local variance (LV) model in HMM-based speech synthesis. In the proposed technique, we define the LV as a feature that represents the local variation of a spectral parameter sequence and model LVs using HMMs. Context-dependent HMMs are used to capture the dependence of LV trajectories on phonetic and prosodic contexts. In addition, the dynamic features of LVs are taken into account as well as the static one to appropriately model the dynamic characteristics of LV trajectories. By introducing the LV model into the spectral parameter generation process, the proposed technique can impose a more precise variance constraint for each frame than the conventional technique with a global variance (GV) model. Consequently, the proposed technique alleviates the excessive spectral peak enhancement that often occurs in GV-based parameter generation. Objective evaluation results show that the proposed technique can generate better spectral parameter trajectories than the GV-based technique in terms of spectral and LV distortion. Moreover, the results of subjective evaluation demonstrate that the proposed technique can generate synthetic speech significantly closer to the original one than the conventional technique while maintaining speech naturalness.

  26. Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis 査読有り

    Yu Maeno, Takashi Nose, Takao Kobayashi, Tomoki Koriyama, Yusuke Ijima, Hideharu Nakajima, Hideyuki Mizuno, Osamu Yoshioka

    SPEECH COMMUNICATION 57 144-154 2014年2月

    出版者・発行元:ELSEVIER SCIENCE BV

    DOI: 10.1016/j.specom.2013.09.014  

    ISSN:0167-6393

    eISSN:1872-7182

    詳細を見る 詳細を閉じる

    This paper proposes an unsupervised labeling technique using phrase-level prosodic contexts for HMM-based expressive speech synthesis, which enables users to manually enhance prosodic variations of synthetic speech without degrading the naturalness. In the proposed technique, HMMs are first trained using the conventional labels including only linguistic information, and prosodic features are generated from the HMMs. The average difference of original and generated prosodic features for each accent phrase is then calculated and classified into three classes, e.g., low, neutral, and high in the case of fundamental frequency. The created prosodic context label has a practical meaning such as high/low of relative pitch at the phrase level, and hence it is expected that users can modify the prosodic characteristic of synthetic speech in an intuitive way by manually changing the proposed labels. In the experiments, we evaluate the proposed technique in both ideal and practical conditions using speech of sales talk and fairy tale recorded under a realistic domain. In the evaluation under the practical condition, we evaluate whether the users achieve their intended prosodic modification by changing the proposed context label of a certain accent phrase for a given sentence. (C) 2013 Elsevier B.V. All rights reserved.

  27. Parametric speech synthesis based on Gaussian process regression using global variance and hyperparameter optimization 査読有り

    Tomoki Koriyama, Takashi Nose, Takao Kobayashi

    Proceedings of 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing 3862-3866 2014年

    DOI: 10.1109/ICASSP.2014.6854319  

  28. Tone modeling using stress information for HMM-based Thai speech synthesis 査読有り

    Decha Moungsri, Tomoki Koriyama, Tashi Nose, Takao Kobayashi

    Proceedings of the 7th International Conference on Speech Prosody 1057-1061 2014年

  29. Controlling Switching Pause Using an AR Agent for Interactive CALL System 査読有り

    Naoto Suzuki, Takashi Nose, Akinori Ito, Yutaka Hiroi

    Communications in Computer and Information Science 435 588-593 2014年

    出版者・発行元:Springer Verlag

    DOI: 10.1007/978-3-319-07854-0_102  

    ISSN:1865-0929

    詳細を見る 詳細を閉じる

    We are developing a voice-interactive CALL (Computer-Assisted Language Learning) system to provide more opportunity for better English conversation exercise. There are several types of CALL system, we focus on a spoken dialogue system for dialogue practice. When the user makes an answer to the system's utterance, timing of making the answer utterance could be unnatural because the system usually does not make any reaction when the user keeps silence, and therefore the learner tends to take more time to make an answer to the system than that to the human counterpart. However, there is no framework to suppress the pause and practice an appropriate pause duration. In this research, we did an experiment to investigate the effect of presence of the AR character to analyze the effect of character as a counterpart itself. In addition, we analyzed the pause between the two person's utterances (switching pause). The switching pause is related to the smoothness of its conversation. Moreover, we introduced a virtual character realized by AR (Augmented Reality) as a counterpart of the dialogue to control the switching pause. Here, we installed the character the behavior of "time pressure" to prevent the learner taking long time to consider the utterance. To verify if the expression is effective for controlling switching pause, we designed an experiment. The experiment was conducted with or without the expression. Consequently, we found that the switching pause duration became significantly shorter when the agent made the time-pressure expression. © Springer International Publishing Switzerland 2014.

  30. Subjective Evaluation of Packet Loss RecoveryTechniques for Voice over IP 査読有り

    Masahito Okamoto, Takashi Nose, Akinori Ito, Takeshi Nagano

    2014 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), VOLS 1-2 711-714 2014年

    出版者・発行元:IEEE

    DOI: 10.1109/ICALIP.2014.7009887  

    詳細を見る 詳細を閉じる

    We conducted a subjective evaluation experiment for VoIP speech under severe packet loss condition. The target codec was G.729, and four packet loss concealment methods were tested: parameter redundancy, SVM-based parameter redundancy, N-gram-based parameter estimation and interleaving. We first evaluated the effect of the interleaving block length on the subjective delay and speech quality. As a result, we found that the interleaving improved the subjective speech quality, but longer block length did not improve the quality. Next, we investigated the effect of PLC methods on the subjective latency and quality, and we found interleaving and simple PLC method gave the best result. N-gram-based PLC method made the quality worse.

  31. A Study on the Effect of Speech Rate on Perception of Spoken Easy Japanese Using Speech Synthesis 査読有り

    Hafiyan Prafianto, Takashi Nose, Yuya Chiba, Akinori Ito, Kazuyuki Sato

    2014 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), VOLS 1-2 476-479 2014年

    出版者・発行元:IEEE

    DOI: 10.1109/ICALIP.2014.7009839  

    詳細を見る 詳細を閉じる

    "Easy Japanese" is a controlled natural language, which is designed to convey information correctly in Japanese language to people of various nationalities. In this research, we used synthesized speech with various speech rates to investigate how the speech rate correlates with the perception of Easy Japanese for non-native speakers of Japanese. As a result, we found that the speech rates of 320 and 360 mora per minute are perceived to be close to the ideal speech rate.

  32. Robot: Have I Done Something Wrong? -Analysis of Prosodic Features of Speech Commands under the Robot's Unintended Behavior- 査読有り

    Noriko Totsuka, Yuya Chiba, Takashi Nose, Akinori Ito

    2014 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), VOLS 1-2 887-890 2014年

    出版者・発行元:IEEE

    DOI: 10.1109/ICALIP.2014.7009922  

    詳細を見る 詳細を閉じる

    When controlling a mobile robot using speech commands, the user's mistake or the recognizer's mis-recognition causes the robot to move in an unintended way, which can be dangerous. Therefore, we aim to develop a method for the robot to know if the robot's behavior matches the user's intention or not from the features of the speech commands. To this end, we investigated prosodic features in the command utterance given to a mobile robot when the robot does an intended or unintended behavior. As a result, we found that speaking rate of the speech under the robot's unintended behavior becomes slower, and the F0 becomes higher. In addition, we investigate if fatigue or experience affects the prosodic features, and we found that the F0 and intensity under the robot's unintended behavior were significantly different at the beginning of the experiment.

  33. Tempo modification of music signal using sinusoidal model and LPC-based residue model 査読有り

    Akinori Ito, Yuki Igarashi, Masashi Ito, Takashi Nose

    Proceedings of the 21st International Congress on Sound and Vibration 1-8 2014年

  34. User modeling by using bag-of-behaviors for building a dialog system sensitive to the interlocutor's internal state 査読有り

    Yuya Chiba, Masashi Ito, Takashi Nose, Akinori Ito

    Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue 74-78 2014年

  35. Quantized F0 Context and Its Applications to Speech Synthesis, Speech Coding and Voice Conversion 査読有り

    Takashi Nose, Takao Kobayashi

    2014 TENTH INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATION HIDING AND MULTIMEDIA SIGNAL PROCESSING (IIH-MSP 2014) 578-581 2014年

    出版者・発行元:IEEE

    DOI: 10.1109/IIH-MSP.2014.149  

    詳細を見る 詳細を閉じる

    This paper describes a technique for language-independent prosody modeling using unsupervised prosodic labeling in HMM-based speech synthesis and shows its applications to low bit-rate speech coding and speaker-independent voice conversion. In the proposed technique, sequences of prosodic features are roughly quantized at a phone level and the resultant indexes are used as the prosodic context for the model training. The conventional HMM-based speech synthesis requires accurate prosodic labels corresponding to the speech samples where manual modification is necessary to improve the modeling accuracy, which sometimes takes extra costs and limits its application. In contrast, the proposed technique creates the prosodic label from the training data itself and can apply not only to the speech synthesis but also to the speech coding and voice conversion. Subjective experimental results show the effectiveness of the use of the quantized F0 context without manual prosodic labeling.

  36. Analysis of English pronunciation of singing voices sung by Japanese speakers 査読有り

    Kazumichi Yoshida, Takashi Nose, Akinori Ito

    2014 TENTH INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATION HIDING AND MULTIMEDIA SIGNAL PROCESSING (IIH-MSP 2014) 554-557 2014年

    出版者・発行元:IEEE

    DOI: 10.1109/IIH-MSP.2014.143  

    詳細を見る 詳細を閉じる

    Singing songs is one of the most popular amusements in Japan. We sing many kinds of songs at occasions such as karaoke. However, it is difficult for most of Japanese native speakers to sing English songs because of difference of phone inventory of the two languages. Nowadays, there are numerous studies of CALL (Computer Assisted Language Learning) systems including the training of English pronunciation; however, there is no system that evaluates English pronunciation of the sung English. We are now investigating how to develop such a system by analyzing English singing voice and the result of subjective evaluation. In this paper, we show the result of the subjective evaluation as well as the analysis results. As a result, we found that not only the number of mispronunciations but also other factors affect the perceived goodness of English pronunciation. We also found that pronunciation scores of the singing voice by singers with singing experience were higher than that of spoken speech, which might mean that the experience of singing improves the skill of English singing.

  37. Transform Mapping Using Shared Decision Tree Context Clustering for HMM-Based Cross-Lingual Speech Synthesis 査読有り

    Daiki Nagahama, Takashi Nose, Tomoki Koriyama, Takao Kobayashi

    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4 770-774 2014年

    出版者・発行元:ISCA-INT SPEECH COMMUNICATION ASSOC

    ISSN:2308-457X

    詳細を見る 詳細を閉じる

    This paper proposes a novel transform mapping technique based on shared decision tree context clustering (STC) for HMM-based cross-lingual speech synthesis. In the conventional cross lingual speaker adaptation based on state mapping, the adaptation performance is not always satisfactory when there are mismatches of languages and speakers between the average voice models of input and output languages. In the proposed technique, we alleviate the effect of the mismatches on the transform mapping by introducing a language-independent decision tree constructed by STC, and represent the average voice models using language-independent and dependent tree structures. We also use a bilingual speech corpus for keeping speaker characteristics between the average voice models of different languages. The experimental results show that the proposed technique decreases both spectral and prosodic distortions between original and generated parameter trajectories and significantly improves the naturalness of synthetic speech while keeping the speaker similarity compared to the state mapping.

  38. Accent type and phrase boundary estimation using acoustic and language models for automatic prosodic labeling 査読有り

    Tomoki Koriyama, Hiroshi Suzuki, Takashi Nose, Takahiro Shinozaki, Akinori Ito

    Proceedings of 15th Annual Conference of the International Speech Communication Association 2337-2341 2014年

  39. Analysis of spectral enhancement using global variance in HMM-based speech synthesis 査読有り

    Takashi Nose, Akinori Ito

    Proceedings of 15th Annual Conference of the International Speech Communication Association 2917-2921 2014年

    ISSN:2308-457X

    eISSN:1990-9772

  40. Frame-level acoustic modeling based on Gaussian process regression for statistical nonparametric speech synthesis 査読有り

    Tomoki Koriyama, Takashi Nose, Takao Kobayashi

    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 8007-8011 2013年10月18日

    DOI: 10.1109/ICASSP.2013.6639224  

    ISSN:1520-6149

    詳細を見る 詳細を閉じる

    This paper proposes a new approach to text-to-speech based on Gaussian processes which are widely used to perform non-parametric Bayesian regression and classification. The Gaussian process regression model is designed for the prediction of frame-level acoustic features from the corresponding frame information. The frame information includes relative position in the phone and preceding and succeeding phoneme information obtained from linguistic information. In this paper, a frame context kernel is proposed as a similarity measure of respective frames. Experimental results using a small data set show the potential of the proposed approach without state-dependent dynamic features or decision-tree clustering used in a conventional HMM-based approach. © 2013 IEEE.

  41. An intuitive style control technique in HMM-based expressive speech synthesis using subjective style intensity and multiple-regression global variance model 査読有り

    Takashi Nose, Takao Kobayashi

    SPEECH COMMUNICATION 55 (2) 347-357 2013年2月

    出版者・発行元:ELSEVIER SCIENCE BV

    DOI: 10.1016/j.specom.2012.09.003  

    ISSN:0167-6393

    eISSN:1872-7182

    詳細を見る 詳細を閉じる

    To control intuitively the intensities of emotional expressions and speaking styles for synthetic speech, we introduce subjective style intensities and multiple-regression global variance (MRGV) models into hidden Markov model (HMM)-based expressive speech synthesis. A problem in the conventional parametric style modeling and style control techniques is that the intensities of styles appearing in synthetic speech strongly depend on the training data. To alleviate this problem, the proposed technique explicitly takes into account subjective style intensities perceived for respective training utterances using multiple-regression hidden semi-Markov models (MRHSMMs). As a result, synthetic speech becomes less sensitive to the variation of style expressivity existing in the training data. Another problem is that the synthetic speech generally suffers from the over-smoothing effect of model parameters in the model training, so the variance of the generated speech parameter trajectory becomes smaller than that of the natural speech. To alleviate this problem for the case of style control, we extend the conventional variance compensation method based on a GV model for a single-style speech to the case of multiple styles with variable style intensities by deriving the MRGV modeling. The objective and subjective experimental results show that these two techniques significantly enhance the intuitive style control of synthetic speech, which is essential for the speech synthesis system to communicate para-linguistic information correctly to the listeners. (C) 2012 Elsevier B.V. All rights reserved.

  42. [招待講演] 統計モデルに基づく音声合成における話者・スタイルの多様化 招待有り

    能勢 隆

    電子情報通信学会技術研究報告 Vol. 112 (No. 422) 67-72 2013年

  43. HMM-BASED EXPRESSIVE SPEECH SYNTHESIS BASED ON PHRASE-LEVEL F0 CONTEXT LABELING 査読有り

    Yu Maeno, Takashi Nose, Takao Kobayashi, Tomoki Koriyama, Yusuke Ijima, Hideharu Nakajima, Hideyuki Mizuno, Osamu Yoshioka

    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) 7859-7863 2013年

    出版者・発行元:IEEE

    DOI: 10.1109/ICASSP.2013.6639194  

    ISSN:1520-6149

    詳細を見る 詳細を閉じる

    This paper proposes a technique for adding more prosodic variations to the synthetic speech in HMM-based expressive speech synthesis. We create novel phrase-level F0 context labels from the residual information of F0 features between original and synthetic speech for the training data. Specifically, we classify the difference of average log F0 values between the original and synthetic speech into three classes which have perceptual meanings, i.e., high, neutral, and low of relative pitch at the phrase level. We evaluate both ideal and practical cases using appealing and fairy tale speech recorded under a realistic condition. In the ideal case, we examine the potential of our technique to modify the F0 patterns under a condition where the original F0 contours of test sentences are known. In the practical case, we show how the users intuitively modify the pitch by changing the initial F0 context labels obtained from the input text.

  44. SPEAKER-INDEPENDENT STYLE CONVERSION FOR HMM-BASED EXPRESSIVE SPEECH SYNTHESIS 査読有り

    Hiroki Kanagawa, Takashi Nose, Takao Kobayashi

    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) 7864-7868 2013年

    出版者・発行元:IEEE

    DOI: 10.1109/ICASSP.2013.6639195  

    ISSN:1520-6149

    詳細を見る 詳細を閉じる

    This paper proposes a technique for creating target speaker's expressive-style model from the target speaker's neutral style speech in HMM-based speech synthesis. The technique is based on the style adaptation using linear transforms where speaker-independent transformation matrices are estimated in advance using pairs of neutraland target-style speech data of multiple speakers. By applying the obtained transformation matrices to a new speaker's neutral-style model, we can convert the style expressivity of the acoustic model to the target style without preparing any target-style speech of the speaker. In addition, we introduce a speaker adaptive training (SAT) framework into the transform estimation to reduce the acoustic difference among speakers. We subjectively evaluate the performance of the style conversion in terms of the naturalness, speaker similarity, and style reproducibility.

  45. A style control technique for singing voice synthesis based on multiple-regression HSMM 査読有り

    Takashi Nose, Misa Kanemoto, Tomoki Koriyama, Takao Kobayashi

    Proceedings of 14th Annual Conference of the International Speech Communication Association 378-382 2013年

  46. Statistical nonparametric speech synthesis using sparse Gaussian processes 査読有り

    Tomoki Koriyama, Takashi Nose, Takao Kobayashi

    Proceedings of 14th Annual Conference of the International Speech Communication Association 1072-1076 2013年

  47. Robust estimation of multiple-regression HMM parameters for dimension-based expressive dialogue speech synthesis 査読有り

    Tomohiro Nagata, Hiroki Mori, Takashi Nose

    Proceedings of 14th Annual Conference of the International Speech Communication Association 1549-1553 2013年

  48. Very low bit-rate F0 coding for phonetic vocoders using MSD-HMM with quantized F0 symbols 査読有り

    Takashi Nose, Takao Kobayashi

    SPEECH COMMUNICATION 54 (3) 384-392 2012年3月

    出版者・発行元:ELSEVIER SCIENCE BV

    DOI: 10.1016/j.specom.2011.10.002  

    ISSN:0167-6393

    eISSN:1872-7182

    詳細を見る 詳細を閉じる

    This paper presents a technique of very low bit-rate F0 coding for phonetic vocoders based on a hidden Markov model (HMM) using phone-level quantized F0 symbols. In the proposed technique, an input F0 sequence is converted into an F0 symbol sequence at the phone level using scalar quantization. The quantized F0 symbols represent the rough shape of the original F0 contour and are used as the prosodic context for the HMM in the decoding process. To model the F0 that has voiced and unvoiced regions, we use multi-space probability distribution HMM (MSD-HMM). Synthetic speech is generated from the context-dependent labels and pre-trained MSD-HMMs by using the HMM-based parameter generation algorithm. By taking into account the preceding and succeeding contexts as well as the current one in the modeling and synthesis, we can generate a smooth F0 trajectory similar to that of the original with only a small number of quantization bits. The experimental results reveal that the proposed F0 coding outperforms the conventional segment-based F0 coding technique using MSD-VQ. We also demonstrate that the decoded speech of the proposed vocoder has acceptable quality even when the F0 bit-rate is less than 50 bps. (C) 2011 Elsevier B.V. All rights reserved.

  49. A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis 査読有り

    Vataya Chunwijitra, Takashi Nose, Takao Kobayashi

    SPEECH COMMUNICATION 54 (2) 245-255 2012年2月

    出版者・発行元:ELSEVIER SCIENCE BV

    DOI: 10.1016/j.specom.2011.08.006  

    ISSN:0167-6393

    eISSN:1872-7182

    詳細を見る 詳細を閉じる

    This paper proposes a technique of improving tone correctness in speech synthesis of a tonal language based on an average-voice model trained with a corpus from nonprofessional speakers' speech. We focused on reducing tone disagreements in speech data acquired from nonprofessional speakers without manually modifying the labels. To reduce the distortion in tone caused by inconsistent tonal labeling, quantized F0 symbols were utilized as the context for F0 to obtain an appropriate F0 model. With this technique, the tonal context could be directly extracted from the original speech and this prevented inconsistency between speech data and F0 labels generated from transcriptions, which affect naturalness and the tone correctness in synthetic speech. We examined two types of labeling for the tonal context using phone-based and sub-phone-based quantized F0 symbols. Subjective and objective evaluations of the synthetic voice were carried out in terms of the intelligibility of tone and its naturalness. The experimental results from both the objective and subjective tests revealed that the proposed technique could improve not only naturalness but also the tone correctness of synthetic speech under conditions where a small amount of speech data from nonprofessional target speakers was used. (C) 2011 Elsevier B.V. All rights reserved.

  50. HMMに基づく対話音声合成における多様な韻律生成のためのコンテクストの拡張 査読有り

    郡山知樹, 能勢 隆, 小林隆夫

    電子情報通信学会論文誌 Vol. J95-D (No. 3) 597-607 2012年

  51. An F0 modeling technique based on prosodic events for spontaneous speech synthesis 査読有り

    Tomoki Koriyama, Takashi Nose, Takao Kobayashi

    Proceedings of 2012 IEEE International Conference on Acoustics, Speech, and Signal Processing 4589-4593 2012年

    DOI: 10.1109/ICASSP.2012.6288940  

  52. Discontinuous Observation HMM for Prosodic-Event-Based F0 Generation 査読有り

    Tomoki Koriyama, Takashi Nose, Takao Kobayashi

    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3 462-465 2012年

    出版者・発行元:ISCA-INT SPEECH COMMUNICATION ASSOC

    詳細を見る 詳細を閉じる

    This paper examines F0 modeling and generation techniques for spontaneous speech synthesis. In the previous study, we proposed a prosodic-unit HMM where the synthesis unit is defined as a segment between two prosodic events represented by a ToBI label framework. To take the advantage of the prosodic-unit HMM, continuous F0 sequences must be modeled from discontinuous F0 data including unvoiced regions. The conventional F0 models such as the MSD-HMM and the continuous F0 HMM are not always appropriate for such demand. To overcome this problem, we propose an alternative F0 model named discontinuous observation HMM (DO-HMM) where the unvoiced frames are regarded as missing data. We objectively evaluate the performance of the DO-HMM by comparing it with the conventional F0 modeling techniques and discuss the results.

  53. A speech parameter generation algorithm using local variance for HMM-based speech synthesis 査読有り

    Vataya Chunwijitra, Takashi Nose, Takao Kobayashi

    Proceedings of 13th Annual Conference of the International Speech Communication Association 1151-1154 2012年

  54. Speaker-independent HMM-based voice conversion using adaptive quantization of the fundamental frequency 査読有り

    Takashi Nose, Takao Kobayashi

    SPEECH COMMUNICATION 53 (7) 973-985 2011年9月

    出版者・発行元:ELSEVIER SCIENCE BV

    DOI: 10.1016/j.specom.2011.05.001  

    ISSN:0167-6393

    eISSN:1872-7182

    詳細を見る 詳細を閉じる

    This paper describes a speaker-independent HMM-based voice conversion technique that incorporates context-dependent prosodic symbols obtained using adaptive quantization of the fundamental frequency (F0). In the HMM-based conversion of our previous study, the input utterance of a source speaker is decoded into phonetic and prosodic symbol sequences, and the converted speech is generated using the decoded information from the pre-trained target speaker's phonetically and prosodically context-dependent H M M. In our previous work, we generated the F0 symbol by quantizing the average log F0 value of each phone using the global mean and variance calculated from the training data. In the current study, these statistical parameters are obtained from each utterance itself, and this adaptive method improves the F0 conversion performance of the conventional one. We also introduce a speaker-independent model for decoding the input speech and model adaptation for training the target speaker's model in order to reduce the required amount of training data under a condition where the phonetic transcription is available for the input speech. Objective and subjective experimental results for Japanese speech demonstrate that the adaptive quantization method gives better F0 conversion performance than the conventional one. Moreover, our technique with only ten sentences of the target speaker's adaptation data outperforms the conventional GMM-based one using parallel data of 200 sentences. (C) 2011 Elsevier B.V. All rights reserved.

  55. TONAL CONTEXT LABELING USING QUANTIZED F-0 SYMBOLS FOR IMPROVING TONE CORRECTNESS IN AVERAGE-VOICE-BASED SPEECH SYNTHESIS 査読有り

    Vataya Chunwijitra, Takashi Nose, Takao Kobayashi

    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING 4708-4711 2011年

    出版者・発行元:IEEE

    DOI: 10.1109/ICASSP.2011.5947406  

    ISSN:1520-6149

    詳細を見る 詳細を閉じる

    This paper proposes a technique for improving tone correctness in Thai speech synthesis based on an average voice model trained with nonprofessional speech corpus. The proposed technique utilizes quantized F0 symbols as the tonal context in order to obtain an appropriate F0 model. With this technique, the prosodic context can be extracted from real speech directly and this leads to prevent the inconsistency between speech data and F0 labels generated from transcription, which affects the naturalness and tone correctness in synthetic speech. We examine two types of tonal context labeling using the quantized F0 symbols based on phone and sub-phone boundaries. Experimental results of both objective and subjective tests show that the proposed technique can improve not only the naturalness but also the tone correctness of synthetic speech under condition of using a small amount speech data of nonprofessional target speakers.

  56. VERY LOW BIT-RATE F0 CODING FOR PHONETIC VOCODER USING MSD-HMM WITH QUANTIZED F0 CONTEXT 査読有り

    Takashi Nose, Takao Kobayashi

    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING 5236-5239 2011年

    出版者・発行元:IEEE

    DOI: 10.1109/ICASSP.2011.5947538  

    ISSN:1520-6149

    詳細を見る 詳細を閉じる

    This paper presents a very low bit-rate F0 coding technique for speaker-dependent phonetic vocoder based on hidden Markov model (HMM) using quantized F0 context. In the proposed technique, the input F0 sequence is converted into F0 symbol sequence at a phoneme level using scalar quantization. The quantized F0 symbols are used in the decoding process as the prosodic context for the HMM-based speech synthesis. The synthetic speech is generated from the context-dependent labels and input speaker's pre-trained HMMs by using the HMM-based parameter generation algorithm. By taking account account of preceding and succeeding phonemes and F0 symbols as the contextual factors, we can generate smooth F0 trajectory similar to that of the original with only a small number of quantization bits. Experimental results demonstrate that the proposed technique can generate F0 contour with acceptable quality even when the bit-rate is less than 50 bps.

  57. A perceptual expressivity modeling technique for speech synthesis based on multiple-regression HSMM 査読有り

    Takashi Nose, Takao Kobayashi

    Proceedings of 12th Annual Conference of the International Speech Communication Association 109-112 2011年

  58. HMM-based emphatic speech synthesis using unsupervised context labeling 査読有り

    Yu Maeno, Takashi Nose, Takao Kobayashi, Yusuke Ijima, Hideharu Nakajima, Hideyuki Mizuno, Osamu Yoshioka

    Proceedings of 12th Annual Conference of the International Speech Communication Association 1849-1852 2011年

  59. Performance prediction of speech recognition using average-voice-based speech synthesis 査読有り

    Tatsuhiko Saito, Takashi Nose, Takao Kobayashi, Yohei Okato, Akio Horii

    Proceedings of 12th Annual Conference of the International Speech Communication Association 1953-1956 2011年

  60. On the use of extended context for HMM-based spontaneous conversational speech synthesis 査読有り

    Tomoki Koriyama, Takashi Nose, Takao Kobayashi

    Proceedings of 12th Annual Conference of the International Speech Communication Association 2657-2660 2011年

  61. Recent development of HMM-based expressive speech synthesis and its applications 査読有り

    Takashi Nose, Takao Kobayashi

    Proceedings of 2011 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 1-4 2011年

  62. HMM-Based Voice Conversion Using Quantized F0 Context 査読有り

    Takashi Nose, Yuhei Ota, Takao Kobayashi

    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E93D (9) 2483-2490 2010年9月

    出版者・発行元:IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG

    DOI: 10.1587/transinf.E93.D.2483  

    ISSN:0916-8532

    詳細を見る 詳細を閉じる

    We propose a segment-based voice conversion technique using hidden Markov model (HMM)-based speech synthesis with nonparallel training data. In the proposed technique, the phoneme information with durations and a quantized F0 contour are extracted from the input speech of a source speaker, and are transmitted to a synthesis part. In the synthesis part, the quantized F0 symbols are used as prosodic context. A phonetically and prosodically context-dependent label sequence is generated from the transmitted phoneme and the F0 symbols. Then, converted speech is generated from the label sequence with durations using the target speaker's pre-trained context-dependent HMMs. In the model training, the models of the source and target speakers can be trained separately, hence there is no need to prepare parallel speech data of the source and target speakers. Objective and subjective experimental results show that the segment-based voice conversion with phonetic and prosodic contexts works effectively even if the parallel speech data is not available.

  63. A Rapid Model Adaptation Technique for Emotional Speech Recognition with Style Estimation Based on Multiple-Regression HMM 査読有り

    Yusuke Ijima, Takashi Nose, Makoto Tachibana, Takao Kobayashi

    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E93D (1) 107-115 2010年1月

    出版者・発行元:IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG

    DOI: 10.1587/transinf.E93.D.107  

    ISSN:0916-8532

    詳細を見る 詳細を閉じる

    In this paper, we propose a rapid model adaptation technique for emotional speech recognition which enables us to extract paralinguistic information a, well as linguistic information contained in speech signals. This technique is based on style estimation and style adaptation using a multiple-regression HMM (MRHMM). In the MRHMM, the mean parameters of the Output probability density function are controlled by a low-dimensional parameter vector. called a style vector, which corresponds to a set of the explanatory variables of the multiple regression. The recognition process consists of two stages. In the first stage, the style vector that represents the emotional expression category and the intensity of its expressiveness for the input speech is estimated on a sentence-by-sentence basis. Next, the acoustic models are adapted using the estimated style vector, and then standard HMM-based speech recognition is performed in the second stage. We assess the performance of the proposed technique in the recognition of simulated emotional speech uttered by both professional narrators and non-professional speakers.

  64. A Technique for Estimating Intensity of Emotional Expressions and Speaking Styles in Speech Based on Multiple-Regression HSMM 査読有り

    Takashi Nose, Takao Kobayashi

    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E93D (1) 116-124 2010年1月

    出版者・発行元:IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG

    DOI: 10.1587/transinf.E93.D.116  

    ISSN:0916-8532

    詳細を見る 詳細を閉じる

    In this paper. we propose a technique for estimating the degree or intensity of emotional expressions and speaking styles appearing in speech. The key idea is based on a style control technique for speech synthesis using a multiple regression hidden semi-Markov model (MRHSMM), and the proposed technique can be viewed as the inverse of the style control. In the proposed technique. the acoustic features of spectrum. power. fundamental frequency. and duration are simultaneously modeled using the MRHSMM. We derive an algorithm for estimating explanatory variables of the MRHSMM, each of which represents the degree or intensity of emotional expressions and speaking styles appearing in acoustic features of speech, based on a maximum likelihood criterion. We show experimental result.; to demonstrate the ability of the proposed technique using two types of speech data. simulated emotional speech and spontaneous speech with different speaking styles. It is found that the estimated values have correlation with human perception.

  65. HMM-BASED SPEECH SYNTHESIS WITH UNSUPERVISED LABELING OF ACCENTUAL CONTEXT BASED ON F0 QUANTIZATION AND AVERAGE VOICE MODEL 査読有り

    Takashi Nose, Koujirou Ooki, Takao Kobayashi

    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING 4622-4625 2010年

    出版者・発行元:IEEE

    DOI: 10.1109/ICASSP.2010.5495548  

    ISSN:1520-6149

    詳細を見る 詳細を閉じる

    This paper proposes an HMM-based speech synthesis technique without any manual labeling of accent information for a target speaker's training data. To appropriately model the fundamental frequency (F0) feature of speech, the proposed technique uses coarsely quantized F0 symbols instead of accent types for the context-dependent labeling. By using F0 quantization, we can automatically conduct the labeling of F0 contexts for training data. When synthesizing speech, an average voice model trained in advance using manually labeled multiple speakers' speech data including accent information is used to create the label sequence for synthesis. Specifically, the input text is converted to a full context label sequence, and an F0 contour is generated from the label sequence and the average voice model. Then, a label sequence including the quantized F0 symbols is created from the generated F0 contour. We conduct objective and subjective evaluation tests, and discuss the results.

  66. 統計的モデル選択に基づいた連続音声からの語彙学習 査読有り

    田口 亮, 岩橋直人, 船越孝太郎, 中野幹生, 能勢 隆, 新田恒雄

    人工知能学会論文誌 Vol. 25 (No. 4) 549-559 2010年

    DOI: 10.1527/tjsai.25.549  

  67. HMM-based robust voice conversion using adaptive F0 quantization 査読有り

    Takashi Nose, Takao Kobayashi

    Proceedings of 7th ISCA Workshop on Speech Synthesis 80-85 2010年

  68. Evaluation of Prosodic Contextual Factors for HMM-based Speech Synthesis 査読有り

    Shuji Yokomizo, Takashi Nose, Takao Kobayashi

    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2 430-433 2010年

    出版者・発行元:ISCA-INST SPEECH COMMUNICATION ASSOC

    詳細を見る 詳細を閉じる

    This paper explores the effect of prosodic contextual factors for speech synthesis based on hidden Markov model (HMM). In the HMM-based speech synthesis, to model not only the phonetic features but also the prosodic ones, a variety of contextual factors are taken into account in the model training. In a baseline system, a lot of contextual factors are used, and the resultant cost for parameter tying by context clustering becomes relatively high compared to that in the speech recognition. We examine the choice of prosodic contexts by objective measures for English and Japanese speech data which have difference linguistic and prosodic characteristics. Experimental results show that more compact context sets give also comparable or close performance to the conventional full context.

  69. Conversational Spontaneous Speech Synthesis Using Average Voice Model 査読有り

    Tomoki Koriyama, Takashi Nose, Takao Kobayashi

    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2 853-856 2010年

    出版者・発行元:ISCA-INST SPEECH COMMUNICATION ASSOC

    詳細を見る 詳細を閉じる

    This paper describes conversational spontaneous speech synthesis based on hidden Markov model (HMM). To reduce the amount of data required for model training, we utilize an average-voice-based speech synthesis framework, which has been shown to be effective for synthesizing speech with arbitrary speaker's voice using a small amount of training data. We examine several kinds of average voice model using reading-style speech and/or conversation-style speech. We also examine an appropriate utterance unit for conversational speech synthesis. Experimental results show that the proposed two-stage model adaptation method improves the quality of synthetic conversational speech.

  70. Speaker-independent HMM-based Voice Conversion Using Quantized Fundamental Frequency 査読有り

    Takashi Nose, Takao Kobayashi

    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4 1724-1727 2010年

    出版者・発行元:ISCA-INST SPEECH COMMUNICATION ASSOC

    詳細を見る 詳細を閉じる

    This paper proposes a segment-based voice conversion technique between arbitrary speakers with a small amount of training data. In the proposed technique, an input speech utterance of source speaker is decoded into phonetic and prosodic symbol sequences, and then the converted speech is generated from the pre-trained target speaker's HMM using the decoded information. To reduce the required amount of training data, we use speaker-independent model in the decoding of the input speech, and model adaptation for the training of the target speaker's model. Experimental results show that there is no need to prepare the source speaker's training data, and the proposed technique with only ten sentences of the target speaker's adaptation data outperforms the conventional GMM-based one using parallel data of 200 sentences.

  71. Grounding new words on the physical world in multi-domain human-robot dialogues 査読有り

    Mikio Nakano, Naoto Iwahashi, Takayuki Nagai, Taisuke Sumii, Xiang Zuo, Ryo Taguchi, Takashi Nose, Akira Mizutani, Tomoaki Nakamura, Muhammad Attamimi, Hiromi Narimatsu, Kotaro Funakoshi, Yuji Hasegawa

    AAAI Publications, 2010 AAAI Fall Symposium Series 74-79 2010年

  72. Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis 査読有り

    Junichi Yamagishi, Takashi Nose, Heiga Zen, Zhen-Hua Ling, Tomoki Toda, Keiichi Tokuda, Simon King, Steve Renals

    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 17 (6) 1208-1230 2009年8月

    出版者・発行元:IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

    DOI: 10.1109/TASL.2009.2016394  

    ISSN:1558-7916

    eISSN:1558-7924

    詳細を見る 詳細を閉じる

    This paper describes a speaker-adaptive HMM-based speech synthesis system. The new system, called "HTS-2007," employs speaker adaptation (CSMAPLR+MAP), feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in our previous systems. Subjective evaluation results show that the new system generates significantly better quality synthetic speech than speaker-dependent approaches with realistic amounts of speech data, and that it bears comparison with speaker-dependent approaches even when large amounts of speech data are available. In addition, a comparison study with several speech synthesis techniques shows the new system is very robust: It is able to build voices from less-than-ideal speech data and synthesize good-quality speech even for out-of-domain sentences.

  73. HMM-Based Style Control for Expressive Speech Synthesis with Arbitrary Speaker's Voice Using Model Adaptation 査読有り

    Takashi Nose, Makoto Tachibana, Takao Kobayashi

    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E92D (3) 489-497 2009年3月

    出版者・発行元:IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG

    DOI: 10.1587/transinf.E92.D.489  

    ISSN:0916-8532

    詳細を見る 詳細を閉じる

    This paper presents methods for controlling the intensity of emotional expressions and speaking styles of ail arbitrary speaker's synthetic speech by using, a small amount of his/her speech data in HMM-based Speech synthesis. Model adaptation approaches are introduced into the style control technique based on the multiple-regression hidden semi-Markov model (MRHSMM). Two different approaches are proposed for training a target speaker's MRHSMMs. The first one is MRHSMM-based model adaptation in which the pretrained MRHSMM is adapted to the target speaker's model. For this purpose, we formulate the MLLR adaptation algorithm for the MRHSMM. The second method utilizes simultaneous adaptation of speaker and style front an average voice model to obtain tire target speaker's style-dependent HSMMs which are used for the initialization of the MRHSMM. Front the result of subjective evaluation using adaptation data of 50 sentences of each style, we show that the proposed methods outperform the conventional speaker-dependent model training when using the same size of speech data of the target speaker.

  74. EMOTIONAL SPEECH RECOGNITION BASED ON STYLE ESTIMATION AND ADAPTATION WITH MULTIPLE-REGRESSION HMM 査読有り

    Yusuke Ijima, Makoto Tachibana, Takashi Nose, Takao Kobayashi

    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS 4157-4160 2009年

    出版者・発行元:IEEE

    DOI: 10.1109/ICASSP.2009.4960544  

    ISSN:1520-6149

    詳細を見る 詳細を閉じる

    This paper proposes a technique for emotional speech recognition which enables us to extract paralinguistic information as well as linguistic information contained in speech signal. The technique is based on style estimation and style adaptation using multiple-regression HMM. Recognition process consists of two stages. In the first stage, a style vector that represents the emotional expression category and intensity of its variation of input speech is estimated on a sentence-by-sentence basis. Then the acoustic models are adapted using the estimated style vector and standard HMM-based speech recognition is performed in the second stage. We assess the performance of the proposed technique on the recognition of acted emotional speech uttered by both professional narrators and nonprofessional speakers and show the effectiveness of the technique.

  75. Speaking style adaptation for spontaneous speech recognition using multiple-regression HMM 査読有り

    Yusuke Ijima, Takeshi Matsubara, Takashi Nose, Takao Kobayashi

    Proceedings of 10th Annual Conference of the International Speech Communication Association 552-555 2009年

  76. HMM-based speaker characteristics emphasis using average voice model 査読有り

    Takashi Nose, Junichi Asada, Takao Kobayashi

    Proceedings of 10th Annual Conference of the International Speech Communication Association 2631-2634 2009年

  77. Learning lexicons from spoken utterances based on statistical model selection 査読有り

    Ryo Taguchi, Naoto Iwahashi, Takashi Nose, Kotaro Funakoshi, Mikio Nakano

    Proceedings of 10th Annual Conference of the International Speech Communication Association 2731-2734 2009年

  78. Recent development of the HMM-based speech synthesis system (HTS) 査読有り

    Heiga Zen, Keiichiro Oura, Takashi Nose, Junichi Yamagishi, Shinji Sako, Tomoki Toda, Takashi Masuko, Alan W. Black, Keiichi Tokuda

    Proceedings of 2009 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 121-130 2009年

  79. Performance evaluation of the speaker-independent HMM-based speech synthesis system HTS-2007 for the Blizzard Challenge 2007 査読有り

    Junichi Yamagishi, Takashi Nose, Heiga Zen, Tomoki Toda, Keiichi Tokuda

    Proceedings of 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing 3957-3960 2008年

    DOI: 10.1109/ICASSP.2008.4518520  

  80. Speaker and style adaptation using average voice model for style control in HMM-based speech synthesis 査読有り

    Makoto Tachibana, Shinsuke Izawa, Takashi Nose, Takao Kobayashi

    2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12 4633-4636 2008年

    出版者・発行元:IEEE

    DOI: 10.1109/ICASSP.2008.4518689  

    ISSN:1520-6149

    詳細を見る 詳細を閉じる

    We propose a technique for synthesizing speech with desired style expressivity of an arbitrary target speaker's voice. In an MLLR-based speaker adaptation technique for multiple regression hidden semi-Markov model (MRHSMM), the quality of synthesized speech crucially depends on the initial MRHSMM trained from a certain source speaker's data and it is not always possible to synthesize natural sounding speech with a given target speaker's voice. To overcome this problem, we perform simultaneous adaptation of speaker and style from an average voice model. Experimental results show that the proposed technique provides more natural sounding speech than the conventional one with speaker adaptation only.

  81. An On-line Adaptation Technique for Emotional Speech Recognition Using Style Estimation with Multiple-Regression HMM 査読有り

    Yusuke Ijima, Makoto Tachibana, Takashi Nose, Takao Kobayashi

    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5 1297-1300 2008年

    出版者・発行元:ISCA-INST SPEECH COMMUNICATION ASSOC

  82. An Estimation Technique of Style Expressiveness for Emotional Speech Using Model Adaptation Based on Multiple-Regression HSMM 査読有り

    Takashi Nose, Yoichi Kato, Makoto Tachibana, Takao Kobayashi

    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5 2759-2762 2008年

    出版者・発行元:ISCA-INST SPEECH COMMUNICATION ASSOC

    詳細を見る 詳細を閉じる

    This paper describes a technique of estimating style expressiveness for an arbitrary speaker's emotional speech. In the proposed technique, the style expressiveness, representing how much the emotions and/or speaking styles affect the acoustic features, is estimated based on multiple-regression hidden semi-Markov model (MRHSMM). In the model training, we first train average voice model using multiple speakers' neutral style speech. Then, the speaker- and style-adapted HSMMs are obtained based on linear transformation from the average voice model with a small amount of the target speaker's data. Finally, MRHSMM of the target speaker is obtained using the adapted models. For given input emotional speech, the style expressiveness is estimated based on maximum likelihood criterion. From the experimental results, we show that the estimated value gives good correspondence to the perceptual rating.

  83. A style control technique for HMM-based expressive speech synthesis 査読有り

    Takashi Nose, Junichi Yamagishi, Takashi Masuko, Takao Kobayashi

    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E90D (9) 1406-1413 2007年9月

    出版者・発行元:IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG

    DOI: 10.1093/ietisy/e90-d.9.1406  

    ISSN:0916-8532

    詳細を見る 詳細を閉じる

    This paper describes a technique for controlling the degree of expressivity of a desired emotional expression and/or speaking style of synthesized speech in an HMM-based speech synthesis framework. With this technique, multiple emotional expressions and speaking styles of speech are modeled in a single model by using a multiple-regression hidden semi-Markov model (MRHSMM). A set of control parameters, called the style vector, is defined, and each speech synthesis unit is modeled by using the MRHSMM, in which mean parameters of the state output and duration distributions are expressed by multiple-regression of the style vector. In the synthesis stage, the mean parameters of the synthesis units are modified by transforming an arbitrarily given style vector that corresponds to a point in a low-dimensional space, called style space, each of whose coordinates represents a certain specific speaking style or emotion of speech. The results of subjective evaluation tests show that style and its intensity can be controlled by changing the style vector.

  84. A speaker adaptation technique for MRHSMM-based style control of synthetic speech 査読有り

    Takashi Nose, Yoichi Kato, Takao Kobayashi

    Proceedings of 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing 833-836 2007年

    DOI: 10.1109/ICASSP.2007.367042  

  85. The HMM-based speech synthesis system version 2.0 査読有り

    Heiga Zen, Takashi Nose, Junichi Yamagishi, Shinji Sako, Takashi Masuko, Alan W. Black, Keiichi Tokuda

    Proceedings of 6th ISCA Workshop on Speech Synthesis 294-299 2007年

  86. Style estimation of speech based on multiple regression hidden semi-Markov model 査読有り

    Takashi Nose, Yoichi Kato, Takao Kobayashi

    Proceedings of 8th Annual Conference of the International Speech Communication Association 2285-2288 2007年

  87. A Style Control Technique for Speech Synthesis Using Multiple Regression HSMM 査読有り

    Takashi Nose, Junichi Yamagishi, Takao Kobayashi

    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5 1324-1327 2006年

    出版者・発行元:ISCA-INST SPEECH COMMUNICATION ASSOC

    詳細を見る 詳細を閉じる

    This paper presents a technique for controlling intuitively the degree or intensity of speaking styles and emotional expressions of synthetic speech. The conventional style control technique based on multiple regression HMM (MRHMM) has a problem that it is difficult to control phone duration of synthetic speech because HMM has no explicit parameter which models phone duration appropriately. To overcome this problem, we use multiple regression hidden semi-Markov model (MRHSMM) which has explicit state duration distributions to control phone duration. We show that the duration control is important for style control of synthetic speech from the results of subjective tests. We also compare the proposed technique with another control technique based on model interpolation.

  88. A Technique for Controlling Voice Quality of Synthetic Speech Using Multiple Regression HSMM 査読有り

    Makoto Tachibana, Takashi Nose, Junichi Yamagishi, Takao Kobayashi

    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5 2438-2441 2006年

    出版者・発行元:ISCA-INST SPEECH COMMUNICATION ASSOC

    詳細を見る 詳細を閉じる

    This paper describes a technique for controlling voice quality of synthetic speech using multiple regression hidden semi-Markov model (HSMM). In the technique, we assume that the mean vectors of output and state duration distribution of HSMM are modeled by multiple regression with a parameter vector called voice quality control vector. We first choose three features for controlling voice qualities, that is, "smooth voice - nonsmooth voice," "warm - cold," "high-pitched - low-pitched," and then we attempt to control voice quality of synthetic speech for these features. From the results of several subjective tests, we show that the proposed technique can change these features of voice quality intuitively.

︎全件表示 ︎最初の5件までを表示

書籍等出版物 3

  1. 音響キーワードブック

    能勢隆

    2016年3月22日

  2. 進化するヒトと機械の音声コミュニケーション

    能勢隆

    (株)エヌ・ティー・エス 2015年9月

  3. Human Machine Interaction - Getting Closer

    Ryo Taguchi, Naoto Iwahashi, Kotaro Funakoshi, Mikio Nakano, Takashi Nose, Tsuneo Nitta

    2012年1月