Lecture Notes in Computer Science- P75 pdf

A New Chinese Speech Synthesis Method Apply in Chinese Poetry Learning 359 constituted of the oral cavity, the nasal cavity etc. then change into the voice. Ac- cording to the different vocal tract form, the airflow becomes into different speeches. Thus we see the vocal tract parameters decide a certain voice. for example, the syllable “ ” will be read as “yi1”. The vocal tract will keep still when pronounce the phoneme “i”, but in word “ ” whose spelling is “yi1ge4”, at the end of the pronunciation of phoneme “i” the vocal tract is ready to change to fit the phoneme “g”. So the pronunciation of “i” will be different from the single “i” in a word when another syllable is come with it. The new speech synthesis method this paper present is to make the synthesized speech presents the affection produced by the coarticulation in the word. 3.3 The Sentence Prosody Firstly we take a look at the prosody of words in a sentence. Here we have to in- troduce a concept called “pitch resetting”, it is comparative with the “pitch continuous” which means the latter syllable’s start pitch is equal to the previous one’s end pitch in a word. A sentence can divided into several words, the first syllable’s start pitch in each words will reset to a certain pitch, but in the word the syllable’s pitch is vary continuous. We call it “pitch resetting”. Pitch resetting often happens when we exchange breath during reading. We often take a word, a phrase or a short sentence as a breath exchanging unit. As shown in figure 2, the sentence “ ” is segmented as spelling sequence “xi4lie4bao4dao4/ gan3shou4er4ling2ling2si4/ jin1tian1bo1chu1/ ”. The sentence is composed by three phrase, from the figure 2 we see at the first syllable of each phrase the pitch is reset to certain value. We can decide the pitch reset prosody boundary when we do Chinese word segmentation in Text proceeding. Fig. 2. Chinese sentence prosody Another feature of sentence prosody is whole sentence’s pitch trends. In statement sentences the pitch trend is declining. This trend overlaps on every syllable in the sentence. So when pitch resetting occurs in a statement sentence the “definite pitch” will a little lower than it last time was. Consider the Chinese poetry’s reading feature; we assume the pitch resetting happens in a single syllable’s end or a single word’s end. 360 C. Zhu and Y. Zhu 4 The Speech Synthesis Method The TTS system mainly including three parts: text processing module, prosody module and speech synthesis module. In speech synthesis module, what kind of speech synthesis algorithm should be chosen is most important. As it is an important part of the TTS system, we make a close look at it. 4.1 Speech Synthesis Algorithm This paper addresses a new speech synthesis method which takes the time-domain waveform editing algorithm as basic speech synthesis algorithm and overlaps the vocal cepstrum parameters which get from homomorphism analysis on the adjacent syllables in a word to smooth the speech transition affections. The waveform editing synthesis whose advantage is rapid for process and vocal tract parametric synthesis whose advantage is flexible for adjustment as it is considering the essence of the sounds. 4.1.1 The Voice Database Because the waveform editing algorithm is our basic algorithm, the voice database is needed to store all the elementary waveforms. The voice database mainly stores the synthesis elements. The choice of the base synthesis element not only decides the quality of the final speech but also relative to the limit of the hardware storage ability. So many Chinese TTS systems choose syllables, words or phrases, even sentences as the base synthesis element, which lead to a big voice database. Our approach is taking initial consonant and simple/compound vowel as basic elements according to the reference [3]. Thus the storage of voice database is cut down to several hundreds of KB meanwhile maintains a fairly equal level of voice quality. 4.1.2 PSOLA Algorithm E.Moulines and F.Charpentier found a speech synthesis algorithm based on time domain waveform modification called PSOLA (Pitch Synchronous Overlap Add) [4]. It is being widely used nowadays. To know more detail about PSOLA algorithm please see the reference [5] and [6]. The PSOLA algorithm ensure the waveform and the spectrum persist smooth and continuous when the speech signal being modified. It works by three steps. As shown in figure 3. Firstly make a transform on a small segment of the original time domain waveform, whose duration is about 2 times of the pitch period, we call the transformed speech signal as short time temporary signal. Then modify the temporary signal. At last rebuild the time domain waveform from the modified temporary signal. So we can do the modifications in step 2 to synthesis the speech we required. original time domain waveform temporary signal modified temporary signal modified time domain wavform step 1 step 2 step 3 Fig. 3. Main steps in PSOLA algorithm A New Chinese Speech Synthesis Method Apply in Chinese Poetry Learning 361 For example, if we want to synthesize a syllable of 400ms, but the corresponding syllable in voice database is 200ms, then we can process it as shown in figure 4. 1 2 3 modified 1 modified 1 modified 2 modified 2 modified 3 modified 3 modified 3 synthesized waveform original waveform duration 200 ms duration 400 ms i/modified i : temporary signal transform overlap and transform according to pitch information Fig. 4. PSOLA synthesis procedures Firstly calculate how many temporary signals there should be in 400ms duration and calculate all the temporary signals of the original syllable’s waveform, then according to the pitch information, find the temporary signals which about to be synthesized should equal with which ones in the original’s and arrange them on the duration line, finally overlap them to produce the synthesized speech. 4.1.3 Concept of Cepstrum We called time-domain signal sequence )( ˆ nx as the complex cepstrum of signal sequence )(nx . The )( ˆ nx is calculated by formula 1. )]]]([[ln[)( ˆ 1 nxZZnx − = , (1) take the real part of )( ˆ nx as )(nc , we called )(nc the cepstrum and )(nc is calculated by formula 2. |])([|[ln)( 1 nxZZnc − = . (2) 4.1.4 Homomorphism Analysis to Get the Vocal Tract Cepstrum Parameters The time domain speech signal )(nx is the convolution of speech source signal )(ne and vocal tract signal )(nv in a simple digital speech model. We have known that vocal tract contains the most important information of the speech, thus we want to separate the vocal tract signal and modify it in order to produce the speech we need. 362 C. Zhu and Y. Zhu There is no good way to separate )(nv from )(nx in time domain, but the homomorphism analysis is helpful. In homomorphism analysis, do Z transform on both sides of the equation 3. )()()( nvnenx ∗= . (3) The convolution is changed into product and we get the equation 4. )()()( kVkEkX = . (4) Do logarithm operation on both sides of the equation, then we change the product operation into linear operation and get the equation 5. ))(ln())(ln())(ln( kVkEkX += . (5) Make it as equation 6. )( ˆ )( ˆ )( ˆ kVkEkX += . (6) Do 1− Z transform, the equation change into equation 7. )( ˆ )( ˆ )( ˆ nvnenx += . (7) Now we can get the vocal tract cepstrum parameter )( ˆ nv by an linear filter. After we modified the vocal tract cepstrum parameter, the converse operation can be used to make the cepstrum domain signal )( ˆ nv back to time domain signal )(nv . 4.1.5 Vocal Tract Cepstrum Parameter Speech Synthesis When dealing with the adjacent syllables in one word during the speech synthesis, we could synthesize the speech through adding the latter syllable’s vocal tract cepstrum parameters into the former syllable. In the step 2 of PSOLA algorithm, after the temporary signal to be synthesized is calculated we take the last k temporary signal’s vocal tract cepstrum parameters of the first syllable and the first k temporary signal’s vocal tract cepstrum parameters of the second syllable with a linear operation, the operation result as the first syllable’s last k temporary signal’s new vocal tract cepstrum parameters. Finally transform the cepstrum parameters and temporary signal back to time domain then we get the synthesized speech. The linear operation method is shown in formula 8. The linear coefficient is determined according to the reference [7]. ) 2 sin() 2 sin(1 ~ ~ ~ 25 2 1 25 2 1 25 2 1 K k v v v K k v v v v v v b b b f f f k f f f ππ × ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ + ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ −× ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ = ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ # ## , (8) A New Chinese Speech Synthesis Method Apply in Chinese Poetry Learning 363 In formula 8, fi v ~ is the first syllable’s modified vocal tract cepstrum, fi v is the first syllable’s original vocal tract cepstrum, bi v is the second syllable’s original vocal tract cepstrum. Thus we resolve the affection between the adjacent syllables. 4.2 The Programming Implementation The method this paper mention is implement under VC.net framework with C++ languages. The Figure 5 shows the logic procedure of the Chinese TTS system, When the Chinese poetry text is input into the TTS system we can predict the basic duration, pitch of the syllable and then sentence mood, and then do words segmentation to mark the boundaries of pitch resetting, the next step is synthesize the speech with the consonants, vowels, tones which has been analyzed already by PSOLA algorithm, meanwhile to adjust the prosody of the adjacent syllables in one word with vocal tract cepstrum parameter synthesis algorithm, and finally get the synthesized speech. Chinese poetry text Chinese words Chinese syllables Predicted duration Predicted pitch Pitch resetting boundary consonant vowel tone Modify vocal tract parameter Changed tone type Synthesized by PSOLA algorithm and cepstrum parameters algorithm sound Words segmentation Fig. 5. The logic flow of TTS system The final user interface including the waveform which is synthesized by the system is shown in Figure 6. Fig. 6. User interface . time domain then we get the synthesized speech. The linear operation method is shown in formula 8. The linear coefficient is determined according to the reference [7]. ) 2 sin() 2 sin(1 ~ ~ ~ 25 2 1 25 2 1 25 2 1 K k v v v K k v v v v v v b b b f f f k f f f ππ × ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ + ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ −× ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ = ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ # ## . pitch trends. In statement sentences the pitch trend is declining. This trend overlaps on every syllable in the sentence. So when pitch resetting occurs in a statement sentence the “definite pitch”. signal modified time domain wavform step 1 step 2 step 3 Fig. 3. Main steps in PSOLA algorithm A New Chinese Speech Synthesis Method Apply in Chinese Poetry Learning 361 For example, if

Lecture Notes in Computer Science- P75 pdf

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan