If segmental durations are normally distributed, z-score normalization has some nice properties. It's easy to estimate from a relatively small set of data. Since the z-score normalization parameters are simply MLE univariate Gaussian parameters fit to the feature (here duration) you can adapt these parameters using maximum a posteriori adaptation to a new speaker, or new utterance -- if there is reason to believe that the speaking rate might have changed.
I've used z-score normalization of word duration in the past, acknowledging that it's poorly motivated. There's no expectation that word length should be normally distributed -- phone counts of words are not normally distributed, so why should their durations. In fact, phone counts may be more log-normal than normal. Brief tangent: here is the distribution of phone counts per word.
So I took a look at a speaker (f2) from the Boston University Radio News Corpus, and looked at the distributions of word, syllable, vowel and phone durations to see if any look more or less Gaussian.
The distribution of word durations is decidedly non-Gaussian. We can see evidence of the bimodality that is likely coming from the abundance of monosyllabic words in the data (~59% of words in this material). Also, comparing the histogram of phone counts and word durations, the relationship is fairly clear.
Syllable durations don't look too bad. There's a bit of skew to the right, this model is overestimating longer durations, but this isn't terrible. There is a strange spikiness to the distribution, but I blame this more on an interaction between the resolution of the duration information (in units of 10ms) and the histogram than an underlying phenomenon. If someone were to take a look at this data, and decide to go ahead and use z-score normalization on syllable durations, it wouldn't strike me as a terrible way to go.
There has been a good amount of discussion (and some writing, cf. C. Wightman, et al. "Segmental durations in the vicinity of prosodic phrase boundaries." JASA, 1992) about the inherent duration of phones, and how phone identity is an important conditional variable when examining (and normalizing) phone duration. I haven't reexamined that here, but suffice to say, the distribution of phone durations isn't well modeled by a Gaussian or likely any other simple parameterized distribution. In all likelihood phone identity is an important factor to consider here, but are all phone ids equally important or can we just look at vowel vs. consonant or other articulatory clusterings -- frontness, openness, fricatives, etc.? I'll have to come back to that.
But...if we look closer at the distribution of vowel durations, while this isn't all that Gaussian, it looks like a pretty decent fit to a half-normal distribution. I don't see any obvious multimodality which would suggest a major effect of another conditional variable on this either phone identity or lexical stress, but it's possible that the dip at around the 20th percentile is evidence of this. Or it could just be noise.
To properly use a half-normal distribution, you would have to discount the probability mass at zero. This has the advantage that vowel onsets and offsets are much easier to detect than consonant boundaries, syllable boundaries or word boundaries.
So this might be my new normalization scheme for segmental durations. Gaussian MLE z-score normalization for syllables if I have them (either from automatic or manual transcription), and Half-normal MLE z-score normalization on vowels that are acoustically approximated (or when ASR output is too poor to be trusted).





