If segmental durations are normally distributed, z-score normalization has some nice properties. It's easy to estimate from a relatively small set of data. Since the z-score normalization parameters are simply MLE univariate Gaussian parameters fit to the feature (here duration) you can adapt these parameters using maximum a posteriori adaptation to a new speaker, or new utterance -- if there is reason to believe that the speaking rate might have changed.
I've used z-score normalization of word duration in the past, acknowledging that it's poorly motivated. There's no expectation that word length should be normally distributed -- phone counts of words are not normally distributed, so why should their durations. In fact, phone counts may be more log-normal than normal. Brief tangent: here is the distribution of phone counts per word.
So I took a look at a speaker (f2) from the Boston University Radio News Corpus, and looked at the distributions of word, syllable, vowel and phone durations to see if any look more or less Gaussian.
The distribution of word durations is decidedly non-Gaussian. We can see evidence of the bimodality that is likely coming from the abundance of monosyllabic words in the data (~59% of words in this material). Also, comparing the histogram of phone counts and word durations, the relationship is fairly clear.
phrase boundaries." JASA, 1992) about the inherent duration of phones, and how phone identity is an important conditional variable when examining (and normalizing) phone duration. I haven't reexamined that here, but suffice to say, the distribution of phone durations isn't well modeled by a Gaussian or likely any other simple parameterized distribution. In all likelihood phone identity is an important factor to consider here, but are all phone ids equally important or can we just look at vowel vs. consonant or other articulatory clusterings -- frontness, openness, fricatives, etc.? I'll have to come back to that.
To properly use a half-normal distribution, you would have to discount the probability mass at zero. This has the advantage that vowel onsets and offsets are much easier to detect than consonant boundaries, syllable boundaries or word boundaries.
So this might be my new normalization scheme for segmental durations. Gaussian MLE z-score normalization for syllables if I have them (either from automatic or manual transcription), and Half-normal MLE z-score normalization on vowels that are acoustically approximated (or when ASR output is too poor to be trusted).