Spoken Language Processing: AuToBI Version 1.1

I've made enough improvements to AuToBI to consider the toolkit a milestone more mature.

In addition to some uninteresting bug fixes, and refactoring, the version 1.1 is made up by 3 significant changes. As ever, the AuToBI homepage includes milestone releases, and the project itself is hosted on github.

1) Package restructuring.

The internal structure makes more sense now. Classes are divided into 'core', 'feature extractors', 'classifiers', 'feature sets', 'io' and 'utilities'. With over 100 classes, the code had outgrown a flat package structure. Unfortunately this restructuring changed the serialization signatures of class names. This means that old models won't work with the version 1.1 release. But I have new models trained on the Boston Directions Corpus and Boston Radio News Corpus which will be available from the AuToBI homepage shortly.

2) Implemented reference counting for Feature maintenance.

The feature extraction process allows a user to specify the features they want without explicitly going through all the intermediate steps to extract them. For example, extracting the mean speaker normalized pitch from the second half of a word requires pitch to be extracted, speaker normalization parameters to be calculated or retrieved, the pitch to be normalized, the second half of the word to be identified, and finally the mean calculated. In AuToBI each of these steps are treated as features that are required by the feature extraction of the user-desired feature. In version 1.0, AuToBI was able to identify which features were required, but never recognized when a feature wasn't needed any longer. Version 1.1 includes functionality that maintains a reference count for each feature based on how many features that still need to be extracted are going to need it -- speaker normalization parameters may be required by many requested features. This allows for tidier memory management, which should, in turn, allow AuToBI to operate on more material.

3) Reduced storage for acoustic contours.

In version 1.0 pitch and intensity contours were stored as lists of time-value pairs -- a storage class containing 2 doubles. Assuming 10ms samples, and 16bytes per sample, this is about 1600bytes per second. Not unacceptably inefficient, but definitely could be improved. The other approach would be to only store values, and let the time of each point be specified by a start time (t0) and step size (dt). This allows storage with 8bytes per sample plus 16bytes for the parameters. The problem with this approach is that pitch points are invalid when there is no periodic material detected. The time-value pair approach handled this simply and easily. The new solution uses an array of one-bit booleans for each sample in the contour, describing whether it was a 'valid' point or not. If the rate of periodic to aperiodic material is less than 32:1, the new approach will reduce the memory requirements. (As I'm writing this, I realize that this ratio could be explicitly tested during pitch extraction, selecting the most efficient storage solution at runtime. Keep an eye out in a new release.)

The changes are mostly inside baseball kind of things, but possibly useful for AuToBI users who get their hands dirty in the code -- whoever you are. For everyone else, the most obvious changes you'll notice is that version 1.0 models don't work any more, and a speedup of ~6%.

Spoken Language Processing

Tuesday, January 18, 2011

AuToBI Version 1.1

No comments:

Disqus for Spoken Language Processing Blog