- Using too few Examples.
Everyone, myself especially, learns best from examples. Hands-on example are even better. My class did used hardly any. I think that explains many of the blank stares I got in response to "are there any questions?" It's very easy to ask a question about an example -- "Wait, why does the entropy equal 2.3?". It's much more difficult to ask a question like "Could you clarify the advantages and disadvantages of L2 vs. L1 regularization? You lost me at 'gradient'."
- Starting with the Math.
I spent the first two classes "reviewing" the linear algebra and calculus that would be necessary to really get into the details of some of the more complicated algorithms later in the course. Big mistake. First of all, this material wasn't review for many students -- an unexpected (but rapid) realization. Second of all, I had already lost sight of the point of the math. The math is there to support the big ideas of modeling and evaluation. These can't be accomplished without the math, but I put the cart way before the horse. In the future, I'll be starting with generalization with as little math as possible, and then bringing it in as needed.
- Ending with Evaluation.
The class included material on mean-squared error and classification error rates, far earlier than I introduced the ideas of evaluation. Sure, accuracy is a pretty intuitive notion, but there's a big assumption made in assuming that every body in the seats will know what I'm talking about. Even the relatively simple distinction between linear and squared error only takes 10 minutes to discuss, but it goes a long way towards instilling greater understanding of what's going on.
- Ambitious and unclear goals and expectations.
While this was never explicit, in reflection, it is obvious to me that my goal of the course was for the students to "know as much as I do about a machine learning". It should have been "understand the fundamentals of machine learning". Namely, 1) how can we generalize from data (statistical modeling), 2) how can we apply machine learning (feature extraction) and 3) how do we know if the system works (evaluation).
For instance, I spent a lot of time showing how frequentists and bayesians can come the the same end point w.r.t. L2 Regularization in Linear Regression. I think this is way cool, but is it more important than spending an hour describing k-nearest neighbors? Only for me, not for the students. Was it more helpful to describe MAP adaptation than decision trees? Almost definitely not not. Decision trees are disarmingly intuitive. They can be represented as if-statements, and provide an easy example of overfitting (without requiring that students know that an n-degree polynomial can intersect n+1 points). But I thought they were too simple, and not "statistical" enough to fit in with the rest of the class. Oops.
Well, that's (at least) four mistakes -- hopefully next time they'll be all new mistakes.