Thursday, November 11, 2010

Cross-validation with one model

This is essentially a repost of Rob J Hyndman's blog post on the relevance of cross-validation for statisticians.


Within this very nice piece, Rob drops this bomb of mathematical knowledge:

It is not necessary to actually fit n separate models when computing the CV statistic for linear models.


Say what?


Here is a broader excerpt and the method itself (after the jump). 

Saturday, November 06, 2010

Semantically Related Term Challenge

Joseph Turian over at MetaOptimize.com has posted a fun NLP challenge.

The task is to identify semantically related words from a shared corpus.

So you're thinking, sure, no problem.  I'll look for common concurrences.  Maybe I'll start with some seed pairs and do some bootstrapping.   Or you do LSA, if you're into that sort of thing.

But here's the rub, there are a few million documents, so you've got to get clever if you're going to use LSA (cause that would require SVD of an impossibly large and sparse matrix).

As if that weren't challenging enough, these "documents" are only a word or two long, so the concurrences you find are going to be pretty sparse.

So, that's it.  Have at it.

Monday, November 01, 2010

National Novel Writing Month

November is National Novel Writing Month. In past years, I've known two or three people who have set out to write a complete novel within the month.  To all of those writers who push out 100,000 words in a month, you have a huge gold star in my book.

Now, I don't quite have the motivation, creativity or time to try to dig a novel out of my head, but I'm going to put a spin on it.  This November will be Personal Paper Writing Month.

If anyone out there reads this, and wants to join me in the effort, I'll post semi-regular updates about our progress.  If I'm all alone out in this...well, so be it. You can still keep tabs on how it's going here though.