Joseph Turian over at MetaOptimize.com has posted a fun NLP challenge.
The task is to identify semantically related words from a shared corpus.
So you're thinking, sure, no problem. I'll look for common concurrences. Maybe I'll start with some seed pairs and do some bootstrapping. Or you do LSA, if you're into that sort of thing.
But here's the rub, there are a few million documents, so you've got to get clever if you're going to use LSA (cause that would require SVD of an impossibly large and sparse matrix).
As if that weren't challenging enough, these "documents" are only a word or two long, so the concurrences you find are going to be pretty sparse.
So, that's it. Have at it.