Speak of the devil… build your own OpenCalais like supermachine

In line with my recent blog mentioning OpenCalais, the topic extraction tool, DBpedia, one of the awesome linked open data projects I’ve been using a bunch for Alive.cn, just released their own topic extraction tool, DBpedia Spotlight. If you are okay with downloading 9GB of Lucene indices and setting up their scripts, you can have your own self-hosted topic extraction tool. They basically open sourced something that is worth a lot of money in a previously relatively closed space.

What is topic extraction? Check this demo out and enter any block of text — say, a recent news article. The benefit of using DBpedia’s solution (besides it being free) is that it automatically ties topics back to their DBpedia topics which already have a huge storehouse of Wikipedia-derived linked open data.

The Man versus Machine Jeopardy Challenge

Jeopardy Feb. 14 2011 – Human vs Machine IBM Challenge Day 1 Part 1/2

It’s the probably the most public test of the advances in linked open structured data and semantic text analysis, I’m really following closely this tournament pitting IBM’s super-computer Watson against the two most successful Jeopardy champions. I suspect that they’re using the same publicly available data sets that we’re using for constructing Alive.cn.

I wonder, however, why they chose to rely only on electronically fed questions rather than going the final mile and adding a voice recognition interface on top of the system. Voice recognition accuracy has gotten so good these days, but I wonder if the final few percentage mistakes makes a critical difference against the best human players.

There have been some other truly AMAZING projects in this field. Two I’d like to highlight:

  • Google Squared: This Google Labs experiment is an amazing mash-up of topic extraction and turning unstructured web data into structured data. Simply type in any category (example: “Chinese Emperors”) and it will bring you up a spreadsheet of items in that category and some properties. Next, you can add your own properties (“Inventions”) and it will automatically fill in the results using searched data from the web converted back into structured data. It’s truly one of the most remarkable things to come out of Google, but a bit more work on it (say, a voice recognition interface) and it could be a mainstream breakthrough.
  • OpenCalais Topic Extraction: Another semantic analysis tool that will pull out “topics” automatically and link them against linked open data. Try out the free demo and copy-and-paste a news article. After submitting the article, you’ll see it has linked together topics on the side automatically.

Like I’ve mentioned before, I feel that we’re right on the tipping point in the next several years where there will be advances in knowledge extraction and interpolation that will have a revolutionary effect on everything including how we interact with computing and having exponential advances on data forecasting. Projects like Wikipedia (an unstructured data source) are just the beginning.

P.S. My favorite comment about the Man versus Machine Jeopardy contest: “Why couldn’t they have programmed Watson to use the voice of Sean Connery?”