I wanted to invite technically-minded Beijing folks again to a presentation that I’m doing on Thursday at the mongoDB conference. While I’m still relatively new to mongoDB, I’m taking the opportunity to give some insights on building a new multi-lingual, comprehensive entertainment database using linked open data. The presentation will go through an evolution starting with the early days of Rotten Tomatoes when we assembled the movie information manually to my current efforts with Alive.cn.
I’m still not certain yet whether I’m going to deliver my presentation in English or in Chinese. Obviously, I’m much more comfortable speaking English, but would like to make sure that the audience is getting the message correctly. In any case, I’ve presented both English and Chinese versions of the presentation below. I decided to go with a movie theme in the visuals throughout the presentation to keep things in line with my “entertainment database” topic.
Looks like some of the presentation fonts and layout didn’t get transferred too well with the upload to SlideShare, but you can get the general gist below:
In line with my recent blog mentioning OpenCalais, the topic extraction tool, DBpedia, one of the awesome linked open data projects I’ve been using a bunch for Alive.cn, just released their own topic extraction tool, DBpedia Spotlight. If you are okay with downloading 9GB of Lucene indices and setting up their scripts, you can have your own self-hosted topic extraction tool. They basically open sourced something that is worth a lot of money in a previously relatively closed space.
What is topic extraction? Check this demo out and enter any block of text — say, a recent news article. The benefit of using DBpedia’s solution (besides it being free) is that it automatically ties topics back to their DBpedia topics which already have a huge storehouse of Wikipedia-derived linked open data.
It’s the probably the most public test of the advances in linked open structured data and semantic text analysis, I’m really following closely this tournament pitting IBM’s super-computer Watson against the two most successful Jeopardy champions. I suspect that they’re using the same publicly available data sets that we’re using for constructing Alive.cn.
I wonder, however, why they chose to rely only on electronically fed questions rather than going the final mile and adding a voice recognition interface on top of the system. Voice recognition accuracy has gotten so good these days, but I wonder if the final few percentage mistakes makes a critical difference against the best human players.
There have been some other truly AMAZING projects in this field. Two I’d like to highlight:
Google Squared: This Google Labs experiment is an amazing mash-up of topic extraction and turning unstructured web data into structured data. Simply type in any category (example: “Chinese Emperors”) and it will bring you up a spreadsheet of items in that category and some properties. Next, you can add your own properties (“Inventions”) and it will automatically fill in the results using searched data from the web converted back into structured data. It’s truly one of the most remarkable things to come out of Google, but a bit more work on it (say, a voice recognition interface) and it could be a mainstream breakthrough.
OpenCalais Topic Extraction: Another semantic analysis tool that will pull out “topics” automatically and link them against linked open data. Try out the free demo and copy-and-paste a news article. After submitting the article, you’ll see it has linked together topics on the side automatically.
Like I’ve mentioned before, I feel that we’re right on the tipping point in the next several years where there will be advances in knowledge extraction and interpolation that will have a revolutionary effect on everything including how we interact with computing and having exponential advances on data forecasting. Projects like Wikipedia (an unstructured data source) are just the beginning.
P.S. My favorite comment about the Man versus Machine Jeopardy contest: “Why couldn’t they have programmed Watson to use the voice of Sean Connery?”