[ml] Slides and Additional Notes for IR talk

Thu Sep 23 03:54:22 UTC 2010

I wanted to send out more pointers to some more things in the vein of my
talk from last week.  SMART is probably targeted more for academics.
Hackers will probably be more interested in some of the more practical tools
for building real world relevance systems.

Specifically I wanted to introduce you all to Nutch, which is an Apache
project provider crawling and searching utilities built on Hadoop and Lucene
(also Apache Projects).  With Nutch you can configure fetcher, parser,
indexer, and searcher plugins to use it for any thing from a pimped out
custom search engine for your website or relevance engine for any domain.
*
What is Hadoop?*
Hadoop is open source software to use in building scabable, distributed
computing systems.  It has a Map Reduce implementation, which Nutch uses to
run it's crawling, parsing and indexing work.
*
What is Lucene?*
Lucene is full text document search engine project.  At the core of Lucene
search algorithmns is tf-idf.  Nutch uses Lucene by generating Lucene
indices as the output of its crawl process.  Nutch extends the Lucene
searcher with it's plugins, but the core the relevance algorithm comes down
to Lucene's:
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

*What can you do with Nutch?*
I've used Nutch to implement full text search for http://findlaw.com.  They
used a Google Search Appliance for years, but we were able to use Nutch to
replace the applicance with a custom in-house search engine implementation
built on Nutch.

We also used Nutch to build an ads relevence system for serving up the ads
on FindLaw.com.  This project required custom fetcher, parser, indexer, and
searcher Nutch plugins on top of the Nutch basics, resulting in a
domain-specific revelence system that took advantage of Hadoop scale and
Lucene's tf-idf implementation, while being totally outside the normal
application domain of Nutch (full text search of web pages).
*
Links:*
Nutch: http://nutch.apache.org/
Hadoop: http://hadoop.apache.org/
Lucene: http://lucene.apache.org/
A tutorial: http://wiki.apache.org/nutch/NutchTutorial

Jared-

PS: I'll move this thread to the wiki for archival purposes eventually...

On Wed, Sep 15, 2010 at 11:26 PM, Jared Dunne <jareddunne at gmail.com> wrote:

> The Slides:
>
> https://docs.google.com/present/edit?id=0Ae0pay6z9C6GZGYzZG1uMm5fNDFmdGt2YnhoYw&hl=en
>
>
> SMART
> Someone asked a good question after the talk about if there was a "generic"
> vector space model framework out there. We discussed "search appliances"
> such as Google's offerings, but you were looking for something that you
> could hand off the term vectors or data for given domain and then have a
> toolkit of these vector space algorithms provided to look at it. I mentioned
> Salton on the theory front, but I should of also mentioned the result of his
> research, SMART, which is an implementation of his work with sample data
> sets. Its probably a good thing to play around with in the vein of your
> question.
>
> Lots of good links from its wiki page:
> http://en.wikipedia.org/wiki/SMART_Information_Retrieval_System
>
> SMART unixy bits via FTP:
> ftp://ftp.cs.cornell.edu/pub/smart/
>
> This tutorial looks promising (loving the old school html):
> http://www.tcnj.edu/~mmmartin/CSC485IMME321/Papers/SMART/SmartCourse.html<http://www.tcnj.edu/%7Emmmartin/CSC485IMME321/Papers/SMART/SmartCourse.html>
>
>
> I'll probably send out some additional stuff later on about other areas
> that we started to touch on in the discussion after the talk, like query and
> term expansion and spell correction.
>
> Jared-
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20100922/6b03d5cf/attachment.html>