1. EachPod

Episode 117: Full Text Search

Author
Jade Robbins and Mark Sanborn
Published
Tue 19 Apr 2011
Episode Link
http://faceoffshow.com/2011/04/19/episode-117-full-text-search/

Add enterprise level search into your site.


News and Follow/Ups – 01:00



Geek Tools – 14:13





Webapps – 16:12





Full Text Search – 22:11




  • Options

    • Google Custom Search

      • Commercial

      • Benefits

        • Super fast to setup

        • Easy to implement

        • Ability to add adsense into search results



      • Downsides

        • Unable to adjust content ranking and do custom integration

        • Mainly for just indexing HTML pages, not search queries and other text.





    • Sphinx

      • “Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.”

      • Open source with commercial support

      • Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.

      • The search service daemon (searchd) is pretty low on memory usage – and you can set limits on how much memory the indexer process uses too.

      • API for:

        • Java, PHP, Python, Ruby, Perl, C, and other languages.



      • Written in C++

      • Stats

        • 60+ MB/sec per server

        • 500+ queries/sec

        • Biggest known Sphinx cluster indexes 5 billion documents, resulting in over 6 TB of data. Busiest known one is, unsurpisingly, Craigslist, that serves 50+ million search queries/day.



      • Companies using Sphinx






    • Lucene

      • Done by the Apache foundation

      • Open source

      • Written in Java

      • Search types

        • ranked searching — best results returned first

        • many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more

        • fielded searching (e.g., title, author, contents)

        • date-range searching

        • sorting by any field

        • multiple-index searching with merged results

        • allows simultaneous update and searching



      • Stats

        • over 95GB/hour on modern hardware

        • small RAM requirements — only 1MB heap

        • index size roughly 20-30% the size of text indexed





    • Solr

      • Lucene is a library where Solr is a server that supports XML, REST

      • Benefits over Sphinx

        • Solr is easily embeddable in Java applications.

        • Solr can be integrated with Hadoop to build distributed applications

        • Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can’t.



      • Companies using Solr







Share to: