Episode 117: Full Text Search

Author: Jade Robbins and Mark Sanborn
Published: Tue 19 Apr 2011
Episode Link: http://faceoffshow.com/2011/04/19/episode-117-full-text-search/

Add enterprise level search into your site.

News and Follow/Ups – 01:00

Square now being sold in Apple’s store

Check-Ins dying out?

Dropbox: 25 million users

Geek Tools – 14:13

Yikerz! – Super fun magnet game

Webapps – 16:12

Surfboard – Flipboard as a web app

InstaLyrics – Find lyrics quickly

Full Text Search – 22:11

Options
- Google Custom Search
  - Commercial
  - Benefits
    - Super fast to setup
    - Easy to implement
    - Ability to add adsense into search results
  - Downsides
    - Unable to adjust content ranking and do custom integration
    - Mainly for just indexing HTML pages, not search queries and other text.
- Sphinx
  - “Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.”
  - Open source with commercial support
  - Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
  - The search service daemon (searchd) is pretty low on memory usage – and you can set limits on how much memory the indexer process uses too.
  - API for:
    - Java, PHP, Python, Ruby, Perl, C, and other languages.
  - Written in C++
  - Stats
    - 60+ MB/sec per server
    - 500+ queries/sec
    - Biggest known Sphinx cluster indexes 5 billion documents, resulting in over 6 TB of data. Busiest known one is, unsurpisingly, Craigslist, that serves 50+ million search queries/day.
  - Companies using Sphinx
    - Craigslist
    - Slashdot
    - Mozilla
    - WordPress.org
- Lucene
  - Done by the Apache foundation
  - Open source
  - Written in Java
  - Search types
    - ranked searching — best results returned first
    - many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
    - fielded searching (e.g., title, author, contents)
    - date-range searching
    - sorting by any field
    - multiple-index searching with merged results
    - allows simultaneous update and searching
  - Stats
    - over 95GB/hour on modern hardware
    - small RAM requirements — only 1MB heap
    - index size roughly 20-30% the size of text indexed
- Solr
  - Lucene is a library where Solr is a server that supports XML, REST
  - Benefits over Sphinx
    - Solr is easily embeddable in Java applications.
    - Solr can be integrated with Hadoop to build distributed applications
    - Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can’t.
  - Companies using Solr
    - eHarmony
    - Ticketmaster
    - Digg
    - AOL
    - Zappos

Share to:

EachPod

EachPod

Episode 117: Full Text Search

News and Follow/Ups – 01:00

Geek Tools – 14:13

Webapps – 16:12

Full Text Search – 22:11