Searching over Streams with Luwak and Apache Samza

A talk at FOSDEM (Open Source Search Room), Brussels, Belgium, 31 Jan 2015

Traditional searches take the form of individual queries over large mostly-stable corpuses of documents. In this talk, we’ll show how we invert this paradigm to allow for searching over streams of documents by combining Samza, a distributed stream-processing framework, with Luwak, a library for efficiently running large numbers of queries over individual documents.

Abstract

Real-time searching over streams is useful in a number of contexts. For example, companies may want to detect whenever they are mentioned in a news feed; or a Twitter user might want to see a continuous stream of tweets for a particular hashtag.

Luwak provides a mechanism for running many thousands of queries over a single document in a highly efficient manner, by filtering out queries that it can detect will not match. Luwak is designed to run on a single node, holding all registered queries in RAM. Scaling to higher document throughput, or to more queries, requires parallelization across multiple machines.

Samza provides a framework for such parallelization, by partitioning and recombining both the document streams and the query set (which can be treated as just another stream), and also provides fault-tolerance mechanisms that allows swift recovery from machine failure, without losing documents or queries.

Searching over Streams with Luwak and Apache Samza

Abstract

Recent posts

Conference talks

Searching over Streams with Luwak and Apache Samza

Abstract

Subscribe

My book

Recent posts

Conference talks