Searching over Streams with Luwak and Apache Samza
A talk at
FOSDEM (Open Source Search Room),
Brussels, Belgium, 31 Jan 2015
Co-presented with Alan Woodward.
Traditional searches take the form of individual queries over large mostly-stable corpuses of
documents. In this talk, we’ll show how we invert this paradigm to allow for searching over streams
of documents by combining Samza, a distributed stream-processing framework, with Luwak, a library
for efficiently running large numbers of queries over individual documents.
Abstract
Real-time searching over streams is useful in a number of contexts. For example, companies may want
to detect whenever they are mentioned in a news feed; or a Twitter user might want to see
a continuous stream of tweets for a particular hashtag.
Luwak provides a mechanism for running many thousands of
queries over a single document in a highly efficient manner, by filtering out queries that it can
detect will not match. Luwak is designed to run on a single node, holding all registered queries in
RAM. Scaling to higher document throughput, or to more queries, requires parallelization across
multiple machines.
Samza provides a framework for such parallelization, by
partitioning and recombining both the document streams and the query set (which can be treated as
just another stream), and also provides fault-tolerance mechanisms that allows swift recovery from
machine failure, without losing documents or queries.