Kafka & Samza – LinkedIn's open source stream processing infrastructure
A talk at
STAC Summit,
London, UK, 30 Oct 2014
This talk was given to an audience of technologists in the financial services industry. They have
been users of sophisticated stream processing systems for a long time, and wanted to know what the
new systems such as Samza, coming out of internet companies,
are all about.
Abstract
Only a handful of industries used to be concerned with event-stream processing, such as defense,
sensor-driven manufacturing, and of course capital markets. Today, stream processing is a topic in
many more industries, from retailing to utilities to social media. And as with so many
data-intensive problems today, web companies are creating and open sourcing a large amount of code
to handle them. Retail banking and wealth management technologists are considering these open source
tools to deal with whole new classes of problems, while some trading firms are starting to use them
for old classes of problems.
Apache Kafka and Apache Samza are two big data technologies open sourced by LinkedIn. Kafka is
a publish-subscribe message bus designed for high throughput and reliability. Samza builds on Kafka
and Hadoop to provide high-throughput, stateful stream processing across a cluster (joining,
filtering, transforming, etc.). In this talk, Martin will introduce the motivation and architecture
of these systems and explore their uses and limitations.