Skip to content


Scalable stream processing with Apache Samza and Apache Kafka

A talk at ApacheCon Europe, Budapest, Hungary, 18 Nov 2014

Abstract

Samza, an Apache Incubator project, is a framework for processing and analysing high-volume data streams. It is built upon Apache Kafka and YARN (Hadoop 2.0). You can think of Samza as a real-time, continuously running version of MapReduce.

In this talk, Martin will show why stream processing is becoming an important part of the architecture of data-intensive applications, alongside storage and batch processing. We will explore how Samza works, and show how it reliably processes millions of messages per second. We will also examine what kinds of applications would benefit from using Samza.

This talk is for anyone interested in large-scale data processing problems. Developers working with Hadoop, distributed storage (e.g. HBase, Cassandra) or real-time data flows will find it particularly interesting. You will learn:

  • What kinds of real-time data problems you can solve with Samza;
  • How the stream processing model helps developers write more reliable applications more easily;
  • Apache Samza’s approach to stream processing, and how it compares to other frameworks;
  • How to contribute to development.