Data liberation and data integration with Kafka
A talk at
Strata + Hadoop World,
New York, NY, US, 30 Sep 2015
Abstract
Even the best data scientist can’t do anything if they cannot easily get access to the necessary
data. Simply making the data available is step 1 towards becoming a data-driven organization. In
this talk, we’ll explore how Apache Kafka can replace slow, fragile ETL processes with real-time
data pipelines, and discuss best practices for data formats and integration with existing systems.
Apache Kafka is a popular open source message broker for high-throughput
real-time event data, such as user activity logs or IoT sensor data. It originated at LinkedIn,
where it reliably handles around a trillion messages per day.
What is less widely known: Kafka is also well suited for extracting data from existing databases,
and making it available for analysis or for building data products. Unlike slow batch-oriented ETL,
Kafka can make database data available to consumers in real-time, while also allowing efficient
archiving to HDFS, for use in Spark, Hadoop or data warehouses.
When data science and product teams can process operational data in real-time, and combine it with
user activity logs or sensor data, that turns out to be a potent mixture. Having all the data
centrally available in a
stream data platform is an exciting
enabler for data-driven innovation.
In this talk, we will discuss what a Kafka-based stream data platform looks like, and how it is
useful:
- Examples of the kinds of problems you can solve with Kafka
- Extracting real-time data feeds from databases, and sending them to Kafka
- Using Avro for schema
management and future-proofing your data
- Designing your data pipelines to be resilient, but also flexible and amenable to change
References
- Jay Kreps: “Putting Apache Kafka to use: A practical guide to building a stream data platform
(part 1).” 25 February 2015.
- Gwen Shapira: “The problem of managing
schemas,” 4 November 2014.
- Martin Kleppmann: “Schema evolution in Avro, Protocol Buffers and
Thrift,” 5 December 2012.
- Martin Kleppmann: “Bottled Water: Real-time integration of PostgreSQL and
Kafka.” 23 April 2015.
- Martin Kleppmann: “Designing data-intensive applications.”
O’Reilly Media, to appear.
- Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.:
“All Aboard the Databus!,” at ACM Symposium on Cloud
Computing (SoCC), October 2012.