Processing a stream of events with Kafka and Flink

Dear reader, in this post you’ll learn how to deploy a Kafka cluster and a Flink cluster. You’ll also learn how to make a simple application in Flink to process a stream of event coming from a Kafka producer. The tutorial is intended to prepare the reader to have this first experience with both technologies and successfully deploy this simple but powerful application.

Configuration of the environment

Let’s first configure the environment. Basically we need to download and configure Kafka, Flink and eclipse

Install Kafka

https://kafka.apache.org/downloads

version: 2.12

unzip the file and move it to an applications folder in your computer

modify configuration files

zookeeper.properties

dataDir=../kafka-logs/zookeper

server.properties

log.dirs=../kafka-logs/server-0

set the KAFKA_HOME environment variable

export KAFKA_HOME=/Users/david/bigdata/kafka_2.12-2.6.0

Remember to have this variable set for all the incoming terminals that you use. Alternatively you can define it in the .bashrc profile file.

Now do a quick test if you want

https://kafka.apache.org/documentation/#quickstart

But for our purpose we’ll need only this:

bin/zookeeper-server-start.sh config/zookeeper.properties

bin/kafka-server-start.sh config/server.properties

Install Flink

https://www.apache.org/dyn/closer.lua/flink/flink-1.11.2/flink-1.11.2-bin-scala_2.11.tgz

Unzip the file and move it to an applications folder in your computer

Go to the directory and start the flink cluster

./start-cluster.sh

Check the Flink cluster is running:

Install Eclipse

https://www.eclipse.org/downloads/

follow the steps and set the path to an applications folder in your computer

Coding the application

Create a new project maven project in eclipse

Right click in the package explorer (located at the lateral left), create project…, Maven, Maven Project

Next, check the create a simple project checkbox and then next

Set a group id and an artifact id. This is the name of your package, then click finish

Under the project right click and then new package, give a name to the package

Under the recently create package do right click and create a new class, name the class and then click finish

Then substitute the code of the class with the following class:

package kfp1;

import java.util.Properties;

import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
import org.apache.flink.api.common.functions.FlatMapFunction;

import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.*;
import org.apache.flink.streaming.util.serialization.SimpleStringSchema;


public class KafkaFlinkC1 
{
	public static void main(String[] args) throws Exception
    {
	final StreamExecutionEnvironment env =  StreamExecutionEnvironment.getExecutionEnvironment();

	Properties p = new Properties();
	p.setProperty("bootstrap.servers", "127.0.0.1:9092");
	
	DataStream<String> kafkaData = env.addSource(new FlinkKafkaConsumer011("test", new SimpleStringSchema(), p));

	kafkaData.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>()
			{
		public void flatMap(String value, Collector<Tuple2<String, Integer>> out)
		{
			String[] words = value.split(" ");
			for (String word : words)
				out.collect(new Tuple2<String, Integer>(word, 1));
		}	})
		
	.keyBy(0)
	.sum(1)
	.writeAsText("/Users/david/bigdata/kafka.txt");
	
	env.execute("Kafka Example");
    }

}

remember to change the package and class names for the ones that you have defined when creating the project, package and class

Now it’s time to indicate the dependencies by means of the pom.xml file. Use the following code:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>kafka-flink</groupId>
  <artifactId>kafka-flink</artifactId>
  <version>0.0.1-SNAPSHOT</version>
<dependencies>
	
	<dependency>
 	 <groupId>org.apache.flink</groupId>
  		<artifactId>flink-java</artifactId>
  	<version>1.11.2</version>
  	<scope>provided</scope>
	</dependency>

	<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-streaming-java_2.11</artifactId>
  <version>1.11.2</version>
  <scope>provided</scope>
</dependency>

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-clients_2.11</artifactId>
  <version>1.11.2</version>
  <scope>provided</scope>
</dependency>

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-kafka-0.11_2.11</artifactId>
  <version>1.11.2</version>
</dependency>
<dependency>
      <groupId>commons-logging</groupId>
      <artifactId>commons-logging</artifactId>
      <version>1.1.1</version>
    </dependency>

</dependencies>

<build>
	<plugins>
		<plugin>
			<groupId>org.apache.maven.plugins</groupId>
			<artifactId>maven-assembly-plugin</artifactId>
			<version>3.1.0</version>
				<configuration>
					<!-- get all project dependencies -->
					<descriptorRefs>
						<descriptorRef>jar-with-dependencies</descriptorRef>
					</descriptorRefs>
					<!-- MainClass in mainfest make a executable jar -->
					<archive>
					  <manifest>
						<mainClass>kfp1.KafkaFlinkC1</mainClass>
					  </manifest>
					</archive>

				</configuration>
				<executions>
				  <execution>
					<id>make-assembly</id>
                                        <!-- bind to the packaging phase -->
					<phase>package</phase> 
					<goals>
						<goal>single</goal>
					</goals>
				  </execution>
				</executions>
			</plugin>
	</plugins>
</build>

 <properties>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
  </properties>

</project>

Remember to check your java compiler version as well as the Flink downloaded version. Otherwise you may not encounter errors at compilation time but you’ll face out error when deploying the application in the Flink cluster. For this demo the Flink version has been 1.11.2.

Go tot the pom.xm file. Do right click and select maven clean, then do a maven install. First time, the required jars will be downloaded so this can take some minutes. Inside eclipse in the below terminal check the jar path and keep it open because we will use that path to indicate to Flink run where the .jar is located.

Testing the system

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

./bin/flink run /Users/david/eclipse-workspace/kafka-flink/target/kafka-flink-0.0.1-SNAPSHOT-jar-with-dependencies.jar

Now we can check that our package is running in the Flink cluster by going to the Flink dashboard

Now get back to the kafka producer terminal and start sending your values. Those values will be the keys that Flink will receive, process doing the aggregation and then write in a file sink.

If we want to see how the code is doing under the hood when running on top of the Flink cluster we can check it by doing click in the Flink job.

Notice how the application get structured in two tasks. The first task is in charge of receiving the events that the Kafka producer sends and does the flat map, whereas the second task is in charge of performing the aggregations of each key and then sending them to the sink, which in this case it consists of writing in an output file in the designated path.

Now check the the output file and voilà you’ll see how each key gets increased by 1 each time you type it in the Kafka producer.

Conclusions

Taking into account that Kafka and Flink are not easy technologies to learn and to deploy this tutorial is a demonstration of how to set a local environment in an easy and fast way. Having said that, now it is much more easier to learn Flink and to test our ideas locally without the need of running anything in a cluster, Kubernetes or even using MSK and EMR in AWS which would cost us some money just for the sake of learning.

Comparing Flink with Spark Streaming would require a whole post. However, at the time of writing this post, Flink turns out to be a bit more difficult in comparison with Spark Structured Streaming as the operations resemble more to MapReduce model so it is not as direct as to think in terms of Dataframes or Datasets.

Table of Contents