How to Build A Distributed Streaming Data Application With Apache Kafka?

    • May 3, 2019
    • Share :

    Apache Kafka is an open-source software platform as per official kafka document. It is used for building real-time data pipelines and streaming apps. It is distributed, horizontally scalable, fault-tolerant, commit log, wicked fast, and runs in production in thousands of companies. Our aim is to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

    • It works in a publish-subscribe pattern.
    • Messaging Terminology
    Kafka Cluster

    Kafka was originally developed at LinkedIn in 2011. And handed over to Apache Software Foundation, it’s written in Scala and Java.

    Distributed

    • A distributed system is one which is split into multiple running machines, all of which work together in a cluster to appear as one single node to the end-user. Kafka is distributed in the sense that it stores, receives, and sends messages on different nodes (called brokers). The benefits of this approach are high scalability and fault tolerance.

    Horizontal scalability

    • Horizontal scalability is the ability to increase capacity by connecting multiple hardware or software entities so that they work as a single logical unit.
    • Horizontal scalability is solving the same problem by throwing more machines at it. Adding a new machine does not require downtime nor there is any limit to the number of machines you can have in your cluster. The catch is that not all systems support horizontal scalability, as they are not designed to work in a cluster and those that are usually more complex to work with.

    Fault-tolerant

    • Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components.
    • Distributed systems are designed in such a way to accommodate failures in a configurable way. In a 6-node Kafka cluster, you can have it continue working even if 3 of the nodes are down. It is worth noting that fault-tolerance is at a direct tradeoff with performance, as the more fault-tolerant your system is, the less performant it is.

    Commit Log

    • A commit log (also referred to as a write-ahead log, transaction log) is a persistent ordered data structure which only supports appends. You cannot modify or delete records from it. It is read from left to right and guarantees item ordering.

    Kafka has four core APIs:

    • The Producer API allows an application to publish a stream of records to one or more Kafka topics.
    • The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
    • The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
    • The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
    • The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

    How did it work?

    1) Producers

    Producers publish data on the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!

    2) Consumers

    Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

    If all the consumer instances have the same consumer group, then the records will effectively be load-balanced over the consumer instances.

    If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

    3) Kafka Broker

    Kafka Broker A Kafka cluster is made up of multiple Kafka Brokers. Each Kafka Broker has a unique ID (number). Kafka Brokers contain topic log partitions. Connecting to one broker bootstraps a client to the entire Kafka cluster. For failover, you want to start with at least three to five brokers. A Kafka cluster can have, 10, 100, or 1,000 brokers in a cluster if needed.

    4) Kafka Topic

    Kafka Topic A Topic is a category/feed name to which messages are stored and published. Messages are byte arrays that can store any object in any format. As said before, all Kafka messages are organized into topics. If you wish to send a message you send it to a specific topic and if you wish to read a message you read it from a specific topic.

    5) Kafka Stream

    • The real-time processing of data continuously, concurrently, and in a record-by-record fashion is what we call Kafka Stream processing.
    • Basically, Kafka Real-time processing includes a continuous stream of data. Hence, after the analysis of that data, we get some useful data out of it. Now, while it comes to Kafka, real-time processing typically involves reading data from a topic (source), doing some analysis or transformation work, and then writing the results back to another topic (sink). To do this type of work, there are several options.

    Use Case: 1

    Client requirement

    • In our client requirement, they want real-time monitoring of vehicles. a tracking system with Parental control, Alert on real-time critical incidents like overspeeding, harsh braking, sudden acceleration, traveling outside of geofencing, and other vehicle information.
    • To overcome with a requirement we need real-time data from the vehicle with OBD devices that we can get with MQTT but to process those data we need real-time data pipelines that can archive with Kafka and also need to process data that can achieve with Kafka Stream API.
    • Also, we need to monitor vehicle health in real-time and also provide the user some suggestions like change oil, check oil, check tires, coolant level, next service time. Those are all things we can achieve with data processing, also predict the future of vehicle health by its usage.

    Our solution

    • We propose a system which is highly scalable, real-time, and secure with the following tools. We use MQTT, Kafka, and InfluxDB to retrieve, manipulate(Transform) and store data from the OBD device.
    • IoT devices will communicate with our server using the MQTT protocol, MQTT mosquito broker will receive messages and send them to respective subscribers. We use spring boot microservice which acts as an MQTT client and subscribes to all topics which will be produced by the MQTT broker in the future.
    • When spring boot client subscriber receives a message using MQTT paho client and MqttAsyncClient will receive the message in JSON format.
    • After receiving messages from IoT devices we transform JSON data into AVRO which will be recommended by Kafka for internal communication and using Kafka producer we forward generated data to Kafka broker(In future cluster).
    • Using Kafka stream we manipulate data as per our need and forward it to Kafka consumer. During manipulation, we can produce a number of, messages which will be required by other microservices.
    • To store all those data in a time series manner we use InfluxDB. A Kafka consumer which is responsible to store data in InfluxDB will receive messages from a stream and store them into influxDB. Why we used InfluxDB.
    • The scenario was to receive messages from an IoT through MQTT then forward those messages to Kafka. Here Kafka stream comes into the picture. It manipulates data as per our requirement and forwards to Kafka than those messages will be received by Kafka consumers and store into InfluxDB
    • We can scale Kafka clusters as well as MQTT brokers as per our need(horizontal scaling). We can also use the partition system in Kafka to process messages faster. InfluxDB sharding will use for data replication.

    Microservice Architecture (Kafka MQTT Paho client)

    There are be some other microservices like User management, Device management, Searching, Logging.

    ODB

    User management microservice defines users and their role in the current system there will be a hierarchy of users and roles which are listed below.

    One of the important aspects of microservice is security, to make secure interservice communication and API calls. We are using an OAuth server with UAA, Every request to any endpoint of the architecture is performed via any “client”.for internal service communication Secure inter-service-communication using Feign clients.

    Eureka, Ribbon, Hystrix, Feign

    When one service wants to request data from another,

    • Eureka: this is where services (un-)register, so you can ask “odb-service” and get a set of IPs of instances of the odb-service, registered in Eureka (service discovery)
    • Ribbon: when someone asked for “odb-service” and already retrieved a set of IPs, Ribbon does the load balancing over these IPs.
      • So to sum up, when we got a URL like “http://uaa/oauth/token/” with 2 instances of UAA server running on 10.10.10.1:9999 and 10.10.10.2:9999, we may use Eureka and Ribbon to quickly transform that URL either to “http://10.10.10.1:9999/oauth/token” or “http://10.10.10.2:9999/oauth/token” using a Round Robin algorithm (default algo).
    • Hystrix: a circuit breaker system solving fall-back scenarios on service fails.
    • Feign: using all that in a declarative style to communicate with other services.

    Device management

    • All IoT devices of a user will manage here.
    • It contains a number of devices owned by users.
    • Inventory management of devices.
    • Save trip history and alert history.
    • Device active/deactivate managements.

    Kafka/MQTT paho client

    • Receive messages on various topics and forward those to Kafka broker
    • Kafka stream receives and Transform/Manipulate those data and forward to the Kafka broker.
    • Kafka consumer will receive those messages and store into InfluxDB.

    Notification Management

    • It will process stored data in InfluxDB and generate notifications and notify users about it.
    • It will responsible for sending an alert notification to the user.

    Log Management

    • Manage logs which will be generated from other microservices.
    • Elasticsearch will be used to store logs.
    • Logstash will collect, parse, and store logs for future use.
    • Kibana will be used for virtualization.

    Utils

    • Will hold all static data
      • For e.g County, State, City list
    • Other common utilities are going to be used in other microservices.
    • It will expose the REST interface to get consumed by microservices.

    Use Case: 2

    Client requirement

    • The idea was to collect cryptocurrency prices every n second from various exchanges.
    • To collect data from 45 exchanges we need a highly available, Atomic system. Which is capable to communicate with those exchanges via REST APIs or Socket.
    • Also capable to store those data in time series manners.
    • Here Kafka comes to the rescue. To process those data and store them. As cryptocurrency prices fluctuate in meantime. This whole process should take minimal time possibly it takes.
    • Real-time monitoring and analysis of cryptocurrency are required to get the latest low price from an exchange and High price from an exchange from 45 exchanges.
    • Prediction of future prices of currency using past prices required massive data to give accurate results using AI.
    • All these things required software and tools like Kafka, Kafka Stream, InfluxDB, Spring Boot Application, Docker.

    Our Solution

    • We proposed a system that will use Kafka, Kafka Stream to achieve minimum latency time. And able to perform read and write operations at incredible speed.
    • Influx is the key, a time series database with the retention policy, we can set retention policy with a number of days so older data will wipe automatically
    • We use spring batch jobs to take everyday backup of the current database, by currency pair, and per exchange.
    • The current requirement is to sefisy massive number of requests sending from per exchange.
    • We make sure the application is highly available and maintainable.
    • User wallet management system will maintain user's wallet and spring boot microservice which manages user, role and his balance.
    • We use kong as an api gateway which has some prebuild plugs for security and authorization.
    High-level architecture cryptocurrency price Microservice :
    • There are be some other microservices like User management, public-private data gatherer, batch job, utils management, influx dB, Searching, Logging.
    Notification Management
    • It will process stored data in InfluxDB and generate notifications and notify users about it.
    • It will responsible for sending a notification to the user.
    Log Management
    • Manage logs which will be generated from other microservices.
    • Elasticsearch will be used to store logs.
    • Logstash will collect, parse, and store logs for future use.
    • Kibana will be used for virtualization.
    Utils
    • Will hold all static data
      • For e.g County, State, City list
    • Other common utilities are going to be used in other microservices.
    • It will expose the REST interface to get consumed by microservices.