Build A Distributed Streaming Data Application With Apache Kafka

    • May 3, 2019
    • Share :
    • Apache Kafka is an open-source software platform as per official kafka document. It is used for building real-time data pipelines and streaming apps. It is distributed, horizontally scalable, fault-tolerant, commit log, wicked fast, and runs in production in thousands of companies. Our aim is to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
    • It works in a publish-subscribe pattern.
    • Messaging Terminology
    • Kafka was originally developed at LinkedIn in 2011. And handed over to Apache Software Foundation, it’s written in Scala and Java.
    Distributed
    • A distributed system is one which is split into multiple running machines, all of which work together in a cluster to appear as one single node to the end user. Kafka is distributed in the sense that it stores, receives and sends messages on different nodes (called brokers). The benefits to this approach are high scalability and fault-tolerance.
    Horizontal scalability
    • Horizontal scalability is the ability to increase capacity by connecting multiple hardware or software entities so that they work as a single logical unit.
    • Horizontal scalability is solving the same problem by throwing more machines at it. Adding a new machine does not require downtime nor there is any limit to the amount of machines you can have in your cluster. The catch is that not all systems support horizontal scalability, as they are not designed to work in a cluster and those that are are usually more complex to work with.
    Fault-tolerant
    • Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components.
    • Distributed systems are designed in such a way to accommodate failures in a configurable way. In a 6-node Kafka cluster, you can have it continue working even if 3 of the nodes are down. It is worth noting that fault-tolerance is at a direct tradeoff with performance,as in the more fault-tolerant your system is, the less performant it is.
    Commit Log
    • A commit log (also referred to as a write-ahead log, transaction log) is a persistent ordered data structure which only supports appends. You cannot modify or delete records from it. It is read from left to right and guarantees item ordering.
    Kafka has four core APIs:
    • The Producer API allows an application to publish a stream of records to one or more Kafka topics.
    • The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
    • The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
    • The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
    • The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
    How did it work?
    • Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!
    • Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
      • If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
      • If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
    • Kafka Broker A Kafka cluster is made up of multiple Kafka Brokers. Each Kafka Broker has a unique ID (number). Kafka Brokers contain topic log partitions. Connecting to one broker bootstraps a client to the entire Kafka cluster. For failover, you want to start with at least three to five brokers. A Kafka cluster can have, 10, 100, or 1,000 brokers in a cluster if needed.
    • Kafka Topic A Topic is a category/feed name to which messages are stored and published. Messages are byte arrays that can store any object in any format. As said before, all Kafka messages are organized into topics. If you wish to send a message you send it to a specific topic and if you wish to read a message you read it from a specific topic.
    Kafka Stream
    • The real-time processing of data continuously, concurrently, and in a record-by-record fashion is what we call Kafka Stream processing.
    • Basically, Kafka Real-time processing includes a continuous stream of data. Hence, after the analysis of that data, we get some useful data out of it. Now, while it comes to Kafka, real-time processing typically involves reading data from a topic (source), doing some analysis or transformation work, and then writing the results back to another topic (sink). To do this type of work, there are several options.

    Use Case : 1

    Client requirement
    • In our client requirement, they want real-time monitoring of Vehicle. a tracking system with Parental control, Alert on real-time critical incidents like over speeding, harsh braking, sudden acceleration, traveling outside of geofencing and other vehicle information.
    • To overcome with requirement we need real-time data from the vehicle with OBD devices that we can get with MQTT but to process those data we need real-time data pipelines that can archive with Kafka and also need to process data that can achieve with Kafka Stream API.
    • Also, we need monitoring vehicle health in real time and also provide the user some suggestions like change oil, check oil, check tires, coolant level, next service time. Those all things we can achieve with data processing, also predict future of vehicle health by its usage.
    Our solution
    • We propose a system which is highly scalable, real-time and secure with following tools. We use MQTT, Kafka, and InfluxDB to retrieve, manipulate(Transform) and store data from the OBD device.
    • IoT devices will communicate with our server using MQTT protocol, MQTT mosquitto broker will receive messages and send to respective subscribers. We use spring boot microservice which acts as an MQTT client and subscribes all topics which will be produced by MQTT broker in future.
    • When spring boot client subscriber receives message using MQTT paho client and MqttAsyncClient will receive the message in JSON format.
    • After receiving messages from IoT devices we transform JSON data into AVRO which will be recommended by Kafka for internal communication and using Kafka producer we forward generated data to Kafka broker(In future cluster).
    • Using Kafka stream we manipulate data as per our need and forward to Kafka consumer. During manipulation, we can produce a number of, messages which will be required by other microservices.
    • To store all those data in time series manner we use InfluxDB. A Kafka consumer which is responsible to store data in InfluxDB will receive messages from a stream and store it into influxDB. Why we used InfluxDB.
    • The scenario was to receive messages from an IOT through MQTT than forwarding those messages to Kafka. Here Kafka stream comes into the picture. It manipulates data as per our requirement and forwards to the Kafka than those messages will be received by Kafka consumer and store into InfluxDB
    • We can scale Kafka clusters as well as MQTT brokers as per our need(horizontal scaling). We can also use the partition system in Kafka to process message faster. InfluxDB sharding will use for data replication.
    Microservice Architecture (Kafka MQTT Paho client)
    • There are be some other microservices like User management, Device management, Searching, Logging.
    • User management microservice define users and its role in the current system there will be a hierarchy of users and roles which are listed below.
    • One of the important aspects of microservice is security, to make secure interservice communication and API calls. We are using OAuth server with UAA, Every request to any endpoint of the architecture is performed via any “client”.for internal service communication Secure inter-service-communication using Feign clients.
      • Eureka, Ribbon, Hystrix, Feign
        • When one service wants to request data from another,
        • Eureka: this is where services (un-)register, so you can ask “odb-service” and get a set of IPs of instances of the odb-service, registered in Eureka (service discovery)
        • Ribbon: when someone asked for “odb-service” and already retrieved a set of IPs, Ribbon does the load balancing over these IPs.
          • So to sum up, when we got a URL like “http://uaa/oauth/token/” with 2 instances of UAA server running on 10.10.10.1:9999 and 10.10.10.2:9999, we may use Eureka and Ribbon to quickly transform that URL either to “http://10.10.10.1:9999/oauth/token” or “http://10.10.10.2:9999/oauth/token” using a Round Robin algorithm (default algo).
        • Hystrix: a circuit breaker system solving fall-back scenarios on service fails.
        • Feign: using all that in a declarative style to communicate with other services.
    Device management
    • All IOT devices of a user will manage here.
    • It contains a number of devices owned by users.
    • Inventory management of devices.
    • Save trip history and alert history.
    • Device active/deactivate managements.
    Kafka/MQTT paho client
    • Receive messages on various topics and forward those to Kafka broker
    • Kafka stream receives and Transform/Manipulate those data and forward to the Kafka broker.
    • Kafka consumer will receive those messages and store into InfluxDB.
    Notification Management
    • It will process stored data in InfluxDB and generate notifications and notify user about it.
    • It will responsible for sending an alert notification to the user.
    Log Management
    • Manage logs which will be generated from other microservices.
    • Elasticsearch will be used to store logs.
    • Logstash will collect, parse, and store logs for future use.
    • Kibana will be used for virtualization.
    Utils
    • Will hold all static data
      • For e.g County, State, City list
    • Other common utilities which are going to be used in other microservices.
    • It will expose the REST interface to get consumed by microservices.

    Case Study 2

    Client requirement
    • The idea was to collect cryptocurrency price at every n second from various exchanges.
    • To collect data from 45 exchanges we need highly available, Atomic system. Which is capable to communicate with those exchanges via REST APIs or Socket.
    • Also capable to store those data in time series manners.
    • Here Kafka come to rescue. To process those data and store. As cryptocurrency price fluctuate in mean time. This whole process should take minimal time possibly it takes.
    • Real time monitoring and analysis of cryptocurrency required to get latest low price form an exchange and High price from an exchange from 45 exchanges.
    • Prediction of future price of currency using past price required massive data to give accurate result using AI.
    • All these things required softwares and tools like Kafka, Kafka Stream, InfluxDB, Spring Boot Application, Docker.
    Our Solution
    • We proposed a system that will use Kafka, Kafka Stream to achieve minimum latency time. And able to perform read and write operation at incredible speed.
    • Influx is the key, a time series database with retention policy, we can set retention policy with number of day so older data will wipe automatically
    • We use spring batch job to take every day backup of current database, by currency pair, and per exchange.
    • Current requirement is to sefisy massive number of request sending from per exchange.
    • We make sure application is highly avaliable and maintainable.
    • User wallet management system will maintain user wallet its and spring boot microservice which manages user,role and his balance.
    • We use kong as an api gateway which have some prebuild plugs for security and authorization.
    High level architechture cryptocurrency price Microservice :
    • There are be some other microservices like User management, public private data getherer, batch job, utils management,influx db, Searching, Logging.
    Notification Management
    • It will process stored data in InfluxDB and generate notifications and notify user about it.
    • It will responsible for sending an notification to the user.
    Log Management
    • Manage logs which will be generated from other microservices.
    • Elasticsearch will be used to store logs.
    • Logstash will collect, parse, and store logs for future use.
    • Kibana will be used for virtualization.
    Utils
    • Will hold all static data
      • For e.g County, State, City list
    • Other common utilities which are going to be used in other microservices.
    • It will expose the REST interface to get consumed by microservices.