This is the 5th post in a series on System Design.
The faster you can extract insights and knowledge from your data, the more quickly your systems can respond to the changing state of the world they observe.
An event indicates that something interesting has happened in the context of the application. It might be an external event that the system captures, or it could be an internal event generated by a state change. In a system, the event source emits the event without expecting how it might be processed by other components.
Typically, events are published to a messaging system. The interested parties can register to receive events and process them accordingly.
Event-based architectures can be implemented using messaging systems like RabbitMQ. An event is removed from the broker once every subscriber has consumed it. In addition to freeing up resources for the broker, this also destroys any explicit event records.
I have written a detailed post on message-oriented middleware using RabbitMQ that you might find interesting.
Logging immutable events in a simple data structure has some useful characteristics. Event logs differ from FIFO queues managed by most message brokers in that they are append-only data structures.
The log is appended with records, each with a unique entry number. As a result of the sequence numbers, the order of events in the system is explicitly captured.
Event logs have some key advantages due to their persistent nature:
- New event consumers can be introduced at any time. A new consumer has access to the complete history of events stored in the log. Both existing and new events can be processed by it.
- Event-processing logic can be modified to add new features or fix bugs. You can then execute the new logic on the entire log to enrich results or correct errors.
- In the event of a server or disk failure, you can restore the last known state and replay events from the log to restore the data set. The role of the transaction log in database systems is analogous to this.
A distributed persistent log store is what Kafka is at its core. Kafka uses a dumb broker/smart client architecture. Event brokers are primarily responsible for appending new events to persistent logs, delivering events to consumers, and managing log partitioning and replication for scalability and availability. Multiple consumers can read log entries multiple times since they are stored durably. It is simply a matter of specifying the log offset, or index, of the entries they wish to read. In this way, the broker does not have to maintain any complex consumer-related state. This architecture has proven to be extremely scalable and provides a very high throughput.
In traditional data processing applications, you persist data from external feeds into a database and devise queries to extract information. This becomes increasingly difficult as the amount of information your systems process increases. For low latency, aggregate reads and joins, you need fast, scalable write performance from your database. Finally, you are ready to perform useful analysis after the database writes and reads are complete.
An ever-growing number of high-volume data sources has led to the development of a new class of technologies called stream processing systems. By using these, you are able to process data streams in memory, without having to persist data. Real-time analytics is often referred to as this. Scalable systems increasingly use stream processing platforms.
Financial services firms use stream processing for real-time fraud detection, retailers use it to gain complete views of customer behavior, and cloud-native companies use it to detect outages before they affect customers.
Usually, batch processing has played a major role in the processing of newly available data in software systems. Data representing new and updated objects are accumulated into files in a batch processing system. The batch data load job processes this newly available data periodically and inserts it into the application’s databases. ETL stands for extract, transform, load. The ETL process involves processing batch files containing new data, aggregating it, and transforming it into a format that can be inserted into your storage system.
As soon as the batch has been processed, the data is available to your analytics and external users. From the newly inserted data, you can run queries that produce useful insights.
Large-scale systems depend on batch processing because it is reliable, effective, and efficient. The downside is the lag between new data arriving and it is available for analysis and querying.
Data streams or event streams represent unbounded datasets. The definition of unbounded is infinite and ever-growing. As new records are added to the dataset over time, it is unbounded.
Data streams also have the following attributes:
- Streams of data are ordered
- Immutable data records
- The replayability of data streams
In stream processing, one or more event streams are continuously processed. Stream processing involves continuously reading data from an unbounded dataset, processing it, and emitting output.
Stream processing is a very simple activity if you only need to process each event individually. When you have operations that involve multiple events, moving averages, and joining two streams to create an enriched stream of information, stream processing becomes really interesting. You need to keep track of more information than just each event alone in those cases. We call the information stored between events a state.
There are several types of states in stream processing:
- Local or internal state — This state is only accessible by a specific instance of the stream-processing application.
- External state —This type of state is stored in an external data store, usually a NoSQL system like Redis or Cassandra. Its advantages include its virtually unlimited size and the fact that it can be accessed from multiple instances of the application.
Stream Processing Platforms
A variety of data structures are used to make data available to platforms. In most cases, these are queues, such as Kafka topics, or files in a distributed storage system, such as AWS S3. In streaming processing nodes, data objects are ingested from data sources, and aggregations are performed. Streams of data objects are processed from the source. Data objects are conceptually passed between processing nodes.
In stream processing systems, processing nodes transform input streams at one node into new streams that are processed by downstream nodes.
There are two general types of stream processing applications:
- The first method merely processes and transforms individual events in the stream without requiring any context or state information about them.
- In some streaming applications, a state must persist across the processing of individual data objects. Stateful streaming applications fall into this category.
In the data processing space, Apache Flink, Apache Storm, AWS Kinesis, Apache Kafka Streams, Apache Spark Streams, and Spring Data Flow are some of the top contenders.
Batch Processing Vs Stream Processing
Characteristics of Batch Processing
- Latency — Minutes to Hours
- Analytics — This is a complex process that incorporates both new batch data and existing data
Characteristics of Stream Processing
- Latency — Subsecond to seconds
- Analytics — A relatively simple method of detecting events, aggregating events, and calculating metrics over rolling time intervals for newly arriving data
Book: Fundamentals of Stream Processing Application Design, Systems, and Analytics
Book: Foundations of Scalable Systems