Kinesis

Introduction

  • Kinesis is to collect, process, and analyze real-time, streaming data.

  • Data is automatically replicated synchronously to 3 AZ.

  • Good for:

    • IoT

    • Realtime Big Data

    • Streaming processing

Feature

  • Kinesis services:

    • Kinesis Data Streams (low latency streaming ingest at sacle)

      • Consists of ordered "shards", the total capacity is the sum of all its shards. Can re-send shards.

      • Must manage scaling (shard splitting / merging)

      • To store streaming data from producers in shards, and waiting for consumption (ex. EC2 instances)

      • Retention period: 24 hours (default) ~ 7 days. Cannot be deleted or changed.

      • Billing is per shard provisioned.

      • Batching available or per message calls.

      • Need to write your own code for producer / consumer.

      • Producer:

        • Options:

          • AWS SDK (simple producer)

          • Kinesis Producer Library (KPL): batch, compression, retries, with C++ / Java

          • Kinesis Agent

            • Monitor log files and sends them to Kinesis directly

            • Can write to Kinesis Data Streams and Kinesis Firehose

        • Limit:

          • 1 MB/s or 1000 messages/s at write per shard, otherwise would get ProvisionThroughputException.

      • Consumer:

        • Options:

          • AWS SDK

          • Lambda (Event source mapping)

          • KCL: checkpointing, coordinated reads

        • Limits:

          • Consumer Classics:

            • ~200 ms latency

            • 2 MB/s at read per shard across all consumers

            • 5 API calls per second per shard across all consumers

          • Consumer Enhanced Fan-out:

            • ~70 ms latency

            • 2 MB/s at read per shard, per enhanced consumer

            • No API calls needed (push model)

    • Kinesis Firehose (to only some specific destinations, near real time, serverless)

      • Fully managed buffered streaming service, no administration, automatic scaling, serverless

      • Buffer would be flushed if buffer size / time is reached. So it's near realtime (60 seconds latency minimum for non full batches)

      • No data storage. So once flushed, data is gone. (not able to re-send)

      • Supports many data formats, conversions, transformatins, compression (with Lambda, templates available.)

      • Pay for the amount of data going through Firehose

      • Doesn't have shards to keep data, only receive streaming data from producers:

        • SDK / KPL / Kinesis Agent

        • Kinesis Data Streams

        • CloudWatch logs & events

        • IoT rules actions

      • Can transform streams with Lambda

      • Then can only load into S3, Redshit, ElasticSearch & Splunk

      • Example of architecture:

  • Kinesis Analytics (perform real time analytics on streams using SQL)

    • Pay only for resources comsumed (but not cheap)

    • Scales automatically and with real-time latency from milliseconds to seconds.

    • Use SQL-type query language to Kinesis streams / Firehose, then send data to S3, Redshift, ElaticSearch Cluster.

    • Can use Lambda for pre-processing

    • Use Case:

      • Streaming ETL

      • Continuous metric generation

      • Responsive analytics

  • Scenario:

    • Streaming Architecture comparison of Kinesis and DynamoDB

  • Comparison of storages

Tips

  • Streaming Data != Data Streaming

    • Streaming Data is about data that is continuously generated by different sources.

    • Data Streaming is the process of transferring a stream of data from one place to another

Last updated