Ingesting IoT and Sensor Data at Scale – Hacker Noon

With the boom in the number of IoT devices over the past few years, the relatively new use-cases of IoT and sensor data such as smart factories and smart cities have lead to time-series data being produced at a large scale. In such use-cases, a huge number of different kinds of sensors sends terabytes of data to be ingested so that real-time monitoring can improve efficiency and avert failures. As an example, take a Boeing 787 which generates half a terabyte of data during a single flight.

A factory producing bottle packaging

Let’s take a look into how we can deal with the challenges of time-series data and handle the high throughput of such data at scale, while still being highly available.

Time-series Data

Time series data is any data which is time stamped. If you would plot this kind of data on a graph, it will have time on one of its axes. Time-series data workload differs from the workload of other sorts of data as time-series data is primarily inserted but rarely updated. Time-series databases introduce efficiency by treating time as a first-class citizen.

A time-series data graph

Recording a huge amount of time-series data allows us to delve into all aspects of an operation such as:

  1. Analyzing the past: Having a history of how the state of various sensors changes over time helps us to understand how the performance gets affected when the system was in a certain state. It also allows us to go back to a certain point in time and learn why a particular error occurred.
  2. Monitoring the present: Having sensors live-stream data right into your dashboards helps in monitoring the mission-critical systems for effectively debugging and identifying components which require repairs.
  3. Predicting the future: Having rich historical data plugged into machine learning frameworks can be used to generate actionable insights about the future so that the problems are identified and solved even before they appear (predictive maintenance).

Use-cases and Challenges of Time-series Data

On an abstract level, time-series use cases can be divided into two broad categories. The first one being the traditional IT and system monitoring. Time-series databases such as InfluxDB are great for ingesting time-series data for such use cases. The second use case is smart factories or smart cities where industrial time-series data often requires a completely different scale which is not suitable to be handled by the traditional time-series databases.

IT Monitoring vs Smart Factory use case

Check out this whitepaper to know more about the functional differences between CrateDB time series and specialized time series databases like InfluxDB as well as performance benchmark between CrateDB and InfluxDB.

What is CrateDB?

The key characteristics we are looking for in such use cases are horizontal scaling and self-healing clusters. CrateDB is a new kind of distributed SQL database that is extremely adept at handling industrial time series data due to its ease of use and ability to work with many terabytes of time series data with thousands of sensor data structures.

CrateDB Admin UI

CrateDB operates in a shared-nothing architecture as a cluster of identically configured servers (nodes). The nodes coordinate seamlessly with each other, and the execution of write and query operations are automatically distributed across the nodes in the cluster.

Increasing or decreasing database capacity is a task of adding or removing nodes. Sharding, replication (for fault tolerance), and rebalancing of data as the cluster changes size are automated. Why don’t you take CrateDB for a spin?

CrateDB is also hosted on Microsoft Azure that allows you to connect to the various services that Azure has to offer.

Event Hubs + CrateDB?

Azure Event Hubs is a real-time data ingestion service that allows you to stream millions of events per second from any source to build dynamic data pipelines. It is built for large scale messaging and handling streams of data, such as industrial IoT data from smart factories or smart cities infrastructure. Data streaming through Event Hubs can be passed to Azure Functions for further enrichment or transformation.

CrateDB on Microsoft Azure

Once that’s done, the processed data is captured into CrateDB for analysis. CrateDB can be connected to the Event Hubs using the CrateDB Event Hubs Connector. This makes it even easier to integrate and analyze IoT data in real time in order to monitor, predict, or control the behaviour of smart systems. The Connector can scale to accept millions of telemetry data readings per second from Event Hubs or IoT Hub and insert it into the CrateDB.

An Experiment with Ingestion Performance

That does sound like a match made in heaven. We decided to check out how CrateDB performs in this workflow. The first thing that needed to be decided upon was the format of the data coming in. We selected data which looked close to data from sensors in a smart factory.

OBJECT
payload[‘model’] STRING
payload[‘objectId’] STRING
payload[‘timestamp’] STRING
payload[‘value’] LONG
payload[‘variable’] STRING

The size of the payload was around 100 Bytes per message. We went ahead to deploy two Event Hubs namespaces each with twenty throughput units (TPUs) of Azure Event Hubs. The throughput capacity of Event Hubs is controlled by throughput units. We connected Event Hubs to CrateDB by using the Event Hubs connector for CrateDB and ran thirty-two consumer pods (one consumer per partition). These two settings can be adjusted to increase throughput.

For CrateDB Cloud settings, we deployed seven nodes with the replication factor set to one and twenty-one shards (three shards per node).

Grafana dashboard showing the ingestion metrics

We were able to ingest more than 110,000 messages per second, which is about 10 Billion messages per day. By the way, ingesting at this speed would produce about 1.5 TB of data per day.

Put Machine Data to Work!

This experiment showcases that you can ingest huge amounts of data from your IoT devices or sensors to CrateDB with the help of Event Hubs. Having such rich time-series data allows us to gather really valuable data which can give us actionable insights on debugging issues and averting faults that might occur in the future.

CrateDB Cloud is hosted on Microsoft Azure and you can connect it to many the many services made available by Azure such as Machine Learning Studio for generating insights or Grafana for visualising the data. We are in the booming age of data, so why not put it to work to make our lives easier, safer and better!

Give me a shout out on twitter! I’d be excited to learn how you are currently scaling your machine data infrastructure and how you plan to use CrateDB.

read original article here