Big Data, mind the gap? Part I: The Hadoop ecosystem and Data Ingestion
More and more organizations are talking about big data, but many are still primarily focused on traditional relational database management systems to store structured data. Advanced analytics and machine learning are as well at the peak of Gartner’s hype cycle but we still discuss with many companies that are mainly using common business intelligence and corporate reporting. A few have started investigating predictive analytics on their internal structured data, but we are still far from the big data potential.
Mind the gap?
Innovation always needs to start with a use case and this is how we started our Swisscom Analytics Lab. Imagine an industrial company connecting sensors to its machines to better understand their usage and gain knowledge on machine maintenance and predict eventual breakouts and failures. Connecting 20 machines and monitoring that data could probably be done with a classical IT architecture but what if the company wants to industrialize the technology and connect a few thousand machines sending sensor data on a real-time basis? We will quickly reach the limits with traditional data storage and processing methods.
This is where the Hadoop ecosystem comes into play. Hadoop is an open-source software library delivered by Apache and enables reliable, scalable, and distributed computing. Originally, Hadoop refers to the Apache Hadoop project, composed of three main components: MapReduce (the execution framework), HDFS (for distributed storage) and YARN (the resource manager). Hadoop’s strength is its ability to run on commodity hardware bundled in clusters of various nodes and to massively analyze data in parallel. But Hadoop is much more than these three core components. A variety of other frameworks are available in the Hadoop ecosystem, covering topics such as: SQL or non-SQL databases, In-Memory processing, low-latency querying, data ingestion, GUIs, machine learning libraries as well as security and governance.
Enough theory for now, let’s have a deep-dive on Hadoop with some IoT data. We have installed one cluster of eight nodes and a second one of two nodes (16GB RAM, 8 CPU, 250GB Disk space per node) in our Swisscom Cloud using the Hortonworks distribution of Hadoop. The first cluster is used for the data ingestion and the second for the data processing.
Drilling-down in one of the dataflows, we have a view of the different processors used. We first handle the incoming HTTP request sent by the IoT gateway (SSL as well as HTTP methods such as GET, POST, PUT, DELETE, etc. can be configured accordingly), we then read the machine ID attribute and are able to later route our data based on this attribute. As you can see, we have two forks in the dataflow as we have conducted some load tests to assess the performance of our cluster with 12’000 files per minute. We then write the data directly to HDFS, which is the Hadoop distributed storage system. Basically, we store the raw JSON data on our second cluster, the processing cluster (based on Hortonworks HDP).
As we are dealing with quite small JSON files in our case, we need to tackle the issue of dealing with thousands of files generated on a continuous basis (each machine sends a JSON file every 5 seconds). NiFi enables us to merge many small files into bigger files, or also route the JSON files to a specific folder based on the date they were created. This will enable us later to process the data in a more efficient way.
NiFi enables us to quickly design the needed dataflow, to have a live overview of the data flowing in and observe the queues of actual data. Moreover, we have many possibilities of forcing some Hadoop settings such as block replications, block and buffer size. We replicate our data blocks 3 times, which enables us to continue working if one or two nodes of our cluster fail as the data is replicated through the Hadoop cluster. NiFi handles many incoming data types (emails and attachments, image and media metadata, Twitter feeds, JMS messages, event logs, etc.) through various connections types (HTTP, SFTP, TCP, UDP, etc.) and can then persist the data in many formats (AVRO, Parquet, ORC, ElasticSearch, Cassandra, HIVE, MongoDB, SQL, etc.).
Having a look at our Hadoop cluster, via the Ambari cockpit, we can see that the data is persisted in HDFS correctly.
Next week, we’ll explain how we can start processing this raw data, and explore some great possibilities with a few Apache frameworks. In case you are interested in investigating Big Data and Advanced Analytics, our Swisscom Analytics Lab can support you. We can help you find relevant and valuable use cases, give you access to an analytics cluster, help you assess its potential for your organization or focus on some specific topics (data lake, data ingestion, predictive analytics, data science, integration with SAP, etc.).
Tim Giger, Hadoop – Data Warehouse – Cloud & DevOps
Thomas Jeschka, Hadoop – Data Architect – Data Warehouse
Olivier Gwynn, Hadoop – Data Science – Predictive Analytics