Big Data, mind the gap? Part III: Data science with Hadoop (Spark and Zeppelin)
Data science with Hadoop (Spark and Zeppelin)
With the setup of the data ingestion data flow from our IoT sensor data and investigation on different Hadoop data processing frameworks such as HIVE and HBase, we have emphasized one of Hadoop’s advantage: the ability to immediately persist data with a schema on-read architecture and being later able to process the data depending on our use cases.
Obviously, we have various scenarios with our IoT data and therefore do not want to make too quick assumptions on the schema of the data. This could lead to choosing wrong tools or technologies which would be costly to switch from later on.
Tim briefly mentioned one use case last week which would be the need of a real-time dashboard to aggregate the sensor data for live machine monitoring. Based on this case, we could start making assumptions that we would need near real-time data, quick aggregations and high performance on querying/displaying the data (a so-called Speed Layer)
Today I would like to focus on other types of use cases which are related to data science and machine learning. Predictive maintenance, anomaly detection and forecasting are common scenarios including IoT and Analytics. Imagine being able to predict our machine’s utilization in order to save energy, or being able to detect a pattern of sensor values that might lead to a failure of the machine and therefore require quick maintenance. What could be the root cause of a machine’s bottleneck or how much do the sensor values deviate from their baseline? Can we predict the values of a specific sensor such as the temperature or pressure in order to find the best settings of the machine?
Well this is where Hadoop and Machine Learning algorithms (such as anomaly detection, regression, time series, or naïve Bayes ) come into play.
Spark is basically a distributed data processing engine designed for in-memory computing and interactive data mining. A few top-level modules are layered on top of Spark Core (the engine) such as MLlib which is a built-in library for Machine Learning.
With Spark, data can be created on the fly (through Spark’s Resilient Distributed Datasets) and is optimized for scaling. Spark can be called with Scala, Python, Java and more recently R via its APIs. We will chose primarily Scala for our case, as it is a very interesting programming language, and moreover was developed at the EPFL in Lausanne ?
Apache Spark core and modules
Apache Spark core and modules
We can call the spark shell or launch Spark applications directly from our HDP Hadoop cluster. Let’s start getting our hands dirty with the spark shell and Scala. We’ve loaded our IoT sensor data in so-called Spark DataFrames which is a central component. The data can be accessed from HDFS or also via the Spark HBase or Hive connectors which are very convenient, and leverage these powerful frameworks.
IoT sensor data loaded in Spark DataFrames via the Spark shell and Scala
The data schema can be displayed, as well as the sensor data in Spark. Spark enables data scientists to tweak caching and other persistency parameters, it can be a good idea to persist the data frame in memory.
Regarding performance and computation, which is often a topic in Machine Learning: Sparks runs best on YARN, which is the Hadoop Core for computation (as HDFS is the Hadoop core for data storage) and can also leverage GPUs (Graphical Processing Units) for Spark jobs. We are not using GPUs here as we are running Spark on our HDP cluster nodes mentioned in our first blog post.
The Spark shell isn’t the sexiest interface but it’s a good way to get started with native Spark. For the rest of our case, we’ll use the Zeppelin Notebook to continue using Spark and demonstrate the use of Machine Learning for our IoT case.
IoT data is now loaded in Spark and can be used for further Machine Learning or other Analytics tasks
Apache Spark integrates with Zeppelin, a web-based notebook server that supports interactive data analytics and Data Science. Concretely speaking, data engineers or data scientists can develop, organize, execute and share data code as well as visualize results without having to access to a coding environment or the cluster itself. Zeppelin is like a Data Science studio and is open to various programming languages and supports APIs.
A first good starting point is that the Spark Context and Spark SQL Context are automatically initialized in zeppelin so we can directly create our first Spark DataFrame, as we did in the spark shell. The sensor data schema is recognized and we can find our different pressure, temperature and statuses.
Loading HDFS or HIVE & HBase data in Spark via the Zeppelin Notebook
The data can be quickly displayed and analyzed, this time with a nicer graphical interface.
Displaying and querying machine sensor temperatures in the Zeppelin Notebook
MLlib provides a library of common Machine Learning algorithms (classification, regression, clustering, collaborative filtering, dimensionality reduction, etc.) and utilities which scale well and perform in a distributed environment, using Spark core.
So let’s use some simple Machine Learning Algorithms from MLlib on our IoT Data in order to predict the future temperature of our machines that might lead to failures. The data needs a little preparation and pre-processing before we can train our predictive model on the historical IoT data. Once our model is available, and has of course been analyzed and validated, we may use it on our new (or incoming, streaming) IoT data and this way predict if our machines require a quick maintenance before failing.
Building our Machine Learning model in Zeppelin with Spark MLlib
Applying the model on our machine streaming data and identifying failure risk
Advanced Machine Learning and Data Science leveraging Apache Spark
We can go further than the use of Apache Spark’s MLlib using different Machine Learning tools and at the same time leveraging the use of Apache Spark running on Hadoop clusters. For instance, many of our customers have integrated a big amount of their transactional and master data in the SAP environment and SAP Predictive Analytics (I presented the SAP PA solution with a Swisscom use case in a previous blog post) leverages the usage of Apache Spark. Imagine integrating the master data of our industrial machines, and some transactional data from the SAP systems with the IoT data persisted in HDFS. This would be a good basis for the further development of our Machine Learning cases.
You can also use your favorite tools and libraries leveraging Apache Spark : scikit-learn, pandas or numpy for Python fans, SparkR for R fans, DeepLearning4J, and more.
To close this post, I would like to emphasize that data is naturally key for Machine Learning and Predictive Analytics. Apache Hadoop is a great solution to tackle the problem of finding a common, reliable, scalable and cost-efficient architecture component to serve as a data lake. As I hope you clearly understood, Hadoop doesn’t only act as the data lake itself but also offers a wide variety of frameworks covering different needs such as specific data ingestion or data processing, NoSQL Databases, Data Analytics, Real-time Dashboarding, Predictive Analytics and Machine Learning. And if you want to use your favorite tools on top of Hadoop, it shouldn’t be an issue.
I hope these Hadoop posts helped you to have a better overview of the possibilities of big data and analytics in your industries and organizations. In case you are interested in investigating these topics, our Swisscom Analytics Lab can support you so don’t hesitate to contact us. And let’s make Analytics great again!
Olivier Gwynn, Hadoop – Data Science – Predictive Analytics
Thomas Jeschka, Hadoop – Data Architect – Data Warehouse
Tim Giger, Hadoop – Data Warehouse – Cloud & DevOps