Streaming Analytics is the ability to continuously process, manage, monitor, enrich and perform real-time analysis on live streaming data (typically from sensors and other components of the Internet of Things [IoT]). As organizations across the globe embrace and deploy large-scale data processing engines like Apache Spark, it is enabling leaders and managers access to real-time or near real-time insights, allowing them to gain competitive advantage and come up with potential new business models.
Predictive maintenance is a great use-case of streaming analytics in action. This blog post will go ‘under the hood’ of the ‘OpenText™ Magellan™ for IoT demo’, one of the most popular breakouts at this year’s Enterprise World, highlighting the combination of flexible, powerful OpenText analytics and AI technology with state-of-the-art open source tools that support real-time streaming analytics.
Apache Spark with its Spark Streaming API provides a robust mechanism for supporting streaming applications. Spark Streaming is a built-in library within Apache Spark that allows for building fault-tolerant applications for processing streaming jobs the same way you process and write batch jobs. Spark Streaming supports popular programming languages including Java, Scala and Python.
Since Magellan is built on Apache Spark, the Spark Streaming API is already pre-integrated and configured out of the box, making it easily available to developers for rapid development and deployment of streaming analytics solutions.
Streaming data can originate from sources such as smart meters, sensors and smart devices, web click streams, social media platforms like Twitter and Facebook, or financial B2B or B2C transactions. Some common use cases for real-time analytics include:
- Recommendation engines
- Asset monitoring and predictive maintenance
- Log processing and threat detection
- Fraud detection
- Tracking and optimization
The first task in building a streaming analytics application is collecting, managing and processing streaming data. One popular open source tool that can be deployed to achieve this is Apache NiFi. NiFi is a powerful Open Source data flow and event processing platform with an easy-to-use UI interface and more than 200 built-in connectors, which makes designing data flows quick and easy. NiFi’s highly scalable and fault-tolerant architecture makes it ideal for processing all kinds of streaming data sources and routing the streams to open source tools and storage solutions like Kafka, HDFS, HBase, and more.
The diagram below shows a high-level functional deployment of NiFi in conjunction with Magellan.
Adding Apache NiFi infrastructure to OpenText Magellan’s data lake platform (built on the open source Hadoop technology) in parallel to further process this data makes it easy to:
- Consume and route these various data streams in real time
- Transform, process and store enriched data leveraging Spark Streaming APIs
- Trigger automated workflows and alerts as configured
- Track granular-level dataflow and data lineage from beginning to end (data provenance)
- Develop streaming analytics applications with powerful visualizations and dashboards
How it all works
A high level process workflow for a streaming analytics application in action would look something like this:
The whole process can be broken down into six steps:
- Data Acquisition – Data from various streams is collected through appropriate Apache NiFi connectors in an intuitive manner via NiFi’s data flow designer. This real-time data can then be blended with historical or other enterprise data sources to enrich the feed before further processing by downstream systems.
- Data Routing – Once the stream data has been collected and blended with other data as needed, the data is routed to a message broker (Apache Kafka) for further processing. Magellan is configured with a built-in Apache Kafka messaging platform for stream processing and storing in a distributed, replicated, fault-tolerant cluster.
- Stream Processing – The Spark Streaming API reads the streaming data payloads from the Kafka message broker and further processes the data. Typically, business validations and rules will be further applied at this stage. Once all the processing is finished, the data is stored to the Magellan data lake (HDFS).
- Machine Learning – Spark Streaming API will apply the already defined machine learning prediction model to the processed data.
- Prediction Results – The model’s predictions will be saved and persisted to ensure results are always available.
- Actionable Insights – With these real-time descriptive and predictive insights, organizations can take quick data-driven business decisions to maximize gains and trigger automatic business process workflows for further action. Magellan BI & Reporting enables rapid deployment and distribution of reports, dashboards and self-serve capabilities through the organization.
As the streaming data is being stored and persisted, it is a good practice to periodically validate and verify the underlying predictive model for its accuracy and prediction power. Models are built on historical data to predict future outcomes. So, as you get new data feeds, the model’s performance and results may no longer be valid because of changes in operating environment, business landscape, or customer behavior.
With all the necessary tools, components and technologies pre-integrated and configured out of the box, and boosted with Apache NiFi, Magellan simplifies and enables deploying a streaming analytics application through each of the six steps. Magellan greatly reduces total cost of ownership (TCO) and Time to Market (TTM) for rapid development and deployment of streaming analytics solutions.
Built on an open product stack, Magellan lets you take advantage of the flexibility, extensibility, and diversity of exciting new open source developments, bundling technologies for advanced analytics, machine learning, data modeling and preparation, and enterprise-grade BI into a single integrated infrastructure.
The platform combines open source machine learning with advanced analytics, enterprise-grade BI, and capabilities to acquire, merge, manage and analyze Big Data and Big Content stored in your Enterprise Information Management (EIM) systems. Magellan enables machine-assisted decision making, automation, and business optimization.
OpenText™ Professional Services can guide you and your team through this entire process and help define and implement a successful streaming analytics strategy for your organization.
This post is part of an ongoing series on machine learning. Learn about how to get started or how to leverage Apache Spark for programming algorithms.