Apache Spark is a widely used open source engine for performing large-scale data processing and machine learning computations. OpenText™ Magellan™ provides an open platform with Apache Spark already integrated so it that can easily run on Hadoop clusters. At its core, by leveraging Spark, Magellan enables the flexibility and extensibility of an open stack, while ensuring enterprises maintain full ownership of their data and algorithms.
Apache Spark is central to Magellan’s core of deriving value from diverse sources of big data at scale, and keeping up with an ever-changing business landscape.
Apache Spark provides four built in libraries:
- Spark SQL & Data Frames for data exploration and analysis
- MLlib for developing and publishing machine learning pipelines
- GraphX for graphs and graph-parallel computation
- Spark Streaming for building and accessing streaming data
Magellan ships with these libraries pre-installed and configured out of the box, making it easy for developers and data scientists to easily leverage predefined processing pipelines and statistical routines on their data and content. The platform allows you to combine these libraries seamlessly in the same application. These libraries are easily accessible from a web based interface, called Magellan Notebook, to create and publish MLlib pipelines and models for use by data and business analysts.
The platform provides different tools for users and use cases based on their day to day role, responsibilities, and expertise.
1. Magellan Notebook
For data scientists and advanced programmers, the platform provides Magellan Notebook, a web based application for quickly creating, testing and deploying machine learning pipelines. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including classification, regression, clustering, and collaborative filtering.
Magellan Notebook is based on Jupyter Notebook, however it is much more than just a shell extension, with added features like:
- Programming support for different kernels: R, Python, and Scala
- Built in support for additional python libraries: numpy, scipy, sklearn, matplotlib, pandas, statsmodels, plotly, nose, beautifulsoup, seaborn, and yapf
- Adding additional python libraries
- Adding additional R packages
- Publishing your pipeline models to the Magellan advanced analytics engine
Thus, the Notebook allows you to port over your already existing algorithms and models on to the Magellan platform for seamless integration and distribution.
2. Magellan advanced analytics engine
For data and business analysts, the platform provides an advanced analytics engine to access machine learning and other pipelines published from Magellan Notebook. The engine provides a rich GUI that enables users to easily perform various data exploration tasks, such as:
- Uploading and exploring their own data sets, selecting and applying a published model, and finally exporting analytics for further investigation
- Data segmentation supporting drag and drop selections
- Pulling and analyzing data directly from Apache Spark Repository
For big data architects, the engine provides GUI to load, prepare and enrich data prior to being used for building custom models. The engine extends the traditional console-based approach to a simple web based interactive drag and drop selections for ingesting, cleansing, enriching and scheduling moving data from various sources. Data engineering features includes tools to create columns for further data analysis. For example, to better quantify data, you can create columns that summarize, rename, and define expressions using existing columns. You also can create ranges, groups, parameters, and ranks based on existing data values. REST and Java based APIs are also available for further integration with other external systems.
3. Magellan text mining and content analytics engine
For natural language processing (NLP) and text mining, the platform provides an engine that can crawl social media, websites, documents, emails, file share repositories and extract, transform and enrich this data using an on demand or automated workflow. New insights and cost optimization opportunities can be unearthed by combining this data with other structured and sometimes distributed and seemingly unrelated data for intelligent analysis. For example, marketing analysts may want to understand customer sentiment toward company’s products and services in real time. The engine can connect to online platforms like Twitter, LinkedIn, Facebook and others, perform sentiment analysis on the feeds, and merge the results with structured data from CRM systems and other relational databases to yield valuable insight into the public tone and opinion of their brand.
The engine enriches the unstructured content with semantic metadata like:
- Topics and entity extraction (person’s name, organization name, places, etc.)
- Concept extraction
- Sentiment analysis
- Auto language detection and summation
- Multiple language support
- Taxonomy management and training application (can create or import your own taxonomies)
4. Magellan visualization engine
Once you have defined your AI road map and go forward strategy, the next critical step is to identify the right tools and platform to bring it all together. Choosing the right tools and the platform will help you decide if your AI & ML strategy is truly going to transform your business or not.
OpenText Magellan is a flexible artificial intelligence (AI) and analytics platform that combines machine learning, advanced analytics, and enterprise-grade business intelligence (BI) with the ability to acquire, merge, manage, and analyze structured and unstructured big data. The platform combines open source machine learning with advanced analytics, enterprise-grade BI, and capabilities to acquire, merge, manage and analyze Big Data and Big Content stored in your Enterprise Information Management (EIM) systems. Magellan enables machine-assisted decision making, automation, and business optimization.
This post is part of an ongoing series on machine learning. Learn about how to get started or how to leverage Apache Spark for programming algorithms.