Qubole Meets BI Tools: 5 Machine Learning Libraries and their Big Data Use Cases

Start Free Trial
May 19, 2016 by Updated January 8th, 2024

In an ongoing effort to extract more useful information and insights from massive volumes of structured and unstructured data, many organizations have turned to cloud-based Hadoop big data analytics solutions such as Qubole. And as effective as these solutions are at capturing and analyzing large data volumes, their ability to interact with powerful Business Intelligence (BI) tools such as Machine Learning libraries (MLlib), is taking big data analytics capabilities to a whole new level.

What follows is a look at 5 Machine Learning Libraries and the Big Data use case for each.

1. MLlib: No conversation about deep learning tools should begin without mention of Apache’s own open-source machine learning library for Spark and Hadoop. MLlib features a host of common algorithms and data types, all designed to run at speed and scale. This makes MLlib a good fit for network security and other use cases such as predictive intelligence, customer segmentation for marketing purposes, and sentiment analysis.

As is common with any Hadoop project, the main language for working in MLlib is Java. However, Python users will find that they can connect MLlib with the NumPy library. In addition, Scala users can write code against MLlib.

While setting up an on-premise Hadoop cluster can be impractical for many organizations, it should be noted that cloud-based Hadoop vendors are readily equipped to run MLlib.

2. Weka: Created at the University of Waikato in New Zealand, open-source Weka is a collection of Java machine learning algorithms engineered for data mining tasks. Known for setting the standard for open-source machine learning, Weka boasts a rich set of tools and, and user interfaces for exploring data and results. A book that goes over numerous ML concepts, shows examples using Weka and explains both the software and techniques that are used also accompanies Weka. Those looking to gain a solid understanding of machine learning will find that Weka is a good project to get them started.

NOTE: While Weka wasn’t created with Hadoop users in mind, it now contains new packages for distributed processing in Hadoop.

3. Accord Framework: Accord is a machine learning and signal processing framework for .Net, Microsoft’s web services strategy. In this instance, the term “signal processing” refers to Accord’s range of machine learning algorithms for images and audio. Performing facial recognition analysis is just one use case for Accord. The framework also includes a set of vision processing algorithms that operate on image streams and can be used to implement functions such as the tracking of moving objects. On top of that, Accord also has libraries that provide a number of conventional machine learning functions.

4. H2O: Designed by Oxdata, which has since changed its name to H2O.ai, the H2O library of machine learning algorithms is primarily geared toward business processes. Like MLlib, Hadoop pundits can use Java to interact with H2O. Plus H2O’s framework provides bindings for Python, R, and Scala. That’s a win-win scenario as it also enables cross-interaction with all of the libraries found on the Python, R, and Scala platforms.

As far as big data use cases are concerned, H2O is used by businesses for risk and fraud analysis, insurance and healthcare analytics, and customer intelligence, a field dedicated to using big data science to increase customer retention and profitability.

5. TensorFlow Serving: Released by Google in early 2016, TensorFlow Serving is a flexible, high-performance open-source software library for machine learning models. TensorFlow was originally developed to conduct machine learning and deep neural networks research by Google researchers and engineers on the Google Brain Team within Google’s Machine Intelligence research organization. Providing out-of-the-box integration with TensorFlow models, the general nature of the software system allows it to be extended to serve other types of models and data as well.

TensorFlow Serving was designed for production environments, i.e., real-time settings where hardware setups are installed and programs are run and relied on as part of an organization’s daily operations.

Big data is getting bigger and more complex every day. Fortunately, organizations faced with the growing challenge of extracting business value from mountains of data can rely on the winning combination of cloud-based Hadoop solutions such as Qubole, and powerful BI tools such as MLlib to meet these challenges and create a competitive advantage.

Ready to give big data in the cloud a test run? Sign up for a free Proof of Concept.

Start Free Trial
Read Big Data and the Rise of Self-Service Analytics