StreamX as a Service

Start Free Trial
July 25, 2017 by Updated April 15th, 2024

Today, we are excited to announce the private Beta availability of StreamX as a managed service within the Qubole Data Service (QDS) platform. With this release, we give Big Data practitioners the ability to access data from Kafka queues for delivery into Amazon S3 in a partitioned and compressed format.

StreamX

StreamX is an open-source service for persisting Kafka logs into Cloud object stores such as Amazon S3. It is built on the Kafka Connect framework and designed for reliable exactly-once delivery.

Why we built StreamX

Cloud-scale data teams use Apache Kafka to capture high-velocity data generated by thousands of clients. Extracting business value requires persisting these logs into a common storage repository, where familiar Big Data frameworks such as Spark and Hive take over. Ensuring reliability and data integrity is challenging, particularly in guaranteeing exactly-once delivery. In addition, standard log formats are not efficient for large-scale data analysis. Hence, we built StremX with the following features:

  • Designed for exactly-once delivery.
  • Automatic delivery of logs to Parquet or Avro output formats (ORC coming soon).
  • Compatible with Kafka and Kafka Connect.
  • Amazon S3 is available as the output destination for Cloud-based data architectures.
  • Options for multiple output paths and output partitioning based on Kafka input partitions.
  • Integration with Hive for automatic creation of table partitions.

StreamX Managed Service

As part of the Qubole Platform, StreamX provides the advantages of a managed service, allowing you to:

  • Maximize productivity and reduce complexity with automated lifecycle cluster management.
  • Easily launch a StreamX cluster from the Clusters page in Qubole Data Services (QDS).
  • Control costs: pay only for what you use.
  • Troubleshoot with 24/7 Qubole support.

Getting Started

Getting started is easy with the form-based UI: simply select a Kafka endpoint, a storage output path, and an output data format (such as Parquet). The short video below shows how to launch a StreamX cluster and perform the steps listed above.

To sign up for the private Beta, please email [email protected] or contact your CSM.

Start Free Trial
Read Deep Learning on Qubole Using BigDL for Apache Spark – Part 1