Apache CarbonData: Data Storage for ACID Ingest, Fast Query, and Machine Learning – Huawei

The growing volume of data requires skills to deal with dozens of new challenges like how to ingest streaming mutable data? How to build and cache index for a fast query? How to analyze data with ML? To address the above challenges, we present CarbonData – a data storage that offers SQL API to ingest, query, and analyze data. It empowers ingestion with cloud-native ACID transactions and Streaming Merge SQL. Empowers Query with index, materialized view technologies, and a novel distributed index caching and pruning system that improves query performance and outperforms existing cloud platforms. It empowers analytics with integrating data with ML frameworks and offers SQL API to track lineage and dependencies among data versions and models in the ML pipelines. The contributions are to call out the requirements of a high-performance data storage, share experiences in exploiting novel technologies, share our design in integrating data with ML by SQL, and discuss future challenges.