Tech Blog
Cloud-native Big Data Activation Platform
-
Part 3: Transactions on the Data Lake
Data Lakes are becoming increasingly central to the analytical operations of organizations. This brings in many more ‘transactional’ requirements on the pipeline architecture and the… The post...
-
Part 2: Tuning the Data Ingestion process
In Part 1 of this series, we briefly touched upon the various design considerations to be made when architecting the Data Lake. We saw how… The post Part 2: Tuning the Data Ingestion process...
-
Enhanced Network Security with AWS PrivateLink on Qubole
Increase data security and simplify the infrastructure with Qubole About Qubole Open Data Lake Platform Qubole is an open and secure data lake platform for… The post Enhanced Network Security with...
-
Part 1: Ingestion into the Data Lake
Data Lakes are a core pillar in an organization’s data strategy. Data lakes make organizational data from different sources, accessible to various end-users like business… The post Part 1:...
-
Qubole University Launches Badge Program
For decades our desks were covered in trophies, certificates, and medals demonstrating our accomplishments, achievements, and competencies. Over the time, these methods of recognition have… The...
-
Enabling Spark SQL MERGE via optimized ACID Data Source v0.6.0
We are pleased to announce the 0.6.0 release of ACID Data source for Apache Spark. This release should further empower Data lake users in enterprises… The post Enabling Spark SQL MERGE via...
-
Introducing Apache Spark 3.0 on Qubole
We are pleased to announce the availability of Apache Spark 3.0 in the Qubole environment. Spark 3.0 release comes with a lot of exciting new… The post Introducing Apache Spark 3.0 on Qubole...
-
Apache Airflow Concepts – DAG Scheduling and Variables
In our last blog, we covered all the basic concepts of Apache Airflow. In this blog, we will cover some of the advanced concepts and… The post Apache Airflow Concepts – DAG Scheduling and...
-
Introducing Capacity Reservation for Application Master to increase Workload Reliability despite Spot Interruptions
AWS Spot instances reduce cloud costs by up to 90% but can be interrupted by AWS at any given time causing running workloads to fail.… The post Introducing Capacity Reservation for Application...
-
Qviz – Qubole Visualization Framework for Jupyter-Based Notebooks
Data visualization is a critical aspect of Exploratory Data Analysis that helps Data Analysts and Scientists visualize frequency distributions, explore causal/correlated relationships between...
-
Data Discovery Tools – Qubole Workbench
It is common knowledge that data lakes offer the right architecture to support multiple use cases and tools, but can be operationally complex to implement… The post Data Discovery Tools – Qubole...
-
Apache Airflow Tutorial – DAGs, Tasks, Operators, Sensors, Hooks & XCom
Now that you have read about how different components of Airflow work and how to run Apache Airflow locally, it’s time to start writing our… The post Apache Airflow Tutorial – DAGs, Tasks,...
-
Presto on Qubole is 2.6x faster than competition!
In the past 2-3 years, Presto has set the bar for fast analytical processing in modern cloud data lake architectures. Qubole has offered a Presto… The post Presto on Qubole is 2.6x faster than...
-
Terraforming the Open Data Lake
Image credits: https://science.howstuffworks.com/terraforming.htm The Qubole Open Data Lake Platform Qubole is the open data lake company that provides a simple and secure data lake platform… The...
-
Logan: A Data-Driven Log Analyzer for Easy Navigation of Apache Spark Logs
Running Large distributed Apache Spark clusters in the public cloud, that handle exponential increase in volumes of data to fuel analytics and machine learning (ML)… The post Logan: A Data-Driven...
-
Cost and Performance efficiency with Multi-tenant Spark Platform
Introduction Ad-hoc analytics and data exploration require compute resources that can process incoming jobs instantaneously and keep the response time low. Apache Spark is a… The post Cost and...
-
Columnar Format in Data Lakes For Dummies
Columnar data formats have become the standard in data lake storage for fast analytics workloads as opposed to row formats. Columnar formats significantly reduce the… The post Columnar Format in...
-
Introducing Managed Spot Block Instances that provide up to 40% cost savings
Qubole is excited to announce the general availability of Managed Spot Block instances that provides up to 40% cost savings over On-Demand Ec2 Instances. Managed… The post Introducing Managed Spot...
-
Boosting Parallelism for ML in Python using scikit-learn, joblib & PySpark
As a general-purpose programming language, Python is universal. It’s quick and easy, but yet powerful with plenty of capabilities. It gives you an opportunity to… The post Boosting Parallelism for...
-
Introducing Qubole Release 59
Qubole regularly releases its software for processing petabytes of data on the cloud through major releases once a quarter. This is in addition to several… The post Introducing Qubole Release 59...
-
Loading More...