The 2020 Covid-19 crisis has led to an unprecedented market shift that has emphasized the importance of implementing modern data architecture that accelerates analytics, dictates an ROI-driven approach to business analytics, and keeps costs under control. As per the Mordor Intelligence report, the data lakes market was valued at USD 3.74 billion in 2020 and is expected to reach USD 17.60 billion by 2026.
Reflecting on the steady growth of the data lake during the pandemic, Qubole Solutions Architecture Director Sandeep Dabade, explains, “We, at Qubole, have seen continued growth in terms of onboarding of newer use cases across a majority of our enterprise customers. The three driving factors for this have been: a) Need for modernizing and unifying the data platform, b) Faster time to value, c) Cost.”
Ashish Kumar, Technical Director at Qubole, agrees and stipulates that the pandemic was the most significant factor that influenced the customers’ decisions on cost and RoI in 2020. “Most of the organizations tried to keep a check on the cost; also, RoI was the biggest factor for them. For example, India’s leading on-demand food delivery customer migrated a couple of heavy workloads from Snowflake to Presto to save cost. Likewise, a London-based media customer tried to cut down on their cost by moving their workloads to Qubole Presto. A similar trend was observed among the prospective customers as well, some of the customers choose Qubole on top of their data lake to save cost.”
The data lakes have become an economical option for many organizations as storing data in a centrally managed infrastructure helped cut down the cost and the number of information silos in an organization making data accessible to users across the enterprise.
Vivek Sharma, Senior Solutions Architect at Qubole, shares, “Today organizations don’t want to have siloed data and incur data transfer costs for maintaining the copy of data in data lake as well as a data warehouse. Our customers prefer to have a single platform to derive value from various use cases. Moreover, newer or broader use cases are also pushing the adoption of data lakes.”
Purvang Parikh, Senior Solutions Architect at Qubole, too believes that data transformation is happening on data lakes instead of data warehouses.”
The data thought leaders at Qubole analyze the five key trends that might dominate this year. Keep reading to discover Qubole’s predictions for data lake trends in 2021.
Trend #1: Data Lake and Data Warehouse – Can They Co-exist?
Well, many enterprises today are looking at considering the convergence of both platforms. In a keynote address at the Data Lake Summit held last year, Debanjan Saha, VP and GM of Data Analytics services at Google Cloud, said that convergence between a data lake and data warehouse is not just in talks but in reality. “Convergence is happening from both sides. The data warehouse vendors are gradually moving from their existing model to the convergence of the data warehouse and data lake model. Similarly, the vendors who started their journey on the data lake-side are now expanding into the data warehouse space.”
Sandeep Dabade predicts that the convergence will continue to happen this year. “We have seen early signs of this happening already. We have many customers that land and stage the raw data in the data lake, cleanse and transform it, and then move richer or hot data in the cloud data warehouses such as Redshift or Snowflake, all of this using Qubole. For example, one large scale media and entertainment customer their data into Snowflake. Another global enterprise e-commerce customer moves their richer data into Redshift.”
Ashish Kumar says, “The gap is going to narrow down between these two platforms this year. Qubole is thinking in that direction where most of the problems we try to decipher can be solved by the data warehouse. The data lake is going to be the superset that can decipher all data warehouses problems along with more capabilities.”
Trend #2: Migration of Data Science onto Data Lake
Another key trend Sandeep Dabade predicts this year is the migration of data science use cases onto the data lake. “Up until now, data engineering use cases have dominated the data lake market. That is changing and Data science use cases on Data Lakes will continue to grow in 2021,” he said.
“Data science appears to be the primary focus in 2021. Some of our customers were looking at our platform specifically to use data science use cases with ML and AI to leverage structured data compared to unstructured or semi-structured data. Use of Qubole for machine learning for predicting and personalization of data science use cases will grow in 2021,” Purvang added.
Trend #3: Organizations will prioritize TCO optimizations and execute an RoI-driven approach
Running ad hoc analytics, streaming analytics, and machine learning workloads in the cloud offer unique cost, performance, and time to value advantages. But the unpredictability of both workloads sizes and their associated costs can become obstacles to growth and innovation if you don’t have efficient ways to monitor and manage them. Having the means to control costs and apply specified governance policies has become an even more critical topic for cloud data lakes due to the pandemic.
Purvang and Ashish both lay out the problem this way, “Because of the pandemic, the cost has been a primary concern for many companies as they were struggling to generate revenues as planned. Keeping an eye on RoI is going to be the main focus for prospective customers this year.”
However, Qubole TCO optimization capabilities can help enterprises keep check on the wasteful spending in the cloud data lake platform. Sandeep explains, “TCO has been one of the classic plays of the cloud, and yet many organizations continue to struggle with keeping their cloud costs in check. Organizations start prioritizing TCO optimizations after hitting the ceiling. The growing size of data workloads will only accelerate the prioritization of TCO optimization initiatives. Qubole provides four value pillars that are geared towards TCO optimization – Automated Cluster Lifecycle Management, Workload Aware Autoscaling, Aggressive Downscaling, and Spot Intelligence. All of these features come out-of-the-box for our customers.”
“Additionally, Qubole has continued to invest in building features such as Cost Explorer that continue to provide our customers with the visibility and actionable insights for optimizing their TCO,” he further explained.
Trend #4: Data Security and Governance Will be a Top Concern for Organizations
In Data lake, organization-wide data from multiple sources are gathered including, consumer Personal Identifiable Information (PII) data. However, this sensitive data must be protected, compliant with privacy laws and regulations. This makes data security and governance critical pillars in designing a data lake. Qubole, an open data platform for machine learning, streaming, and ad hoc analytics, has enhanced its platform security by supporting AWS PrivateLink in 2019. “Qubole with AWS PrivateLink makes it easy to connect services across different AWS accounts and VPCs, and significantly simplifies network architecture. When a customer configures the Qubole Platform through AWS’s PrivateLink connectivity, the traffic between Qubole VPC and the customer’s VPC does not traverse through the public internet,” says Purvang.
Vivek asserted that “data governance is a macro topic, which covers the policies, practices, and approaches to manage the data assets within an organization securely.” He highlighted the following capabilities that Qubole offers for data governance:
- Granular and Efficient Updates and Deletes: ACID capabilities on data lake helps with the right to be forgotten and the right to be erased by making sure that data in the data lake is current and, if asked to be deleted, it is deleted. Qubole supports ACID transactions natively across multiple engines to help avoid lost updates, dirty reads, stale reads and enforce app-specific integrity constraints. Data integrity is maintained in the data lake when concurrent users access the data lake to read and write data simultaneously.
- Granular Data Access Controls: It provides granular data access controls and the ability to mask data with a single policy across multiple engines Apache Spark, Presto, Hive running on multiple clouds. Few of our customers have started enforcing policies around how data (from Database, Tables ) can be shared with users based on compliance requirements from CCPA, GDPR.
- Role-based Access Control: It enables access controls at a minimum of three levels, starting from data ingestion to data access: the infrastructure, platform, and data levels providing effective policy management. It allows Security teams to leverage their cloud provider’s IAM/AD/LDAP services to restrict particular user-based Qubole access to specific computing resources.
Ashish says, “In 2021, GDPR and CCPA compliances will be the top concern for every organization. Also, not many vendors have storage level security options. In that case, Qubole provides a user-level role override, which can be used for data security. Apart from that, we have typical ACL for all the resources available in Qubole along with a SQL level authorization and storage level security.”
Purvang further adds, “Qubole is committed to providing a secure, reliable, and performant cloud-native platform for the customers. Qubole uses a multi-layer approach to protect the confidentiality, integrity, and availability of customer data. We follow best practices in security and governance to deliver enterprise-ready capabilities. We also provide secure access and protect data artifacts with encryption and Role-Based Access Controls (RBAC), auditing, and compliance with industry and governmental regulations like — SOC2 Type2, HIPAA, and GDP. Qubole protects the data at rest, in transit, and use, by providing secure access to the platform and encrypting all data and metadata.”
Trend #5: Presto and Spark combination to Dominate the Data Engine Space
Sandeep predicts that “Spark will continue to make inroads into ETL use cases such as real-time streaming where performance is the primary KPI. It has been an obvious engine of choice for Data Science and some ETL use cases. Presto will continue to dominate the ad hoc, interactive, and reporting use cases on the cloud, where performance is the primary KPI. Presto has been the popular choice for SQL on data lake and will continue to dominate those use cases. Hive will continue to cater to Batch ETL use cases where cost and reliability are the primary KPIs. We have customers that still rely heavily on the Hive for their long-running batch ETL workloads where performance is not the primary KPI, but reliability and cost are the main concerns.”
However, Ashish projects that new customers would prefer a mix of Presto and Spark for batch jobs. “Spark would be utilized for all the ETL, AI and ML workloads, and Presto will be used for latency SQLs on top of the database and data lakes.”
Qubole’s data lake predictions may change the working of the companies in 2021 and provide an edge over their competitors. Some of our customers are already using all the aforementioned Qubole cloud data platform capabilities to great effect.