Given the sheer amount of data that we produce in this era, choosing the right data platform to manage data has never felt more important. Should it be the venerable data warehouse that has served our needs until now, or should it be the data lake that promises to support any type of data workload?
In a recent virtual roundtable discussion on Data Lake vs. Data Warehouse – A Modern Strategy to Help Companies Look at the Past, Present, and Future, initiated by Consumex, industry leaders shared their thoughts on how data warehouse and data lake can co-exist. Addressing the keynote session, Kevin Blaisdell, Senior Solutions Architect of Qubole, gave an overview of how data-driven organizations use data. He also shed light on modern data architecture.
Leading the virtual debate with a group of industry experts from Biogen, HPE, Novartis, Edcast, and ActionIQ, Shree Nair, Head of Strategy, Idera Software, addressed the challenges Idera sees in the data lake and data warehouse space and how their organization addresses the shortcomings. Read the edited excerpt of the roundtable discussion below:
Shree Nair: How do you look at the modern data lake architecture, especially around the evolution of data lakes and the existence of data warehouses?
Neelesh Ajmani, Lead Enterprise Architect, Biogen: The more important point is to look at architecture from a business context – what business problem are you trying to solve? When we talk about data lakes, they are differentiated from data warehouses because of the present and future analytics you can conduct.
Shree Nair: What challenges are likely in the data lake space given that it is evolving? How do you go about keeping pace with this evolution?
Neelesh Ajmani: Any challenge you want to resolve with the help of data lake should address the business need or help businesses think in the direction that can help achieve their business objectives or goals. So, from that perspective, the biggest challenge on the data lakeside, in my view, is streaming and maintaining the data validity.
Shree Nair: What issues do you see in the data lake and the data warehouse space? How does your organization address these shortcomings? Are there any real-life scenarios you can share?
Ravi Achukola, Associate Vice President, Global IT Digital, Data & Analytics, Hewlett Packard Enterprise: One of the challenges we ran into during the digital transformation is that we have discreet sources – data marts, data warehouses, and other data silos – to consolidate into one source meticulously. The most distinct contention point is explaining to the businesses the difference between a data lake and a data warehouse. What I should do and where I should go: pressure reporting vs. analytical reporting. Challenge in the bottom line is data governance – how we prevent a data lake from turning into a data swamp.
Moreover, if we don’t have the governance or guidelines laid out properly, it is impossible to control what data should be there in the lake and in the warehouse. Also, data literacy and culture are the key to innovation to launch these initiatives successfully. We struggle to decide the level of governance we need to put in place to derive a value from the data lake because, in the past, data warehouses or data marts used to lock in the ease of access to data, making it challenging for businesses. Another important aspect is to understand the real-time use cases for warehouses or data lakes. Our journey in the last three years is to pivot and position ourselves as a data company that puts ample opportunity or onus on our team to drive data analytics at a data enterprise-level and, on top of that, redefine our data warehouses.
Shree Nair: Do you think data literacy is just about educating the employees in your data team across the organization on the importance of data or is it more about training them on the common objectives that the companies are chasing after?
Ravi Achukola: Data literacy is a broader topic; there are many ways to enable the organization by taking a data awareness and relevancy approach. Everybody wants all their data in the data lake. The first step of the data lake is establishing data stores and data owners for governance to bring awareness of the platform’s limitations. Second is the data definition and cataloging to help the data stewards do MOCs for their team and understand what we should and should not do.
Shree Nair: How do you cover any of the real-time or near real-time scenarios where you want stats and get the data captured and stored in a data lake, that could be further powering your data science objectives?
Joshua Mathias, Data Scientist, Edcast: It is crucial to have a good data pipeline for ingesting or bringing in data from various sources in real-time. Real-time is about efficiency to recommend the user to have proper data caching. They use a lot of data and intelligence to advise the user where to store a type of cache data, put objective to the use cases, store what’s used for application immediately vs. what’s used for analysis, or prepare the model beforehand.
Shree Nair: When you access real-time data, are you most of the time accessing it through a data lake or data warehouse to serve a particular use case, or do most of the time you try to get data stored in the data lake and access it from there?
Joshua Mathias: Yes, we store the data in the data lake first and use the data directly from there. Essentially all the data is rarely trustful to us and the structure we define ourselves and send the data to the data lake.
Shree Nair: What do you feel about the real-time data ingestion into data lakes as opposed to ingesting them into data warehouses? What is your long-term strategy for near real-time or real-time data ingestion to data warehouses vs. data lakes?
Aparna Mangari, Head of Data Science and Analytics, Novartis: We are using some different approaches – we have data warehouse implementation, data lake implementation, and data verification.
In regards to the data warehouse, it is not always real-time. There is a data lag right from the source system to the transactional system to the data warehouse. There is also a lag of a couple of scheduled hours, and we don’t have access to the real-time data. But the advantage is data warehouse is easy to implement. However, when it comes to the data lake, data ingestion becomes a challenge. Given that we are taking Novartis as the global drug development company, understanding what data can be ingested into the data lake takes a lot of brainstorming. The other two challenges we are facing in the implementation are the validity of data and data governance from the data privacy perspective.
Shree Nair: What best practices do you follow in data lake and data warehouse implementation? How do you maintain the fine balance between performance, scalability, and cost? If you could share some high-level best practices of both platforms.
John Lynch, Senior IT Architect from Biogen: At Biogen, we are using a next-generation database technology called Snowflake that has given absolute flexibility to bring quick warehouse solutions to the market. It is highly performant; it changes all data warehousing and database implementation, which requires 5-10 parameters wide administration counting and charge performance, speed, scalability separation CPU, and storage. I’m not confident if data lake is really positioned to service. The data scientists spend about 99% of their time organizing the data in a data lake to answer basic questions. However, it doesn’t take much effort to put the data into the warehouse and put the data in context. I also notice a fair amount of chicanery in the data lake in terms of AI or predictive analytics. As far as real-time vs. non-real-time is concerned, there is a significant cost associated with it.
Shree Nair: How are you obtaining the benefits of managing the data lake and data warehouse?
John Lynch: Generally speaking of Snowflake, the ingestion of the data is rapid, and the cross charging is effortless and very apparent. It has arguably a more robust way to deploy data in terms of statistics response time.
Benjamin Shemmer, Senior Data Scientist, ActionIQ: The distinction that we are making between the data lake and data warehouse is mischaracterized to some extent. Suppose you are working on a data lake, the first process is to create a central data point that mimics a temporary data warehouse for structured data before you solve the problem. The question then becomes: are you going to persist with that structure or not? It is essentially the snapshot of the data warehouse you are momentarily realizing, materializing before you go on. We shouldn’t be framing this as a competition between the two, but the question is, you are always going to have that warehouse structure, but do you need the lake as well?
Shree Nair: What advice would you give to the data team who have got their data lakes on S3 and want to check out Snowflake to get the best of both worlds? Or would you advise them to stick to the existing data lake, maintain a single source of truth, and use the compute capabilities of Snowflake?
John Lynch: Snowflake has the ability to use its S3 staging, but it can also use external staging so it can potentially access anything inbound to the data lake. I like the belt and suspender approach, particularly in enterprise data, because you do not want to bet the whole firm on one technology and come out short. The data lake is helpful for specific activities and is not a real-time activity center. The database is running as performant SAR now. It is exciting to give them another look.
Shree Nair: What is the importance of data lakes and their positive impact on a long-term data strategy?
Ravi Achukola: When we say data lake, we are pivoting to the federal data lake approach. Given the multiple cloud vendors, the key is to talk about integrating all kinds of data into the data lake and distributing the data to data marts and warehouses. The data lake serves for Corley to get the data and mash up the data as per how it is being consumed across the ecosystem.
Shree Nair: Any closing thoughts?
Neelesh Ajmani: Both data lake and data warehouse can sit together to meet both types of use cases that are mutually independent. Data warehouses are essential for analytics purposes, which is vital for any business.
Whereas, data lake helps you assemble all kinds of structured and unstructured, and semi-structured data in one place. The data warehouse aggregates and transforms data and makes it easily consumable for businesses. The data lake is still crucial because of predictive analytics. For example, pharma companies using AI and ML would be forced to go in that direction as the combination of structured, unstructured, and semi-structured cannot be done through data warehouses. However, the question arises how do you combine all that into a modern architecture that can help solve the problems by keeping both the platforms together and having both types of use cases managed for the aggregation of data and separation from the business user perspective. That’s where I think both data lake and data warehouse can play a role and lead to more use cases.
One of the use cases that I can think of in modern data architecture is the R&D environment. However, the process is so slow; that there is no predictive analytics. It’s all based on reactive analytics. You can collate the data point and prepare the datasets to do AI and predictive analytics before you jump onto the traditional way of clinical trials. It all depends upon innovation, and innovation will bring in new use cases.
Ravi Achukola Data lake and data warehouse have a grey line. I wouldn’t say one is parallel to the second to fit the purpose. Businesses are demanding real-time or almost near real-time analytics. Data lakes and data warehouses are best for this purpose, but when you have discrete data coming from different clouds. All you need to do is mashup in real-time, do a fast turnaround, so it is essential for us to build modern data architecture to help the customer, make their journey smooth, more effective, and turn that into an opportunity for us.