Data Lakes in a Real-time bidding environment - David Garty, Spotad

November 25, 2020

As Spotad is supporting millions of queries per second, in order to make data reliable and easily accessible, a well-designed data lake is one of our most important business aspects. In this presentation, I'll focus on key aspects of data lake architecture, cost, data-based optimizations, and clusters. It is well-known that well-partitioned data helps reduce query costs and improve performance by limiting the amount of data a query needs to scan to return the results. In particular, I'll cover known and less known aspects of data partitioning, idempotency of data workflows, and caching aspects to support your business goal. Planning and optimizing are some of the strongest tools for maintaining a well-designed data lake while keeping the cost at a minimum and performance at its best. The most important aspect of those is to always know what is going on with your data. This includes monitoring query runtimes at all times, checking for the most and least queried data sources, checking clusters utilization, and optimizing based on these results. I will discuss and demonstrate the importance of developing auto-monitoring tools and using the results for optimization. In addition to this, I will also discuss spot nodes utilization tools such as heterogeneous cluster nodes, and setting the maximum price in the context of cost-reduction and stability.

Previous Video
Apache CarbonData: Data Storage for ACID Ingest, Fast Query, and Machine Learning - Huawei
Apache CarbonData: Data Storage for ACID Ingest, Fast Query, and Machine Learning - Huawei

The growing volume of data requires skills to deal with dozens of new challenges like how to ingest streami...

Next Video
Data Lakes Fundamentals and Best Practices - Lessons learned in Planning, Strategy, and Execution
Data Lakes Fundamentals and Best Practices - Lessons learned in Planning, Strategy, and Execution

Presented by Shreya Pal, Cognizant Data lakes have been around for almost a decade since the term 'data la...