As Spotad is supporting millions of queries per second, in order to make data reliable and easily accessible, a well-designed data lake is one of our most important business aspects. In this presentation, I’ll focus on key aspects of data lake architecture, cost, data-based optimizations, and clusters. It is well-known that well-partitioned data helps reduce query costs and improve performance by limiting the amount of data a query needs to scan to return the results. In particular, I’ll cover known and less known aspects of data partitioning, idempotency of data workflows, and caching aspects to support your business goal. Planning and optimizing are some of the strongest tools for maintaining a well-designed data lake while keeping the cost at a minimum and performance at its best. The most important aspect of those is to always know what is going on with your data. This includes monitoring query runtimes at all times, checking for the most and least queried data sources, checking clusters utilization, and optimizing based on these results. I will discuss and demonstrate the importance of developing auto-monitoring tools and using the results for optimization. In addition to this, I will also discuss spot nodes utilization tools such as heterogeneous cluster nodes, and setting the maximum price in the context of cost-reduction and stability.
Free access to Qubole for 30 days to build data pipelines, bring machine learning to production, and analyze any data type from any data source.