White Papers

The Evolving Role of the Data Engineer

Issue link: https://www.qubole.com/resources/i/1243713

Contents of this Issue

Navigation

Page 31 of 63

Data Warehouses and Data Lakes Modern data stores were developed in the 1990s and 2000s, when sites handling big data found relational databases too slow for their needs. One major performance hit in relational databases comes from the separation of data into different tables, as is dictated by normalization. For instance, a typical query might open one table and offer a purchase number in order to obtain the ID for a cus‐ tomer, which it then submits to another table to get an address. Opening all these tables, with each query reading new data into the cache, can drag down an analytics application that consults millions of records. To accommodate applications that need to be responsive, many organizations created extra tables that duplicated data, stretching the notion of a single source of truth to gain some performance. This task used to provide a central reason for ETL tools. The most popular model for building storage of this type as a data warehouse involves a "star" or "snowflake" schema. The original numeric data is generally aggregated at the center, and different attribute fields are copied into new tables so that analytics applica‐ tions can quickly retrieve all the necessary data about a customer, a retail store, and so on. These non-numeric attribute fields are called dimensions. While traditional normalized databases remain necessary for han‐ dling sales and other transactions, data warehouses serve analytics. Hence the distinction between online transaction processing (OLTP) and online analytical processing (OLAP). OLAP can toler‐ ate the duplication done in a data warehouse because analytics depend on batch processing and therefore require less frequent updates—perhaps once a day—than OLTP. But in modern business settings, such delays in getting data would put organizations at a great competitive disadvantage. They need to accept and process streaming data quickly in order to reflect the fail‐ ure of a machine out in the field or to detect and prevent the fraudu‐ lent use of a credit card. Data must also be structured on the fly; data analysts can't wait months or years for a DBA to create a new schema for new data. Therefore, the structure of relational databases, which handle ana‐ lytical processing in multiple steps, renders them inappropriate for 24 | The Evolving Role of the Data Engineer

Articles in this issue

view archives of White Papers - The Evolving Role of the Data Engineer