White Papers

The Evolving Role of the Data Engineer

Issue link: https://www.qubole.com/resources/i/1243713

Contents of this Issue


Page 11 of 63

Data Engineering Today Key issues that data engineers handle include performance, scalabil‐ ity, fault tolerance, change management, and exception handling. I look at these issues in this report and mention some of the popular tools available to solve them, along with the theoretical and practical knowledge you need. The next few sections show how the concepts and thought processes data engineers use resemble older ways of thinking about data, and discuss what assumptions need to change: Data ingestion and transfer In traditional environments with data warehouses and relational databases, this task is divided among a number of tools for dif‐ ferent stages of data use. For instance, an SQL database might ingest data from an outside source, such as a spreadsheet, data‐ base, or flat file. To ingest data from an operational database to a data warehouse or a data store used by business intelligence (BI) tools, the developer applies a tool for Extract, Load and Trans‐ form (ELT) or the aforementioned ETL. Sometimes, replication or streaming tools are used to keep the target system up to date in near real time. Data virtualization tools can make data avail‐ able on demand without having to move it and keep it up to date. Backups also use specific tools. Big data environments divide tasks differently. SQL is used mostly for exploring the fields and characteristics of data sets (such as ranges) before production use, when it is supplanted or complemented by APIs. Backups probably require specialized tools—and are performed in cloud solutions through configura‐ tion options—but replication is built into most data stores and is configured as part of their setup. The new environments still use bulk data transfers and ETL tools, which have evolved with the times. Transformations and enrichment The data engineer often must add or change fields prior to stor‐ age. Reasons include: • Correcting errors. • Joining databases on fields they have in common, perhaps with changes to column names for consistency and clarity. • Adding provenance metadata, such as the source of the data and a timestamp recording when the data was received. 4 | The Evolving Role of the Data Engineer

Articles in this issue

view archives of White Papers - The Evolving Role of the Data Engineer