White Papers

The Evolving Role of the Data Engineer

Issue link: https://www.qubole.com/resources/i/1243713

Contents of this Issue


Page 43 of 63

to read. Using that method, start by listing all the columns on which you want to partition data. If you want to further subdivide the data, specify next the columns on which you often filter data. You can then subdivide the data even more by listing high-cardinality col‐ umns (that is, columns that contain many different values, such as a unique ID). Spark can estimate the sizes represented by the values in the columns and can partition a skewed distribution so that it's evenly divided among partitions. A smart use of orderBy clusters data to make reads faster and avoids having files that are too small. Because partitioning on the basis of the application is so important, what should you do if different applications need access to the data based on different fields? Creating indexes on those fields will boost performance, if the database supports indexes. (MongoDB, Cassandra, and CouchDB all support indexes, for instance.) Each index adds a small size burden to the database, and a larger time burden that affects each write, so indexes should not be used lav‐ ishly. And indexes do not substitute for efficient partitioning. So another option is to duplicate the data, using a data store for each application with the proper partitions. Choosing the Right Data Processing Engine In the past, DBAs often added data to their databases manually or by running ad hoc scripts. Employees then used ETL or ELT products to transform the data into a form appropriate for the organization's applications. Organizations exchanged data through file transfers. This can be very efficient for transferring large files using large buf‐ fers with very little overhead. The sender might transfer data once a day during a low-volume period such as the middle of the night. Once you have a file on a local system, you can read millions of lines a second, particularly using the readlines call offered in many lan‐ guages to put large chunks of a file into memory before reading individual lines. Another process puts the data into the right schema for the database that stores it. 36 | The Evolving Role of the Data Engineer

Articles in this issue

view archives of White Papers - The Evolving Role of the Data Engineer