White Papers

The Evolving Role of the Data Engineer

Issue link: https://www.qubole.com/resources/i/1243713

Contents of this Issue

Navigation

Page 35 of 63

Access to Data Stores: SQL and APIs Hadoop and other modern, nonrelational databases provide APIs, which can quickly carry out the basic database operations known as CRUD (Create, Read, Update, and Delete). Thus, it's well worth learning a programming language in order to use these APIs. SQL, even though it's the common way people interact with relational databases, should be considered a high-level language that imposes overhead. The SQL query optimizer takes up overhead through such activities as deciding on the order in which to execute a WHERE clause, and whether to consult indexes. Thus, even relational data‐ bases have raw, direct APIs. Someone planning an application and exploring the data will find SQL useful at that stage, but most pro‐ duction applications in big data use APIs. The Hadoop family of tools was designed in the mid-2000s and mostly offers Java interfaces. But many people find it easier to use Python, Scala, or other interpreted languages. Because you can try out interpreted language statements using an interactive command line, development can go faster. These languages also lend them‐ selves to more compact source code than older languages such as Java. If you learn Python or Scala, you can probably interact with any of the modern databases. Jupyter Notebooks, which are interac‐ tive environments allowing you to try out programming code, are also popular to start development. But because both developers and DBAs have been accustomed to using SQL, and because it can be a convenient way to explore a data set, nearly every database supports some SQL-like language, even if the database is a document store or has some other kind of nonrela‐ tional structure. HDFS is just a filesystem designed to handle large files with replica‐ tion. The filesystem imposes no structure on the data within it, but Hadoop and most other programs use it to store individual records with keys. Hive and Impala create schemas on top of HDFS to repli‐ cate the relational model. You cannot use these extensively to exe‐ cute complicated queries, though, because the underlying data stores don't have the relational structure that supports the execution of arbitrary queries. If you try to include too many clauses, perfor‐ mance will be unfeasibly slow. For instance, the early versions of Hive didn't even have UPDATE and DELETE statements, because the underlying HDFS filesystem is optimized for continuous writes. 28 | The Evolving Role of the Data Engineer

Articles in this issue

Links on this page

view archives of White Papers - The Evolving Role of the Data Engineer