The Evolving Role of the Data Engineer

Azure Data Catalog helps you extract metadata and set up a catalog for users to search for data they can use. Event Hubs is a streaming data ingestion service like Kafka. Table Storage is a semi-structured database. GCP BigTable is a highly scalable key/value data store storing semi- structured data. Dataflow is a framework for transforming and enriching both stream and batch data. In addition to these three major vendors, Databricks deserves a mention for its popular analytical solution, built on Spark. Object and Tiered Storage The rethinking of storage among big data users has extended to the architectural depths of disk structures and operating system choices. As an alternative to the standard block storage offered by conven‐ tional filesystems, cloud services and open source projects offer object storage, meant for types of data that aren't expected to change or be edited. Object storage is popular among organizations that store large amounts of multimedia files such as audio and video, or that archive large amounts of data that they might need access to in the future. Object stores also scale efficiently, making them a good choice when you want to quickly append BLOBs or store large write-once data such as logs. In the cloud, common object stores include Amazon Simple Storage Service (S3), Azure Blob, and Google Cloud Storage. Cloud services also offer tiered storage, where you can trade off cost for access time, and even define policies, so that (for instance) an object moves to a slower, cheaper tier after 30 days, then switches into a cold archive after one year, and finally is deleted after a time of your choosing to meet regulatory or tax requirements. Options like this appear on AWS, Azure, and GCP. Object stores are cheaper than block storage because they don't pro‐ vide random access to data. In fact, they don't even provide a direc‐ tory structure, although sometimes they let you simulate one, just so you can keep a logical inventory of what you have. Structuring Data | 33

