Data Engineering Community

  • Home >
  • Big Data Analytics

Data Engineering Community Welcome Page

As data technologies have evolved, data engineering has become a distinct role that creates value for data scientist, and data analysts, not to mention a number of operational business units that rely on the data supplied by data engineering pipelines to make key business decisions.

This site is is dedicated to data engineers who are interested in on-going education in big data technologies and relevant topics that affect the data engineering function. We will continue to add relevant content including training and best practices over time. While this site is sponsored by Qubole, our intent is that you should be able to get value from this Data Engineering Community whether or not you are a current Qubole customer.

The Data Engineering Function

Data engineers’ primary function is to build and maintain the pipelines that get the data ready for use by the rest of the data team as well as other business units. This means dealing with a variety of systems where the data is generated, different formats, quality, governance, timeliness, as well as maintaining a scalable infrastructure all within the budgets allocated to the data engineering team.

Types of Data Engineers

There are 3 distinct types of data engineers, they can be differentiated by the tool(s) they use to build their data pipelines:

  • Distributed Systems Engineer – These are distributed systems engineers with a Ph.D. level skillset. Typically authors of frameworks to solve cutting edge data problems. They tend to write and self manage their pipelines. May rely on infrastructure provider to production runs to data pipelines.
  • Data Engineer – These are software engineers with data-oriented skillset and familiar with OSS frameworks to build their pipelines. Typically have a language of choice and will rely on someone else for infrastructure needs.
  • ETL Developer – These are traditional ETL developers using products such as Informatica, Talend, or Pentaho. They may be new to distributed systems and OSS Software and will prefer SQL and UI-assisted toolsets such as drag and drop or assisted pipelines. They generally don’t have to worry about scaling jobs and infrastructure needs.

Regardless of which type you identify with, data engineers build the pipelines that deliver production-ready data sets and format and cleanse the data so that it’s ready for use.

Required Skillset

Data Engineers need to have skills in several areas such as relational and non-relational data stores, file formats, ingestion tools, BI and visualization tools, persistent storage, cluster management, programming languages (SQL, Python, Scala, Java), and some basic knowledge of artificial intelligence to start.

Natural curiosity about data is important because it leads to constant improvement of their skill sets as well as tighter integration into the extended data team, this also includes a good understanding about the systems where the data originates, who uses it and how it is consumed.

If you would like to contribute

If you would like to contribute to our Data Engineering Community please contact us a [email protected]