Nikhil Kumawat
- Dec 6, 2022
- 9 min read

What is Data Engineering? From A to Z.

Updated: Jan 23, 2023

Before going to explore what data engineering is let's see some interesting figures related to data generation from various sources:

Each day, Google processes 8.5 billion searches.
WhatsApp users exchange up to 65 billion messages daily.
The World will produce slightly over 180 zettabytes of data by 2025.
The market of Big Data analytics in banking is set to reach $62.10 billion by 2025.
Using big data, Netflix saves $1 billion per year on customer retention.

Seems interesting, Let's move further.

The data is growing at an unprecedented rate every day in such a way that by the end of the year 2022, the world will produce 94 Zetta Bytes of Data and the reason behind this is data generation from enormous sources like social media platforms, smartphone applications, clickstreams, Internet of Things (or IoT) devices, etc. This immense amount of data generated at a high speed (GB/s) introduces technical challenges for extracting, storing, managing, and deriving facts and figures which may help in business growth.

This huge amount of data that companies receive from various sources can uncover many opportunities for understanding their business, customers, and making better business decisions, by extracting insights from the data. In order to do this, businesses require skilled engineers for data governance and strategy, such as Data Engineers, Data Analysts, and Data Scientists so that these skilled people can turn the vast amount of data into actionable insights.

Thus, the Data Engineering field concerns with the mechanics of the flow and access of the data with a goal of "quality data available for fact-finding and data-driven decision-making". Now, Data Engineer comes here who design, build, maintain scalable data infrastructures and platform to make quality data available for decision-making. These data infrastructures include data repositories such as databases, data warehouses, data lake, as well as data pipelines for transforming and moving data between these systems. At the same time, Data Engineers also make sure that data is highly available, consistent, secure, and recoverable.

Evolution of Data Engineering

Two decades back when Data was not generated in a massive amount at a high speed from multiple sources it means data can be stored on a single machine and hence can be analyzed without utilizing scalable data infrastructures which also makes it easily accessible for finding insights of the business. But now, the past decade has exponentially changed the way data is perceived and used. Organizations are now more reliant on collecting users' data and analyzing them to create business value and make business decisions.

So today, handling the massive amount of variety of data which are coming with high velocity becomes a necessity for companies to handle it by introducing a scalable amount of storage with highly distributed computation, and because of this change in the rate of data generation compared to the last two decades companies required Data Engineers who can build scalable infrastructure for managing an ever-growing variety of data so that data can be delivered for further analysis and enable the business to make the best decision based on their findings. For more details about how the Data Engineering field evolves refer to the link and link.

This is how data creation has changed in the last decade and still, it's growing so the need for Data Engineers is emerging. Let's move to the ecosystem of Data Engineering where you will get to know terminologies and building blocks of the Data Engineering field.

Data Engineering Ecosystem

The ecosystem of the Data Engineering field is the combination of data infrastructure, frameworks, and processes that helps in making Data Engineering tasks production ready. Let's see individually what this term means:

Data Infrastructure:

Data Infrastructure contains various data repositories which are used as enterprise data storage into which data has been specifically partitioned for analytical or reporting purposes. Data repositories are categorized as databases, data warehouses, data lake, and data marts. let's see an overview of these data repositories. Selecting the right one is a crucial part of the design.

Database: A database is used for storing information, or data in an organized way in a computer system which can be relational or non-relational.
Data Warehouse: Data Warehouse is a central repository where data is integrated from multiple sources using the ETL process and used to store current and historical data that has been cleaned, conformed, and categorized.
Data Lake: Store large volumes of raw data in its native format, straight from its source. This can be structured, semi-structured, or unstructured.
Data Marts: A Data Mart is a sub-section of the Data Warehouse, built specifically for a particular business function, purpose, or community of users like the sales and finance group of an organization.

Note: A well-designed data repository is essential for building a system that is scalable and capable of performing during high workloads.

Frameworks/tools:

Frameworks or tools are the terms that are being used interchangeably in the field of Data Engineering, these tools facilitate storage, analysis, data visualization and, building data pipelines for seamless ETL/ELT operations. Below is a list of tools that are used in Data Engineering.

Apache Hadoop: Hadoop is for storing and analyzing large data sets in a distributed storage environment. Designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Apache Spark: Spark is a unified engine for large-scale data analytics. It supports a rich set of higher-level tools including SparkSQL, pandas API on spark, MLlib, GraphX, and Structured Streaming.
Apache Kafka: Apache Kafka is an open-source distributed event streaming platform used for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Apache Hive: Hive is a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

And many more tools are there, other frameworks also which can be accessed from the cloud like AWS, Azure, Google Cloud, and IBM cloud.

Processes:

Process defines how the Data Engineering tasks are put in the deployment environment. Below is the list of tasks that occurs in the Data Engineering environment.

Extracting data from disparate sources.
Architecting and managing data repositories.
Architecting and managing the data pipelines for transformation, integration, and storage of Data.
Automating and optimizing workflows and flows of data between systems.
Developing applications needed through the data engineering workflows.

Responsibilities of a Data Engineer

Till now we have covered the definition and ecosystem of Data Engineering, let's move further and see the responsibilities of a Data Engineer. The overarching responsibility of a Data Engineer is to provide analytics-ready data to data consumers.

The responsibilities of a Data Engineer are listed as:

Extract, organize, and integrate data from disparate sources.
Design and manage the data pipelines that encompass the journey of data source to the data destination.
Prepare data for analysis and reporting by transforming and cleansing it.
Set up and manage the infrastructure required for the ingestion, processing, and storage of data.

Career opportunities in Data Engineering

Let's go according to the statistics:

95% of businesses cite the need to manage unstructured data as a problem for their business. (Source: Forbes)
Big Data in healthcare could be worth $71.6 billion by 2027. (Source: Global News Wire)
The demand for composite data analytics professionals will grow by 31% by 2030. (Source: Forbes)
97.2% of organizations are investing in Big Data and AI. (source: New Vantage)
96% of companies plan to hire job seekers with big data skills. (source: The Economic Times)
80% of data analytics adoptions will depict a business's capabilities. (source: Forbes)
Data Science jobs will increase by around 28% by 2026 (Source: Towards Data Science)
Machine Learning Engineers, Data Scientists, and Big Data Engineers rank among the top emerging jobs on LinkedIn (Source: Forbes)
- According to LinkedIn's 2020 Emerging Jobs Report, data engineering now joins machine learning and data science as one of the top 10 "jobs experiencing tremendous growth" in the U.S., with industries from retail to automotive taking notice and making this hard-to-hire talent a part of their teams.
- Dice Tech Job Report of 2020 lists data engineering as the fastest-growing tech occupation with a year-over-year growth of 50%. And with more and more companies competing to find the right talent for their expanding data infrastructure, it is expected to grow even furthers in years to come.

From the above statistics, it can be seen that in the upcoming years there is a high need for Data Engineers in companies for handling a large amount of data by creating scalable infrastructure and make data deliverable to Data Analysts or Data Scientists to analyze the data in order to enable the business to make the best decisions based on their findings.

What makes you Data Engineer

For this part, I am collecting information from the people who've worked as Data Engineer themselves and hired data engineering teams. For more details, you can visit the blog 5 things you should know for a career in data engineering a must-read blog.

You must be a strong developer:
1. In a blog post about what he looks for in data engineer, Anderson said, "I can't stress enough how important it is for a data engineer to have a strong programming background. They also need a love of or at least an interest in data, in finding patterns in data. Also, they have to like and have the ability to create systems that are difficult and complex. So, it's a love of data combined with a love of programming to create data pipelines."
2. In addition to being comfortable in coding, Lappas says, "You have to have the operations mindset that uptime is critically important. You have to be careful how you build your infrastructure for reliability so that any changes won't break any of the pieces."
3. Ng says "Everything is code now: infrastructure as code, pipeline as code, etc. Courses are OK but nothing beats real-world experience. A textbook doesn't teach you how to handle a data pipeline outage - at least none of mine did!".
You need to know about a lot of technologies: Lappas says: "A data engineer has three main duties":
1. To ensure that the data pipeline - the acquisition and processing of data - is working.
2. To serve the needs of internal customers - the data scientists and data analysts.
3. To control the cost of moving and storing data.
Social and communication skills are important: Ng says, "Aside from hard technical skills, a good data engineer should also have certain soft skills and qualities":
1. Good communication skills: A lot of times there's a discovery period when you start to design a pipeline because your data is sitting in different silos that may be located in different areas of your infrastructure. You'll have to talk to people to understand the playing field before you design anything. This discovery step isn't easy, but it's a requirement for making sure you're building the right thing.
2. Excitement about working on back-end systems: Data engineers don't build a lot of UIs and front-end apps. They work deep in the systems stack so the excitement will encourage you to build the systems that is optimized and fulfill the requirement.
3. A love of learning: You have to keep up with new libraries, frameworks, and tools out there in the community. Things change fast and you need to be able to quickly understand, evaluate and learn new tools if necessary.

"Having good people skills is critical," Lappas agrees. "A data engineer serves internal teams, so he or she has to understand the business goal that the data analyst wants to achieve to best support them. If a data scientist has a specific tool they want to use, the data engineer has to set up the environment in a way that lets them use it. So you have to be really good at interacting with the rest of the data team."

Take Away

Lappas says, "Data engineers are responsible for acquiring data for data scientists and data analysts, who need all the company's data available in a format that lets them query it with the tool of their choice. The data engineer has to migrate it from where it lives and transform it so that it makes sense to the data scientists and data analysts. That may require aggregating it and running statistical methods to derive higher insights. For example, if a mobile app generates 10,000 events per second, chances are you're going to have to do some transformation on that raw data to make it useful for the rest of the data team."

Tam says, "I've hired people of many different educational backgrounds – from people who've just graduated with a computer science degree to people who've done bootcamp courses in Python. You shouldn't be pigeonholed by your background. It depends on the person's overall goal. If they have the vision and drive, anyone could make a good data engineer with time."

Ng's advice: "Work for a startup and find a great mentor. Whether this is at an internship or your first job, find a place where you can work directly for someone who's a great teacher. More than anything else, a great mentor is the most efficient way to learn the right things and learn those things quickly. By working at a startup you'll be forced to wear multiple hats and will learn an incredible amount while doing that. Each hat is an opportunity to learn something new. Be a hat collector."

With this, I end this blog here, if anything else needs to add please comment.

Thanks for reading hope you find it insightful and helps you to get to know who are Data Engineers, the Ecosystem of Data Engineering, and most importantly what makes you Data Engineer.

Keep learning, and Keep Growing!!!

Have a nice day.