The world of technology is brimming with enticing buzzwords. Proficiency in areas like machine learning, artificial intelligence, and neural networks is highly desirable.
In today’s big data realm, the task of writing algorithms doesn’t automatically entail responsibility for data cleaning. Data organization and preparation for scientists and researchers have evolved into complex challenges in their own right. This is where data engineers step in and carve out their crucial role within this ecosystem.
Imagine you’ve been hired as a data scientist at a young, data-driven organization. Your task is to develop a model for identifying fraudulent transactions. You plan to apply a sophisticated machine learning technique that you’ve honed over the years. However, after digging into the data for a couple of hours, you realize that it’s scattered across multiple databases. Furthermore, the data is stored in tables that are optimized for running applications, not for conducting analysis. To make matters worse, some of the data has been corrupted due to legacy code. In your previous company, you never faced such issues because the data was neatly organized and readily available. Feeling desperate, you find solace in the arrival of the data engineer.
The data engineer’s role is to make your life as a data scientist easier. If you require data from various sources simultaneously, the data engineer will extract it from those sources and load it into a single, ready-to-use database. Additionally, they will optimize the database schema to enhance query performance. They also monitor the data pipelines to ensure the absence of corrupt data. Whenever an issue arises, they repair the pipeline and schedule or automate tasks to prevent errors during manual work.
In essence, data engineering is a form of software engineering that revolves around designing, developing, testing, and maintaining architectures such as databases and large-scale processing systems.
Now, let’s explore the essential skills and knowledge that data engineers should possess.
- Firstly, they need to be familiar with Linux and comfortable using the command line.
- They should have experience in programming with at least one language, such as Python, Scala, or Java.
- Proficiency in SQL is crucial, including the ability to write queries, extract data, and create database schemas.
- Data engineers should also possess a fundamental understanding of distributed systems and how they differ from traditional storage and processing systems.
- They require a deep comprehension of the ecosystem, encompassing ingestion, processing frameworks, and storage engines. It is vital for them to know the strengths and weaknesses of each tool and determine their optimal applications.
- Finally, they must possess the skills to access and process data effectively.
In a data-driven organization striving for scalability, the data engineer plays a pivotal role and is regarded as one of the most valuable individuals.
Data Engineer vs Data Scientist
Focus and Objectives:
- Data Engineer: Building and maintaining data systems and infrastructure.
- Data Scientist: Extracting insights and knowledge from data.
Skills and Expertise:
- Data Engineer: Strong programming, database, and ETL skills.
- Data Scientist: Mathematics, statistics, and machine learning proficiency.
Workflow and Responsibilities:
- Data Engineer: Data ingestion, integration, cleaning, and ensuring data quality.
- Data Scientist: Research questions, data analysis, modeling, and communication of insights.
Data Engineering Tools
A Data Engineer’s role involves moving, processing, and loading data from various sources into an analytical database using specialized tools.
These tools can be classified into three categories:
- Databases for storage — Databases hold large amounts of data and can be SQL or NoSQL-based.
- Processing frameworks for data manipulation — Processing frameworks, such as Spark, Hive or Kafka, handle data cleaning, aggregation, and joining from different sources, using clusters of machines for efficient processing.
- Scheduling tools for automation — Scheduling tools, like Apache Airflow, ensure timely and ordered execution of data movement and processing tasks.
A good Data Engineer understands the limitations and dependencies of these tools, creating a data engineering pipeline that involves extracting data from multiple sources, transforming it using processing frameworks, and loading it into an analytical database. External APIs or other formats can also serve as data sources.
Let’s understand databases and their types. Databases are essential for data engineers, as they store and organize information. They allow rapid search and retrieval. The main difference between databases and file systems is the level of organization and functionality.
Databases can handle structured, semi-structured, and unstructured data. Structured data follows a defined structure, like tabular data in relational databases. Semi-structured data has tags or markers but lacks a strict table structure, like JSON data. Unstructured data has no schema and resembles files, such as text, photos or videos. SQL databases use tables and relations, like MySQL and PostgreSQL, while NoSQL databases handle various data types, such as key-value stores (Redis) and document databases (MongoDB).
Data engineers work with different database types based on the formats and use cases of the data they retrieve from various sources.
To design and develop databases, we need to understand the concept of a database schema. A schema describes the structure and relationships between tables in a database. In a simple example of a restaurant ordering system, we have tables for customers and orders, connected by a foreign key called customer ID. SQL statements can be used to create such a schema, defining properties and relationships between tables.
For handling large volumes of data from various sources for analytical purposes, companies often build data warehouses. Multidimensional schemas, like the star schema, are designed for these purposes. A star schema consists of one or more fact tables (e.g., orders) referencing dimension tables (e.g., customer, restaurant, time, dish). Dimension tables hold information about the world, while fact tables contain records representing events.
Designing an effective database schema is crucial for building a data pipeline as an inefficient schema can consume excessive memory and resources. Planning is essential in this process.
To handle petabytes of data, distributed or parallel computing is essential. Data pipelines involve collecting, joining, cleaning, and aggregating data from various sources. Parallel computing is crucial in the world of big data due to memory and processing power limitations. Big data processing tools divide tasks into smaller subtasks and distribute them across multiple computers, leveraging parallel processing to complete the overall task faster. This approach provides extra processing power and allows data partitioning to reduce memory usage and improve fault tolerance.
However, parallel computing also comes with communication overhead, which can be a bottleneck if processing requirements are not substantial or if there are too few processing units. Additionally, the speed increase is not always linear, and there may be parallel slowdown.
By breaking tasks into smaller subtasks, distributing them among multiple machines, and aggregating the results, data can be transformed and loaded into an analytical database for data scientists to utilize.
Data Engineering Tools
Let’s discuss specific parallel computing frameworks, particularly those popular in the data engineering world.
MapReduce & Hadoop
Hadoop is a widely used framework for big data systems. It consists of an ecosystem of open-source tools maintained by the Apache Software Foundation. Hadoop utilizes the MapReduce algorithm and plays a central role in developing ETL pipelines (Extract, Transform, Load).
Two important projects within Hadoop are HDFS, a distributed file system, and MapReduce, a popular processing technique. However, cloud-managed storage systems like Amazon S3 or Google Cloud Storage have started replacing HDFS.
MapReduce, while effective, had the drawback of being challenging to write jobs for. To address this, various software programs emerged to simplify the task, which are discussed next.
One notable software program under the Hadoop umbrella is Hive, which facilitates the extraction phase of the ETL data pipeline. Hive acts as a layer on top of the Hadoop ecosystem, allowing structured querying of data from multiple sources using its SQL variant, Hive SQL. It offers an SQL-like interface and integrates with various databases and file systems compatible with Hadoop. Previously, developers had to implement queries using the challenging MapReduce Java API before Hive provided a more accessible solution. While MapReduce initially handled Hive jobs, it now seamlessly integrates with multiple data processing tools. The Hive query resembles standard SQL syntax, but behind the scenes, it transforms into a job utilizing MapReduce algorithms to operate on a cluster of computers. Hive supports various data collection tools, providing flexibility for integration. Once data extraction is complete, we can explore data transformation tools.
Another parallel computing framework worth mentioning is Spark. Spark distributes data processing tasks across clusters of computers, aiming to minimize disk writes and keep most processing in memory. It addresses the limitations of MapReduce, especially in interactive exploratory data analysis, where each step builds upon the previous one. Spark originated from the University of California and is currently maintained by the Apache Software Foundation.
It relies on resilient distributed datasets (RDDs), a distributed data structure similar to a list of tuples, which can undergo transformations and actions. Transformations include functions like filter, map, groupByKey, and union, while actions produce a single result.
PySpark, a Python interface, is commonly used with Spark, providing a DataFrame abstraction similar to pandas DataFrames. Spark is engineered for performance and can be significantly faster than Hadoop for large-scale data processing. With the transformation phase covered, it’s time to explore automation in the ETL pipeline.
To begin understanding how workflow scheduling frameworks function, let’s first grasp the concept of a workflow. Let’s consider an example:
You can create a Spark job that retrieves data from a CSV file, filters out corrupt records, and loads the cleaned data into a SQL database for analysis. However, if this job needs to be performed daily, manually running it each time would not be scalable.
In such cases, simple tools like cron, a Linux tool, can solve this problem. But let’s say you have multiple jobs: one to handle the CSV file, another to clean data from an API, and a third to merge data from both sources. The third job relies on the successful completion of the first two jobs. It becomes apparent that a more comprehensive approach is necessary, and a basic tool like cron is insufficient. This is where workflow scheduling frameworks come into play, as they orchestrate these interconnected jobs.
Dependencies between jobs can be effectively visualized using directed acyclic graphs (DAGs). A DAG comprises nodes connected by directed edges, without any cycles in the graph. This means that no path, following the directed edges, encounters a node more than once.
While Linux cron is a simple solution that many companies employ, a more comprehensive tool is required to manage all these dependencies. Airflow has emerged as the de facto workflow scheduling framework. Initially developed by Airbnb for internal workflow management, Airflow was open-sourced in 2015 and later joined the Apache Software Foundation in 2016. Airflow revolves around the concept of DAGs, enabling developers to create and test these DAGs using Python, thereby constructing intricate data pipelines.
Here’s an example of performing ETL using Apache Spark & Apache Airflow for workflow in Python:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from pyspark.sql import SparkSession
# Define the ETL function
# Create a SparkSession
spark = SparkSession.builder \
.appName("ETL Example") \
# Extract data from a CSV file
raw_data = spark.read.csv("input.csv", header=True, inferSchema=True)
# Transform the data
transformed_data = raw_data.filter("age >= 18").select("name", "age")
# Load the transformed data into a database
.option("url", "jdbc:postgresql://localhost:5432/mydatabase") \
.option("dbtable", "mytable") \
.option("user", "myuser") \
.option("password", "mypassword") \
# Stop the SparkSession
# Define the DAG (workflow)
dag = DAG(
start_date=datetime(2023, 5, 7),
schedule_interval="0 * * * *", # Run every hour
# Define the ETL task
etl_task = PythonOperator(
# Set task dependencies
In this example, we start by creating a
SparkSession, which is the entry point for using Spark functionalities. We then extract data from a CSV file using
spark.read.csv(), specifying the file path, whether the file has a header, and inferring the schema.
Next, we perform data transformation by filtering the data and selecting specific columns using Spark’s DataFrame API. In this case, we filter out records where the age is less than 18 and select only the “name” and “age” columns.
Finally, we load the transformed data into a PostgreSQL database using the JDBC (Java Database Connectivity) API. We specify the JDBC URL, table name, credentials, and save mode (in this case, “append” to add the data to an existing table). The data is then written to the database using the
In this example, we define a DAG (workflow) with the ID “etl_workflow”. The DAG is scheduled to start on May 7, 2023, and the
schedule_interval parameter is set to
"0 * * * *" to run the ETL job every hour.
we discussed several important tools and frameworks in the field of data engineering. We started by exploring Apache Hadoop, which is a distributed computing platform designed for handling big data processing and storage. Hadoop provides a scalable and fault-tolerant framework that enables the processing of large datasets across clusters of computers.
Next, we covered Apache Hive, a data warehousing and SQL-like query engine built on top of Hadoop. Hive allows users to write SQL-like queries to analyze and process data stored in Hadoop Distributed File System (HDFS). It provides a high-level abstraction that simplifies data querying and allows for efficient data processing at scale.
We then moved on to Apache Spark, a powerful and fast open-source data processing engine. Spark offers in-memory processing capabilities, allowing it to handle large datasets and perform complex computations efficiently. Spark provides various APIs, including Scala, Python, and Java, and supports a wide range of data processing tasks, such as batch processing, real-time streaming, machine learning, and graph processing.
Lastly, we discussed Apache Airflow, a platform for programmatically authoring, scheduling, and monitoring workflows. Airflow allows data engineers to create complex data pipelines by defining tasks and their dependencies as Directed Acyclic Graphs (DAGs). It provides a rich set of operators and integrations, making it easier to orchestrate and manage data workflows across various systems and services.
These tools are widely used in the data engineering ecosystem and offer valuable capabilities for processing, analyzing, and managing large-scale data. By gaining knowledge and experience with tools like Hadoop, Hive, Spark, and Airflow, data engineers can effectively tackle the challenges of working with big data and build robust data processing pipelines.