Course Title:
Certificate Course in Data Engineering
Course Description:
This certificate course is designed to equip learners with the essential skills and knowledge to excel in the field of data engineering. Upon completion, participants will be able to comprehend the data engineering lifecycle, master Python programming for data manipulation and analysis, proficiently work with relational and NoSQL databases, apply big data technologies such as Hadoop and Spark, use machine learning techniques to extract valuable insights, and design and implement robust data pipelines and data warehouses..
Course instructional level:
Beginner/Intermediate
Course Duration:
6 Months
Hours: 150
Course coordinator:
Prof. Arti Karandikar
Course coordinator's profile(s):
Prof. Aarti Karandikar completed her B.E (CSE) in 1999, M.Tech (CSE) in 2011 and Ph.D. from RTMNU, Nagpur. She is currently serving as Assistant Professor in the Department of Computer Science and Engineering (Data Science). She has 22+ years of teaching experience. She has published more than 45 research papers in various peer reviewed journals and conferences. She has 02 copyrights to her credit. She is a certified educator trainer for Alteryx Foundation Course and Alteryx Designer Core. Her areas of interest include data analytics, data mining, artificial intelligence, machine learning, and remote sensing.
Course Contents:
Module/Topic name | Sub-topic | Duration |
1. Introduction to Data Engineering and Python | 30 Hours | |
1a |
Data-engineering role, various stages and concepts in the data engineering lifecycle. | |
1b |
Data engineering technologies such as Relational Databases, NoSQL Data Stores, and Big Data Engines, data security, governance, and compliance. | |
1c |
Learn Python for Data Science and Software Development. | |
1d |
Python programming logic Variables, Data Structures, Branching, Loops, Functions, Objects & Classes. | |
1e |
Python libraries such as Pandas & Numpy, and developing code using Jupyter Notebooks. | |
1f |
Access and web scrape data using APIs and Python libraries like Beautiful Soup. | |
2. Python for Data Engineering and RDBMS | 30 Hours | |
2a |
Python for working with and manipulating data, Implement webscraping and use APIs to extract data with Python, | |
2b |
Play the role of a Data Engineer working on a real project to extract, transform, and load data, Use Jupyter notebooks and IDEs | |
2c |
Data, databases, relational databases, and cloud databases. | |
2d |
Data models, relational databases, and relational model concepts (including schemas and tables). | |
2e |
Entity Relationship Diagram and design a relational database for a specific use case. | |
2f |
Popular DBMSes including MySQL, PostgreSQL, and IBM DB2 | |
2g |
Analyze data within a database using SQL and Python, DDL commands. | |
2h |
Construct basic to intermediate level SQL queries using DML commands. | |
2i |
Advanced SQL techniques like views, transactions, stored procedures, and joins | |
3. Data Warehouse and Data Pipelines, BI Dashboards | 30 Hours | |
3a |
Describe and contrast Extract, Transform, Load (ETL) processes and Extract, Load, Transform (ELT) processes, batch vs concurrent modes of execution. | |
3b |
Implement ETL workflow through bash and Python functions. | |
3c |
Data pipeline components, processes, tools, and technologies. | |
3d |
Populate a data warehouse, and model and query data using CUBE, ROLLUP, and materialized views. | |
3e |
Popular data analytics and business intelligence tools and vendors and create  data visualizations using IBM Cognos Analytics. | |
3f |
Design and load data into a data warehouse, write aggregation queries, create materialized query tables, and create an analytics dashboard. | |
3g |
Purpose of analytics and Business Intelligence (BI) tools, IBM Cognos Analytics and Google Looker Studio, DB2 data with IBM Cognos Analytics, | |
4. Introduction to NoSQL Databases Big Data with Spark and Hadoop | 30 Hours | |
4a |
Four main categories of NoSQL repositories, characteristics, features, benefits, limitations, and applications of the more popular Big Data processing tools. | |
4b |
MongoDB tasks including create, read, update, and delete (CRUD) operations. | |
4c |
Execute keyspace, table, and CRUD operations in Cassandra. | |
4d |
Impact of big data, including use cases, tools, and processing methods. | |
4e |
Apache Hadoop architecture, ecosystem, practices, and user-related applications, including Hive, HDFS, HBase, Spark, and MapReduce. | |
4f |
Spark programming basics, including parallel programming basics for DataFrames, data sets, and Spark SQL. | |
4g |
Spark’s RDDs and data sets, optimize Spark SQL using Catalyst and Tungsten, and use Spark’s development and runtime environment options. | |
5. Machine Learning with Apache Spark and Capstone Project | 30 Hours | |
5a |
Describe ML, explain its role in data engineering, summarize generative AI, discuss Spark's uses, and analyze ML pipelines and model persistence. | |
5b |
Evaluate ML models, distinguish between regression, classification, and clustering models, and compare data engineering pipelines with ML pipelines. | |
5c |
Construct the data analysis processes using Spark SQL, and perform regression, classification, and clustering using SparkML. | |
5d |
Demonstrate connecting to Spark clusters, build ML pipelines, perform feature extraction and transformation, and model persistence. | |
5e |
Design and implement various concepts and components in the data engineering lifecycle such as data repositories. | |
5f |
Showcase working knowledge with relational databases, NoSQL data stores, big data engines, data warehouses, and data pipelines. |
Course Outcomes:
On successful completion of the course, participants shall be able to
- Understand core concepts of data engineering, databases, and big data technologies.
- Grasp the principles of data pipelines, ETL processes, and data warehousing.
- Use Python and SQL to manipulate and analyze data.
- Debug and troubleshoot data pipelines, optimize performance, and identify data quality issues.
- Evaluate different data engineering tools and techniques to select the best solution for a given problem.
- Design and implement complex data engineering solutions, including data pipelines, data warehouses, and machine learning models.