Data Engineering
Basic Info
Faculty Profile
Course Contents
Course Outcomes
Assignments
Exams
Further Readings

Course Title:

Certificate Course in Data Engineering



Course Description:

This certificate course is designed to equip learners with the essential skills and knowledge to excel in the field of data engineering. Upon completion, participants will be able to comprehend the data engineering lifecycle, master Python programming for data manipulation and analysis, proficiently work with relational and NoSQL databases, apply big data technologies such as Hadoop and Spark, use machine learning techniques to extract valuable insights, and design and implement robust data pipelines and data warehouses..



Course instructional level:


Beginner/Intermediate

Course Duration:


6 Months
Hours: 150

Course coordinator:


Dr. Amit Pimpalkar

Course Co-coordinator:


Dr. Poonam Agarkar


Course coordinator's profile(s):


Dr. Amit Pimpalkar is an assistant professor at the School of Computer Science and Engineering at Ramdeobaba University, Nagpur. He has 19+ years of enriched academic and industrial experience. He earned his PhD in Computer Science and Engineering from Sathyabama Institute of Science and Technology, Chennai. He received his Master of Technology from Shri Ram Institute of Technology, Jabalpur, a Bachelor of Engineering from Nagpur University, Nagpur, and a diploma in Computer Technology from Maharashtra State Board of Technical Education, Mumbai. He has published over 80+ research articles and book chapters in International Journal and Conference proceedings. He had ten National and an International Patent on his name. He had four copyrights in his credit from the Government of India. He is a Life Member of ISTE, IAENG, ICSES, CSTA, and IEEE. He is a BoS Member of Nagpur, Surydoya College of Engineering and Technology. He served as a judge and keynote speaker for many projects, hackathons, research paper presentations, and conferences. He is an active reviewer for many national and international conferences and journals, including IEEE Access, Wiley, Hindawi, Elsevier, Heliyon, and the Journal of Sensors. He is a PhD supervisor at Ramdeobaba University, Nagpur, guiding many P.G. and U.G. students. He has a demonstrated history of working in the software industry and is skilled in Software Testing, Python, and C. His research interests include data mining, machine learning, and natural language processing.


Course Co-coordinator's profile(s):


Dr. Poonam T. Agarkar completed her Doctoral degree in the field of Wireless Sensor Networks to design and develop an efficient routing protocol from RTMNU and Y.C.C.E Nagpur as her Research center, did M. Tech in Electronics and B.E. in Electronics and Telecommunication from Y.C.C.E, RTMNU. She had worked as an Assistant Professor for 8 years, as a visiting professor at Government Engineering College Nagpur for 3.5 years, and had a total work experience of 11.5 years. She has published papers at 8 international conferences, including 4 IEEE, 6 SCOPUS Indexed journals, and 2 SCIE journals. She has taught various subjects like CCN, VHDL, Electronics Instrumentation, Communication Engineering, Image Processing, Data Science etc.

Course Contents:



Module/Topic name Sub-topic Duration
1. Introduction to Data Engineering and Python 30 Hours

1a

Data-engineering role, various stages and concepts in the data engineering lifecycle.

1b

Data engineering technologies such as Relational Databases, NoSQL Data Stores, and Big Data Engines, data security, governance, and compliance.

1c

Learn Python for Data Science and Software Development.

1d

Python programming logic Variables, Data Structures, Branching, Loops, Functions, Objects & Classes.

1e

Python libraries such as Pandas & Numpy, and developing code using Jupyter Notebooks.

1f

Access and web scrape data using APIs and Python libraries like Beautiful Soup.
2. Python for Data Engineering and RDBMS 30 Hours

2a

Python for working with and manipulating data, Implement webscraping and use APIs to extract data with Python,

2b

Play the role of a Data Engineer working on a real project to extract, transform, and load data, Use Jupyter notebooks and IDEs

2c

Data, databases, relational databases, and cloud databases.

2d

Data models, relational databases, and relational model concepts (including schemas and tables).

2e

Entity Relationship Diagram and design a relational database for a specific use case.

2f

Popular DBMSes including MySQL, PostgreSQL, and IBM DB2

2g

Analyze data within a database using SQL and Python, DDL commands.

2h

Construct basic to intermediate level SQL queries using DML commands.

2i

Advanced SQL techniques like views, transactions, stored procedures, and joins
3. Data Warehouse and Data Pipelines, BI Dashboards 30 Hours

3a

Describe and contrast Extract, Transform, Load (ETL) processes and Extract, Load, Transform (ELT) processes, batch vs concurrent modes of execution.

3b

Implement ETL workflow through bash and Python functions.

3c

Data pipeline components, processes, tools, and technologies.

3d

Populate a data warehouse, and model and query data using CUBE, ROLLUP, and materialized views.

3e

Popular data analytics and business intelligence tools and vendors and create  data visualizations using IBM Cognos Analytics.

3f

Design and load data into a data warehouse, write aggregation queries, create materialized query tables, and create an analytics dashboard.

3g

Purpose of analytics and Business Intelligence (BI) tools, IBM Cognos Analytics and Google Looker Studio, DB2 data with IBM Cognos Analytics,
4. Introduction to NoSQL Databases Big Data with Spark and Hadoop 30 Hours

4a

Four main categories of NoSQL repositories, characteristics, features, benefits, limitations, and applications of the more popular Big Data processing tools.

4b

MongoDB tasks including create, read, update, and delete (CRUD) operations.

4c

Execute keyspace, table, and CRUD operations in Cassandra.

4d

Impact of big data, including use cases, tools, and processing methods.

4e

Apache Hadoop architecture, ecosystem, practices, and user-related applications, including Hive, HDFS, HBase, Spark, and MapReduce.

4f

Spark programming basics, including parallel programming basics for DataFrames, data sets, and Spark SQL.

4g

Spark’s RDDs and data sets, optimize Spark SQL using Catalyst and Tungsten, and use Spark’s development and runtime environment options.
5. Machine Learning with Apache Spark and Capstone Project 30 Hours

5a

Describe ML, explain its role in data engineering, summarize generative AI, discuss Spark's uses, and analyze ML pipelines and model persistence.

5b

Evaluate ML models, distinguish between regression, classification, and clustering models, and compare data engineering pipelines with ML pipelines.

5c

Construct the data analysis processes using Spark SQL, and perform regression, classification, and clustering using SparkML.

5d

Demonstrate connecting to Spark clusters, build ML pipelines, perform feature extraction and transformation, and model persistence.

5e

Design and implement various concepts and components in the data engineering lifecycle such as data repositories.

5f

Showcase working knowledge with relational databases, NoSQL data stores, big data engines, data warehouses, and data pipelines.


Course Outcomes:


On successful completion of the course, participants shall be able to
  • Understand core concepts of data engineering, databases, and big data technologies.
  • Grasp the principles of data pipelines, ETL processes, and data warehousing.
  • Use Python and SQL to manipulate and analyze data.
  • Debug and troubleshoot data pipelines, optimize performance, and identify data quality issues.
  • Evaluate different data engineering tools and techniques to select the best solution for a given problem.
  • Design and implement complex data engineering solutions, including data pipelines, data warehouses, and machine learning models.