Professional Certificate course in Data Engineering

Name: Professional Certificate course in Data Engineering
SKU: 43518
Price: 130000.00 INR
Availability: InStock

₹130,000.00 +GST

Category: Uncategorized

Course Details :

Course Type: Online Live Delivery with self paced courses
Duration: 87 hours (3.5 Months for weekdays & 6 months for weekend)
Total Lectures: 44 Lectures
Skill Level: Intermediate
Assessments: Daily Assesments
Certificate: Yes

Outcome Expected:

Cloud Platform Proficiency:

Outcome: Gain proficiency in working with cloud platforms such as AWS, Azure, or Google Cloud to manage and analyze data.

Data Modeling and Database Design:

Outcome: Acquire skills in designing and implementing data models, creating and optimizing databases for efficient storage and retrieval.

ETL (Extract, Transform, Load) Processes:

Outcome: Learn to design and implement ETL processes to extract data from various sources, transform it into the desired format, and load it into a data warehouse or database.

Big Data Technologies:

Outcome: Familiarity with big data technologies such as Hadoop, Spark, and Kafka for processing and analyzing large volumes of data.

Streaming Data Processing:

Outcome: Learn to work with real-time data streams, process streaming data, and implement solutions for real-time analytics.

Data Visualization:

Outcome: Ability to create meaningful visualizations using tools like Tableau, Power BI, or other visualization tools to communicate insights effectively.

Career Readiness:

Outcome: Prepare for a career in data engineering and cloud analytics with a strong foundation in both technical skills and industry best practices.

Requirements:

Daily Assessments
Mini projects (module wise)
Live Evaluation
Online classes on Zoom

Target Audience:

Undergraduate and postgraduate students in any domain/field

Key Features:

Requires no programming or technical skills. Students with no technical background can join
Cover various data science topics in detail such as Python, Databases, AWS, Snowflake, Kafka, Spark, and many more
Placement Support is provided for all the students who pass the eligibility criteria
Industry support for every student

Curriculum

Module 1: Introduction to DE

This module provides an understanding of data engineering concepts, skills, practices, and tools essential for managing data at scale.

What is Data Engineering?
Role of Data Engineers in the Industry
Importance of Data Engineering in Data-driven Organizations
Overview of Data Engineering Tools and Technologies
Career Paths and Opportunities in Data Engineering

Module 2 : Python

We will explore Python, a versatile and beginner-friendly programming language.Python is known for its readability and wide range of applications, from web development and data analysis to artificial intelligence and automation.

Introduction to Python
Basic Syntax and Data Types
Control Structures (Conditional Statements and Looping)
Functions
Lambda Functions
Data Structures (Lists, Tuples, Dictionaries, Sets)
File Handling
Error Handling (try and except)
List Comprehensions
Decorators
NumPy
Pandas
Regex
Code optimisation

Module 3 : RDBMS

We will explore RDBMS (Relational Database Management System) to understand the database technology that organizes data into structured tables with defined relationships.

Introduction to Databases
MYSQL -Introduction & Installation
SQL KEYS
PRIMARY KEY
FOREIGN KEY
UNIQUE KEY
Composite key
Normalization and Denormalization
ACID Properties

Module 4 : SQL

We will dive into SQL (Structured Query Language) to acquire the skills needed for managing and querying relational databases. SQL enables them to retrieve, update, and manipulate data, making it a fundamental tool for working with structured data in various applications.

Basic SQL Queries
Advanced SQL Queries
Joins (INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN)
Data Manipulation Language (DML): INSERT, UPDATE, DELETE
Data Definition Language (DDL): CREATE, ALTER, DROP
Data Control Language (DCL): GRANT, REVOKE
Aggregate Functions (SUM, AVG, COUNT, MAX, MIN)
Grouping Data with GROUP BY
Filtering Groups with HAVING
Subqueries
Views
Indexes
Transactions and Concurrency Control
Stored Procedures and Functions
Triggers
Stored procedures

Module 5 : MongoDB

We delve into MongoDB to understand this popular NoSQL database, which stores data in flexible, JSON-like documents. They learn how MongoDB’s scalability and speed make it suitable for handling large volumes of unstructured data.

Introduction to NoSQL and MongoDB
Installation and Setup of MongoDB
MongoDB Data Model (Documents, Collections, Databases)
CRUD Operations (Create, Read, Update, Delete)
Querying Data with MongoDB
Indexing and Performance Optimization
Aggregation Framework
Data Modeling and Schema Design
Working with Embedded Documents and Arrays
Transactions and Atomic Operations
Security in MongoDB (Authentication, Authorization)
Replication and High Availability
Sharding and Scalability
Backup and Disaster Recovery
MongoDB Atlas (Cloud Database Service)
MongoDB Compass (GUI for MongoDB)
MongoDB Drivers and Client Libraries (e.g., pymongo for Python)
Using MongoDB with programming languages Python
Real-world Applications and Case Studies

Module 6 : Shell Script

We explore shell scripting in the Linux environment, where they learn to write and execute scripts using the command-line interface. Shell scripts are text files containing a series of commands, and We discover how to automate tasks.

Introduction to Shell Scripting
Basics of Shell Scripting (Variables, Comments, Quoting)
Input/Output in Shell Scripts
Control Structures (Conditional Statements, Loops)
Functions and Scripts Organization
Command Line Arguments and Options
String Manipulation
File and Directory Operations
Process Management (Running Commands, Background Processes)
Text Processing (grep, sed, awk)
Error Handling and Exit Status
Environment Variables
Regular Expressions in Shell Scripts
Debugging and Troubleshooting
Advanced Topics (Signals, Job Control, Process Substitution)
Shell Scripting Best Practices
Scripting with Specific Shells (Bash, Zsh, etc.)
Scripting for System Administration Tasks
Scripting for Automation and Task Orchestration

Module 7 : Git

We will study Git, a distributed version control system, to learn how it tracks changes in software code. Git allows collaborative development, enabling multiple people to work on the same project simultaneously while managing different versions of code.

Introduction to Version Control Systems (VCS) and Git
Installation and Setup of Git
Basic Git Concepts (Repositories, Commits, Branches, Merging)
Git Workflow (Local and Remote Repositories)
Creating and Cloning Repositories
Git Configuration (Global and Repository-specific Settings)
Tracking Changes with Git (git add, git commit)
Viewing Commit History (git log)
Branching and Merging (git branch, git merge)
Resolving Merge Conflicts
Working with Remote Repositories (git remote, git push, git pull)
Collaboration with Git (Forking, Pull Requests, Code Reviews)
Git Tags and Releases
Git Hooks
Rebasing and Cherry-picking
Git Reset and Revert
Git Stash
Git Workflows (e.g., Gitflow, GitHub Flow)

Module 8 : Cloud

We delve into cloud computing, which involves delivering various computing services (such as servers, storage, databases, networking, software, and analytics) over the internet.

Introduction to Cloud Computing and Data Engineering
Overview of Cloud Providers (AWS and Azure)
Cloud Storage Solutions (AWS S3, Azure Blob Storage)
Cloud Database Services (AWS RDS, Azure SQL Database)
Data Warehousing in the Cloud (AWS Redshift, Azure Synapse Analytics)
Cloud Data Integration and ETL (AWS Glue, Azure Data Factory)
Big Data Processing in the Cloud (AWS EMR, Azure HDInsight)
Real-time Data Processing and Streaming Analytics (AWS Kinesis, Azure Stream Analytics)
NoSQL Databases in the Cloud (AWS DynamoDB, Azure Cosmos DB)
Data Lakes and Analytics Platforms (AWS Athena, Azure Databricks)
Machine Learning and AI Services (AWS SageMaker, Azure Machine Learning)
Data Visualization and BI Tools (AWS QuickSight, Azure Power BI)
Cloud Security and Compliance
Cost Management and Optimization in the Cloud
Best Practices for Cloud Data Engineering

Module 9: System Design

The System Design provides an in-depth exploration of the principles, methodologies, and best practices involved in designing scalable, reliable, and maintainable software systems.

Load balancer and High availability
Horizontal vs Vertical Scaling
Monolithic vs microservice
Distributed messaging service and Aws SQS
CDN (content delivery Network)
Caching , scalability
Aws API gateway

Module 10 : Snowflake

In this module, We will study Snowflake to grasp modern cloud-based data warehousing, focusing on its architecture, data sharing, scalability, and data analytics applications.

Introduction to snowflake
Difference between Datalake,Data Warehouse,Delta Lake,Database
Dimension and Fact Tables
Roles and users
data modeling , snowpipe
MOLAP and ROLAP
Partitioning and indexing
Data mart and data cubes & caching
data masking
handling json files
data loading from S3 and transformation

Module 11 : Data cleaning

We will engage in data cleaning to understand the process of identifying and correcting errors or inconsistencies in datasets, ensuring data accuracy and reliability for analysis and reporting.

Structured vs Unstructured Data using Pandas
Common Data issues and how to clean them
Data cleaning with Pandas and PySpark
Handling Json Data
Meaningful data transformation (Scaling and Normalization)
Example: Movies Data Set Cleaning

Module 12 : Hadoop

This module provides a comprehensive introduction to Hadoop, its core components, and the broader ecosystem of tools and technologies for big data processing and analytics.

Introduction to Big Data
Characteristics and Challenges of Big Data
Overview of Hadoop Ecosystem
Hadoop Distributed File System (HDFS)
Hadoop MapReduce Framework
Hadoop Cluster Architecture
Hadoop Distributed Processing
Hadoop YARN (Yet Another Resource Negotiator)
Hadoop Data Storage and Retrieval
Hadoop Data Processing and Analysis
Hadoop Streaming for Real-time Data Processing
Hadoop Ecosystem Components:
- HBase for NoSQL Database
- Hive for Data Warehousing and SQL
- Pig for Data Flow Scripting
- Spark for In-memory Data Processing
- Sqoop for Data Import/Export
- Flume for Data Ingestion
- Oozie for Workflow Management
- Kafka for Real-time Data Streaming
Hadoop Security and Governance

Module 13 : Kafka

In this module, We learn about Kafka, an open-source stream processing platform. Kafka is used for ingesting, storing, processing, and distributing real-time data streams and explores Kafka’s architecture, topics, producers, consumers, and its role in handling large volumes of data with low latency.

Introduction to kafka
producer, consumer, Consumer Groups
topics , offset , partitions, brokers
Zookeeper,replication
Batch vs real time streaming
real streaming process
Assignment and Task

Module 14 : Spark

In this module, We will explore Spark, which is an open-source, distributed computing framework that provides high-speed, in-memory data processing for big data analytics.

Introduction to Apache Spark
Features and Advantages of Spark over Hadoop MapReduce
Spark Architecture Overview
Resilient Distributed Datasets (RDDs)
Directed Acyclic Graph (DAG) Execution Engine
Spark Core and Spark SQL
DataFrames and Datasets in Spark
Spark Streaming for Real-time Data Processing
Structured Streaming for Continuous Applications
Machine Learning with MLlib in Spark
Graph Processing with GraphX in Spark
Spark Performance Tuning and Optimization Techniques
Integrating Spark with Other Big Data Technologies (Hive, HBase, Kafka, etc.)
Spark Deployment Options (Standalone, YARN, Mesos)
Spark Cluster Management and Monitoring

Module 15 : Airflow

Here, We will explore Airflow to understand its role in orchestrating and automating workflows, scheduling tasks, managing data pipelines, and monitoring job execution.

Why and what is airflow
airflow UI
Run first dag
grid view
graph view
landing times view
calendar view
gantt view
Code view
Core concepts of airflow
DAGs
Scope
Operators
control flow
Task and task instance
Database and executors
ETL/ELT process implementation
monitoring ETL pipeline with airflow

Module 16 : DataBricks

This module provides a comprehensive introduction to DataBricks.You will learn how to leverage DataBricks to build and deploy scalable data pipelines.

Introduction to Databricks
Overview of Databricks Unified Analytics Platform
Setting up Databricks Environment
Databricks Workspace: Notebooks, Clusters, and Libraries
Spark Architecture in Databricks
Spark SQL and DataFrame Operations in Databricks Notebooks
Data Import and Export in Databricks
Working with Delta Lake for Data Versioning and Transaction Management
Performance Optimization Techniques in Databricks
Advanced Analytics and Machine Learning with MLlib in Databricks
Collaboration and Sharing in Databricks Workspace
Monitoring and Debugging Spark Jobs in Databricks
Integrating Databricks with Other Data Engineering Tools and Services

Module 17 : Prometheus

We will study Prometheus to explore its role as an open-source monitoring and alerting toolkit, used for collecting and visualizing metrics from various systems, aiding in performance optimization and issue detection.

Introduction to Prometheus
Prometheus Server and Architecture
Installation and Setup of Prometheus
Understanding Prometheus UI (User Interface)
Node Exporters: Monitoring System Metrics
Prometheus Query Language (PromQL) for Aggregation, Functions, and Operators
Integrating Python Applications with Prometheus for Custom Metrics
Key Metric Types: Counter, Gauge, Summary, and Histogram
Recording Rules for Pre-computed Metrics
Alerting Rules for Generating Alerts
Alert Manager: Installation and Configuration

Module 18 : Data dog

We will study about Datadog, a monitoring and analytics platform for cloud-scale applications. It provides developers, operations teams, and business users with insights into their applications, infrastructure, and overall performance.

Metrics
Dashboards
Alerts
Monitors
Tracing
Logs monitoring
Integrations

Module 19 : Docker

In this module, we will cover Docker, an open-source platform used to develop, ship, and run applications in containers. Containers are lightweight, portable, and self-sufficient units that package an application along with its dependencies, libraries, and configuration files, enabling consistent deployment across different environments.

What is docker
Installation of docker
Docker images , containers
Docker file
Docker volume
Docker registry
Containerizing applications with docker hands-on

Module 20 : Kubernetes

This module provides a comprehensive introduction to Kubernetes, an open-source container orchestration platform for automating deployment, scaling, and management of containerized applications.

Nodes
Pods
ReplicaSets
Deployments
Namespaces
Ingress

FAQs

Q. Are there any benefits with the certification ?

Ans. The certification is provided by IFACET – IIT Kanpur

Q. Will the certification help in Placements ?

Ans. Yes, 100% placement Support is provided for all the students who pass the eligibility criteria

Q. Does the certification lead to an alumni status from IITK ?

Ans. No

Instructor Profile

Name: Shabarinath Premlal

Embedded engineering graduate, having a decade of experience in embedded hardware board design engineering / IoT Solution. Served in leadership positions for Automation projects with several Industry and Institutions. Experienced visioning, costing and executing projects from inception to launch and is able to provide a structured framework to analyse complex situations into simple strategic imperatives.

All Courses