Skip to content

Latest commit

 

History

History
94 lines (72 loc) · 3.87 KB

README.md

File metadata and controls

94 lines (72 loc) · 3.87 KB

DeepLearning.AI Data Engineering Specialization 🌟

Welcome to my repository for the DeepLearning.AI's Data Engineering Professional Certificate! This repo contains code, quizzes, and personal notes from the specialization, showcasing my journey in mastering data engineering concepts and tools.

📚 Overview

The Data Engineering Specialization is a comprehensive program designed to equip learners with the skills needed to design, build, and manage data pipelines and architectures. This repository documents my hands-on experience with the course material.

📑 Table of Contents

Courses

Course 1: Introduction to Data Engineering

  • Key Topics:
    • Data engineering lifecycle and undercurrents
    • Designing data architectures on AWS
    • Implementing batch and streaming pipelines
  • Content:
    • Notes on requirements gathering and stakeholder collaboration
    • Code samples for batch and streaming pipelines
    • Architecture diagrams and design considerations

Course 2: Data Ingestion and DataOps

  • Key Topics:
    • Working with source systems (relational and NoSQL databases)
    • Data ingestion techniques (batch and streaming)
    • DataOps practices (CI/CD, Infrastructure as Code, data quality)
  • Content:
    • Scripts for data ingestion from APIs and message queues
    • Terraform configurations for AWS resources
    • Airflow DAGs for orchestrating data pipelines
    • Data quality tests using Great Expectations

Course 3: Data Storage and Retrieval

  • Key Topics:
    • Storage systems (object, block, file storage)
    • Data lake and data warehouse architectures
    • Query optimization and performance tuning
  • Content:
    • Implementations of data lakehouse architectures
    • Advanced SQL queries and performance comparisons
    • Notes on storage formats and indexing strategies

Course 4: Data Modeling and Transformation

  • Key Topics:
    • Data modeling techniques (normalization, star schema, data vault)
    • Transformations for analytics and machine learning
    • Batch and streaming data processing
  • Content:
    • Data models and schemas for different use cases
    • PySpark code for data transformations
    • Preprocessing pipelines for machine learning datasets

🛠 Skills Developed

  • Data Architecture Design
  • Data Ingestion Techniques
  • DataOps Practices
  • Data Storage and Retrieval
  • Data Modeling
  • Data Transformation and Orchestration

🔧 Technologies Used

  • Programming Languages: Python, SQL
  • Cloud Platforms: AWS
  • Data Processing Frameworks: Apache Spark, PySpark, Pandas
  • Orchestration Tools: Apache Airflow
  • Infrastructure as Code: Terraform
  • Data Quality Tools: Great Expectations
  • Databases: MySQL, PostgreSQL, MongoDB, Amazon S3
  • Others: REST APIs, Message Queues, Streaming Platforms

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📫 Contact

Feel free to reach out via LinkedIn or email for any questions or collaborations!