E-Commerce ETL Pipeline

This project implements an ETL pipeline for the Brazilian E-Commerce dataset using PySpark. The pipeline reads data from CSV files, processes and aggregates it, and saves the output as a Parquet file.

Prerequisites

Before running the ETL pipeline, ensure you have the following software installed:

Python 3.11
Java Development Kit (JDK) 8 or higher

Installing Dependencies

Install the required Python packages using pip:

pip install pyspark pytest pyyaml

Project Structure

pg_test_task/
│
├── config/
│   └── brazil.yml
│
├── tests/
│   └── test_etl.py
│
├── .github/
│   └── workflows/
│       └── ci.yml
│
├── etl.py
├── config.py
├── main.py
├── requirements.txt
└── README.md

etl.py: Contains the ETL class and its methods.
test_etl.py: Contains unit tests for the ETL pipeline.
config.py: Contains the Config class for loading configurations.
main.py: Main script to run the ETL pipeline.
config/brazil.yml: Configuration file for the Brazilian E-Commerce dataset.
.github/workflows/ci.yml: GitHub Actions configuration for CI/CD.

Configuration

The configuration file config/brazil.yml contains paths to the input datasets, output path, and column mappings. Here's an example configuration:

input_paths:
  orders: 'data/olist_orders_dataset.csv'
  order_items: 'data/olist_order_items_dataset.csv'
  products: 'data/olist_products_dataset.csv'
output_path: 'data/output/parquet'
columns:
  orders:
    order_id: 'order_id'
    order_purchase_timestamp: 'order_purchase_timestamp'
  order_items:
    order_id: 'order_id'
    product_id: 'product_id'
    price: 'price'
  products:
    product_id: 'product_id'
    product_category_name: 'product_category_name'
aggregations:
  group_by: ['product_id', 'week']
  aggregate_column: 'price'
  aggregation_function: 'sum'

Running the ETL Pipeline

To run the ETL pipeline, execute the main.py script with the path to the configuration file:

python main.py config/brazil.yml

Running Tests

To run the unit tests, use the following command:

pytest -v tests/

Continuous Integration

The project includes a GitHub Actions workflow configuration in .github/workflows/ci.yml to run the tests on every pull request. Here is the configuration:

name: CI

on: [pull_request]

jobs:
  test:
    runs-on: windows-latest

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.11'
    - name: Install dependencies
      run: |
        pip install -r requirements.txt
    - name: Run tests
      env:
        PYTHONPATH: ${{ github.workspace }}
      run: |
        pytest -v tests/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E-Commerce ETL Pipeline

Prerequisites

Installing Dependencies

Project Structure

Configuration

Running the ETL Pipeline

Running Tests

Continuous Integration

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
config		config
data		data
tests		tests
.gitignore		.gitignore
README.md		README.md
config.py		config.py
etl.py		etl.py
main.py		main.py
requirements.txt		requirements.txt

davidpiskolti/pg_test_task

Folders and files

Latest commit

History

Repository files navigation

E-Commerce ETL Pipeline

Prerequisites

Installing Dependencies

Project Structure

Configuration

Running the ETL Pipeline

Running Tests

Continuous Integration

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages