Skip to content

jdaarevalo/schedule_athena_queries

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Athena Query Execution Automation

Overview

This project automates the execution of Athena queries using AWS EventBridge Scheduler and AWS Step Functions (State Machine). It simplifies the process by allowing users to define queries and schedules in a YAML file. Once set up, the system automatically triggers and manages the execution of Athena queries based on the defined schedule.

Features

  • Automated Query Execution: Schedule Athena queries to run automatically without manual intervention.
  • Flexible Scheduling: Supports 'at', 'rate', and 'cron' expressions for versatile scheduling options.
  • State Machine Integration: Leverages AWS Step Functions for efficient query management and execution tracking.
  • Serverless Deployment: Built on AWS SAM for easy deployment and scalability.

schedule-athena-queries

How It Works

  1. Define Your Query: Create a YAML file into src/queries with the query definition and parameters. Use the template_query.yaml as example.
  2. Deploy the Stack: Run the provided deploy.sh script to deploy the stack to AWS.
  3. Monitor & Manage: Once deployed, the system manages query execution as per the schedule, handling retries and failures gracefully.

Getting Started

Prerequisites

  • AWS Account
  • Configured AWS CLI
  • AWS SAM CLI installed
  • Basic understanding of AWS services (EventBridge, Step Functions, Athena)

Quick Setup

  1. Clone the Repository:
git clone [email protected]:jdaarevalo/schedule_athena_queries.git
  1. Navigate to the Project Directory:
cd schedule_athena_queries
  1. Define Your Queries:
  • Create your queries as template_query.yaml file to define your Athena queries and schedules.
  1. Deploy:
  • Run the deploy.sh script with the desired aws_profile (in your ~/.aws/credentials) and the Athena queries output location:
./deploy.sh [aws_profile] [s3-athena-queries-output-location]

Sample

./deploy.sh default s3://aws-athena-query-results-{AWS_ACCOUNT_ID}-{REGION}/

Configuration

Template Query (template_query.yaml)

  • name: Unique identifier for the query schedule.
  • schedule_expression: When the query is triggered (supports 'at', 'rate', 'cron').
  • query_string: The SQL query to execute.

SAM Template (template.yaml)

  • Defines the AWS resources, roles, and policies needed for the application.

Deployment Script (deploy.sh)

  • A convenient script to package and deploy the stack to AWS.

Cost and Performance Considerations

Why Not an AWS Glue Job?

AWS Glue is a fully managed ETL service that simplifies processing large amounts of data. However, it might not be the most cost-effective for certain types of tasks:

Scenario Breakdown:

  • Daily Queries: Consider 30 daily queries, each taking 15 minutes.
  • Monthly Total: 15 minutes * 30 queries/day * 30 days = 135,000 minutes or 225 hours.

Cost Implications:

While AWS Glue only charges for the runtime ($0.44 per DPU-Hour for Python Shell jobs, billed per second with a 1-minute minimum), you pay for the job's entire duration, including idle time while waiting for Athena queries to execute. This can add up significantly:

  • Monthly Cost: 0.0625 DPUs×225 hours×0.44 USD per DPU-Hour=6.19 USD (Python shell ETL job cost).

Why Not Lambda?

AWS Lambda excels at short, event-driven processes. However, for long-running tasks like executing Athena queries, it may not be the most suitable:

Timeout Limitations:

  • 15-Minute Maximum: Lambda functions have a maximum execution time of 15 minutes, after which they time out.
  • Long-Running Queries: Queries over large datasets or complex operations can easily exceed this limit, making Lambda unsuitable for direct query management.

Cost Implications:

  • Billing for Idle Time: You're billed for the entire runtime, even if the Lambda function is just waiting for a query to complete.
  • Cost Accumulation: If the function hits its timeout and is retried, costs for each execution accumulate, potentially leading to a higher total cost compared to services designed for longer-running tasks.

The Advantages of AWS Step Functions:

AWS Step Functions can manage long-running processes more efficiently and cost-effectively than AWS Glue or Lambda for several reasons:

Efficiency and Cost:

  • Daily Executions: Consider 20 daily executions with 4 state transitions each.
  • Monthly State Transitions: 4 transitions * 900 requests = 3,600 monthly transitions.
  • Minus Free Tier: 3,600 transitions - 4,000 Free Tier = 0 billable transitions.
  • Monthly Cost: 0 billable transitions * 0.000025 USD per transition = 0.00 USD.

Flexibility:

  • Long-Running Tasks: Step Functions can pause execution between steps without incurring costs for idle time, making them ideal for managing Athena queries.
  • Complex Workflows: They allow you to define complex workflows and handle scenarios like retries and failures more gracefully than Lambda.

Conclusion:

While AWS Glue and Lambda offer valuable services for many scenarios, their time constraints and cost structure make them less suitable for executing long-running Athena queries. This project leverages AWS Step Functions to provide a more reliable, scalable, and cost-efficient solution, ensuring you only pay for what you need and avoid charges for idle time. It automates the process in a way that's mindful of AWS billing, offering significant advantages over traditional methods for managing and executing Athena queries.

Contributing

All contributions and suggestions are welcome! Please fork the repository and create a pull request with your improvements.

Guidelines

  • Ensure code and documentation consistency.
  • Provide detailed descriptions and examples for changes.

Support & Troubleshooting

  • For issues and features, please open an issue.
  • Check AWS documentation for service-specific queries and troubleshooting.

About

schedule_athena_queries

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published