Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query performance optimizations #82

Open
zhousun opened this issue Jan 15, 2025 · 0 comments
Open

Query performance optimizations #82

zhousun opened this issue Jan 15, 2025 · 0 comments
Labels

Comments

@zhousun
Copy link
Contributor

zhousun commented Jan 15, 2025

Description

Query performance is critical for any analytics engine. This epic tracks known issues to optimize pg_mooncake's performance.

ClickBench is an excellent starting point for performance engineering. While pg_mooncake already performs great, there is still plenty of room for improvement.

Issues in this epic

Query-specific optimization

  • [ClickBench-Q29] DuckDB performs poorly on aggregation queries over multiple Parquet files, as also seen in "DuckDB (Parquet, partitioned)" and "ParadeDB (Parquet, partitioned)".
SELECT SUM(ResolutionWidth), SUM(ResolutionWidth + 1), SUM(ResolutionWidth + 2), SUM(ResolutionWidth + 3), ..., SUM(ResolutionWidth + 89) FROM hits;
  • DuckDB doesn't push down OR and INLIST filter to table scan.
SELECT * FROM t WHERE a=1 OR a=2;
  • [ClickBench-Q23] DuckDB doesn't support late projection, i.e. initially projecting only the columns used in ORDER BY, sorting to find the top rows, and then projecting additional columns only for those top rows. This depends on resolving the previous issue.
SELECT * FROM hits WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10;

To implement these query optimizations, we could: (1) upstream them to DuckDB, (2) implement the optimizations using DuckDB's OptimizerExtension, or (3) implement the optimizations within Postgres and only send optimized queries to DuckDB.

General query execution overhead

  • pg_duckdb has a constant per-query overhead due to running DuckdbPrepare() twice before executing queries with ExecuteQuery().
  • pg_mooncake may also experience per-file overhead, which will require further investigation and optimization.
@zhousun zhousun pinned this issue Jan 15, 2025
@zhousun zhousun changed the title Query Performance Optimization Query Performance Optimizations Jan 15, 2025
@dpxcc dpxcc added the epic label Jan 15, 2025
@dpxcc dpxcc changed the title Query Performance Optimizations Query performance optimizations Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants