Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(batch-exports): Use async producer in Redshift export #25872

Merged
merged 5 commits into from
Nov 1, 2024

Conversation

tomasfarias
Copy link
Contributor

@tomasfarias tomasfarias commented Oct 29, 2024

Problem

Redshift is notorious for being a very inefficient destination that can be very slow. This usually means we end up retrying a lot, due to connections to ClickHouse being dropped due to lack of progress on the insert side.

Similar to the work done for BigQuery, we implement async production of events for Redshift batch export. This gives us a buffer to store events, to ensure we can continue using the connection even if Redshift is being too slow.

Moreover, this PR also adds support for heartbeating in Redshift to ensure we can resume when a batch export fails, thus guaranteeing we will eventually finish.

Changes

  • Implement async production of Arrow record batches for Redshift batch export, similar to BigQuery.
  • Implement heartbeating support.

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

How did you test this code?

Ran all existing Redshift tests. All passed.

@tomasfarias tomasfarias force-pushed the fix/switch-redshift-batch-export-to-async branch 2 times, most recently from 3409f6c to 502424b Compare October 31, 2024 15:51
@tomasfarias tomasfarias force-pushed the fix/switch-redshift-batch-export-to-async branch from 502424b to 53ea1e2 Compare October 31, 2024 16:39
@tomasfarias tomasfarias marked this pull request as ready for review November 1, 2024 09:53
@tomasfarias tomasfarias requested a review from a team November 1, 2024 09:53
Copy link
Contributor

@rossgray rossgray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me (except for a very minor typo).

However, I have to admit I don't understand all the details of batch exports but conceptually it looks good :)

posthog/temporal/batch_exports/redshift_batch_export.py Outdated Show resolved Hide resolved
@tomasfarias tomasfarias merged commit 0f3fb72 into master Nov 1, 2024
89 checks passed
@tomasfarias tomasfarias deleted the fix/switch-redshift-batch-export-to-async branch November 1, 2024 11:05
Copy link

sentry-io bot commented Nov 3, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ RecordBatchProducerError: The record batch producer encountered an error during execution posthog.temporal.batch_exports.batch_exports in... View Issue
  • ‼️ RecordBatchProducerError: The record batch producer encountered an error during execution posthog.temporal.batch_exports.batch_exports in... View Issue
  • ‼️ RecordBatchProducerError: The record batch producer encountered an error during execution posthog.temporal.batch_exports.batch_exports in... View Issue
  • ‼️ RecordBatchProducerError: The record batch producer encountered an error during execution posthog.temporal.batch_exports.batch_exports in... View Issue
  • ‼️ RecordBatchProducerError: The record batch producer encountered an error during execution posthog.temporal.batch_exports.batch_exports in... View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants