-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bulk upsert to lucky #789
base: main
Are you sure you want to change the base?
Conversation
src/avram/bulk_upsert.cr
Outdated
private def set_timestamps(collection) | ||
collection.map do |record| | ||
record.created_at.value ||= Time.utc if record.responds_to?(:created_at) | ||
record.updated_at.value = Time.utc if record.responds_to?(:updated_at) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these be NOW()
statements instead? Especially when you're iterating thousands of rows it's worth saving as much time as you can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we were considering this! The one thing we needed to work through for this was a MAJOR data changeup -- these would then need to be NOW() instead of bound $n params which changes up the models a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: #393 (note the issue on updated_at
here that would require a trigger to know when to change)
Somewhat related: luckyframework/lucky#1583
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I LOVE that you guys are tackling this. I know it's a super difficult problem to solve, and to make it extra hard, we have to take additional considerations in to mind with the Lucky core goals.
- Is this type-safe? (as in providing more safety than just what Crystal gives you)
- Does this help me catch bugs at compile-time?
- Does it add security by ensuring bad data can't be used?
I know the current upsert
doesn't quite work like a normal DB upsert. @paulcsmith wanted to add ON CONFLICT
support directly in to the query builder which would add some extra support. You can read up on the original PR #334
Also while you're working on this, keep in mind on how this would look coming from an HTML form. How do the params get built and passed over to this? One tip I've learned while working on Lucky is don't be afraid to break away from the normal "rails mindset" coding styles. Sometimes an API that looks a bit funky / offputting at first may end up being a better solution in the long run. A little extra typing characters could mean a few less specs and more confidence!
Feel free to hit me up if you have any questions I can answer. I'll keep an eye on the PR and try to chime in when I can. Stoke to see this! 🚀
src/avram/bulk_upsert.cr
Outdated
private def set_timestamps(collection) | ||
collection.map do |record| | ||
record.created_at.value ||= Time.utc if record.responds_to?(:created_at) | ||
record.updated_at.value = Time.utc if record.responds_to?(:updated_at) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: #393 (note the issue on updated_at
here that would require a trigger to know when to change)
Somewhat related: luckyframework/lucky#1583
anything that is manual (from a form) I'd generally consider "small enough" to not care about utilizing bulk operations directly from an html form To me, bulk operations are more for:
|
The tricky part with this is how delicate it is to provide a nice interface to the developer while still retaining as much compile time safety as possible. It may be possible to do both in an easy way by dividing out the concerns here. Consider the string builder approach: Instead of: "this is a title #{data} #{data.metadata ? data.processed : ""}" String builder has an intuitive and easy to read interface: String.build do |b|
b << "this is a title "
b << data
b << data.processed if data.metadata
end In the same way I wonder if it's possible or perhaps even more safe to iteratively build the upsert. Here's some "imagine code" that hopefully shows what I mean: class BulkUserOperation < UserOperation
end
upsert_op = BulkUserOperation.build_upsert do |upserter|
DataSource.each_row do |row|
upserter << upserter.bind row
upserter.commit
end
end
upsert_op.finish Here are my assumptions:
Handling bulk data operations is tricky, and handling upserts properly is tricky too. Perhaps Avram needs some better tooling around bulk data operations independent of upserts. |
Nice to haves:
|
0e7a40f
to
278b28b
Compare
After looking over this and watching the discussion, I'm aware that there were at least three different assumptions about what a bulk insert/upsert might be used for. I observed these assumptions:
They're not fundamentally incompatible, nor is it absolutely necessary that whatever is implemented here facilitate all three of these. I do think it might be helpful to identify the use cases which need to be discussed. It might provide some clearer guidance around validations, compile time guards, etc. |
Or inside the req cycle, like a user uploads a csv |
will/crystal-pg#244 needs to be merged/released or monkeypatched into avram before this is viable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching up on all the work y'all have done and threw some comments in there 👍
@grepsedawk totally forgot about this one, lmk if you want to pair on it later this week too |
17b0f97
to
8284f68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, this is pretty chonky, but it doesn't look "crazy".... I don't have any immediate use for it, but from what it sounds like, @grepsedawk used it and it seems to work?
The only thing I'm not seeing (and it may be ok for now) is no way to use this with Lucky::Params, right?
@robcole Can you just update the original post for this PR with an example code of how someone would use this? That will give me (and other readers) an easy overview of what this PR is really doing.
It was designed for bulk importing which isn’t something I often do with params. I’ll see if I can find an example where it might be a fit, but would usually be iterating over a collection of responses from some source (API, CSV, or internal model building typically). Will add the examples in using it for today or tomorrow. I think you’d just need an array of params since we use the same general interface as upsert and it should work fine, but I haven’t tested that yet. |
Ok, I think that's fine. We can worry about adding in support from params later then. I think your usage makes sense, so just a quick code example. Just a few lines, nothing crazy, and that would be good! |
8284f68
to
0b1913d
Compare
Overall, I love the direction, and I think this is a super useful feature. There's just a few things that concern me, so I'd like to talk about possible solutions, or direction we can take. And please forgive me if any of this is incorrect and I missed something. Right now, this implementation does a piggy back off of SaveOperations to help keep things a bit more consistent, and make use of some existing methods so we don't end up duplicating things 👍 However, this leads to several new issues... For example, say I have a normal class SaveUser < User::SaveOperation
upsert_lookup_columns :email, :username
upsert_unique_on :email, :username
needs passcode : Int32
attribute password : String
before_save do
validate_required(password)
validate_email_format(email)
validate_unique(username)
end
after_commit do |user|
EmailUser.new(user, passcode).deliver
end
end
# This works fine
SaveUser.upsert(passcode: 1234, email: "...", username: "...") do |op, user|
end
# This leads to some issues
SaveUser.upsert([{email: "...", username: "..."}])
So at this level, it's almost not even a SaveOperation.... The lack of validations and callbacks is the biggest part I think. Assuming you're pulling in bulk data from a CSV in which you don't control, there's no way to guarantee that a malformed email won't get inserted in to your DB. Since postgres won't handle validations other than NULL or not, and possible unique (if you have the index), then your app would be at-risk of bad data. Here's my suggestions, but I'm totally open to others as well.
I'm sure there's other options I may not be thinking of.... @robcole, I know you and @grepsedawk have put in a ton of work in to this, and I really appreciate it ❤️ . I want to make sure it's clear that I'm not shooting down all your efforts and hard work. I think having this feature in Avram will be huge! We just need to make sure that features like this live up to Lucky's core goals. I'd be happy to also hop on a voice chat in discord if either of you (or anyone else) would like to chat more about potential solutions, or just to hash out any concerns. |
I side towards 3. It allows us all to feel out the API a bit more before making any big changes Also, I could be wrong but I think the way @robcole did the splat it would actually support the |
I also side with 3, but with a note that when we bring in sql operations or bulk operations, I’m 💯 down to work on refactors to bring the two interfaces into more uniformity. I’d like to have a sensible way to have it still optionally yield an operation, but for the current code, it doesn’t feel like it brings enough value to change the API around that much. A few things I’m less certain of for bulk upsert in general based on how I’ve used it:
The way I designed it should be a somewhat easy port for things like params once we have the following concepts in Lucky to make sure everything feels smooth. |
0b1913d
to
b9865cf
Compare
@jwoertink In the event you're curious - https://github.com/zdennis/activerecord-import is where I got a lot of the API design inspiration (on top of your existing upsert code) as it's what I used up until Rails finally added it. Validations are optional in it, but the interface around reporting and fixing those validations can be a bit wonky - basically you're doing For the recent Rails additions, AFAIK these are all intended on being bulk operations, so callbacks and validations are skipped as well: https://www.bigbinary.com/blog/bulk-insert-support-in-rails-6 Given this LMK if you've got more thoughts here - definitely appreciating the feedback as it'll help me make sure I'm on the same page in general with Lucky API goals. |
Oh sweet. Thanks for the links @robcole. I'll take a look at those. Pulling things from rails can be nice since it's so well established, but it's also a double edged sword. Since rails isn't type-safe, and doesn't enforce any sort of data integrity, you're more likely to have bad data which leads to bugs in production. For example, in rails, an empty string is fine. If the column is I just had another idea.... Since one of the goals for Lucky is to try and always provide an escape hatch for when there's something we can't do at a high-level, what if we just dropped this whole API down a level? For example, instead of doing this off the SaveOperation, it becomes lower level like upsert = Avram::BulkUpsertt(Thing).new(array_of_named_tuples_for_thing)
upsert.run After writing that, I didn't really like it, but the premise is still good... At a lower level, we don't need to necessarily follow Lucky convention. This is similar to when you do something like What are your thoughts on that? |
@@ -100,5 +123,9 @@ module Avram::Upsert | |||
\{% raise "Please use the 'upsert_lookup_columns' macro in #{@type} before using '{{ method.id }}'" %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will/crystal-pg@v0.24.0...v0.26.0 crystal-lang/crystal-db@v0.10.1...v0.11.0 This adds the upstream changes from crystal-pg v0.11 This also allows us to remove the monkey patches required for null array elements in avram, added/utilized by the bulk upsert functionality luckyframework#789
will/crystal-pg@v0.24.0...v0.26.0 crystal-lang/crystal-db@v0.10.1...v0.11.0 This adds the upstream changes from crystal-pg v0.11 This also allows us to remove the monkey patches required for null array elements in avram, added/utilized by the bulk upsert functionality #789
src/ext/db/param.cr
Outdated
@@ -0,0 +1,8 @@ | |||
# Can be removed once https://github.com/will/crystal-pg/pull/244 is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed now #807
b9865cf
to
a71b93f
Compare
Adds an `upsert` overload for Arrays to create large numbers of records at the same time. Uses PG's UNNEST behavior to allow for a near-infinite (buyer beware) number of insertions rather than being limited by PG's bind parameter restrictions (64k total binds, which would prevent more than a few thousand upserts at a time depending on the number of column inserts). Co-authored-by: Alex Piechowski <[email protected]> Co-authored-by: robacarp <[email protected]>
a71b93f
to
da6d8db
Compare
@robcole After talking it over with the team, there's been some suggestions on what if we made this a separate shard extension similar to how we did https://github.com/luckyframework/avram_slugify? If we go that route, you could keep the current implementation as-is. This would allow people to use the feature, and give feedback / contribute to really flesh out the different use cases. If you're cool with that, I could even start a new repo under Lucky and I think I can give you access to that. Let me know your thoughts, or you can hit me up on discord if you want 👍 |
@jwoertink I think it makes more sense to have this in Avram, but I agree that it isn’t ready yet. Have you done any work on the concept of a BulkOperation or SQLOperation? My plan was to refactor this and add a BulkOperation which yields the operation in the same way as others, but would have a concept of multiple records, and initially wouldn’t handle validations and callbacks, but could down the road. Haven’t had time to get to the refactor yet but it’s on my list still. |
Oh, ok. Yeah, I agree that it should be in Avram core too, but just not with the current implementation. If you're open to some refactors, then we can keep this open 👍 I haven't started on any of the bulk stuff. Work has just been way too crazy lately, so I've fallen behind a bit on my Lucky duties |
Cool, I’ll take a stab at it soonish (booting two personal projects / business ideas on Lucky now so I’ve been short on time recently too), and hopefully we can piece together what a BulkOperation looks like from here with this PR. Once I get to that, I’ll also add a couple more tests that help illuminate where this bulk insert is most useful as well, which is stuff like CSV loading and ETL batch operations that write to rollup tables. |
Sweet! Stoked. And no rush on any of that. I don't plan on another release until around summer, and if it's not ready, we can always do the next release 😄 I just wanted to make sure you weren't waiting on me for anything. Sounds like we're on the same page though 👍 Thanks for being understanding! |
Adds an
upsert
overload for Arrays to create large numbers of records at the same time.Uses PG's UNNEST behavior to allow for a near-infinite (buyer beware) number of insertions
rather than being limited by PG's bind parameter restrictions (64k total binds, which would prevent
more than a few thousand upserts at a time depending on the number of column inserts).
Common use cases for this include building large numbers of records for insertion into stat grouping tables (similar to materialized views, but with more atomicity for which records are refreshed), loading records from CSV files or JSON files into a model, etc.
Example uses (pseudocode):