Bulk import support for SaveOperation #952

garymardell · 2023-06-28T22:23:02Z

Looking for feedback on the approach along with whether this is something that would likely be merged into Avram before I finish up writing tests. I could also add it as a separate shard.

I have the need for bulk-inserting in the application I am building and noticed that #662 issue is still open. I also came across the implementation of bulk upsert in #789 however looks like there hasn't been much activity in some time.

This PR piggy-backs on the behaviour of a SaveOperation as I wanted to avoid duplicated logic, validations, callbacks etc. It essentially follows the same pattern as a regular insert but for multiple operations at the same time. Initially in my app I had something that looked like this:

operations = [] of SaveThing

data.each do |datum|
  operations << SaveThing.new(...)
end

AppDatabase.transaction do
  operations.each(&.save)
end

This PR:

Defines an import method that accepts an array of save operations of a specific class
Treats the set of operations as a batch that either all succeed or none
Builds a single SQL insert statement
Triggers all callbacks including before_save, after_save and after_commit

Changes to Avram::Insert

Initially in my branch I had created a separate MultiInsert class that accepted an array of parameters however there was a lot of duplication. It also felt like the Insert class as a representation of an insert statement should be able to support this case. I refactored the class to accept an array of parameters.

The regular save operation is now an array of params with a single item.

Changes to SaveOperation

I introduced a new values method to the public api which returns the the attributes as a hash. This uses the same underlying attributes_to_hash that insert_sql was using.

The difference between values and insert_sql is that values is uncompacted (preserves nil values). This is important for bulk inserting as the number of values ($1, $2, $3...) needs to remain consistent.

In my use-case I am inserting tree-like data that has a field for parent_id which for the root node is nil. Without this change the first operation for the root has one less parameter than all subsequent nodes and the insert would fail.

jwoertink · 2023-06-28T22:53:55Z

Oh wow! Well, this is definitely something we want to add in, and I've had ideas sitting in the back of my head for a while, but no time to put them down. This is an interesting approach though. I'm a bit lost on the interface though... How does this look?

From what I can tell, it might look like..

class SaveUser < User::SaveOperation
end

operations = params.many_nested(:users).map {|data| SaveUser.new(data) }
SaveUser.import(operations)

I guess some key things to think about before getting too deep in to this are

Does this handle mass bulk? (i.e. 100k+) or would it need to paginate somehow?
Would the dev always need to build out that array of operations, or can the params just be passed in and know to use many_nested?
What happens if 9 insert fine, but 1 fails? Does it rollback all of them and return errors telling you which one?

The only other thing that comes to mind is naming. The name of your operation is intended to be somewhat descriptive of what it does. In this case we're saying "I save a user (SaveUser), but really I save a lot of them too...". A pedantic, but if I wanted my operation to be called ImportUsers.import(), could I still tell it to use my SaveUser? Not necessary to calling this "good enough", and could always be a future update.

Thanks for working on this! I know many have requested it, and it's a tricky task ❤️

garymardell · 2023-06-28T23:45:49Z

Thanks for the quick feedback! Yes your example is pretty much spot on, using my example from above it's now:

operations = [] of SaveThing

data.each do |datum|
  operations << SaveThing.new(...)
end

AppDatabase.transaction do
  SaveThing.import(operations)
end

In my use-case I am not using Avram::Paramable. I actually have a protobuf input that gets decoded so I have an array of objects that I manipulate and then want to insert. In rails I would have created an array of models, however with Avram it's an array of operations instead which is why my import interface accepts operations.

Does this handle mass bulk? (i.e. 100k+) or would it need to paginate somehow?

In this WIP it will just generate an insert statement for the set of operations you give it so if you were to provide it with a very large set of operations it will generate a very large insert statement. I feel like it is a useful characteristic for this to remain transactional as an interface and delegate all pagination/splitting to the developer. Then the consumer can decide what to do on failure; for example rollback all 100k records; skip the page that failed; stop inserting but keep anything previous inserted; without having to build this into the API.

Would the dev always need to build out that array of operations, or can the params just be passed in and know to use many_nested?

I'm not too familiar with many_nested and how it's meant to be used but it should be possible to add another method that accepts Avram::Paramable providing it can be converted into a set of operations.

SaveUser.import(params.many_nested(:users))

I think to support this the return type would have to change to a set of operations of an exception rather than the current true/false as you'd likely want the operations after the fact to get the records.

What happens if 9 insert fine, but 1 fails? Does it rollback all of them and return errors telling you which one?

Postgres will treat the whole insert transactionally, so if 1 were to fail the whole set would rollback. There are options which I have not included in this PR (but could be added) for telling postgres what to do on conflict (https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT) if another value exists for a unique index.

As for errors, as it's an all or nothing affair in this PR I am triggering the same mark_as_failed and emitting the notification on each of the operations. I don't think it's possible for postgres to tell us which operation specifically failed.

The only other thing that comes to mind is naming. The name of your operation is intended to be somewhat descriptive of what it does. In this case we're saying "I save a user (SaveUser), but really I save a lot of them too...".

One option is to make a separate BulkOperation(U, T) class that gets generated for an operation similar to how SaveOperation gets generated for a model. With U being the operation class and T being the record type. Something like:

SaveUser::BulkOperation.insert_all([SaveUser.new])

I don't believe this changes the implementation too much as it would still be working on sets of operations (or parameters turned into operations).

A pedantic, but if I wanted my operation to be called ImportUsers.import(), could I still tell it to use my SaveUser

I think there are 2 options for this:

Change the definition of import to be def self.import(operations : Array(U)) forall U which allows the user to extend the SaveUser class as ImportUser but this means the interface accepts an array of anything which is less than ideal.
With the approach above a BulkOperation class would already be generated with the correct generic parameters that could be used directly or extended by the user similar to how SaveOperation works.

garymardell · 2023-06-29T00:29:59Z

I just pushed to another branch an example of generating a separate bulk class for each operation:

https://github.com/luckyframework/avram/compare/main...garymardell:avram:bulk-operation?expand=1

operations = [] of SaveThing

data.each do |datum|
  operations << SaveThing.new(...)
end

AppDatabase.transaction do
  SaveThing::Bulk.insert_all(operations)
end

and then in theory you can also do:

class ImportUsers < Avram::BulkOperation(SaveUser, Span)
end

ImportUsers.insert_all(users)

if you want specific naming. Alternatively this might be a rare enough use-case that we don't generate for each operation and this is the only way to use it.

jwoertink · 2023-06-29T15:01:03Z

I think what you have here is a great start. I dig the change of being able to create a separate class designated for handling bulk imports.

We will eventually need support for params, but maybe that can come in a future PR, and to start we just explain that this isn't meant to take direct user input initially. Same with the pagination part. I agree with you that we can leave that all up to the developer and just say this is an all or nothing deal.

Can I get an example of what the error handling would look like? I see this doesn't take a block like the other operations do. This is probably fine since returning the entire array of what you passed in might be quite expensive.

Are you thinking maybe the interface looks like this?

op = ImportUsers.insert_all(users)
if op.saved?
  # good
else
  # not good
  op.errors
end

jwoertink · 2023-08-10T21:46:28Z

@garymardell If you have a moment, I think there's just 2 bits left here.

Get main merged in and run the formatter and ameba
Write out an example of what error handling looks like with this.

Then I think we can just merge it in and get something going so we can improve in it more later.

Thanks!

garymardell · 2023-08-10T21:57:08Z

Sorry for delay, this got sidelined by other work going on. I actually have another branch locally that I think is more promising for providing a BulkOperation class that would provide a better interface similar to what you suggested above. It would take care of the issue of accepting params / operations and provides a nicer way of knowing if it was successful or not. I will try to get back to it over the next couple of days to get it PR ready.

jwoertink · 2023-08-10T22:01:08Z

oh sweet! If that's the case, then I'm down to hold off. Should we close out this issue for now, then you can re-open a new one with your updates?

garymardell · 2023-08-10T22:01:59Z

Should we close out this issue for now, then you can re-open a new one with your updates?

Yes going to close this one for now.

akitaonrails · 2023-08-30T04:28:30Z

Hi, I am currently using my fork with your bulk-insertion PR for the time being. I did a small hack into Avram::Insert.statement to have a separate statement_for_bulk to add "on conflict do nothing."

I hope you guys come up with a more permanent solution soon. data import is quite common.

Thanks.

garymardell added 2 commits June 28, 2023 14:49

Support for bulk importing operations

586dc2b

Properly interpolate index

29127ad

Fix specs with a space between values

17fa3de

garymardell closed this Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk import support for SaveOperation #952

Bulk import support for SaveOperation #952

garymardell commented Jun 28, 2023 •

edited

Loading

jwoertink commented Jun 28, 2023

garymardell commented Jun 28, 2023 •

edited

Loading

garymardell commented Jun 29, 2023 •

edited

Loading

jwoertink commented Jun 29, 2023

jwoertink commented Aug 10, 2023

garymardell commented Aug 10, 2023

jwoertink commented Aug 10, 2023

garymardell commented Aug 10, 2023

akitaonrails commented Aug 30, 2023

Bulk import support for SaveOperation #952

Bulk import support for SaveOperation #952

Conversation

garymardell commented Jun 28, 2023 • edited Loading

Changes to Avram::Insert

Changes to SaveOperation

jwoertink commented Jun 28, 2023

garymardell commented Jun 28, 2023 • edited Loading

garymardell commented Jun 29, 2023 • edited Loading

jwoertink commented Jun 29, 2023

jwoertink commented Aug 10, 2023

garymardell commented Aug 10, 2023

jwoertink commented Aug 10, 2023

garymardell commented Aug 10, 2023

akitaonrails commented Aug 30, 2023

garymardell commented Jun 28, 2023 •

edited

Loading

garymardell commented Jun 28, 2023 •

edited

Loading

garymardell commented Jun 29, 2023 •

edited

Loading