-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Reserve for column. Optimize large block insertion #341
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the comments inline.
Also add some tests here. Some viable test cases:
- Reserve() doesn't change Size() (for both empty and non-empty columns)
- Reserve() on non-empty Column doesn't change stored values.
- Reserve(0) shouldn't crash, also shodun't change stored values.
- How Reserve() and Capacity() work together.
Also please fix the code style, this project uses 'egyptian grackets'
@@ -174,6 +174,12 @@ ColumnLowCardinality::ColumnLowCardinality(std::shared_ptr<ColumnNullable> dicti | |||
ColumnLowCardinality::~ColumnLowCardinality() | |||
{} | |||
|
|||
void ColumnLowCardinality::Reserve(size_t new_cap) | |||
{ | |||
dictionary_column_->Reserve(new_cap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we assume that ALL of the new items are unique, wich is quite an uncommon and quite sub-optimal case for LowCardinality
.
Maybe estimate dict size as ln(new_cap) * sqrt(new_cap) + new_cap / 5
, this way we'll get:
new_cap | dict's new_cap | estimated % of unique values |
---|---|---|
10 | 10 | 100.0 |
20 | 18 | 90.0 |
40 | 32 | 80.0 |
60 | 44 | 73.33 |
80 | 56 | 70.0 |
100 | 67 | 67.0 |
200 | 115 | 57.5 |
400 | 200 | 50.0 |
600 | 277 | 46.17 |
800 | 350 | 43.75 |
1000 | 419 | 41.9 |
2000 | 740 | 37.0 |
4000 | 1325 | 33.12 |
6000 | 1874 | 31.23 |
8000 | 2404 | 30.05 |
10000 | 2922 | 29.22 |
20000 | 5401 | 27.01 |
40000 | 10120 | 25.3 |
60000 | 14695 | 24.49 |
80000 | 19194 | 23.99 |
100000 | 23641 | 23.64 |
1000000 | 213816 | 21.38 |
10000000 | 2050970 | 20.51 |
So it converges to 20% of unique items (e.g. items in the dictionary column) for huge columns, but tolerates a high number of unique values for small columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I do not understand the form, so could you implement ColumnLowCardinality::Reserve ? 🌹
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do
@Enmk thank you for review |
For large block insertion(100,000+), preassign column memory