Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Reserve for column. Optimize large block insertion #341

Merged
merged 3 commits into from
Nov 2, 2023
Merged

add Reserve for column. Optimize large block insertion #341

merged 3 commits into from
Nov 2, 2023

Conversation

1261385937
Copy link
Contributor

For large block insertion(100,000+), preassign column memory

@1261385937
Copy link
Contributor Author

1261385937 commented Oct 31, 2023

The total cpu improvement : (2650-2400)/2650 = 9.4%
Exhaustion point is std::vector reallocate.

Before (part):
1698746720200

After (part):
1698746415049

clickhouse/columns/string.h Outdated Show resolved Hide resolved
clickhouse/columns/tuple.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@Enmk Enmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comments inline.
Also add some tests here. Some viable test cases:

  • Reserve() doesn't change Size() (for both empty and non-empty columns)
  • Reserve() on non-empty Column doesn't change stored values.
  • Reserve(0) shouldn't crash, also shodun't change stored values.
  • How Reserve() and Capacity() work together.

Also please fix the code style, this project uses 'egyptian grackets'

@@ -174,6 +174,12 @@ ColumnLowCardinality::ColumnLowCardinality(std::shared_ptr<ColumnNullable> dicti
ColumnLowCardinality::~ColumnLowCardinality()
{}

void ColumnLowCardinality::Reserve(size_t new_cap)
{
dictionary_column_->Reserve(new_cap);
Copy link
Collaborator

@Enmk Enmk Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we assume that ALL of the new items are unique, wich is quite an uncommon and quite sub-optimal case for LowCardinality.

Maybe estimate dict size as ln(new_cap) * sqrt(new_cap) + new_cap / 5, this way we'll get:

new_cap dict's new_cap estimated % of unique values
10 10 100.0
20 18 90.0
40 32 80.0
60 44 73.33
80 56 70.0
100 67 67.0
200 115 57.5
400 200 50.0
600 277 46.17
800 350 43.75
1000 419 41.9
2000 740 37.0
4000 1325 33.12
6000 1874 31.23
8000 2404 30.05
10000 2922 29.22
20000 5401 27.01
40000 10120 25.3
60000 14695 24.49
80000 19194 23.99
100000 23641 23.64
1000000 213816 21.38
10000000 2050970 20.51

So it converges to 20% of unique items (e.g. items in the dictionary column) for huge columns, but tolerates a high number of unique values for small columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I do not understand the form, so could you implement ColumnLowCardinality::Reserve ? 🌹

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do

@1261385937
Copy link
Contributor Author

@Enmk thank you for review

@1261385937 1261385937 requested review from Enmk November 1, 2023 09:57
@Enmk Enmk merged commit 0f8b396 into ClickHouse:master Nov 2, 2023
16 checks passed
@1261385937 1261385937 deleted the column-construction branch November 2, 2023 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants