add Reserve for column. Optimize large block insertion #341

1261385937 · 2023-10-27T03:29:13Z

For large block insertion(100,000+), preassign column memory

1261385937 · 2023-10-31T10:09:10Z

The total cpu improvement : (2650-2400)/2650 = 9.4%
Exhaustion point is std::vector reallocate.

Before (part):

After (part):

clickhouse/columns/string.h

clickhouse/columns/tuple.cpp

clickhouse/columns/string.cpp

clickhouse/columns/column.h

Enmk

Please see the comments inline.
Also add some tests here. Some viable test cases:

Reserve() doesn't change Size() (for both empty and non-empty columns)
Reserve() on non-empty Column doesn't change stored values.
Reserve(0) shouldn't crash, also shodun't change stored values.
How Reserve() and Capacity() work together.

Also please fix the code style, this project uses 'egyptian grackets'

Enmk · 2023-10-31T16:29:51Z

clickhouse/columns/lowcardinality.cpp

@@ -174,6 +174,12 @@ ColumnLowCardinality::ColumnLowCardinality(std::shared_ptr<ColumnNullable> dicti
 ColumnLowCardinality::~ColumnLowCardinality()
 {}

+void ColumnLowCardinality::Reserve(size_t new_cap)
+{
+    dictionary_column_->Reserve(new_cap);


Here we assume that ALL of the new items are unique, wich is quite an uncommon and quite sub-optimal case for LowCardinality.

Maybe estimate dict size as ln(new_cap) * sqrt(new_cap) + new_cap / 5, this way we'll get:

new_cap dict's new_cap estimated % of unique values

10 10 100.0

20 18 90.0

40 32 80.0

60 44 73.33

80 56 70.0

100 67 67.0

200 115 57.5

400 200 50.0

600 277 46.17

800 350 43.75

1000 419 41.9

2000 740 37.0

4000 1325 33.12

6000 1874 31.23

8000 2404 30.05

10000 2922 29.22

20000 5401 27.01

40000 10120 25.3

60000 14695 24.49

80000 19194 23.99

100000 23641 23.64

1000000 213816 21.38

10000000 2050970 20.51

So it converges to 20% of unique items (e.g. items in the dictionary column) for huge columns, but tolerates a high number of unique values for small columns.

Actually, I do not understand the form, so could you implement ColumnLowCardinality::Reserve ? 🌹

Sure, will do

1261385937 · 2023-11-01T09:56:36Z

@Enmk thank you for review

add Reserve for column. Optimize large block insertion

6a8729c