-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path16_optimizing_stable_sort.html
324 lines (265 loc) · 11.5 KB
/
16_optimizing_stable_sort.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width">
<title>16. Optimizing stable sort</title>
<link rel="stylesheet" href="template/style.css">
</head>
<body>
<span style="float: right">
[
<a href="15_merge_inplace.html">previous</a>,
<a href="index.html">contents</a>,
<a href="17_adaptive_merge_sort.html">next</a>
]
</span>
<a name="L16.-Optimizing-stable-sort"></a>
<h1>16. Optimizing stable sort</h1>
<a name="The-measure-of-a-good-programmer"></a>
<h2>The measure of a good programmer</h2>
<p>With algorithms and components you have to know the bunch.
It’s like a toolchest.
A good woodworker doesn’t just have one kind of tool.
You have a bunch of tools, whether making a door or whatever,
and then you find the right tools.
The <em>goodness, or best measure of a programmer
is the number of tools, or algorithms, he is fluent with</em>.
There is an erroneous view that good programmer is someone
who can invent new algorithms.
This is good, but most people don’t need to.
A carpenter doesn’t invent new tools.
You can be an exceptionally good, or world class carpenter, without
ever introducing a new woodworking tool.
One of the things you need to know is existing tools.</p>
<p>I constantly hear, “I never needed __ algorithm”.
This might be true.
A carpenter doesn’t use every tool in a given day.
Nevertheless, they have a large tool chest for those
cases when they need it.
Larger the tool chest, better off you are,
easier it is going to be to solve problems.
Moreover, if the tools are known to everyone,
they can understand what you do.</p>
<p>This is why I try to come up with a set of systematic components.
I think programming is a wonderful thing if you know many algorithms
and data structures and just use them.</p>
<a name="How-good-is-our-stable-sort-3f-"></a>
<h2>How good is our stable sort?</h2>
<p>The merge sort algorithm we wrote last week is in
many respects good code.
It’s a good example of a stable sort.
It’s a good basis on which we could get fast results and figure out how to make it very fast.
We will do measurements.</p>
<p>The current algorithm has <code>log(n)</code> recursive levels.
At every level we have a merge which is <code>O(n log(n))</code>.
so the overall complexity is <code>O(n log^2(n))</code>.
It’s better than bubble sort and shell sort.
So how bad is it?
I made a test harness<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup>:</p>
<pre><code>Sorting double from 8 up to 2097152 elements generated with iota at: Sun Jun 27 20:29:54 2021
size stable inplace
8 12 18
16 8 25
32 8 35
64 9 40
128 9 49
256 12 66
512 14 94
1024 23 124
2048 34 150
4096 46 175
8192 54 197
16384 63 218
32768 68 234
65536 73 255
131072 77 276
262144 87 355
524288 92 321
1048576 100 343
2097152 102 364
</code></pre>
<p>Stable here is <a href="https://en.cppreference.com/w/cpp/algorithm/stable_sort"><code>std::stable_sort</code></a>
which is highly optimized library code.
So it tells us we can do better.
The numbers are nano-seconds per element.
Our merge sort is called <code>inplace</code>.
We start off ok, and then gradually get slower and slower in comparison.
Obviously, as the size increases you have to do more work.
It’s not linear.
But, ours is clearly 3-4x slower than that.</p>
<a name="A-plan-for-improvement"></a>
<h2>A plan for improvement</h2>
<p>We are going to write a faster one.
I am going to work with the class but I will not always tell you
the truth.
Let us derive together and have a conversation but if you go on a wrong path I might not
necessarily stop you.
In real life I’m not going to be there to stop you when you write code.
So let us try together.</p>
<a name="Requirements"></a>
<h3>Requirements</h3>
<p>First we will decide our requirements.
These might be more strict than possible,
so we can remove them later if needed.</p>
<ol>
<li><p>Maintain stability.</p>
<p> Otherwise we would just write quicksort.</p></li>
<li><p>Requires only <code>ForwardIterator</code>.</p>
<p> <code>InputIterator</code> is the only weaker thing, and it’s not enough for
sorting.</p></li>
<li><p>Use limited extra memory.</p>
<p> The requirement of no extra memory at all is bogus.
With modern systems we often have more memory than we need.
But, we often have to parallelize stuff,
so, we cannot put too much data on the processor.
We want data to be able to fit in the cache.
One way to do that is by exposing control.</p></li>
<li><p>Fast.</p>
<p> What would be good?
Within 5-10% of the <code>std::stable_sort</code>.
What would be great?
Observe STL also has <a href="https://en.cppreference.com/w/cpp/algorithm/sort"><code>std::sort</code></a> because it’s faster.
If it’s as fast as sort, that would be great.<sup id="fnref:2"><a href="#fn:2" rel="footnote">2</a></sup>.</p></li>
</ol>
<a name="Ideas-to-explore"></a>
<h3>Ideas to explore</h3>
<p>So we are going to try to make it fast, using all kinds of brilliant algorithm
ideas.
Let me tell you one great secret.
When designing algorithms there is a huge space of possible inputs.
You might have really big structure.
Or very expensive comparisons.
Should you concentrate on those?
No. First, make your algorithm
work really fast for what I call a common case, such as <code>double</code>.
You don’t know how big integers are, but
the size of double is fixed for centuries to come,
by IEEE standard.
But, integers, doubles, and pointers are all good.
<em>Always make your algorithms as fast as possible
for the most common case</em>.
Then, you can look at bigger things and make other tradeoffs.</p>
<p>What ideas can we come up with?</p>
<p><strong>Using insertion sort for small cases</strong></p>
<p>In <a href="https://en.wikipedia.org/wiki/Insertion_sort">insertion sort</a> you build up a sorted list by inserting elements
one at a time.
One part of the list is always sorted.
We take an element which is not inserted,
and linear search to find where it belongs.</p>
<pre><code>[ sorted | not sorted ]
</code></pre>
<p>What is the complexity of insertion sort?
It’s <code>O(n^2)</code>.
What is the complexity of merge sort, in terms of comparisons?
<code>O(n log(n))</code>.
What is the specific coefficient for insertion sort?
The sorted portion is on average half the length of the original input.
In addition, on average when we insert an element we
only have to go half the length before finding its location.
Therefore the complexity
is roughly: <code>n^2/4</code>.</p>
<p>Now we will solve for when it’s better to do one than the other.</p>
<pre><code>n^2/4 < n log(n)
n < 4 log(n)
2^n < n^4
</code></pre>
<p>So the cutoff is <code>n = 16</code>:</p>
<pre><code>2^16 = 65536 = 16^4
</code></pre>
<p>That’s what theory tells us.
What should we really do?
We should measure.
But, the theory tells us around which numbers to try (around 16).</p>
<p><strong>Binary insertion sort</strong></p>
<p>We can also binary search to find where to insert the element.
That will also make the cutoff point larger.
Is it going to be dramatically better?
In insertion sort, we already have to do a linear complexity
shift to make room.</p>
<p><strong>Get rid of recursion</strong></p>
<p>Perhaps we can use <code>binary_counter</code>.</p>
<p><strong>Write an excellent merge</strong></p>
<p>Basically we need an algorithm that will say,
“do this version if we have enough memory,
otherwise do that”.
The term is <em>adaptive</em>.</p>
<p>You might think everybody can write a good merge.
If you Google “std merge” it shows you <a href="https://web.archive.org/web/20130812111552/https://www.cplusplus.com/reference/algorithm/merge/">this</a>:
Since you’re a normal programmer
you might say, “oh it’s on the Web, therefore I can copy and paste and use
it in my code.”</p>
<pre><code>template <class InputIterator1, class InputIterator2, class OutputIterator>
OutputIterator merge (InputIterator1 first1, InputIterator1 last1,
InputIterator2 first2, InputIterator2 last2,
OutputIterator result)
{
while (true) {
*result++ = (*first2<*first1)? *first2++ : *first1++;
if (first1==last1) return std::copy(first2,last2,result);
if (first2==last2) return std::copy(first1,last1,result);
}
}
</code></pre>
<p>This is a broken piece of code.
An empty range will melt the computer.</p>
<p><strong>Faster rotate</strong></p>
<p>What is the complexity of <code>std::rotate</code>?
It’s tricky, because it depends on the kind of iterators which you have.
With <code>RandomAccessIterator</code> the best theoretical algorithm
does <code>n + gcd(n_1, n_2)</code> assignments, not swaps.
On average GCD is small, but larger than one.
So we can get almost to <code>n</code> assignments, which is a lot better than <code>n</code> swaps.
For <code>ForwardIterator</code> it happens to be <code>n - gcd(n_1, n_2)</code> swaps.
It is roughly <code>n</code> swaps for <code>BidirectionalIterator</code> (<code>3n</code> assignments).</p>
<p>As we will observe, we can use a faster rotate than the rotate in STL because we have this additional storage.
If you want to rotate and you have enough storage, then you only need <code>n + n_1</code> assignments which is for sure less than <code>3n</code>.</p>
<a name="First-steps"></a>
<h3>First steps</h3>
<p>There’s lots of things to do.
How should we go about it?
The problem with programming, specifically designing components
and composing the system, is that you do not know what is right in isolation.
You never know what the correct interface is
until you see it in other algorithms,
and you see how those are used.
This is why you just have to try things and ideas start emerging.</p>
<p>You might think it’s an infinite process.
No it’s not infinite, that’s the wonderful thing about life.
It sort of terminates (I cannot prove it of course).
In practice if you start fitting things together you sort of discover what you need to return,
what you need to pass,
what is the right thing to do, and that’s what
I am trying to teach.</p>
<p>When should we try insertion sort?
As a rule we want to fix the asymptotic complexity.
Doing insertion sort at the bottom won’t help that.
Right now we have a problem with our asymptotic complexity.
It’s <code>O(n log^2(n))</code>.
We want to get rid of that square as fast as possible.</p>
<p>I’m very lazy.
So, we saw how fast we can get using no memory.
What about if we had infinite memory?
How much extra memory do I really need?
I think we can do it in <code>n/2</code>.</p>
<div class="footnotes">
<hr/>
<ol>
<li id="fn:1">
AMD Ryzen 5 2400G (8 core, 3.6 GHz). GCC 9.3.0<a href="#fnref:1" rev="footnote">↩</a></li>
<li id="fn:2">
Alex: It has been my dream for decades
to make my stable sort as fast as my sort,
at which point I could throw away the other and just have one sort.
But, I make progress with stable sort, then I make progress with sort.<a href="#fnref:2" rev="footnote">↩</a></li>
</ol>
</div>
<span style="float: right">
[
<a href="15_merge_inplace.html">previous</a>,
<a href="index.html">contents</a>,
<a href="17_adaptive_merge_sort.html">next</a>
]
</span>
</body>
</html>