-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path13_searching.html
447 lines (374 loc) · 18.2 KB
/
13_searching.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width">
<title>13. Searching</title>
<link rel="stylesheet" href="template/style.css">
</head>
<body>
<span style="float: right">
[
<a href="12_merge_sort.html">previous</a>,
<a href="index.html">contents</a>,
<a href="14_binary_search.html">next</a>
]
</span>
<a name="L13.-Searching"></a>
<h1>13. Searching</h1>
<a name="History-of-binary-search"></a>
<h2>History of binary search</h2>
<p>The first time binary search was described as an algorithm,
it was done by a great American computer scientist and engineer <a href="https://en.wikipedia.org/wiki/John_Mauchly">John Mauchly</a>.
Of course, you have never heard of him.
He was the guy who invented first general purpose computer, but we
don’t remember people like that.
In 1946 he gave a brilliant series of
lectures at the <a href="https://en.wikipedia.org/wiki/Moore_School_of_Electrical_Engineering">Moore School</a> at <a href="https://en.wikipedia.org/wiki/University_of_Pennsylvania">Pennsylvania University</a> on programming.
For the first time, he described things like merge, merge sort, and binary search.
This is not a bad thing to be the first person to describe.
He designed <a href="https://en.wikipedia.org/wiki/ENIAC">ENIAC</a> which should make him very famous.
Indeed, he did some very fundamental work.</p>
<p>Then comes this interesting fact (from “The Art of Computer Programming”)
It takes about 15 years for people to come up with binary search
which sort of works for all possible inputs.
Apparently people didn’t have trouble coding binary search when the length is of the form <code>2^(n-1)</code>.
Because it’s easy, you take the middle element and then both sides will be of the same form and you can keep dividing.
Apparently people couldn’t do it.
Knuth claims that the first correct implementation was done by <a href="https://en.wikipedia.org/wiki/D._H._Lehmer">D.H. Lehmer</a>.
He is someone you should know about as a very great computer scientist.
He did amazing amount of work on computational number theory,
like sieves for discovering large primes and many other important things.
Among other things, he published a binary search which at least always terminated.</p>
<p>I actually disagree with Knuth slightly
and claim that the first correct binary search was published roughly at the same time,
but a couple of years after, by a German computer scientist.
Once again, he is unjustly forgotten.
He does not appear on Wikipedia<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup>.
His name is <a href="https://en.wikipedia.org/wiki/Hermann_Bottenbruch">Herman Bottenbruch</a>.
His claim to fame is he was one of the people who invented <a href="https://en.wikipedia.org/wiki/ALGOL_58">Algol 58</a>, the predecessor of Algol 60.
He is one of the people who tried unsuccessfully to convince American
delegates to Algol 58 committee that they should introduce block structures.
He was one of the inventors of blocks.
American representatives which included such brilliant people as <a href="https://en.wikipedia.org/wiki/John_Backus">John Backus</a>
and <a href="https://en.wikipedia.org/wiki/Alan_Perlis">Alan Perlis</a> actually rejected it as too hard to implement.
They didn’t know how to do stacks.
But sadly enough he doesn’t get much credit, especially credit for correct binary search.
We will be actually studying his version.</p>
<a name="bsearch-is-wrong"></a>
<h2>bsearch is wrong</h2>
<p>If we think about merging two sequences of roughly the same length,
or rather exactly the same length <code>n</code>, the expected number of comparisons is going to be <code>2n - 1</code>.
From which follows a conjecture.
If we have sequences of size <code>n</code> and size <code>m</code> the number of comparisons should be <code>n + m - 1</code>.
Not every conjecture is true however.
This one is definitely false.
Here is a simple counterexample.
Take a sequence of length 1000 and a sequence of length 1.
We only need <code>log(1000)</code> because we can binary search for its index.</p>
<p>So there is a fundamental possibility
for using binary search for merging, dramatically reducing the number of comparisons.
<code>log(n)</code> is much smaller than <code>n</code>.</p>
<p>You might think we can just use binary search from a standard library, such as C <a href="https://man7.org/linux/man-pages/man3/bsearch.3.html"><code>bsearch(3)</code></a>.
Sounds like a plausible idea.
It was written by great UNIX guys.
They know something about programming, so let us see what they provide us with (see <code>man 3 bsearch</code>):</p>
<pre><code>void* bsearch(
const void* key,
const void* base,
size_t nmemb,
size_t size,
int (*compare)(const void*, const void*)
);
</code></pre>
<p>Notice it takes all these parameters, and it’s a little messy because it’s C.
Components are hard for them<sup id="fnref:2"><a href="#fn:2" rel="footnote">2</a></sup>.
Nevermind what it takes.
What’s interesting is what it returns.</p>
<blockquote><p>returns a pointer to a matching member of the array, or NULL if no match is found.</p></blockquote>
<p>So for our merge, it will most often return <code>NULL</code>.
At which point, you will have to do linear search.
So observe, ancient interface, done by brilliant people, in the standard library, and it’s utterly useless.</p>
<p>Even if we are so fortunate as to get a pointer to an element back, does it help with merge?
Especially if we want to make it stable?
No.</p>
<blockquote><p>If there are multiple elements that match the key, the element returned is unspecified.</p></blockquote>
<p>You will read similar things in many algorithms books.
It’s a typical story for binary search.
Even when the book is written by famous people.
I’ll show you how to write it.</p>
<a name="What-is-correct-code-3f-"></a>
<h3>What is correct code?</h3>
<p>Here comes another philosophical point.
<em>What does wrong mean?</em>
<em>What does incorrect mean?</em>
At school they told you that the program is incorrect when <em>it doesn’t satisfy its specifications</em>.</p>
<p>Well then <code>bsearch</code> is a correct program.
I looked at the source, it does do what it promises to do.
It will return <code>NULL</code>.
I wish it were not correct.
I wish it returned something useful.</p>
<p>Correctness is a deeper concept than just satisfying specification.
Well in reality, as you guys know, it must be deeper because you haven’t got any specifications.
When you write code, it’s not that you are given specifications and need to encode them.
I suspect that has never happened in your life, nor will it happen in any foreseeable future.
But you still have to attempt to do something which is correct.</p>
<p>Of course the people who advocate writing specifications will say yes, first
you will write specification, and then implement specification.
But, it’s not going to help.
Because, if you write wrong specification, you are the same guy who is going to write the implementation.
Most likely it will not make it correct.</p>
<p>So it’s a deeper thing.
You have to establish correctness from more fundamental principles.
<em>The program is correct if it returns desirable information</em>,
if it does what it’s supposed to do in some absolute sense.
It’s very hard to prove it.</p>
<p>I think one of the lessons of this particular lecture is how hard simple things are.
lots of very bright people cannot give it a correct interface. Same with <code>bsearch</code>.</p>
<p>You might say, “Alex just talks about his beef with the standard committee.”
No.
What I’m trying to tell you is that when you write things like that in your code,
There will be some other guy using your code.
Always think about that other guy.
The great flaw in most code I see is there is no consideration for the other guy.
People think, “oh it works, so it’s done.”
My dream is that we all write code thinking about other people.
Then you say, “well, then I have to do more work.”
This is the beauty of sharing.
You might have attended kindergarten and had a teacher that taught you it’s good to share toys.
She was right.</p>
<a name="Linear-search"></a>
<h2>Linear search</h2>
<p>It would be very contrary to the way I do things to start with binary search.
How could we do binary search if we cannot do linear search?
In STL it is called <a href="https://en.cppreference.com/w/cpp/algorithm/find"><code>std::find</code></a> or <code>std::find_if</code><sup id="fnref:3"><a href="#fn:3" rel="footnote">3</a></sup>.</p>
<p>Let’s see how to write it.
We can assume we know how to do it, and start from the top,
or we could assume we don’t know what we are doing,
which is usually the case when starting new things.
I seldom start writing code from the signature.
I don’t know what the signature is.
I typically have some algorithmic idea, often an inner loop, so I start with that,
Then write code inside out:</p>
<pre><code>while (first != last && ... find the element...) ++first;
</code></pre>
<p>Now write the signature:</p>
<pre><code>template <typename I, typename P>
// I is an InputIterator
// P is a unary predicate
I find_if(I first, I last, P pred) {
// [first, last) is a valid range.
while (first != last && !pred(*first)) ++first;
return first;
}
</code></pre>
<p><code>I</code> is an <code>InputIterator</code> instead of a <code>ForwardIterator</code>
because it is single pass.</p>
<a name="Trimming-the-standard"></a>
<h3>Trimming the standard</h3>
<p>One of the mistakes which frequently happens is people use the principle of <a href="https://en.wikipedia.org/wiki/Occam%27s_razor">Occam’s Razor</a> and say, “we need to only have one <code>find_if</code>”.
That’s what happened.
After I submitted STL it had many fine functions,
but Bjarne was very afraid that STL was too large and would not be accepted, as is.
(It wasn’t that enormous at that point.)
He said, “why don’t I come to Palo Alto (I was at HP Labs)
and bring along bunch of other standard committee people and we will trim it”.
Trimming was a sad thing.
Imagine somebody coming with a knife and cutting pieces of your flesh.
One of the things he said was there should be only one <code>find_if</code>.</p>
<a name="Helper-functions"></a>
<h3>Helper functions</h3>
<p>So what else used to exist?
Everytime I program I put this function back.
It takes about 30 seconds,
it’s called <code>find_if_not</code>.
Sometimes you want guys who are good,
and sometimes bad.
Couldn’t you negate a predicate and use the same function?
Well, I did, I added <a href="https://en.cppreference.com/w/cpp/utility/functional/not1"><code>std::not1</code></a>
and <a href="https://en.cppreference.com/w/cpp/utility/functional/not2"><code>std::not2</code></a>.
But, it’s annoying.
Fortunately, I think the
standard committee in C++11 put it back in the standard.
But they didn’t do all of them,
and some of them are not correct.</p>
<pre><code>template <typename I, typename P>
// I is an InputIterator
// P is a unary predicate
I find_if_not(I first, I last, P pred) {
// [first, last) is a valid range.
while (first != last && pred(first)) ++first;
return first;
}
</code></pre>
<p>Somebody asked me in 1994
why I didn’t write <code>std::all_of</code>, <code>std::none_of</code>, <code>std::any_of</code>
which are now in C++11.
My attitude was I write algorithms,
not wrappers.
They are just wrappers<sup id="fnref:4"><a href="#fn:4" rel="footnote">4</a></sup>.</p>
<pre><code>template <typename I, typename P>
// I is an InputIterator
// P is a unary predicate
inline
bool all_of(I first, I last, P pred) {
return find_if_not(first, last, pred) == last;
}
template <typename I, typename P>
// I is an InputIterator
// P is a unary predicate
inline
bool none_of(I first, I last, P pred) {
return find_if(first, last, pred) == last;
}
template <typename I, typename P>
// I is an InputIterator
// P is a unary predicate
inline
bool any_of(I first, I last, P pred) {
return find_if(first, last, pred) != last;
}
</code></pre>
<a name="Bounded-and-counted-ranges"></a>
<h2>Bounded and counted ranges</h2>
<p>Once upon a time I believed ranges come in two kinds<sup id="fnref:5"><a href="#fn:5" rel="footnote">5</a></sup>.</p>
<ol>
<li><strong>Bounded ranges</strong>: Ranges bounded by a first and last iterator/pointer.</li>
<li><strong>Counted ranges</strong>: Ranges constructed from a first pointer/iterator and an
integer <code>N</code>.</li>
</ol>
<p>Which one is better?
Both are good, and both are different.
You cannot say one is better than the other.
It depends on the algorithm.
There used to be more bounded range algorithms in the STL, but they were taken out.
For example we have <a href="https://en.cppreference.com/w/cpp/algorithm/copy"><code>std::copy</code></a> and
<a href="https://en.cppreference.com/w/cpp/algorithm/copy_n"><code>std::copy_n</code></a>
both are really convenient.</p>
<p>Let’s look at the interface for <code>std::copy_n</code>.</p>
<pre><code>template<typename InputIt, typename Size, typename OutputIt>
OutputIt copy_n(InputIt first, Size count, OutputIt result);
</code></pre>
<p>This is actually wrong.
You have to return where the output iterator advances to,
because it might be gone after the function operates
internally, and there is no way to get it back.
But, it’s not enough.
It needs to also return where the first ends.
There is a general principle, you should be able to restart the algorithm
where you left off.
You copy something, then you want to copy more.
So we need to think carefully about what you return.
You might think everyone gets it, but
that code was reviewed by hundreds of people,
and not noticed.</p>
<p>So let’s try writing <code>find_if_n</code></p>
<pre><code>template <typename I, typename N, typename P>
// I is an InputIterator
// N is an integral type
// P is a unary predicate
std::pair<I, N> find_if_n(I first, N n, P pred) {
while (n != N(0) && !pred(first)) {
--n;
++first;
}
return std::make_pair(first, n);
}
</code></pre>
<a name="Advance-and-distance-functions"></a>
<h2>Advance and distance functions</h2>
<p>How does <a href="https://en.cppreference.com/w/cpp/iterator/advance"><code>std::advance</code></a> work?
It was introduced by me to allow us to do long or fast thing depending on iterator type.
For a pointer it will translate to one instruction.
It’s going to be fast.
In the case of a linked list, it’s going to be linear time.
Here is how it works:</p>
<pre><code>template<typename I, typename N>
inline
void advance(I& first, N n) {
advance(first, n, std::iterator_traits<I>::iterator_category);
}
template<typename I, typename N>
inline
void advance(I& first, N n, std::input_iterator_tag) {
while (n > 0) {
++first;
--n;
}
}
template<typename I, typename N>
inline
void advance(I& first, N n, std::random_access_iterator_tag) {
first += n;
}
</code></pre>
<p>The dispatch between these is clearly done at compile time.
I could have had every iterator have <code>+=</code> and <code>-</code> operators.
My thinking was that people have an expectation that those symbols are fast.
Making it linear time will confuse many people.<sup id="fnref:6"><a href="#fn:6" rel="footnote">6</a></sup></p>
<p>Now let’s implement <a href="https://en.cppreference.com/w/cpp/iterator/distance"><code>std::distance</code></a>:</p>
<pre><code>template<typename I>
inline
typename std::iterator_traits<I>::difference_type distance(I& first, I& last) {
return distance(first, last, std::iterator_traits<I>::iterator_category);
}
template<typename I>
inline
typename std::iterator_traits<I>::difference_type distance(I& first, I& last, std::input_iterator_tag) {
typename std::iterator_traits<I>::difference_type n;
while (first != last) {
++first;
++n;
}
return n;
}
template<typename I>
inline
typename std::iterator_traits<I>::difference_type distance(I& first, I& last, std::random_access_iterator_tag) {
return last - first;
}
</code></pre>
<a name="Code"></a>
<h2>Code</h2>
<ul>
<li><a href="code/algorithm.h">algorithm.h</a></li>
<li><a href="code/search.h">search.h</a></li>
</ul>
<div class="footnotes">
<hr/>
<ol>
<li id="fn:1">
He has a Wikipedia page <a href="https://en.wikipedia.org/wiki/Hermann_Bottenbruch">now</a>!<a href="#fnref:1" rev="footnote">↩</a></li>
<li id="fn:2">
C does not have a type safe form of generics like <code>template</code>.
This makes it difficult to write reusable components in the way Alex teaches.
The workaround used by <code>bsearch</code> and <code>qsort</code> is to return <code>void *</code> which is a pointer to any type.<a href="#fnref:2" rev="footnote">↩</a></li>
<li id="fn:3">
Alex: The name is stolen from Common Lisp <a href="http://clhs.lisp.se/Body/f_find_.htm"><code>find</code></a>.
Always try to borrow from some place.
Originality is frowned upon,
especially for naming.
Everyone loves to make non-standard names.<a href="#fnref:3" rev="footnote">↩</a></li>
<li id="fn:4">
I find this comment amusing because
Alex just got done talking about why <code>find_if_not</code> was such a helpful contribution
It’s not clear how to reconcile his advice to carefully considering convenience of user,
with this comment about not wanting to provide convenient interfaces for algorithms.<a href="#fnref:4" rev="footnote">↩</a></li>
<li id="fn:5">
These two kinds of ranges are discussed in depth in “Elements of Programming” chapter 6.<a href="#fnref:5" rev="footnote">↩</a></li>
<li id="fn:6">
Alex: I am not saying I was right, because when we were writing
“Elements of Programming”, Paul and I, we decided to abandon advance and distance,
and just say that depending on iterator category,
the complexity of these operators change.<a href="#fnref:6" rev="footnote">↩</a></li>
</ol>
</div>
<span style="float: right">
[
<a href="12_merge_sort.html">previous</a>,
<a href="index.html">contents</a>,
<a href="14_binary_search.html">next</a>
]
</span>
</body>
</html>