I’ve been hard at work fleshing out my range library and writing a proposal to get range support into the standard. That proposal describes a foundational range concept: Iterable. An Iterable is anything we can pass to std::begin()
and std::end()
to get an Iterator/Sentinel pair. Sentinels, as I described here earlier this year, make it possible for the Iterable concept to efficiently describe other kinds of ranges besides iterator pairs.
The three types of ranges that we would like the Iterable concept to be able to efficiently model are:
- Two iterators
- An iterator and a predicate
- An iterator and a count
The Iterator/Sentinel abstraction is what makes it possible for the algorithms to handle these three cases with uniform syntax. However, as Sean Parent pointed out here, the third option presents challenges when trying to make some algorithms optimally efficient. Back in February, when Sean offered his criticism, I promised to follow up with a blog post that justified the design. This is that post.
Note 1: I’ve changed terminology since the February posts. In those posts, Iterable represented a range where the begin
and end
have different types, and Range is an Iterable where they’re the same. In my current proposal, Iterable is more or less as it was before, but Range is now an Iterable that doesn’t own its elements.
Note 2: This post uses the syntax of Concepts Lite, which has not been adopted yet. Everything in this post is implementable in C++11 using my library for Concepts Lite emulation, which I describe here.
Counted Ranges
Counted ranges, formed by specifying a position and a count of elements, have iterators — as all Iterables do. The iterators of a counted range must know the range’s extent and how close they are to reaching it. Therefore, the counted range’s iterators must store both an iterator into the underlying sequence and a count — either a count to the end or a count from the front. Here is one potential design:
class counted_sentinel {}; template<WeakIterator I> class counted_iterator { I it_; DistanceType<I> n_; // distance to end public: // ... constructors... using iterator_category = typename iterator_traits<I>::iterator_category; decltype(auto) operator*() const { return *it_; } counted_iterator & operator++() { ++it_; --n_; return *this; } friend bool operator==(counted_iterator const & it, counted_sentinel) { return it.n_ == 0; } // ... other operators... }; template<WeakIterator I> class counted_range { I begin_; DistanceType<I> count_; public: // ... constructors ... counted_iterator<I> begin() const { return {begin_, count_}; } counted_sentinel end() const { return {}; } };
There are some noteworthy things about the code above. First, counted_iterator
bundles an iterator and a count. Right off, we see that copying counted iterators is going to be more expensive, and iterators are copied frequently. A mitigating factor is that the sentinel is empty. Passing a counted_iterator
and a counted_sentinel
to an algorithm copies as much data as passing an iterator and a count. When passed separately, the compiler probably has an easier time fitting them in registers, but some modern compilers are capable passing the members of a struct in registers. This compiler optimization is sometimes called Scalar Replacement of Aggregates1, 2 and is known to be implemented in gcc and LLVM (see this recent LLVM commit for example).
Also, incrementing a counted iterator is expensive: it involves incrementing the underlying iterator and decrementing the internal count. To see why this is potentially expensive, consider the trivial case of passing a counted_iterator<list<int>::iterator>
to advance
. That counted iterator type is bidirectional, and advance
must increment it n times:
template<BidirectionalIterator I> void advance(I & i, DistanceType<I> n) { if(n >= 0) for(; n != 0; --n) ++i; else for(; n != 0; ++n) --i; }
Notice that for each ++i
or --i
here, two increments or decrements are happening when I
is a counted_iterator
. This is sub-optimal. A better implementation for counted_iterator
is:
template<BidirectionalIterator I> void advance(counted_iterator<I> & i, DistanceType<I> n) { i.n_ -= n; advance(i.it_, n); }
This has a noticeable effect on the generated code. As it turns out, advance
is one of the relatively few places in the standard library where special handling of counted_iterator
is advantageous. Let’s examine some algorithms to see why that’s the case.
Single-Pass Algorithms with Counted Iterators
First, let’s look at a simple algorithm like for_each
that makes exactly one pass through its input sequence:
template<InputIterator I, Regular S, Function<ValueType<I>> F> requires EqualityComparable<I, S> I for_each(I first, S last, F f) { for(; first != last; ++first) f(*first); return first; }
When passed counted iterators, at each iteration of the loop, we do an increment, a decrement (for the underlying iterator and the count), and a comparison. Let’s compare this against a hypothetical for_each_n
algorithm that takes the underlying iterator and the count separately. It might look like this:
template<InputIterator I, Function<ValueType<I>> F> I for_each_n(I first, DifferenceType<I> n, F f) { for(; n != 0; ++first, --n) f(*first); return first; }
For the hypothetical for_each_n
, at each loop iteration, we do an increment, a decrement, and a comparison. That’s exactly as many operations as for_each
does when passed counted iterators. So a separate for_each_n
algorithm is probably unnecessary if we have sentinels and counted_iterator
s. This is true for any algorithm that makes only one pass through the input range. That turns out to be a lot of algorithms.
Multi-Pass Algorithms with Counted Iterators
There are other algorithms that make more than one pass over the input sequence. Most of those, however, use advance
when they need to move iterators by more than one hop. Once we have specialized advance
for counted_iterator
, those algorithms that use advance
get faster without any extra work.
Consider partition_point
. Here is one example implementation, taken from libc++ and ported to Concepts Lite and sentinels:
template<ForwardIterator I, Regular S, Predicate<ValueType<I>> P> requires EqualityComparable<I, S> I partition_point(I first, S last, P pred) { DifferenceType<I> len = distance(first, last); while (len != 0) { DifferenceType<I> l2 = len / 2; I m = first; advance(m, l2); if (pred(*m)) { first = ++m; len -= l2 + 1; } else len = l2; } return first; }
Imagine that I
is a forward counted_iterator
and S
is a counted_sentinel
. If the library does not specialize advance
, this is certainly inefficient. Every time advance
is called, needless work is being done. Compare it to a hypothetical partition_point_n
:
template<ForwardIterator I, Predicate<ValueType<I>> P> I partition_point_n(I first, DifferenceType<I> len, P pred) { while (len != 0) { DifferenceType<I> l2 = len / 2; I m = first; advance(m, l2); if (pred(*m)) { first = ++m; len -= l2 + 1; } else len = l2; } return first; }
The first thing we notice is that partition_point_n
doesn’t need to call distance
! The more subtle thing to note is that calling partition_point_n
with a raw iterator and a count saves about O(N) integer decrements over the equivalent call to partition_point
with counted_iterator
s … unless, of course, we have specialized the advance
algorithm as shown above. Once we have, we trade the O(N) integer decrements for O(log N) integer subtractions. That’s a big improvement.
But what about the O(N) call to distance
? Actually, that’s easy, and it’s the reason why I introduced a concept called SizedIteratorRange. counted_iterator
stores the distance to the end. So the distance between a counted_iterator
and a counted_sentinel
(or between two counted_iterators
) is known in O(1) regardless of the iterator’s category. The SizedIteratorRange concept tests whether an iterator I
and a sentinel S
can be subtracted to get the distance. This concept is modeled by random-access iterators by their nature, but also by counted iterators and their sentinels. The distance
algorithm is specialized for SizedIteratorRange, so it is O(1) for counted iterators.
With these changes, we see that partition_point
with counted iterators is very nearly as efficient as a hypothetical partition_point_n
would be, and we had to make no special accommodations. Why can’t we make partition_point
exactly as efficient as partition_point_n
? When partition_point
is called with a counted iterator, it also returns a counted iterator. Counted iterators contain two datums: the position and distance to the end. But when partition_point_n
returns just the position, it is actually computing and returning less information. Sometimes users don’t need the extra information. But sometimes, after calling partition_point_n
, the user might want to pass the resulting iterator to another algorithm. If that algorithm calls distance
(like partition_point
and other algorithms do), then it will be O(N). With counted iterators, however, it’s O(1). So in the case of partition_point
, counted iterators cause the algorithm to do O(log N) extra work, but it sometimes saves O(N) work later.
To see an example, imagine a trivial insertion_sort
algorithm:
template<ForwardIterator I, Regular S> requires EqualityComparable<I, S> && Sortable<I> // from N3351 void insertion_sort(I begin, S end) { for(auto it = begin; it != end; ++it) { auto insertion = upper_bound(begin, it, *it); rotate(insertion, it, next(it)); } }
Imagine that I
is a counted_iterator
. The first thing upper_bound
does is call distance
. Making distance
O(1) for counted_iterator
s saves N calls of an O(N) algorithm. To get comparable performance for an equivalent procedure in today’s STL, users would have to write a separate insertion_sort_n
algorithm that dispatches to an upper_bound_n
algorithm — that they would also need to write themselves.
Counted Algorithms with Counted Iterators
We’ve seen that regular algorithms with counted iterators can be made nearly as efficient as dedicated counted algorithms, and that sometimes we are more than compensated for the small performance loss. All is not roses, however. There are a number of counted algorithms in the standard (the algorithms whose names end with _n
). Consider copy_n
:
template<WeakInputIterator I, WeakOutputIterator<ValueType<I>> O> pair<I, O> copy_n(I in, DifferenceType<I> n, O out) { for(; n != 0; ++in, ++out, --n) *out = *in; return {in, out}; }
(We’ve changed the return type of copy_n
so as not to lose information.) If I
is a counted iterator, then for every ++in
, an increment and a decrement are happening, and in this case the extra decrement is totally unnecessary. For any counted (i.e., _n
) algorithm, something special needs to be done to keep the performance from degrading when passed counted iterators.
The algorithm author has two options here, and neither of them is ideal.
Option 1: Overload the algorithm
The following is an optimized version of copy_n
for counted iterators:
template<WeakInputIterator I, WeakOutputIterator<ValueType<I>> O> pair<I, O> copy_n(counted_iterator<I> in, DifferenceType<I> n, O out) { for(auto m = in.n_ - n; in.n_ != m; ++in.i_, --in.n_, ++out) *out = *in; return {in, out}; }
Obviously, creating an overload for counted iterators is unsatisfying.
Option 2: Separate the iterator from the count
This option shows how a library implementer can write just one version of copy_n
that is automatically optimized for counted iterators. First, we need to provide two utility functions for unpacking and repacking counted iterators:
template<WeakIterator I> I uncounted(I i) { return i; } template<WeakIterator I> I uncounted(counted_iterator<I> i) { return i.it_; } template<WeakIterator I> I recounted(I const &, I i, DifferenceType<I>) { return i; } template<WeakIterator I> counted_iterator<I> recounted(counted_iterator<I> const &j, I i, DifferenceType<I> n) { return {i, j.n_ - n}; }
With the help of uncounted
and recounted
, we can write an optimized copy_n
just once:
template<WeakInputIterator I, WeakOutputIterator<ValueType<I>> O> pair<I, O> copy_n(I in_, DifferenceType<I> n_, O out) { auto in = uncounted(in_); for(auto n = n_; n != 0; ++in, --n, ++out) *out = *in; return {recounted(in_, in, n_), out}; }
This version works optimally for both counted and non-counted iterators. It is not a thing of beauty, however. It’s slightly annoying to have to do the uncounted
/recounted
dance, but it’s mostly needed only in the counted algorithms.
As a final note, the overload of advance
for counted iterators can be eliminated with the help of uncounted
and recounted
. After all, advance
is a counted algorithm.
Benchmark: Insertion Sort
To test how expensive counted ranges and counted iterators are, we wrote a benchmark. The benchmark pits counted ranges against a dedicated _n
algorithm for Insertion Sort. The program is listed in this gist.
The program implements both insertion_sort_n
, a dedicated counted algorithm, and insertion_sort
, a general algorithm that accepts any Iterable, to which we pass a counted range. The latter is implemented in terms of the general-purpose upper_bound
as provided by the Range v3 library, whereas the former requires a dedicated upper_bound_n
algorithm, which is also provided.
The test is run both with raw pointers (hence, random-access) and with an iterator wrapper that only models ForwardIterator. Each test is run three times, and the resulting times are averaged. The test was compiled with g++
version 4.9.0 with -O3 -std=gnu++11 -DNDEBUG
and run on a Linux machine. The results are reported below, for N == 30,000:
insertion_sort_n
|
insertion_sort
|
|
---|---|---|
random-access | 2.692 s | 2.703 s |
forward | 23.853 s | 23.817 s |
The performance difference, if there is any, is lost in the noise. At least in this case, with this compiler, on this hardware, there is no performance justification for a dedicated _n
algorithm.
Summary
In short, counted iterators are not a perfect abstraction. There is some precedent here. The iterators for deque
, and for any segmented data structure, are known to be inefficient (see Segmented Iterators and Hierarchical Algorithms, Austern 1998). The fix for that problem, new iterator abstractions and separate hierarchical algorithm implementations, is invasive and is not attempted in any STL implementation I am aware of. In comparison, the extra complications that come with counted iterators seem quite small. For segmented iterators, the upside was the simplicity and uniformity of the Iterator abstraction. In the case of counted ranges and iterators, the upside is the simplicity and uniformity of the Iterable concept. Algorithms need only one form, not separate bounded, counted, and sentinel forms. The benchmark gives me some reasonable assurance that we aren’t sacrificing too much performance for the sake of a unifying abstraction.