Channel: std – Eric Niebler
Viewing all articles
Browse latest Browse all 11

Ranges in C++: Counted Iterables and Efficiency


I’ve been hard at work fleshing out my range library and writing a proposal to get range support into the standard. That proposal describes a foundational range concept: Iterable. An Iterable is anything we can pass to std::begin() and std::end() to get an Iterator/Sentinel pair. Sentinels, as I described here earlier this year, make it possible for the Iterable concept to efficiently describe other kinds of ranges besides iterator pairs.

The three types of ranges that we would like the Iterable concept to be able to efficiently model are:

  1. Two iterators
  2. An iterator and a predicate
  3. An iterator and a count

The Iterator/Sentinel abstraction is what makes it possible for the algorithms to handle these three cases with uniform syntax. However, as Sean Parent pointed out here, the third option presents challenges when trying to make some algorithms optimally efficient. Back in February, when Sean offered his criticism, I promised to follow up with a blog post that justified the design. This is that post.

Note 1: I’ve changed terminology since the February posts. In those posts, Iterable represented a range where the begin and end have different types, and Range is an Iterable where they’re the same. In my current proposal, Iterable is more or less as it was before, but Range is now an Iterable that doesn’t own its elements.

Note 2: This post uses the syntax of Concepts Lite, which has not been adopted yet. Everything in this post is implementable in C++11 using my library for Concepts Lite emulation, which I describe here.

Counted Ranges

Counted ranges, formed by specifying a position and a count of elements, have iterators — as all Iterables do. The iterators of a counted range must know the range’s extent and how close they are to reaching it. Therefore, the counted range’s iterators must store both an iterator into the underlying sequence and a count — either a count to the end or a count from the front. Here is one potential design:

class counted_sentinel

template<WeakIterator I>
class counted_iterator
    I it_;
    DistanceType<I> n_; // distance to end
    // ... constructors...
    using iterator_category =
        typename iterator_traits<I>::iterator_category;
    decltype(auto) operator*() const
        return *it_;
    counted_iterator & operator++()
        return *this;
    friend bool operator==(counted_iterator const & it,
        return it.n_ == 0;
    // ... other operators...

template<WeakIterator I>
class counted_range
    I begin_;
    DistanceType<I> count_;
    // ... constructors ...
    counted_iterator<I> begin() const
        return {begin_, count_};
    counted_sentinel end() const
        return {};

There are some noteworthy things about the code above. First, counted_iterator bundles an iterator and a count. Right off, we see that copying counted iterators is going to be more expensive, and iterators are copied frequently. A mitigating factor is that the sentinel is empty. Passing a counted_iterator and a counted_sentinel to an algorithm copies as much data as passing an iterator and a count. When passed separately, the compiler probably has an easier time fitting them in registers, but some modern compilers are capable passing the members of a struct in registers. This compiler optimization is sometimes called Scalar Replacement of Aggregates1, 2 and is known to be implemented in gcc and LLVM (see this recent LLVM commit for example).

Also, incrementing a counted iterator is expensive: it involves incrementing the underlying iterator and decrementing the internal count. To see why this is potentially expensive, consider the trivial case of passing a counted_iterator<list<int>::iterator> to advance. That counted iterator type is bidirectional, and advance must increment it n times:

template<BidirectionalIterator I>
void advance(I & i, DistanceType<I> n)
    if(n >= 0)
        for(; n != 0; --n)
        for(; n != 0; ++n)

Notice that for each ++i or --i here, two increments or decrements are happening when I is a counted_iterator. This is sub-optimal. A better implementation for counted_iterator is:

template<BidirectionalIterator I>
void advance(counted_iterator<I> & i, DistanceType<I> n)
    i.n_ -= n;
    advance(i.it_, n);

This has a noticeable effect on the generated code. As it turns out, advance is one of the relatively few places in the standard library where special handling of counted_iterator is advantageous. Let’s examine some algorithms to see why that’s the case.

Single-Pass Algorithms with Counted Iterators

First, let’s look at a simple algorithm like for_each that makes exactly one pass through its input sequence:

template<InputIterator I, Regular S,
         Function<ValueType<I>> F>
    requires EqualityComparable<I, S>
I for_each(I first, S last, F f)
    for(; first != last; ++first)
    return first;

When passed counted iterators, at each iteration of the loop, we do an increment, a decrement (for the underlying iterator and the count), and a comparison. Let’s compare this against a hypothetical for_each_n algorithm that takes the underlying iterator and the count separately. It might look like this:

template<InputIterator I, Function<ValueType<I>> F>
I for_each_n(I first, DifferenceType<I> n, F f)
    for(; n != 0; ++first, --n)
    return first;

For the hypothetical for_each_n, at each loop iteration, we do an increment, a decrement, and a comparison. That’s exactly as many operations as for_each does when passed counted iterators. So a separate for_each_n algorithm is probably unnecessary if we have sentinels and counted_iterators. This is true for any algorithm that makes only one pass through the input range. That turns out to be a lot of algorithms.

Multi-Pass Algorithms with Counted Iterators

There are other algorithms that make more than one pass over the input sequence. Most of those, however, use advance when they need to move iterators by more than one hop. Once we have specialized advance for counted_iterator, those algorithms that use advance get faster without any extra work.

Consider partition_point. Here is one example implementation, taken from libc++ and ported to Concepts Lite and sentinels:

template<ForwardIterator I, Regular S,
         Predicate<ValueType<I>> P>
    requires EqualityComparable<I, S>
I partition_point(I first, S last, P pred)
    DifferenceType<I> len = distance(first, last);
    while (len != 0)
        DifferenceType<I> l2 = len / 2;
        I m = first;
        advance(m, l2);
        if (pred(*m))
            first = ++m;
            len -= l2 + 1;
            len = l2;
    return first;

Imagine that I is a forward counted_iterator and S is a counted_sentinel. If the library does not specialize advance, this is certainly inefficient. Every time advance is called, needless work is being done. Compare it to a hypothetical partition_point_n:

template<ForwardIterator I, Predicate<ValueType<I>> P>
I partition_point_n(I first, DifferenceType<I> len, P pred)
    while (len != 0)
        DifferenceType<I> l2 = len / 2;
        I m = first;
        advance(m, l2);
        if (pred(*m))
            first = ++m;
            len -= l2 + 1;
            len = l2;
    return first;

The first thing we notice is that partition_point_n doesn’t need to call distance! The more subtle thing to note is that calling partition_point_n with a raw iterator and a count saves about O(N) integer decrements over the equivalent call to partition_point with counted_iterators … unless, of course, we have specialized the advance algorithm as shown above. Once we have, we trade the O(N) integer decrements for O(log N) integer subtractions. That’s a big improvement.

But what about the O(N) call to distance? Actually, that’s easy, and it’s the reason why I introduced a concept called SizedIteratorRange. counted_iterator stores the distance to the end. So the distance between a counted_iterator and a counted_sentinel (or between two counted_iterators) is known in O(1) regardless of the iterator’s category. The SizedIteratorRange concept tests whether an iterator I and a sentinel S can be subtracted to get the distance. This concept is modeled by random-access iterators by their nature, but also by counted iterators and their sentinels. The distance algorithm is specialized for SizedIteratorRange, so it is O(1) for counted iterators.

With these changes, we see that partition_point with counted iterators is very nearly as efficient as a hypothetical partition_point_n would be, and we had to make no special accommodations. Why can’t we make partition_point exactly as efficient as partition_point_n? When partition_point is called with a counted iterator, it also returns a counted iterator. Counted iterators contain two datums: the position and distance to the end. But when partition_point_n returns just the position, it is actually computing and returning less information. Sometimes users don’t need the extra information. But sometimes, after calling partition_point_n, the user might want to pass the resulting iterator to another algorithm. If that algorithm calls distance (like partition_point and other algorithms do), then it will be O(N). With counted iterators, however, it’s O(1). So in the case of partition_point, counted iterators cause the algorithm to do O(log N) extra work, but it sometimes saves O(N) work later.

To see an example, imagine a trivial insertion_sort algorithm:

template<ForwardIterator I, Regular S>
    requires EqualityComparable<I, S> &&
             Sortable<I> // from N3351
void insertion_sort(I begin, S end)
    for(auto it = begin; it != end; ++it)
        auto insertion = upper_bound(begin, it, *it);
        rotate(insertion, it, next(it));

Imagine that I is a counted_iterator. The first thing upper_bound does is call distance. Making distance O(1) for counted_iterators saves N calls of an O(N) algorithm. To get comparable performance for an equivalent procedure in today’s STL, users would have to write a separate insertion_sort_n algorithm that dispatches to an upper_bound_n algorithm — that they would also need to write themselves.

Counted Algorithms with Counted Iterators

We’ve seen that regular algorithms with counted iterators can be made nearly as efficient as dedicated counted algorithms, and that sometimes we are more than compensated for the small performance loss. All is not roses, however. There are a number of counted algorithms in the standard (the algorithms whose names end with _n). Consider copy_n:

template<WeakInputIterator I,
         WeakOutputIterator<ValueType<I>> O>
pair<I, O> copy_n(I in, DifferenceType<I> n, O out)
    for(; n != 0; ++in, ++out, --n)
        *out = *in;
    return {in, out};

(We’ve changed the return type of copy_n so as not to lose information.) If I is a counted iterator, then for every ++in, an increment and a decrement are happening, and in this case the extra decrement is totally unnecessary. For any counted (i.e., _n) algorithm, something special needs to be done to keep the performance from degrading when passed counted iterators.

The algorithm author has two options here, and neither of them is ideal.

Option 1: Overload the algorithm

The following is an optimized version of copy_n for counted iterators:

template<WeakInputIterator I,
         WeakOutputIterator<ValueType<I>> O>
pair<I, O> copy_n(counted_iterator<I> in,
                  DifferenceType<I> n, O out)
    for(auto m = in.n_ - n; in.n_ != m;
            ++in.i_, --in.n_, ++out)
        *out = *in;
    return {in, out};

Obviously, creating an overload for counted iterators is unsatisfying.

Option 2: Separate the iterator from the count

This option shows how a library implementer can write just one version of copy_n that is automatically optimized for counted iterators. First, we need to provide two utility functions for unpacking and repacking counted iterators:

template<WeakIterator I>
I uncounted(I i)
    return i;

template<WeakIterator I>
I uncounted(counted_iterator<I> i)
    return i.it_;

template<WeakIterator I>
I recounted(I const &, I i, DifferenceType<I>)
    return i;

template<WeakIterator I>
counted_iterator<I> recounted(counted_iterator<I> const &j, I i, DifferenceType<I> n)
    return {i, j.n_ - n};

With the help of uncounted and recounted, we can write an optimized copy_n just once:

template<WeakInputIterator I,
         WeakOutputIterator<ValueType<I>> O>
pair<I, O> copy_n(I in_, DifferenceType<I> n_, O out)
    auto in = uncounted(in_);
    for(auto n = n_; n != 0; ++in, --n, ++out)
        *out = *in;
    return {recounted(in_, in, n_), out};

This version works optimally for both counted and non-counted iterators. It is not a thing of beauty, however. It’s slightly annoying to have to do the uncounted/recounted dance, but it’s mostly needed only in the counted algorithms.

As a final note, the overload of advance for counted iterators can be eliminated with the help of uncounted and recounted. After all, advance is a counted algorithm.

Benchmark: Insertion Sort

To test how expensive counted ranges and counted iterators are, we wrote a benchmark. The benchmark pits counted ranges against a dedicated _n algorithm for Insertion Sort. The program is listed in this gist.

The program implements both insertion_sort_n, a dedicated counted algorithm, and insertion_sort, a general algorithm that accepts any Iterable, to which we pass a counted range. The latter is implemented in terms of the general-purpose upper_bound as provided by the Range v3 library, whereas the former requires a dedicated upper_bound_n algorithm, which is also provided.

The test is run both with raw pointers (hence, random-access) and with an iterator wrapper that only models ForwardIterator. Each test is run three times, and the resulting times are averaged. The test was compiled with g++ version 4.9.0 with -O3 -std=gnu++11 -DNDEBUG and run on a Linux machine. The results are reported below, for N == 30,000:

insertion_sort_n insertion_sort
random-access 2.692 s 2.703 s
forward 23.853 s 23.817 s

The performance difference, if there is any, is lost in the noise. At least in this case, with this compiler, on this hardware, there is no performance justification for a dedicated _n algorithm.


In short, counted iterators are not a perfect abstraction. There is some precedent here. The iterators for deque, and for any segmented data structure, are known to be inefficient (see Segmented Iterators and Hierarchical Algorithms, Austern 1998). The fix for that problem, new iterator abstractions and separate hierarchical algorithm implementations, is invasive and is not attempted in any STL implementation I am aware of. In comparison, the extra complications that come with counted iterators seem quite small. For segmented iterators, the upside was the simplicity and uniformity of the Iterator abstraction. In the case of counted ranges and iterators, the upside is the simplicity and uniformity of the Iterable concept. Algorithms need only one form, not separate bounded, counted, and sentinel forms. The benchmark gives me some reasonable assurance that we aren’t sacrificing too much performance for the sake of a unifying abstraction.

Viewing all articles
Browse latest Browse all 11

Trending Articles