Ranges in C++: Counted Iterables and Efficiency

I’ve been hard at work fleshing out my range library and writing a proposal to get range support into the standard. That proposal describes a foundational range concept: Iterable. An Iterable is anything we can pass to std::begin() and std::end() to get an Iterator/Sentinel pair. Sentinels, as I described here earlier this year, make it possible for the Iterable concept to efficiently describe other kinds of ranges besides iterator pairs.

The three types of ranges that we would like the Iterable concept to be able to efficiently model are:

Two iterators
An iterator and a predicate
An iterator and a count

The Iterator/Sentinel abstraction is what makes it possible for the algorithms to handle these three cases with uniform syntax. However, as Sean Parent pointed out here, the third option presents challenges when trying to make some algorithms optimally efficient. Back in February, when Sean offered his criticism, I promised to follow up with a blog post that justified the design. This is that post.

Note 1: I’ve changed terminology since the February posts. In those posts, Iterable represented a range where the begin and end have different types, and Range is an Iterable where they’re the same. In my current proposal, Iterable is more or less as it was before, but Range is now an Iterable that doesn’t own its elements.

Note 2: This post uses the syntax of Concepts Lite, which has not been adopted yet. Everything in this post is implementable in C++11 using my library for Concepts Lite emulation, which I describe here.

Counted Ranges

Counted ranges, formed by specifying a position and a count of elements, have iterators — as all Iterables do. The iterators of a counted range must know the range’s extent and how close they are to reaching it. Therefore, the counted range’s iterators must store both an iterator into the underlying sequence and a count — either a count to the end or a count from the front. Here is one potential design:

class counted_sentinel
{};

template<WeakIterator I>
class counted_iterator
{
    I it_;
    DistanceType<I> n_; // distance to end
public:
    // ... constructors...
    using iterator_category =
        typename iterator_traits<I>::iterator_category;
    decltype(auto) operator*() const
    {
        return *it_;
    }
    counted_iterator & operator++()
    {
        ++it_;
        --n_;
        return *this;
    }
    friend bool operator==(counted_iterator const & it,
                           counted_sentinel)
    {
        return it.n_ == 0;
    }
    // ... other operators...
};

template<WeakIterator I>
class counted_range
{
    I begin_;
    DistanceType<I> count_;
public:
    // ... constructors ...
    counted_iterator<I> begin() const
    {
        return {begin_, count_};
    }
    counted_sentinel end() const
    {
        return {};
    }
};

There are some noteworthy things about the code above. First, counted_iterator bundles an iterator and a count. Right off, we see that copying counted iterators is going to be more expensive, and iterators are copied frequently. A mitigating factor is that the sentinel is empty. Passing a counted_iterator and a counted_sentinel to an algorithm copies as much data as passing an iterator and a count. When passed separately, the compiler probably has an easier time fitting them in registers, but some modern compilers are capable passing the members of a struct in registers. This compiler optimization is sometimes called Scalar Replacement of Aggregates^{1, 2} and is known to be implemented in gcc and LLVM (see this recent LLVM commit for example).

Also, incrementing a counted iterator is expensive: it involves incrementing the underlying iterator and decrementing the internal count. To see why this is potentially expensive, consider the trivial case of passing a counted_iterator<list<int>::iterator> to advance. That counted iterator type is bidirectional, and advance must increment it n times:

template<BidirectionalIterator I>
void advance(I & i, DistanceType<I> n)
{
    if(n >= 0)
        for(; n != 0; --n)
            ++i;
    else
        for(; n != 0; ++n)
            --i;
}

Notice that for each ++i or --i here, two increments or decrements are happening when I is a counted_iterator. This is sub-optimal. A better implementation for counted_iterator is:

template<BidirectionalIterator I>
void advance(counted_iterator<I> & i, DistanceType<I> n)
{
    i.n_ -= n;
    advance(i.it_, n);
}

This has a noticeable effect on the generated code. As it turns out, advance is one of the relatively few places in the standard library where special handling of counted_iterator is advantageous. Let’s examine some algorithms to see why that’s the case.

Single-Pass Algorithms with Counted Iterators

First, let’s look at a simple algorithm like for_each that makes exactly one pass through its input sequence:

template<InputIterator I, Regular S,
         Function<ValueType<I>> F>
    requires EqualityComparable<I, S>
I for_each(I first, S last, F f)
{
    for(; first != last; ++first)
        f(*first);
    return first;
}

When passed counted iterators, at each iteration of the loop, we do an increment, a decrement (for the underlying iterator and the count), and a comparison. Let’s compare this against a hypothetical for_each_n algorithm that takes the underlying iterator and the count separately. It might look like this:

template<InputIterator I, Function<ValueType<I>> F>
I for_each_n(I first, DifferenceType<I> n, F f)
{
    for(; n != 0; ++first, --n)
        f(*first);
    return first;
}

For the hypothetical for_each_n, at each loop iteration, we do an increment, a decrement, and a comparison. That’s exactly as many operations as for_each does when passed counted iterators. So a separate for_each_n algorithm is probably unnecessary if we have sentinels and counted_iterators. This is true for any algorithm that makes only one pass through the input range. That turns out to be a lot of algorithms.

Multi-Pass Algorithms with Counted Iterators

There are other algorithms that make more than one pass over the input sequence. Most of those, however, use advance when they need to move iterators by more than one hop. Once we have specialized advance for counted_iterator, those algorithms that use advance get faster without any extra work.

Consider partition_point. Here is one example implementation, taken from libc++ and ported to Concepts Lite and sentinels:

template<ForwardIterator I, Regular S,
         Predicate<ValueType<I>> P>
    requires EqualityComparable<I, S>
I partition_point(I first, S last, P pred)
{
    DifferenceType<I> len = distance(first, last);
    while (len != 0)
    {
        DifferenceType<I> l2 = len / 2;
        I m = first;
        advance(m, l2);
        if (pred(*m))
        {
            first = ++m;
            len -= l2 + 1;
        }
        else
            len = l2;
    }
    return first;
}

Imagine that I is a forward counted_iterator and S is a counted_sentinel. If the library does not specialize advance, this is certainly inefficient. Every time advance is called, needless work is being done. Compare it to a hypothetical partition_point_n:

template<ForwardIterator I, Predicate<ValueType<I>> P>
I partition_point_n(I first, DifferenceType<I> len, P pred)
{
    while (len != 0)
    {
        DifferenceType<I> l2 = len / 2;
        I m = first;
        advance(m, l2);
        if (pred(*m))
        {
            first = ++m;
            len -= l2 + 1;
        }
        else
            len = l2;
    }
    return first;
}

The first thing we notice is that partition_point_n doesn’t need to call distance! The more subtle thing to note is that calling partition_point_n with a raw iterator and a count saves about O(N) integer decrements over the equivalent call to partition_point with counted_iterators … unless, of course, we have specialized the advance algorithm as shown above. Once we have, we trade the O(N) integer decrements for O(log N) integer subtractions. That’s a big improvement.

But what about the O(N) call to distance? Actually, that’s easy, and it’s the reason why I introduced a concept called SizedIteratorRange. counted_iterator stores the distance to the end. So the distance between a counted_iterator and a counted_sentinel (or between two counted_iterators) is known in O(1) regardless of the iterator’s category. The SizedIteratorRange concept tests whether an iterator I and a sentinel S can be subtracted to get the distance. This concept is modeled by random-access iterators by their nature, but also by counted iterators and their sentinels. The distance algorithm is specialized for SizedIteratorRange, so it is O(1) for counted iterators.

With these changes, we see that partition_point with counted iterators is very nearly as efficient as a hypothetical partition_point_n would be, and we had to make no special accommodations. Why can’t we make partition_point exactly as efficient as partition_point_n? When partition_point is called with a counted iterator, it also returns a counted iterator. Counted iterators contain two datums: the position and distance to the end. But when partition_point_n returns just the position, it is actually computing and returning less information. Sometimes users don’t need the extra information. But sometimes, after calling partition_point_n, the user might want to pass the resulting iterator to another algorithm. If that algorithm calls distance (like partition_point and other algorithms do), then it will be O(N). With counted iterators, however, it’s O(1). So in the case of partition_point, counted iterators cause the algorithm to do O(log N) extra work, but it sometimes saves O(N) work later.

To see an example, imagine a trivial insertion_sort algorithm:

template<ForwardIterator I, Regular S>
    requires EqualityComparable<I, S> &&
             Sortable<I> // from N3351
void insertion_sort(I begin, S end)
{
    for(auto it = begin; it != end; ++it)
    {
        auto insertion = upper_bound(begin, it, *it);
        rotate(insertion, it, next(it));
    }
}

Imagine that I is a counted_iterator. The first thing upper_bound does is call distance. Making distance O(1) for counted_iterators saves N calls of an O(N) algorithm. To get comparable performance for an equivalent procedure in today’s STL, users would have to write a separate insertion_sort_n algorithm that dispatches to an upper_bound_n algorithm — that they would also need to write themselves.

Counted Algorithms with Counted Iterators

We’ve seen that regular algorithms with counted iterators can be made nearly as efficient as dedicated counted algorithms, and that sometimes we are more than compensated for the small performance loss. All is not roses, however. There are a number of counted algorithms in the standard (the algorithms whose names end with _n). Consider copy_n:

template<WeakInputIterator I,
         WeakOutputIterator<ValueType<I>> O>
pair<I, O> copy_n(I in, DifferenceType<I> n, O out)
{
    for(; n != 0; ++in, ++out, --n)
        *out = *in;
    return {in, out};
}

(We’ve changed the return type of copy_n so as not to lose information.) If I is a counted iterator, then for every ++in, an increment and a decrement are happening, and in this case the extra decrement is totally unnecessary. For any counted (i.e., _n) algorithm, something special needs to be done to keep the performance from degrading when passed counted iterators.

The algorithm author has two options here, and neither of them is ideal.

Option 1: Overload the algorithm

The following is an optimized version of copy_n for counted iterators:

template<WeakInputIterator I,
         WeakOutputIterator<ValueType<I>> O>
pair<I, O> copy_n(counted_iterator<I> in,
                  DifferenceType<I> n, O out)
{
    for(auto m = in.n_ - n; in.n_ != m;
            ++in.i_, --in.n_, ++out)
        *out = *in;
    return {in, out};
}

Obviously, creating an overload for counted iterators is unsatisfying.

Option 2: Separate the iterator from the count

This option shows how a library implementer can write just one version of copy_n that is automatically optimized for counted iterators. First, we need to provide two utility functions for unpacking and repacking counted iterators:

template<WeakIterator I>
I uncounted(I i)
{
    return i;
}

template<WeakIterator I>
I uncounted(counted_iterator<I> i)
{
    return i.it_;
}

template<WeakIterator I>
I recounted(I const &, I i, DifferenceType<I>)
{
    return i;
}

template<WeakIterator I>
counted_iterator<I> recounted(counted_iterator<I> const &j, I i, DifferenceType<I> n)
{
    return {i, j.n_ - n};
}

With the help of uncounted and recounted, we can write an optimized copy_n just once:

template<WeakInputIterator I,
         WeakOutputIterator<ValueType<I>> O>
pair<I, O> copy_n(I in_, DifferenceType<I> n_, O out)
{
    auto in = uncounted(in_);
    for(auto n = n_; n != 0; ++in, --n, ++out)
        *out = *in;
    return {recounted(in_, in, n_), out};
}

This version works optimally for both counted and non-counted iterators. It is not a thing of beauty, however. It’s slightly annoying to have to do the uncounted/recounted dance, but it’s mostly needed only in the counted algorithms.

As a final note, the overload of advance for counted iterators can be eliminated with the help of uncounted and recounted. After all, advance is a counted algorithm.

Benchmark: Insertion Sort

To test how expensive counted ranges and counted iterators are, we wrote a benchmark. The benchmark pits counted ranges against a dedicated _n algorithm for Insertion Sort. The program is listed in this gist.

The program implements both insertion_sort_n, a dedicated counted algorithm, and insertion_sort, a general algorithm that accepts any Iterable, to which we pass a counted range. The latter is implemented in terms of the general-purpose upper_bound as provided by the Range v3 library, whereas the former requires a dedicated upper_bound_n algorithm, which is also provided.

The test is run both with raw pointers (hence, random-access) and with an iterator wrapper that only models ForwardIterator. Each test is run three times, and the resulting times are averaged. The test was compiled with g++ version 4.9.0 with -O3 -std=gnu++11 -DNDEBUG and run on a Linux machine. The results are reported below, for N == 30,000:

	`insertion_sort_n`	`insertion_sort`
random-access	2.692 s	2.703 s
forward	23.853 s	23.817 s

The performance difference, if there is any, is lost in the noise. At least in this case, with this compiler, on this hardware, there is no performance justification for a dedicated _n algorithm.

Summary

In short, counted iterators are not a perfect abstraction. There is some precedent here. The iterators for deque, and for any segmented data structure, are known to be inefficient (see Segmented Iterators and Hierarchical Algorithms, Austern 1998). The fix for that problem, new iterator abstractions and separate hierarchical algorithm implementations, is invasive and is not attempted in any STL implementation I am aware of. In comparison, the extra complications that come with counted iterators seem quite small. For segmented iterators, the upside was the simplicity and uniformity of the Iterator abstraction. In the case of counted ranges and iterators, the upside is the simplicity and uniformity of the Iterable concept. Algorithms need only one form, not separate bounded, counted, and sentinel forms. The benchmark gives me some reasonable assurance that we aren’t sacrificing too much performance for the sake of a unifying abstraction.

Ranges in C++: Counted Iterables and Efficiency

Counted Ranges

Single-Pass Algorithms with Counted Iterators

Multi-Pass Algorithms with Counted Iterators

Counted Algorithms with Counted Iterators

Benchmark: Insertion Sort

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112