Quantcast
Channel: std – Eric Niebler
Viewing all articles
Browse latest Browse all 11

Input Iterators vs Input Ranges

$
0
0

This post was inspired by some shortcomings of the std::getline solution I described in my previous post, which just goes to show that there is no interface so simple that it can’t be done wrong. Or at least sub-optimally.

Input Iterators and Lazy Ranges

In the previous article, I analyzed the interface of std::getline and proposed a range-based solution as a better alternative. Users of the new range-based getlines API would read lines from a stream like this:

for(std::string const & line : getlines(std::cin))
{
    use_line(line);
}

The range object returned from getlines is lazy; that is, it fetches lines on demand. It’s a good design, and I’m still happy with it. The implementation leaves much to be desired though. Both the range object itself, as well as the iterators it yields, are fatter than they need to be. That got me thinking about std::istream_iterator, and input iterators and ranges in general. My conclusion: Naked input iterators like std::istream_iterator that don’t “belong” to a range have serious problems.

Fat Input Iterators

If you’re not already familiar with std::istream_iterator, take a minute to look it up in your favorite C++ reference. It is parametrized on the type of thing you want to extract from a stream. An istream_iterator<int> reads ints, an istream_iterator<string> reads strings, etc. Although the implementation is unspecified, reading an element would typically happen first when the iterator is constructed, and then each time the iterator is incremented. The element is stored in a data member so that it can be returned when you dereference the iterator. OK so far?

The implication for istream_iterator<string> is that it is a hulking behemoth of an iterator. Not only is it fat because it holds a string, but copying one means copying a string, too. That’s potentially a dynamic allocation, just from copying an iterator! STL algorithms generally assume iterators are cheap to copy and take them by value nonchalantly. What’s more, a default-constructed istream_iterator<string> is used as a dummy end-of-sequence iterator. Naturally, it contains a string too, but it never uses it! istream_iterator definitely needs to go on a diet. We’ll fix that, but we’re not done describing the problems yet. Read on.

Surprising Side-Effects

Say we wanted to return a range of istream_iterator<string>s. We could return a std::pair of them, and that would work, sort of. Better, we could return a boost::iterator_range (which is basically a std::pair of iterators with begin and end member functions) to get something that users could iterate over with a range-based for loop:

// Return a lazy range of strings
boost::iterator_range<std::istream_iterator<std::string>>
get_strings( std::istream & sin )
{
    return boost::make_iterator_range(
        std::istream_iterator<std::string>{sin},
        std::istream_iterator<std::string>{}
    );
}

//...

for(std::string const & str : get_strings( std::cin ))
{
    use_string(str);
}

But think of the waste: the range holds two iterators, each of which holds a string and a reference to the stream. Wouldn’t it be smarter if the returned range just held a reference to the stream, and constructed the iterators on-demand in its begin and end member functions, like this:

template< class T >
class istream_range
{
    std::istream & sin_;
public:
    using iterator = std::istream_iterator<T>;
    using const_iterator = iterator;

    explicit istream_range( std::istream & sin )
      : sin_(sin)
    {}
    iterator begin() const
    {
        return std::istream_iterator<T>{sin_};
    }
    iterator end() const
    {
        return std::istream_iterator<T>{};
    }
};

OMG, isn’t this soooo clever? The range object went from about 24 bytes (with libstdc++ 4.7), to 4 bytes — the size of just one pointer! And if you play around with istream_range, it will seem to work. Check it out:

// Read a bunch of strings from a stream
std::istringstream sin{"This is his face"};

for(auto const & str : istream_range<std::string>{sin})
{
    std::cout << str << std::endl;
}

As we might expect, the above prints:

This
is
his
face

But all is not roses. Take a look at this:

std::istringstream sin{"This is his face"};
istream_range<std::string> strings{sin};

if(strings.begin() != strings.end())
    std::cout << *strings.begin() << std::endl;

This code checks to see if the range is non-empty, and if so it prints the first element of the range. What would you expect this to print? This, right? After all, that’s the first string in the stream. If you try it, this is what you’ll get:

is

Huh? That’s hardly what any reasonable person would expect. Chalk this gotcha up to a quirk of the implementation of istream_iterator. As mentioned above, when you construct one from a stream, it eagerly fetches a value out of the stream and saves it (or, most implementations do). That’s fine, unless you happen to throw that iterator away and construct a new one, which fetches a second value from the stream. That, sadly, is what the above code is doing, but it’s not obvious.

If the fatness was the first problem with std::istream_iterator, the second is that its constructor has surprising side-effects.

Lone Range-er to the Rescue!

The solution to istream_iterator‘s woes will be to replace it with istream_range. Put simply, if we’re reading strings from a stream, the string needs to live somewhere. The iterator seemed like the logical place when we were all thinking strictly in terms of iterators. But with ranges, we now have a much better place to put it: in the range object.

With the string safely tucked away in the range object, we neatly side-step the issue of fat istream iterators. The iterator only needs to hold a pointer to the range. It goes without saying that the iterator cannot outlive the range that produced it, but that’s true of all the standard containers and their iterators.

The range object also gives us a better place to put the surprising side-effect: in the range object’s constructor. By moving the side-effect out of the iterator’s constructor, it is now perfectly acceptable to construct the iterators on-demand in the begin and end member functions. We’re left with an optimally small range — it holds only a string and an istream & — and an optimally small and efficient iterator — it holds only a pointer.

Without further ado, here is the complete solution:

template< class T >
class istream_range
{
    std::istream & sin_;
    mutable T obj_;

    bool next() const
    {
        return sin_ >> obj_;
    }
public:
    // Define const_iterator and iterator together:
    using const_iterator = struct iterator
      : boost::iterator_facade<
            iterator,
            T const,
            std::input_iterator_tag
        >
    {
        iterator() : rng_{} {}
    private:
        friend class istream_range;
        friend class boost::iterator_core_access;

        explicit iterator(istream_range const & rng)
          : rng_(rng ? &rng : nullptr)
        {}

        void increment()
        {
            // Don't advance a singular iterator
            BOOST_ASSERT(rng_);
            // Fetch the next element, null out the
            // iterator if it fails
            if(!rng_->next())
                rng_ = nullptr;
        }

        bool equal(iterator that) const
        {
            return rng_ == that.rng_;
        }

        T const & dereference() const
        {
            // Don't deref a singular iterator
            BOOST_ASSERT(rng_);
            return rng_->obj_;
        }

        istream_range const *rng_;
    };

    explicit istream_range(std::istream & sin)
      : sin_(sin), obj_{}
    {
        next(); // prime the pump
    }

    iterator begin() const { return iterator{*this}; }
    iterator end() const   { return iterator{};     }

    explicit operator bool() const // any objects left?
    {
        return sin_;
    }

    bool operator!() const { return !sin_; }
};

This solution has a major advantage over std::istream_iterator even in the pre-ranges world of C++98: the iterators are as svelte and cheap to copy as a single pointer. One might go so far as to wonder how a potentially inefficient and error-prone component as istream_iterator ever made it into the standard in the first place. (But, I just mentioned “efficient” and “iostreams” in the same sentence, so how smart am I, right Andrei?)

As an added bonus, I added a cute contextual conversion to bool for testing whether the range is empty or not. That lets you write code like this:

if( auto strs = istream_range<std::string>{std::cin} )
    std::cout << *strs.begin() << std::endl;

If you don’t like the Boolean conversion trick, you can do it the old, boring way too:

istream_range<std::string> strs{std::cin};
if( strs.begin() != strs.end() )
    std::cout << *strs.begin() << std::endl;

You can call strs.begin() as many times as you like, and it has no untoward side-effects. Adapting this code to improve my getlines implementation from the previous post is a trivial exercise.

Home on the Range

In the post-ranges world, the advantages of istream_range over istream_iterator are even clearer. As I mentioned in my previous post ranges are awesome because they compose. With filters and transformers and zippers and the whole zoo of range adapters, you can do things with ranges and range algorithms that you wouldn’t dream of doing with raw iterators before.

Conclusion

So far, the ranges discussion as I’ve heard it has been framed mostly in terms of ranges’ added convenience and power. To this impressive list of advantages, we can now add efficiency. Win, win, win.

Caveat to the Boost.Range Users

Please read this if you are an avid user of Boost’s range adapters. As they are currently written, they interact poorly with the istream_range I’ve presented here. Some things will work, like this:

// read in ints, echo back the evens
auto is_even = [](int i) {return 0==i%2;};
boost::copy( istream_range<int>{std::cin}
               | boost::adaptors::filtered(is_even),
             std::ostream_iterator<int>(std::cout) );

And some things will fail, like this:

// read in ints, echo back the evens
auto is_even = [](int i) {return 0==i%2;};
auto evens = istream_range<int>{std::cin}
               | boost::adaptors::filtered(is_even);
boost::copy( evens, std::ostream_iterator<int>(std::cout) );

The problem is that the temporary istream_range<int> goes out of scope before we have a chance to iterate over it. Had we gone with an iterator_range< std::istream_iterator<int> >, it would have actually worked, but only because of a quirk of the current Boost.Range implementation. The Boost.Range adapters only work when either (A) the adapted range happens to be an lvalue, or (B) the range’s iterators can outlive their range. These less-than-ideal assumptions made sense in C++98, but not in C++11. On modern compilers, Boost.Range can and should store a copy of any adapted rvalue ranges. In my opinion, it’s time for a range library for the modern world.


Viewing all articles
Browse latest Browse all 11

Trending Articles