[This article appeared in the November 1997 issue of C/C++ Users Journal.
There may be small differences between this and the published version.]

ROLLING YOUR OWN INPUT ITERATORS
--------------------------------


Much has been written about how to use the STL (Standard Template
Library) container classes and algorithms.  But little has been
written on how to extend it with your own STL-compliant container
classes, iterators, and algorithms.  That's a shame, because STL was
intended to be much more than just another class library -- it was
intended as an open-ended framework for generic programming.  This
article is a step towards filling that gap in the literature, by
showing you how to create your own input iterator classes, and why you
might want to do so.  It's aimed at people who already use STL, so
I'll assume you know the basics of what iterators are and how
they're used.


WHAT'S SO GREAT ABOUT INPUT ITERATORS?

Input iterators are useful whenever you want to generate, return, or
read in a sequence of values which are not already stored in memory as
part of some container object.  They are used by algorithms that pass
through a sequence a single time, in the forward direction only, and
never try anything sneaky like hanging onto an old iterator value in
hopes of returning to a previous position in the sequence.  In spite
of these limitations, the standard C++ library contains 20
different algorithms (not counting variations with an extra
function-object parameter) that can operate on sequences defined by
input iterators.  Furthermore, such sequences can initialize or be
inserted into any object of one of the standard library's container
classes.

The standard example of an input iterator is the istream_iterator
class template.  An istream_iterator<T,charT> turns a stream with
character type charT, containing textual representations of values of
type T, into an STL sequence of T values.  The appropriate extraction
operator (>>) must be defined for type T.  For example, if "foo" is
the name of a file whose contents are
  "35\n24.7\n-12.76 3.8e+2"
then the code fragment
  typedef istream_iterator<float,char> It;
  ifstream is("foo");
  vector<float> v(It(is), It());
will initialize vector v with the sequence of values
  35f, 24.7f, -12.76f, 380f.

Note that the above example used It() as the past-the-end value
(an iterator value just past the end of the desired sequence).
Successively incrementing an istream_iterator will eventually cause it
to attain the default-constructed value, when the underlying istream
reaches end-of-file.  This is typical of input iterators -- their
one-pass nature means that no intermediate iterator values are
available for specifying subsequences, so the only available
past-the-end value is one indicating some form of end of input.

The standard library also includes an istreambuf_iterator class
template for turning istreams or streambufs into an STL sequence of
characters.  Thus we see a common use for input iterators: as adaptors
that provide an STL-compliant interface to some other class so that it
can be used with STL algorithms (and user-written generic algorithms).

Here are some additional examples of input iterators:

* Reading a file system directory.  Dietmar Kuhl
(http://www.informatik.uni-konstanz.de/~kuehl) has implemented an
input iterator class that serves as a wrapper around the Unix
opendir(), readdir() and closedir() functions.  Thus you can write
  list<string> l(dir_it("foo"), dir_it());
to create a list of the names of all entries of the directory "foo", and
  distance(dir_it("foo"), dir_it())
to compute the number of entries in directory "foo".

* An analog of istream_iterator for binary files.  If you deal with
files that contain a series of values in some binary format, instead
of a textually-encoded format, you may find it useful to define a
class analogous to istream_iterator for reading in these values.

* Lexical analysis.  A lexer converts a sequence of characters into a
sequence of tokens.  For maximum flexibility and ease of interfacing with
STL components, a lexer should then be a class template or function
template that takes an arbitrary iterator range for reading characters, and
creates an input iterator for reading tokens.  E.g., if it0 and it1
are of an input iterator type X, with the range [it0, it1) defining a
sequence of characters, then [lexer<X>(it0,it1), lexer<X>()) is the
corresponding sequence of tokens.  (Recall that [it, itend) means the
sequence of values gotten by incrementing it0 up to, but not
including, it1, and dereferencing every iterator along the way.)

* Database queries.  A query can return multiple results.  How should
you return these results?  One common method is to dictate a
particular kind of container into which the results will be placed.
This has the drawback of being inflexible and requiring an extra
copying stage if the application needs to put the results into a
different kind of container.  Furthermore, sometimes the results can
be processed one by one without any need by the application to ever
store the entire set of results.  Another common method that overcomes
these problems is to have a notion of a database cursor.  A query sets
the cursor, and then you can read in the results one by one.  The only
problem with this is that it may not conform to any standard
interface.  But if you make your database cursors actually be input
iterators -- or write an adaptor for this purpose -- then you obtain a
very flexible interface.  A query returns an input iterator i of some
type X, with [i, X()) being the sequence of return values.  The
iterator pair (i, X()) can then be passed to any number of STL
algorithms, used to initialize the contents of a standard container,
transferred to an existing container via STL's copy() algorithm, or
used directly by your own code.


A DATABASE EXAMPLE

Let's look at that last example above in more detail; in particular,
consider putting an STL-style interface on the Unix dbopen() facilities.
The dbopen() function can actually be used to create and access three
different kinds of database files: flat files of records, hash-table
files storing (key, value) pairs, and B-tree files storing (key,
value) pairs.  With the B-tree option you can access (key, value)
pairs in ascending key order, according to a user-defined comparison
function.  We'll ignore the flat-file and hash-table options, and
concentrate on the B-tree option.

I've written the classes const_db_btree and db_btree to provide the
STL-style interface.  These assume that your keys are of class Date
(Listing 0), and the associated values are ints.  In practice you'd
want to make const_db_btree and db_btree be class templates to allow
arbitrary key and value types -- and that is what I originally did --
but doing so complicates the exposition considerably.  Since this is
an article about input iterators, and not about generic container
classes, I opted for simplicity of description over generality here.

The const_db_btree class is for read-only access to an existing B-tree
file; its header files are given in Listings 0 and 1.  The db_btree
class publicly inherits from const_db_btree, and simply adds
additional functionality to allow you to insert new (key, value)
pairs, delete them, or construct a new B-tree file.  Since db_btree
adds nothing that is relevant to a discussion of input iterators, I
have omitted it.

We'll talk about the implementation later; for now just skip down to
the line that says "INTERFACE STARTS HERE" in Listing 1.  The
constructor argument is the name of the B-tree file you want to
access.  The function begin() returns an input iterator pointing at
the first entry in the B-tree, according to the key ordering.  The
function end() simply returns the default-constructed input iterator;
it is the past-the-end value for all sequences produced by this
class.

The call x.find_from(k) returns an input iterator pointing at the 
first entry whose key is >= k, or x.end() if there is no
such entry.  Thus [x.find_from(k), x.end()) is the sequence of all
entries in the B-tree file of x with keys >= k.

The call x.find_to(k) returns an iterator pointing at the first entry in
the B-tree.  This iterator value differs from x.begin() in its behavior when
repeatedly incremented: after reaching the last element with key <= k,
incrementing once more will cause it to be equal to x.end().  Thus
[x.find_to(k), x.end()) is the sequence of all entries in the B-tree
file of x with keys <= k.

The call x.find_rng(k1,k2) returns an iterator pointing at the first
entry whose key is >= k1, or x.end() if there is no such entry.  This
iterator value differs from x.find_from(k1) in its behavior when
repeatedly incremented: after reaching the last element with key <=
k2, incrementing once more will cause it to be equal to x.end().  Thus
[x.find_rng(k1,k2), x.end()) is the sequence of all entries in the
B-tree file of x with keys between k1 and k2 inclusive.

The dbopen() interface has a notion of a cursor into the database,
which is used in the implementation of const_db_btree's iterators.
However, the dbopen() interface only allows for a single database
cursor.  This prevents us from having more than one valid,
dereferenceable iterator into the database at a time.  So a side
effect of the calls x.begin(), x.find_from(f), x.find_to(k), and
x.find_rng(k1,k2) is that any existing dereferenceable (not x.end())
iterators pointing into x are invalidated.

The call x.get(k, d) returns true if there is an entry in x's B-tree
with key k, and assigns to d the value associated with key k;
otherwise it returns false and leaves d unchanged.  It does not
invalidate any existing iterators into x.

Now for an example of using const_db_btree.  Suppose you have a B-tree
database where the keys are dates and the associated values are the
number of widgets your store sold on that date.  The name of the
database file is "sales".  Listing 2 shows some code that accesses the
database to print out the five best sales days within a given date
range.  The template functions partial_sort_copy() and copy(), and the
class template ostream_iterator, are from the standard library.  The call
  partial_sort_copy(iit0, iit1, rit0, rit1, cmp);
stores in [rit0, ri1) the N "best" values in the sequence [iit0,
iit1), where N is the distance from rit0 to rit1, and "best" is
defined by the comparison object cmp.  rit0 and rit1 must be
random-access iterators.  The call to copy() just writes the values in
array A to cout, using operator<<(), separated by the string "\n".


INPUT ITERATOR REQUIREMENTS

The material in this section is based on sections 24.3 and 24.1 of the
December '96 working paper for the draft proposed C++ standard (which
I'll just call the WP), especially section 24.1.1, which is devoted to
input iterators.  The WP discusses three kinds of iterator values:
dereferenceable values, past-the-end values, and singular values.
Dereferenceable iterator values "point at" some value in a container
or input sequence.  Past-the-end values can't be assumed to point at
anything, but mark the end of a sequence.  Singular values are
basically unusable iterator values; the only thing you can do with an
iterator variable holding a singular value is to assign it a new
value.  Uninitialized pointers are one example of singular values.

Table 1 is taken (with slight modifications) from the WP, and gives
the requirements for a type X to be considered an input iterator for
value type T.

Item 1 says that X must have a copy constructor and destructor.

Item 2 says that X must allow copy assignment.  The WP says that a
copy assignment should return a value of type X, which I suspect is a
typo -- copy assignments (including compiler-generated copy
assignments) usually return a value of type X&, and the descriptions
given for istream_iterator and istreambuf_iterator don't support the
idea of a return type of X.

Items 3 and 4 discuss the == and != operators.  These may return
values of any type that can be converted to bool, which includes bool
itself and integer types.  In the working paper, the phrase "(a,b) is
in the domain of ==" simply means that the expression a == b is
(required to be) defined.  The phrase "a is in the domain of ==" means
that there exists some iterator value b for which the expression a ==
b is (required to be) defined.  Singular values are not in the domain
of ==; i.e., if a is a singular value, then neither a == b nor b ==
a is (required to be) defined.  Presumably a != b is defined
whenever a == b is defined, and vice versa.  If an expression (such
as a == b) is not required to be defined, then the standard doesn't
care what it returns or what it does; it could sit in an infinite
loop, it could cause the program to crash, it could stomp all over
memory, or it could delete all of your files.

Curiously enough, although the WP gives some cases where the
expression a == b is not required to be defined, it never explicitly
says just when a == b must be defined.  However, it seems to be
assumed that if a and b are non-singular iterator values marking
positions in (or past-the-end of) the same sequence, then a == b must
be defined; otherwise loops such as
  for (it = start; it != end; ++it)
wouldn't work.

The requirement that == be an equivalence relation simply means that
it must behave as you expect equality to behave: i.e.,
1. a == a is always true whenever the expression is defined (any
   non-singular iterator is equal to itself);
2. if a == b is defined and true then b == a is defined and true (the
   order of the arguments for == doesn't matter); and
3. if a == b and b == c are both defined and true, then a == c is
   defined and true (you can chain together equalities to deduce new
   equalities).

Item 5 discusses the dereferencing operator.  There's a bit of
circularity in the WP here.  The WP says that a is a dereferenceable
iterator if *a is defined; then it turns around and, in item 5, says
that *a is defined if a is dereferenceable.  In any event, the
intention is that *a returns the value "pointed at" by a, or the first
value in the (sub)sequence whose beginning is marked by a.  There's
some ambiguity about the return type of *a.  Section 24.1.1 says that
*a must return a value of type T, but the description of
istream_iterator<T,charT> indicates a return type of T const &.
Perhaps the committee intended the return type to merely be convertible
to T.

Item 5 also imposes a consistency requirement on == and *: If a == b
is defined and true, then *a and *b are either both undefined or both
defined and return the same value.  Combined with items 3 and 4, this
tells us that two dereferenceable iterator values a and b must be
considered unequal if they point to different values, and that a and b
must be considered unequal if a is dereferenceable and b is a
non-dereferenceable, past-the-end value.

Item 6 simply requires that a->m can be used as an abbreviation for
(*a).m, where m is a member of class T and a is an iterator value.

Item 7 implies that a dereferenceable input iterator can be repeatedly
incremented without obtaining a singular value, until it reaches a
past-the-end value.  (If a past-the-end value is never reached, then
the iterator can be incremented indefinitely, and it defines an
infinite sequence.)  But item 7 is most interesting for what it does
not require.  If it1 and it2 are copies of the same iterator value,
and the operation ++it1 occurs, then it2 is no longer required to be
usable; it may be considered a singular value.  The reason for this is
that input iterators are used for one-pass algorithms -- you aren't
supposed to return to previous values of an iterator.  The
implications for == are that when implementing == you need only
consider comparing an input iterator with itself, copies of itself,
and the past-the-end value; it doesn't matter what == returns in any
other case.

This lack of a requirement is important, as it allows efficient
implementations of input iterators.  The original input iterator
requirements, given in Stepanov and Lee's seminal paper defining STL,
were stronger; as a result, istream_iterators had to be implemented
with three data members:
1. A pointer to an istream.
2. A bool indicating whether this was the past-the-end value.
3. A variable of type T, holding the value (if any) "pointed at" by
   the iterator.
This made istream_iterators rather large structures, which could be
expensive to copy, especially if values of type T were expensive to
copy.  But iterators get copied all the time, because they are
typically passed by value; they should therefore be inexpensive to
copy.

The weakened requirements allow an input iterator to contain a single
data member: a pointer to a structure that holds the value "pointed
at" and whatever information is necessary to generate the remaining
values in the sequence, with a null pointer indicating the
past-the-end value.  Equality of iterators may be implemented as
equality of the enclosed pointers.  When an input iterator is
incremented, it may need to update the values in the structure pointed
at, but it can continue to hold the same pointer; since it is no
longer legal to dereference or compare for equality any copies of the
old value of the iterator, it doesn't matter that these in fact hold
the same value as the incremented iterator.  

Item 8 says that, if you ignore return values, post-increment and
pre-increment of an input iterator are equivalent.  Again, the most
interesting thing is what is not required.  There is no restriction on
the return type of post-increment.  In particular, it need not return
an iterator value.

Item 9, though, adds the additional restriction that *r++ must be
defined, and equivalent to incrementing r and then returning the old
value pointed at.  At first this would seem to require that r++ return
the old value of r, and that this old value remain dereferenceable,
thereby unraveling the implementation scheme I described above.  But
there is another way of satisfying item 9.  You can define a proxy
class whose only purpose is to hold a value of type T, and for which
operator*() is defined and returns the enclosed value.  If the
operation r++ returns a proxy containing the value of *r before
incrementing, item 9 is then satisfied.  I'll show an example of this
in the next section.

Finally, there is a time-complexity requirement: all of the operations
in Table 1 must take constant time (amortized).  What this means is
that, for any particular machine and compiler, you can put a fixed
upper limit t on how long any one of the operations takes, that is
independent of the length of the sequence.  An example of violating
this requirement would be if the first increment operation took one
unit of time, the second took two units, the third took three units,
and so on.  Actually, the requirement is a little more lenient than
this, in two ways:

1. Constructing, assigning, or copying the value pointed at by an
iterator only counts for one unit of time, regardless of how long it
actually takes.  Otherwise it would be impossible to satisfy the time
complexity requirements if you had a value type like vector<int>,
since the time required to copy a vector is proportional to its
length.

2. Individual increment operations may take longer than the upper
limit t, as long as iterating over the entire sequence takes no longer
than a time of Nt, where N is the length of the sequence.  One example
of where this becomes important is when doing a depth-first traversal
of a tree structure.  Going from the rightmost leaf of a large subtree
to the leftmost leaf of the next subtree may take time proportional to
the depth of the tree, since you have to follow many parent links up
to get to the next subtree.  But if you traverse the entire tree, many
steps will only require following one child link or one parent link;
the average will be just under two links followed for each node of the
tree.

In addition there are certain required typedefs that are used by the
generic algorithms in the standard library.  You can satisfy these
requirements simply by making your input iterator class X be publicly
derived from
  std::iterator<
    std::input_iterator_tag,
    T, Distance
  >
where Distance is an integer type that can hold the maximum possible
length of a sequence defined by iterators of type X.

If your compiler does not support partial instantiation of templates,
then your version of STL has to use an older and clumsier method for
getting type information to generic algorithms.  In that case,
publicly derive X from
  std::input_iterator<T, Distance>.
In either case, if your compiler doesn't support namespaces then leave
off "std::".


AN ADAPTOR FOR CREATING INPUT ITERATOR CLASSES

Rather than deal with all of these requirements anew every time I
needed to define an input iterator class, I decided to write an
adaptor to take care of the details once and for all.  The code is
given in Listing 4.  The symbol BRAIN_DEAD_COMPILER should be defined
if your compiler can't handle partial specialization of templates.

The class input_iter_adaptor<P, Distance> is a wrapper around a pointer or
smart pointer class P that makes it behave as an input iterator.  The
type Distance should be chosen as described at the end of the previous
section.  The type P must satisfy the following requirements:

1. pointer_traits<P>::element_type is defined as some type G.  The
pointer_traits class template is given in Listing 3.  This requirement
is automatically satisfied if P is a pointer type, or if the type
P::element_type is defined.  element_type is the type of structure
pointed at by objects of type P.  If the symbol BRAIN_DEAD_COMPILER is
defined, you can ignore this requirement, but you'll have to
explicitly provide the value_type (Val parameter) and deref_type (Der
parameter) mentioned below.

2. G::value_type is defined as some type T with a public copy
constructor.  T is the value type for the input iterator.

3. G::deref_type is defined as either T or T const &.  It is the
return type of the dereference operator (*it).  This is to accomodate
the ambiguity in the WP as to whether the return type of *it must be T
or can be T const &.

The following are required for any value g of type G:

4. g.value() has type G::deref_type, and returns the most recent
value generated in the sequence.  g.value() is undefined if g.next() has
never been called or the last call of g.next() returned false.

5. g.next() generates the next value in the sequence and returns true
if there are more values in the sequence, otherwise it returns false.

Now on to the implementation.  The input_iter_adaptor class has a
single data member p of type P.  The private function inc() gets the next
value in the sequence, setting p to the null pointer if there are no
more values in the sequence.  

The default constructor creates a past-the-end value by initializing p
with the null pointer.  The other constructor allows an iterator to be
constructed from a P value, and fetches the first value of the
sequence (if any).

The class PIR (post-increment return) is the proxy class used to
implement expressions of the form *r++.  The post-increment operator
initializes a PIR object with the current value pointed at, increments
the iterator, then returns the PIR object.

The remaining operations should require no further explanation.

The class P can simply be a pointer to G when the object generating the
sequence of values has a static scope that encloses uses of the input
iterator -- i.e., when you have something like this:
  { G g(...);
    input_iter_adaptor<G *>
      i(&g), iend();
    [code that doesn't pass i out
     of this scope]
  }
The implementation of const_db_btree in the next section is an example
of this.

In some cases you'll want P to be a smart-pointer class for G.  For
example, suppose you want to create an input iterator for the
Fibonacci sequence.  This sequence is defined by
  F(1) = 1
  F(2) = 1
  F(n+2) = F(n) + F(n+1)
Listing 6 shows a function -- fibonacci(n) -- that returns an input
iterator for the first n elements of the Fibonacci sequence.  Class
fib is the generator class G mentioned in the requirements given
above.  The input iterator uses a reference-counted smart pointer
class template defined in Listing 5.  This smart pointer class keeps
track of how many pointers are referencing an object, and deletes the
object when there are no references to it. This is needed because the
input iterator -- along with its enclosed pointer -- is passed out of
the scope in which the fib object is created, and so some means of
reclaiming the fib object's storage is needed.


USING THE ADAPTOR WITH THE DATABASE EXAMPLE

Now we're ready to look at the implementation of the const_db_btree
class described earlier, which is an STL-style interface to the
facilities provided by the dbopen() function.  The implementation is
given in Listings 1 and 7. 

The Unix function dbopen() returns a pointer to a DB struct.  Most of
the DB struct members are pointers to the various functions needed to
access the database -- and all of these functions take the DB pointer
as an argument!  This somewhat convoluted arrangement is used because
there are several different kinds of databases that can be opened with
dbopen(), and each has its own version of each access function.  Keys
and their associated values are represented via DBT values:
  struct DBT {
    void * data;
    size_t size;
  };
data is a pointer to an array of size bytes.  The various database
routines take and return DBT values.  The functions pod2dbt() and
dbt2pod() (Listing 7) convert "plain old data" to and from the DBT
representation.

Class const_db_btree has one data member, body, of type datagen (see
Listing 1).  This is the generator class for its iterator class: an
iterator is just a wrapper around a pointer to body (see the typedef
for iterator and the functions that return an iterator).  The data
members of class datagen are
* value_: the (key,data) pair for the current database cursor position.
* dbp: pointer to the DB struct returned by dbopen().
* key, data: DBT representations of value_.
* ek: greatest key value to retrieve (for find_to() and find_rng()).
* endkey: DBT representation of ek.
* flag: operation code used by the database access routines returned
  by dbopen().

The function const_db_btree::ctor() (Listing 7) is called by the
constructor.  It simply opens the database as a B-tree file, and
specifies that the function compare() be used to compare (the DBT 
representations of) keys.

The function datagen::set(Key * startk, Key * endk) (Listing 7) is
used by begin(), find_from(), find_to() and find_rng().  It sets
things up for the first call of next().  It is always immediately
followed by the statement
  return iterator(&body);
Since the constructor for input_iter_adaptor() calls next(), every
call of set() is followed by a call to next().  startk is (a pointer
to) the lower bound on the range of keys we want to fetch; if it is
null, we want to start with the first entry in the database.  endk is
(a pointer to) the upper bound on the range of keys we want to fetch;
if it is null, we want to continue to the end of the database.

The actual database accesses all occur in datagen::next() (Listing 7),
which is called by the iterator increment operation and the iterator
constructor (Listing 4).  The next() function determines if there are
any more (key,data) pairs to fetch, and if so, fetches the next one.
On entry to this function, flag is set to either R_FIRST, R_CURSOR or
R_NEXT (R_SETCURSOR is used only in db_btree).  The R_FIRST flag tells
the dbp->seq() function to fetch the first (key,data) pair in the
database.  The R_CURSOR flag says to fetch the first (key,data) pair
with key no less than the key data member, and set the database cursor
to this position.  The R_NEXT flag says to advance the database cursor
to the next (key,data) pair and fetch it.  In all cases the new
(key,data) pair ends up in the data members key and data.  If the
return code is > 0, no more database entries could be found.  If the
returned key is > ek (whose DBT representation is in endkey), and we
are returning values from find_to() or find_rng(), this also indicates
the end of the sequence of values, and next() returns false.


CONCLUSION

Input iterators provide a standard interface for returning sets or
sequences of results, one that fits with STL and allows wider use of
generic programming techniques.  Use of the input_iter_adaptor class
template lets you easily define new input iterator classes without
having to pore through the detailed requirements for input iterators.