Iterators and iterables

Reading a large file

In bioinformatics, VCF files (Variant Call Format) are everywhere: variant calling, genotyping, population genetics… and they can be huge (millions of variants).

We could create a Python script to count the number of SNPs and to list the chromosomes.

This has worked just fine, but what would happen if the file would had millions of lines?

Even with 10,000 SNPs this is fine, but notice what is happening: we are building a Python list of length 10,000 and storing 10,000 tuples. If this were 10 million lines, that list could take a lot of memory. We can’t read millions of SNPs in memory, we would run out of memory, and that it exactly what the snps list is trying to do in this code.

But… do we really need to store every SNP to answer simple questions like:

How many SNP records are there?
Which chromosomes appear in the file?

No. We can compute those answers while streaming the file, without keeping all SNPs.

Now we’ll write a function that does not build a list. It reads the VCF line by line and only keeps:

a counter (count)
a set of chromosomes (chroms)

That’s a small amount of memory, no matter how many lines the VCF has.

With this approach at no point we have stored all the SNPs in a list. (In this case we have stored all lines in the io.StringIO file-like object, but if you would read a standard file from your disk Python would never load the whole big file in memory all at once.)

A file is something you can loop over:

for line in vcf_fhand:
    ...

The big idea is that the for loop gets one line at a time, so you don’t need more in memory.

If you would try to build a list, you would keep everything in memory.
If you create a stream of data, you only keep in memory what you need (counters, sets, summaries).

Iterables

An iterable is an object that can return an iterator (something that can be looped over, iterated over) and iterators are the objectes that are iterated over, one item at a time (the object doing the looping). For example, a list is an iterable and we can iterate over its elements one at a time. Other examples of Python iterables are: tuple, str, dict, set, or range.

flowchart LR
    A["x = [1, 2, 3]"] -->|"iter()"| B["Iterator"]
    B -->|"next()"| 1
    B -->|"next()"| 2
    B -->|"next()"| 3
    B -->|"next()"| StopIteration

For instance, let’s imagine that we want to sum the squares from 1 to 10. We could create a list and then we could iterate over it using a for loop.

A list is an iterable because we can get its elements one at a time. But the list has extra capabilities, the main one being that it holds its members in memory, and that is not required to be an iterable.

In fact, the we don’t need the members of the iterable to exist before we ask for them. For instance, range is also an iterable and it will create the elements when we ask for them. We could use range in the sum of squares example.

In the first approach we created a list, but not in the second one.

The lazy advantage

An iterable could be lazy, we don’t need to have all the data in memory, or even to exists at the creation time of the object, in order to be able to access to it. The fundamental difference between a rangeand a list is explained in the Python documentation:

The advantage of the range type over a regular list or tuple is that a range object will always take the same (small) amount of memory, no matter the size of the range it represents (as it only stores the start, stop and step values, calculating individual items and subranges as needed). Ranges are lazy, they build the numbers when they are needed but not before.

One of the advantages of the iterables is that they can be used to analyze data that can not be hold in memory all at once, moreover, it allows us to analyze that data having only one item at a time in memory. We could create a range capable of returning many numbers and it would cost almost no time and memory.

However, if you would try the transform this range into a list you would run out of memory (don’t run the next cell.)

So in order to process this data we don’t need to materialize all the data in memory in order to iterate over it, we just need to be able to iterate over it, so we can skip the memory intensive part.

Iterators

There is another concept, the iterator, that its related to the iterable. If the iterable was the object that could be iterated over, the iterator is the actual object that does the iteration. Iterators are to iterables what bookmarks are to books, we can iterate over the pages of a book, but during a particular iteration we will be at any time in a particular page, and that position should be held somewhere, for instance by just leaving the book opened at that particular page or by a bookmark. You can think of an iterable as a book that can be read many times, while the iterator would be the bookmark that moves along.

You can think of an iterator as an stream of data to be processed that will be seen one at a time.

In technical terms, as we have already shown iterables are objects from which we can create iterators, using the __iter__ special method, while the iterators will be, once created, the objects that will be used to do the iteration.

Iterators: iter and next

While the defining characteristic of an iterable is the __iter__ method, the capability of creating iterators, the crucial trait of the iterator is that we can ask for the next item using the next function. Any object capable of returning its members one at a time when we request them using the next function is an iterator.

We can always create an iterator from an iterable using the iter function. For example, we could create an iterator from a list using iter, and then we could iterate over it using next.

If iter(x) works, x is iterable, if next(x) works, x is an iterator.

Iterable: works with iter(x)
Iterator: works with next(x)

An iterator will keep yielding items until it is exhausted, and then it will raise a StopIteration exception.

So that’s the all that there’s to it, an iterator will yield one item at a time until all its items are consumed and then will raise a StopIteration exception marking the end of the iteration.

file objects are iterators

Python is very fond of iterables and iterators, it uses them for many tasks. For instance, every time that you open a file you get a file object that is in iterator.

Iterators are consumed, iterables are not

Check out the next cell and predict the expected result. Now, run the cell, and explain the result of the second sum (this is the biggest iterator Gotcha.)

This kind of errors are common when dealing with files:

f = open("variants.vcf")
# First pass: count lines
count = sum(1 for line in f)

# Second pass: try to process data
for line in f:
    print(line)  # WARNING: This will do nothing! The iterator is empty.

As we have explained an iterator is an object that represents an stream of data, it yields item after item until it is consumed. Iterators generate items until they are exhausted, and then they are exhausted for good.

However, iterables are not consumed, you can iterate over them as many times as you want because functions that iterate over them, internally, create a new iterator every time they start a fresh iteration. Iterables are not exhausted because you don’t really iterate over them, you iterate over the iterator objects that are generated from them.

So, iterables are reusable while iterators are consumable.

As we have seen for loops can work both with iterables and iterators, but iterators will be consumed, while iterables can be iterated over as many times as you desire.

Be careful because iterators return themselves when you pass them to the iter function, so the original one will be still consumed even when you might thought that you created a new one.

iterators are iterable, but iterables are not iterators

All iterators are iterable because they implement the __iter__ special method (usually by returning themselves). So, if you pass to the iter function an iterator it just returns itself, that’s why for can work both with iterators and iterables, while if you pass it an iterable it will create a new iterator every time.

However, iterables are usually not iterators because they do not implement the __next__ special method. Remember, the requirement to be an iterable is to be able to create iterators, not to be an iterator yourself. So if you try to use the next function directly on an iterable you will get a TypeError exception.

iterators have no length

One limitation is that iterators usually have no length. (Some iterators might have a length_hint, but that’s not the usual case and the length is not guaranteed).

What we have is the guarantee that they will always give us something back when we ask with the next function, either an item or a StopIteration exception, but the items are given one at a time and we usually don’t know how many of them are.

You can materialize them by using list

If you ever try to print an iterator you won’t get much.

If for any reason you need to materialize them you can always use list or tuple.

If you use NumPy, be aware that you can’t directly materialize an iterator into a numpy array and you might get really odd results.

You could get the correct result by transforming the iterator into a list first.

Be aware that this solution will work for small lists, but if the number of items is large the list will take much memory. As an alternative numpy has a safer way that materializes the iterator directly into a numpy array: numpy.fromiter.

Remember that numpy assumes that you have all elements materialized in memory and that you know the number of elements. fromiter will have a better performance if you tell it how many elements will be in the final array , using the count argument (if you now it).

More about the iterables and iterators

Python is fond of iterables

Many Python classes are iterable, like list, tuple, set, and frozenset, or range.

Dictionaries are also iterable, by default you will get their keys, but you can also iterate over its values, and items.

Strings are also iterable over their characters.

How `for` works

The most usual way of iterating over an iterable is to use it in a for loop.

At this point somebody could raise the question: if we have said that iterables create iterators and then iterators are iterated over, how’s that for can iterate directly over the iterable without requiring an iterator? That would be a very good question, if you are not used to these kind of objects they could be quite confusing because it would look like for iterates over iterable objects, but that’s not really the case, in fact the functions and statements, like the for statement, that iterate over things, internally, create an iterator before starting the iteration, and then they iterate over that iterator. The for loop does not iterate over the iterable, it iterates over the iterator returned by the iterable.

So what for really does is:

Calls iter(…) once
Calls next(…) until it catches a StopIteration exception thrown by the iterable.

for i in iterable:
  ...

really means:
  1. iterator = iter(iterable)
  2. repeat:
    x = next(iterator)
    execute body
    until StopIteration

We can reproduce the for behavior using iter and while.

for will accept any object that iter accepts and the first thing that it does is to create an iterator from that object.

We can show the internal call to iter by creating specially chatty iterable and iterator classes. (You don’t need to worry about how to implement these classes, we will study a much easier way of implementing our own custom iterators by using generators.)

The iterable users

for is not the only iterable user the uses the iterator trick, many other Python functions do, like: zip, sum, max, min, any, all, enumerate,map, list, etc.

itertools and more-itertools

Once you start using iterables and iterators you should take a look at the itertools standard library module. In there you will find many tools to get the most of these tools.

Iterables don’t need to be sequences

Iterables just need to be iterable, nothing else. Lists or ranges have extra capabilities that go further than the iterable requirement, to be able to be iterated over, but these are not required. For instance, lists have random access, we can ask for any of its members at any time.

But that’s not a requirement to be an iterable. For instance, sets have no random access, but they are iterable.

Also, iterables do not need to have a stable order.

Iterables do not even need to have a finite number of elements or even length. Let’s create an iterable that produces random integers for ever.

InfiniteRandomInts is iterable because we have implemented the __iter__ magic method, and that’s all an iterable needs to be iterable.

If you ever need to represent more constrained behaviors, like the behavior of an object with stable order, random access and a length you have other protocols available, like the sequence.

Other resources

A Real Python tutorial on iterators and iterables.
The itertool standard library module.