Lists vs. Sequences: Concatenation

In my previous post, I presented some of the benefits of thinking in sequences rather than lists. This post will discuss a concrete example, concatenation of multiple series of elements.

The case is as follows. You have several series of elements and you would like to aggregate all the contained elements into one concatenated series. This resulting series is for iteration only and does not have to be mutable.

To solve the task using lists, one could write a method like this:

public static IList<T> ConcatenateLists<T>(params IList<T>[] lists)
	List<T> retList = new List<T>();
	foreach (var list in lists)
	return retList;

The elements of all the lists passed to the method are added to a result list which is ultimately returned to the caller. Semantically, there is absolutely nothing wrong with this method. It produces the expected result.

Consider, however, the performance of this method if the input lists contain a million elements altogether. The result list would need to contain one million elements. If you are lucky, the elements are of a reference type, requiring only storage for one million references. If the elements are of a value type, however, one million objects, including their data, would have to be copied to the result list. If each object holds a kilobyte of data, the result list will need to allocate an entire gigabyte of memory before the iteration can even begin!

Surely, a smarter approach would be beneficial. My proposal is to think in sequences. Since the resulting concatenation only needs to be iterated over, there is no specific need for the result of the method to be a list in the first place. A sequence is sufficient. Further, the input elements can be regarded as a series of sequences as well, yielding a method like follows:

public static IEnumerable<T> ConcatenateEnumerables<T>(params IEnumerable<T>[] enumerables)
	foreach (var enumerable in enumerables)
		foreach (var item in enumerable)
			yield return item;

Notice the yield return statement. As previously discussed, this facilitates lazy execution. If, for some reason, the caller decides it only needs the first ten elements, then the yield return statement is only executed ten times.

Consider again the scenario of a million kilobyte-sized value type elements being concatenated. This time, the caller will be served the sequence one element at a time, from already allocated memory, reusing a single kilobyte-sized variable through the entire iteration.

Evidently, thinking in sequences can lead to remarkable performance gains. Of course, you need to consider the requirements and determine whether a sequence will meet your needs. If, for any reason, the resulting series has to be manipulated as a list, consider still using sequences and relying on LINQ’s Enumerable.ToList<TSource> extension method to aggregate the elements into a list when needed.

Thinking in Sequences

When dealing with series of objects, it is easy to think of them as lists, or arrays. That is, after all, the first collection types most people get acquainted with when becoming a programmer. And by all means, lists are very versatile and easy to employ in most situations. However, they often provide more functionality than you strictly need. Rather than considering collections of objects as lists, I find it helpful to think of them as sequences.

Since version 1.0, the .NET Framework has provided the IEnumerable interface for iterating over a sequence of objects.  These days, its generic cousin, IEnumerable<T>, introduced in version 2.0, is often preferred. Initially, those interfaces existed mainly to support the foreach statement. However, with the advent of LINQ to Objects, IEnumerable got a morale boost. Suddenly, anyone could easily create advanced queries against any sequence of objects.

In my opinion, the main advantage of IEnumerable is its stream-based nature. Like with streams, the first items of an IEnumerable sequence can be made available for processing without having to collect every single item first. A sequences does not even have to be finite. Consider, for instance, the following code, returning an infinite sequence of all positive integers.

public static IEnumerable<int> EnumerateAllPositiveIntegers()
	int integer = 0;
		yield return integer;

Note the yield return statement. This is a very handy shortcut which C# provides for creating IEnumerable sequences. Simply yield return every item you wish to include in the sequence. Those who are unfamiliar with such iterator blocks in C# may suspect that the code above will lead to an infinite loop. However, every yield statement will lead to the enumerator’s MoveNext method returning, giving control back to the client object. Of course, if the client iterates the sequence using a regular foreach loop expecting the sequence to terminate, an infinite loop will in fact be the result.

The following code shows how the integer-generating method above can be used in a LINQ query, without risking an infinite loop.

IEnumerable<int> firstFiftyOddPositiveIntegers = EnumerateAllPositiveIntegers()
	.Where(num => num % 2 == 1)

Rather than generating an infinite number of integers, consider an operation which needs to complete some time-consuming task for every value it returns. In such cases, exposing the results through an IEnumerable sequence using yield return will allow the client to process each value without having to wait for every value to be produced, collected and returned.

So, when are sequences preferable over lists and other collections? Obviously, if a series of items is potentially infinite, a sequence has to be used; it cannot be represented as a list or collection. Generally, sequences are intended for items to simply be iterated over, while lists are collections you can add items to and remove items from. My rule of thumb is to expose IEnumerable sequences whenever feasible, relying on LINQ to convert the sequence into lists or arrays, should it be necessary.

By thinking in sequences, LINQ makes even more sense than before, and it also makes it easier to spot those situations where a sequence is not sufficient for the task.