Make Mostly Read, Seldom-Written Lists Much More Efficient

One of the many things that I do at work is run a full-blown Search Engine which I also developed from scratch. This Search Engine feeds all product related information to our websites. A search index consists of a pre-computed collection of products, their properties, a list of words that are correctly spelled, and some pre-computed faceted/guided navigation. A search index, until this week, took up approximately 10.7 gigs of memory. This was becoming too large as we added new products every single day.

As of writing this, it now takes only 4.8 gigs of memory and is only slightly (1-3%) less performant than before. How did I do it? Believe it or not, a very simple data structure and algorithm change.

In the Search Engine, a product’s properties are a key-value pairing of strings… Things like “isInStock” “1” or “color” “red” etc. We store the properties in a collection, per product. The collection was originally:

Dictionary<string, HashSet<string>> _entityProperties;

The key of the Dictionary was the property name and the HashSet of strings were the values for that property name (property names are not a “unique” key – a product could have multiple colors for example). I initially chose this data structure because we have a heavy need for DIRECT lookups to property names and values. Methods like HasProperties(string propertyName) and HasProperty(string propertyName, string propertyValue) are essential to the search engine’s function, and need to be performant. Thus, I figured that a Dictionary and HashSet would be best, since both offer O(1) lookup times and the index is read from 10000x more often than it is written to. O(1 + 1) is pretty good when it comes to complexity.

It turns out that there was a simpler, better data structure for my goals which also satisfies the performance requirements of the aforementioned methods.

As you may or may not know (and I learned the hard way), a HashSet<T> is actually not very efficient when you have only a few items in it. A List<T> is actually more performant for small collections (4 or fewer objects with simple GetHashCode() methods, such as strings, in my testing). This is true even though your average lookup/read case goes from O(1) to (1/2n) since you must traverse the List to find your desired object. The reason that List is faster is that there is no hash key computation, and the List<T> is basically an elastic array and thus takes less memory and has less overhead than a HashSet<T> with the same number of objects in it. Since my product properties typically only consist of 2 or 3 values for a given property name, I changed my data structure to this:

Dictionary<string, List<string>> _entityProperties;

This shaved approximately 10% off of the memory footprint and brought my memory usage down to 9.6 gigs. The performance was basically identical in all performance tests. This was better than my HashSet, but still not great. I wanted to do better. I was sure that somehow I could do better.

I spent the good part of this week trying – and failing – to design a more efficient data structure than the above. I tried a string Trie with nodes that pointed to another Trie, I tried SortedList<TKey, TValue> instead of the above, and everything else that I could think of. Yet no matter what I did, the memory stayed the same and the performance was the same or worse. It sucked. I was still sure that somehow I could do better.

Finally, Wednesday morning, I had a thought in the shower (where I do my best thinking): two dimensional Arrays suck. They are well documented to, in general, have worse memory usage metrics than a one dimensional array (a quick Google will fill you in). A Dictionary of Lists is certainly a two dimensional jagged Array of sorts, and it wasn’t doing what I wanted in terms of memory. So, I took another approach and changed my data structure wildly – I flattened it out and made it one dimensional:

List<KeyValuePair<string, string>> _entityProperties;

Seems insane, right? I go from a Dictionary with an O(1) key lookup to a linear List of all keys and values stored together. And yet, it did the trick for my memory: it went from 9.6 gigs to 4.8 gigs. Half of the amount of memory used. I was stoked.

I saved this memory by both interning strings and taking advantage of the KeyValuePair being a struct. Structs are a lot more efficient than reference types when the object is small, and a KeyValuePair is indeed quite small.

A new problem needed solving, however. Each product has around 60-100 properties associated with it, and I needed them to be accessed efficiently and quickly with near-direct lookups. Traversing the [now giant] List was not acceptable in terms of performance.

As it stood, I went from an O(1 + 1) data structure (key and value lookup costs for Dictionary and HashSet) to an O(1 + 1/2n) data structure (Dictionary and List) and finally to an O(n) data structure (List). And to top it all off, the n in the 1/2n was 3 or 4 where the n in the flat List of KeyValuePair was between 60 and 100. Truly I was getting worse performance with each improvement – at least theoretically. Still, the allure of the memory savings was too great to ignore and I wanted to use this data structure.

It then hit me: why not use BinarySearch on the List<T> to look up items quickly and keep the List sorted while I add AND be able to check for duplicates before adding? It was certainly worth a shot, since binary search is an O(log n) algorithm which is an improvement over the List’s O(n) traversal. So, I modified my Add(string propertyName, string propertyValue) method to keep the List sorted as things were added to it. This is surprisingly easy to do.

note that from here on out I’m simplifying my code greatly to basic concepts from what actually exists in order to avoid giving away trade secrets or violating my NDA and being fired

public void Add(string propertyName, string propertyValue)
{   
    // TODO: null checks, etc.

    var keyValuePair = new KeyValuePair<string, string>(propertyName, propertyValue);

    // Add property name and value
    // First find the identical item if it exists
    var result = _entityProperties.BinarySearch(keyValuePair, _entityPropertiesComparer);
    // If result is >= 0, already exists, done
    if (result >= 0)
    {
        return;
    }

    // If it does not exist, a one's complement of the returned int tells us WHERE to insert to maintain the sort order
    _entityProperties.Insert(~result, keyValuePair);
}

The secret here is two-fold:

I created a custom KeyValuePair<string, string> comparer class that implements IComparer<KeyValuePair<string, string>> and basically does a case-insensitive string compare of first the key strings, then the value strings. This IComparer is required by the List’s BinarySearch method to determine the ordering of objects in the List.
The BinarySearch method returns a very useful value: a simple integer. If the int is < 0, it means that the item was not found in the List. If it is >= 0, it means that the item was found at the index of the value. If it returns a negative integer, it means not only that it was not found, but also that the proper index to insert the item at in order to keep the List sorted is the one’s complement of the value.

A super useful return type, indeed. This allows you to add your elements to a List while preserving an order, at the cost of your add being an O(log n) operation instead of a List’s usual O(1) add operation. However, if you don’t add things as much as you read the List (we only adjust the index once a day for example, but read it thousands of times per hour), this can be worthwhile. Additionally, you could add everything in O(1) time and then do a final List Sort in order to sort the List for a single O(log n) cost if the order of elements does not matter until you’re done adding everything. In my case, the order mattered as I added to the List because I did not ever want to add duplicates (same property name and value). The HashSet handles this for me – Lists do not.

So, now my add costs O(log n) instead of O(n), but the payoff is now my lookups cost O(log n) instead of O(n) as well. I adjusted my earlier mentioned HasProperty and HasProperties methods accordingly:

public List<string> GetSpecificPropertyValues(string propertyName)
{
    // TODO: null checks, etc.

    List<string> result = new List<string>();

    // Binary search the property name - null is the smallest value of string for comparison
    var keyValuePair = new KeyValuePair<string, string>(propertyName, null);
    // One's complement the start index
    var startIndex = ~_entityProperties.BinarySearch(keyValuePair, _entityPropertiesComparer);

    for (int i = startIndex; i < _entityProperties.Count; i++)
    {
        // Leave the loop when the property name no longer matches
        if (!string.Equals(propertyName, _entityProperties[i].Key, 
		    StringComparison.OrdinalIgnoreCase))
        {
            // Leave the loop
            break;
        }
                    
        result.Add(_entityProperties[i].Value);
    }

    return result;
}

public bool HasProperty(string propertyName, string propertyValue)
{
    // TODO: null checks, etc.

    // Binary search the property name
    var keyValuePair = new KeyValuePair<string, string>(propertyName, propertyValue);
    var startIndex = _entityProperties.BinarySearch(keyValuePair, _entityPropertiesComparer);
    return startIndex >= 0;
}

public bool HasProperties(string propertyName)
{
    // TODO: null checks, etc.

    // Binary search the property name
    var keyValuePair = new KeyValuePair<string, string>(propertyName, null);
    // One's complement the start index
    var startIndex = ~_entityProperties.BinarySearch(keyValuePair, _entityPropertiesComparer);
    if (startIndex >= _entityProperties.Count)
    {
        return false;
    }

    // Check that the next element matches the property name
    return string.Equals(propertyName, _entityProperties[startIndex].Key, 
	    StringComparison.OrdinalIgnoreCase);
}

Suddenly, I have the same “direct lookup” methods available as I did with my Dictionary and HashSet/List structure, but in a flat List with O(log n) complexity.

This yielded 50% less memory usage and only a 1-3% increase in performance times. A very acceptable trade for the Search Engine.

If you have a List<T> with a lot of objects in it, and performance is key to your application, consider using BinarySearch and/or Sort to access it in a much more efficient way. As long as you can create an IComparer<T> where T is your List objects type, you’ll have a more efficient List.