Sunday, September 2, 2007

Filtering Lists

Something I've seen done quite often in code-bases I've worked on is the addition of search methods to inherited collections. The developer will add a FindByID() or FindByCustomer() method to a list of purchase orders. While it is sometimes feasible to perform these kinds of searches directly against the database, there are advantages to embedding the search methods directly in container collection itself. For instance, some applications are required to work in a disconnected state. In a PDA application I worked on, the program had to run disconnected from the SQL Server database, and used an XML file to store the local data. The disk I/O was very slow however, and so I found it to be much more performant to load all of the data in memory, then query the data at runtime.

The drawback to these kinds of query methods is that they are very limited in their functionality. This is by design of course--you don't add a query method to a collection unless you have need of it. Query methods usually start out very simple anyway--"FindByCustomer()" is fairly straightforward. But what if you need something a little more robust? Do you write "FindByCustomerForDateRangeWhereOrdersArePendingAndAmountOwedGreaterThan()?" What if the items in the list are in a different kind of collection--e.g., what if the purchase orders are stored in an ArrayList instead of your custom collection that defines the query methods? You'll have to move the items to the custom query-able collection prior to executing the run-time query. That smells...

So I set as my requirements that:

  • query mechanism should be simple; preferably based on a single function.
  • query mechanism should be highly flexible, supporting varied and multiple criteria with ease.
  • query criteria should be decoupled from the query operation.
  • query operation should be useful on any kind of list

With these issues in mind, I began working on a way to dynamically query a list without adding query methods to the underlying collection, or even without caring what the underlying collection is. My approach is of the form "given a list of items, create a new list of items that match a certain criteria." I desired to separate the criteria used to search the list from the operation of creating a new list matching the criteria. The criteria code had to be robust enough to handle multiple criteria.

I started with an interface called IMatch. IMatch has a single function--IsMatch--that returns a boolean indicating a successful or failed match. With this interface, you can create multiple concrete matching criteria without caring what the criteria are.

 

/// <summary>

/// Determines if one item matches another.

/// </summary>

/// <typeparam name="T"></typeparam>

/// <remarks>

/// Collaborates with ListFilter for purposes of filtering target

/// data from lists.

/// </remarks>

public interface IMatch<T>

{

    /// <summary>

    /// True if the item matches the criteria, otherwise False.

    /// </summary>

    /// <param name="value"></param>

    /// <returns></returns>

    bool IsMatch(T value);

} 

For testing purposes, I decided to use a simple regular expression matching criteria. First, the unit test:

[Test]

 public void RegularExpressionMatchTest()

{

    IMatch<String> match = new RegularExpressionMatch("^Smith");

    bool result = match.IsMatch("Smith, Mr.");

    Assert.IsTrue(result);

    result = match.IsMatch("Star Wars");

    Assert.IsFalse(result);

}

The Regular Expression "^Smith" should match "Smith, Mr.", but not "Star Wars." The implementation of RegularExpressionMatch is as follows:

/// <summary>

/// Implements IMatch<String> using a regular expression.

/// </summary>

public class RegularExpressionMatch : IMatch<String>

{

    private String _Expression;

    /// <summary>

    /// Gets or Sets the expression being matched.

    /// </summary>

    public String Expression

    {

        get { return this._Expression; }

        set { this._Expression = value; }

    }

    #region IMatch<string> Members

    /// <summary>

    /// Implementation if IMatch.IsMatch.

    /// </summary>

    /// <param name="value"></param>

    /// <returns>true if the expression matches the value, otherwise false.</returns>

    public bool IsMatch(string value)

    {

        return System.Text.RegularExpressions.Regex.IsMatch(value, this.Expression);

    }

    #endregion

    /// <summary>

    /// Constructor: store the expression.

    /// </summary>

    /// <param name="expression"></param>

    public RegularExpressionMatch(String expression)

    {

        this._Expression = expression;

    }

}

At this point, two of my requirements are met: the query mechanism is simple, and the criteria is decoupled from the filtering operation. Indeed, at this point I have not written the filtering operation. Let's fix that. First, the unit test:

[Test]

public void MatchSimpleExpression()

{

    IMatch<String> match = new RegularExpressionMatch("^Jones");

 

    List<String> list = new List<String>();

    list.Add("Anderson, Larry");

    list.Add("Black, John");

    list.Add("Jones, Tom");

    list.Add("Jones, Steven");

    list.Add("Zachary, James");

    IList<string> results = ListFilter<String>.GetFilteredList(list, match);

    Assert.AreEqual(2, results.Count);

    Assert.AreEqual("Jones, Tom", results[0]);

    Assert.AreEqual("Jones, Steven", results[1]);

}

According to the unit test, we should be able to use a RegularExpressionMatch to filter a list of strings for items that match the regular expression. The actual filtering takes place on the highlighted line. The GetFilteredList is fed a source list that it will filter according to the matching criteria. The implementation should be very simple:

/// <summary>

/// Filters a list of items of T using a specified filter of T.

/// </summary>

/// <typeparam name="T"></typeparam>

public class ListFilter<T>

{

    /// <summary>

    /// Returns a list of T matching the specified criteria.

    /// </summary>

    /// <param name="source">The list being filtered.</param>

    /// <param name="criteria">An IMatch interface that performs the

    /// matching criteria.</param>

    /// <returns></returns>

    public static IList<T>

        GetFilteredList(IEnumerable<T> source, IMatch<T> criteria)

    {

        List<T> results = new List<T>();

        // add matching items to the results list.

        foreach(T item in source)

            if (criteria.IsMatch(item))

                results.Add(item);

        return results;

    }

    /// <summary>

    /// prevent the ListFilter class from being instantiated.

    /// </summary>

    private ListFilter()

    {

    }

}

As you can see, GetFilteredList simply creates a result set and adds items from the source list that match the criteria to the result set. By defining the source list as IEnumerable, we ensure that we can use any kind of list as a source.

At this point, we are left with one requirement: what about complicated queries? How do we execute a query such as "FindByCustomerForDateRangeWhereOrdersArePendingAndAmountOwedGreaterThan()?"  I could argue that the preceding query could be achieved by making repeated calls to GetFilteredList() using the IMatch class for each criteria being sought. However, that method suffers from two defects: 1) it is annoying; 2) it doesn't handle complex "or" criteria. For example, what if I want to include items that meet any one of many criteria in my search? It sounds like what we need is a CompositeMatch; specifically, a CompositeAndMatch, and a CompositeOrMatch. Let's do the CompositeAndMatch first. Here's the test:

[Test]

public void CompositeAndFilterTest()

{

    CompositeAndMatch<String> match = new CompositeAndMatch<String>();

    Assert.IsTrue(match.IsMatch("arbitrary string")); // and match without criteria should be true.

    match.Matches.Add(new RegularExpressionMatch("^Smith"));

    match.Matches.Add(new RegularExpressionMatch("John"));

    bool result = match.IsMatch("Smith, John A.");

    Assert.IsTrue(result);

    result = match.IsMatch("Smith, Albus");

    Assert.IsFalse(result);

    result = match.IsMatch("Black, John");

    Assert.IsFalse(result);

}

The CompositeAndMatch matches the first string because it contains both "Smith" and "John", but fails to match "Smith, Albus" because it does not contain "John", and fails to match "Black, John" because it does not contain "Smith." Here's the implementation:

/// <summary>

/// IMatch class that contains multiple IMatch instances. Matches if and

/// only if all of the contained IMatch instances are true.

/// </summary>

/// <typeparam name="T"></typeparam>

public class CompositeAndMatch<T> : IMatch<T>

{

    private IList<IMatch<T>> _Matches;

      

    /// <summary>

    /// Container for other IMatch instances.

    /// </summary>

    public IList<IMatch<T>> Matches

    {

        get { return this._Matches; }

    }

    #region IMatch<T> Members

    /// <summary>

    /// Implementation of IMatch.IsMatch()

    /// </summary>

    /// <param name="value"></param>

    /// <returns>true if all contained IMatch instances are true, otherwise false.</returns>

    public bool IsMatch(T value)

    {

        bool result = true;

        foreach (IMatch<T> item in this.Matches)

        {

            if (!item.IsMatch(value))

            {

                result = false;

                break;

            }

        }

        return result;

    }

    #endregion

    public CompositeAndMatch()

    {

        this._Matches = new List<IMatch<T>>();

    }

}

 The CompositeOrMatch is similar, except that it should return true if any of the contained matches are true:

[Test]

public void CompositeOrFilterTest()

{

    CompositeOrMatch<String> match = new CompositeOrMatch<String>();

    Assert.IsTrue(match.IsMatch("arbitrary string")); // or match without criteria should be true.

    match.Matches.Add(new RegularExpressionMatch("^Smith"));

    match.Matches.Add(new RegularExpressionMatch("John"));

    bool result = match.IsMatch("Smith, John A.");

    Assert.IsTrue(result);

    result = match.IsMatch("Smith, Albus");

    Assert.IsTrue(result);

    result = match.IsMatch("Black, John");

    Assert.IsTrue(result);

 

    result = match.IsMatch("Potter, Harry");

    Assert.IsFalse(result);

 

}

The matcher should return true for "Smith, John A.", "Smith, Albus", and "Black, John"; but "Potter, Harry" contains neither "Smith" nor "John" so we should not get a match there. Here is the implementation:

/// <summary>

/// IMatch class that contains multiple IMatch instances. Matches if any one

/// of the contained IMatch instances is true.

/// </summary>

/// <typeparam name="T"></typeparam>

public class CompositeOrMatch<T> : IMatch<T>

{

    private IList<IMatch<T>> _Matches;

       

    /// <summary>

    /// Container for the other IMatch instances.

    /// </summary>

    public IList<IMatch<T>> Matches

    {

        get { return this._Matches; }

    }

    #region IMatch<T> Members

    /// <summary>

    /// Implementation for IMatch.IsMatch()

    /// </summary>

    /// <param name="value"></param>

    /// <returns>true if any of the contained IMatch instances are true, otherwise false.</returns>

    public bool IsMatch(T value)

    {

        // result should be initialized to true if there are no criteria

        // otherwise false.

        bool result = this.Matches.Count == 0;

 

        foreach (IMatch<T> item in this.Matches)

        {

            if (item.IsMatch(value))

            {

                result = true;

                break;

            }

        }

        return result;

    }

    #endregion

    public CompositeOrMatch()

    {

        this._Matches = new List<IMatch<T>>();

    }

 

}

Now, since the CompositeAndMatch and CompositeOrMatch classes both implement IMatch, they each CompositeMatch class can itself be used as a component of another composite match. Thus, if you have a query such that ComplicatedMessOfCriteriaMustBeTrueA or ComplicatedMessOfCriteriaMustBeTrueB, you could wrap the complicated mess of criteria for both A and B in CompositeAndFilters, then add each of the CompositeAndFilters to a master CompositeOrFilter. At this point, you can filter the entire list on all of the criteria with a single call to GetFilteredList(). Thus, our final requirement of being able to solve complex queries is now met. Enjoy!