Generating Code with LINQ

Tuesday 3 April, 2007, 07:35 PM

Jafar Husain shows an interesting technique: using LINQ to generate code at runtime. While I think this is a cool idea, I think he missed a trick that could have taken it a step further.

The main problem I have with the example is that the code is harder to read than it needs to be. It took me a little while to work out what was going on. I'll talk through what he's doing, and then show how I think we can tweak it to get a little more out of LINQ. This change might make it more readable, although that's probably a contentious claim.

(By the way, the original article is called LINQ to Code, although that sounds backwards to me: the convention seems to be that with phrases of the form LINQ to X, X is usually the source data type. That would make the original example a case of LINQ to Reflection. However, the code generation technique is the more interesting part, so I guess the title is just poetic license.)

How Jafar Husain's Example Works

The goal of the example is to use reflection to drive the generation of code that performs deserialization. This isn't a new idea: as he says, the example deliberately imitates the .NET XmlSerializer model. However, the example uses two features from LINQ to achieve its goal, and shows that LINQ can make this kind of thing much easier.

The first trick is to use the new query syntax in C# 3.0 to work with the information provided by the reflection API. This illustrates that you can use these query features of LINQ against any bunch of objects whose collections implement IEnumerable. (E.g., in this particular case we're just querying a plain old array of PropertyInfo objects.) Here's a simplified version of Husain's code to illustrate the idea:

var props = from property in typeof(T).GetProperties()
  let attributes = property.GetCustomAttributes(
        typeof(FieldRangeAttribute), true)
  where attributes.Length > 0
  let fieldRangeAttribute = (FieldRangeAttribute)attributes[0]
  select new { property, fieldRangeAttribute };

This evaluates to a sequence of tuples, one for each property to which an attribute of type FieldRangeAttribute has been applied. (That attribute type is specific to this example by the way, so don't look for it in the docs. You can find the source in the original article.) Each tuple contains the property to which the attribute has been applied, and the attribute itself.

Of course we could have done this just as easily without LINQ. Here's a C# 2.0 implementation using the iterator syntax to generate a similar collection:

public struct PropAndAttribute
{
    public PropertyInfo property;
    public FieldRangeAttribute fieldRangeAttribute;
}
static IEnumerable<PropAndAttribute> GetProps(Type t)
{
    foreach (PropertyInfo property in t.GetProperties())
    {
        object[] attributes = property.GetCustomAttributes(
            typeof(FieldRangeAttribute), true);
        if (attributes.Length > 0)
        {
            PropAndAttribute item = new PropAndAttribute();
            item.property = property;
            item.fieldRangeAttribute = (FieldRangeAttribute) attributes[0];

            yield return item;
        }
    }
}

The lack of anonymous types in C# 2.0 means I've had to define this PropAndAttribute type to act as the tuple, but otherwise this code does much the same thing as the query based example.

If you prefer your examples old-school, we could just build a collection object and return that:

static IEnumerable<PropAndAttribute> GetProps(Type t)
{
    List<PropAndAttribute> items = new List<PropAndAttribute>();
    foreach (PropertyInfo property in t.GetProperties())
    {
        object[] attributes = property.GetCustomAttributes(
            typeof(FieldRangeAttribute), true);
        if (attributes.Length > 0)
        {
            PropAndAttribute item = new PropAndAttribute();
            item.property = property;
            item.fieldRangeAttribute = (FieldRangeAttribute) attributes[0];

            items.Add(item);
        }
    }
    return items;
}

These three examples all amount to the much same thing. I'm not going to try and argue which of these is the most readable. The interesting point, I think, is that the new query syntax works fine against an array. When you first see the query syntax in C# 3.0 it looks so SQL-like that it's easy to get the impression that there's some sort of inline database query here. But the fact that there's no database involved in this particular example illustrates that the syntax is actually a whole lot more general-purpose. It's designed to be reminiscent of SQL, but it's not actually SQL. (Daniel Moth has a good explanation of what the C# 3.0 query syntax really means.)

So that's the first of the two LINQ techniques Jafar Husain's example shows. The second, and arguably the more interesting, is the code generation.

Runtime code generation isn't new, of course. In .NET v1 the Code DOM gave us the option to construct code and compile it at runtime, which is how ASP.NET builds pages on the fly. System.Reflection.Emit gave us the ability to emit IL at runtime, something the regular expression engine can exploit. (IL is .NET's byte code, for those of you with a Java background.) .NET 2.0 added LCG (Lightweight Code Gen), which simplifies certain aspects of IL generation, and also removes certain overheads. But both the source-centric and the IL-centric techniques have issues. Firing up the compiler is a bit of a heavyweight operation. Building your own IL is rather grungy, and you don't get a lot of help in ensuring the generated code is verifiably type-safe. (The CLR can verify the generated code for you after you've generated it, but it's nicer to have a model that doesn't let you get it wrong in the first place.)

LINQ introduces a new approach that is somewhere between the low-level IL-based techniques and the heavyweight compiler-based techniques. It introduces a runtime representation for expressions. You could think of this as a language-neutral abstract syntax tree. With a single method call, you can compile these things into IL. So this gives you what is arguably the best of both worlds: a quick, lightweight way to generate runnable code from data at runtime, without the need to dive down to the IL level.

Jafar Husain's example uses this technique. But this is where I started to find it hard to follow, and figured there was room for improvement:

var type = typeof(T);
var lambdaParameter = Ex.Parameter(typeof(string), "line");
var subStringMethod = typeof(string).GetMethod("Substring",
      new Type[] { typeof(int), typeof(int) });

var expr =
    Ex.Lambda<Func<string, T>>(
        Ex.MemberInit(
            Ex.New(type.GetConstructor(new Type[] { })),
            from property in type.GetProperties()
            let attributes = property.GetCustomAttributes(
                  typeof(FieldRangeAttribute), true)
            where attributes.Length > 0
            let fieldRangeAttribute =
                  (FieldRangeAttribute)attributes[0]
            select (MemberBinding)Ex.Bind(
                property,
                Ex.Call(
                    lambdaParameter,
                    subStringMethod,
                    Ex.Constant(fieldRangeAttribute.Start),
                    Ex.Constant(fieldRangeAttribute.Length)
                    )
                )
            ),
            lambdaParameter
        );

Func<string, T> parse = expr.Compile();

You should recognize parts of this from earlier. It's using LINQ to retrieve properties from reflection as before. The two main differences are what's being done in the 'select' clause, and what we're doing with the results. (Note that in this example, Ex is presumed to be an alias for System.Linq.Expressions.Expression. You'd need to add using Ex = System.Linq.Expressions.Expression; to the top of your source file if you want to try this code out.)

The majority of this code builds a lambda expression. This is the expression object model's representation for a lambda, which is a form of anonymous method in C# 3.0. This particular lambda takes a single parameter, which is signified by the lambdaParameter parameter. The lambda's body is represented by a member initialization expression, i.e. an expression which creates an object and then initializes some or all of that object's members. The type of object is identified by T here. (That's the name of a type parameter: in the original example this code is part of a generic method.)

The select clause generates MemberBinding objects by calling Ex.Bind; it'll create one for each property that the query finds. The member bindings tell the member initialization expression how to initialize each of the members of the newly created object. The property to be initialized is identified by the PropertyInfo passed as the first parameter. So this basically generates a sequence of property assignments, one for each of the properties generated by the query. (I.e., one for every property in the source type that has been annotated with the FieldRangeAttribute.)

The second parameter to Ex.Bind is an expression that will provide the value for the property. In this case, it's a function call expression. The function in question is the Substring method of the normal .NET String class. The target string for the call is identified by lambdaParameter here, which is a placeholder identifying the parameter of the lambda that the whole piece of code generates. Remember the code builds a lambda expression that takes a single parameter, and this part just says to feed that parameter back in here as the target of the method call. Finally, two values from the attribute applied to the target property are passed in as the parameters for the call to Substring.

It took me a while to see the wood for the trees here. But the core of the trick is placing this call to Ex.Bind inside of the select clause. This results in that part of the expression tree being generated based on what the query returns. So if you apply that code to a type such as this:

public class Customer
{
    private string address;

    [FieldRange(0, 35)]
    public string Address
    {
        get { return address; }
        set { address = value; }
    }
    private string name;
    [FieldRange(35, 10)]
    public string Name
    {
        get { return name; }
        set { name = value; }
    }
}

it effectively generates an expression equivalent to this function:

Func<string, Customer> parse = input =>
    new Customer
    {
        Address = input.Substring(0, 35),
        Name = input.Substring(35, 10)
    };

Notice that the property initializers correspond directly to the properties of the target class to which the FieldRangeAttribute has been applied. That, in a nutshell, is what this example is doing.

The important thing to notice is that this ends up being compiled code. Notice that in the generative code, the final line is a call to the generated expression's Compile method. That builds runnable IL from the tree. So reflection only occurs during the method generation. Once you've built the method, it's JIT compiled like any other, and it's no longer relying on reflection. So it should run just as fast as that final equivalent bit of code shown in the last snippet.

So that's pretty cool. However, I don't know about you, but I actually found the code pretty hard to follow. It was only after trying it out and then writing this explanation that I really followed what was going on. Perhaps we can do better.

Putting the Compiler to Work

As I mentioned way back in 2005 when I first wrote about C# 3.0 expressions, the C# compiler can build these abstract syntax trees for us from C# source code. We don't normally need to go to the effort of building up the object model from scratch. So why does this example do that?

The reason this example does the work itself is that it wants to build the expression based on the data fetched at runtime. However, this doesn't mean we need to completely disregard what the C# compiler can do for us here. Here's my modified version:

public class TextRecordSerializer<T> where T : new ()
{
    public static Func<string, T> parse;
    static TextRecordSerializer()
    {
        var lambdaParameter = Ex.Parameter(typeof(string), "line");

        var expr =
            Ex.Lambda<Func<string, T>>(
                Ex.MemberInit
                (
                    Ex.New(typeof(T).GetConstructor(new Type[0])),
                    from property in typeof(T).GetProperties()
                    let attributes = property.GetCustomAttributes(
                        typeof(FieldRangeAttribute), true)
                    where attributes.Length > 0
                    let fieldRangeAttribute =
                        (FieldRangeAttribute) attributes[0]
                    select BindToExpression(str =>
                          str.Substring(fieldRangeAttribute.Start,
                                        fieldRangeAttribute.Length),
                        property,
                        lambdaParameter)
                ),
                lambdaParameter
        );

        parse = expr.Compile();

    }


    public static MemberBinding BindToExpression(
                    Expression<Func<string, string>> expr,
        PropertyInfo property, ParameterExpression lparam)
    {
        return Ex.Bind(property, Ex.Invoke(expr, lparam));
    }

Notice that the call to Substring is now using ordinary C# function call syntax. However, even though this code does not explicitly build a tree of expression objects, one is still created implicitly. That's because of the BindToExpression helper function I added. It takes a parameter of type Expression<Func<string, string>>. The Expression generic type is a special type recognized by the compiler. If you assign a lambda into this type, the C# compiler does not compile the code in the normal way. Instead it generates an expression tree that represents the lambda. Since I'm passing the lambda that contains the call to Substring into BindToExpression, that lambda will be turned into an expression tree.

I also made the helper function do the work of building the MemberBinding. This isn't strictly necessary. It just removes some of the clutter from the main loop.

To be fair, the result is still somewhat mind bending. I'm not sure this technique is actually a good idea, as I suspect the only reason I can look at the code and know what it does is because I wrote it. I have no idea how long it would take me to understand it 6 months from now. Also, I don't know if it's usefully more readable than the original. It's slightly more compact certainly, but you could argue that by enabling the use of ordinary C# 3.0 syntax to define the expression subtree that is to be generated, I've actually made it harder to distinguish between the code that generates the expression and the generated expression itself. It feels somewhat akin to writing an ASP.NET page using server-side JavaScript that generates client-side JavaScript, all mixed into one source file. Doubtless there are ways I could break it up to make it easier to read. But I'm not yet sure I know what the best way to do that is. Right now it all seems like minor variations on an essentially obscure theme.

Conclusion

None of this will be new to LISP developers. The idea of writing code that uses data to drive the generation of code is several decades old in that world, as is the idea that the compiler can take ordinary source code and give it to you as a data structure. Still, I think it's cool that we can now do the same in C# 3.0.

But I look at both Jafar Husain's original, and my adaptation, and I am reminded of how some LISP-based systems acquired a reputation for containing write-only code. This is powerful stuff, but is it maintainable? Of course, code generation is often tricky to follow; use of Reflection.Emit is rarely a model of simplicity for example. On balance, I think it's too early to tell. Only once we get familiar with these new language features and develop good taste in their application will we know how best to use them. But I'm not in a hurry to inflict this sort of thing on my customers' production systems.

April (2018)	(1 item)
August (2014)	(1 item)
July (2014)	(5 items)
April (2014)	(1 item)
March (2014)	(1 item)
January (2014)	(2 items)
November (2013)	(2 items)
July (2013)	(4 items)
April (2013)	(1 item)
February (2013)	(6 items)
September (2011)	(2 items)
November (2010)	(4 items)
September (2010)	(1 item)
August (2010)	(4 items)
July (2010)	(2 items)
September (2009)	(1 item)
June (2009)	(1 item)
April (2009)	(1 item)
November (2008)	(1 item)
October (2008)	(1 item)
September (2008)	(1 item)
July (2008)	(1 item)
June (2008)	(1 item)
May (2008)	(2 items)
April (2008)	(2 items)
March (2008)	(5 items)
January (2008)	(3 items)
December (2007)	(1 item)
November (2007)	(1 item)
October (2007)	(1 item)
September (2007)	(3 items)
August (2007)	(1 item)
July (2007)	(1 item)
June (2007)	(2 items)
May (2007)	(8 items)
April (2007)	(2 items)
March (2007)	(7 items)
February (2007)	(2 items)
January (2007)	(2 items)
November (2006)	(1 item)
October (2006)	(2 items)
September (2006)	(1 item)
June (2006)	(2 items)
May (2006)	(4 items)
April (2006)	(1 item)
March (2006)	(5 items)
January (2006)	(1 item)
December (2005)	(3 items)
November (2005)	(2 items)
October (2005)	(2 items)
September (2005)	(8 items)
August (2005)	(7 items)
June (2005)	(3 items)
May (2005)	(7 items)
April (2005)	(6 items)
March (2005)	(1 item)
February (2005)	(2 items)
January (2005)	(5 items)
December (2004)	(5 items)
November (2004)	(7 items)
October (2004)	(3 items)
September (2004)	(7 items)
August (2004)	(16 items)
July (2004)	(10 items)
June (2004)	(27 items)
May (2004)	(15 items)
April (2004)	(15 items)
March (2004)	(13 items)
February (2004)	(16 items)
January (2004)	(15 items)

IanG on Tap

Blog Navigation

Writing

Other Sites