Getting more performance out of Entity Framework 6

I’ve been wanting to catch up on my database-related knowledge and have finally gotten around to setting enough time aside to studying it. After a short refresher of basic SQL, I picked the following course on Pluralsight: Entity Framework Database Performance Anti-patterns by Russ Thomas.

This course contains lots of tips and advice on what to consider in order to keep data retrieval up to speed in a project. From the perspective of the software developer, it informs me of the necessary design considerations and trade-offs to make so that the user experience remains smooth and responsive.

Additionally, Russ explains how shifting your thinking from procedural to set-based helps find more efficient ways to retrieve data while still maintaining a high level of abstraction.

This post goes over the following subjects:

  • projection
  • nested LINQ smell
  • improving SQL queries generated by Entity Framework

YAGNI – You ain’t gonna need it

The classical mistake of the programmer is to fetch way more data from the database than what is actually needed.

Filtering the results in the client

public void PrintMatchingUserBusinessIds(string lastNameToFind)
{
    using (var db = new AdventureWorks2012Entities())
    {
         var matchingPeople = from people in db.People
            select people;

        foreach (Person p in matchingPeople)
        {
            if (string.Equals(lastNameToFind, p.LastName))
            {
                Console.WriteLine(p.BusinessEntityID);  
            }                    
        }
    }
}

The query that will get sent to SQL Server looks like this:

SELECT * FROM Person;

The problems we have here is that:

1. we load all entries from the People table, even though we are only interested in those that have the wanted last name.
2. all the columns from the People table are loaded and converted to objects, but we only use the property BusinessEntityID.

Let’s fix problem 1: filtering the results in the database

public void PrintMatchingUserBusinessIds(string lastNameToFind)
{
    using (var db = new AdventureWorks2012Entities())
    {
        var matchingPeople = from people in db.People
                     where people.LastName.Equals(lastNameToFind) // filter in the database
                     select people;

        foreach (Person p in matchingPeople)
        {
            Console.WriteLine(p.BusinessEntityID);
        }
    }
}

Now it’s a little better. The query filters the people in the database:

SELECT * FROM Person p
WHERE p.LastName = @lastname;

Projection

However, we are still fetching i-don’t-know-how-many-columns, but we need just BusinessEntityID. Let’s fix problem 2 using a projection on BusinessEntityID:

public void PrintMatchingUserBusinessIds(string lastNameToFind)
{
    using (var db = new AdventureWorks2012Entities())
    {
        var matchingPeople = from people in db.People
                     where people.LastName.Equals(lastNameToFind)
                     select new { people.BusinessEntityID }; // projection

        foreach (var p in matchingPeople)
        {
            Console.WriteLine(p.BusinessEntityID);                    
        }
    }
}

This time, Entity Framework will generate a SQL query that will look like this:

SELECT p.BusinessEntityID FROM Person p
WHERE p.LastName = @lastname;

@lastname being the parameter containing the last name we are searching for.

Non set-based work

When programming with Entity Framework or with other ORMs, it’s so easy to just think procedurally: I need to fetch these People objects, and then I am going to use this and that property from these objects.
The ORM does its job and abstracts the details of the database away from the coder, which is nice because it makes his life a easier. However, we need to be aware of the impact our code will have on the database and how we can prevent performance problems.

Take a look at the code below:

public void PrintEmailAddresses(string lastNameToFind)
{
    using (var db = new AdventureWorks2012Entities())
    {
        var matchingPeople = 
            from people in db.People 
            where people.LastName.Equals(lastNameToFind)
            select people;

        foreach (var p in matchingPeople)
        {
            foreach (var address in p.EmailAddresses)
            {
                Console.WriteLine(address.EmailAddress1);
            }
        }
    }
}

At first, it may appear pretty harmless. We find a list of people with the given last name and print all their email addresses.

What we need to be careful about is that we are navigating the property EmailAddresses of the table Person: the email addresses are located in another table EmailAddress. In essence, Entity Framework will first fetch all the people with the given last name. We will then iterate over the list of people. When we try to access p.EmailAddresses, Entity Framework will send a query to get all the email addresses for the current person. Note that a separate query is executed for every single person.

-- fetch the matching people
SELECT * FROM Person p
WHERE p.LastName = @lastname;

-- fetch the email addresses for a person
SELECT * FROM  EmailAddress e
WHERE e.BusinessEntityID = @currentPersonBusinessEntityID;

It is inefficient to execute a separate query for each Person. Obviously we should join the tables Person and EmailAddress instead. And as we saw before, we are retrieving too much data with these SELECT * statements: we are only using the EmailAddress1 property (Note: the name of the property is EmailAddress1 to avoid a collision with the name of the table).

The name of the smell is nested LINQ query. Every time we have a LINQ query or a for/foreach loop inside a loop, something is likely to be wrong.

To avoid this kind of problem, the programmer needs to shift his thinking from procedural to set-based.

Set-based thinking

When coding, software engineers are used to thinking in a procedural way.
Procedural thinking:
Get a list of people with the specified last name and print their email addresses.
We are thinking in terms of rows of data.

Set-based thinking:
Join tables Person and EmailAddress, get the email addresses for people with the specified last name. Here, we are thinking in terms of columns.

When it comes to databases, procedural thinking has a lot more overhead than set-based thinking: it produces more queries to the database and tends to fetch more data than needed. With the set-based thinking we focus on how to get to the meat (the email address). With procedural thinking we focus on obtaining People objects to use their EmailAddresses property.

Now what does set-based thinking look like in our example, and how do we make it better?

Back to the email addresses’ example

To improve the email address code sample, let’s now think in a set-based way: we have people, and we want to get to their email addresses. The columns we need are Person.LastName to match the last name and EmailAddress.EmailAddress1. To get from table Person to EmailAddress, we need to join them.

Our first try is to use Entity Framework’s Include method to tell it it needs navigate EmailAddresses.

Solution 1: Include() method

public void PrintEmailAddresses(string lastNameToFind)
{
    using (var db = new AdventureWorks2012Entities())
    {
        var matchingPeople = 
                     from people in db.People.Include("EmailAddresses") // navigate property EmailAddresses
                     where people.LastName.Equals(lastNameToFind)
                     select people;

        foreach (var p in matchingPeople)
        {
            foreach (var address in p.EmailAddresses)
            {
                Console.WriteLine(address.EmailAddress1);
            }
        }
    }
}

The code above will cause Entity Framework to perform a join:

SELECT * FROM Person p
INNER JOIN EmailAddress e ON p.BusinessEntityID = e.BusinessEntityID;
WHERE p.LastName = @lastname;

This is better, but we are reading all the properties of Person and EmailAddress. Unfortunately, we cannot give more details to Entity Framework in the Include method and tell it to do a projection on EmailAddress1.

To have more control, we need to specify our query more precisely by defining the join and the projection explicitly. To do so, we can use the LINQ syntax.

Solution 2: Explicit join with LINQ

public void PrintEmailAddresses(string lastNameToFind)
{
    using (var db = new AdventureWorks2012Entities())
    {
        var emails = from people in db.People
                     join email in db.EmailAddresses                           // explicit join
                     on people.BusinessEntityID equals email.BusinessEntityID
                     where people.LastName.Equals(lastNameToFind)
                     select new { email.EmailAddress1 };                       // projection

        foreach (var e in emails)
        {
            Console.WriteLine(e.EmailAddress1);
        }
    }
}

Thanks to our improved LINQ query, Entity Framework now produces a more efficient SQL query:

SELECT p.BusinessEntityID, e.EmailAddress1 FROM Person p
INENR JOIN EmailAddress e ON p.BusinessEntityID = e.BusinessEntityID
WHERE p.LastName = @lastname;

We now obtain the wanted list of email addresses and got rid of the nested LINQ smell by removing the inner foreach loop.

Note that with this approach, Entity Framework still includes p.BusinessEntityID in the select list, although we do not need it.

An alternative is to use a LINQ extension method chain and performing SelectMany() on EmailAddresses.

Solution 3: LINQ extension method chain with SelectMany()

public void PrintEmailAddresses(string lastNameToFind)
{
    using (var db = new AdventureWorks2012Entities())
    {
        IQueryable< string > emails = db.People
            .Where(p => p.LastName.Equals(lastNameToFind))
            .SelectMany(p => p.EmailAddresses)
            .Select(email => email.EmailAddress1);  // projection
            
        foreach (var e in emails)
        {
            Console.WriteLine(e);
        } 
    }
}

With that code, we actually do not specify the join explicitly: Entity Framework infers from our SelectMany(p => p.EmailAddresses) that it needs to join Person to EmailAddress. The SQL query now looks like this:

SELECT e.EmailAddress1 FROM Person p
INNER JOIN EmailAddress e ON p.BusinessEntityID = e.BusinessEntityID
WHERE p.LastName = @lastname;

We see here that the generated SQL query reads less information than the one from solution 2: p.BusinessEntityID is no longer in the select list. This query is closest to what one would write manually to get the email addresses.

Recap of the email example

The first code example was using a nested foreach loop, in which a navigation property was accessed, and resulted in a separate query being executed for each matching entry from the first table. This is inefficient and does not scale: for large result sets, the performance will really go down the drain.

Solution 1 showed improvement: using the method Include(“PropertyName”), we can inform Entity Framework that we will need information from another table and will generate a single query that will perform a JOIN. This alternative improves scalability. Unfortunately, Entity Framework will read all the columns of both tables and we have no way to tell it to do a projection.

In solution 2, we made up for the lack of projection from the previous solution by using LINQ syntax to explicitly join the tables and project the columns that we need. That way, we avoid read data that we do not need. This solution scales well and the amount of data we retrieve is reduced to almost the minimum. It is however noteworthy to mention that the primary key used to join the table is included in the select list, which is not strictly needed.

Solution 3 uses a LINQ method chain with SelectMany() and is the most efficient solution: improving upon solution 2, a join is performed on the tables, and this time only the necessary columns are included in the select list: the primary key is only used in the JOIN.

Depending on the amount of data and the performance, either of solution 1 to 3 can be used. If you know the size of the result set will always remain small, solution 1 is the easiest to implement. When performance is needed, either solution 2 or 3 can be used, depending on your project’s preference in terms of code style and readability. If performance is critical, solution 3 will be slightly better since it saves a read on the primary key column.

Conclusion

In this post, we saw how to prevent performance problems with Entity Framework by:

  • using projection to avoid reading data we don’t use
  • applying set-based thinking to define how to get the data from the database
  • recognising and fixing the nested LINQ smell
  • using techniques to generate more efficient queries:
    1. Include() method
    2. Explicit join with LINQ
    3. LINQ method chain with SelectMany()