It’s no secret that I am not an ORM fan. Actually that’s a little harsh, ORMs bring some significant advantages and it’s not just having an object orientated view of a database – a lot of junior developers just don’t know SQL and they can save a massive amount of development time. They also discourage people from infecting the code-base with SQL which can prove a real maintenance headache.
So in many ways ORMs are a good thing.
Unfortunately using an ORM can result in serious database abuse. Not so long ago I profiled some of the LINQ-to-SQL that one of our products uses and I nearly had kittens (some of that’s mentioned in this article). It was making thousands of unnecessary round trips to SQL Server to pick up individual records when one query could have got everything we needed.
A project that my team have just inherited demonstrates a different kind of database abuse rather neatly. There’s a web service that basically just provides a view of the DB. One of the methods does this…
return dc.Streets .OrderBy(s => s.Added).ThenByDescending(p => p.GUID.ToUpper()) .Skip(start) .Take(count) .ToArray();
Which is fine, apart from the fact that it transpires that it’s perfectly legitimate to have duplicate keys in the database (it’s not our DB). The Web Service is consumed by our remote client and we’d rather not be sending that any duplicates – so I modified the LINQ to eliminate them.
The situation is slightly more complex than just using .Distinct() because in the case of duplicates what we actually want is the later record, according to the Added datetime field.
return dc.Streets .GroupBy(key => key.GUID.ToUpper()) .Select(g => g.OrderByDescending(x => x.Added).FirstOrDefault()) .OrderBy(s => s.Added).ThenByDescending(p => p.GUID.ToUpper()) .Skip(start) .Take(count) .ToArray();
The .GroupBy groups the records by the key, then the following select only returns one record, the one with the highest “Added”. The result of this is an IEnumberable<Street> so the .OrderBy etc. will work just fine.
It doesn’t work though, it throws a SQLException because the SQL that the Entity Framework generates contains the “Added” column in the order by clause twice. So I thought I’d profile the SQL Server an see exactly what it was trying to do…
SELECT TOP (500) [t4].[test], [t4].[GUID], [t4].[ParishGUID], [t4].[Name], [t4].[Active], [t4].[Added], [t4].[Created] FROM ( SELECT [t1].[value] FROM ( SELECT UPPER([t0].[GUID]) AS [value] FROM [dbo].[Street] AS [t0] ) AS [t1] GROUP BY [t1].[value] ) AS [t2] OUTER APPLY ( SELECT TOP (1) 1 AS [test], [t3].[GUID], [t3].[ParishGUID], [t3].[Name], [t3].[Active], [t3].[Added], [t3].[Created] FROM [dbo].[Street] AS [t3] WHERE (([t2].[value] IS NULL) AND (UPPER([t3].[GUID]) IS NULL)) OR (([t2].[value] IS NOT NULL) AND (UPPER([t3].[GUID]) IS NOT NULL) AND ([t2].[value] = UPPER([t3].[GUID]))) ORDER BY [t3].[Added] DESC ) AS [t4] ORDER BY [t4].[Added], UPPER([t4].[GUID]) DESC, [t4].[Added] DESC
Holy Truffle Oil Batman! That’s bonkers! What makes matters worse is that even if I correct the SQL the query times out and there are only 30,000 records in the DB. Because it’s bonkers.
So I started to think about how I’d actually do it in SQL. This is my first attempt, now I’m not saying it’s optimal. I think there may be ways I could improve it but it works and with 30,000 records it returns pretty much instantly.
SELECT [GUID] ,[ParishGUID] ,[Name] ,[Active] ,[Added] ,[Created] FROM (SELECT *, ROW_NUMBER() OVER(ORDER BY [GUID]) RN FROM (SELECT * ,RANK() OVER (PARTITION BY [GUID] ORDER BY Added,NEWID()) RK FROM dbo.Street WHERE @takeAll = 1 OR Added >= @added) DEDUPE WHERE RK=1) CHUNK WHERE RN > @start and RN <= @start+@take
The Entity Framework is kind enough to provide a generic ExecuteQuery<T> method on the DataContext so I’ll be writing a new method in the DAL effectively to override Entity Framework’s toxic SQL generation in this case. I suspect my new method won’t be lonely for long.