Filtered Statistics

Facebooktwitterredditpinterestlinkedinmail

Filtered Statistics

For me, tuning the Microsoft Data Platform and SQL Server is a simple principle, make the engine do as little work as possible. For the engine and the optimizer to be able to do so, we need to feed it the appropriate information.

The SQL Server Cardinality Estimator(CE) does it’s best to analyze our data, structures and map the best available plan to retrieve the information that we need. To be able to do so, it is helpful for us to give the engine a guiding hand.

Execution plans for a procedure or ad-hoc query are created on the first time running the query. This means that we are at the mercy of the first process running the query. When this happens, the optimizer will analyze many bits of information, including the statistical distribution of the indexes. SQL Server will create the execution plan based on the parameters passed after the creation of the procedure. Based on the statistics, this plan could be sub-optimal.

There are MANY different methodologies to possibly implement for to give SQL Server a nudge, but one of them is Filtered Statistics. This is when we can tell SQL Server to generate statistics on a set of information. For example, we can compare Nashville Predator playoff statistics to Pittsburgh Penguin Playoff statistics. This will cause the CE to choose something to appease both.

While the CE is rather “intelligent”, we still need to help it along. We can give the optimizer a nudge in the right direction by creating Filtered Statistics. Filtering the statistics, with a where clause, will give SQL Server a subset of rows that gives the optimizer that nudge.

Create table objects

Below we will create the base table objects. In this exercise, we are tracking employee complaints.

/**************************************************************************************
** drop and create tables
**************************************************************************************/
        USE     AdventureWorks2012
        GO
        IF      OBJECT_ID('Employee') IS NOT NULL
                DROP TABLE Employee
        GO
        IF      OBJECT_ID('Complaints') IS NOT NULL
                DROP TABLE Complaints
        GO
        CREATE  TABLE Employee
                (
                EmpID           INT             ,
                name            NVARCHAR(100)
                )
        GO
        CREATE  TABLE Complaints
                (
                EmpID           INT             ,
                ComplaintsCount INT             ,
                RecordDate      DATETIME
                )
        GO
        CREATE CLUSTERED INDEX cidx_EmpID ON Employee(EmpID) with(data_compression = page)
        GO
        CREATE INDEX ix_Employee_name ON Employee(name) with(data_compression = page)
        GO
        CREATE STATISTICS ix_Employee_EmpID_name ON Employee(EmpID, name)
        GO
        CREATE CLUSTERED INDEX ix_Complaints_id_amount ON Complaints(EmpID, ComplaintsCount) with(data_compression = page)
        GO

Populate Data

Now, let’s populate the complaints. FlipFlops is a stellar employee, Sunshine should be fired and Wolf is on thin ice.

/**************************************************************************************
** Populate tables
**************************************************************************************/

        set     statistics time, io off

        INSERT  into Employee(EmpID, Name)
        VALUES  (1, 'FlipFlops')

        INSERT  into Employee(EmpID, Name)
        VALUES  (2, 'Sunshine')

        INSERT  into Employee(EmpID, Name)
        VALUES  (3, 'Wolf')
        GO
        SET NOCOUNT ON
        -- few complaints
        INSERT  Complaints VALUES(1, 100,GETDATE())

        -- He should be fired amount of complaints
        DECLARE @loop INT
        SET     @loop = 1
        WHILE   @loop <= 500000
                BEGIN
                        INSERT  Complaints(EmpID, ComplaintsCount, RecordDate)
                        VALUES  (2, @loop, GETDATE()-RAND()*1000)
                
                        SET     @loop = @loop + 1
                END
        GO
        -- He's on thin ice
        DECLARE @loop INT
        SET     @loop = 1
        WHILE   @loop <= 70000
                BEGIN
                        INSERT  Complaints(EmpID, ComplaintsCount, RecordDate)
                        VALUES  (3, @loop, GETDATE()-RAND()*1000)
                
                        SET     @loop = @loop + 1
                END
        GO

        UPDATE STATISTICS Employee WITH fullscan
        GO
        UPDATE STATISTICS Complaints WITH fullscan
        GO

SentryOne Plan Explorer

Upon writing this blog, I am not an employee of SentryOne nor do I endorse them in any official capacity. However, I will use SentryOne Plan Explorer over SSMS builtin any chance that I get. Therefore, the comparison images in the remainder of the document come from SentryOne. You’re Welcome.

Query data with NO filtered statistics


First, let’s look at FlipFlops with no filter.

        declare @name   nvarchar(100) = 'FlipFlops'

        SELECT  ComplaintsCount
        FROM    employee e
                inner join Complaints s on e.empid = s.empid
        WHERE   e.name = @name
        option  (recompile)

In the image below, we see that although only ONE result was returned, CE estimated 166,900 records. Not quite fair to guess that he has caused so much trouble when he has not.

FlipFlops No Filter


Now let’s see Sunshine. Remember he had 500,000 complaints! It must be his cheery misdemeanor.

        declare @name   nvarchar(100) = 'Sunshine'

        SELECT  ComplaintsCount
        FROM    employee e
                inner join Complaints s on e.empid = s.empid
        WHERE   e.name = @name
        option  (recompile)

When we look below, although he had 500,000 complaints, only 166,900 records were estimated for return.

Sunshine No Filter


Finally, let’s checkout Wolf. While he could be worse, he is on thin ice. He has 700 complaints.

        declare @name   nvarchar(100) = 'Wolf'

        SELECT  ComplaintsCount
        FROM    employee e
                inner join Complaints s on e.empid = s.empid
        WHERE   e.name = @name
        option  (recompile)

Wolf only had 700 complaints, but 166,900 records were estimated for return. He is looking much worse than reality shows.

Wolf No Filter

So, what is happening is that there are 3 possible employee results for complaints. It is rather simple. CE is taking the total amount of records(500,701) and dividing by 3 assuming that all 3 will have roughly the same amount of records. We see that along with the estimated number of records being the same, the execution plan operators are the same. For such a variation in amount of records, there must be a better way.

Implementing Filtered Statistics

This is where the “magic” happens. It really is not that complicated. Create Statistics on a column and have a where clause. However, there are some rules:

  • cannot reference a computed column
  • cannot reference a UDT column
  • cannot reference a spatial data type column
  • cannot reference a hierarchyID data type column
  • use “Null” and “is not null” rather than NULL literals
        CREATE STATISTICS Employee_stats_EmpID ON Employee (EmpID)
        WHERE name = 'FlipFlops'
        GO
        CREATE STATISTICS  Employee_stats_EmpID2 ON Employee (EmpID)
        WHERE name = 'Sunshine'
        GO
        CREATE STATISTICS  Employee_stats_EmpID3 ON Employee (EmpID)
        WHERE name = 'Wolf'
        GO

        UPDATE STATISTICS Employee WITH fullscan
        GO
        UPDATE STATISTICS Complaints WITH fullscan
        GO

Results with Filtered Statistics


Let’s see FlipFlops to start

        declare @name   nvarchar(100) = 'FlipFlops'

        SELECT  ComplaintsCount
        FROM    employee e
                inner join Complaints s on e.empid = s.empid
        WHERE   e.name = @name
        option  (recompile)

Now we see that he has only one complaint estimated. We knew that he was alright.

FlipFlops No Filter


Next Sunshine. Let’s see if the estimate is as high as the truth.

        declare @name   nvarchar(100) = 'Sunshine'

        SELECT  ComplaintsCount
        FROM    employee e
                inner join Complaints s on e.empid = s.empid
        WHERE   e.name = @name
        option  (recompile)

Again, this is what we would expect. The CE is estimating much closer to the reality. Also, take note of the join operator. For all of the other execution plans it is a nested loop. For this call, you will see the merge join operator. To this point, the execution plans have been using the nested loop operator, which is essentially a “for/next” loop. A merge operator brings together 2 one-to-many or many-to-many record-sets that are similar in size and order. Think of it like meticulously shuffling two ordered decks of cards. In this situation, this is a more efficient plan estimated by the CE.

The optimizer must know Sunshine personally.

Sunshine No Filter


Finally, Wolf again.

        declare @name   nvarchar(100) = 'Wolf'

        SELECT  ComplaintsCount
        FROM    employee e
                inner join Complaints s on e.empid = s.empid
        WHERE   e.name = @name
        option  (recompile)

700 complaints and the CE guessed 700. He might want to watch out.

Wolf No Filter



Conclusion:

The SQL Server Cardinality Estimator does it’s best but is not perfect. Sometimes it needs a bit of a push. This comes with an understanding of the underlying data. If our data has a rather even distribution, the Filtered Statistics are not likely necessary. However, if we have data that covers many peaks and valleys, each varying, then Filtering Statistics may be a viable answer.

Play with them and leave comments and let me know what you find.

Facebooktwitterredditpinterestlinkedinmail

2 thoughts on “Filtered Statistics

Leave a Reply

Your email address will not be published. Required fields are marked *