SQL Server 2010/2012
In the MPP approach we distribute data over multiple servers, so that a complex query can be divided into several queries that need to handle less data and that run in parallel. This gives the speed for the MPP solution. Each server processes a certain part of the query independently from all the other servers. We call this a shared-nothing infrastructure (see my former post for details).
A similar concept are the distributed partitioned views in the SMP SQL Server. A partitioned view is simply a union over tables of the same layout, usually created in the way of the following example:
CREATE VIEW Sales AS
SELECT * FROM Sales2013
UNION ALL SELECT * FROM Sales2012
UNION ALL SELECT * FROM Sales2011
UNION ALL SELECT * FROM Sales2010
In this example we distributed the sales data to 4 tables (sales by year).
A partitioned views is updatable if
- Each table has a check-constraint on the partitioning column
- The partitioning column is part of the primary key
- Some more requirements (same primary key, same ansi padding, no indexes on computed columns, see http://msdn.microsoft.com/en-us/library/ms188299(v=sql.105).aspx)
If these two conditions are not met, you could still create an instead-of trigger to perform the update, but since we want to use this method to improve performance, the instead-of trigger might not be a good idea.
The check-constraint on the underlying tables could be something like this:
CREATE TABLE Sales2013 (
OrderID INT NOT NULL,
CustomerID INT NOT NULL,
OrderDate DATETIME NULL CHECK (DATEPART(yy, OrderDate) = 2013),
OrderYear INT CHECK (OrderYear = 2013),
CONSTRAINT OrderIDYear PRIMARY KEY(OrderID, OrderYear)
Up till now, this has nothing to do with an MPP approach. But now we will look at a distributed partitioned view. A partitioned view is called distributed if the underlying tables sit on different servers. This can be achieved by using linked servers. So, imagine we have the four tables above on four servers (Server1 to Server4). Then the partitioned view may look like this:
CREATE VIEW Sales AS
SELECT * FROM Server1.MyDatabase.dbo.Sales2013
UNION ALL SELECT * FROM Server2.MyDatabase.dbo.Sales2012
UNION ALL SELECT * FROM Server3.MyDatabase.dbo.Sales2011
UNION ALL SELECT * FROM Server4.MyDatabase.dbo.Sales2010
In this situation, queries to the Sales view can be distributed to the four servers with each server solving a part of the query.
Having a distribution over years is actually a bad idea, unless the typical query load is to compare full years. Usually we want to choose a partition key that guaranties a more equal distribution over the servers. One option could be the OrderID, for example OrderID % 4 =0 for server 1, Order % 4 = 1 for server 2 and so on.
Now, with this we’re getting pretty close to what the PDW does, don’t we? Well, let’s look at the difference in detail:
distributed partioned view
|Maintenance||Rather difficult. Linked server must be created, manual definition of the constraints. Each distribution table has to be created separately. Changes have to be kept consistent over all distributions manually.||Easy. Only one create-/alter-table statement with distribution option. The PDW takes care of all the remaining tasks.|
|Scalability||Adding another servers makes it necessary to manually redistribute the data and to adjust the distribution key (table definitions). Constraints must be dropped and recreated in order to do this.||Easy. After setting up a new scale unit, the PDW can automatically redistribute the table contents to the available nodes.|
|High availability||If one of the linked servers fails, queries will fail if they touch that distribution. It’s difficult to implement high availability in this scenario. One option could be to double the amount of servers, so that each servers has it’s standby server which replicates the data. In case of an error, the distributed partitioned table definition has to be modified in order to switch. But this would be a very inefficient way to implement HA (twice the amount of servers). Operation System Clustering could be an option but all this has to be configured manually.||Out of the box. The PDW automatically switches to an other compute node if one fails and can recreate the failing node using a Hyper-V disk image.
No single point of failure.
|Hash key||Distribution has to be defined in the table definitions. All underlying tables have to be modified if there are changes to the distribution key. Also the data has to be re-shuffled manually.||Distribution key only needs to be managed on the control node. Redistribution (based on other key) is easy using a CTAS statement.|
|Statistics||Statistics are very important for query performance. For distributed partitions views each server maintains the statistics of its distribution, but there is no complete statistics for the entire distributed view. Therefore the server executing the query doesn’t know about the statistics and cannot optimize the queries,||Statistics are very important for query performance. Statistics are automatically created on the compute nodes and transferred to the control node using the create/alter statistics statement. Since the query plan is created on the control node, the query optimizer can access a statistic for the table to build an optimal query plan.|
|Partitioning||If the underlying tables need to be partitioned using table-partitioning, the partitioning (including switch operations, partition merges etc.) have to be done per server.||Partitions of all distributions are managed centrally on the control node. There is no need to execute statements on the compute nodes.|
|Queries with star joins||Distributed partitioned tables can only leverage multiple servers, if the distributed query can be executed on each server separately. A simple star join breaks this rule unless you rewrite it from
inner join DimDate on FactSales.DateKey=DimDate.DateKey
But this means, you have to modify almost every query to leverage the distributed partitioned view.
|With the concept of replicated table, dimension are usually available on each compute node. The query optimizer fully leverages this, so the query can be distributed with no need to rewrite it.
In fact, most queries can be parallelized without any need to modify the query.
|Data load||Inserts to the view are slow. Loading a massive amount of data will be fastest with bulk inserts on each server. However, this makes it necessary to spit the upload files along the distribution key before doing the bulk upload and to know about the servers which host the distributions. In case of modifications (for example, if another server is added) the loading routines have to be adjusted.||Bulk loads are automatically distributed in parallel to all the compute nodes, without the need to split the uploads files beforehand and without the need to know about the distribution key at all.|
So, while distributed partitioned tables sound to be promising, comparing them to the PDW infrastructure shows a lot of disadvantages and required additional tasks to make them work. The PDW handles all this in a fully automatic way, thus giving high performance, scalability and high availability in an easy to maintain environment. If you try to mimic this functionality with distributed partitioned tables you will most likely end up with a lot of work and a very hard to maintain system.
So, to get back to the title of this blog post, we’ll have to say
Distributed Partitioned Views: MPP light? No, not at all. PDW offers so much more in maintenance and usability that distributed partitioned views shouldn’t be considered as a replacement for an MPP infrastructure.