PDW v1 | PDW 2012 | SQL Server 2008-2012
When using the Parallel Data Warehouse (PDW) as a data source for SSIS (Integration Services) packages, some pitfalls have to be avoided. This post is about one of them that I recently saw.
But first some background information about the query slots on the PDW: PDW v1 used 32 query slots for queries targeting user tables (more for system tables) and each running query takes exactly one of those slots. By this procedure it is guarantied, that no single query can take too many resources of the appliance which means, that even under heavy query load, the appliance will respond on the other query slots as usual. This is one of the reasons why the PDW responds to queries with a high performance consistently, even if the appliance is under heavy workload. If all 32 query slots are in use, new arriving queries are suspended up to a certain limit depending on the version of the PDW.
However, in some cases we want to assign more resources to a query, for example for large data processing operations. For this reason, PDW 2012 introduced the concept of resource classes. Here are the four resource classes that exist on PDW 2012:
|Name||Query slots||Max. Memory|
|smallrc (default)||1||0.4 GB|
The total number of 32 query slots is still valid for PDW 2012, so to give some examples, PDW 2012 can simultaneously execute
- 32 queries in resource class smallrc
- 3 queries in resource class largerc (taking 21 slots) and 11 queries in resource class smallrc
- 1 query in resource class xlargerc, one in resource class largerc and one in resource class mediumrc
and so on
The resource classes are assigned to the login. By default, newly created logins are in the smallrc resource class.
The approach for handling queries on the PDW works perfectly for the usual workload. Even if queries are suspended because the 32 query slots are currently in use, they will run, whenever a query slot gets available again. However, if the client application opens more than one connection at a time and keeps them open, you have to be careful with the resource classes.
Here is a simple example using SQL Server Integration Services. My data flow just contains of two independent data sources with no further processing. I used the FactInternetSales table from AdventureWorksPDW2012, so the amount of rows is quite small. As expected, the package runs perfectly well:
Now, let’s assume we want the package to execute faster and therefore we assign a better resource class. Let’s try the largerc class here:
alter server role largerc add member testuser;
The members of the resource classes can be listed using this query taken from the documentation (chm-file, search for meta data queries):
SELECT l.name AS [member], r.name AS [server role]
FROM sys.server_role_members AS rm
JOIN sys.server_principals AS l
ON l.principal_id = rm.member_principal_id
sys.server_principals AS r
ON r.principal_id = rm.role_principal_id
l.[type] = ‘S’
AND r.[type] = ‘R’
AND r.[name] in (‘mediumrc’, ‘largerc’, ‘xlargerc’);
Here’s the result:
The package still runs perfectly fine. Note that the two queries in resource class largerc take 2 x 7 = 14 query slots.
Now let’s try the xlargerc resource class by adding the user to this role:
alter server role largerc drop member testuser;
alter server role xlargerc add member testuser;
The query from above shows the user now being member of the xlargerc server role:
Now, when executing the query in SQL Server Data Tools you will notice that nothing happens. The package passes validation but no data is transferred:
Checking the SSIS progress reveals, that the SSIS package stalls in the pre-execute phase of the data flow.
SSIS tries to open two queries with resource class xlargerc while the PDW allows only one (as 2 x 22 is greater than 32). Therefore the second query is suspended, waiting for the first to finish. But this will never happen. You can clearly see this using the following meta data query:
SELECT request_id, concurrency_slots_used AS [Slots Needed], resource_class AS [Resource Class]
Here, query QID15444 is the meta data query itself. So there are two queries that need resource class xlargerc: QID15433 and QID15442. You can also see them in the admin console:
Looking at the query details of one of the two queries shows that the query is in resource class xlargerc and consuming 22 slots as shown in the following screenshot:
Again, moving the user to a lower resource class (small/medium/large) would solve the issue here.
In general, all connections within a single dataflow of SSIS are opened together during the Pre-Process phase of the data flow. For example, the same behavior would occur with an OLEDB source and a lookup:
Again, the package stalls during the pre-process phase:
So, be careful when assigning higher resource classes to logins used for client tools that need to have multiple queries returning results simultaneously. To make it clear, it’s not a problem if multiple queries are executed in parallel as long as the the client tool does not stop reading from one query if the other doesn’t return data (yet). In my example the reason for the ETL package to stall was that SSIS expects both queries to return their result sets at the same time. If only one of the result sets is suspended, the other remains open and SSIS continues waiting. So, for client tools like SSIS that are working with the data in a data flow pipeline, dependencies could lead to dead lock situations if no more query slots are free.
As long as each data flow of the package opens only the number of connections that are allowed by the specific resource class, the data flow executes fine. For example, here are 16 data flows each using a single OLEDB connection:
The reason why this works is that each data flow performs the pre-process stage independently. When setting the user to resource class xlargerc (consuming 22 slots per query) you can see how each data flow is running sequentially (taking about 12 sec in my example). With the resource class smallrc (default) the queries are running in parallel (taking about 2.5 sec in my example). The following video shows the execution for both resource classes:
Be aware though, that this situation can also happen even with resource class smallrc and 32 query slots, as ETL packages that are performing complex operations with many merge-joins and lookups could also reach the limit of the query slots and stall because of a dead lock within the ETL package. Usually, these packages would run perfectly fine and with a good performance on a SQL Server SMP machine but are throttled if the data source is moved to a PDW. So you might also see this situation on a PDW v1 in case you’re working with more complex ETL packages. The only solution would be to re-engineer the package taking some of the following ideas into account :
- Avoid operations where many data connections are used in a single data flow, for example by combining them into a single query (doing the join/process operation on the PDW)
- Try splitting single data flows with many connections to separate data sources with fewer connections.
- In general reduce the number of open queries that need to read in parallel if the output streams depend on each other, for example by using intermediate caching of the data (for example using the cache destination)
- Use smallrc resource group on PDW 2012 for the ETL accounts if you’re experiencing SSIS dead locks.
- In general, re-design the data integration processes to move more workload from ETL to the Data Warehouse Appliance (consider using ELT instead of ETL).
- Also be aware of packages that are executed in parallel. Having multiple packages running simultaneously may reduce the available query slots for each package significantly.
- And finally: Create a style guide with rules, how many connection each package is allowed to open in parallel and document the amount of connections required per package.
When reading this list of options it becomes clear that migrating an existing data warehouse project, with maybe hundreds of ETL packages, from an SMP environment to an MPP appliance is a challenge, as most of these ideas contradict the design principles of SMP warehouses. For example, for a SMP environment, putting workload into the ETL pipeline is often more efficient than doing the job in the database. And ETL packages are easier to maintain compared to SQL scripts.