For large tables usually we’re looking for tuning options like creating an index or having a good partition strategy in place. For the Parallel Data Warehouse (PDW) additional decisions have to be made for the table layout.
Distributed or replicated?
The first decision is about the way the table is stored on the compute nodes. There are two options:
All data of the table is available on all compute nodes
The data of the table is distributed between the compute nodes
Distributing table data between the compute nodes follows the real nature of the MPP system. In PDW this distribution is established using a hash function on a specific table column which is referred to as the distribution key. In the other hand, replicated tables have their full content available on every compute node.
Creating a table in one of the two modes is quite easy:
|Distributed table||Replicated table|
CREATE TABLE MyTable(
CREATE TABLE MyTable(
But when do we choose a distributed or replicated table?
As a rule of thumb, you will want to create tables which contain reference data, or – as we say in the data warehouse environment – dimensions, as replicated tables if they are not too big. The reason is simple. The typical data warehouse query will be a star join between the fact and the dimension tables with where-conditions on columns of the dimension tables, grouping (group by) on columns of the dimension tables and aggregations based on columns of the fact tables. So, if we distribute the large fact tables in order to leverage the full power of the MPP engine, having the dimensions on each compute node allows the compute node to answer the query with out needing data from other compute nodes. Let’s take a look at the following query based on a customer dimension which is linked to a sales table:
select Sum(SalesAmount) from FactSales
inner join DimCustomer On FactSales.CustomerKey=DimCustomer.CustomerKey
No matter on which key the FactSales table is distributed, having the DimCustomer table replicated means, that each compute node can individually compute the Sum of sales for the customers in the EMEA region. There still has to be a final aggregation for the results coming from each compute node, but in this case, this is just one line per compute node.
Also consider, that read/write is much faster with distributed tables (parallel process) compared to replicated table. This is one reason why replicated tables should be used for a smaller amount of data.
Choosing a good distribution key
The following aspects are important when choosing a distribution key:
- What kind of workload do we have?
(do we usually see lots of “atomic” reads, returning only a few rows, or do we more likely expect large scans and aggregates on the table)
- What are the typically performed joins among the tables
- What are the typically performed aggregations (group by) used on the tables?
- How is the distribution key itself distributed?
Choosing a distribution key which is unequally distributed will result in skew. The different distributions of the table should contain almost the same number of rows in order to have a good parallelization of queries over the compute nodes. If all the data sits on one compute node because of a bad distribution, this node becomes the bottleneck and you cannot expect a good performance.
I’m getting back to the challenge of finding a good distribution key in later posts.
Please keep in mind, that for both decisions, distributed vs. replicated and the distribution key, you don’t have to make a decision that lasts forever. In fact, it’s quite unlikely that you come up with the best solution at the very beginning. It’s quite easy to redistribute a table based on another distribution key or to turn a distributed table into a replicated one. For both scenarios, CTAS can be used. CTAS stands for Create Table As Select. This is quite similar to the Select into syntax on the SMP SQL Server. For example, if you want to change the distribution key of a table Sales you could follow these steps:
- CTAS Sales to SalesNEW having the new distribution key
- rename Sales to SalesBAK
- rename SalesNEW to Sales
- drop SalesBAK