PDW v2 | Big Data

In my former post about Big Data, I used a “definition” which can be abbreviated as

“data that is too big for analysis within the required time”

The key aspects of this phrase are:

- size of data
- time frame for the analysis
- complexity of analysis

The time frame can be real time, near time, a few hours or maybe even days. This depends on the business requirement. The size of data may get bigger than expected because you need additional data sources (for example external data from market places) for your analysis. But today I’d like to focus on the third bullet point: the complexity of analysis.

If you don’t have complex analysis requirements and if you have plenty of time, you can process terabytes of data without any big data issues. Remember that storing a huge amount of data is not the big problem. But retrieving the data and doing analysis on this data is much more challenging.

But what are complex analytical computations? In SQL we can do a lot of computations by aggregating detailed values (sum, average, min, max etc.). And for many of the typical business performance indicators, this works quite well. But what about the following tasks:

- Frequency analysis and decompositions (Fourier-/Cosine-/Wavelet transformation) for example for forecasting or decomposition of time series
- Machine learning and data mining, for example k-means clustering, decision trees, classification, feature selection
- Multivariate analysis, correlation
- Projections, prediction, future prospects
- Statistical tests (for example chi-squared or binomial)
- Trend calculations, predictions and probability for certain trends or scenarios
- Complex models involving simulations (for example Monte Carlo simulation for risk analysis)
- binomial, normal or other types of distributions and density functions
- …

For example, a decomposition of a time series into its main components may look like this:

*(Source: R, temperature in Nottingham taken from the datasets library)*

Decomposing time series can helpful to analyze periodicity and trends of sales data for example. This could be important for calculating the effect of promotions or to understand seasonal effects.

And this is just one example. As long as you can only slice and dice on your existing data, you’re always looking at the past. But in order to derive ideas and guidance for future decisions, higher sophisticated methods are required than just sum/group by. Some people even say, that this is where Business Intelligence starts. Everything else is just an analysis of the past which is also important, but there is so much more to find. The current discussion about data scientists clearly shows the rising demand for getting more out of your data. And to be honest, having a data scientist working just with a tool like Excel is like having Dr. House using just a folding rule as medical instrument instead of all the sophisticated laboratory instruments and equipment…it doesn’t work.

So, there are a lot of calculations that go far beyond the capabilities of traditional SQL. Therefore, we usually need to load the data from our data warehouse into some kind of analytical or statistical tool which is specialized such calculations. The results can then be transferred back into the relational database tables. As the focus of such tools differs from the focus of databases, these tools are usually separated from the database but offer interfaces (for example ODBC or flat file) to load data. Common tools are R, SAS, MatLab, just to name a few of them. R (http://cran.r-project.org), for example, is a toolset for doing advanced calculations and research level statistical computing. R is open source and can easily be extended using packages (libraries). Today, a huge amount of such packages exists for all kinds of different tasks.

However, when it comes to Big Data, the process of unloading all the required data can be very time consuming. So for Big Data analytics it’s important to bring both worlds together. This would be the perfect match. For doing so, the following two options are most promising:

- Using Hadoop (Map/Reduce)
- Using In-Database Analytics

**Hadoop**

PDW v2 offers a seamless integration with Hadoop using Polybase. This makes it easy and fast to export data on a Hadoop infrastructure. Research level analytics can then be performed on the Hadoop cluster. For this purpose, R supports distributed analysis and Map/Reduce jobs using the HadoopStreaming library. But we’re still copying the data out to the analytical environment, right? Yes, but in this scenario, each infrastructure is used in an optimal way:

- PDW for high-performance SQL queries to combine and arrange the data in the format needed for analytical engines (more like a NoSQL format, for example to prepare variables for data mining).
- Hadoop for distributed parallel computing tasks using Map/Reduce jobs
- High performance (massive parallel) data transfer between the MPP (PDW) and Hadoop.
- Transparent access of the analytical results using SQL (seamless integration of relational and non-relational data with Polybase)

Preparing the data for analytics can be a complex and challenging process. Usually data from multiple tables needs to be joined and filtered. Using SQL is the best choice for this task. For example, for preparing call center data for a mining model, it may be necessary to create variables (single row of data) that contain the number of complaints per week over the last weeks. This can then be used to build a decision tree. In SQL, this task is easy and in an MPP environment, we get the best performance for this task. For the decision tree we need to perform a feature selection at each node of the tree. This involves statistical functions and correlation which reach far beyond SQL. Using the analytical environment is the best choice for such advanced calculations. The resulting decision tree (rules, lift chart, support probabilities etc.) can then be stored as a file on the Hadoop cluster and from there being queried or imported back into relational database tables using Polybase.

**In-Database Analytics**

Another approach is to operate the analytical engine on the same platform and on the same data as the MPP database system. This approach ties both worlds together in a very consistent way but it’s currently not available on the PDW (although it is on my personal wishing list). However, in other MPP environments, this approach is not uncommon. For example, in SAP HANA you can write stored procedures in R just like this

CREATE PROCEDURE myCalc (…)

**LANGUAGE RLANG AS**

BEGIN

…

END;

The function body is then standard R code using R syntax, not SQL.

Typical features for In-Database Analytics include:

- Analytical stored procedures
- In-database analytics: direct access to database tables and views from the analytical engine without needing to load/unload the data
- Tables/Views as parameters for the analytical functions (for example R data frames)
- Full utilization of in-memory capabilities
- Full utilization of the parallel query engine

**Conclusion**: In order to perform sophisticated analysis based on your BI data, SQL is not sufficient. Specialized toolsets like R are the the best solution. However, when it comes to Big Data, loading/unloading the data into these toolset may not be efficient anymore. A closer integration is necessary. Using Hadoop or In-Database Analytics are promising approaches for this scenario.