Power Query Preview| Hadoop
In order to make the common task of data acquisition and preparation available to end users, Microsoft developed a very powerful enhancement for Excel: Power Query. Actually Power Query is part of the new Power BI series of tools and there is a lot of information out there on the web showing what things you can do with Power Query and the other Power BI tools. Currently, a preview release of Power Query is available as an Excel addin for testing.
Some cool things you can do with Power Query
- Load data from different sources including relational databases, csv files, Excel files, web pages, OData feeds, Hadoop, even from Active Directory, SharePoint lists or from Facebook.
For example, you can load all files in a given directory into a single Power Pivot or Excel table
- Perform transformation rules on the columns, for example split, format, replace, group by, unpivot – using a comprehensive GUI (no coding or writing of formulas needed)
- Write complex formulas, for example you can use the Table.FillDown function to fill null values in the table with values from the last preceding row that is not null. There is a large number of powerful functions for more advanced users.
- Load the resulting data set to Excel and/or to a Power Pivot data model for further analysis.
A good starting point are the blog posts by Chris Webb:
And of course you should check the product web site which you can find here:
You will find useful specification documents (like the full formula language specification) here:
There are also a lot of cool videos on Youtube showing examples of Power Query and you can find a lot of interesting blogs about the product and its future in Office 365, tutorials and use cases.
For today’s post, let me show an example of loading data into Power Query. For this example, I’m using a text file with weblog data (w3c) stored on Hadoop. First let’s check the Hadoop File System:
The file size is about 500MB large. Let’s load this file into a Power Pivot data model by using Power Query. The first step is to open Excel and from the Power Query Ribbon bar, choose “From Other Sources” / “From Hadoop File (Hdfs)”. You only have to make sure that your Hadoop cluster has WebHDFS enabled.
In the following dialog I’m entering the address of my web hdfs folder:
Power Query fetches a list of all the files from my hdfs that are located under this directory (recursively):
If you like, you may click the arrow-symbol in the header of the Attribute column to include more information (for example the file size). But since we know which file we like to pick, we can simply click on the link “Binary” in the fist row of the table.
Power Query comes back with a sample of the first 100 rows:
We will use this sample to build our transformations. First the leading rows starting with a hash sign need to be removed. The easiest way of filtering out rows in Power Query is to use the column filter. So we click on the column filter and select the text filters:
Unfortunately, there is a filter for “begins with…” but not for “does not begin with…”. Anyway, we use the “begins with…” filter here, enter the hash sign and click ok. Since there are only few rows starting with a hash sign and Power Query shows the first 100 rows resulting from the query, this query would take some time to finish (the complete file is scanned). We don’t have that much time, so we interrupt the query by clicking
We can now edit the formula that was generated by Power Query and negate the boolean operation as shown below:
|Before||= Table.SelectRows(ImportedText, each Text.StartsWith([Column1], “#”))|
|After||= Table.SelectRows(ImportedText, each not Text.StartsWith([Column1], “#”))|
This results in the rows that we need for our data load:
The web log file is a delimited file using space as the delimiter, so we right click the column and choose “Split column…”, then “By Delimiter”. Doing a quick count of the spaces shows that we can go for 15 spaces:
Column 14 is the user agent. In order to get this column properly, we split it at the + sign and then remove the last two columns of the dataset.
Next step is to rename all the columns based on the W3C specification for the web log. Also, we can now rename the Query to “Weblog”.
After clicking Ok, the query definition window closes, and Excel starts to load the data:
Since this is a large data set, we don’t want to load it in to an Excel table (which is enabled by default). So without waiting for the data set to be fully loaded, we change the switch “Load to worksheet” to Off causing the resulting data to disappear from Excel:
Excel continues the load but the data goes nowhere. You still need to wait until the load is complete (data fully checked).
After the load is complete, the link “Load to data model” becomes available. By clicking this link, the data is transferred to a Power Pivot data model:
As you can see, the full data set of 2.9 million rows is now loaded to PowerPivot. From here, we can create a pivot table to analyze the data, for example by status code:
So, with just a few clicks, we were able to load and format the weblog file from Hadoop into a Power Pivot data model for analysis. Saving the Excel file results in 34MB, a compression ratio of approx. 1:14 compared with the original file being stored on Hadoop.