Importing OpenStreetMap data or other large XML files with Integration Services (SSIS)

SQL Server 2005-2016

Last year I wrote about importing data from OpenStreetMap using PowerQuery. You can find the blog post here. In that post I loaded a relatively small area as an example. The reason was, that the XML DOM parser in PowerQuery loads the full XML document into memory before being able to access it. If you need to process larger areas, the approach with the DOM parser won’t work. For example, in a recent project I had to load all addresses in Berlin. I took the OSM file from geofarbrik.de. The file is bzip-compressed down to 68 MB. Once downloaded it expands to about 940 MB (yes, XML is very talkative and compresses very well…). At first, I tried to load the file using PowerQuery and the script from my blog post. But since the DOM parser creates a memory consuming object model, it failed at about 30% of the load and 20 minutes with an out-of-memory error (using 25GB).

So, if you need to load a large XML file in general, you’re well advised to use a a different parsing approach. So this blog post is about using a lightweight XML parser for reading the file from above. For example, a SAX parser reads the file once from the beginning to the end (forward only) firing events. In C# the XMLReader follows a similar approach. Both parsers do not allow you to search the XML file randomly (for example with XPATH), to insert elements or to go back and forth. But what they do is that they read the file in a very efficient way.

Let’s have a look at a typical node element from the OpenStreetMap OSM file:


<node id=”256922190″ visible=”true” version=”7″ changeset=”29687333″ timestamp=”2015-03-23T20:13:41Z” user=”atpl_pilot”
uid=”881429″ lat=”52.5379749″ lon=”13.2888659″>
  <tag k=”addr:city” v=”Berlin”/>
  <tag k=”addr:country” v=”DE”/>
  <tag k=”addr:housenumber” v=”24″/>
  <tag k=”addr:postcode” v=”13627″/>
  <tag k=”addr:street” v=”Halemweg”/>
  <tag k=”addr:suburb” v=”Charlottenburg-Nord”/>
  <tag k=”amenity” v=”library”/>
  <tag k=”layer” v=”1″/>
  <tag k=”name” v=”Stadtteilbibliothek Halemweg”/>
  <tag k=”ref:isil” v=”none”/>
</node>

You can clearly see the geo coordinates (latitude and longitude) as well as the address (in key/value pairs below the node). I wasn’t interested in points of interest (POIs) but you can also see that the amenity key contains information about the point of interest. In this case, we have a library.

Since PowerQuery uses the DOM parser and because I wanted the import process to run scheduled I used Integration Services (SSIS) to load the file. First I had to create a database table like this:

CREATE TABLE [dbo].[OSMAddress](
    [latitude] [real] NULL,
    [longitude] [real] NULL,
    [street] [nvarchar](255) NULL,
    [housenumber] [nvarchar](20) NULL,
    [postcode] [nvarchar](20) NULL,
    [city] [nvarchar](255) NULL,
    [country] [nvarchar](2) NULL
)

Next, I used a very simple data flow to populate the table:

image

The main logic is contained in the script component. This is the code for the CreateNewOutputRows event in the script component (please note that his code is without any error handling for simplicity here):

public override void CreateNewOutputRows()
{
    float latitude = -1;
    float longitude = -1;

    String city = null;
    String country = null;
    String street = null;
    String housenumber = null;
    String postcode = null;

    using (XmlReader reader = XmlReader.Create(Variables.OpenStreetmapFile))
    {               
        while (reader.Read())
        {
            switch (reader.NodeType)
            {
                case XmlNodeType.Element: 
                    if (reader.Name.Equals(“node”))
                    {
                        if (reader.HasAttributes)
                        {
                            String lt = reader.GetAttribute(“lat”);
                            String lg = reader.GetAttribute(“lon”);

                            if (lt != null && lg != null)
                            {
                                if (!(float.TryParse(lt, out latitude) && float.TryParse(lg, out longitude)))
                                    latitude=longitude=-1;
                            }                                       
                        }
                    }
                    else if (reader.Name.Equals(“tag”))
                    {
                        if (latitude > -1 && longitude > -1)
                        {
                            String k = reader.GetAttribute(“k”);
                            String v = reader.GetAttribute(“v”);
                            if (k!=null && v!=null) {
                                switch (k)
                                {
                                    case “addr:city”:        city = v; break;
                                    case “addr:country”:     country = v; break;
                                    case “addr:housenumber”: housenumber = v; break;
                                    case “addr:postcode”:    postcode = v; break;
                                    case “addr:street”:     street =v; break;                                                   
                                }
                            }
                        }
                    }

                    break;
                
                case XmlNodeType.EndElement:
                    if (reader.Name.Equals(“node”))
                    {
                        if (latitude > -1 && longitude > -1 && street != null && city!=null && housenumber!=null)
                        {
                            Output0Buffer.AddRow();
                            Output0Buffer.city = city.Substring(0, Math.Min(city.Length,255));
                            Output0Buffer.country = (country==null)?””:country.Substring(0, Math.Min(country.Length,2));
                            Output0Buffer.housenumber = housenumber.Substring(0, Math.Min(housenumber.Length,20));
                            Output0Buffer.latitude = latitude;
                            Output0Buffer.longitude = longitude;
                            Output0Buffer.postcode = (postcode==null)?””:postcode.Substring(0, Math.Min(postcode.Length,20));
                            Output0Buffer.street = street.Substring(0, Math.Min(street.Length,255));
                        }
                        latitude = longitude = -1;
                        street = postcode = housenumber = country = city = null;
                    }
                    break;
            }
        }
    }
}

The package took about 10 seconds to extract all of the addresses from the OSM file into the database table: A quite impressive result compared to the 20 minutes without success from above. So this clearly shows the advantage of XML parsers like SAX or XMLReader when it comes to reading larger files. If you go for larger areas it’s better to directly stream from the bzip2 compressed file instead of decompressing the file first. For example, the OSM file for Germany (complete) is about 4GB in size (bzip2 compressed) and expands to a single 48GB XML-file. I used SharpZipLib to decompress the file on the fly which saves a lot of disk space and IO. Using this approach I created the following visualization showing the concentration of fuel stations in Germany:

image

Of course you could retrieve much more information from the OSM file than I did here. For example, you can read borders (city, state etc.), points of interests (restaurants, airports etc.), sometimes even the location of trees. The file format is described here: http://wiki.openstreetmap.org/wiki/OSM_XML.

Advertisements
Posted in Allgemein | Leave a comment

Polybase in SQL Server 2016 CTP 2

SQL Server 2016

One of the exciting new features in SQL Server 2016 that is already available in the public preview version (CTP 2) is Polybase, the transparent access bridge from SQL to Hadoop. This functionality has been available in the Microsoft Analytics Platform System (APS) since version 2012 and has now made it’s way into SQL Server. You can find a lot more information in this blog post for example: http://blogs.technet.com/b/dataplatforminsider/archive/2014/06/02/polybase-in-aps-yet-another-sql-over-hadoop-solution.aspx

This blog post covers the first steps for using Polybase in SQL Server 2016 community preview 2 (CTP 2).

First of all, you need to install Polybase functionality during the installation process by checking the appropriate option:

image

This installs two services for Polybase as shown below:

image

Also, when opening a database in SQL Server Management Studio you’ll see some new objects:

image

External tables are your gateway to hadoop. Later in this post we will create an external table to a Hadoop file which can then be used mostly like a normal SQL Server table.

External data sources define the connection to the Hadoop cluster (more precisely to the Hadoop file system, HDFS), while external file formats are used to define the structure of the file on the Hadoop cluster.

Before we can start, we need to configure SQL Server for Hadoop connectivity. The following code is an example:

EXEC sp_configure ‘hadoop connectivity’, 5;
GO
RECONFIGURE;
GO

You can use the following configuration values:

0 no Hadoop connectivity (default)
3 Enable connectivity to Cloudera CDH for Linux
4 Enable connectivity to Hortonworks Data Platform for Windows Server (HDP), HDInsight on Analytics Platform System, or HDInsight’s Microsoft Azure blob storage
5 Enable connectivity to Hortonworks Data Platform (HDP)for Linux

In my case, I’m using a Hortonworks Sandbox on Linux, so I’m using the configuration value 5.

Also, the user that is used to access Hadoop has to be configured in the Polybase configuration file which can be found in the Binn\Polybase\Settings sub directory of the instance, in my case

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\Hadoop.config

The default user is pdw_user which I changed to ‘hue’ here:

image

To make all these changes active, I had to restart SQL Server and the two Polybase services.

Now, let’s create an external data source. You can create a code snippet from the context menu of the external data sources folder. Here is the code I’m using for my sandbox:

CREATE EXTERNAL DATA SOURCE HDP2 WITH
(
    TYPE = HADOOP,
    LOCATION = ‘hdfs://sandbox.hortonworks.com:8020’
)

Next, we need to create an external file format. I’d like to access the tab delimited file sample_07 from Hive which is shipped with the sandbox. Therefore the external file format looks like this:

CREATE EXTERNAL FILE FORMAT TSV
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ‘\t’,
        DATE_FORMAT = ‘MM/dd/yyyy’
    )
)

The date format is not needed for my case (there is no date column in this file) but just there to show the functionality.

The two new server objects now appear in the “External Resources” section of the database.

image

Next, we can create the external table itself using the defined data source and file format.

create external table sample_07
(
    code nvarchar(255),
    description nvarchar(255),
    total_emp int,
    salary nvarchar(255)
)
WITH
(
    LOCATION = ‘/apps/hive/warehouse/sample_07’,
    DATA_SOURCE = HDP2,
    FILE_FORMAT = TSV,
    REJECT_TYPE = value,
    REJECT_VALUE=0
)

The new external table appears in the “External Tables” folder of the database.

image

We can now query the table like any other table in the database, for example:

select * from Sample_07

image

Here is another example

select left(code,2) code2, sum(total_emp) total_emp, count(*) Count
from Sample_07
group by left(code,2)

And of course you can also write queries with joins between Polybase external tables and local SQL Server tables which gives a transparant SQL query experience to Hadoop data.

 

Final notes

  • The functionality and syntax may change in the final product, so this only applies to the technical preview 2 (CTP2)
  • Currently, writing to a hadoop file is not supported (there should be something like the “create external table as select” syntax in the APS)
  • This post is just a brief overview. There are many more options (for example more file formats, compression etc.) that can be used. A good orientation can be found in the Microsoft APS support tools download which can be found here: http://www.microsoft.com/en-us/download/details.aspx?id=45294
Posted in Allgemein | Leave a comment

Incremental data loads in Microsoft Power Query for Excel

Power Query

Power Query is a great and flexible tool for getting and transforming data from different sources into Excel or Power Pivot. The standard procedure for Power Query is to read a full table and to replace the destination with the result of the current query. This is exactly what we need for most use cases of Power Query. However, if you like to add new data to existing data you can still use Power Query but you have to follow a slightly different approach.

For example, let’s say we want to create a list of the blog posts while the RSS feed only delivers the most recent posts. In this case we would need to add the the query results to the existing data.

Let’s start with my olap blog. The RSS feed delivers only the last 25 entries of my blog. In order to load this with Power Query, I’m using the load from web function:

image

The dialog asks for the URL (I’m using my blog’s mirror here at http://ms-olap.blogspot.com/feeds/posts/default)

image

If Power Query does not automatically detect this as an XML table you can insert the XML.Tables(…) function as shown below:

image

The items are stored in the last column (entry table) which can be expanded by clicking on image button right between “entry”. For my example I’m only interested in the publication data and the title.

image

By expanding this table we get one row per blog post. In order the get the title, I’m expanding the last column (in my case title.1 since there is already a column title for the title of the blog) to its text value:

image

Finally, I changed the type of the first column to DateTime and I renamed the columns as shown below:

image

The full script generated by Power Query so far looks like this:

let
    Source = Xml.Tables(Web.Contents(“
http://ms-olap.blogspot.com/feeds/posts/default”)),
    #”Expand entry” = Table.ExpandTableColumn(Source, “entry”, {“published”, “title”}, {“published”, “title.1”}),
    #”Expand title.1″ = Table.ExpandTableColumn(#”Expand entry”, “title.1”, {“Element:Text”}, {“Element:Text”}),
    #”Removed Columns” = Table.RemoveColumns(#”Expand title.1″,{“id”, “updated”, “category”, “title”, “subtitle”, “link”, “author”, “generator”, “
http://a9.com/-/spec/opensearchrss/1.0/”}),
    #”Split Column by Position” = Table.SplitColumn(#”Removed Columns”,”published”,Splitter.SplitTextByPositions({0, 19}, false),{“published.1”, “published.2”}),
    #”Changed Type” = Table.TransformColumnTypes(#”Split Column by Position”,{{“published.1”, type datetime}, {“published.2”, type text}}),
    #”Removed Columns1″ = Table.RemoveColumns(#”Changed Type”,{“published.2”}),
    #”Renamed Columns” = Table.RenameColumns(#”Removed Columns1″,{{“published.1”, “Date”}, {“Element:Text”, “Title”}})
in
    #”Renamed Columns”

We can now load this result to the Workbook by clicking the “Close & Load” button. Now, here comes the interesting part. Up till now, the RSS feed only delivers the last 25 rows, so whenever there is a new post, the oldest post would be removed from the table. Let’s say we’d like to add new posts to the table.

To do so, I’m creating a new query named “Existing data” from the data we just loaded into the workbook:

image

Using the “Save & Load To …” function in the ribbon, we can choose to only create a connection (no data is actually moved here):

image

The workbook queries pane now looks like this:

image

We can now edit the first query (“BlogPosts”) again. In the home tab of the Power Query ribbon you can find the options for combining queries.

image

In my case, I simple decide to append the new data to the existing one. If you have a key to match the same rows, you could also do a merge based on the key column (for example if there are updated rows).

image

The result is that we now have 25 rows from the existing table plus 25 rows recently read from the blog. At this point of time, the entries are identical so we have 25 duplicates. I remove the duplicated rows here using the “Remove duplicates” function.

image

We can now save & load the query again. In order to demonstrate the effect, I’m pasting the remaining blog posts (which cannot be pulled by the RSS feed) into the Excel table:

image

Summary
While full loads usually are the most usefull approach for loading data with Power Query, you can also build incremental load procedures by joining or appending the new data to the existing data.

Posted in Allgemein | Leave a comment

Importing data from SAP using the SAP BW Delta Queue and Theobald Software Xtract IS

SQL Server 2005-2014

 

Importing data from SAP is a common task in many BI projects. Since Integration Services (SSIS) doesn’t have out of the box support for dealing with SAP as a data source, third party components should be considered to simplify this task. The components we’re using and which I am writing about here are supplied from Theobald Software and are called Xtract IS. A fully functional trial version is available and the examples in this blog post can be reproduced using this trial version. Xtract IS contains many different options to access a SAP system, for example

  • read table (through rfc_read_table or by using a custom function included in Xtract IS that also supports compression) or sap queries
  • call remote functions (RFCs) and BAPIs
  • Retrieving data from legacy reports
  • Connection to BW via OHS
  • Connection to SAP extractors (aka “delta queue” or “generic data source”)
  • much more…

For this short blog series I’d like to concentrate on the extractor interface which is optimal for reading large tables from SAP. Especially the support for the delta detection is important when loading tables or structures with maybe many billion rows of data. In order to avoid misunderstandings, Theobald’s delta queue component does not require SAP BW to be in place, but it basically acts like BW and uses the same interface to push the data to SSIS in a very efficient way.

Since this post got a little longer, let’s start with an agenda:

 

Agenda

  • How do I get access to a SAP system?
  • Required customizations in SAP to use BW extractors (Delta Queue)
  • How to use existing (pre-defined) extractors in Integration Services
  • Create and use a simple custom extractor based on SAP query
  • Create and use a delta-enabled custom extractor based on SAP query

 

How do I get access to a SAP system?

Before we start, of course we need some access to an existing SAP system. If you have a development system in house the administrators might give you the required access. If not, there are some other options to get started and to create a test environment for connecting SAP to Integration Services. 

Most of the examples I’m using here in this blog post are working fine with the SAP ABAP NetWeaver trial version (so called “mini-SAP”). This edition runs for 90 days and can easily been extended for whatever time is needed. Since the download link changes from time to time it’s best to search for the download using your favorite search engine or to go to the SAP Community website (registration needed) and search for “SAP NetWeaver Trial Version ABAP”. There is an excellent blog series by Manfred Lutz about the ABAP trial version covering the installation process as well as many aspects of configuration. In addition, if you don’t want to install the trial version, I can also recommend having a look at the Consolut website as they’re also offering free of charge access to a mini sap. Until recently they also offered free of charge access to a full SAP IDES system which is somewhat like the AdventureWorks for Microsoft SQL Server (however IDES covers a lot more of real life business processes). Unfortunately they cannot offer this service any longer so you might want to take a look at the SAP offerings:  https://training.sap.com/shop/live-access.

 

Required customizations in SAP to use BW extractors (Delta Queue)

Theobald Software has very detailed instructions on its web site about how to use and configure the SAP system. You can find the detailed step-by-step instructions here. For this post, I exactly followed those steps naming all of the created objects exactly as shown in the documentation.

 

How to use existing (pre-defined) extractors in Integration Services

In SSIS we first need to create a connection to the SAP system. With Theobald Software’s Xtract IS installed, you can add a new connection of type “XTRACT”:

image

SAP has some specific settings that need to be entered for making a connection (for example the client or the system number). The following dialog box is used to enter these settings which are usually provided by your administrator.

t2

In order to use the source component in a data flow, we first have to create a data flow (obviously) and then drag the Xtract DeltaQ component to this data flow:

image

When opening the component’s settings you first start filling out the right four boxes. The search buttons help you to find the correct information.

image

If you’re using the Xtract IS components for the first time or if you are not sure if the customization of the SAP system was correct, it is a good idea to use the “Customization Check” link (marked with the red arrow in the screenshot). This function checks the security settings and customizations of the SAP system and finally gets back with a checklist:

image

If any of the tests fails the message usually gives administrators enough information to fix this. Next task to do is to find the extractor to deliver data to the data flow. If you’ve installed the SAP trial system, not much content is available. But as mentioned above, you could also connect to the Consolut IDES demo system containing a lot of pre-defined extractors. Now, how do you find the extractor for your data? The easiest thing is to search the web for “sap extractor for <<your topic>>”, for example “sap extractor for profit centers”. You could also go to http://help.sap.com directly or check the content available on the Consolut website.

For this demonstration, I’m using one of the profit center transaction extractors. The search function makes it easy to find the extractor by name or description. Please note that if you are using the Mini SAP installation, this extractor is not available so you need to go for one of the existing extractors there.

image

I’m using the first of the extractors in the list (0EC_PCA_1). After hitting ok, we’re back in the main configuration dialog now showing all the fields provided by the extractor:

image

If filters are defined for the columns in the data source/extractor you will notice an “Edit” link right beside the checkbox for including the column in the output. Filters can be defined following the typical way SAP handles filters (include/exclude, cascaded) and package variables may be used for filter values.

Some important settings can be found in the “Settings” dialog (link right below the dialog window header). Pay special attention to the “automatic data type conversion” setting:

image

With automatic type conversion enabled, Xtract IS derives the type correctly for many different columns. You can see this effect on the data flow (meta data):

without automatic type conversion with automatic type conversion
image image

Detailed explanation of all the settings for this dialog can be found on the Theobald Software website, for example here. The setting dialog also allows to name a package variable for passing the delta mode as shown below:

image

Another great feature is the ability to generate a documentation for all fields that are used in the extractor (button “generate documentation”). The output is in Excel format and looks like this:

image

 

Create and use a simple custom extractor based on SAP query

If the data you want to load is not covered by an extractor, you can still use the extractor functions by creating a custom extractor. You can do so by using a SAP View, a Query or by writing custom ABAP code. For this example, I’m using SAP Query to create a query based on the simple flight model data that is included in SAP systems (also in the SAP trial version) for educational purposes.

First we need to create an info set for the query using transaction code SQ02:

t3

After clicking “Create” further information has to be provided:

t4

We reference one of the tables of the flight model (sbook, the table for flight bookings) here. The other tables to join are defined in the next step. If you are unsure which tables are needed, a good starting point is to view the table definition using transaction se11 (ABAP dictionary). On the “Entry help/check” tab you see which tables are used to validate entry values. This is shown in the following dialog:

t5

Also double clicking on the entries in the domain or search help column shows more information on the given structure. For this example, we will at least need the following additional tables:

  • SCARR (airline)
  • SFLIGHT (flight details)
  • SPFLI (flight schedule)

 

From the info set editor you have to click the icon marked in red below in order to add more tables:

t6

For tables that are required more than once (in this case the airport which is needed as deperature and arrival airport) you may create aliases before adding the table to the diagram:

t7

t8

t9

SAP Query tries to automatically detect the links between the tables. However, in my case it didn’t work well, so you have to manually delete and recreate some of the links:

ta

tb

After adjusting the table links, we click on the “-> InfoSet” button to continue with the field selection. Here, the available fields of the query are defined. As a default we choose to start with the key fields here:

td

Next you can choose the fields you like by placing them in the right pane of the InfoSet editor (via drag and drop). Field groups are used mainly to structure fields a little bit:

te

Clicking on check validates the structures. In my case I had a wrong link created in the diagram:

tf

So here is how the full diagram looks like after correcting it:tg

Next thing to do is to activate the InfoSet:

th

We can now create a new extractor (or “generic data source”) by using transaction code rso2. For my example I used “ZTESTHB” as the name for the new extractor. The name will be needed later on to refer to the extractor:

ti

The entry box for the InfoSet becomes available once you hit the “Extraction by Query” button. If our InfoSet is not in the list it might be because of the visibility. In this case, go back to transaction SQ02 and click the “Role/User Group Assignment” button for the InfoSet.

tj

You can now assign the InfoSet to all required user groups. Back in rso2 you may also need an application component which can be created using transaction code rsa6:

tk

You may now save the extractor (aka generic data source). Depending on the fields you’ve chosen, you might get a warning about text fields:

tl

Especially with large tables (the main reason why we’re using the extractor), it’s a good advise to transfer only the keys, not the text fields. Text fields and other descriptive information could be loaded separately (for example using the xtract table component). To remove textfields from the extractor, you can use this menu entry:

tm

When saving the extractor you need to provide a package (or local object) and a workbench request (which you can also create from this dialog).

We’re now ready to use the new extractor from within SSIS. As shown above we need an XTRACT connection the SAP system and an Xtract IS DeltaQ component in the data flow. We can then configure this component:

tq

The data source/extractor should be easy to find using the search function:

tr

I simply select all fields here and activate the source:

ts

A good way to check if everything works fine is the “Preview” functionality:

tt

From within SAP you can use the gateway monitor (transaction code smgw) to see all active connections:

tu

In the data flow the Xtract DeltaQ component has two outgoing connections. One is for the data retrieved, the other is for the log. Writing the log data to a table could also be helpful for tracking errors:

tv

tw

 

Create and use a delta-enabled custom extractor based on SAP query

So far, the data source/extractor generated does not support any delta mechanism. In order to change this we open the InfoSet again (using transaction code sq02) and add the booking data to the output (if it is not already there).

tx

After re-generating the InfoSet we can edit the data source/extractor again (using transaction code rso2). Clicking on “Generic Delta” opens the following dialog to define the delta mechanism:

ty

For the field we select our order_date field that we have just added to the field list. The safety interval results in the an overlapping data delivery so that we’re sure that we don’t miss any rows. However, in this case, each delta contains some rows from the previous batch which means that we need to take care of this in our data flow.

We can now switch the update mode to “C – Delta Initialization” (and I will show how this is done using a package variable in a minute). The next run of the package fetches all rows from the table.

tz

Now switching the update mode to “D – Delta Update” gives no new rows on the next run:

tz1

The request log shows the detailed extraction:

tz2

In order to control the update mode from a package variable we first need to create the variable and then enter the variable name in the “Update Mode Variable” of the setting dialog. Also, don’t forget to change the update mode for the Xtract IS DeltaQ component to “V – SSIS variable” for the magic to happen:

tz4

At the end of this post I’d like to add some more tips on using the delta queue. First, you may test each extractor using then Extractor Checker (transaction code rsa3).

tz5

And delta queue maintenance (transaction code rsa7) may be used to check the current value for the delta field. Also, if something fails because of missing rights, a trace (transaction code ST01) may help (look for failed authority checks) and transaction code SU53 may be used to display the recently failed authority check. And finally, the Delta Queue Monitor (transaction code odqmon) could be helpful to monitor the state of delta enabled extractors.

Posted in Allgemein | Leave a comment

Un-pivoting plan or forecast data in Excel with PowerQuery

PowerBI | PowerQuery

PowerQuery enhances Microsoft Excel by providing a self-service data integration toolset to combine external data with local data. But PowerQuery may also be used just to do useful transformation tasks within Excel. In this example, we start with a simple Excel workbook containing planning data for 5 products and 12 periods and a regional allocation key for 4 regions.

Products:

image

Regions and allocation:

image

For further processing a table of the following format is required with the plan values being un-pivoted and the allocation key being applied for each product, month and region:

image
                           …
image

So, how can we do this in Excel? Before PowerQuery we might have been using some Excel tricks with functions like index or mod, but this approach is not very flexible. What about adding an additional product or another region? What we really need is a simple transformation that allows us to refresh the resulting table based on any changes of the underlying tables.

The following video shows my step-by-step approach to this task.

Unpivoting Excel Data in PowerQuery

You can also download the final Excel file here.

Posted in Allgemein | Leave a comment

Querying Hadoop from SQL Server

SQL Server 2012 | SQL Server 2014

Microsoft’s Analytics Platform System (APS) offers built in transparent access to Hadoop data sources through the Polybase technology. This includes bidirectional access not only to Hadoop but also to Cloud services. The SMP SQL Server currently doesn’t contain Polybase, so access to Hadoop needs to be handled differently. Will Polybase be available in an upcoming SMP SQL Server? From the past we saw some technology making its way from PDW to SMP SQL Server, for example the clustered columnstore index, the cardinality estimation or the batch mode table operations. So let’s hope that Polybase makes it into the SMP SQL Server soon. Until then, one option is to use the HortonWorks ODBC driver and linked tables. To be honest, Polybase is a much more powerful technology since it uses cost-based cross platform query optimization which includes the ability to push down tasks to the Hadoop cluster when it makes sense. Also, Polybase doesn’t rely on Hive but access the files directly in parallel, thus giving a great performance. Linked tables are less powerful but may still be useful for some cases.

So, here we go. First, you need to download the ODBC driver from the Hortonworks add-ons page: http://hortonworks.com/hdp/addons/.

Make sure you pick the right version (32 bit/64 bit) for your operating system. After the installation completes, we need to set up an ODBC connection. Therefore, start the ODBC Datasource Adminstrator (Windows+S, then type ‘ODBC’). Again, make sure to start the correct version (32 bit/64 bit). The installer has already created a connection but you still need to supply the connection properties. I created a new connection instead:

p1

I’m connecting to the Hortonworks Sandbox here (HDP 2.1, I had problems connecting to HDP 2.2 with the current version of the ODBC driver). Instead of the host name you can also enter the IP address (usually 127.0.0.1 for the sandbox) but in order to get other tools running (like Redgate Hdfs Explorer) I configured the sandbox virtual machine to run on a bridged network and put the bridge network IP address of the sandbox (console command “ip addr”) in my local host file.

You should now click on Test to verify that the connection actually works:

p2

In SQL Server Management Studio we can now create a linked server connection to the Hadoop system using the following command:

EXEC master.dbo.sp_addlinkedserver
@server = N’Hadoop’,
@srvproduct=N’HIVE’,
@provider=N’MSDASQL’,
@datasrc=N’HDP’,
@provstr=N’Provider=MSDASQL.1;Persist Security Info=True;User ID=hue;’

Depending on you Hadoop’s security settings, you might need to provide a password for the provider string as well. The @server name is used to refer to the linked server later while the @datasrc names the ODBC connection (see “Data Source Name” in the configuration dialog of the connection above).

With the new linked server, we can now explore the Hive database in Management Studio:

p3

In order to run a query on for example table “sample_07” you can user one of the following commands:

select * from openquery (Hadoop, ‘select * from Sample_07’)

or

select * from [Hadoop].[HIVE].[default].[sample_07]

For both queries, “Hadoop” refers to the name of the linked server (@server parameter in the SQL statement from above).

If you get the following error message, this means that you are not allowed to query the table:

OLE DB provider “MSDASQL” for linked server “Hadoop” returned message “[Hortonworks][HiveODBC] (35) Error from Hive: error code: ‘40000’ error message: ‘Error while compiling statement: FAILED: HiveAccessControlException Permission denied. Principal [name=hue, type=USER] does not have following privileges on Object [type=TABLE_OR_VIEW, name=default.sample_07] : [SELECT]’.”.
Msg 7306, Level 16, State 2, Line 1
Cannot open the table “”HIVE”.”default”.”sample_08″” from OLE DB provider “MSDASQL” for linked server “Hadoop”.

In this case, you should simply give the user from you ODBC connection the SELECT right. To do so, run the following query in Hive:

grant select on sample_07 to user hue;

That’s it. You should now get the contents of the table in SQL Server:

p4

You might want to set the length of string columns manually because Hive does not return the size of the string column (in Hive, the column type is simply “string”). The size returned from the query results from the advanced ODBC-settings of our connection. I left everything on default here, so here is how it looks:

image

So, the default string column length is 255 here. Let’ check and copy the data over to SQL Server:

select * into sample_07 from [Hadoop].[HIVE].[default].[sample_07]

The resulting table looks like this:

image

To have a more precise control of the column length, you should use the convert function here, for example:

select
convert(nvarchar(50),) ,
convert(nvarchar(80),[description]) [description],
total_emp,
salary
from [Hadoop].[HIVE].[default].[sample_07]

Be careful with the remaining setting in the advanced options dialog. For example, checking “Use native query” means that you pass the query (openquery-Syntax) as it is to Hive. This could be intended to fully leverage specific features of Hive, but this could also lead to errors if you’re not familiar with the HiveQL query syntax. Also, to get a better with larger tables you might want to adjust the “Rows fetched per block” option to a larger value.

With HDP 2.2 you should also be able to write to the table (create a new table, grant all permissions and run an insert into) but I couldn’t do on my HDP 2.1 machine.

Summary

Until Polybase makes it into the SMP SQL Server product, Hadoop data may be queried from SQL Server using the ODBC driver and the linked server object. This could also be an option for Analysis Services to connect to Hadoop by using SQL Server views via linked server, since Analysis Services doesn’t support ODBC in multi dimensional mode. However, Polybase on the APS gives a much better performance because of the intelligent cross platform query optimizer and Polybase can also be used to write data to Hadoop, so I hope we’ll find this technology in the SMP SQL Server soon.

Posted in Allgemein | Leave a comment

Hive file format comparison

Apache Hive 0.11+

in this post I’d like to compare the different file formats for Hive, as well as the different execution times for queries depending on the file format, compression and execution engine. As for the data, I’m using the Uber data set that I also used in my last post. I’m aware that the following query stats are not exhaustive and you may get to different results in your environment or with other table formats. Also you could try different serdes for Hive as well as consider compression. Still, it gives you some idea that both the file format and the execution engine play an important role for Hive’s query performance. However, when choosing a file format, you may also consider data management aspects. For example, if you get your source files in CSV format, than you will likely process the files in this format at least during the first process step.

As test queries I used the query to measure the total trip time (query 1) and the query to find all trips ending at San Francisco airport (query 2) from my last post. Here is the result for the file formats I tested:

image

Here is some information about the different file formats being used here:

File format Description
textfile separated text file (for example tab separated fields)
rcfile internal hive format (binary)
orc columnar storage format (highly compressed, binary, introduced with Hive 0.11)
Link: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
parquet columnar storage format (compressed, binary)
Link: http://parquet.incubator.apache.org/
Parquet is supported by a plugin in Hive since version 0.10 and natively in Hive 0.13 and later.
avro serialization file format from Apache Avro (contains schema and data, tools available for processing).
Link: http://avro.apache.org/
Avro is supported in Hive since version 0.9.

For Table create/Write Time I measured a “create table as select” (CTAS) into the specific format. As you can see, the resulting size of the table depends a lot on the file format. The columnar Orc file format compresses the data in a very efficient way:

image

Using Tez as the execution engine (set hive.execution.engine=tez) results in a much better performance compared to map reduce (set hive.execution.engine=mr). The total time for the two queries is shown in this table:

image

In map reduce mode, query time does not seem to depend too much on the file format being used:

image

However, when running the queries in Tez, you’ll see a significant difference between file formats like Parquet and Orc (with Orc being about 30% faster)

image

Posted in Allgemein | Leave a comment