Tuesday, 27 November 2018

Sample Data Warehouse Using SSMS and SSIS


What is a Data Warehouse?


A Data Warehouse is a large scale storage that is used to analyze data which are extracted from different data sources.

Let’s see what are the steps to create a Data warehouse.

Step 1 – Create a star schema or snow flake schema to get a clear idea about how the data sets can be represented in tables and to identify the relationships between the tables.
Step 2 – Design an ETL architecture to implement the Data warehouse.
Step 3 – Create the staging database.
Step 4 – Load the Data warehouse.

Now let’s see how to implement above steps practically.

Star Schema


Star schema represents how the dimension tables and the fact table is connected, in a simple way.





ETL Architecture




Creating Staging Database


First open the SSMS (SQL Server Management Studio) and it will show a dialog box as below.




Select the server name and press connect. It will show as below and then expand the server as below.



After that right click on database and create new database as following. Give a name to the database and press ok.




Now you can see the newly created database under the databases as below.


Then click on the new query and type query as below to create the tables.


After that select the appropriate database in the drop down near to the execute button and press execute. Then right click on the tables and select refresh. After that expand the tables and you will see the tables has been created.




Next select a table and right click and click on select top 1000 rows. Then you will see an empty table.


Now it’s time to load the raw data into the tables in staging database. To load the data into staging tables we need to create a SSIS (SQL Server Integration Service) package. For that first open the visual studio. Then select a new project.



After that select Integration Services Project and give a name and a location to save the project and press ok.




Next expand the SSIS packages from right hand side and rename the package as you want and double click on it.


Then it will show the control flow tab as above. In to the control flow drag and drop a sequence container.


After that drag and drop a data flow task and re-name it as you want. In my case I named it as STGbranch.


Next double click on the data flow task STGbranch.


Then from the left side tools drag and drop a flat file source to import the .CSV files.


After that double click on the flat file source to configure it to import the .CSV file.




Then click on new to create a new connection manager. After that give a name to connection manager and click on browse to select the .CSV file that need to import.



Then select the .CSV file and press open as below.



After that it will look like below.




Then go to columns and you can see the data set.



Next go to advanced and if you want, can change the data type and length of the data type.



Then go to preview.






After that press ok. Next select columns in flat file source editor.




Then you can see the .CSV file columns (External column) and the extract columns (Output column).
Next press ok. After that drag and drop an OLEDB destination and connect that with the flat file source like below.



Then double click on that and select the table Branch_STG as the destination.



After that go to mappings and confirm that the columns in the flat file source are correctly mapped with the destination table.




Next press ok and go back to control flow and right click on the STGbranch data flow task and execute the task.



You can see the progress of the execution or the errors if any by clicking progress.





After checking the progress, you can go to edit mode by clicking the stop button. Then go back to SSMS and click on select the top 1000 rows as before. Now you see the data in your .CSV file are loaded in to the staging table.




Like this load all the data into the staging database. Apart from this if you have two or more .CSV files which includes data for same table, let’s see how to handle this situation. I’m using two .CSV files. In the similar way drag and drop a data flow task. Then double click on that and import both .CSV files using flat file sources. To get the data from two flat file sources we need to use union all and sort tools to connect them. First drag and drop a union tool and join it with the two flat file sources like below.



After that double click on the union all tool and check the columns are correctly mapped.




Then press ok. Next drag and drop sort tool and join it with union all tool. This tool is used to sort the table ascending or descending order using a column.





Then double click on the sort tool and tick the column which you are going to use to sort and select ascending or descending from the list and press ok.



Finally join the sort tool with the OLE DB destination as previous.




After that execute the data flow task and select the table from SSMS.




If you want to run the data flow tasks sequentially then join the data flow tasks like below and execute them all by executing the container. To do that right click on the container and click execute container.



Finally drag and drop an execute SQL task.




Then double click on the execute SQL task and configure it like following.



In the SQL statement we need to write truncate query to truncate all the tables when running the ETL each time.





Then press ok to continue.

Staging database is created. Now let’s see how to create the data warehouse.

Creating the Data Warehouse


First go to SSMS and create an another database MonthlySales_DM to implement the Data warehouse. After that go to visual studio and create a new package to create the dimension tables and the fact table. Then drag and drop a sequence container. Next add a data flow task as below.



After that double click on the data flow task and drag and drop an OLE DB source and configure it like below. Remember to select the Staging database that created in the SSMS as connection manager and select the staging table that you want to load into dimension table(Customer_STG).



Then check the data from preview and after that check the mappings are correct by selecting the columns. Next press ok.





In the staging database we gave all the columns for customer as varchar. Let’s say if you want to change the data type of phone number into integer in the customer dimension table, then you have to follow the steps given below. Else you can directly connect the OLE DB source with the OLE DB destination (Configuration for this is explained after data conversion steps).

First drag and drop a data conversion tool and connect that with the OLE DB source.





Then double click on the data conversion tool and tick the column, which you want to change the data type. In this case the phone number. After that change the output alias if you want and select the data type which you want. I choose eight-byte signed integer as my data type for the phone number.




Then press ok.

Now drag and drop an OLE DB destination and connect that with the data conversion tool.
Next double click on the OLE DB destination to configure that.



Then select the MonthlySales_DM as connection manager and select new near to the ‘Name of the table or the view’. After that edit the create statement as you want. Add an auto increment surrogate key as well.




Next check the mappings.




Then press ok.

After that execute the data flow task.



Next go to SSMS and select the table.





Like this create other 3 dimension tables also.

After that use an execute SQL task and write truncate statements to truncate each table.



Then connect that with the container as following.




Finally, we have to create the fact table using the transaction table. For that drag and drop another container and place a data flow task inside that.




After that double click on the data flow task. Then drag and drop an OLE DB source to import the transaction table ‘sales’ from SSMS.



Next drag and drop lookup tool and join it with the OLE DB source. Lookup tool is used to join the dimension tables with the transaction table to get the surrogate keys.




After that double click on the lookup and configure general as below.




Then go to connection and select the OLE DB connection and the dimension table as below.




You can see the data that imported from the dimension table using preview. Next go to columns and map the transaction table column with the suitable dimension table column (this is a foreign key mapping) and get the surrogate key as below.




After that press ok.

Likewise create lookups for each dimension and connect them one after other. When you connect one lookup with another lookup it will show a dialog box. Configure it as following and press ok.



Finally connect it with an OLE DB destination and create the fact table for sales.






Finally execute the fact container and check the fact table from SSMS.






Now you can use this Data warehouse to analyze data.


2 comments:

  1. Thank you for the illustratios. You clearly explain each stepts. Usefull for beginners.

    ReplyDelete
  2. It was a nice blog clear and precise in content! keep up the good work!

    ReplyDelete