What is a Data Warehouse?
A Data Warehouse is a large scale storage that is
used to analyze data which are extracted from different data sources.
Let’s see what are the steps to create a Data warehouse.
Step 1 – Create a star schema or snow flake schema
to get a clear idea about how the data sets can be represented in tables and to
identify the relationships between the tables.
Step 2 – Design an ETL architecture to implement the
Data warehouse.
Step 3 – Create the staging database.
Step 4 – Load the Data warehouse.
Now let’s see how to implement above steps
practically.
Star Schema
Star schema represents how the dimension tables and
the fact table is connected, in a simple way.
ETL Architecture
Creating Staging Database
First open the SSMS (SQL Server Management Studio) and it
will show a dialog box as below.
Select the server name and press connect. It will show as below and then expand the server as below.
After that right click on database and create new database as following. Give a name to the database and press ok.
Now you can see the newly created database under the
databases as below.
Then click on the new query and type query as below to
create the tables.
After that select the appropriate database in the drop down
near to the execute button and press execute. Then right click on the tables and select
refresh. After that expand the tables and you will see the tables has been
created.
Next select a table and right click and click on select top
1000 rows. Then you will see an empty table.
Now it’s time to load the raw data into the tables in
staging database. To load the data into staging tables we need to create a
SSIS (SQL Server Integration Service) package. For that first open the visual studio. Then select a new project.
After that select Integration Services Project and give a name and a location to save the project and press ok.
Next expand the SSIS packages from right hand side and
rename the package as you want and double click on it.
Then it will show the control flow tab as above. In to the
control flow drag and drop a sequence container.
After
that drag and drop a data flow task and re-name it as you want. In my case I
named it as STGbranch.
Next double click on the data flow task STGbranch.
Then
from the left side tools drag and drop a flat file source to import the .CSV
files.
After
that double click on the flat file source to configure it to import the .CSV
file.
Then click on new to create a new connection manager. After that give a name to connection manager and
click on browse to select the .CSV file that need to import.
Then select the .CSV file and press open as below.
After that it will look like below.
Then go to columns and you can see the data set.
Next go to advanced and if you want, can change the data type and length of the data type.
Then go to preview.
After that press ok. Next select columns in flat file source editor.
Then you can see the .CSV file columns (External column) and
the extract columns (Output column).
Next press ok. After that drag and drop an OLEDB destination
and connect that with the flat file source like below.
Then double click on that and select the table Branch_STG as the destination.
After that go to mappings and confirm that the columns in the flat file source are correctly mapped with the destination table.
Next press ok and go back to control flow and right click on the STGbranch data flow task and execute the task.
You can see the progress of the execution or the errors if any by clicking progress.
After checking the progress, you can go to edit mode by
clicking the stop button. Then go back to SSMS and click on select the top
1000 rows as before. Now you see the data in your .CSV file are loaded in to
the staging table.
Like this load all the data into the staging database. Apart from this if you have two or more .CSV files which
includes data for same table, let’s see how to handle this situation. I’m using two .CSV files. In the similar way drag and drop a data flow task. Then
double click on that and import both .CSV files using flat file sources. To get the data from two flat file sources we need to use
union all and sort tools to connect them. First drag and drop a union tool and join it
with the two flat file sources like below.
After that double click on the union all tool and check the columns are correctly mapped.
Then press ok. Next drag and drop sort tool and join it with
union all tool. This tool is used to sort the table ascending or descending
order using a column.
Then double click on the sort tool and tick the column which you are going to use to sort and select ascending or descending from the list and press ok.
Finally join the sort tool with the OLE DB destination as previous.
After that execute the data flow task and select the table from SSMS.
If you want to run the data flow tasks sequentially then join the data flow tasks like below and execute them all by executing the container. To do that right click on the container and click execute container.
Finally drag and drop an execute SQL task.
Then double click on the execute SQL task and configure it like following.
In the SQL statement we need to write truncate query to truncate all the tables when running the ETL each time.
Then press ok to continue.
Staging database is created. Now let’s see how to create the data warehouse.
Creating the Data Warehouse
First go to SSMS and create an another database MonthlySales_DM
to implement the Data warehouse. After that go to visual studio and create a new package to
create the dimension tables and the fact table. Then drag and drop a sequence container. Next
add a data flow task as below.
After that double click on the data flow task and drag and drop an OLE DB source and configure it like below. Remember to select the Staging database that created in the SSMS as connection manager and select the staging table that you want to load into dimension table(Customer_STG).
Then check the data from preview and after that check the mappings are correct by selecting the columns. Next press ok.
In the staging database we gave all the columns for customer
as varchar. Let’s say if you want to change the data type of phone number into
integer in the customer dimension table, then you have to follow the steps
given below. Else you can directly connect the OLE DB source with the OLE DB
destination (Configuration for this is explained after data conversion steps).
First drag and drop a data conversion tool and connect that with the OLE DB source.
Then double click on the data conversion tool and tick the
column, which you want to change the data type. In this case the phone number. After that change the output alias if you want
and select the data type which you want. I choose eight-byte signed integer as
my data type for the phone number.
Then press ok.
Now drag and drop an OLE DB destination and connect that
with the data conversion tool.
Next double click on the OLE DB destination to
configure that.Then select the MonthlySales_DM as connection manager and select new near to the ‘Name of the table or the view’. After that edit the create statement as you want. Add an auto increment surrogate key as well.
Next check the mappings.
Then press ok.
After that execute the data flow task.
Next go to SSMS and select the table.
Like this create other 3 dimension tables also.
After that use an execute SQL task and write truncate statements to truncate each table.
Then connect that with the container as following.
Finally, we have to create the fact table using the transaction table. For that drag and drop another container and place a data flow task inside that.
After that double click on the data flow task. Then drag and drop an OLE DB source to import the transaction table ‘sales’ from SSMS.
Next drag and drop lookup tool and join it with the OLE DB source. Lookup tool is used to join the dimension tables with the transaction table to get the surrogate keys.
After that double click on the lookup and configure general as below.
Then go to connection and select the OLE DB connection and the dimension table as below.
You can see the data that imported from the dimension table
using preview. Next go to columns and map the transaction table
column with the suitable dimension table column (this is a foreign key mapping)
and get the surrogate key as below.
After that press ok.
Likewise create lookups for each dimension and connect them one after other. When you connect one lookup with another lookup it will show a dialog box. Configure it as following and press ok.
Finally connect it with an OLE DB destination and create the fact table for sales.
Finally execute the fact container and check the fact table from SSMS.
Now you can use this Data warehouse to analyze data.
Thank you for the illustratios. You clearly explain each stepts. Usefull for beginners.
ReplyDeleteIt was a nice blog clear and precise in content! keep up the good work!
ReplyDelete