Hello everyone. In this video, we will talk about partitions in SSIS. For the demo, we will use our ongoing solution SSIS customer, which consists of one multi dimensional project called SSIS customer inside which we have one cube called customer data warehouse. Once we open the queue, you will see a couple of taps and one of them is partitions and this is what today's topic is. In this tab, here you will see something called as fact customer. Now what is this fact customer fact the customer is nothing but it's the measure group table inside our queue.
Now before we go and talk about partitions in detail, and what is the purpose of this measure group table here, let's go and try to understand basics about partitions first. When we talk about Analysis Services, when we talk about SSIS we look at the cube As a single object, which holds all the Dimension Data and measure group data, but there is one point worth to understand here. And that is partition. partition is nothing but it's a simple object inside the cube which actually holds the complex measure group data. We can take partitions as sub containers, we have a big container called cube, which hold Dimension Data and some small sub containers called the partitions and those partitions will actually hold measure group table data. Best part is we are not required to create those partitions manually.
They get created automatically when cube get created. And they will be loaded with data when cube get processed. When we process the cube dimensions will be loaded with appropriate Dimension Data from the data warehouse, and partition will be loaded with appropriate measure group data from the data warehouse, it will happen automatically. Now in our example, we have only one measure group table called faxes. And that's why you can see here that one partition got automatically created for this fact customer measure group table. Now let's prove the fact that this partition actually contain the measure group data.
To do that, simply right click partition and say delete. say okay. Now remember guys, we deleted the partition. That means, measure group data will not be loaded into queue. Because there is no place where measured group data will be kept. When we process the cube, only Dimension Data get loaded to queue.
So let's go and deploy the cube right click You're a multi dimensional project and to deploy is deploying and processing. So deployment done, let's go to browser tab. And let's simply say reconnect drag customer amount from fact customer measure group table and as expecting will not get anything because partition related to fact, customer measure group table is already deleted. Next, let's go to dem country. Expand dem country sorry, country name, again country name, and you will see that data is still here. What does that mean?
That means when we process the cube, at that time, only Dimension Data get added to the cube. That measure group table didn't get added, because there is no partition related to that measure group table. So it's very important that every measure group table must have one at least one partition. So let's go and create it. For that, let's go to partitions. Let's click on new partition, say next.
And in this screen, we will do two things. One, we will select measure group table for which we want to create partition. Secondly, we will select that table from our data source, which contained data for our measure group table. So in our example, we don't have multiple measure group tables, we have only one measure group table that is fact customer, so we don't have any option, then selecting this one. Then to select that table from data source which contain data for our user group table, we will first choose look in and we will say we want to find it from this data source view is the okay click on Find tables. Click on that customer say next.
Let's ignore all the advanced steps. Say next, next. And in the last step, simply select this radio button design aggregation Later, we'll talk about this Options later in the stage. For now simply say finish and deploy the deployment done. Let's go to browser tab. Let's simply say reconnect.
Let's drag customer amount and customer amount is zero. That means now our cube actually contain fact customer measure group data. And I believe now you got a good idea about partitions, it's simply an object inside the cube, which contain measure group data. If there is no partition, then it there will not be any kind of measure group data. That means there will not be any output. So it's that simple.
In real time projects, it's not required to create such kind of partitions because these kinds of partitions will be created by default. When you create the cube automatically based on measure group table, these kind of partitions will be created and that's what happened in our case also, earlier, we deleted it and recreated it. So now it's time to understand now it's time to learn. When we actually go and create custom partitions, it's time to understand the real advantage of partitions. To get started with, let's talk about processing of cube in detail. So far in our project, whenever we thought about refreshing cube, whenever we thought about loading updated data from data warehouse to cube, we simply went right click the project and say process queue.
But when it comes to real time project, processing cube again and again is not considered as best option. In real time scenario, data warehouse will be very huge. It will consist of blacks, crows or billions of records inside it. Processing cube means processing complete data warehouse again and again. Which will be Very slow. And hence, in real time project to make this process faster, we will never process cube again and again what we do in the beginning, we deploy the cube and then we process the cube for the first time.
After that, in future, whenever we think about refreshing queue, instead of processing complete queue, we simply go and process part of the queue. Here, you can see dimension list is there. If you right click one of the dimension you will find option called as process. If you believe that only dimension changes are required inside a cube for now, simply go and process dimensions don't process cube entirely in the same way. If you believe that there are some changes in data warehouse related to measure group data You want those changes in queue. In that case, simply go and right click the partition and say process following this principle will give you our advantage of better processing performance.
Now still there is one small issue, we have dimension processing and we have partition processing. dimension means at the end of the day masters and end user won't go and create dimensions every day. He won't go and change dimensions every day. That means dimension processing will happen very rarely. In the reverse side, when it comes to partitions, partition hold measure group data. That means the whole transaction data and transaction is something which happened in system everyday end user will go and may add some new products every day he may go and buy some product he may go and sell something product.
These are some of these are the kind of operations which happen everyday. And hence, it's very, very required that partition get processed very frequently. And if we do processing of partition very frequently, our queue performance will be reused a lot. Because when it comes to real time projects, we will never have measure group tables, just with nine records in real time projects, we will have billions of records in one measure group table. And when we say process the partition that means load all of those billions records from that partition sorry from that measure group table and dump it into queue partition. And every time queue partition is reprocessed.
This billion record processing will happen every time. That means the processing time will be more and good news is in SQL Server Enterprise Edition, we have an option where we can Go and create more than one partition for each measure group table and distribute data across multiple partitions. To make it more clear, let's do a small demo right now or measure group table fact customer, one partition called factor customer. Let's rename it. Let's call it my partition one. Now, let's go and create one more partition for same major group table.
So for that, simply click on new partition, simply click Next. And just like before, select measure group table which is again in fact customer because we want to create new port partition for same measure group table and then select that table from data source which is going to contain data for our measure group table. It's packed customer, simply go and check this checkbox. But instead of getting this next button enable, we got this error error says the fact customer table is already used by my partition one and you have not supplied a query to restrict the rows. Let me make this explanation easier. Right now whatever partition we have that is my partition one contain all the data of fact customer measure group table.
That's why you cannot go and create one more partition Just think about it guys, you have one partition, which contain an entire measure group table, then how can you go and create one more partition which will take part of that measure group table. So in order to create one more partition for the same measure group table, first you have to say cancel and delete this existing data. For that simply right click and say delete. simply say okay, click on new partition, say next. Keep this measure group as fact customer, check this checkbox. say next.
Now, this screen is very important. If you remember last time, we kept this checkbox Uncheck. That means we want to create a partition, which will hold all the data of measure group table. But this time, our intention is something different. We want to create a partition, which will hold only a part of measure group data. So for that we have to say, specify a query to restrict rows, simply check this checkbox.
And then we have to put a where condition, we will put where condition like this, where ear off sales date is less than 2015. Now this is a simple select query, it will return all the records from fact customer table where sales date is less than 2015. Let's copy the query and simply go to SQL Server simply say new query. let's paste it here and simply say ephi. Or there is an error. What is the Let's select it again let's press f5 or it's correct.
Now you can see that we are getting eight records. Remember guys in our fact table we had nine records and one of them is of 2015. Remember this one. So this Sanjay record simply got skipped here. You can't see Sanjay anywhere. Okay, so query works perfectly fine.
It is returning all the records from fact customer table where sales date is less than 2015. Simply press next. This is the screen where we will set processing location and storage location. Processing location is nothing but it's the analysis server where or partition data get processed. By default it is set to current server instance. That means it is set to analysis server where our cube get deployed.
In order to reduce the overhead on current analysis server. We can simply To change the analysis server to some other remote server as well. But for this demo, let's keep it simple. Let's keep it current server instance. Now the second option is storage location. Now this is the place where by default partition data will be kept.
Now if you want, you can change the path as well. But for this example, let's keep it as it is simply press next. So here is the final step of the bizarre step where we will go and create aggregations for our partition. Now we are going to have a detailed video on aggregations very soon, so don't worry about that. For now, we will simply select this option design aggregations later. But still, you should understand one simple sentence.
More aggregations we create in partition, better will be the performance. But in Riverside, the size of the cube will start increasing. So it's very Simple. More aggregations means better performance, and bigger The bigger the cube. That's it, simply say finish and we are done with creating our custom partition. Let's simply rename this partition to less than 2010 sorry, it should be 2015.
Okay, now remember guys, this partition contain only those records, which are, which happened before 2015 it won't contain all the data so we have to create one more partition. So simply go and say new partition. say next. Again the same setting start customer fact customer again. say next. Check this checkbox again.
But this time, the condition will be year. Sales date greater than 2015. Now this is the biggest mistake Most of the time happens in SSIS. Remember guys, last time the query was year off sales date less than 2015. And right now we are putting it as greater than 2015. That means it will skip 2015 Records, it should not be greater than it should be greater than or equal to, or we should say create three partitions one less than 2015, one greater than 2015 and one equal to 2015.
When we create custom partition, one of the biggest mistake we do is we write wrong queries. So, be careful about that it should be greater than or equal to let's copy this query and let's confirm it once. Let's go to SQL Server. Let's put this query here. Let's simply press f5. Yes, we are getting one record perfectly fine.
Let's go to Visual Studio. simply say next. Keep all everything as it is simply say next. Again. We don't want to create aggregation for this partition as well. Design aggregation later To say finish, let's rename this new partition as greater than or equal to 2015.
Writing the multi dimensional project, say deploy, deploying and processing. Now definitely once the deployment is done, you will go and browse the cube. But before that I'd like to talk about one important concept. Right click your multi dimensional project, go to properties. In deployment section, you will find a property called processing option, which will have three possible values default, do not process and full. Now let's understand what this property is actually about.
When we say right click and deploy what exactly happens? It will simply deploy our cube to Analysis Services. But what after that, after that, or cube will get processed? That's what our understanding is. But in reality, it's not true. When we say deploy the cube, after deployment, whether our cube get processed or not, it depends on this property.
If we set this option to full, that means, after every deployment processing will happen, every time we go and deploy the queue, it will continue with processing of the queue. If we set it to do not process that means, if we deploy the queue, it will not continue with processing of the cube. for processing, we have to right click the queue project explicitly and we have to say process. In real time project, do not process is the best option. We simply go and manually deploy the cube, we simply go and manually process the cube. And once both of them are done later in the future, we simply go and manually process either dimensions or partitions based on our requirement.
Now other than these two options, we have one more option that is default. Now let's talk about this. It says, process the cube only if processing is not already done. If cube is already processed, don't process it again. if q is not processed, then process it. Now let's follow the best practice in the industry.
Let's keep it Do not process, let's say Apply. Okay? Now right click it, say deploy for the first time deployment done. Right click and say process. Unfortunately, because of screen resolution, you can't see this button. So let me maximize it.
Now you can see this run button here. Simply click on this run button, it will start processing the queue and processing done. Let's say close. Close. Let's go to browser. Let's say reconnect.
And let's simply drag customer amount from our fact customer measure group table and we are getting something like 4464 here. Let's go to STD server and reconfirm the data. In SQL Server, we have this customer data warehouse. Let's simply say right click and new query. And let's write down a simple Query SELECT sum of Max sorry my mistake, sum of customer amount. From fact customer it will do or it will return the total customer amount, let's say execute and we are getting 4464 that means, the total value in data warehouse and total value in cube is matching.
It means everything is correct. The only difference is when it comes to cube. This whole 464 is the addition of total customer amount of this partition plus total customer amount of this partition. Now to now to prove this point, let's do one more demonstration. Remember that we have two partitions, one partition hold data Where's date is less than 2015 and one partition hold data where sales date is greater than 2015 greater than and equal to, let's go to SQL Server. And let's open this fact customer table.
And here we have 2015 data, let's simply make this 124021250 we just increase 2015 data by 10. Let's do for 2010 as well. Let's make it 22 that means we are increasing it by 10. That means now, total sum will be increased by 20. So instead of 4464, we should see 448 for now. But it needs processing but this time we will not go and process cube instead of that.
We will process individual partitions. So let's process first partition first say process. Let's maximize it, say run. Its processing. It's done. Let's say close.
Close now. Do the same for second partition, simply say process, maximize it, say run. Close, close. Now let's go to browser. Let's say reconnect. Now we should see 4484.
So let's drag customer amount. And here it is, we are getting the reflection. That means everything is working. Now let's go back to SQL Server. And let's change one to 502. Again, 1240 and 22 to 12.
Again, that means again, we should get 4464. Let's go back to our Visual Studio. Let's go to partitions. And this time, let's go and execute sorry, let's go and process the first partition that is 2015 partition, right click, say process, maximize it, say run. Let it to let it complete. It's done, say close.
Close. Let's go to browser tab. Let's see Connect guys. I didn't part process the second partition that means we should get 4474 Okay, let's go and just drag customer amount here it is 4474 but the actual data inside data warehouses 4464 we are getting 10 more why because we have not processed this the can partition, it proves that data resides inside these partitions and we get reflection of only those data which we processed. So, this is all about basics on partitions. We are going to talk about partitions once again when when it comes to aggregations and storage modes.
Before we go and end this video, let's try to recollect what all things we learned. We started with definition of partitions, then we understood how we can make processing more efficient And faster by processing individual dimensions and partitions instead of processing one big queue. And after that we learned we can take our processing speed to next level by breaking one big partition to multiple small partitions logically. So make sure that you practice it very well. And if you have any question, don't forget to drop a mail. Thank you very much.