Now in this video we will be dividing the data into two parts that is training and validation a part of the data will be allocated to training and a part of the data with veracity to validation and we will be building our model based on our training data. We will also do step by selection for our data. So let's do it. First we are forming two data sets that is data training, data training and validation. As we haven't specified any library name in our mind in my data step, so it is implied that both data sets will be created inside work. said my One that is my library name dot data setting is linear underscore rank underscore retail set is a keyword where we specify the name of our input data set.
So our original data set is the input data set that is linear and is correct and discredited, which is located inside the library mile run. So this data set is getting kaput in the to duplicate data sets that we are creating that is training and validation. So this training and validation is a duplicate data set of linear underscore reg underscore retail. So the original data set is getting copied in the two data sets that is training and validation which will be created in network. Now we have to specify that what will be the proportion of division of the data set. For that I'm using the keyword romney within bracket zero, run only within bracket zero is a keyword to generate random numbers this is used to divide the data into two parts that is run only zero keyword is used to divide the data into two parts.
So, just because we are generating random numbers, so the division will not be exact it would be approx, and we have specified the run only zero less than point seven. That is 70% of our data is going to the data set trading and 30% of the data that is the remaining data is going to the data set validation 70% of the observations are going to training and 30% of the observations are going to validation. So in my original dataset, the total number of observations was 200, where 70% of 200 means around 140. So 140 observations should go to training and 60 observations should go to validation. But over here, we have generated random numbers using the keyword randomly which is used to do the division of the data into two paths. Therefore, the exact 70% and 30% will not go there will be an approximation in the division.
And just because we have created random numbers, therefore, every time we run this code we will be getting different sets of results. So, let's run the code. Before I run the code Let me explain you on the code once more. Here we have created the data sets training and validation which is the copy of the original data set linear underscore reg underscore data stored in mileage one run only zero is a keyword to generate random numbers here 70% of the data is going to training and 30% that is remaining is going to validation. Just because we have generated random numbers the division will not be exactly 7230 it will Be an approximation. So let's run this code first.
The two datasets as we did not specify any library name it will be created at work. So let's open the library to work. So, see this is training data sets. So, training data, there are 143 observations as I told you exactly 140 observations will not be allocated to training data and validation will have the remaining number of observations that is 57. Okay, so now the division of the data is done 70 suited to ratio the ratio may vary according to your own choice. Now we'll be doing step by selection and we will build a model based on our training data.
So we will be using the procedure PROC PRINT data equal to training model My model is a key word. We're here to create the classical linear regression model and customer satisfaction is my independent variable. product quality. From product quality till faceplate, price flexibility, they are all my independent variables. We are going to receive step by selection using the keyword adjusted R square. Then via using the statement grant and then quit.
So let's run the code we have built the classical linear regression model based on training data with customer satisfaction as our dependent variables and the independent variables are from product quality to price flexibility, we have we are doing stepwise selection, which is done to select the set of significant variables, the set of significant variables that we'll be getting over here that we have to use for our future purpose to run the next regression procedures that is true predicted non dependent variables. So here I am doing step by selection, and we are using the technique of adjusted R squared. So let's run this code. So, see, here, the step by selection is done in every step one of the other variables are either added or removed and the variables are removed based on the value of adjusted R square and r square. The steps are ordered or sorted in such a way that is it is sorted in descending order of our adjusted R square.
So, the step where my adjusted R squared value will be maximum in that step, whatever variables are there, those variables or those independent variables will be my significant variables that I will be using in future to predict my dependent variable. So, see here the adjusted R square value is maximum 0.8014 that is, it's at point one 4%. So, the independent variables that are significant are product quality ecommerce advertising product line sales force image competitive pricing, packaging order building Price flexibility. So out of all the variables in product quality, the price flexibility only these sets of variables. So these many variables that is around nine independent variables are taking significant variables for our model. So this result will change every time we run the code because we have generated random numbers.
So now let me copy this set of variables and keep it because this I needed for me future purpose. So, I'm keeping a certain set of variables over here I'm converting them into comments. Because normally we write the name of the variable since as editor window cannot be accepted. So, these are the set of independent variables. So, in this video we'll be doing the hair only. So let's end the video over here.
Thank you. Goodbye. See your for the next week.