Wednesday, July 25, 2018

Parameterization or Dummy coding in Logistic Regression (SAS) - Explained with example

Why is it important to understand the parameterization methods (Dummy coding) in logistic regression analysis?

There are different types of Parameterization methods and irrespective of what you use, you will get similar output. However, understanding these methods will help you interpret the regression output correctly.

Logistic Regression works on the assumption that the Logit transformation of the response variable should have a linear relationship with the predictors. However if one (or more) of your predictor variable is categorical,  this assumption of linearity cannot be met, hence we need dummy variables (called dummy coding), usually in the form of 0 or 1. This is required for each level of the categorical variables.


The two methods are  a) Effect Coding and b) Reference Cell Coding. 

Scenario: Let's say you are trying to predict which customers will Churn (leave) your company. You built a logistic regression model and one of the predictors is Contract Status of the customer.

1) Effect Coding: This is default in SAS. Lets say variable Contract Status of a customer has three values/levels - a) In_Contract: Customers who are still in contract, b)Soon_Out_Of_Contract: customers who are soon coming out of contract and c) Out_Of_Contract. In Effect Coding, SAS compares the effect of each level of a variable with the average effect of all levels of that variable. I.e. It will compares the likelihood of a customer churning (leaving) when he is In_Contract Vs the average likelihood of all three levels. Similarly, it will compare the likelihood of customer churning in second level, Soon_Out_Of_Contract, Vs the average effect of all three levels. SAS repeats this for all levels of that variable.
2) Reference Cell coding : In Reference cell coding, you have to choose one of the levels as a reference level. SAS will compare all the other levels against the chosen reference level. Example, if you choose 'Out_Of_Contract' as a reference level in your regression analysis, SAS will compare 'Out_of_Contract' against 'In_Contract' and 'Out_of_Contract' against 'Soon_Of_Contract'.

Understanding which parametrization method is used in the analysis will help you read and interpret the output correctly. 

The other article (Writing equation for logistic regression) shows how to write code for the above parametrization methods.

Thursday, August 7, 2014

How to automate your code in SAS - Three ways to write Macros

I have given three types - “How to write Macros in SAS”. This is for beginners, so I hope I have made it easy without compromising on technical details.

Before knowing How, it is important to know Why. So I will start with a real scenario and explain Why Macros are important in everyday programming.

Macros are used to automate your program so that you don’t have to write the same piece of code again and again. This automation can be as simple as asking SAS to remember a single value to remember complex joins or even remembering the pages of code which you want to re-use.   

Scenario:
You have been requested to write a report which looks at all customers who bought product “Shoes” launched after 01-Jan-2014. This report does several things like number of pairs sold, region of sales, weekly calls in the call centre, summarizes data, exports to excel etc. Let’s say your code is around 1000 lines. What if you need to look at different products with different launch date every week? One way is to look through 1000 lines of code every week and change Product and launch Date wherever it appears in the code OR something like “Find and Replace” would be a life saver here?

Suppose this was your original code (1000 + lines)...

Data New_Table;
Set Old_Table;
where Product_Name = &Product_N. and Launch_Date = &Launch_D.
run;
----
-----

Type 1 -%Let (Easiest of all Macros - 'Find and Replace'). This works as Find and Replace in MS Word or Excel

This is how you will create a Macro before you start writing your code:

%let Product_N = “Shoes”; 
%let Launch_D = “01JAN2014”;

Followed by your original code……

How this works:

a) Create a Macros: %let Product_N = “Shoes”; 
  • %Let keyword tells SAS that you have created a Macro.
  • Then you give a name to your macro - here you have created two macros Product_N and Launch_D. This can be anything but give it a name which is easy to remember when you read the code next time.
  • Then you give that macro a value with the help of '=' sign. This is the value that can change. For example Product and Launch Date can change next time. 
b) Call the Macro that you have created above in your code: 
  •  Instead of using values of the variables in your code, now you use the Macro names (&Product_N.) with & sign. For example: where Product_Name = &Product_N.  
So basically it works like this - Where ever SAS finds &Product_N in your code, it will replace it with the value you have given to the macro when you declared it. In this case its "Shoes".

 Next time, all you need to do is change the value of the Macro (%let = ) and SAS will replace the Macro (&Product_N) with this value. 


Type 2 – Call Symput

In %let example you saw above, you can easily assign values to the Macro variables after = sign. Another common scenario is that when you want SAS to calculate a value and then assign it to a Macro variable automatically. 

Suppose you want to compare sales at two different dates - a day before product was launched and sales a day after Launch_date. So basically you want SAS to look into products table and calculate two new dates a) Date_before_launch and b) Date_after_launch from Launch_Date. 

You can achieve this easily by call symput routine. Your code will look something like this:

Data _null_;
Set Products;
Call Symput (‘Date_before_launch’, Launch_Date – 1);
Call Symput (‘Date_after_launch’, Launch_Date + 1);
Run;

How this works: 
Data _Null_: You don’t need to create a new table so call it _null_.
Call Symput: Call Symput is a call routine in SAS. It has two parameters separated by a comma (,). First parameter is the Macro_Name (name you want to give to your macro) and second parameter is the value calculated from some existing variable - in this case Launch_Date - 1. 

You use this macro in the same way as you did in example 1:

where sales_date = &date_before_launch.;

Few things to remember while using Call Symput:

  • Macro variables created with Call_Symput call routine can only be used in the next SAS program. It cannot be used in the same program.
  • It can be used both for Character and numeric variables.
  • You can also mention pre-defined macro in SAS as the second parameter. For example, to calculate yesterday’s date, you can do: 
    • Call Symput(“Yest_Date”,today()-1); /*Today() is the pre-defined variable in sas which stores today's date*/
  
Type 3–Proc SQL Into (This is easy and my favorite)

What you did earlier with Call Symput, you can do the same with Proc SQL’s Into statement. Same results can be achieved with this code:

Proc sql noprint;
Select (Launch_Date – 1) into :date_before_launch                  /*This will be the macro*/
From Products;
Quit; 

Another Example: Suppose you want to create a macro “Rows” which stores number of rows in Products table and Maximum sales into macro “Max_Sales”.

Proc SQL noprint;
Select Count(*) into :Rows
Max(Sales) into :Max_Sales
From Products;
Quit; 

Any comments / suggestions, please post below. 


Sunday, August 3, 2014

Why Proc SQL when everything can be done with SAS Base?

Why Proc SQL when everything can be done with SAS Base?

Someone with a year to two years of experience is more likely to ask this question.

First of all, I agree that SAS Base is a complete data analysis tool and in my opinion, its one of the most efficient tools available in the market to handle large volumes of data with excellent processing speed. Also, no doubt that SAS is a winner in advanced statistical analysis like Predictive Analytics.

So why Proc SQL?

If you know only these few things in SQL, you would never ever need to look a those complex merges and daunting procs in SAS Base.In this post I will only write Why and not How. So you will have to search these methods and get yourself familiar with them.

Joins: Left, Right and Inner 
Very easy to understand. You will find lot of videos on youtube explaining different types of joins in SQL. Why Joins and Not Merges - You don't need to sort your tables before joining and 'By Variable' doesn't need to have same name, so no renaming of variables.

Proc SQL summary functions: SUM/MEAN/AVG
Many times you have to sum-up data in SAS and to do that you have to use Proc Means.  Proc Means comes with complex procedure and output format. I don't like the format of the tables that you get with the Output Out = statement.
With Proc SQL its very easy to summarize your data with help of Group by Statement. Best source is Youtube to learn Group By statement in SQL.


Simple Sub-queries in SQL are very handy. Don't go too complex. 
Suppose you want to select everything from table A where there are no matching records in table B. In SAS, again get all things correct first like same variable name, sorting all tables and then merge them. In Proc SQL, this simple statement will work:

select * from A where Name not in (Select Name from B);

No Sorting, No rename of variables.


Again, you will find lot of video on SQL Joins, Group By Statement and Sub-Queries in SQL on youtube. Most of the SQL code works in Proc SQL.

Summing up in SAS or SQL - SUM function with Group By statement

There can be two ways to sum-up numeric data in a table - horizontally (row sum) or vertically (Column sum). I will try to explain both separately in SAS and SQL, you can choose which ever you find easy. Personally I use SQL for both because code is easy to write and understand...

Example:

Name        Salary      Bonus      Year        Gross_Income
A               100         10            2013
B                200        15            2013
A               150         15            2013
A               80           12            2014


Horizontal summation (Row Sum) - You need to calculate Gross_Income for each row. 


SQL 
Proc SQl;
Create table New as
select *, SUM(Salary,Bonus) as Gross_Income
from Old;
quit;
The above code is self explanatory again. Creating a new variable Gross_Income and using SUM function in Proc SQL to sum up all the variables in parenthesis. (* brings all the existing variables as well in the final table)

SAS 
Data New;
Set Old;
Gross_Income = SUM(Salary,Bonus);
run;
The above code is self explanatory. Creating a new variable Gross_Income and using SUM function to sum up all the variables in parenthesis.



Vertical Summation - Summing by Groups: You need to find Total_salary per person per year. That means you need to add up total for A and B for each year. 


Proc SQL: Group by SUM
Proc SQL;
Create table New_Table as
select Name, Year, SUM(Salary)  as Yearly_Salary
from Old_Table
Group by Name, Year;
Quit;

*Same SUM function that you used for horizontal summation followed by new variable name.
*Add Group By Statement and mention the Category or Group Level at which you want to sum up. In this case, Its per person per Year.


SAS (using class statement):
Proc Means data = Old_table noprint;
VAR Salary;
CLASS Name Year;
output out = New_table
SUM = SUM_Salary;
run;

*You have to use Procedure Means to sum up vertically in SAS.
*In VAR Statement: You need to mention all the variables you want to sum up.
*In CLASS Statement: You need to mention the Category or Group Level at which you want to sum up. In this case, Its per person per Year.
*You have to use "Output Out=" statement to get results in the new_table. Otherwise results will be just print in the output window and no table will be created.
* You have to give new variable a name. In this case its Sum_Salary. If you do not give name to new variable that you are creating, SAS will give it a default name which might be confusing you have too many variables to sum up or find averages.



Hence I find Proc SQL easy for summations or averages.

Any questions, please comment.










Friday, August 1, 2014

Up Coming Posts on SAS/SQL

I am thinking to write on few advanced topics on SAS. I will  keep adding/removing list of posts as I publish them:

SAS Hash Joins Vs SQL Joins - Video
Arrays in SAS - Make your data preparation easy in Modeling - Video
Three different ways to create Macros - with Examples - Video
Basic Statistics in Predictive modelling - Article
T-Tests - Article
ANOVA - Video
Linear Regression - Video
Logistic Regression - Video


Thursday, July 31, 2014

Top Five Functions in SAS/SQL with examples. Most commonly used SAS functions.

Based on my experience, here are the most commonly used functions in SAS which you would end up using in almost every program. All are equally important so read them all. 

SUBSTR : It’s used to extract a part of a string. It’s used only with character variables.

Syntax: SUBSTR(Variable_name, Argument_1, Argument_2)
Argument_1 = Starting position from where you want SAS to start extracting.
Argument_2 = Number of characters to extract.

Example: A company has assigned four digit product_ID to each product. You are asked to combine all the Product_IDs starting with 35, 36 and 37 into one Product Category “Mens Shoes” and product IDs starting with 22, 23 and 24 as second product category “Kids Shoes”.
Mens_Shoes = SUBSTR(Product_id,1,2) IN (“35”,”36”,”37”);
Kids_Shoes = SUBSTR(Product_ID,1,2) IN (“22”,”23”,”24”);

IN is the SAS statement which means any of these values in parenthesis.
It’s very common that you have to extract a part of the string in normal day to day analysis job so remember SUBSTR.



COMPRESS: Removes ALL the blanks from a string.

Syntax: COMPRESS(Variable_name)

Example: Sometimes your data in not in the correct format and might have some leading or trailing blanks or you may want to remove all the blanks from a variable “Phone_number”. This usually is the case when you combine data from different sources.
Emp_ID = COMPRESS(Phone_Number).
There are two more similar functions in SAS a) Strip – Strips removes the leading blanks and b) Trim: Removes the trailing blanks.


UPCASE: To capitalize all the characters in a String.

Syntax: UPCASE(Variable_name)

Example: It’s usually used in conjunction with other functions. For example: From variable Products, you want to select any products that start with ‘Men’. This column might contain values like Mens_Shoes, Mens_Shirts, MENS_Socks, MEN_Jackets etc. I am sure you observed that few values have different formats like Upper Case, Lower Case, Capitals etc and also, one of them is MEN (Not Mens). We can combine two functions here – SUBSTR and UPCASE.
If SUBSTR(UPCASE(Products),1,3) EQ “MEN” then New_Product = “Men_Clothing”;
SAS will use the inner most function first, in this UPCASE and then then the SUBSTR function to match values.



It’s very common that you have to convert variable type in SAS. So here are two main functions in SAS which are very frequently used.

PUT: To convert a variable from Numeric to Character. Also used for date conversions.

Syntax: PUT(Variable_Name,format)

Format is the character format you want to assign to new variable.
Example: Taking the same Product_ID example. You have Product_ID stored as Numeric. You want to extract first two characters of Product_ID and if its 35, 36 or 37 then product is Mens_Shoes. However, as the product_ID is numeric, you cannot use the SUBSTR function. So you have to first convert the Numeric to Character variable and then perform SUBSTR. Here you go…
If SUBSTR(PUT(Product_ID,$8.),1,2) IN (“35”,”36”,”37”) then Product EQ “Mens_Shoes”;



INPUT: To convert a variable from Character to Numeric. Also used for Date Conversions.

Syntax: INPUT(Variable_Name, format);
Format is the numeric format you want to assign to new variable.

Example: Suppose you want to match two tables based on variable Product_ID. However in one of the tables you have product_id as numeric and in other table its character format. So you have to change format of these variables. In this example, let’s change numeric variable to a character.
New_Product_ID = INPUT(Product_ID,8.);
Then you can rename, Drop or keep this variable (as you wish) and join with other table. 


Last and most powerful
INDEX: This is one of the most commonly used functions in SAS. I said this for almost all the functions above and I meant it for all of them – they are all important. They are used in everyday programming and very often asked in interviews.
INDEX is used to find Character/word or part of string from a Variable.

Syntax: INDEX(Variable_Name,”MEN”).

Example: In SUBTSR, you knew the starting position and number of characters you want to extract. But what if you want to see if the word “MEN” appears anywhere in the Variable then assign it to a Category – Mens_Clothing. Here you do not know the exact position as word MEN as appear anywhere in the variable. Here is the solution:
IF INDEX(UPCASE(Product),”MEN”) then Category = “MENS_CLOTHING”;
I used UPCASE just in case if MEN appear in the lower case in the variable. 


Ya I know, I said five but I gave you Six. Just couldn’t help.

Any questions/doubts/suggestions/complaints, please comment below.

Sunday, July 27, 2014

Finally Passed "SAS Certified Statistical Business Analyst Using SAS 9: Regression and Modeling"

I postponed it at least five times and finally took this exam last Friday of March-14 . I think I did the right thing in postponing it because whenever I looked at the Example questions on SAS website, I felt I wasn't prepared enough and test is exactly like Example questions. Here is my take of this exam:

Difficultly Level (0-10) - I would say 7.

No need to worry about time - There are 65 questions in 2 hours. I took really long for few questions and still I managed to complete test in 1.5 hours including review. So ignore the time factor.

Test is well distributed across all stages of Predictive modelling. But If I still have to weight it, I would say40% of questions are related to anlaysing the output of the Model.

Theoretical knowledge is must. It tests your basic theoretical knowledge as well Like method names, type of tests, options/curves used etc.

I took "Predictive Modeling Using Logistic Regression" training last year. It is very relevant to this exam. So if you can get hold of this training material, it would definitely help. 

Comment it if need any more information or Contact me.