Wednesday, July 25, 2018

Parameterization or Dummy coding in Logistic Regression (SAS) - Explained with example

Why is it important to understand the parameterization methods (Dummy coding) in logistic regression analysis?

There are different types of Parameterization methods and irrespective of what you use, you will get similar output. However, understanding these methods will help you interpret the regression output correctly.

Logistic Regression works on the assumption that the Logit transformation of the response variable should have a linear relationship with the predictors. However if one (or more) of your predictor variable is categorical,  this assumption of linearity cannot be met, hence we need dummy variables (called dummy coding), usually in the form of 0 or 1. This is required for each level of the categorical variables.


The two methods are  a) Effect Coding and b) Reference Cell Coding. 

Scenario: Let's say you are trying to predict which customers will Churn (leave) your company. You built a logistic regression model and one of the predictors is Contract Status of the customer.

1) Effect Coding: This is default in SAS. Lets say variable Contract Status of a customer has three values/levels - a) In_Contract: Customers who are still in contract, b)Soon_Out_Of_Contract: customers who are soon coming out of contract and c) Out_Of_Contract. In Effect Coding, SAS compares the effect of each level of a variable with the average effect of all levels of that variable. I.e. It will compares the likelihood of a customer churning (leaving) when he is In_Contract Vs the average likelihood of all three levels. Similarly, it will compare the likelihood of customer churning in second level, Soon_Out_Of_Contract, Vs the average effect of all three levels. SAS repeats this for all levels of that variable.
2) Reference Cell coding : In Reference cell coding, you have to choose one of the levels as a reference level. SAS will compare all the other levels against the chosen reference level. Example, if you choose 'Out_Of_Contract' as a reference level in your regression analysis, SAS will compare 'Out_of_Contract' against 'In_Contract' and 'Out_of_Contract' against 'Soon_Of_Contract'.

Understanding which parametrization method is used in the analysis will help you read and interpret the output correctly. 

The other article (Writing equation for logistic regression) shows how to write code for the above parametrization methods.

No comments:

Post a Comment