## How do I include categorical predictors in my regression model?

One common problem researchers face when running a regression analysis is how to include categorical predictors. Unlike using continuous variables, which you can simply add with not previous manipulation, including categorical predictors need some extra work in performing the analysis as well as in interpreting the results.

Let’s start with the simplest case of binary variables, that is, two level categorical

predictors. The first step is to transform it into a 0/1 variable also known as dummy variable. Imagine a simple regression model where the dependent variable is salary and the only predictor is gender, which has been coded as 1 if “Male” and 2 if ”Female”. We will first need to recode it into 0 if “Male” and 1 if ”Female” (or vice versa). The category coded with a 0 is known as reference group or category. The interpretation of the coefficient corresponding to the dummy variable is the average difference in the dependent variable between the two levels of the binary predictor. In our example the coefficient corresponding to this dummy is the difference in the average salary across genders. A positive and significant regression coefficient will indicate that on average men have better salaries than women.

Now, let’s look at the case of having more than two categories. Categorical predictors with k categories need to be transformed into k-1 dummy variables before being entered in the model. This process of creating dichotomous variables from a categorical variable is known as dummy coding. For the sake of simplicity we will consider the case of a categorical predictor with three levels. We will need to include two dummy variables in the model. For example, let’s consider the categorical variable education (highest level of studies completed) coded as 1 ”High School or less” 2 “College” 3 “Advanced graduate degree”. The way to dummy code this variable will be creating a variable called HSorless that is 1 when education is 1 and 0 otherwise. Likewise, we will create College which is 1 if education is 2 and 0 otherwise; and Advanced which is 1 if education is 3 and 0 otherwise. We will only use two of the three created dummies in the regression analysis, for instance we could chose to include HSorless and College and leave Advanced as reference category. Or we could decide to include College and Advanced and leave HSorless as reference category. The decision about the reference category will depend upon your research interest. If your interest is to see how having a college or advanced degree contributes to the average salary compared to having High School or less, then leaving HSorless as the reference group seems like the appropriate choice.

The interpretation of the dummy coefficients is similar to the case of the binary variable. The coefficient corresponding to College is the average difference in the dependent between this level of education and the reference group. If College and Advance are the dummies included in the model the coefficient for College will show the average difference in salary for a person who has completed a college degree compared to a person with high school or less. 