Skip to content

Commit

Permalink
Merge pull request #290 from PSIAIMS/LT1
Browse files Browse the repository at this point in the history
added in latest conference info
  • Loading branch information
statasaurus authored Sep 4, 2024
2 parents 501a6fd + afd2e39 commit 34a18c8
Show file tree
Hide file tree
Showing 3 changed files with 163 additions and 137 deletions.
273 changes: 148 additions & 125 deletions SAS/logistic-regr.qmd
Original file line number Diff line number Diff line change
@@ -1,160 +1,59 @@
---
title: "Logistic Regression"
title: "Logistic Regression in SAS"
output: html_document
date: last-modified
date-format: D MMMM, YYYY
---

For a brief description of what is logistic regression see [here](../R/logistic_regr.html).

### Parameterization of model effects (categorical covariates) in SAS

The most common problem when fitting logistic regression in SAS, is getting SAS to model the binary variable (events) and any categorical covariates correctly.
### Modelling the lung cancer data

There are two procedures which can be used to perform logistic regression in SAS, ***proc logistic***, and ***proc genmod (using dist=bin and link=logit options)***. They obtain the same results when parameterized correctly.
In SAS, we can use proc logistic or proc genmod to perform a logistic regression.

For this example, we will focus on proc logistic but the same concepts apply to both procedures.
To demonstrate the use of logistic regression we examine the same lung dataset as used in the R example [here](../R/logistic_regr.html).

Let's first look at types of parameterization.
**Some Common mistakes**

Parameterization is what SAS does when you include a categorical variable in your model. It codes that variable using new parameters and it fits those into your model.
1) Handling of missing data. Check SAS output that the number of missing values is as you expect. Make sure you have changed any `NA` results in the raw data to be missing, since SAS would consider `NA` as a valid category (a non-missing character result).

The default parameterization is ***PARAM=EFFECT*** - This is shown as Example A in the design matrix below.
2) Make sure you consider continuous or categorical variables as you intended. Just because a variable is character or numeric in the dataset, doesn't mean SAS will treat it that way in the model. You have to use Class row to tell SAS which variables should be treated as character factors (also using `ref=' '` to select the reference category).

For the example of a categorical variable trt - which has 3 treatment levels (Active 1, Active 2, Placebo). With the EFFECT option, the reference level (Placebo) is given values of "-1" for both of the x1 and x2 parameters. The design matrix consists of 1 less parameter than the number of levels of your variable (so that for 3 treatment groups you have x1 and x2 - 2 parameters in the model).
3) Be careful you are modelling the correct event (response vs non-response, or weight_gain vs weight_loss for example)

Which reference category is selected is specified either by using the ref='X' option, or the default which would be the last value of the variable. In our example this is Placebo (as this is last in the alphabet after Active 1 and Active 2).
4) Be careful when interpreting any odds ratios that you have the factor of interest the correct way around (0 vs 1, or 1 vs 0)

Alternatively, you change the parameterization in SAS using the following options:
5) If using proc logistic, be careful of how SAS creates its parameters used in the model as this determines how you can use the parameter estimates! It is often easiest to use `param=glm` so that the exp(maximum liklihood parameter estimate) = odds ratio. Check the class level information (Design variables) is as you would expect. See below for more detail on other parameterization methods.

***PARAM=REF*** - This is shown as Example B below - reference parameter values "0", n-1 variables
**Option 1:** **Using Proc Logistic with sex and ph_ecog as character factors:** sex (2 levels: 1,2) and ph_ecog (4 levels:0,1,2 or 3), meal_cal as continuous. As shown below, the `class` statement and `ref=" "` option can be used to set the direction of the comparison, `/param=glm` ensures exp(estimates)=odds ratio.

```{r eval=FALSE}
proc logistic data=dataset;
CLASS TRTP /PARAM=REF;
MODEL resp (event="Y") = TRTP ;
proc logistic data=lung;
class sex (ref="1") ph_ecog (ref="0")/param=glm;
model wt_grp(event="weight_gain") = age sex ph_ecog meal_cal;
run;
```

***PARAM=GLM*** - This is shown as Example C below - no reference parameter as such (output compares against last value of treatment group), n variables.
**Option 2:** **Using proc Genmod with sex and ph_ecog as character factors:** sex (2 levels: 1,2) and ph_ecog (4 levels:0,1,2 or 3), meal_cal as continuous. This gets the same results as above. You can use exp(maximum likelihood parameter estimate) to obtain the odds ratios and 95% CIs for the odds ratios. The below already uses GLM parameterization.

```{r eval=FALSE}
proc logistic data=dataset;
CLASS TRTP /PARAM=GLM;
MODEL resp (event="Y") = TRTP ;
proc genmod data=lung;
class sex (ref="1") ph_ecog (ref="0");
model wt_grp (event="weight_gain") = age sex ph_ecog meal_cal / dist=bin link=logit;
run;
```

#### **Design Matrix Examples**

```{r, echo=FALSE, fig.align='center', out.width="75%"}
knitr::include_graphics("../images/logistic_regression/parameterization.png")
```

Depending on which parameterization you use, you have to interpret those effects carefully. It is also very important to check the reference assigned is the correct one?

This is also particularly important if you are using any contrast, estimate or lsmestimate statements since the coefficients used for the contrasts must be in line with the parameterization.

### Example of deriving constrast statements in SAS

The table below shows the how the parameterizations using EFFECT, REF or GLM, translate to the model being fitted.

Using General model: Y= α + β1x1 + β2x1 {+β3x3}, the treatments are represented as follows:

```{r eval=FALSE}
Treatment Effect(A) REF (B) GLM (C)
-----------------------------------------------------------------------------
Active A Y = α + β1 Y = α + β1 Y = α + β1
Active B Y = α + β2 Y = α + β2 Y = α + β2
Placebo Y = α - β1 - β2 Y = α Y = α + β3
--------------------------------------------------------
```

Now let's suppose we wanted to compare the average of the Active Treatment groups versus the Placebo treatment group. (A+B)/2 compared to Placebo.

**EFFECT (A) parameterization**

```{r eval=FALSE}
((α + β1 + α + β2 ) /2 ) - (α - β1 - β2)
= α + 0.5 β1 + 0.5 β2 - α + β1 + β2
= 1.5 β1 + 1.5 β2
```

Therefore, to apply this contrast in SAS we would use:
**Option 3:** **Using proc logistic, however this time all covariates are treated as continuous parameters not categorical.** You do need to ensure `age` `sex` `ph_ecog` and `meal_cal` are numeric SAS variables else you will get an error message suggesting to add the varibles to a class statement.

```{r eval=FALSE}
PROC LOGISTIC data=dataset;
CLASS trt ; -default PARAM=EFFECT option is used so last trt is reference
MODEL resp (event="Y") = trt ;
CONTRAST "Active (A & B) vs. Placebo" trt 1.5 1.5 / e;
RUN;
```

**GLM (C) parameterization**

The GLM parameterization can be more intuitive.

```{r eval=FALSE}
((α + β1 + α + β2)/2 - (α + β3)
= α + 0.5 β1 + 0.5 β2 - α - β3
= 0.5 β1 + 0.5 β2 - β3
```

Therefore, to apply this contrast in SAS we would use:

```{r eval=FALSE}
PROC LOGISTIC data=dataset;
CLASS trt / param=glm;
MODEL resp (event="Y") = trt ;
CONTRAST "Active (2 & 3) vs. Placebo (1)" trt 0.5 0.5 -1 / e;
RUN;
```

For this reason, it is common to either use the default PARAM=Effect with ref="X" option. or to use the param=GLM with no ref="X" option. It is very important to check the output design matrix so you know what parameterization and reference groups are being used.

### Ensuring you are modelling the correct Binary event in SAS

With logistic regression, we often want to model the number of "Successes". However, by default, SAS sorts alphabetically/numerically and selects the first occurring EVENT alphabetically as the one it's going to model.

It's a common mistake, and we find SAS modelling the number of failures instead of successes. Very common when your response is: 'Responder' vs 'Non-responder', SAS will model the Non-responders as 'N' is alphabetically first before 'R'!

For this reason, It is recommended to always use the event="Y" option.

Options such as ORDER=DATA\|FORMATTED\|FREQ\|INTERNAL as well as descending can be used to ensure the correct levels of classification variables are being modelled. More detail [here](https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_logistic_sect006.htm)

### Modelling the lung cancer data

As described above, we can use proc logistic or prog genmod to perform a logistic regression. Just make sure the parameterization is correct. To demonstrate the use of logistic regression we examine the same lung dataset as used in the R example [here](../R/logistic_regr.html).

The following could be used, however, to match R, Sex and ECOG are treated as continuous parameters in this example, hence option 3 will be used.

```{r eval=FALSE}
Option 1
proc logistic data=lung2;
class sex (ref="1") ph_ecog (ref="0");
model wt_grp(event="weight_gain") = age sex ph_ecog meal_cal;
run;
Option 2
proc genmod data=lung;
class sex (ref="1") ph_ecog (ref="0");
model wt_grp (event="weight_gain") = age sex ph_ecog meal_cal / dist=bin link=logit;
run;
Option 3
proc logistic data=lung2;
model wt_grp(event="weight_gain") = age sex ph_ecog meal_cal;
run;
proc logistic data=lung;
model wt_grp(event="weight_gain") = age sex ph_ecog meal_cal;
run;
```

We analyze the weight gain in lung cancer patients in dependency of age, sex, ECOG performance score and calories consumed at meals. Weight is categorized into a binary variable for Loss or Gain. The probability modelled is weight_gain.
**Comparison with R model** Option 3 is used with the following 3 models, in order to match the same analysis performed in R. This is modelling Sex and ECOG as continuous parameters. We analyze the weight gain in lung cancer patients in dependency of age, sex, ECOG performance score and calories consumed at meals. Weight is categorized into a binary variable for Loss or Gain. The probability modelled is weight_gain.

We fit 3 models. In

Expand All @@ -167,7 +66,7 @@ We fit 3 models. In
```{r eval=FALSE}
title1 "Model 1";
proc logistic data=lung2;
proc logistic data=lung;
model wt_grp (event="weight_gain") = age sex ph_ecog meal_cal;
run;
Expand Down Expand Up @@ -255,3 +154,127 @@ meal_cal 1 -0.00083 0.000435 3.6770 0.0552
```

NOTE: the chi-square test summary of backward elimination, p=0.6250 is different to the results in R, which gave a difference in deviance of -0.24046, p=0.6239. This difference is currently being investigated.

### Parameterization of model effects (categorical covariates) in SAS

The most common problem when fitting logistic regression in SAS, is getting SAS to model the binary variable (events) and any categorical covariates correctly.

There are two procedures which can be used to perform logistic regression in SAS, `proc logistic`, and `proc genmod` (using dist=bin and link=logit options). They obtain the same results when parameterized correctly. However `proc logistic` has three ways to parameterize categorical variables, and failure to parameterize correctly can lead to incorrect interpretation of results. Parameterization is what SAS does when you include a categorical variable in your model. It codes that variable using new parameters and it fits those into your model.

The default parameterization is `/PARAM=EFFECT` - This is shown as Example A in the design matrix example figure below.

For the example of a categorical variable trt - which has 3 treatment levels (Active 1, Active 2, Placebo). With the EFFECT option, the reference level (Placebo) is given values of "-1" for both of the x1 and x2 parameters. The design matrix consists of 1 less parameter than the number of levels of your variable (so that for 3 treatment groups you have x1 and x2 - 2 parameters in the model). Which reference category is selected is specified either by using the `ref='X'` option, or the default which would be the last value of the variable. In our example (as we have not specified `ref='X'` this is Placebo (as this is last in the alphabet after Active 1 and Active 2).

```{r eval=FALSE}
proc logistic data=dataset;
CLASS TRTP /PARAM=EFFECT;
MODEL resp (event="Y") = TRTP ;
run;
or
proc logistic data=dataset;
CLASS TRTP;
MODEL resp (event="Y") = TRTP ;
run;
```

Alternatively, you change the parameterization in SAS using the following options:

`PARAM=REF` - This is shown as Example B below - reference parameter values "0", n-1 variables

```{r eval=FALSE}
proc logistic data=dataset;
CLASS TRTP /PARAM=REF;
MODEL resp (event="Y") = TRTP ;
run;
```

`PARAM=GLM` - This is shown as Example C below - no reference parameter as such (output compares against last value of treatment group), n variables.

```{r eval=FALSE}
proc logistic data=dataset;
CLASS TRTP /PARAM=GLM;
MODEL resp (event="Y") = TRTP ;
run;
```

**Design Matrix Examples**

```{r, echo=FALSE, fig.align='center', out.width="75%"}
knitr::include_graphics("../images/logistic_regression/parameterization.png")
```

Depending on which parameterization you use, you have to interpret the estimates you obtains from the model carefully. It is also very important to check the reference assigned is the correct one.

This is also particularly important if you are using any contrast, estimate or lsmestimate statements since the coefficients used for the contrasts must be in line with the parameterization.

### Example of deriving contrast statements in SAS

The table below shows the how the parameterizations using EFFECT, REF or GLM, translate to the model being fitted.

Using General model: Y= α + β1x1 + β2x1 {+β3x3}, the treatments are represented as follows:

```{r eval=FALSE}
Treatment Effect(A) REF (B) GLM (C)
--------------------------------------------------------
Active A Y = α + β1 Y = α + β1 Y = α + β1
Active B Y = α + β2 Y = α + β2 Y = α + β2
Placebo Y = α - β1 - β2 Y = α Y = α + β3
--------------------------------------------------------
```

Now let's suppose we wanted to compare the average of the Active Treatment groups versus the Placebo treatment group. (A+B)/2 compared to Placebo.

**EFFECT (A) parameterization**

```{r eval=FALSE}
((α + β1 + α + β2 ) /2 ) - (α - β1 - β2)
= α + 0.5 β1 + 0.5 β2 - α + β1 + β2
= 1.5 β1 + 1.5 β2
```

Therefore, to apply this contrast in SAS we would use:

```{r eval=FALSE}
PROC LOGISTIC data=dataset;
CLASS trt ; -default PARAM=EFFECT option is used so last trt is reference
MODEL resp (event="Y") = trt ;
CONTRAST "Active (A & B) vs. Placebo" trt 1.5 1.5 / e;
RUN;
```

**GLM (C) parameterization**

The GLM parameterization can be more intuitive.

```{r eval=FALSE}
((α + β1 + α + β2)/2 - (α + β3)
= α + 0.5 β1 + 0.5 β2 - α - β3
= 0.5 β1 + 0.5 β2 - β3
```

Therefore, to apply this contrast in SAS we would use:

```{r eval=FALSE}
PROC LOGISTIC data=dataset;
CLASS trt / param=glm;
MODEL resp (event="Y") = trt ;
CONTRAST "Active (2 & 3) vs. Placebo (1)" trt 0.5 0.5 -1 / e;
RUN;
```

For this reason, it is common to either use the default PARAM=Effect with ref="X" option. or to use the param=GLM with no ref="X" option. It is very important to check the output design matrix so you know what parameterization and reference groups are being used.

### Ensuring you are modelling the correct Binary event in SAS

With logistic regression, we often want to model the number of "Successes". However, by default, SAS sorts alphabetically/numerically and selects the first occurring EVENT alphabetically as the one it's going to model.

It's a common mistake, and we find SAS modelling the number of failures instead of successes. Very common when your response is: 'Responder' vs 'Non-responder', SAS will model the Non-responders as 'N' is alphabetically first before 'R'!

For this reason, It is recommended to always use the event="Y" option.

Options such as `ORDER=DATA|FORMATTED|FREQ|INTERNAL` as well as descending can be used to ensure the correct levels of classification variables are being modelled. More detail [here](https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_logistic_sect006.htm)
2 changes: 1 addition & 1 deletion _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ website:
href: publication/white_paper.qmd
- text: "Conferences"
href: publication/conference.qmd
- text: "Dissertation"
- text: "Academic projects"
href: publication/dissertation.qmd


Expand Down
Loading

0 comments on commit 34a18c8

Please sign in to comment.