Skip to content

Commit

Permalink
Merge pull request #354 from PSIAIMS/LT1
Browse files Browse the repository at this point in the history
Closed out some issues for the minor updates
  • Loading branch information
statasaurus authored Nov 25, 2024
2 parents 36b37bc + 863efc1 commit 69c4bd3
Show file tree
Hide file tree
Showing 7 changed files with 76 additions and 51 deletions.
35 changes: 32 additions & 3 deletions R/ci_for_prop.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ adcibc %>%

## Packages

The {cardx} package is an extension of the {cards} package, providing additional functions to create Analysis Results Data Objects (ARDs)^1^. It was developed as part of {NEST} and pharmaverse. This package requires the binary endpoint to be a logical (TRUE/FALSE) vector or a numeric/integer coded as (0, 1) with 1 (TRUE) being the success you want to calculate the confidence interval for.
**The {cardx} package** is an extension of the {cards} package, providing additional functions to create Analysis Results Data Objects (ARDs)^1^. It was developed as part of {NEST} and pharmaverse. This package requires the binary endpoint to be a logical (TRUE/FALSE) vector or a numeric/integer coded as (0, 1) with 1 (TRUE) being the success you want to calculate the confidence interval for.

If calculating the CI for a difference in proportions, the package requires both the response and the treatment variable to be numeric/integer coded as (0, 1) (or logical vector).

Expand All @@ -58,11 +58,11 @@ Example data format needed for {cardx} for a single proportion CI

```{r}
#| echo: FALSE
#Data for use with cardrx takes the format 0s and 1s
#Data for use with cardx takes the format 0s and 1s
head(act2,30)
```

The {PropCIs} package produces CIs for methods such as Blaker's exact method and Midp which aren't available in {cardx} but are available in SAS. We found results agreed with SAS to the 5th decimal place. The package also calculates CIs for Clopper-Pearson, Wald, Wilson, Agresti-coull and these align to results obtained in cardx to at least the 7th decimal place. The {PropsCIs} package requires just the number of events (numerator number of successes) & total number of subjects (denominator) as an input dataset. Given Blaker and Midp are rarely used in practice, and {PropsCIs} isn't a package commonly downloaded from CRAN, further detail is not provided here.
**The {PropCIs} package** produces CIs for methods such as Blaker's exact method and Midp which aren't available in {cardx} but are available in SAS. We found results agreed with SAS to the 5th decimal place. The package also calculates CIs for Clopper-Pearson, Wald, Wilson, Agresti-Coull and these align to results obtained in cardx to at least the 7th decimal place. The {PropsCIs} package requires just the number of events (numerator number of successes) & total number of subjects (denominator) as an input dataset. Given Blaker and Midp are rarely used in practice, and {PropsCIs} isn't a package commonly downloaded from CRAN, further detail is not provided here.

Code example for Clopper-pearson:\
`exactci(x=<count of successes> , n=<Total>, conf.level=0.95)`
Expand All @@ -73,6 +73,32 @@ Code example for Mid P method:\
Code example for Blaker's exact method:\
`blakerci(x=<count of successes> , n=<Total>, conf.level=0.95, tolerance=1e-05)`

**The {Hmisc} package** produces CIs using the Clopper-Pearson method. In this example (x=36 and n=154), the results match the cardx package. Documentation reports that the method uses F distribution to compute exact intervals based on the binomial cdf. However, if the percentage of responders is 100% then the upper limit is set to 1. Similarly if the percentage of responders is 0%, then the lower limit is set to 0. Hence, in extreme cases there may be differences between this package and the standard implementation of Clopper-Pearson method.

Code example for Clopper-pearson:\
`binconf(x=<count of successes> , n=<Total>,method="exact",alpha=0.05)`

**The {RBesT} package** produces CIs using the Clopper-Pearson method. In this example (x=36 and n=154), the results match the cardx package. However, as described below, there are 2 cases where the results using RBesT package do not match cardx or Hmisc.

1) x = 0 (0% responders), in which case the lower limit does not match.
2) x = n (100% responders), in which case the upper limit does not match.

Because of the relationship between the binomial distirbution and the beta distribution. This package uses quantiles of the beta distribution to derive exact confidence intervals.

$$ B(\alpha/2;x, n-x+1) < p < B(1-\alpha/2; x+1, n-x)$$

RBesT equations are: \
pLow \<- qbeta(Low, r + (r == 0), n - r + 1)\
pHigh \<- qbeta(High, r + 1, n - r + ((n - r) == 0))

If the equations were updated as follows then it would match Hmisc intervals:\
pLow \<- qbeta(Low, r, n - r + 1)\
pHigh \<- qbeta(High, r + 1, n - r)

`BinaryExactCI(x=<count of successes> , n=<Total>,alpha=0.05)`

It is currently unclear why the RBesT script has the logical conditions (r==0) and ((n-r)==0. Therefore, we currently do not recommend using this package and suggest cardx or Hmisc is used instead.

## Methods for Calculating Confidence Intervals for a single proportion using cardx

For more technical derivation and reasons for use of each of the methods listed below, see the corresponding [SAS page](https://psiaims.github.io/CAMIS/SAS/ci_for_prop.html).
Expand Down Expand Up @@ -170,6 +196,9 @@ proportion_ci_jeffreys(act2,conf.level=0.95) %>%
as_tibble()
```

```
```

## Methods for Calculating Confidence Intervals for a matched pair proportion

When you have 2 measurements on the same subject, the 2 sets of measures are not independent and you have matched pair of responses.
Expand Down
6 changes: 4 additions & 2 deletions R/logistic_regr.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,13 @@ title: "Logistic Regression"
library(tidyverse)
```

A model of the dependence of binary variables on explanatory variables. The logit of expectation is explained as linear for of explanatory variables. If we observed $(y_i, x_i),$ where $y_i$ is a Bernoulli variable and $x_i$ a vector of explanatory variables, the model for $\pi_i = P(y_i=1)$ is
In binary logistic regression, there is a single binary dependent variable, coded by an indicator variable. For example, if we respresent a response as 1 and non-response as 0, then the corresponding probability of response, can be between 0 (certainly not a response) and 1 (certainly a response) - hence the labeling !

The logistic model models the log-odds of an event as a linear combination of one or more independent variables (explanatory variables). If we observed $(y_i, x_i),$ where $y_i$ is a Bernoulli variable and $x_i$ a vector of explanatory variables, the model for $\pi_i = P(y_i=1)$ is

$$
\text{logit}(\pi_i)= \log\left\{ \frac{\pi_i}{1-\pi_i}\right\} = \beta_0 + \beta x_i, i = 1,\ldots,n
$$
$$

The model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios.

Expand Down
17 changes: 10 additions & 7 deletions R/rounding.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,13 @@ library(janitor)
library(dplyr)
```

## round from R base

The **round()** function in Base R will round to the nearest whole number and 'rounding to the even number' when equidistant, meaning that exactly 12.5 rounds to the integer 12.

## round from R base
The **round(12.5,digits=1)** function tells R to round to 1 decimal place.

The **round()** function in Base R will round to the nearest whole number and 'rounding to the even number' when equidistant, meaning that exactly 12.5 rounds to the integer 12. However, this is dependent on OS services and on representation error (since e.g. 0.15 is not represented exactly, the rounding rule applies to the represented number and not to the printed number, and so round(0.15, 1) could be either 0.1 or 0.2).
However, rounding is dependent on OS services and on representation error since for example, if 0.15 is not represented exactly, if could actually be the number 0.15000000001 or 0.149999999999! The rounding rule applies to the represented number and not to the printed number, and so round(0.15, 1) could be either 0.1 or 0.2).

```{R}
round(1:9/10+0.05,1)
Expand All @@ -37,8 +39,7 @@ r_2_dec
r_3_dec
```

If using the janitor package in R, and the function `round_half_up()`, the results would be the same with the exception of rounding 1.2345 to 3 decimal places where a result of 1.235 would be obtained instead of 1.234.
However, in some rare cases, `round_half_up()` does not return result as expected. There are two kinds of cases for it. 1. Round down for positive decimal like 0.xx5.
If using the janitor package in R, and the function `round_half_up()`, the results would be the same with the exception of rounding 1.2345 to 3 decimal places where a result of 1.235 would be obtained instead of 1.234. However, in some rare cases, `round_half_up()` does not return result as expected. There are two kinds of cases for it. 1. Round down for positive decimal like 0.xx5.

```{R}
round_half_up(524288.1255, digits = 3)
Expand All @@ -51,7 +52,6 @@ options(digits=22)
524288.1255
```


In `round_half_up()`, a small decimal `sqrt(.Machine$double.eps)` is added before rounding. It avoids some incorrect rounding due to the stored numeric value is a little less than the original value, but does not cover all conditions.

```{R}
Expand Down Expand Up @@ -122,11 +122,14 @@ Like `round_half_up()`, it also contains the two kinds of incorrect results. And

To avoid the rounding issue totally, the only way is to increase precision, e.g. using package `Rmpfr`. It will need CPU resource. And it's not always necessary considering the accuracy of current functions.

## round5() from package cards

The `cards::round5()` package does the same rounding as the `janitor::round_half_up()`.

## Conclusion

So far, `round_half_up()` from package janitor is still one of the best solutions to round away from zero, though users may meet incorrect results in rare cases when the numbers are big or the decimal is long.
So far, `round_half_up()` from package janitor (or `cards::round5()` ) is still one of the best solutions to round away from zero, though users may meet incorrect results in rare cases when the numbers are big or the decimal is long.

```{R}
options(digits = 7) #This just returns the number of displayed digits back to the default
```

38 changes: 24 additions & 14 deletions python/logistic_regression.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ output: html_document
---

# Imports

```{python}
#data manipulation
import pandas as pd
Expand All @@ -14,35 +15,37 @@ import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
```


# Background

Logistic regression models the relationship between a binary variable and set of explanatory variables. The logit of expectation is explained as linear for of explanatory variables. If we observed $(y_i, x_i),$ where $y_i$ is a Bernoulli variable and $x_i$ a vector of explanatory variables, the model for $\pi_i = P(y_i=1)$ is
In binary logistic regression, there is a single binary dependent variable, coded by an indicator variable. For example, if we respresent a response as 1 and non-response as 0, then the corresponding probability of response, can be between 0 (certainly not a response) and 1 (certainly a response) - hence the labeling !

The logistic model models the log-odds of an event as a linear combination of one or more independent variables (explanatory variables). If we observed $(y_i, x_i),$ where $y_i$ is a Bernoulli variable and $x_i$ a vector of explanatory variables, the model for $\pi_i = P(y_i=1)$ is

$$
\text{logit}(\pi_i)= \log\left\{ \frac{\pi_i}{1-\pi_i}\right\} = \beta_0 + \beta x_i, i = 1,\ldots,n
$$
$$

The model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios.

# Example : Lung cancer data

*Data source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.*

These data were sourced from the R package {survival} and have been downloaded and stored in the `data` folder.
These data were sourced from the R package {survival} and have been downloaded and stored in the `data` folder.

```{python}
# importing and prepare
lung2 = pd.read_csv("../data/lung_cancer.csv")
#create weight loss factor while respecting missing values
# 1: patients with a weight loss of more than zero
# 0: patients a weight loss of zero oe less
# 0: patients a weight loss of zero or less
lung2["wt_grp"] = np.where(lung2["wt.loss"].isnull(), np.nan, (lung2["wt.loss"] > 0).astype(int))
```

# Logistic Regression Modelling

Let's further prepare our data for modelling by selecting the explanatory variables and the dependent variable. The Python packages that we are are aware of require complete (i.e. no missing values) data so for convenience of demonstrating these methods we will drop rows with missing values.
Let's further prepare our data for modelling by selecting the explanatory variables and the dependent variable. The Python packages that we are are aware of require complete (i.e. no missing values) data so for convenience of demonstrating these methods we will drop rows with missing values.

```{python}
x_vars = ["age", "sex", "ph.ecog", "meal.cal"]
Expand All @@ -57,6 +60,7 @@ y = lung2_complete[y_var]
```

## Statsmodels package

We will use the `sm.Logit()` method to fit our logistic regression model.

```{python}
Expand All @@ -69,20 +73,25 @@ print(lr_sm.summary())
```

### Model fitting

In addition to the information contained in the summary, we can display the model coefficients as odds ratios:

```{python}
print("Odds ratios for statsmodels logistic regression:")
print(np.exp(lr_sm.params))
```

We can also provide the 5% confidence intervals for the odds ratios:

```{python}
print("CI at 5% for statsmodels logistic regression:")
print(np.exp(lr_sm.conf_int(alpha = 0.05)))
```

### Prediction
Let's use our trained model to make a weight loss prediction about a new patient.

Let's use our trained model to make a weight loss prediction about a new patient.

```{python}
# new female, symptomatic but completely ambulatory patient consuming 2500 calories
new_pt = pd.DataFrame({
Expand All @@ -101,10 +110,10 @@ print(lr_sm.predict(new_pt_sm))

## Scikit-learn Package

The `scikit-learn` package is a popular package for machine learning and predictive modelling.
The `scikit-learn` package is a popular package for machine learning and predictive modelling.

::: {.callout-warning}
It's important to note that l2 regularisation is applied by default in the `scikit-learn` implementation of logistic regression. More recent releases of this package include an option to have no regularisation penalty.
::: callout-warning
It's important to note that l2 regularisation is applied by default in the `scikit-learn` implementation of logistic regression. More recent releases of this package include an option to have no regularisation penalty.
:::

```{python}
Expand All @@ -120,16 +129,17 @@ print("Odds ratios for scikit learn logistic regression:")
print(np.exp(lr_sk.coef_))
```


However, obtaining the confidence intervals and other metrics is not directly supported in `scikit-learn`.
However, obtaining the confidence intervals and other metrics is not directly supported in `scikit-learn`.

### Prediction
Using the same new patient example we can use our logistic regression model to make a prediction. The `predict_proba` method is used to return the probability for each class. If you are interested in viewing the prediction for `y = 1`, i.e. the probability of weight loss then you can select the second probability as shown:

Using the same new patient example we can use our logistic regression model to make a prediction. The `predict_proba` method is used to return the probability for each class. If you are interested in viewing the prediction for `y = 1`, i.e. the probability of weight loss then you can select the second probability as shown:

```{python}
print("Probability of weight loss using the scikit-learn package:")
print(lr_sk.predict_proba(new_pt)[:,1])
```

## Conclusions

There are two main ways to fit a logistic regression using python. Each of these packages have their advantages with `statsmodel` geared more towards model and coefficient interpretation in low dimensional data settings and in contrast the `scikit-learn` implementation more appropriate for use cases focused on prediction with more complex, higher dimensional data.
There are two main ways to fit a logistic regression using python. Each of these packages have their advantages with `statsmodel` geared more towards model and coefficient interpretation in low dimensional data settings and in contrast the `scikit-learn` implementation more appropriate for use cases focused on prediction with more complex, higher dimensional data.
12 changes: 2 additions & 10 deletions python/one_sample_t_test.qmd
Original file line number Diff line number Diff line change
@@ -1,13 +1,10 @@
---
title: "One Sample t-test"
title: "One Sample t-test in Python"
output: html_document
---

## One Sample t-test in Python

The One Sample t-test is used to compare a single sample against an expected hypothesis value. In the One Sample t-test, the mean of the sample is compared against the hypothesis value. In Python, a One Sample t-test can be performed using the scipy.stats.ttest_1samp(...) function from the scipy package, which accepts the following parameters:


1.*a*: Sample observations.

2.*popmean*: Expected value in null hypothesis. If array_like, then its length along axis must equal 1, and it must otherwise be broadcastable with a.
Expand All @@ -20,7 +17,6 @@ The One Sample t-test is used to compare a single sample against an expected hyp

## Data Used


```{python}
import pandas as pd
Expand All @@ -35,10 +31,6 @@ df = pd.DataFrame(data)
```

---
subsubtitle: "t-test"
---

The following code was used to test the comparison in Python. Note that the baseline null hypothesis goes in the "popmean" parameter.

```{python}
Expand All @@ -61,4 +53,4 @@ if p_value < alpha:
else:
print("Fail to reject null hypothesis: There is no significant difference.")
```
```
11 changes: 2 additions & 9 deletions python/paired_t_test.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,9 @@ title: "Paired t-test"
output: html_document
---

## Paired t-test in Python

Paired t-tests are used to test the difference of means for two dependant variables. In the Paired t-test, the difference of the means between the two samples is compared to a given number that represents the null hypothesis. For a Paired t-test, the number of observations in each sample must be equal.


In Python, a Paired t-test can be performed using the scipy.stats.ttest_rel(...) function from the scipy package, which accepts the following parameters:

In Python, a Paired t-test can be performed using the scipy.stats.ttest_rel(...) function from the scipy package, which accepts the following parameters:

1.*a, b*: Sample observations. The arrays must have the same shape.

Expand All @@ -23,7 +19,6 @@ Paired t-tests are used to test the difference of means for two dependant variab

## Data Used


```{python}
import pandas as pd
Expand All @@ -38,8 +33,6 @@ df_pressure = pd.DataFrame(data)
```

## Paired t-test

The following code was used to test the comparison in Python. Note that the baseline null hypothesis goes in the "popmean" parameter.

```{python}
Expand All @@ -55,4 +48,4 @@ print(f"t = {t_stat}")
print(f"p-value = {p_value}")
```
```
8 changes: 2 additions & 6 deletions python/two_samples_t_test.qmd
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
---
title: "Two Sample t-test"
title: "Two Sample t-test in Python"
output: html_document
---

# **Two Sample t-test in Python**

The Two Sample t-test is used to compare two independent samples against each other. In the Two Sample t-test, the mean of the first sample is compared against the mean of the second sample. In Python, a Two Sample t-test can be performed using the **stats** package from scipy.

### Data Used
Expand All @@ -25,8 +23,6 @@ data = {
df = pd.DataFrame(data)
```

## Python

If we have normalized data, we can use the classic Student's t-test. For a Two sample test where the variances are not equal, we should use the Welch's t-test. Both of those options are available in the scipy **stats** package.

### Student's T-Test
Expand Down Expand Up @@ -61,4 +57,4 @@ t_stat_welch, p_value_unequal_var = stats.ttest_ind(group1, group2, equal_var=Fa
print("\nWelch's T-Test assuming unequal variances:")
print(f"T-statistic: {t_stat_welch}")
print(f"P-value: {p_value_unequal_var}")
```
```

0 comments on commit 69c4bd3

Please sign in to comment.