From 2211941ba6cf0b6562870895c7ed64f3c98fa484 Mon Sep 17 00:00:00 2001 From: Taylor Date: Fri, 22 Nov 2024 15:51:06 +0000 Subject: [PATCH 1/5] As per Issue #335 --- R/logistic_regr.qmd | 6 ++++-- python/logistic_regression.qmd | 38 +++++++++++++++++++++------------- 2 files changed, 28 insertions(+), 16 deletions(-) diff --git a/R/logistic_regr.qmd b/R/logistic_regr.qmd index 5326366ee..c40d744fd 100644 --- a/R/logistic_regr.qmd +++ b/R/logistic_regr.qmd @@ -8,11 +8,13 @@ title: "Logistic Regression" library(tidyverse) ``` -A model of the dependence of binary variables on explanatory variables. The logit of expectation is explained as linear for of explanatory variables. If we observed $(y_i, x_i),$ where $y_i$ is a Bernoulli variable and $x_i$ a vector of explanatory variables, the model for $\pi_i = P(y_i=1)$ is +In binary logistic regression, there is a single binary dependent variable, coded by an indicator variable. For example, if we respresent a response as 1 and non-response as 0, then the corresponding probability of response, can be between 0 (certainly not a response) and 1 (certainly a response) - hence the labeling ! + +The logistic model models the log-odds of an event as a linear combination of one or more independent variables (explanatory variables). If we observed $(y_i, x_i),$ where $y_i$ is a Bernoulli variable and $x_i$ a vector of explanatory variables, the model for $\pi_i = P(y_i=1)$ is $$ \text{logit}(\pi_i)= \log\left\{ \frac{\pi_i}{1-\pi_i}\right\} = \beta_0 + \beta x_i, i = 1,\ldots,n -$$ +$$ The model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios. diff --git a/python/logistic_regression.qmd b/python/logistic_regression.qmd index 682df42b8..af7a0f25f 100644 --- a/python/logistic_regression.qmd +++ b/python/logistic_regression.qmd @@ -4,6 +4,7 @@ output: html_document --- # Imports + ```{python} #data manipulation import pandas as pd @@ -14,21 +15,23 @@ import statsmodels.api as sm from sklearn.linear_model import LogisticRegression ``` - # Background -Logistic regression models the relationship between a binary variable and set of explanatory variables. The logit of expectation is explained as linear for of explanatory variables. If we observed $(y_i, x_i),$ where $y_i$ is a Bernoulli variable and $x_i$ a vector of explanatory variables, the model for $\pi_i = P(y_i=1)$ is +In binary logistic regression, there is a single binary dependent variable, coded by an indicator variable. For example, if we respresent a response as 1 and non-response as 0, then the corresponding probability of response, can be between 0 (certainly not a response) and 1 (certainly a response) - hence the labeling ! + +The logistic model models the log-odds of an event as a linear combination of one or more independent variables (explanatory variables). If we observed $(y_i, x_i),$ where $y_i$ is a Bernoulli variable and $x_i$ a vector of explanatory variables, the model for $\pi_i = P(y_i=1)$ is $$ \text{logit}(\pi_i)= \log\left\{ \frac{\pi_i}{1-\pi_i}\right\} = \beta_0 + \beta x_i, i = 1,\ldots,n -$$ +$$ The model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios. # Example : Lung cancer data + *Data source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.* -These data were sourced from the R package {survival} and have been downloaded and stored in the `data` folder. +These data were sourced from the R package {survival} and have been downloaded and stored in the `data` folder. ```{python} # importing and prepare @@ -36,13 +39,13 @@ lung2 = pd.read_csv("../data/lung_cancer.csv") #create weight loss factor while respecting missing values # 1: patients with a weight loss of more than zero -# 0: patients a weight loss of zero oe less +# 0: patients a weight loss of zero or less lung2["wt_grp"] = np.where(lung2["wt.loss"].isnull(), np.nan, (lung2["wt.loss"] > 0).astype(int)) ``` # Logistic Regression Modelling -Let's further prepare our data for modelling by selecting the explanatory variables and the dependent variable. The Python packages that we are are aware of require complete (i.e. no missing values) data so for convenience of demonstrating these methods we will drop rows with missing values. +Let's further prepare our data for modelling by selecting the explanatory variables and the dependent variable. The Python packages that we are are aware of require complete (i.e. no missing values) data so for convenience of demonstrating these methods we will drop rows with missing values. ```{python} x_vars = ["age", "sex", "ph.ecog", "meal.cal"] @@ -57,6 +60,7 @@ y = lung2_complete[y_var] ``` ## Statsmodels package + We will use the `sm.Logit()` method to fit our logistic regression model. ```{python} @@ -69,20 +73,25 @@ print(lr_sm.summary()) ``` ### Model fitting + In addition to the information contained in the summary, we can display the model coefficients as odds ratios: + ```{python} print("Odds ratios for statsmodels logistic regression:") print(np.exp(lr_sm.params)) ``` We can also provide the 5% confidence intervals for the odds ratios: + ```{python} print("CI at 5% for statsmodels logistic regression:") print(np.exp(lr_sm.conf_int(alpha = 0.05))) ``` ### Prediction -Let's use our trained model to make a weight loss prediction about a new patient. + +Let's use our trained model to make a weight loss prediction about a new patient. + ```{python} # new female, symptomatic but completely ambulatory patient consuming 2500 calories new_pt = pd.DataFrame({ @@ -101,10 +110,10 @@ print(lr_sm.predict(new_pt_sm)) ## Scikit-learn Package -The `scikit-learn` package is a popular package for machine learning and predictive modelling. +The `scikit-learn` package is a popular package for machine learning and predictive modelling. -::: {.callout-warning} -It's important to note that l2 regularisation is applied by default in the `scikit-learn` implementation of logistic regression. More recent releases of this package include an option to have no regularisation penalty. +::: callout-warning +It's important to note that l2 regularisation is applied by default in the `scikit-learn` implementation of logistic regression. More recent releases of this package include an option to have no regularisation penalty. ::: ```{python} @@ -120,11 +129,12 @@ print("Odds ratios for scikit learn logistic regression:") print(np.exp(lr_sk.coef_)) ``` - -However, obtaining the confidence intervals and other metrics is not directly supported in `scikit-learn`. +However, obtaining the confidence intervals and other metrics is not directly supported in `scikit-learn`. ### Prediction -Using the same new patient example we can use our logistic regression model to make a prediction. The `predict_proba` method is used to return the probability for each class. If you are interested in viewing the prediction for `y = 1`, i.e. the probability of weight loss then you can select the second probability as shown: + +Using the same new patient example we can use our logistic regression model to make a prediction. The `predict_proba` method is used to return the probability for each class. If you are interested in viewing the prediction for `y = 1`, i.e. the probability of weight loss then you can select the second probability as shown: + ```{python} print("Probability of weight loss using the scikit-learn package:") print(lr_sk.predict_proba(new_pt)[:,1]) @@ -132,4 +142,4 @@ print(lr_sk.predict_proba(new_pt)[:,1]) ## Conclusions -There are two main ways to fit a logistic regression using python. Each of these packages have their advantages with `statsmodel` geared more towards model and coefficient interpretation in low dimensional data settings and in contrast the `scikit-learn` implementation more appropriate for use cases focused on prediction with more complex, higher dimensional data. \ No newline at end of file +There are two main ways to fit a logistic regression using python. Each of these packages have their advantages with `statsmodel` geared more towards model and coefficient interpretation in low dimensional data settings and in contrast the `scikit-learn` implementation more appropriate for use cases focused on prediction with more complex, higher dimensional data. From 42a0304d7dd38cc5268c9959a5ec26d2c1509514 Mon Sep 17 00:00:00 2001 From: Taylor Date: Fri, 22 Nov 2024 16:11:50 +0000 Subject: [PATCH 2/5] Updated as per issue #307 --- python/one_sample_t_test.qmd | 12 ++---------- python/paired_t_test.qmd | 11 ++--------- python/two_samples_t_test.qmd | 8 ++------ 3 files changed, 6 insertions(+), 25 deletions(-) diff --git a/python/one_sample_t_test.qmd b/python/one_sample_t_test.qmd index 88627f781..64d8cd95d 100644 --- a/python/one_sample_t_test.qmd +++ b/python/one_sample_t_test.qmd @@ -1,13 +1,10 @@ --- -title: "One Sample t-test" +title: "One Sample t-test in Python" output: html_document --- -## One Sample t-test in Python - The One Sample t-test is used to compare a single sample against an expected hypothesis value. In the One Sample t-test, the mean of the sample is compared against the hypothesis value. In Python, a One Sample t-test can be performed using the scipy.stats.ttest_1samp(...) function from the scipy package, which accepts the following parameters: - 1.*a*: Sample observations. 2.*popmean*: Expected value in null hypothesis. If array_like, then its length along axis must equal 1, and it must otherwise be broadcastable with a. @@ -20,7 +17,6 @@ The One Sample t-test is used to compare a single sample against an expected hyp ## Data Used - ```{python} import pandas as pd @@ -35,10 +31,6 @@ df = pd.DataFrame(data) ``` ---- -subsubtitle: "t-test" ---- - The following code was used to test the comparison in Python. Note that the baseline null hypothesis goes in the "popmean" parameter. ```{python} @@ -61,4 +53,4 @@ if p_value < alpha: else: print("Fail to reject null hypothesis: There is no significant difference.") -``` \ No newline at end of file +``` diff --git a/python/paired_t_test.qmd b/python/paired_t_test.qmd index 0a7aa8bb2..2052dff7d 100644 --- a/python/paired_t_test.qmd +++ b/python/paired_t_test.qmd @@ -3,13 +3,9 @@ title: "Paired t-test" output: html_document --- -## Paired t-test in Python - Paired t-tests are used to test the difference of means for two dependant variables. In the Paired t-test, the difference of the means between the two samples is compared to a given number that represents the null hypothesis. For a Paired t-test, the number of observations in each sample must be equal. - - In Python, a Paired t-test can be performed using the scipy.stats.ttest_rel(...) function from the scipy package, which accepts the following parameters: - +In Python, a Paired t-test can be performed using the scipy.stats.ttest_rel(...) function from the scipy package, which accepts the following parameters: 1.*a, b*: Sample observations. The arrays must have the same shape. @@ -23,7 +19,6 @@ Paired t-tests are used to test the difference of means for two dependant variab ## Data Used - ```{python} import pandas as pd @@ -38,8 +33,6 @@ df_pressure = pd.DataFrame(data) ``` -## Paired t-test - The following code was used to test the comparison in Python. Note that the baseline null hypothesis goes in the "popmean" parameter. ```{python} @@ -55,4 +48,4 @@ print(f"t = {t_stat}") print(f"p-value = {p_value}") -``` \ No newline at end of file +``` diff --git a/python/two_samples_t_test.qmd b/python/two_samples_t_test.qmd index 7b2674c71..93acb0739 100644 --- a/python/two_samples_t_test.qmd +++ b/python/two_samples_t_test.qmd @@ -1,10 +1,8 @@ --- -title: "Two Sample t-test" +title: "Two Sample t-test in Python" output: html_document --- -# **Two Sample t-test in Python** - The Two Sample t-test is used to compare two independent samples against each other. In the Two Sample t-test, the mean of the first sample is compared against the mean of the second sample. In Python, a Two Sample t-test can be performed using the **stats** package from scipy. ### Data Used @@ -25,8 +23,6 @@ data = { df = pd.DataFrame(data) ``` -## Python - If we have normalized data, we can use the classic Student's t-test. For a Two sample test where the variances are not equal, we should use the Welch's t-test. Both of those options are available in the scipy **stats** package. ### Student's T-Test @@ -61,4 +57,4 @@ t_stat_welch, p_value_unequal_var = stats.ttest_ind(group1, group2, equal_var=Fa print("\nWelch's T-Test assuming unequal variances:") print(f"T-statistic: {t_stat_welch}") print(f"P-value: {p_value_unequal_var}") -``` \ No newline at end of file +``` From fdc8c387183bb1450652ec14642e5e0cba225e7c Mon Sep 17 00:00:00 2001 From: Taylor Date: Fri, 22 Nov 2024 16:23:13 +0000 Subject: [PATCH 3/5] as per issue # 299 --- R/rounding.qmd | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/R/rounding.qmd b/R/rounding.qmd index 6d6bb2275..a938ca11d 100644 --- a/R/rounding.qmd +++ b/R/rounding.qmd @@ -8,11 +8,13 @@ library(janitor) library(dplyr) ``` +## round from R base +The **round()** function in Base R will round to the nearest whole number and 'rounding to the even number' when equidistant, meaning that exactly 12.5 rounds to the integer 12. -## round from R base +The **round(12.5,digits=1)** function tells R to round to 1 decimal place. -The **round()** function in Base R will round to the nearest whole number and 'rounding to the even number' when equidistant, meaning that exactly 12.5 rounds to the integer 12. However, this is dependent on OS services and on representation error (since e.g. 0.15 is not represented exactly, the rounding rule applies to the represented number and not to the printed number, and so round(0.15, 1) could be either 0.1 or 0.2). +However, rounding is dependent on OS services and on representation error since for example, if 0.15 is not represented exactly, if could actually be the number 0.15000000001 or 0.149999999999! The rounding rule applies to the represented number and not to the printed number, and so round(0.15, 1) could be either 0.1 or 0.2). ```{R} round(1:9/10+0.05,1) @@ -37,8 +39,7 @@ r_2_dec r_3_dec ``` -If using the janitor package in R, and the function `round_half_up()`, the results would be the same with the exception of rounding 1.2345 to 3 decimal places where a result of 1.235 would be obtained instead of 1.234. -However, in some rare cases, `round_half_up()` does not return result as expected. There are two kinds of cases for it. 1. Round down for positive decimal like 0.xx5. +If using the janitor package in R, and the function `round_half_up()`, the results would be the same with the exception of rounding 1.2345 to 3 decimal places where a result of 1.235 would be obtained instead of 1.234. However, in some rare cases, `round_half_up()` does not return result as expected. There are two kinds of cases for it. 1. Round down for positive decimal like 0.xx5. ```{R} round_half_up(524288.1255, digits = 3) @@ -51,7 +52,6 @@ options(digits=22) 524288.1255 ``` - In `round_half_up()`, a small decimal `sqrt(.Machine$double.eps)` is added before rounding. It avoids some incorrect rounding due to the stored numeric value is a little less than the original value, but does not cover all conditions. ```{R} @@ -122,11 +122,14 @@ Like `round_half_up()`, it also contains the two kinds of incorrect results. And To avoid the rounding issue totally, the only way is to increase precision, e.g. using package `Rmpfr`. It will need CPU resource. And it's not always necessary considering the accuracy of current functions. +## round5() from package cards + +The `cards::round5()` package does the same rounding as the `janitor::round_half_up()`. + ## Conclusion -So far, `round_half_up()` from package janitor is still one of the best solutions to round away from zero, though users may meet incorrect results in rare cases when the numbers are big or the decimal is long. +So far, `round_half_up()` from package janitor (or `cards::round5()` ) is still one of the best solutions to round away from zero, though users may meet incorrect results in rare cases when the numbers are big or the decimal is long. ```{R} options(digits = 7) #This just returns the number of displayed digits back to the default ``` - From 6518b8e6ffca773d32ac666eaed81dfaf707e2af Mon Sep 17 00:00:00 2001 From: Taylor Date: Fri, 22 Nov 2024 17:46:10 +0000 Subject: [PATCH 4/5] Issue #7 --- R/ci_for_prop.qmd | 35 ++++++++++++++++++++++++++++++++--- 1 file changed, 32 insertions(+), 3 deletions(-) diff --git a/R/ci_for_prop.qmd b/R/ci_for_prop.qmd index 9ef4c82d4..0aec2fc18 100644 --- a/R/ci_for_prop.qmd +++ b/R/ci_for_prop.qmd @@ -46,7 +46,7 @@ adcibc %>% ## Packages -The {cardx} package is an extension of the {cards} package, providing additional functions to create Analysis Results Data Objects (ARDs)^1^. It was developed as part of {NEST} and pharmaverse. This package requires the binary endpoint to be a logical (TRUE/FALSE) vector or a numeric/integer coded as (0, 1) with 1 (TRUE) being the success you want to calculate the confidence interval for. +**The {cardx} package** is an extension of the {cards} package, providing additional functions to create Analysis Results Data Objects (ARDs)^1^. It was developed as part of {NEST} and pharmaverse. This package requires the binary endpoint to be a logical (TRUE/FALSE) vector or a numeric/integer coded as (0, 1) with 1 (TRUE) being the success you want to calculate the confidence interval for. If calculating the CI for a difference in proportions, the package requires both the response and the treatment variable to be numeric/integer coded as (0, 1) (or logical vector). @@ -58,11 +58,11 @@ Example data format needed for {cardx} for a single proportion CI ```{r} #| echo: FALSE -#Data for use with cardrx takes the format 0s and 1s +#Data for use with cardx takes the format 0s and 1s head(act2,30) ``` -The {PropCIs} package produces CIs for methods such as Blaker's exact method and Midp which aren't available in {cardx} but are available in SAS. We found results agreed with SAS to the 5th decimal place. The package also calculates CIs for Clopper-Pearson, Wald, Wilson, Agresti-coull and these align to results obtained in cardx to at least the 7th decimal place. The {PropsCIs} package requires just the number of events (numerator number of successes) & total number of subjects (denominator) as an input dataset. Given Blaker and Midp are rarely used in practice, and {PropsCIs} isn't a package commonly downloaded from CRAN, further detail is not provided here. +**The {PropCIs} package** produces CIs for methods such as Blaker's exact method and Midp which aren't available in {cardx} but are available in SAS. We found results agreed with SAS to the 5th decimal place. The package also calculates CIs for Clopper-Pearson, Wald, Wilson, Agresti-coull and these align to results obtained in cardx to at least the 7th decimal place. The {PropsCIs} package requires just the number of events (numerator number of successes) & total number of subjects (denominator) as an input dataset. Given Blaker and Midp are rarely used in practice, and {PropsCIs} isn't a package commonly downloaded from CRAN, further detail is not provided here. Code example for Clopper-pearson:\ `exactci(x= , n=, conf.level=0.95)` @@ -73,6 +73,32 @@ Code example for Mid P method:\ Code example for Blaker's exact method:\ `blakerci(x= , n=, conf.level=0.95, tolerance=1e-05)` +**The {Hmisc} package** produces CIs using the Clopper-Pearson method. In this example (x=36 and n=154), the results match the cardx package. Documentation reports that the method uses F distribution to compute exact intervals based on the binomial cdf. However, if the percentage of responders is 100% then the upper limit is set to 1. Similarly if the percentage of responders is 0%, then the lower limit is set to 0. Hence, in extreme cases there may be differences between this package and the standard implementation of Clopper-Pearson method. + +Code example for Clopper-pearson:\ +`binconf(x= , n=,method="exact",alpha=0.05)` + +**The {RBesT} package** produces CIs using the Clopper-Pearson method. In this example (x=36 and n=154), the results match the cardx package. However, as described below, there are 2 cases where the results using RBesT package do not match cardx or Hmisc. + +1) x = 0 (0% responders), in which case the lower limit does not match. +2) x = n (100% responders), in which case the upper limit does not match. + +Because of the relationship between the binomial distirbution and the beta distribution. This package uses quantiles of the beta distribution to derive exact confidence intervals. + +$$ B(\alpha/2;x, n-x+1) < p < B(1-\alpha/2; x+1, n-x)$$ + +RBesT equations are: \ +pLow \<- qbeta(Low, r + (r == 0), n - r + 1)\ +pHigh \<- qbeta(High, r + 1, n - r + ((n - r) == 0)) + +If the equations were updated as follows then it would match Hmisc intervals:\ +pLow \<- qbeta(Low, r, n - r + 1)\ +pHigh \<- qbeta(High, r + 1, n - r) + +`BinaryExactCI(x= , n=,alpha=0.05)` + +It is currently unclear why the RBesT script has the logical conditions (r==0) and ((n-r)==0. Therefore, we currently do not recommend using this package and suggest cardx or Hmisc is used instead. + ## Methods for Calculating Confidence Intervals for a single proportion using cardx For more technical derivation and reasons for use of each of the methods listed below, see the corresponding [SAS page](https://psiaims.github.io/CAMIS/SAS/ci_for_prop.html). @@ -170,6 +196,9 @@ proportion_ci_jeffreys(act2,conf.level=0.95) %>% as_tibble() ``` +``` +``` + ## Methods for Calculating Confidence Intervals for a matched pair proportion When you have 2 measurements on the same subject, the 2 sets of measures are not independent and you have matched pair of responses. From 863efc1e0fa03d39b2b673cbdda096c485867b87 Mon Sep 17 00:00:00 2001 From: Christina Fillmore <55274484+statasaurus@users.noreply.github.com> Date: Mon, 25 Nov 2024 10:30:49 +0000 Subject: [PATCH 5/5] Update R/ci_for_prop.qmd --- R/ci_for_prop.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/ci_for_prop.qmd b/R/ci_for_prop.qmd index 0aec2fc18..d85f56948 100644 --- a/R/ci_for_prop.qmd +++ b/R/ci_for_prop.qmd @@ -62,7 +62,7 @@ Example data format needed for {cardx} for a single proportion CI head(act2,30) ``` -**The {PropCIs} package** produces CIs for methods such as Blaker's exact method and Midp which aren't available in {cardx} but are available in SAS. We found results agreed with SAS to the 5th decimal place. The package also calculates CIs for Clopper-Pearson, Wald, Wilson, Agresti-coull and these align to results obtained in cardx to at least the 7th decimal place. The {PropsCIs} package requires just the number of events (numerator number of successes) & total number of subjects (denominator) as an input dataset. Given Blaker and Midp are rarely used in practice, and {PropsCIs} isn't a package commonly downloaded from CRAN, further detail is not provided here. +**The {PropCIs} package** produces CIs for methods such as Blaker's exact method and Midp which aren't available in {cardx} but are available in SAS. We found results agreed with SAS to the 5th decimal place. The package also calculates CIs for Clopper-Pearson, Wald, Wilson, Agresti-Coull and these align to results obtained in cardx to at least the 7th decimal place. The {PropsCIs} package requires just the number of events (numerator number of successes) & total number of subjects (denominator) as an input dataset. Given Blaker and Midp are rarely used in practice, and {PropsCIs} isn't a package commonly downloaded from CRAN, further detail is not provided here. Code example for Clopper-pearson:\ `exactci(x= , n=, conf.level=0.95)`