Merge pull request #354 from PSIAIMS/LT1

Closed out some issues for the minor updates
PSIAIMS · Nov 25, 2024 · 69c4bd3 · 69c4bd3
2 parents 36b37bc + 863efc1
commit 69c4bd3
Show file tree

Hide file tree

Showing 7 changed files with 76 additions and 51 deletions.
diff --git a/R/ci_for_prop.qmd b/R/ci_for_prop.qmd
@@ -46,7 +46,7 @@ adcibc %>%
 
 ## Packages
 
-The {cardx} package is an extension of the {cards} package, providing additional functions to create Analysis Results Data Objects (ARDs)^1^. It was developed as part of {NEST} and pharmaverse. This package requires the binary endpoint to be a logical (TRUE/FALSE) vector or a numeric/integer coded as (0, 1) with 1 (TRUE) being the success you want to calculate the confidence interval for.
+**The {cardx} package** is an extension of the {cards} package, providing additional functions to create Analysis Results Data Objects (ARDs)^1^. It was developed as part of {NEST} and pharmaverse. This package requires the binary endpoint to be a logical (TRUE/FALSE) vector or a numeric/integer coded as (0, 1) with 1 (TRUE) being the success you want to calculate the confidence interval for.
 
 If calculating the CI for a difference in proportions, the package requires both the response and the treatment variable to be numeric/integer coded as (0, 1) (or logical vector).
 
@@ -58,11 +58,11 @@ Example data format needed for {cardx} for a single proportion CI
 
 ```{r}
 #| echo: FALSE
-#Data for use with cardrx takes the format 0s and 1s
+#Data for use with cardx takes the format 0s and 1s
 head(act2,30)
 ```
 
-The {PropCIs} package produces CIs for methods such as Blaker's exact method and Midp which aren't available in {cardx} but are available in SAS. We found results agreed with SAS to the 5th decimal place. The package also calculates CIs for Clopper-Pearson, Wald, Wilson, Agresti-coull and these align to results obtained in cardx to at least the 7th decimal place. The {PropsCIs} package requires just the number of events (numerator number of successes) & total number of subjects (denominator) as an input dataset. Given Blaker and Midp are rarely used in practice, and {PropsCIs} isn't a package commonly downloaded from CRAN, further detail is not provided here.
+**The {PropCIs} package** produces CIs for methods such as Blaker's exact method and Midp which aren't available in {cardx} but are available in SAS. We found results agreed with SAS to the 5th decimal place. The package also calculates CIs for Clopper-Pearson, Wald, Wilson, Agresti-Coull and these align to results obtained in cardx to at least the 7th decimal place. The {PropsCIs} package requires just the number of events (numerator number of successes) & total number of subjects (denominator) as an input dataset. Given Blaker and Midp are rarely used in practice, and {PropsCIs} isn't a package commonly downloaded from CRAN, further detail is not provided here.
 
 Code example for Clopper-pearson:\
 `exactci(x=<count of successes> , n=<Total>, conf.level=0.95)`
@@ -73,6 +73,32 @@ Code example for Mid P method:\
 Code example for Blaker's exact method:\
 `blakerci(x=<count of successes> , n=<Total>, conf.level=0.95, tolerance=1e-05)`
 
+**The {Hmisc} package** produces CIs using the Clopper-Pearson method. In this example (x=36 and n=154), the results match the cardx package. Documentation reports that the method uses F distribution to compute exact intervals based on the binomial cdf. However, if the percentage of responders is 100% then the upper limit is set to 1. Similarly if the percentage of responders is 0%, then the lower limit is set to 0. Hence, in extreme cases there may be differences between this package and the standard implementation of Clopper-Pearson method.
+
+Code example for Clopper-pearson:\
+`binconf(x=<count of successes> , n=<Total>,method="exact",alpha=0.05)`
+
+**The {RBesT} package** produces CIs using the Clopper-Pearson method. In this example (x=36 and n=154), the results match the cardx package. However, as described below, there are 2 cases where the results using RBesT package do not match cardx or Hmisc.
+
+1)  x = 0 (0% responders), in which case the lower limit does not match.
+2)  x = n (100% responders), in which case the upper limit does not match.
+
+Because of the relationship between the binomial distirbution and the beta distribution. This package uses quantiles of the beta distribution to derive exact confidence intervals.
+
+$$ B(\alpha/2;x, n-x+1) < p < B(1-\alpha/2; x+1, n-x)$$
+
+RBesT equations are: \
+pLow \<- qbeta(Low, r + (r == 0), n - r + 1)\
+pHigh \<- qbeta(High, r + 1, n - r + ((n - r) == 0))
+
+If the equations were updated as follows then it would match Hmisc intervals:\
+pLow \<- qbeta(Low, r, n - r + 1)\
+pHigh \<- qbeta(High, r + 1, n - r)
+
+`BinaryExactCI(x=<count of successes> , n=<Total>,alpha=0.05)`
+
+It is currently unclear why the RBesT script has the logical conditions (r==0) and ((n-r)==0. Therefore, we currently do not recommend using this package and suggest cardx or Hmisc is used instead.
+
 ## Methods for Calculating Confidence Intervals for a single proportion using cardx
 
 For more technical derivation and reasons for use of each of the methods listed below, see the corresponding [SAS page](https://psiaims.github.io/CAMIS/SAS/ci_for_prop.html).
@@ -170,6 +196,9 @@ proportion_ci_jeffreys(act2,conf.level=0.95) %>%
   as_tibble()
 ```
 
+```         
+```
+
 ## Methods for Calculating Confidence Intervals for a matched pair proportion
 
 When you have 2 measurements on the same subject, the 2 sets of measures are not independent and you have matched pair of responses.

diff --git a/R/logistic_regr.qmd b/R/logistic_regr.qmd
@@ -8,11 +8,13 @@ title: "Logistic Regression"
 library(tidyverse)
 ```
 
-A model of the dependence of binary variables on explanatory variables. The logit of expectation is explained as linear for of explanatory variables. If we observed $(y_i, x_i),$ where $y_i$ is a Bernoulli variable and $x_i$ a vector of explanatory variables, the model for $\pi_i = P(y_i=1)$ is
+In binary logistic regression, there is a single binary dependent variable, coded by an indicator variable. For example, if we respresent a response as 1 and non-response as 0, then the corresponding probability of response, can be between 0 (certainly not a response) and 1 (certainly a response) - hence the labeling !
+
+The logistic model models the log-odds of an event as a linear combination of one or more independent variables (explanatory variables). If we observed $(y_i, x_i),$ where $y_i$ is a Bernoulli variable and $x_i$ a vector of explanatory variables, the model for $\pi_i = P(y_i=1)$ is
 
 $$
 \text{logit}(\pi_i)= \log\left\{ \frac{\pi_i}{1-\pi_i}\right\} = \beta_0 + \beta x_i, i = 1,\ldots,n 
-$$ 
+$$
 
 The model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios.
 

diff --git a/R/rounding.qmd b/R/rounding.qmd
@@ -8,11 +8,13 @@ library(janitor)
 library(dplyr) 
 ```
 
+## round from R base
 
+The **round()** function in Base R will round to the nearest whole number and 'rounding to the even number' when equidistant, meaning that exactly 12.5 rounds to the integer 12.
 
-## round from R base
+The **round(12.5,digits=1)** function tells R to round to 1 decimal place.
 
-The **round()** function in Base R will round to the nearest whole number and 'rounding to the even number' when equidistant, meaning that exactly 12.5 rounds to the integer 12. However, this is dependent on OS services and on representation error (since e.g. 0.15 is not represented exactly, the rounding rule applies to the represented number and not to the printed number, and so round(0.15, 1) could be either 0.1 or 0.2).
+However, rounding is dependent on OS services and on representation error since for example, if 0.15 is not represented exactly, if could actually be the number 0.15000000001 or 0.149999999999! The rounding rule applies to the represented number and not to the printed number, and so round(0.15, 1) could be either 0.1 or 0.2).
 
 ```{R}
 round(1:9/10+0.05,1)
@@ -37,8 +39,7 @@ r_2_dec
 r_3_dec
 ```
 
-If using the janitor package in R, and the function `round_half_up()`, the results would be the same with the exception of rounding 1.2345 to 3 decimal places where a result of 1.235 would be obtained instead of 1.234.
-However, in some rare cases, `round_half_up()` does not return result as expected. There are two kinds of cases for it. 1. Round down for positive decimal like 0.xx5.
+If using the janitor package in R, and the function `round_half_up()`, the results would be the same with the exception of rounding 1.2345 to 3 decimal places where a result of 1.235 would be obtained instead of 1.234. However, in some rare cases, `round_half_up()` does not return result as expected. There are two kinds of cases for it. 1. Round down for positive decimal like 0.xx5.
 
 ```{R}
 round_half_up(524288.1255, digits = 3)
@@ -51,7 +52,6 @@ options(digits=22)
 524288.1255
 ```
 
-
 In `round_half_up()`, a small decimal `sqrt(.Machine$double.eps)` is added before rounding. It avoids some incorrect rounding due to the stored numeric value is a little less than the original value, but does not cover all conditions.
 
 ```{R}
@@ -122,11 +122,14 @@ Like `round_half_up()`, it also contains the two kinds of incorrect results. And
 
 To avoid the rounding issue totally, the only way is to increase precision, e.g. using package `Rmpfr`. It will need CPU resource. And it's not always necessary considering the accuracy of current functions.
 
+## round5() from package cards
+
+The `cards::round5()` package does the same rounding as the `janitor::round_half_up()`.
+
 ## Conclusion
 
-So far, `round_half_up()` from package janitor is still one of the best solutions to round away from zero, though users may meet incorrect results in rare cases when the numbers are big or the decimal is long.
+So far, `round_half_up()` from package janitor (or `cards::round5()` ) is still one of the best solutions to round away from zero, though users may meet incorrect results in rare cases when the numbers are big or the decimal is long.
 
 ```{R}
 options(digits = 7) #This just returns the number of displayed digits back to the default
 ```
-
diff --git a/python/logistic_regression.qmd b/python/logistic_regression.qmd
@@ -4,6 +4,7 @@ output: html_document
 ---
 
 # Imports
+
 ```{python}
 #data manipulation
 import pandas as pd
@@ -14,35 +15,37 @@ import statsmodels.api as sm
 from sklearn.linear_model import LogisticRegression
 ```
 
-
 # Background
 
-Logistic regression models the relationship between a binary variable and set of explanatory variables. The logit of expectation is explained as linear for of explanatory variables. If we observed $(y_i, x_i),$ where $y_i$ is a Bernoulli variable and $x_i$ a vector of explanatory variables, the model for $\pi_i = P(y_i=1)$ is
+In binary logistic regression, there is a single binary dependent variable, coded by an indicator variable. For example, if we respresent a response as 1 and non-response as 0, then the corresponding probability of response, can be between 0 (certainly not a response) and 1 (certainly a response) - hence the labeling !
+
+The logistic model models the log-odds of an event as a linear combination of one or more independent variables (explanatory variables). If we observed $(y_i, x_i),$ where $y_i$ is a Bernoulli variable and $x_i$ a vector of explanatory variables, the model for $\pi_i = P(y_i=1)$ is
 
 $$
 \text{logit}(\pi_i)= \log\left\{ \frac{\pi_i}{1-\pi_i}\right\} = \beta_0 + \beta x_i, i = 1,\ldots,n 
-$$ 
+$$
 
 The model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios.
 
 # Example : Lung cancer data
+
 *Data source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.*
 
-These data were sourced from the R package {survival} and have been downloaded and stored in the `data` folder.  
+These data were sourced from the R package {survival} and have been downloaded and stored in the `data` folder.
 
 ```{python}
 # importing and prepare
 lung2 = pd.read_csv("../data/lung_cancer.csv")
 
 #create weight loss factor while respecting missing values
 # 1: patients with a weight loss of more than zero
-# 0: patients a weight loss of zero oe less
+# 0: patients a weight loss of zero or less
 lung2["wt_grp"] = np.where(lung2["wt.loss"].isnull(), np.nan, (lung2["wt.loss"] > 0).astype(int))
 ```
 
 # Logistic Regression Modelling
 
-Let's further prepare our data for modelling by selecting the explanatory variables and the dependent variable. The Python packages that we are are aware of require complete (i.e. no missing values) data so for convenience of demonstrating these methods we will drop rows with missing values. 
+Let's further prepare our data for modelling by selecting the explanatory variables and the dependent variable. The Python packages that we are are aware of require complete (i.e. no missing values) data so for convenience of demonstrating these methods we will drop rows with missing values.
 
 ```{python}
 x_vars = ["age", "sex", "ph.ecog", "meal.cal"]
@@ -57,6 +60,7 @@ y = lung2_complete[y_var]
 ```
 
 ## Statsmodels package
+
 We will use the `sm.Logit()` method to fit our logistic regression model.
 
 ```{python}
@@ -69,20 +73,25 @@ print(lr_sm.summary())
 ```
 
 ### Model fitting
+
 In addition to the information contained in the summary, we can display the model coefficients as odds ratios:
+
 ```{python}
 print("Odds ratios for statsmodels logistic regression:")
 print(np.exp(lr_sm.params))
 ```
 
 We can also provide the 5% confidence intervals for the odds ratios:
+
 ```{python}
 print("CI at 5% for statsmodels logistic regression:")
 print(np.exp(lr_sm.conf_int(alpha = 0.05)))
 ```
 
 ### Prediction
-Let's use our trained model to make a weight loss prediction about a new patient. 
+
+Let's use our trained model to make a weight loss prediction about a new patient.
+
 ```{python}
 # new female, symptomatic but completely ambulatory patient consuming 2500 calories
 new_pt = pd.DataFrame({
@@ -101,10 +110,10 @@ print(lr_sm.predict(new_pt_sm))
 
 ## Scikit-learn Package
 
-The `scikit-learn` package is a popular package for machine learning and predictive modelling. 
+The `scikit-learn` package is a popular package for machine learning and predictive modelling.
 
-::: {.callout-warning}
-It's important to note that l2 regularisation is applied by default in the `scikit-learn` implementation of logistic regression. More recent releases of this package include an option to have no regularisation penalty.  
+::: callout-warning
+It's important to note that l2 regularisation is applied by default in the `scikit-learn` implementation of logistic regression. More recent releases of this package include an option to have no regularisation penalty.
 :::
 
 ```{python}
@@ -120,16 +129,17 @@ print("Odds ratios for scikit learn logistic regression:")
 print(np.exp(lr_sk.coef_))
 ```
 
-
-However, obtaining the confidence intervals and other metrics is not directly supported in `scikit-learn`. 
+However, obtaining the confidence intervals and other metrics is not directly supported in `scikit-learn`.
 
 ### Prediction
-Using the same new patient example we can use our logistic regression model to make a prediction. The `predict_proba` method is used to return the probability for each class. If you are interested in viewing the prediction for `y = 1`, i.e. the probability of weight loss then you can select the second probability as shown: 
+
+Using the same new patient example we can use our logistic regression model to make a prediction. The `predict_proba` method is used to return the probability for each class. If you are interested in viewing the prediction for `y = 1`, i.e. the probability of weight loss then you can select the second probability as shown:
+
 ```{python}
 print("Probability of weight loss using the scikit-learn package:")
 print(lr_sk.predict_proba(new_pt)[:,1])
 ```
 
 ## Conclusions
 
-There are two main ways to fit a logistic regression using python. Each of these packages have their advantages with `statsmodel` geared more towards model and coefficient interpretation in low dimensional data settings and in contrast the `scikit-learn` implementation more appropriate for use cases focused on prediction with more complex, higher dimensional data.  
+There are two main ways to fit a logistic regression using python. Each of these packages have their advantages with `statsmodel` geared more towards model and coefficient interpretation in low dimensional data settings and in contrast the `scikit-learn` implementation more appropriate for use cases focused on prediction with more complex, higher dimensional data.
diff --git a/python/one_sample_t_test.qmd b/python/one_sample_t_test.qmd
@@ -1,13 +1,10 @@
 ---
-title: "One Sample t-test"
+title: "One Sample t-test in Python"
 output: html_document
 ---
 
-## One Sample t-test in Python
-
 The One Sample t-test is used to compare a single sample against an expected hypothesis value. In the One Sample t-test, the mean of the sample is compared against the hypothesis value. In Python, a One Sample t-test can be performed using the scipy.stats.ttest_1samp(...) function from the scipy package, which accepts the following parameters:
 
-
 1.*a*: Sample observations.
 
 2.*popmean*: Expected value in null hypothesis. If array_like, then its length along axis must equal 1, and it must otherwise be broadcastable with a.
@@ -20,7 +17,6 @@ The One Sample t-test is used to compare a single sample against an expected hyp
 
 ## Data Used
 
-
 ```{python}
 
 import pandas as pd
@@ -35,10 +31,6 @@ df = pd.DataFrame(data)
 
 ```
 
----
-subsubtitle: "t-test"
----
-
 The following code was used to test the comparison in Python. Note that the baseline null hypothesis goes in the "popmean" parameter.
 
 ```{python}
@@ -61,4 +53,4 @@ if p_value < alpha:
 else:
     print("Fail to reject null hypothesis: There is no significant difference.")
 
-```
+```
diff --git a/python/paired_t_test.qmd b/python/paired_t_test.qmd
@@ -3,13 +3,9 @@ title: "Paired t-test"
 output: html_document
 ---
 
-## Paired t-test in Python
-
 Paired t-tests are used to test the difference of means for two dependant variables. In the Paired t-test, the difference of the means between the two samples is compared to a given number that represents the null hypothesis. For a Paired t-test, the number of observations in each sample must be equal.
 
-
- In Python, a Paired t-test can be performed using the scipy.stats.ttest_rel(...) function from the scipy package, which accepts the following parameters:
-
+In Python, a Paired t-test can be performed using the scipy.stats.ttest_rel(...) function from the scipy package, which accepts the following parameters:
 
 1.*a, b*: Sample observations. The arrays must have the same shape.
 
@@ -23,7 +19,6 @@ Paired t-tests are used to test the difference of means for two dependant variab
 
 ## Data Used
 
-
 ```{python}
 
 import pandas as pd
@@ -38,8 +33,6 @@ df_pressure = pd.DataFrame(data)
 
 ```
 
-## Paired t-test 
-
 The following code was used to test the comparison in Python. Note that the baseline null hypothesis goes in the "popmean" parameter.
 
 ```{python}
@@ -55,4 +48,4 @@ print(f"t = {t_stat}")
 print(f"p-value = {p_value}")
 
 
-```
+```
diff --git a/python/two_samples_t_test.qmd b/python/two_samples_t_test.qmd
@@ -1,10 +1,8 @@
 ---
-title: "Two Sample t-test"
+title: "Two Sample t-test in Python"
 output: html_document
 ---
 
-# **Two Sample t-test in Python**
-
 The Two Sample t-test is used to compare two independent samples against each other. In the Two Sample t-test, the mean of the first sample is compared against the mean of the second sample. In Python, a Two Sample t-test can be performed using the **stats** package from scipy.
 
 ### Data Used
@@ -25,8 +23,6 @@ data = {
 df = pd.DataFrame(data)
 ```
 
-## Python
-
 If we have normalized data, we can use the classic Student's t-test. For a Two sample test where the variances are not equal, we should use the Welch's t-test. Both of those options are available in the scipy **stats** package.
 
 ### Student's T-Test
@@ -61,4 +57,4 @@ t_stat_welch, p_value_unequal_var = stats.ttest_ind(group1, group2, equal_var=Fa
 print("\nWelch's T-Test assuming unequal variances:")
 print(f"T-statistic: {t_stat_welch}")
 print(f"P-value: {p_value_unequal_var}")
-```
+```