Quantiles
head(apisrs) |> gt::gt()Vi scene = list(bgcolor = "lightgray")) fig
The Model
<- model_ancova %>% tidy() model_tidy %>% gt() model_glanceThe Model
%>% gt() model_tidy
The Model
add_row(term = "Total", df = sum(.$df), sumsq = sum(.$sumsq)) %>% gt() model_tableType 1
get_anova_table() %>% gt()Type 2
get_anova_table() %>% gt()Type 3
get_anova_table() %>% gt()Available R packages<
We did not find any R package that delivers all the same measures as SAS at once. Therefore, we tried out multiple packages:
Impute with PMM
$imp$bmi imp_pmm 1 2 3 4 5
-1 27.2 27.2 22.0 22.7 22.5
-3 35.3 26.3 27.2 27.2 30.1
-4 25.5 20.4 27.5 22.5 22.0
-6 22.5 22.5 20.4 24.9 21.7
-10 22.7 27.2 27.4 22.0 29.6
-11 27.2 22.0 27.2 30.1 30.1
-12 27.5 22.5 29.6 25.5 27.5
-16 35.3 30.1 35.3 28.7 33.2
-21 21.7 27.2 30.1 27.2 27.2
+1 20.4 22.7 27.2 26.3 29.6
+3 29.6 28.7 35.3 26.3 22.0
+4 27.4 22.5 20.4 27.4 28.7
+6 20.4 22.5 20.4 27.4 26.3
+10 22.0 22.0 22.7 30.1 35.3
+11 33.2 26.3 22.0 27.2 28.7
+12 24.9 22.5 25.5 27.4 33.2
+16 22.7 22.0 29.6 30.1 27.2
+21 20.4 29.6 35.3 28.7 29.6
An alternative to the standard PMM is midastouch
.
Impute with PMM
$imp$bmi imp_pmms
1 2 3 4 5
-1 29.6 30.1 30.1 33.2 35.3
-3 29.6 30.1 30.1 29.6 29.6
-4 21.7 21.7 21.7 21.7 24.9
-6 21.7 25.5 21.7 21.7 24.9
-10 22.0 22.0 22.0 35.3 27.4
-11 30.1 30.1 29.6 33.2 29.6
-12 24.9 22.7 25.5 22.7 27.4
-16 35.3 33.2 33.2 33.2 29.6
-21 29.6 33.2 30.1 33.2 29.6
+1 35.3 27.4 29.6 27.5 30.1
+3 35.3 30.1 30.1 22.5 30.1
+4 24.9 25.5 25.5 21.7 25.5
+6 21.7 25.5 25.5 24.9 25.5
+10 22.7 25.5 29.6 28.7 26.3
+11 35.3 27.4 29.6 22.5 30.1
+12 22.7 20.4 33.2 28.7 35.3
+16 35.3 27.4 30.1 22.5 30.1
+21 35.3 27.4 29.6 27.5 30.1
Simple Survey Designs
head(apisrs) |> gt::gt()Summary Statistics on Complex Survey Designs
head(nhanes) |> gt::gt()Simple Survey Designs
We will use the API dataset (“API Data Files” 2006), which contains a number of datasets based on different samples from a dataset of academic performance. Initially we will just cover the methodology with a simple random sample and a finite population correction to demonstrate functionality.
Summary Statistics on Complex Survey Designs
Much of the previous examples and notes still stand for more complex survey designs, here we will demonstrate using a dataset from NHANES (“National Health and Nutrition Examination Survey Data” 2010), which uses both stratification and clustering:
Blogs
Repository
The repository below provides examples of statistical methodology in different software and languages, along with a comparison of the results obtained and description of any discrepancies.
Meeting Minutes
MANOVA in Python
Example 39.6 Multivariate Analysis of Variance from SAS MANOVA User Guide
This example employs multivariate analysis of variance (MANOVA) to measure differences in the chemical characteristics of ancient pottery found at four kiln sites in Great Britain. The data are from Tubb, Parker, and Nickless (1980), as reported in Hand et al. (1994).
For each of 26 samples of pottery, the percentages of oxides of five metals are measured. The following statements create the data set and invoke the GLM procedure to perform a one-way MANOVA. Additionally, it is of interest to know whether the pottery from one site in Wales (Llanederyn) differs from the samples from other sites; a CONTRAST statement is used to test this hypothesis.
-import pandas as pd
from statsmodels.multivariate.manova import MANOVA
diff --git a/python/Rounding.html b/python/Rounding.html
index b4d8b6e8..c54f1035 100644
--- a/python/Rounding.html
+++ b/python/Rounding.html
@@ -217,7 +217,7 @@ Rounding in Python
Python has a built-in round() function that takes two numeric arguments, number and ndigits, and returns a floating point number that is a rounded version of the number up to the specified number of decimals.
The default number of decimal is 0, meaning that the function will return the nearest integer.
-
+
# For integers
= 12
xprint(round(x))
diff --git a/python/Summary_statistics.html b/python/Summary_statistics.html
index ffb65fef..0a28cb5f 100644
--- a/python/Summary_statistics.html
+++ b/python/Summary_statistics.html
@@ -222,7 +222,7 @@ Summary statistics
4.out (optional): An alternate output array where we can place the result.
5.overwrite_input (optional): Used to modify the input array.
6.keepdims (optional): Creates reduced axes with dimensions of one size.
-
+
import numpy as np
=[12, 25, 16, 50, 34, 29, 60, 86, 52, 39, 41]
diff --git a/python/ancova.html b/python/ancova.html
index d3c66c41..7ccbc8a9 100644
--- a/python/ancova.html
+++ b/python/ancova.html
@@ -232,7 +232,7 @@ sample_dataIntroduction
Data Summary
-
+
import pandas as pd
# Input data
@@ -250,7 +250,7 @@ Data Summary
= pd.DataFrame(data) df
-
+
# Descriptive statistics
= df.describe()
summary_stats
@@ -279,7 +279,7 @@ Data Summary
Ancova in Python
In Python, Ancova can be performed using the statsmodels library from the scipy package.
-
+
import statsmodels.api as sm
import statsmodels.formula.api as smf
from tabulate import tabulate
@@ -313,7 +313,7 @@ Ancova in Python
Model: OLS Adj. R-squared: 0.639
Method: Least Squares F-statistic: 18.10
Date: Mon, 05 Aug 2024 Prob (F-statistic): 1.50e-06
-Time: 10:47:05 Log-Likelihood: -82.054
+Time: 13:27:21 Log-Likelihood: -82.054
No. Observations: 30 AIC: 172.1
Df Residuals: 26 BIC: 177.7
Df Model: 3
@@ -354,7 +354,7 @@ Ancova in Python
Please note that all values match with the corresponding R version, except for the AIC and BIC values, which differ slightly. This should be acceptable for most practical purposes in statistical analysis. Currently, there are ongoing discussions in the statsmodels community regarding the computational details of AIC and BIC.
The following code can be used to enforce complete consistency of AIC and BIC values with R outputs by adding 1 to the number of parameters:
-
+
import numpy as np
# Manual calculation of AIC and BIC to ensure consistency with R
@@ -383,7 +383,7 @@ Ancova in Python
There are different types of anova computations. The statsmodels.stats.anova.anova_lm function allows the types 1, 2 and 3. The code to compute these types is depicted below:
-
+
import statsmodels.formula.api as smf
import statsmodels.stats.anova as ssa
@@ -455,7 +455,7 @@ Ancova in Python
Type 1 Ancova in Python
-
+
print(tabulate(ancova_table_type_1, headers='keys', tablefmt='grid'))
+----+----------+-------+-------+---------+---------+----------+-------------+----------+----------+
@@ -470,7 +470,7 @@ Type 1 Ancova in P
Type 2 Ancova in Python
-
+
print(tabulate(ancova_table_type_2, headers='keys', tablefmt='grid'))
+----+----------+-------+-------+----------+---------+----------+-------------+----------+----------+
@@ -485,7 +485,7 @@ Type 2 Ancova in P
Type 3 Ancova in Python
-
+
print(tabulate(ancova_table_type_3, headers='keys', tablefmt='grid'))
+----+-----------+-------+-------+----------+---------+----------+-------------+----------+----------+
diff --git a/python/correlation.html b/python/correlation.html
index d0a942e3..57a52f47 100644
--- a/python/correlation.html
+++ b/python/correlation.html
@@ -228,7 +228,7 @@ Correlation Analysis in Python
Pearson’s Correlation
It is a parametric correlation test because it depends on the distribution of data. It measures the linear dependence between two variables x and y. It is the ratio between the covariance of two variables and the product of their standard deviation. The result always have a value between 1 and -1.
-
+
import pandas as pd
from scipy.stats import pearsonr
@@ -250,7 +250,7 @@ Pearson’s Correlati
Kendall’s Rank
A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. The Kendall correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully different for a correlation of −1) rank between the two variables.
-
+
import pandas as pd
from scipy.stats import kendalltau
@@ -273,7 +273,7 @@ Kendall’s Rank
Spearman’s Rank
Spearman’s Rank Correlation is a statistical measure of the strength and direction of the monotonic relationship between two continuous variables. Therefore, these attributes are ranked or put in the order of their preference. It is denoted by the symbol “rho” (ρ) and can take values between -1 to +1. A positive value of rho indicates that there exists a positive relationship between the two variables, while a negative value of rho indicates a negative relationship. A rho value of 0 indicates no association between the two variables.
-
+
import pandas as pd
from scipy.stats import spearmanr
diff --git a/python/kruskal_wallis.html b/python/kruskal_wallis.html
index 58bb5e1a..c0f86e5a 100644
--- a/python/kruskal_wallis.html
+++ b/python/kruskal_wallis.html
@@ -226,7 +226,7 @@ Kruskal Wallis in Python
Introduction
The Kruskal-Wallis test is a non-parametric equivalent to the one-way ANOVA. For this example, the data used is a subset of the iris dataset, testing for difference in sepal width between species of flower.
-
+
import pandas as pd
# Define the data
@@ -266,7 +266,7 @@ Introduction
Implementing Kruskal-Wallis in Python
The Kruskal-Wallis test can be implemented in Python using the kruskal function from scipy.stats. The null hypothesis is that the samples are from identical populations.
-
+
from scipy.stats import kruskal
# Separate the data for each species
diff --git a/python/linear_regression.html b/python/linear_regression.html
index 60e3a587..c42358e8 100644
--- a/python/linear_regression.html
+++ b/python/linear_regression.html
@@ -229,7 +229,7 @@ Linear Regression
Descriptive Statistics
The first step is to obtain the simple descriptive statistics for the numeric variables of htwt data, and one-way frequencies for categorical variables. This is accomplished by employing summary function. There are 237 participants who are from 13.9 to 25 years old. It is a cross-sectional study, with each participant having one observation. We can use this data set to examine the relationship of participants’ height to their age and sex.
-
+
import pandas as pd
import statsmodels.api as sm
@@ -237,7 +237,7 @@ Descriptive Statist
= pd.read_csv("../data/htwt.csv") htwt
In order to create a regression model to demonstrate the relationship between age and height for females, we first need to create a flag variable identifying females and an interaction variable between age and female gender flag.
-
+
'female'] = (htwt['SEX'] == 'f').astype(int)
htwt['fem_age'] = htwt['AGE'] * htwt['female']
htwt[ htwt.head()
@@ -319,7 +319,7 @@ Descriptive Statist
Regression Analysis
Next, we fit a regression model, representing the relationships between gender, age, height and the interaction variable created in the datastep above. We again use a where statement to restrict the analysis to those who are less than or equal to 19 years old. We use the clb option to get a 95% confidence interval for each of the parameters in the model. The model that we are fitting is height = b0 + b1 x female + b2 x age + b3 x fem_age + e
-
+
= htwt[['female', 'AGE', 'fem_age']][htwt['AGE'] <= 19]
X = sm.add_constant(X)
X = htwt['HEIGHT'][htwt['AGE'] <= 19]
@@ -357,7 +357,7 @@ Y Regression Analysis
Time:
-10:47:08
+13:27:24
Log-Likelihood:
-534.17
diff --git a/python/logistic_regression.html b/python/logistic_regression.html
index 44fdd8a7..4ba4ff1c 100644
--- a/python/logistic_regression.html
+++ b/python/logistic_regression.html
@@ -267,7 +267,7 @@ Logistic Regression
Imports
-
+
#data manipulation
import pandas as pd
import numpy as np
@@ -289,7 +289,7 @@ Background
Example : Lung cancer data
Data source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.
These data were sourced from the R package {survival} and have been downloaded and stored in the data
folder.
-
+
# importing and prepare
= pd.read_csv("../data/lung_cancer.csv")
lung2
@@ -302,7 +302,7 @@ Example : Lung cancer data
Logistic Regression Modelling
Let’s further prepare our data for modelling by selecting the explanatory variables and the dependent variable. The Python packages that we are are aware of require complete (i.e. no missing values) data so for convenience of demonstrating these methods we will drop rows with missing values.
-
+
= ["age", "sex", "ph.ecog", "meal.cal"]
x_vars = "wt_grp"
y_var
@@ -316,7 +316,7 @@ Logistic Regression Modelling
Statsmodels package
We will use the sm.Logit()
method to fit our logistic regression model.
-
+
#intercept column
= sm.add_constant(x)
x_sm
@@ -333,7 +333,7 @@ Statsmodels packageStatsmodels package
Model fitting
In addition to the information contained in the summary, we can display the model coefficients as odds ratios:
-
+
print("Odds ratios for statsmodels logistic regression:")
print(np.exp(lr_sm.params))
@@ -364,7 +364,7 @@ Model fitting
We can also provide the 5% confidence intervals for the odds ratios:
-
+
print("CI at 5% for statsmodels logistic regression:")
print(np.exp(lr_sm.conf_int(alpha = 0.05)))
@@ -381,7 +381,7 @@ Model fitting
Prediction
Let’s use our trained model to make a weight loss prediction about a new patient.
-
+
# new female, symptomatic but completely ambulatory patient consuming 2500 calories
= pd.DataFrame({
new_pt "age": [56],
@@ -419,11 +419,11 @@ Scikit-learn Package<
It’s important to note that l2 regularisation is applied by default in the scikit-learn
implementation of logistic regression. More recent releases of this package include an option to have no regularisation penalty.
-
+
= LogisticRegression(penalty=None).fit(x, y) lr_sk
Unlike the statsmodels
approach scikit-learn
doesn’t have a summary method for the model but you can extract some of the model parameters as follows:
-
+
print("Intercept for scikit learn logistic regression:")
print(lr_sk.intercept_)
print("Odds ratios for scikit learn logistic regression:")
@@ -439,7 +439,7 @@ Scikit-learn Package<
Prediction
Using the same new patient example we can use our logistic regression model to make a prediction. The predict_proba
method is used to return the probability for each class. If you are interested in viewing the prediction for y = 1
, i.e. the probability of weight loss then you can select the second probability as shown:
-
+
print("Probability of weight loss using the scikit-learn package:")
print(lr_sk.predict_proba(new_pt)[:,1])
diff --git a/python/one_sample_t_test.html b/python/one_sample_t_test.html
index 2273bfac..e14720e8 100644
--- a/python/one_sample_t_test.html
+++ b/python/one_sample_t_test.html
@@ -233,7 +233,7 @@ One Sample t-t
Data Used
-
+
import pandas as pd
# Create sample data
@@ -248,7 +248,7 @@ Data Used
subsubtitle: “t-test”
The following code was used to test the comparison in Python. Note that the baseline null hypothesis goes in the “popmean” parameter.
-
+
import pandas as pd
from scipy import stats
diff --git a/python/paired_t_test.html b/python/paired_t_test.html
index 52409d3a..b0c86823 100644
--- a/python/paired_t_test.html
+++ b/python/paired_t_test.html
@@ -235,7 +235,7 @@ Paired t-test in P
Data Used
-
+
import pandas as pd
# Create sample data
@@ -250,7 +250,7 @@ Data Used
Paired t-test
The following code was used to test the comparison in Python. Note that the baseline null hypothesis goes in the “popmean” parameter.
-
+
import pandas as pd
from scipy import stats
diff --git a/python/skewness_kurtosis.html b/python/skewness_kurtosis.html
index ec02a4c2..4d80ab0b 100644
--- a/python/skewness_kurtosis.html
+++ b/python/skewness_kurtosis.html
@@ -258,7 +258,7 @@ Skewness and Kurtosis in Python
Skewness measures the the amount of asymmetry in a distribution, while Kurtosis describes the “tailedness” of the curve. These measures are frequently used to assess the normality of the data. There are several methods to calculate these measures. In Python, the packages pandas, scipy.stats.skew and scipy.stats.kurtosis can be used.
Data Used
-
+
import pandas as pd
from scipy.stats import skew, kurtosis
@@ -288,7 +288,7 @@ Skewness
\]
All three skewness measures are unbiased under normality. The three methods are illustrated in the following code:
-
+
# Skewness
= skew(df['points'])
type1_skew = df['points'].skew()
@@ -319,7 +319,7 @@ type2_skew Kurtosis
\[b_2 = m_4/s^4-3 = (g_2 + 3)(1-1/n)^2-3\]
Only \(G_2\) (corresponding to type 2) is unbiased under normality. The three methods are illustrated in the following code:
-
+
# Kurtosis
= kurtosis(df['points'])
type1_kurt
diff --git a/python/two_samples_t_test.html b/python/two_samples_t_test.html
index 8c14b426..298b34d8 100644
--- a/python/two_samples_t_test.html
+++ b/python/two_samples_t_test.html
@@ -235,7 +235,7 @@ Two Sample t-test in Python
Data Used
The following data was used in this example.
-
+
import pandas as pd
import numpy as np
from scipy import stats
@@ -257,7 +257,7 @@ Student’s T-Test
Code
The following code was used to test the comparison in Python. Note that we must separate the single variable into two variables to satisfy the scipy stats package syntax.
-
+
# Separate data into two groups
= df[df['trt_grp'] == 'placebo']['WtGain']
group1 = df[df['trt_grp'] == 'treatment']['WtGain']
@@ -281,7 +281,7 @@ group2 Welch’s T-Test
Code
The following code was used to test the comparison in Python using Welch’s t-test.
-
+
# Perform Welch's t-test assuming unequal variances
= stats.ttest_ind(group1, group2, equal_var=False)
t_stat_welch, p_value_unequal_var
diff --git a/search.json b/search.json
index 706cfe56..13538b9a 100644
--- a/search.json
+++ b/search.json
@@ -557,7 +557,7 @@
"href": "python/linear_regression.html",
"title": "Linear Regression",
"section": "",
- "text": "To demonstrate the use of linear regression we examine a dataset that illustrates the relationship between Height and Weight in a group of 237 teen-aged boys and girls. The dataset is available here and is imported to the workspace.\n\nDescriptive Statistics\nThe first step is to obtain the simple descriptive statistics for the numeric variables of htwt data, and one-way frequencies for categorical variables. This is accomplished by employing summary function. There are 237 participants who are from 13.9 to 25 years old. It is a cross-sectional study, with each participant having one observation. We can use this data set to examine the relationship of participants’ height to their age and sex.\n\nimport pandas as pd\nimport statsmodels.api as sm\n\n# Importing CSV\nhtwt = pd.read_csv(\"../data/htwt.csv\")\n\nIn order to create a regression model to demonstrate the relationship between age and height for females, we first need to create a flag variable identifying females and an interaction variable between age and female gender flag.\n\nhtwt['female'] = (htwt['SEX'] == 'f').astype(int)\nhtwt['fem_age'] = htwt['AGE'] * htwt['female']\nhtwt.head()\n\n\n\n\n\n\n\n\nROW\nSEX\nAGE\nHEIGHT\nWEIGHT\nfemale\nfem_age\n\n\n\n\n0\n1\nf\n14.3\n56.3\n85.0\n1\n14.3\n\n\n1\n2\nf\n15.5\n62.3\n105.0\n1\n15.5\n\n\n2\n3\nf\n15.3\n63.3\n108.0\n1\n15.3\n\n\n3\n4\nf\n16.1\n59.0\n92.0\n1\n16.1\n\n\n4\n5\nf\n19.1\n62.5\n112.5\n1\n19.1\n\n\n\n\n\n\n\n\n\nRegression Analysis\nNext, we fit a regression model, representing the relationships between gender, age, height and the interaction variable created in the datastep above. We again use a where statement to restrict the analysis to those who are less than or equal to 19 years old. We use the clb option to get a 95% confidence interval for each of the parameters in the model. The model that we are fitting is height = b0 + b1 x female + b2 x age + b3 x fem_age + e\n\nX = htwt[['female', 'AGE', 'fem_age']][htwt['AGE'] <= 19]\nX = sm.add_constant(X)\nY = htwt['HEIGHT'][htwt['AGE'] <= 19]\n\nmodel = sm.OLS(Y, X).fit()\n\nmodel.summary()\n\n\nOLS Regression Results\n\n\nDep. Variable:\nHEIGHT\nR-squared:\n0.460\n\n\nModel:\nOLS\nAdj. R-squared:\n0.452\n\n\nMethod:\nLeast Squares\nF-statistic:\n60.93\n\n\nDate:\nMon, 05 Aug 2024\nProb (F-statistic):\n1.50e-28\n\n\nTime:\n10:47:08\nLog-Likelihood:\n-534.17\n\n\nNo. Observations:\n219\nAIC:\n1076.\n\n\nDf Residuals:\n215\nBIC:\n1090.\n\n\nDf Model:\n3\n\n\n\n\nCovariance Type:\nnonrobust\n\n\n\n\n\n\n\n\n\n\n\ncoef\nstd err\nt\nP>|t|\n[0.025\n0.975]\n\n\nconst\n28.8828\n2.873\n10.052\n0.000\n23.219\n34.547\n\n\nfemale\n13.6123\n4.019\n3.387\n0.001\n5.690\n21.534\n\n\nAGE\n2.0313\n0.178\n11.435\n0.000\n1.681\n2.381\n\n\nfem_age\n-0.9294\n0.248\n-3.750\n0.000\n-1.418\n-0.441\n\n\n\n\n\n\n\n\nOmnibus:\n1.300\nDurbin-Watson:\n2.284\n\n\nProb(Omnibus):\n0.522\nJarque-Bera (JB):\n0.981\n\n\nSkew:\n-0.133\nProb(JB):\n0.612\n\n\nKurtosis:\n3.191\nCond. No.\n450.\n\n\n\nNotes:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n\n\nFrom the coefficients table b0,b1,b2,b3 are estimated as b0=28.88 b1=13.61 b2=2.03 b3=-0.92942\nThe resulting regression model for height, age and gender based on the available data is height=28.8828 + 13.6123 x female + 2.0313 x age -0.9294 x fem_age"
+ "text": "To demonstrate the use of linear regression we examine a dataset that illustrates the relationship between Height and Weight in a group of 237 teen-aged boys and girls. The dataset is available here and is imported to the workspace.\n\nDescriptive Statistics\nThe first step is to obtain the simple descriptive statistics for the numeric variables of htwt data, and one-way frequencies for categorical variables. This is accomplished by employing summary function. There are 237 participants who are from 13.9 to 25 years old. It is a cross-sectional study, with each participant having one observation. We can use this data set to examine the relationship of participants’ height to their age and sex.\n\nimport pandas as pd\nimport statsmodels.api as sm\n\n# Importing CSV\nhtwt = pd.read_csv(\"../data/htwt.csv\")\n\nIn order to create a regression model to demonstrate the relationship between age and height for females, we first need to create a flag variable identifying females and an interaction variable between age and female gender flag.\n\nhtwt['female'] = (htwt['SEX'] == 'f').astype(int)\nhtwt['fem_age'] = htwt['AGE'] * htwt['female']\nhtwt.head()\n\n\n\n\n\n\n\n\nROW\nSEX\nAGE\nHEIGHT\nWEIGHT\nfemale\nfem_age\n\n\n\n\n0\n1\nf\n14.3\n56.3\n85.0\n1\n14.3\n\n\n1\n2\nf\n15.5\n62.3\n105.0\n1\n15.5\n\n\n2\n3\nf\n15.3\n63.3\n108.0\n1\n15.3\n\n\n3\n4\nf\n16.1\n59.0\n92.0\n1\n16.1\n\n\n4\n5\nf\n19.1\n62.5\n112.5\n1\n19.1\n\n\n\n\n\n\n\n\n\nRegression Analysis\nNext, we fit a regression model, representing the relationships between gender, age, height and the interaction variable created in the datastep above. We again use a where statement to restrict the analysis to those who are less than or equal to 19 years old. We use the clb option to get a 95% confidence interval for each of the parameters in the model. The model that we are fitting is height = b0 + b1 x female + b2 x age + b3 x fem_age + e\n\nX = htwt[['female', 'AGE', 'fem_age']][htwt['AGE'] <= 19]\nX = sm.add_constant(X)\nY = htwt['HEIGHT'][htwt['AGE'] <= 19]\n\nmodel = sm.OLS(Y, X).fit()\n\nmodel.summary()\n\n\nOLS Regression Results\n\n\nDep. Variable:\nHEIGHT\nR-squared:\n0.460\n\n\nModel:\nOLS\nAdj. R-squared:\n0.452\n\n\nMethod:\nLeast Squares\nF-statistic:\n60.93\n\n\nDate:\nMon, 05 Aug 2024\nProb (F-statistic):\n1.50e-28\n\n\nTime:\n13:27:24\nLog-Likelihood:\n-534.17\n\n\nNo. Observations:\n219\nAIC:\n1076.\n\n\nDf Residuals:\n215\nBIC:\n1090.\n\n\nDf Model:\n3\n\n\n\n\nCovariance Type:\nnonrobust\n\n\n\n\n\n\n\n\n\n\n\ncoef\nstd err\nt\nP>|t|\n[0.025\n0.975]\n\n\nconst\n28.8828\n2.873\n10.052\n0.000\n23.219\n34.547\n\n\nfemale\n13.6123\n4.019\n3.387\n0.001\n5.690\n21.534\n\n\nAGE\n2.0313\n0.178\n11.435\n0.000\n1.681\n2.381\n\n\nfem_age\n-0.9294\n0.248\n-3.750\n0.000\n-1.418\n-0.441\n\n\n\n\n\n\n\n\nOmnibus:\n1.300\nDurbin-Watson:\n2.284\n\n\nProb(Omnibus):\n0.522\nJarque-Bera (JB):\n0.981\n\n\nSkew:\n-0.133\nProb(JB):\n0.612\n\n\nKurtosis:\n3.191\nCond. No.\n450.\n\n\n\nNotes:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n\n\nFrom the coefficients table b0,b1,b2,b3 are estimated as b0=28.88 b1=13.61 b2=2.03 b3=-0.92942\nThe resulting regression model for height, age and gender based on the available data is height=28.8828 + 13.6123 x female + 2.0313 x age -0.9294 x fem_age"
},
{
"objectID": "python/kruskal_wallis.html",
@@ -854,242 +854,291 @@
"text": "Results:\nThe output has a p-value 0.0016829 \\(< 0.05\\) (chosen level of significance). Hence, we reject the null hypothesis and conclude that the propotion of death is significantly different from 19%."
},
{
- "objectID": "R/nonpara_wilcoxon_ranksum.html",
- "href": "R/nonpara_wilcoxon_ranksum.html",
- "title": "Wilcoxon Rank Sum (Mann Whitney-U) in R",
+ "objectID": "R/Accelerated_Failure_time_model.html",
+ "href": "R/Accelerated_Failure_time_model.html",
+ "title": "Accelerated Failure Time Model",
"section": "",
- "text": "Wilcoxon rank sum test, or equivalently, Mann-Whitney U-test is a rank based non-parametric method. The aim is to compare two independent groups of observations. Under certain scenarios, it can be thought of as a test for median differences, however this is only valid when: 1) both samples are independent and identically distributed (same dispersion, same shape, not necessarily normal) and 2) are symmetric around their medians.\nGenerally, with two samples of observations (A and B), the test uses the mean of each possible pair of observations in each group (including the pair of each value with itself) to test if the probability that (A>B) > probability (B>A).\nThe Wilcoxon rank sum test is often presented alongside a Hodges-Lehmann estimate of the pseudo-median (the median of the Walsh averages), and an associated confidence interval for the pseudo-median.\nA tie in the data exists when an observation in group A, has the same result as an observation in group B.\n\n\n\nMethods and Formulae\nMann Whitney is not about medians in general\nRelationship between walsh averages and WRS\nHodges Lehmann Problems\n\n\n\n\nThere are three main implementations of the Wilcoxon rank sum test in R.\n\nstats::wilcox.test\nasht::wmwTest()\ncoin::wilcox_test()\n\nThe stats package implements various classic statistical tests, including Wilcoxon rank sum test. Although this is arguably the most commonly applied package, this one does not account for any ties in the data.\n\n# x, y are two unpaired vectors. Do not necessary need to be of the same length.\nstats::wilcox.test(x, y, paired = F)\n\n\n\n\nData source: Table 30.4, Kirkwood BR. and Sterne JAC. Essentials of medical statistics. Second Edition. ISBN 978-0-86542-871-3\nComparison of birth weights (kg) of children born to 15 non-smokers with those of children born to 14 heavy smokers.\n\n# bw_ns: non smokers\n# bw_s: smokers\nbw_ns <- c(3.99, 3.89, 3.6, 3.73, 3.31, \n 3.7, 4.08, 3.61, 3.83, 3.41, \n 4.13, 3.36, 3.54, 3.51, 2.71)\nbw_s <- c(3.18, 2.74, 2.9, 3.27, 3.65, \n 3.42, 3.23, 2.86, 3.6, 3.65, \n 3.69, 3.53, 2.38, 2.34)\n\nCan visualize the data on two histograms. Red lines indicate the location of medians.\n\npar(mfrow =c(1,2))\nhist(bw_ns, main = 'Birthweight: non-smokers')\nabline(v = median(bw_ns), col = 'red', lwd = 2)\nhist(bw_s, main = 'Birthweight: smokers')\nabline(v = median(bw_s), col = 'red', lwd = 2)\n\n\n\n\n\n\n\n\nIt is possible to see that for non-smokers, the median birthweight is higher than those of smokers. Now we can formally test it with wilcoxon rank sum test.\nThe default test is two-sided with confidence level of 0.95, and does continuity correction.\n\n# default is two sided\nstats::wilcox.test(bw_ns, bw_s, paired = F)\n\nWarning in wilcox.test.default(bw_ns, bw_s, paired = F): cannot compute exact\np-value with ties\n\n\n\n Wilcoxon rank sum test with continuity correction\n\ndata: bw_ns and bw_s\nW = 164.5, p-value = 0.01001\nalternative hypothesis: true location shift is not equal to 0\n\n\nWe can also carry out a one-sided test, by specifying alternative = greater (if the first item is greater than the second).\n\n# default is two sided\nstats::wilcox.test(bw_ns, bw_s, paired = F, alternative = 'greater')\n\nWarning in wilcox.test.default(bw_ns, bw_s, paired = F, alternative =\n\"greater\"): cannot compute exact p-value with ties\n\n\n\n Wilcoxon rank sum test with continuity correction\n\ndata: bw_ns and bw_s\nW = 164.5, p-value = 0.005003\nalternative hypothesis: true location shift is greater than 0"
+ "text": "The accelerated failure time model is a parametric survival analysis technique used to model the relationship between the time to event of interest (e.g., time to failure) and a set of predictor variables. It assumes that the covariates have a multiplicative effect on the time to the event. In other words, the time to event is accelerated or decelerated by a factor that depends on the values of the covariates.This differs from the Cox proportional hazards model, which assumes that covariates have a multiplicative effect on the hazard rate, not the time to the event."
},
{
- "objectID": "R/nonpara_wilcoxon_ranksum.html#useful-references",
- "href": "R/nonpara_wilcoxon_ranksum.html#useful-references",
- "title": "Wilcoxon Rank Sum (Mann Whitney-U) in R",
- "section": "",
- "text": "Methods and Formulae\nMann Whitney is not about medians in general\nRelationship between walsh averages and WRS\nHodges Lehmann Problems"
+ "objectID": "R/Accelerated_Failure_time_model.html#mathematical-expression",
+ "href": "R/Accelerated_Failure_time_model.html#mathematical-expression",
+ "title": "Accelerated Failure Time Model",
+ "section": "Mathematical Expression",
+ "text": "Mathematical Expression\nMathematically, the AFT model can be expressed as:\n\\(\\log(T) = X\\beta + \\sigma\\varepsilon\\)\nWhere:\n\nT is the survival time\nlog(T) is the logarithm of the survival time\nX is a matrix of predictor variables\nβ is a vector of coefficients representing the effects of the predictor variables on the logarithm of the survival time\nσ is a scaler quantity representing the scale parameter, which influences the variability of the error term ε in the model.\nε is the error term assumed to follow a specific distribution (e.g., normal distribution for log-normal, extreme value distribution for Weibull) that corresponds to the chosen parametric form of the model."
},
{
- "objectID": "R/nonpara_wilcoxon_ranksum.html#available-r-packages",
- "href": "R/nonpara_wilcoxon_ranksum.html#available-r-packages",
- "title": "Wilcoxon Rank Sum (Mann Whitney-U) in R",
- "section": "",
- "text": "There are three main implementations of the Wilcoxon rank sum test in R.\n\nstats::wilcox.test\nasht::wmwTest()\ncoin::wilcox_test()\n\nThe stats package implements various classic statistical tests, including Wilcoxon rank sum test. Although this is arguably the most commonly applied package, this one does not account for any ties in the data.\n\n# x, y are two unpaired vectors. Do not necessary need to be of the same length.\nstats::wilcox.test(x, y, paired = F)"
+ "objectID": "R/Accelerated_Failure_time_model.html#example-of-aft-model-using-log-normal-distribution",
+ "href": "R/Accelerated_Failure_time_model.html#example-of-aft-model-using-log-normal-distribution",
+ "title": "Accelerated Failure Time Model",
+ "section": "Example of AFT model using “Log-Normal Distribution”",
+ "text": "Example of AFT model using “Log-Normal Distribution”\n\nlibrary(survival)\nattach(lung)\n# Fit an AFT model using lognormal distribution\nmodel_aft <- survreg(Surv(time, status) ~ age + sex + ph.ecog, data = lung, dist = \"lognormal\")\n# Model summary\nsummary(model_aft)\n\n\nCall:\nsurvreg(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung, \n dist = \"lognormal\")\n Value Std. Error z p\n(Intercept) 6.49479 0.58276 11.14 < 2e-16\nage -0.01918 0.00833 -2.30 0.02126\nsex 0.52195 0.15278 3.42 0.00063\nph.ecog -0.35557 0.10331 -3.44 0.00058\nLog(scale) 0.02823 0.05596 0.50 0.61391\n\nScale= 1.03 \n\nLog Normal distribution\nLoglik(model)= -1146.9 Loglik(intercept only)= -1163.2\n Chisq= 32.59 on 3 degrees of freedom, p= 3.9e-07 \nNumber of Newton-Raphson Iterations: 3 \nn=227 (1 observation deleted due to missingness)\n\n\nThe summary output will provide the estimated coefficients, standard errors, and p-values for each predictor variable."
},
{
- "objectID": "R/nonpara_wilcoxon_ranksum.html#example-birth-weight",
- "href": "R/nonpara_wilcoxon_ranksum.html#example-birth-weight",
- "title": "Wilcoxon Rank Sum (Mann Whitney-U) in R",
- "section": "",
- "text": "Data source: Table 30.4, Kirkwood BR. and Sterne JAC. Essentials of medical statistics. Second Edition. ISBN 978-0-86542-871-3\nComparison of birth weights (kg) of children born to 15 non-smokers with those of children born to 14 heavy smokers.\n\n# bw_ns: non smokers\n# bw_s: smokers\nbw_ns <- c(3.99, 3.89, 3.6, 3.73, 3.31, \n 3.7, 4.08, 3.61, 3.83, 3.41, \n 4.13, 3.36, 3.54, 3.51, 2.71)\nbw_s <- c(3.18, 2.74, 2.9, 3.27, 3.65, \n 3.42, 3.23, 2.86, 3.6, 3.65, \n 3.69, 3.53, 2.38, 2.34)\n\nCan visualize the data on two histograms. Red lines indicate the location of medians.\n\npar(mfrow =c(1,2))\nhist(bw_ns, main = 'Birthweight: non-smokers')\nabline(v = median(bw_ns), col = 'red', lwd = 2)\nhist(bw_s, main = 'Birthweight: smokers')\nabline(v = median(bw_s), col = 'red', lwd = 2)\n\n\n\n\n\n\n\n\nIt is possible to see that for non-smokers, the median birthweight is higher than those of smokers. Now we can formally test it with wilcoxon rank sum test.\nThe default test is two-sided with confidence level of 0.95, and does continuity correction.\n\n# default is two sided\nstats::wilcox.test(bw_ns, bw_s, paired = F)\n\nWarning in wilcox.test.default(bw_ns, bw_s, paired = F): cannot compute exact\np-value with ties\n\n\n\n Wilcoxon rank sum test with continuity correction\n\ndata: bw_ns and bw_s\nW = 164.5, p-value = 0.01001\nalternative hypothesis: true location shift is not equal to 0\n\n\nWe can also carry out a one-sided test, by specifying alternative = greater (if the first item is greater than the second).\n\n# default is two sided\nstats::wilcox.test(bw_ns, bw_s, paired = F, alternative = 'greater')\n\nWarning in wilcox.test.default(bw_ns, bw_s, paired = F, alternative =\n\"greater\"): cannot compute exact p-value with ties\n\n\n\n Wilcoxon rank sum test with continuity correction\n\ndata: bw_ns and bw_s\nW = 164.5, p-value = 0.005003\nalternative hypothesis: true location shift is greater than 0"
+ "objectID": "R/Accelerated_Failure_time_model.html#acceleration-factor-calculation",
+ "href": "R/Accelerated_Failure_time_model.html#acceleration-factor-calculation",
+ "title": "Accelerated Failure Time Model",
+ "section": "Acceleration Factor Calculation",
+ "text": "Acceleration Factor Calculation\n\n# Compute acceleration factor (exponentiated coefficients)\nacceleration_factor <- exp(coef(model_aft))\nacceleration_factor\n\n(Intercept) age sex ph.ecog \n661.6830913 0.9810009 1.6853157 0.7007762"
},
{
- "objectID": "R/cmh.html",
- "href": "R/cmh.html",
- "title": "CMH Test",
- "section": "",
- "text": "The CMH procedure tests for conditional independence in partial contingency tables for a 2 x 2 x K design. However, it can be generalized to tables of X x Y x K dimensions.\n\n\nWe did not find any R package that delivers all the same measures as SAS at once. Therefore, we tried out multiple packages:\n\n\n\n\n\n\n\n\nPackage\nGeneral Association\nRow Means Differ\nNonzero Correlation\nM-H Odds Ratio\nHomogeneity Test\nNote\n\n\n\n\nstats::mantelhaen.test()\n✅\n❌\n❌\n✅\n❌\nWorks well for 2x2xK\n\n\nvcdExtra::CMHtest()\n✅\n✅\n✅\n❌\n❌\nProblems with sparsity, potential bug\n\n\nepiDisplay::mhor()\n❌\n❌\n❌\n✅\n✅\nOR are limited to 2x2xK design\n\n\n\n\n\n\n\n\n\n\nWe will use the CDISC Pilot data set, which is publicly available on the PHUSE Test Data Factory repository. We applied very basic filtering conditions upfront (see below) and this data set served as the basis of the examples to follow.\n\n\n# A tibble: 231 × 36\n STUDYID SITEID SITEGR1 USUBJID TRTSDT TRTEDT TRTP TRTPN AGE AGEGR1\n <chr> <chr> <chr> <chr> <date> <date> <chr> <dbl> <dbl> <chr> \n 1 CDISCP… 701 701 01-701… 2014-01-02 2014-07-02 Plac… 0 63 <65 \n 2 CDISCP… 701 701 01-701… 2012-08-05 2012-09-01 Plac… 0 64 <65 \n 3 CDISCP… 701 701 01-701… 2013-07-19 2014-01-14 Xano… 81 71 65-80 \n 4 CDISCP… 701 701 01-701… 2014-03-18 2014-03-31 Xano… 54 74 65-80 \n 5 CDISCP… 701 701 01-701… 2014-07-01 2014-12-30 Xano… 81 77 65-80 \n 6 CDISCP… 701 701 01-701… 2013-02-12 2013-03-09 Plac… 0 85 >80 \n 7 CDISCP… 701 701 01-701… 2014-01-01 2014-07-09 Xano… 54 68 65-80 \n 8 CDISCP… 701 701 01-701… 2012-09-07 2012-09-16 Xano… 54 81 >80 \n 9 CDISCP… 701 701 01-701… 2012-11-30 2013-01-23 Xano… 54 84 >80 \n10 CDISCP… 701 701 01-701… 2014-03-12 2014-09-09 Plac… 0 52 <65 \n# ℹ 221 more rows\n# ℹ 26 more variables: AGEGR1N <dbl>, RACE <chr>, RACEN <dbl>, SEX <chr>,\n# ITTFL <chr>, EFFFL <chr>, COMP24FL <chr>, AVISIT <chr>, AVISITN <dbl>,\n# VISIT <chr>, VISITNUM <dbl>, ADY <dbl>, ADT <date>, PARAMCD <chr>,\n# PARAM <chr>, PARAMN <dbl>, AVAL <dbl>, ANL01FL <chr>, DTYPE <chr>,\n# AWRANGE <chr>, AWTARGET <dbl>, AWTDIFF <dbl>, AWLO <dbl>, AWHI <dbl>,\n# AWU <chr>, QSSEQ <dbl>\n\n\n\n\n\n\n\nThis is included in a base installation of R, as part of the stats package. Requires inputting data as a table or as vectors.\n\nmantelhaen.test(x = data$TRTP, y = data$SEX, z = data$AGEGR1)\n\n\n Cochran-Mantel-Haenszel test\n\ndata: data$TRTP and data$SEX and data$AGEGR1\nCochran-Mantel-Haenszel M^2 = 2.482, df = 2, p-value = 0.2891\n\n\n\n\n\nThe vcdExtra package provides results for the generalized CMH test, for each of the three model it outputs the Chi-square value and the respective p-values. Flexible data input methods available: table or formula (aggregated level data in a data frame).\n\nlibrary(vcdExtra)\n\nLoading required package: vcd\n\n\nLoading required package: grid\n\n\nLoading required package: gnm\n\n\n\nAttaching package: 'vcdExtra'\n\n\nThe following object is masked from 'package:dplyr':\n\n summarise\n\n# Formula: Freq ~ X + Y | K\nCMHtest(Freq ~ TRTP + SEX | AGEGR1 , data=data, overall=TRUE) \n\n$`AGEGR1:<65`\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n in stratum AGEGR1:<65 \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.33168 1 0.56467\nrmeans Row mean scores differ 1.52821 2 0.46575\ncmeans Col mean scores differ 0.33168 1 0.56467\ngeneral General association 1.52821 2 0.46575\n\n\n$`AGEGR1:>80`\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n in stratum AGEGR1:>80 \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.39433 1 0.53003\nrmeans Row mean scores differ 3.80104 2 0.14949\ncmeans Col mean scores differ 0.39433 1 0.53003\ngeneral General association 3.80104 2 0.14949\n\n\n$`AGEGR1:65-80`\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n in stratum AGEGR1:65-80 \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.52744 1 0.46768\nrmeans Row mean scores differ 0.62921 2 0.73008\ncmeans Col mean scores differ 0.52744 1 0.46768\ngeneral General association 0.62921 2 0.73008\n\n\n$ALL\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n Overall tests, controlling for all strata \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.00086897 1 0.97648\nrmeans Row mean scores differ 2.482 2 0.28909\ncmeans Col mean scores differ 0.00086897 1 0.97648\ngeneral General association 2.482 2 0.28909\n\n\n\n\n\nTo get the M-H common odds ratio and the homogeneity test, the epiDisplay package can be used.\n\nlibrary(epiDisplay) \nmhor(x,y,k, graph = FALSE)\n\n\n\nTo tackle the issue with sparse data it is recommended that a use of solve() is replaced with MASS::ginv. This was implemented in the forked version of vcdExtra which can be installed from here:\n\ndevtools::install_github(\"mstackhouse/vcdExtra\")\n\nHowever, also the forked version for the vcdExtra package works only until a certain level of sparsity. In case of our data, it still works if the data are stratified by the pooled Site ID (SITEGR1 - 11 unique values) whereas using the unpooled Site ID (SITEID - 17 unique values) also throws an error. Note: this version is not up to date and sometimes calculates degrees of freedom incorrectly."
+ "objectID": "R/Accelerated_Failure_time_model.html#interpretation",
+ "href": "R/Accelerated_Failure_time_model.html#interpretation",
+ "title": "Accelerated Failure Time Model",
+ "section": "Interpretation",
+ "text": "Interpretation\n\nFor age,acceleration factor \\(< 1\\) indicates that for each one-unit increase in age, the survival time is slowed down by a factor of 0.98 (or a 2% decreasein survival time).\nFor sex, acceleration factor \\(> 1\\) indicates that males have 68% accelerated survival time.\nAn acceleration factor of ph.ecog \\(< 1\\) suggests a 30% decelerated survival time associated with ph.ecog."
},
{
- "objectID": "R/cmh.html#available-r-packages",
- "href": "R/cmh.html#available-r-packages",
- "title": "CMH Test",
- "section": "",
- "text": "We did not find any R package that delivers all the same measures as SAS at once. Therefore, we tried out multiple packages:\n\n\n\n\n\n\n\n\nPackage\nGeneral Association\nRow Means Differ\nNonzero Correlation\nM-H Odds Ratio\nHomogeneity Test\nNote\n\n\n\n\nstats::mantelhaen.test()\n✅\n❌\n❌\n✅\n❌\nWorks well for 2x2xK\n\n\nvcdExtra::CMHtest()\n✅\n✅\n✅\n❌\n❌\nProblems with sparsity, potential bug\n\n\nepiDisplay::mhor()\n❌\n❌\n❌\n✅\n✅\nOR are limited to 2x2xK design"
+ "objectID": "R/Accelerated_Failure_time_model.html#plotting-aft-model-graphically",
+ "href": "R/Accelerated_Failure_time_model.html#plotting-aft-model-graphically",
+ "title": "Accelerated Failure Time Model",
+ "section": "Plotting AFT Model Graphically",
+ "text": "Plotting AFT Model Graphically\n\nsuppressPackageStartupMessages({\nlibrary(survival)\nlibrary(survminer)\nlibrary(ggplot2)\n})\n\n# Fit the AFT model on the lung dataset\naft_model <- survreg(Surv(time, status) ~ age + sex + ph.ecog, data = lung, dist = \"lognormal\")\n\n# Create a new data frame with predicted survival times\ndf <- data.frame(time = lung$time, age = lung$age, sex = lung$sex, ph.ecog = lung$ph.ecog, status=lung$status)\ndf$surv_times <- predict(aft_model, newdata = df)\n\n# Plot the survival curves based on the AFT model\nggsurvplot(survfit(Surv(surv_times, status) ~ 1, data = df),\n data = df, xlab = \"Time\", ylab = \"Survival Probability\")\n\n\n\n\n\n\n\n\nThe survival curve plotted based on the AFT model for the lung dataset illustrates how the probability of survival changes as time progresses, showing the impact of different covariate levels on survival probabilities."
},
{
- "objectID": "R/cmh.html#data-used",
- "href": "R/cmh.html#data-used",
- "title": "CMH Test",
- "section": "",
- "text": "We will use the CDISC Pilot data set, which is publicly available on the PHUSE Test Data Factory repository. We applied very basic filtering conditions upfront (see below) and this data set served as the basis of the examples to follow.\n\n\n# A tibble: 231 × 36\n STUDYID SITEID SITEGR1 USUBJID TRTSDT TRTEDT TRTP TRTPN AGE AGEGR1\n <chr> <chr> <chr> <chr> <date> <date> <chr> <dbl> <dbl> <chr> \n 1 CDISCP… 701 701 01-701… 2014-01-02 2014-07-02 Plac… 0 63 <65 \n 2 CDISCP… 701 701 01-701… 2012-08-05 2012-09-01 Plac… 0 64 <65 \n 3 CDISCP… 701 701 01-701… 2013-07-19 2014-01-14 Xano… 81 71 65-80 \n 4 CDISCP… 701 701 01-701… 2014-03-18 2014-03-31 Xano… 54 74 65-80 \n 5 CDISCP… 701 701 01-701… 2014-07-01 2014-12-30 Xano… 81 77 65-80 \n 6 CDISCP… 701 701 01-701… 2013-02-12 2013-03-09 Plac… 0 85 >80 \n 7 CDISCP… 701 701 01-701… 2014-01-01 2014-07-09 Xano… 54 68 65-80 \n 8 CDISCP… 701 701 01-701… 2012-09-07 2012-09-16 Xano… 54 81 >80 \n 9 CDISCP… 701 701 01-701… 2012-11-30 2013-01-23 Xano… 54 84 >80 \n10 CDISCP… 701 701 01-701… 2014-03-12 2014-09-09 Plac… 0 52 <65 \n# ℹ 221 more rows\n# ℹ 26 more variables: AGEGR1N <dbl>, RACE <chr>, RACEN <dbl>, SEX <chr>,\n# ITTFL <chr>, EFFFL <chr>, COMP24FL <chr>, AVISIT <chr>, AVISITN <dbl>,\n# VISIT <chr>, VISITNUM <dbl>, ADY <dbl>, ADT <date>, PARAMCD <chr>,\n# PARAM <chr>, PARAMN <dbl>, AVAL <dbl>, ANL01FL <chr>, DTYPE <chr>,\n# AWRANGE <chr>, AWTARGET <dbl>, AWTDIFF <dbl>, AWLO <dbl>, AWHI <dbl>,\n# AWU <chr>, QSSEQ <dbl>"
+ "objectID": "R/Accelerated_Failure_time_model.html#example-of-aft-model-using-weibull-distribution",
+ "href": "R/Accelerated_Failure_time_model.html#example-of-aft-model-using-weibull-distribution",
+ "title": "Accelerated Failure Time Model",
+ "section": "Example of AFT model using “Weibull Distribution”",
+ "text": "Example of AFT model using “Weibull Distribution”\n\n# Fit an AFT model using weibull distribution\nmodel_aft_wb <- survreg(Surv(futime, fustat) ~ age + resid.ds + rx, data = ovarian, dist = \"weibull\")\n# Model summary\nsummary(model_aft_wb)\n\n\nCall:\nsurvreg(formula = Surv(futime, fustat) ~ age + resid.ds + rx, \n data = ovarian, dist = \"weibull\")\n Value Std. Error z p\n(Intercept) 10.5634 1.3810 7.65 2e-14\nage -0.0661 0.0190 -3.48 0.0005\nresid.ds -0.5002 0.3799 -1.32 0.1879\nrx 0.5152 0.3236 1.59 0.1114\nLog(scale) -0.6577 0.2384 -2.76 0.0058\n\nScale= 0.518 \n\nWeibull distribution\nLoglik(model)= -87.9 Loglik(intercept only)= -98\n Chisq= 20.17 on 3 degrees of freedom, p= 0.00016 \nNumber of Newton-Raphson Iterations: 6 \nn= 26"
},
{
- "objectID": "R/cmh.html#example-code",
- "href": "R/cmh.html#example-code",
- "title": "CMH Test",
- "section": "",
- "text": "This is included in a base installation of R, as part of the stats package. Requires inputting data as a table or as vectors.\n\nmantelhaen.test(x = data$TRTP, y = data$SEX, z = data$AGEGR1)\n\n\n Cochran-Mantel-Haenszel test\n\ndata: data$TRTP and data$SEX and data$AGEGR1\nCochran-Mantel-Haenszel M^2 = 2.482, df = 2, p-value = 0.2891\n\n\n\n\n\nThe vcdExtra package provides results for the generalized CMH test, for each of the three model it outputs the Chi-square value and the respective p-values. Flexible data input methods available: table or formula (aggregated level data in a data frame).\n\nlibrary(vcdExtra)\n\nLoading required package: vcd\n\n\nLoading required package: grid\n\n\nLoading required package: gnm\n\n\n\nAttaching package: 'vcdExtra'\n\n\nThe following object is masked from 'package:dplyr':\n\n summarise\n\n# Formula: Freq ~ X + Y | K\nCMHtest(Freq ~ TRTP + SEX | AGEGR1 , data=data, overall=TRUE) \n\n$`AGEGR1:<65`\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n in stratum AGEGR1:<65 \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.33168 1 0.56467\nrmeans Row mean scores differ 1.52821 2 0.46575\ncmeans Col mean scores differ 0.33168 1 0.56467\ngeneral General association 1.52821 2 0.46575\n\n\n$`AGEGR1:>80`\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n in stratum AGEGR1:>80 \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.39433 1 0.53003\nrmeans Row mean scores differ 3.80104 2 0.14949\ncmeans Col mean scores differ 0.39433 1 0.53003\ngeneral General association 3.80104 2 0.14949\n\n\n$`AGEGR1:65-80`\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n in stratum AGEGR1:65-80 \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.52744 1 0.46768\nrmeans Row mean scores differ 0.62921 2 0.73008\ncmeans Col mean scores differ 0.52744 1 0.46768\ngeneral General association 0.62921 2 0.73008\n\n\n$ALL\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n Overall tests, controlling for all strata \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.00086897 1 0.97648\nrmeans Row mean scores differ 2.482 2 0.28909\ncmeans Col mean scores differ 0.00086897 1 0.97648\ngeneral General association 2.482 2 0.28909\n\n\n\n\n\nTo get the M-H common odds ratio and the homogeneity test, the epiDisplay package can be used.\n\nlibrary(epiDisplay) \nmhor(x,y,k, graph = FALSE)\n\n\n\nTo tackle the issue with sparse data it is recommended that a use of solve() is replaced with MASS::ginv. This was implemented in the forked version of vcdExtra which can be installed from here:\n\ndevtools::install_github(\"mstackhouse/vcdExtra\")\n\nHowever, also the forked version for the vcdExtra package works only until a certain level of sparsity. In case of our data, it still works if the data are stratified by the pooled Site ID (SITEGR1 - 11 unique values) whereas using the unpooled Site ID (SITEID - 17 unique values) also throws an error. Note: this version is not up to date and sometimes calculates degrees of freedom incorrectly."
+ "objectID": "R/Accelerated_Failure_time_model.html#acceleration-factor-calculation-1",
+ "href": "R/Accelerated_Failure_time_model.html#acceleration-factor-calculation-1",
+ "title": "Accelerated Failure Time Model",
+ "section": "Acceleration Factor Calculation",
+ "text": "Acceleration Factor Calculation\n\n# Compute acceleration factor (exponentiated coefficients)\nacceleration_factor_wb <- exp(coef(model_aft_wb))\nacceleration_factor_wb\n\n (Intercept) age resid.ds rx \n3.869157e+04 9.360366e-01 6.063911e-01 1.673914e+00"
},
{
- "objectID": "R/summary_skew_kurt.html",
- "href": "R/summary_skew_kurt.html",
- "title": "Skewness/Kurtosis",
- "section": "",
- "text": "Skewness measures the the amount of asymmetry in a distribution, while Kurtosis describes the “tailedness” of the curve. These measures are frequently used to assess the normality of the data. There are several methods to calculate these measures. In R, there are at least four different packages that contain functions for Skewness and Kurtosis. This write-up will examine the following packages: e1071, moments, procs, and sasLM.\n\n\nThe following data was used in this example.\n\n# Create sample data\ndat <- tibble::tribble(\n ~team, ~points, ~assists,\n \"A\", 10, 2,\n \"A\", 17, 5,\n \"A\", 17, 6,\n \"A\", 18, 3,\n \"A\", 15, 0,\n \"B\", 10, 2,\n \"B\", 14, 5,\n \"B\", 13, 4,\n \"B\", 29, 0,\n \"B\", 25, 2,\n \"C\", 12, 1,\n \"C\", 30, 1,\n \"C\", 34, 3,\n \"C\", 12, 4,\n \"C\", 11, 7 \n)\n\n\n\n\nBase R and the stats package have no native functions for Skewness and Kurtosis. It is therefore necessary to use a packaged function to calculate these statistics. The packages examined use three different methods of calculating Skewness, and four different methods for calculating Kurtosis. Of the available packages, the functions in the e1071 package provide the most flexibility, and have options for three of the different methodologies.\n\n\nThe e1071 package contains miscellaneous statistical functions from the Probability Theory Group at the Vienna University of Technology. The package includes functions for both Skewness and Kurtosis, and each function has a “type” parameter to specify the method. There are three available methods for Skewness, and three methods for Kurtosis. A portion of the documentation for these functions is included below:\n\n\nThe documentation for the skewness() function describes three types of skewness calculations: Joanes and Gill (1998) discusses three methods for estimating skewness:\n\nType 1: This is the typical definition used in many older textbooks\n\n\\[g_1 = m_1/m_2^{3/2}\\]\n\nType 2: Used in SAS and SPSS\n\\[\nG_1 = g_1\\sqrt{n(n-1)}/(n-2)\n\\]\nType 3: Used in MINITAB and BMDP\n\\[\nb_1 = m_3/s^3 = g_1((n-1)/n)^{3/2}\n\\]\n\nAll three skewness measures are unbiased under normality. The three methods are illustrated in the following code:\n\ntype1 <- e1071::skewness(dat$points, type = 1)\nstringr::str_glue(\"Skewness - Type 1: {type1}\")\n\nSkewness - Type 1: 0.905444204379853\n\ntype2 <- e1071::skewness(dat$points, type = 2)\nstringr::str_glue(\"Skewness - Type 2: {type2}\")\n\nSkewness - Type 2: 1.00931792987094\n\ntype3 <- e1071::skewness(dat$points, type = 3)\nstringr::str_glue(\"Skewness - Type 3: {type3}\")\n\nSkewness - Type 3: 0.816426058828937\n\n\nThe default for the e1071 skewness() function is Type 3.\n\n\n\nThe documentation for the kurtosis() function describes three types of kurtosis calculations: Joanes and Gill (1998) discuss three methods for estimating kurtosis:\n\nType 1: This is the typical definition used in many older textbooks\n\n\\[g_2 = m_4/m_2^{2}-3\\]\n\nType 2: Used in SAS and SPSS\n\\[G_2 = ((n+1)g_2+6)*\\frac{(n-1)}{(n-2)(n-3)}\\]\nType 3: Used in MINITAB and BMDP\n\\[b_2 = m_4/s^4-3 = (g_2 + 3)(1-1/n)^2-3\\]\n\nOnly \\(G_2\\) (corresponding to type 2) is unbiased under normality. The three methods are illustrated in the following code:\n\n # Kurtosis - Type 1\ntype1 <- e1071::kurtosis(dat$points, type = 1)\nstringr::str_glue(\"Kurtosis - Type 1: {type1}\")\n\nKurtosis - Type 1: -0.583341077124784\n\n# Kurtosis - Type 2\ntype2 <- e1071::kurtosis(dat$points, type = 2)\nstringr::str_glue(\"Kurtosis - Type 2: {type2}\")\n\nKurtosis - Type 2: -0.299156418435587\n\n# Kurtosis - Type 3\ntype3 <- e1071::kurtosis(dat$points, type = 3)\nstringr::str_glue(\"Kurtosis - Type 3: {type3}\")\n\nKurtosis - Type 3: -0.894821560517589\n\n\nThe default for the e1071 kurtosis() function is Type 3.\n\n\n\n\nThe moments package is a well-known package with a variety of statistical functions. The package contains functions for both Skewness and Kurtosis. But these functions provide no “type” option. The skewness() function in the moments package corresponds to Type 1 above. The kurtosis() function uses a Pearson’s measure of Kurtosis, which corresponds to none of the three types in the e1071 package.\n\n library(moments)\n\n # Skewness - Type 1\n moments::skewness(dat$points)\n\n[1] 0.9054442\n\n # [1] 0.9054442\n \n # Kurtosis - Pearson's measure\n moments::kurtosis(dat$points)\n\n[1] 2.416659\n\n # [1] 2.416659\n\nNote that neither of the functions from the moments package match SAS.\n\n\n\nThe procs package proc_means() function was written specifically to match SAS, and produces a Type 2 Skewness and Type 2 Kurtosis. This package also produces a data frame output, instead of a scalar value.\n\n library(procs)\n\n # Skewness and Kurtosis - Type 2 \n proc_means(dat, var = points,\n stats = v(skew, kurt))\n\n# A tibble: 1 × 5\n TYPE FREQ VAR SKEW KURT\n <dbl> <int> <chr> <dbl> <dbl>\n1 0 15 points 1.01 -0.299\n\n\nViewer Output:\n\n\n\n\n\n\n\n\n\n\n\n\nThe sasLM package was also written specifically to match SAS. The Skewness() function produces a Type 2 Skewness, and the Kurtosis() function a Type 2 Kurtosis.\n\n library(sasLM)\n\n # Skewness - Type 2\n Skewness(dat$points)\n\n[1] 1.009318\n\n # [1] 1.009318\n \n # Kurtosis - Type 2\n Kurtosis(dat$points)\n\n[1] -0.2991564\n\n # [1] -0.2991564"
+ "objectID": "R/Accelerated_Failure_time_model.html#interpretation-1",
+ "href": "R/Accelerated_Failure_time_model.html#interpretation-1",
+ "title": "Accelerated Failure Time Model",
+ "section": "Interpretation",
+ "text": "Interpretation\n\nFor age, an acceleration factor of 0.93 indicates that for each one-unit increase in age, the survival time is decelerated by a factor of 0.93(or a 7% decrease in the survival time)\nFor residual disease status, an acceleration factor of 0.60 suggests that a decrease in residual disease status is associated with a 40% decelerated survival time.\nan acceleration factor of 1.67 suggests a 67% accelerated survival time for patients receiving a different type of radiation therapy (rx = 2) compared to the reference group (rx = 1)."
},
{
- "objectID": "R/summary_skew_kurt.html#data-used",
- "href": "R/summary_skew_kurt.html#data-used",
- "title": "Skewness/Kurtosis",
- "section": "",
- "text": "The following data was used in this example.\n\n# Create sample data\ndat <- tibble::tribble(\n ~team, ~points, ~assists,\n \"A\", 10, 2,\n \"A\", 17, 5,\n \"A\", 17, 6,\n \"A\", 18, 3,\n \"A\", 15, 0,\n \"B\", 10, 2,\n \"B\", 14, 5,\n \"B\", 13, 4,\n \"B\", 29, 0,\n \"B\", 25, 2,\n \"C\", 12, 1,\n \"C\", 30, 1,\n \"C\", 34, 3,\n \"C\", 12, 4,\n \"C\", 11, 7 \n)"
+ "objectID": "R/Accelerated_Failure_time_model.html#survival-curve-plotting-on-ovarian-dataset",
+ "href": "R/Accelerated_Failure_time_model.html#survival-curve-plotting-on-ovarian-dataset",
+ "title": "Accelerated Failure Time Model",
+ "section": "Survival Curve Plotting on ‘Ovarian’ Dataset",
+ "text": "Survival Curve Plotting on ‘Ovarian’ Dataset\n\n# Fit the AFT model (weibull distribution) on your data\nmodel_aft <- survreg(Surv(futime, fustat) ~ age + resid.ds + rx, data = ovarian, dist = \"weibull\")\n\n# Create survival curves for different levels of predictor variables\nplot_data <- with(ovarian, data.frame(age = seq(min(age), max(age), length.out = 100),\n resid.ds = mean(resid.ds),\n rx = mean(rx)))\n\n# Predict survival times based on the AFT model\nplot_data$survival <- predict(model_aft, newdata = plot_data)\n\n# Plot the survival curves\nggplot(plot_data, aes(x = age, y = survival, color = factor(rx), linetype = factor(rx))) +\n geom_line() +\n labs(x = \"Age\", y = \"Survival Probability\", color = \"Radiation Therapy\", linetype = \"Radiation Therapy\") +\n scale_linetype_manual(values = c(\"solid\", \"dashed\", \"dotted\")) +\n scale_color_manual(values = c(\"blue\", \"red\", \"green\"))\n\n\n\n\n\n\n\n\nThe survival curve plotted based on the AFT model for the ovarian dataset how the probability of survival changes as age increases."
},
{
- "objectID": "R/summary_skew_kurt.html#package-examination",
- "href": "R/summary_skew_kurt.html#package-examination",
- "title": "Skewness/Kurtosis",
- "section": "",
- "text": "Base R and the stats package have no native functions for Skewness and Kurtosis. It is therefore necessary to use a packaged function to calculate these statistics. The packages examined use three different methods of calculating Skewness, and four different methods for calculating Kurtosis. Of the available packages, the functions in the e1071 package provide the most flexibility, and have options for three of the different methodologies.\n\n\nThe e1071 package contains miscellaneous statistical functions from the Probability Theory Group at the Vienna University of Technology. The package includes functions for both Skewness and Kurtosis, and each function has a “type” parameter to specify the method. There are three available methods for Skewness, and three methods for Kurtosis. A portion of the documentation for these functions is included below:\n\n\nThe documentation for the skewness() function describes three types of skewness calculations: Joanes and Gill (1998) discusses three methods for estimating skewness:\n\nType 1: This is the typical definition used in many older textbooks\n\n\\[g_1 = m_1/m_2^{3/2}\\]\n\nType 2: Used in SAS and SPSS\n\\[\nG_1 = g_1\\sqrt{n(n-1)}/(n-2)\n\\]\nType 3: Used in MINITAB and BMDP\n\\[\nb_1 = m_3/s^3 = g_1((n-1)/n)^{3/2}\n\\]\n\nAll three skewness measures are unbiased under normality. The three methods are illustrated in the following code:\n\ntype1 <- e1071::skewness(dat$points, type = 1)\nstringr::str_glue(\"Skewness - Type 1: {type1}\")\n\nSkewness - Type 1: 0.905444204379853\n\ntype2 <- e1071::skewness(dat$points, type = 2)\nstringr::str_glue(\"Skewness - Type 2: {type2}\")\n\nSkewness - Type 2: 1.00931792987094\n\ntype3 <- e1071::skewness(dat$points, type = 3)\nstringr::str_glue(\"Skewness - Type 3: {type3}\")\n\nSkewness - Type 3: 0.816426058828937\n\n\nThe default for the e1071 skewness() function is Type 3.\n\n\n\nThe documentation for the kurtosis() function describes three types of kurtosis calculations: Joanes and Gill (1998) discuss three methods for estimating kurtosis:\n\nType 1: This is the typical definition used in many older textbooks\n\n\\[g_2 = m_4/m_2^{2}-3\\]\n\nType 2: Used in SAS and SPSS\n\\[G_2 = ((n+1)g_2+6)*\\frac{(n-1)}{(n-2)(n-3)}\\]\nType 3: Used in MINITAB and BMDP\n\\[b_2 = m_4/s^4-3 = (g_2 + 3)(1-1/n)^2-3\\]\n\nOnly \\(G_2\\) (corresponding to type 2) is unbiased under normality. The three methods are illustrated in the following code:\n\n # Kurtosis - Type 1\ntype1 <- e1071::kurtosis(dat$points, type = 1)\nstringr::str_glue(\"Kurtosis - Type 1: {type1}\")\n\nKurtosis - Type 1: -0.583341077124784\n\n# Kurtosis - Type 2\ntype2 <- e1071::kurtosis(dat$points, type = 2)\nstringr::str_glue(\"Kurtosis - Type 2: {type2}\")\n\nKurtosis - Type 2: -0.299156418435587\n\n# Kurtosis - Type 3\ntype3 <- e1071::kurtosis(dat$points, type = 3)\nstringr::str_glue(\"Kurtosis - Type 3: {type3}\")\n\nKurtosis - Type 3: -0.894821560517589\n\n\nThe default for the e1071 kurtosis() function is Type 3.\n\n\n\n\nThe moments package is a well-known package with a variety of statistical functions. The package contains functions for both Skewness and Kurtosis. But these functions provide no “type” option. The skewness() function in the moments package corresponds to Type 1 above. The kurtosis() function uses a Pearson’s measure of Kurtosis, which corresponds to none of the three types in the e1071 package.\n\n library(moments)\n\n # Skewness - Type 1\n moments::skewness(dat$points)\n\n[1] 0.9054442\n\n # [1] 0.9054442\n \n # Kurtosis - Pearson's measure\n moments::kurtosis(dat$points)\n\n[1] 2.416659\n\n # [1] 2.416659\n\nNote that neither of the functions from the moments package match SAS.\n\n\n\nThe procs package proc_means() function was written specifically to match SAS, and produces a Type 2 Skewness and Type 2 Kurtosis. This package also produces a data frame output, instead of a scalar value.\n\n library(procs)\n\n # Skewness and Kurtosis - Type 2 \n proc_means(dat, var = points,\n stats = v(skew, kurt))\n\n# A tibble: 1 × 5\n TYPE FREQ VAR SKEW KURT\n <dbl> <int> <chr> <dbl> <dbl>\n1 0 15 points 1.01 -0.299\n\n\nViewer Output:\n\n\n\n\n\n\n\n\n\n\n\n\nThe sasLM package was also written specifically to match SAS. The Skewness() function produces a Type 2 Skewness, and the Kurtosis() function a Type 2 Kurtosis.\n\n library(sasLM)\n\n # Skewness - Type 2\n Skewness(dat$points)\n\n[1] 1.009318\n\n # [1] 1.009318\n \n # Kurtosis - Type 2\n Kurtosis(dat$points)\n\n[1] -0.2991564\n\n # [1] -0.2991564"
+ "objectID": "R/Accelerated_Failure_time_model.html#conclusion",
+ "href": "R/Accelerated_Failure_time_model.html#conclusion",
+ "title": "Accelerated Failure Time Model",
+ "section": "Conclusion",
+ "text": "Conclusion\nIn AFT models, unlike Cox proportional hazards models, survival times follow an assumed parametric distribution (e.g., Weibull, log-logistic, log-normal), directly modelling the effect of covariates on the time scale."
},
{
- "objectID": "R/ancova.html",
- "href": "R/ancova.html",
- "title": "Ancova",
+ "objectID": "R/manova.html",
+ "href": "R/manova.html",
+ "title": "Multivariate Analysis of Variance in R",
"section": "",
- "text": "In this example, we’re looking at Analysis of Covariance. ANCOVA is typically used to analyse treatment differences, to see examples of prediction models go to the simple linear regression page."
+ "text": "For a detailed description of MANOVA including assumptions see Renesh Bedre\nExample 39.6 Multivariate Analysis of Variance from SAS MANOVA User Guide\nThis example employs multivariate analysis of variance (MANOVA) to measure differences in the chemical characteristics of ancient pottery found at four kiln sites in Great Britain. The data are from Tubb, Parker, and Nickless (1980), as reported in Hand et al. (1994).\nFor each of 26 samples of pottery, the percentages of oxides of five metals are measured. The following statements create the data set and perform a one-way MANOVA. Additionally, it is of interest to know whether the pottery from one site in Wales (Llanederyn) differs from the samples from other sites.\n\nlibrary(tidyverse)\n\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\nlibrary(knitr)\nlibrary(emmeans)\n\nWelcome to emmeans.\nCaution: You lose important information if you filter this package's results.\nSee '? untidy'\n\nknitr::opts_chunk$set(echo = TRUE, cache = TRUE)\npottery <- read.csv(\"../data/manova1.csv\")\npottery\n\n site al fe mg ca na\n1 Llanederyn 14.4 7.00 4.30 0.15 0.51\n2 Llanederyn 13.8 7.08 3.43 0.12 0.17\n3 Llanederyn 14.6 7.09 3.88 0.13 0.20\n4 Llanederyn 11.5 6.37 5.64 0.16 0.14\n5 Llanederyn 13.8 7.06 5.34 0.20 0.20\n6 Llanederyn 10.9 6.26 3.47 0.17 0.22\n7 Llanederyn 10.1 4.26 4.26 0.20 0.18\n8 Llanederyn 11.6 5.78 5.91 0.18 0.16\n9 Llanederyn 11.1 5.49 4.52 0.29 0.30\n10 Llanederyn 13.4 6.92 7.23 0.28 0.20\n11 Llanederyn 12.4 6.13 5.69 0.22 0.54\n12 Llanederyn 13.1 6.64 5.51 0.31 0.24\n13 Llanederyn 12.7 6.69 4.45 0.20 0.22\n14 Llanederyn 12.5 6.44 3.94 0.22 0.23\n15 Caldicot 11.8 5.44 3.94 0.30 0.04\n16 Caldicot 11.6 5.39 3.77 0.29 0.06\n17 IslandThorns 18.3 1.28 0.67 0.03 0.03\n18 IslandThorns 15.8 2.39 0.63 0.01 0.04\n19 IslandThorns 18.0 1.50 0.67 0.01 0.06\n20 IslandThorns 18.0 1.88 0.68 0.01 0.04\n21 IslandThorns 20.8 1.51 0.72 0.07 0.10\n22 AshleyRails 17.7 1.12 0.56 0.06 0.06\n23 AshleyRails 18.3 1.14 0.67 0.06 0.05\n24 AshleyRails 16.7 0.92 0.53 0.01 0.05\n25 AshleyRails 14.8 2.74 0.67 0.03 0.05\n26 AshleyRails 19.1 1.64 0.60 0.10 0.03\n\n\n1 Perform one way MANOVA\nResponse ID for ANOVA is order of 1=al, 2=fe, 3=mg, ca, na.\nWe are testing H0: group mean vectors are the same for all groups or they dont differ significantly vs\nH1: At least one of the group mean vectors is different from the rest.\n\ndep_vars <- cbind(pottery$al,pottery$fe,pottery$mg, pottery$ca, pottery$na)\nfit <-manova(dep_vars ~ pottery$site)\nsummary.aov(fit)\n\n Response 1 :\n Df Sum Sq Mean Sq F value Pr(>F) \npottery$site 3 175.610 58.537 26.669 1.627e-07 ***\nResiduals 22 48.288 2.195 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n Response 2 :\n Df Sum Sq Mean Sq F value Pr(>F) \npottery$site 3 134.222 44.741 89.883 1.679e-12 ***\nResiduals 22 10.951 0.498 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n Response 3 :\n Df Sum Sq Mean Sq F value Pr(>F) \npottery$site 3 103.35 34.450 49.12 6.452e-10 ***\nResiduals 22 15.43 0.701 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n Response 4 :\n Df Sum Sq Mean Sq F value Pr(>F) \npottery$site 3 0.204703 0.068234 29.157 7.546e-08 ***\nResiduals 22 0.051486 0.002340 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n Response 5 :\n Df Sum Sq Mean Sq F value Pr(>F) \npottery$site 3 0.25825 0.086082 9.5026 0.0003209 ***\nResiduals 22 0.19929 0.009059 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\n‘summary(fit)’ outputs the MANOVA testing of an overall site effect.\nP<0.001 suggests there is an overall difference between the chemical composition of samples from different sites.\n\nsummary(fit)\n\n Df Pillai approx F num Df den Df Pr(>F) \npottery$site 3 1.5539 4.2984 15 60 2.413e-05 ***\nResiduals 22 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\n2 Now we test to see if the Llanaderyn site is different to the other sites\nNOTE: interest may now lie in using pre-planned contrast statements to investigate if one site differs when compared to the average of the others. You would imagine this could be done using the ‘contrast’ function something like the code below, however this result does not match the SAS user guide and so looks to be doing a different analysis. SUGGEST THIS IS NOT USED UNTIL MORE RESEARCH INTO THIS METHOD CAN BE PERFORMED. One alternative suggestion is to perform a linear descriminent analysis (LDA).\n\nmanova(dep_vars ~ pottery$site) %>% \n emmeans(\"site\") %>% \n contrast(method=list(\n \"Llanederyn vs other sites\"= c(\"Llanederyn\"=-3, \"Caldicot\"=1, \"IslandThorns\"=1, \"AshleyRails\"=1)))\n\n contrast estimate SE df t.ratio p.value\n Llanederyn vs other sites 1.51 0.661 22 2.288 0.0321\n\nResults are averaged over the levels of: rep.meas \n\n\nNOTE: if you feel you can help with the above discrepancy please contribute to the CAMIS repo by following the instructions on the contributions page."
},
{
- "objectID": "R/ancova.html#introduction",
- "href": "R/ancova.html#introduction",
- "title": "Ancova",
+ "objectID": "R/nparestimate.html",
+ "href": "R/nparestimate.html",
+ "title": "Non-parametric point estimation",
"section": "",
- "text": "In this example, we’re looking at Analysis of Covariance. ANCOVA is typically used to analyse treatment differences, to see examples of prediction models go to the simple linear regression page."
- },
- {
- "objectID": "R/ancova.html#data-summary",
- "href": "R/ancova.html#data-summary",
- "title": "Ancova",
- "section": "Data Summary",
- "text": "Data Summary\n\ndf_sas %>% glimpse()\n\nRows: 30\nColumns: 3\n$ drug <fct> A, A, A, A, A, A, A, A, A, A, D, D, D, D, D, D, D, D, D, D, F, F,…\n$ pre <dbl> 11, 8, 5, 14, 19, 6, 10, 6, 11, 3, 6, 6, 7, 8, 18, 8, 19, 8, 5, 1…\n$ post <dbl> 6, 0, 2, 8, 11, 4, 13, 1, 8, 0, 0, 2, 3, 1, 18, 4, 14, 9, 1, 9, 1…\n\ndf_sas %>% summary()\n\n drug pre post \n A:10 Min. : 3.00 Min. : 0.00 \n D:10 1st Qu.: 7.00 1st Qu.: 2.00 \n F:10 Median :10.50 Median : 7.00 \n Mean :10.73 Mean : 7.90 \n 3rd Qu.:13.75 3rd Qu.:12.75 \n Max. :21.00 Max. :23.00"
+ "text": "The Hodges-Lehman estimator (Hodges and Lehmann 1962) provides a point estimate which is associated with the Wilcoxon rank sum statistics based on location shift. This is typically used for the 2-sample comparison with small sample size. Note: The Hodges-Lehman estimates the median of the difference and not the difference of the medians. The corresponding distribution-free confidence interval is also based on the Wilcoxon rank sum statistics (Moses).\nThere are several packages covering this functionality. However, we will focus on the wilcox.test function implemented in R base. The {pairwiseCI} package provides further resources to derive various types of confidence intervals for the pairwise comparison case. This package is very flexible and uses the functions of related packages.\nHodges, J. L. and Lehmann, E. L. (1962) Rank methods for combination of independent experiments in analysis of variance. Annals of Mathematical Statistics, 33, 482-4."
},
{
- "objectID": "R/ancova.html#the-model",
- "href": "R/ancova.html#the-model",
- "title": "Ancova",
- "section": "The Model",
- "text": "The Model\n\nmodel_ancova <- lm(post ~ drug + pre, data = df_sas)\nmodel_glance <- model_ancova %>% glance()\nmodel_tidy <- model_ancova %>% tidy()\nmodel_glance %>% gt()\n\n\n\n\n\n\n\nr.squared\nadj.r.squared\nsigma\nstatistic\np.value\ndf\nlogLik\nAIC\nBIC\ndeviance\ndf.residual\nnobs\n\n\n\n\n0.6762609\n0.6389064\n4.005778\n18.10386\n1.501369e-06\n3\n-82.05377\n174.1075\n181.1135\n417.2026\n26\n30\n\n\n\n\n\n\nmodel_tidy %>% gt()\n\n\n\n\n\n\n\nterm\nestimate\nstd.error\nstatistic\np.value\n\n\n\n\n(Intercept)\n-3.8808094\n1.9862017\n-1.9538849\n6.155192e-02\n\n\ndrugD\n0.1089713\n1.7951351\n0.0607037\n9.520594e-01\n\n\ndrugF\n3.4461383\n1.8867806\n1.8264647\n7.928458e-02\n\n\npre\n0.9871838\n0.1644976\n6.0012061\n2.454330e-06\n\n\n\n\n\n\n\n\nmodel_table <- \n model_ancova %>% \n anova() %>% \n tidy() %>% \n add_row(term = \"Total\", df = sum(.$df), sumsq = sum(.$sumsq))\nmodel_table %>% gt()\n\n\n\n\n\n\n\nterm\ndf\nsumsq\nmeansq\nstatistic\np.value\n\n\n\n\ndrug\n2\n293.6000\n146.80000\n9.148553\n9.812371e-04\n\n\npre\n1\n577.8974\n577.89740\n36.014475\n2.454330e-06\n\n\nResiduals\n26\n417.2026\n16.04625\nNA\nNA\n\n\nTotal\n29\n1288.7000\nNA\nNA\nNA\n\n\n\n\n\n\n\n\nType 1\n\ndf_sas %>%\n anova_test(post ~ drug + pre, type = 1, detailed = TRUE) %>% \n get_anova_table() %>%\n gt()\n\n\n\n\n\n\n\nEffect\nDFn\nDFd\nSSn\nSSd\nF\np\np<.05\nges\n\n\n\n\ndrug\n2\n26\n293.600\n417.203\n9.149\n9.81e-04\n*\n0.413\n\n\npre\n1\n26\n577.897\n417.203\n36.014\n2.45e-06\n*\n0.581\n\n\n\n\n\n\n\n\n\nType 2\n\ndf_sas %>% \n anova_test(post ~ drug + pre, type = 2, detailed = TRUE) %>% \n get_anova_table() %>% \n gt()\n\n\n\n\n\n\n\nEffect\nSSn\nSSd\nDFn\nDFd\nF\np\np<.05\nges\n\n\n\n\ndrug\n68.554\n417.203\n2\n26\n2.136\n1.38e-01\n\n0.141\n\n\npre\n577.897\n417.203\n1\n26\n36.014\n2.45e-06\n*\n0.581\n\n\n\n\n\n\n\n\n\nType 3\n\ndf_sas %>%\n anova_test(post ~ drug + pre, type = 3, detailed = TRUE) %>% \n get_anova_table() %>% \n gt()\n\n\n\n\n\n\n\nEffect\nSSn\nSSd\nDFn\nDFd\nF\np\np<.05\nges\n\n\n\n\n(Intercept)\n31.929\n417.203\n1\n26\n1.990\n1.70e-01\n\n0.071\n\n\ndrug\n68.554\n417.203\n2\n26\n2.136\n1.38e-01\n\n0.141\n\n\npre\n577.897\n417.203\n1\n26\n36.014\n2.45e-06\n*\n0.581\n\n\n\n\n\n\n\n\n\nLS Means\n\nmodel_ancova %>% emmeans::lsmeans(\"drug\") %>% emmeans::pwpm(pvals = TRUE, means = TRUE) \n\n A D F\nA [ 6.71] 0.9980 0.1809\nD -0.109 [ 6.82] 0.1893\nF -3.446 -3.337 [10.16]\n\nRow and column labels: drug\nUpper triangle: P values adjust = \"tukey\"\nDiagonal: [Estimates] (lsmean) \nLower triangle: Comparisons (estimate) earlier vs. later\n\nmodel_ancova %>% emmeans::lsmeans(\"drug\") %>% plot(comparisons = TRUE)"
+ "objectID": "R/nparestimate.html#base",
+ "href": "R/nparestimate.html#base",
+ "title": "Non-parametric point estimation",
+ "section": "{base}",
+ "text": "{base}\nThe base function provides the Hodges-Lehmann estimate and the Moses confidence interval. The function will provide warnings in case of ties in the data and will not provide the exact confidence interval.\n\nwt <- wilcox.test(x, y, exact = TRUE, conf.int = TRUE)\n\nWarning in wilcox.test.default(x, y, exact = TRUE, conf.int = TRUE): cannot\ncompute exact p-value with ties\n\n\nWarning in wilcox.test.default(x, y, exact = TRUE, conf.int = TRUE): cannot\ncompute exact confidence intervals with ties\n\n# Hodges-Lehmann estimator\nwt$estimate\n\ndifference in location \n 0.5600562 \n\n# Moses confidence interval\nwt$conf.int\n\n[1] -0.3699774 1.1829708\nattr(,\"conf.level\")\n[1] 0.95\n\n\nNote: You can process the long format also for wilcox.test using the formula structure:\n\nwilcox.test(all$value ~ all$treat, exact = TRUE, conf.int = TRUE)\n\nWarning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot\ncompute exact p-value with ties\n\n\nWarning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot\ncompute exact confidence intervals with ties\n\n\n\n Wilcoxon rank sum test with continuity correction\n\ndata: all$value by all$treat\nW = 58, p-value = 0.1329\nalternative hypothesis: true location shift is not equal to 0\n95 percent confidence interval:\n -0.3699774 1.1829708\nsample estimates:\ndifference in location \n 0.5600562"
},
{
- "objectID": "R/ancova.html#saslm-package",
- "href": "R/ancova.html#saslm-package",
- "title": "Ancova",
- "section": "sasLM Package",
- "text": "sasLM Package\nThe following code performs an ANCOVA analysis using the sasLM package. This package was written specifically to replicate SAS statistics. The console output is also organized in a manner that is similar to SAS.\n\nlibrary(sasLM)\n\nsasLM::GLM(post ~ drug + pre, df_sas, BETA = TRUE, EMEAN = TRUE)\n\n$ANOVA\nResponse : post\n Df Sum Sq Mean Sq F value Pr(>F) \nMODEL 3 871.5 290.499 18.104 1.501e-06 ***\nRESIDUALS 26 417.2 16.046 \nCORRECTED TOTAL 29 1288.7 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n$Fitness\n Root MSE post Mean Coef Var R-square Adj R-sq\n 4.005778 7.9 50.70604 0.6762609 0.6389064\n\n$`Type I`\n Df Sum Sq Mean Sq F value Pr(>F) \ndrug 2 293.6 146.8 9.1486 0.0009812 ***\npre 1 577.9 577.9 36.0145 2.454e-06 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n$`Type II`\n Df Sum Sq Mean Sq F value Pr(>F) \ndrug 2 68.55 34.28 2.1361 0.1384 \npre 1 577.90 577.90 36.0145 2.454e-06 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n$`Type III`\n Df Sum Sq Mean Sq F value Pr(>F) \ndrug 2 68.55 34.28 2.1361 0.1384 \npre 1 577.90 577.90 36.0145 2.454e-06 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n$Parameter\n Estimate Estimable Std. Error Df t value Pr(>|t|) \n(Intercept) -0.4347 0 2.4714 26 -0.1759 0.86175 \ndrugA -3.4461 0 1.8868 26 -1.8265 0.07928 . \ndrugD -3.3372 0 1.8539 26 -1.8001 0.08346 . \ndrugF 0.0000 0 0.0000 26 \npre 0.9872 1 0.1645 26 6.0012 2.454e-06 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n$`Expected Mean`\n LSmean LowerCL UpperCL SE Df\n(Intercept) 7.900000 6.396685 9.403315 0.7313516 26\ndrugA 6.714963 4.066426 9.363501 1.2884943 26\ndrugD 6.823935 4.208337 9.439532 1.2724690 26\ndrugF 10.161102 7.456182 12.866021 1.3159234 26\npre 7.900000 6.396685 9.403315 0.7313516 26\n\n\nNote that the LSMEANS statistics are produced using the EMEAN = TRUE option. The BETA = TRUE option is equivalent to the SOLUTION option in SAS. See the sasLM documentation for additional information."
+ "objectID": "R/nparestimate.html#pairwiseci",
+ "href": "R/nparestimate.html#pairwiseci",
+ "title": "Non-parametric point estimation",
+ "section": "{pairwiseCI}",
+ "text": "{pairwiseCI}\nThe pairwiseCI package requires data to be in a long format to use the formula structure. Via the control parameter the direction can be defined. Setting method to “HL.diff” provides exact confidence intervals together with the Hodges-Lehmann point estimate.\n\n# pairwiseCI is using the formula structure \npairwiseCI(value ~ treat, data = all, \n method=\"HL.diff\", control=\"B\",\n conf.level = .95)\n\n \n95 %-confidence intervals \n Method: Difference in location (Hodges-Lehmann estimator) \n \n \n estimate lower upper\nA - B 0.56 -0.22 1.082"
},
{
- "objectID": "R/linear-regression.html",
- "href": "R/linear-regression.html",
- "title": "Linear Regression",
+ "objectID": "R/mcnemar.html",
+ "href": "R/mcnemar.html",
+ "title": "McNemar’s test in R",
"section": "",
- "text": "To demonstrate the use of linear regression we examine a dataset that illustrates the relationship between Height and Weight in a group of 237 teen-aged boys and girls. The dataset is available here and is imported to the workspace.\n\nDescriptive Statistics\nThe first step is to obtain the simple descriptive statistics for the numeric variables of htwt data, and one-way frequencies for categorical variables. This is accomplished by employing summary function. There are 237 participants who are from 13.9 to 25 years old. It is a cross-sectional study, with each participant having one observation. We can use this data set to examine the relationship of participants’ height to their age and sex.\n\nknitr::opts_chunk$set(echo = TRUE)\nhtwt<-read.csv(\"../data/htwt.csv\")\nsummary(htwt)\n\n ROW SEX AGE HEIGHT \n Min. : 1 Length:237 Min. :13.90 Min. :50.50 \n 1st Qu.: 60 Class :character 1st Qu.:14.80 1st Qu.:58.80 \n Median :119 Mode :character Median :16.30 Median :61.50 \n Mean :119 Mean :16.44 Mean :61.36 \n 3rd Qu.:178 3rd Qu.:17.80 3rd Qu.:64.30 \n Max. :237 Max. :25.00 Max. :72.00 \n WEIGHT \n Min. : 50.5 \n 1st Qu.: 85.0 \n Median :101.0 \n Mean :101.3 \n 3rd Qu.:112.0 \n Max. :171.5 \n\n\nIn order to create a regression model to demonstrate the relationship between age and height for females, we first need to create a flag variable identifying females and an interaction variable between age and female gender flag.\n\nhtwt$female <- ifelse(htwt$SEX=='f',1,0)\nhtwt$fem_age <- htwt$AGE * htwt$female\nhead(htwt)\n\n ROW SEX AGE HEIGHT WEIGHT female fem_age\n1 1 f 14.3 56.3 85.0 1 14.3\n2 2 f 15.5 62.3 105.0 1 15.5\n3 3 f 15.3 63.3 108.0 1 15.3\n4 4 f 16.1 59.0 92.0 1 16.1\n5 5 f 19.1 62.5 112.5 1 19.1\n6 6 f 17.1 62.5 112.0 1 17.1\n\n\n\n\nRegression Analysis\nNext, we fit a regression model, representing the relationships between gender, age, height and the interaction variable created in the datastep above. We again use a where statement to restrict the analysis to those who are less than or equal to 19 years old. We use the clb option to get a 95% confidence interval for each of the parameters in the model. The model that we are fitting is height = b0 + b1 x female + b2 x age + b3 x fem_age + e\n\nregression<-lm(HEIGHT~female+AGE+fem_age, data=htwt, AGE<=19)\nsummary(regression)\n\n\nCall:\nlm(formula = HEIGHT ~ female + AGE + fem_age, data = htwt, subset = AGE <= \n 19)\n\nResiduals:\n Min 1Q Median 3Q Max \n-8.2429 -1.7351 0.0383 1.6518 7.9289 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 28.8828 2.8734 10.052 < 2e-16 ***\nfemale 13.6123 4.0192 3.387 0.000841 ***\nAGE 2.0313 0.1776 11.435 < 2e-16 ***\nfem_age -0.9294 0.2478 -3.750 0.000227 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.799 on 215 degrees of freedom\nMultiple R-squared: 0.4595, Adjusted R-squared: 0.452 \nF-statistic: 60.93 on 3 and 215 DF, p-value: < 2.2e-16\n\n\nFrom the coefficients table b0,b1,b2,b3 are estimated as b0=28.88 b1=13.61 b2=2.03 b3=-0.92942\nThe resulting regression model for height, age and gender based on the available data is height=28.8828 + 13.6123 x female + 2.0313 x age -0.9294 x fem_age"
+ "text": "Performing McNemar’s test in R\nTo demonstrate McNemar’s test, data was used concerning the presence or absence of cold symptoms reported by the same children at age 12 and age 14. A total of 2638 participants were involved.\n\nUsing the epibasix::mcnemar function\nTesting for a significant difference in cold symptoms between the two ages using the mcNemar function from the epibasix package can be performed as below. The symptoms for participants at age 12 and age 14 are tabulated and stored as an object, then passed to the mcNemar function. A more complete view of the output is achieved by calling the summary function.\n\nlibrary(epibasix)\n\nX <- table(colds$age12, colds$age14)\nepi_mcn <- mcNemar(X)\nsummary(epi_mcn)\n\n\nMatched Pairs Analysis: McNemar's Statistic and Odds Ratio (Detailed Summary):\n \n \n No Yes\n No 707 256\n Yes 144 212\n\nEntries in above matrix correspond to number of pairs. \n \nMcNemar's Chi^2 Statistic (corrected for continuity) = 30.802 which has a p-value of: 0\nNote: The p.value for McNemar's Test corresponds to the hypothesis test: H0: OR = 1 vs. HA: OR != 1\nMcNemar's Odds Ratio (b/c): 1.778\n95% Confidence Limits for the OR are: [1.449, 2.208]\nThe risk difference is: 0.085\n95% Confidence Limits for the rd are: [0.055, 0.115]\n\n\n\n\nUsing the stats::mcnemar.test function\nMcNemar’s test can also be performed using stats::mcnemar.test as shown below, using the same table X as in the previous section.\n\nmcnemar.test(X)\n\n\n McNemar's Chi-squared test with continuity correction\n\ndata: X\nMcNemar's chi-squared = 30.802, df = 1, p-value = 2.857e-08\n\n\nThe result is shown without continuity correction by specifying correct=FALSE.\n\nmcnemar.test(X, correct=FALSE)\n\n\n McNemar's Chi-squared test\n\ndata: X\nMcNemar's chi-squared = 31.36, df = 1, p-value = 2.144e-08\n\n\n\n\nResults\nAs default, using summary with epibasix::mcNemar gives additional information to the McNemar’s chi-square statistic. This includes a table to view proportions, and odds ratio and risk difference with 95% confidence limits. The result uses Edward’s continuity correction without the option to remove this, which is consistent with other functions within the package.\nstats::mcnemar.test uses a continuity correction as default but does allow for this to be removed. This function does not output any other coefficients for agreement or proportions but (if required) these can be achieved within other functions or packages in R."
},
{
- "objectID": "R/PCA_analysis.html",
- "href": "R/PCA_analysis.html",
- "title": "Principle Component Analysis",
+ "objectID": "R/kruskal_wallis.html",
+ "href": "R/kruskal_wallis.html",
+ "title": "Kruskal Wallis R",
"section": "",
- "text": "Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining most of the information.\n\n\n\nWe will load the iris data.\nStandardize the data and then compute PCA.\n\n\nsuppressPackageStartupMessages({\n library(factoextra)\n library(plotly)\n})\n \ndata <- iris\npca_result <- prcomp(data[, 1:4], scale = T)\npca_result\n\nStandard deviations (1, .., p=4):\n[1] 1.7083611 0.9560494 0.3830886 0.1439265\n\nRotation (n x k) = (4 x 4):\n PC1 PC2 PC3 PC4\nSepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863\nSepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096\nPetal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492\nPetal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971\n\n\nWe print the summary of the PCA result, which includes the standard deviation of each principal component and the proportion of variance explained.\n\nsummary(pca_result)\n\nImportance of components:\n PC1 PC2 PC3 PC4\nStandard deviation 1.7084 0.9560 0.38309 0.14393\nProportion of Variance 0.7296 0.2285 0.03669 0.00518\nCumulative Proportion 0.7296 0.9581 0.99482 1.00000"
+ "text": "The Kruskal-Wallis test is a non-parametric equivalent to the one-way ANOVA. For this example, the data used is a subset of datasets::iris, testing for difference in sepal width between species of flower.\n\n\n Species Sepal_Width\n1 setosa 3.4\n2 setosa 3.0\n3 setosa 3.4\n4 setosa 3.2\n5 setosa 3.5\n6 setosa 3.1\n7 versicolor 2.7\n8 versicolor 2.9\n9 versicolor 2.7\n10 versicolor 2.6\n11 versicolor 2.5\n12 versicolor 2.5\n13 virginica 3.0\n14 virginica 3.0\n15 virginica 3.1\n16 virginica 3.8\n17 virginica 2.7\n18 virginica 3.3"
},
{
- "objectID": "R/PCA_analysis.html#introduction",
- "href": "R/PCA_analysis.html#introduction",
- "title": "Principle Component Analysis",
+ "objectID": "R/kruskal_wallis.html#introduction",
+ "href": "R/kruskal_wallis.html#introduction",
+ "title": "Kruskal Wallis R",
"section": "",
- "text": "Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining most of the information.\n\n\n\nWe will load the iris data.\nStandardize the data and then compute PCA.\n\n\nsuppressPackageStartupMessages({\n library(factoextra)\n library(plotly)\n})\n \ndata <- iris\npca_result <- prcomp(data[, 1:4], scale = T)\npca_result\n\nStandard deviations (1, .., p=4):\n[1] 1.7083611 0.9560494 0.3830886 0.1439265\n\nRotation (n x k) = (4 x 4):\n PC1 PC2 PC3 PC4\nSepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863\nSepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096\nPetal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492\nPetal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971\n\n\nWe print the summary of the PCA result, which includes the standard deviation of each principal component and the proportion of variance explained.\n\nsummary(pca_result)\n\nImportance of components:\n PC1 PC2 PC3 PC4\nStandard deviation 1.7084 0.9560 0.38309 0.14393\nProportion of Variance 0.7296 0.2285 0.03669 0.00518\nCumulative Proportion 0.7296 0.9581 0.99482 1.00000"
+ "text": "The Kruskal-Wallis test is a non-parametric equivalent to the one-way ANOVA. For this example, the data used is a subset of datasets::iris, testing for difference in sepal width between species of flower.\n\n\n Species Sepal_Width\n1 setosa 3.4\n2 setosa 3.0\n3 setosa 3.4\n4 setosa 3.2\n5 setosa 3.5\n6 setosa 3.1\n7 versicolor 2.7\n8 versicolor 2.9\n9 versicolor 2.7\n10 versicolor 2.6\n11 versicolor 2.5\n12 versicolor 2.5\n13 virginica 3.0\n14 virginica 3.0\n15 virginica 3.1\n16 virginica 3.8\n17 virginica 2.7\n18 virginica 3.3"
},
{
- "objectID": "R/PCA_analysis.html#visualize-pca-results",
- "href": "R/PCA_analysis.html#visualize-pca-results",
- "title": "Principle Component Analysis",
- "section": "Visualize PCA Results",
- "text": "Visualize PCA Results\n\nScree Plot\nA scree plot shows the proportion of variance explained by each principal component.\n\nfviz_eig(pca_result)\n\n\n\n\n\n\n\n\n\n\nBiplot\nA biplot shows the scores of the samples and the loadings of the variables on the first two principal components.\n\nplt <- fviz_pca_biplot(pca_result, geom.ind = \"point\", pointshape = 21, \n pointsize = 2, fill.ind = iris$Species, \n col.var = \"black\", repel = TRUE)\nplt"
+ "objectID": "R/kruskal_wallis.html#implementing-kruskal-wallis-in-r",
+ "href": "R/kruskal_wallis.html#implementing-kruskal-wallis-in-r",
+ "title": "Kruskal Wallis R",
+ "section": "Implementing Kruskal-Wallis in R",
+ "text": "Implementing Kruskal-Wallis in R\nThe Kruskal-Wallis test can be implemented in R using stats::kruskal.test. Below, the test is defined using R’s formula interface (dependent ~ independent variable) and specifying the data set. The null hypothesis is that the samples are from identical populations.\n\nkruskal.test(Sepal_Width~Species, data=iris_sub)\n\n\n Kruskal-Wallis rank sum test\n\ndata: Sepal_Width by Species\nKruskal-Wallis chi-squared = 10.922, df = 2, p-value = 0.004249"
},
{
- "objectID": "R/PCA_analysis.html#interpretation",
- "href": "R/PCA_analysis.html#interpretation",
- "title": "Principle Component Analysis",
- "section": "Interpretation",
- "text": "Interpretation\n\nThe Scree Plot suggests to decide the number of principle components to retain by looking an elbow point where the explained variance starts to level off.\nThe biplot visualizes both the samples (points) and the variables (arrows). Points that are close to each other represent samples with similar characteristics, while the direction and length of the arrows indicate the contribution of each variable to the principal components."
+ "objectID": "R/kruskal_wallis.html#results",
+ "href": "R/kruskal_wallis.html#results",
+ "title": "Kruskal Wallis R",
+ "section": "Results",
+ "text": "Results\nAs seen above, R outputs the Kruskal-Wallis rank sum statistic (10.922), the degrees of freedom (2), and the p-value of the test (0.004249). Therefore, the difference in population medians is statistically significant at the 5% level."
},
{
- "objectID": "R/PCA_analysis.html#visualization-of-pca-in-3d-scatter-plot",
- "href": "R/PCA_analysis.html#visualization-of-pca-in-3d-scatter-plot",
- "title": "Principle Component Analysis",
- "section": "Visualization of PCA in 3d Scatter Plot",
- "text": "Visualization of PCA in 3d Scatter Plot\nA 3d scatter plot allows us to see the relationships between three principle components simultaneously and also gives us a better understanding of how much variance is explained by these components.\nIt also allows for interactive exploration where we can rotate the plot and view it from a different angles.\nWe will plot this using plotly package.\n\npca_result2 <- prcomp(data[, 1:4], scale = T, rank. = 3)\npca_result2\n\nStandard deviations (1, .., p=4):\n[1] 1.7083611 0.9560494 0.3830886 0.1439265\n\nRotation (n x k) = (4 x 3):\n PC1 PC2 PC3\nSepal.Length 0.5210659 -0.37741762 0.7195664\nSepal.Width -0.2693474 -0.92329566 -0.2443818\nPetal.Length 0.5804131 -0.02449161 -0.1421264\nPetal.Width 0.5648565 -0.06694199 -0.6342727\n\n\nNext, we will create a dataframe of the 3 principle components and negate PC2 and PC3 for visual preference to make the plot look more organised and symmetric in 3d space.\n\ncomponents <- as.data.frame(pca_result2$x)\ncomponents$PC2 <- -components$PC2\ncomponents$PC3 <- -components$PC3\n\n\nfig <- plot_ly(components, \n x = ~PC1, \n y = ~PC2, \n z = ~PC3, \n color = ~data$Species, \n colors = c('darkgreen','darkblue','darkred')) %>%\n add_markers(size = 12)\n\nfig <- fig %>%\n layout(title = \"3d Visualization of PCA\",\n scene = list(bgcolor = \"lightgray\"))\nfig"
+ "objectID": "R/summary-stats.html",
+ "href": "R/summary-stats.html",
+ "title": "Deriving Quantiles or Percentiles in R",
+ "section": "",
+ "text": "Percentiles can be calculated in R using the quantile function. The function has the argument type which allows for nine different percentile definitions to be used. The default is type = 7, which uses a piecewise-linear estimate of the cumulative distribution function to find percentiles.\nThis is how the 25th and 40th percentiles of aval could be calculated using the default type.\n\nquantile(aval, probs = c(0.25, 0.4))"
},
{
- "objectID": "R/ttest_1Sample.html",
- "href": "R/ttest_1Sample.html",
- "title": "One Sample t-test",
+ "objectID": "R/mmrm.html",
+ "href": "R/mmrm.html",
+ "title": "MMRM in R",
"section": "",
- "text": "The One Sample t-test is used to compare a single sample against an expected hypothesis value. In the One Sample t-test, the mean of the sample is compared against the hypothesis value. In R, a One Sample t-test can be performed using the Base R t.test() from the stats package or the proc_ttest() function from the procs package.\n\n\nThe following data was used in this example.\n\n# Create sample data\nread <- tibble::tribble(\n ~score, ~count,\n 40, 2, 47, 2, 52, 2, 26, 1, 19, 2,\n 25, 2, 35, 4, 39, 1, 26, 1, 48, 1,\n 14, 2, 22, 1, 42, 1, 34, 2 , 33, 2,\n 18, 1, 15, 1, 29, 1, 41, 2, 44, 1,\n 51, 1, 43, 1, 27, 2, 46, 2, 28, 1,\n 49, 1, 31, 1, 28, 1, 54, 1, 45, 1\n)\n\n\n\n\nBy default, the R one sample t-test functions assume normality in the data and use a classic Student’s t-test.\n\n\n\n\nThe following code was used to test the comparison in Base R. Note that the baseline null hypothesis goes in the “mu” parameter.\n\n # Perform t-test\n t.test(read$score, mu = 30)\n\n\n One Sample t-test\n\ndata: read$score\nt = 2.3643, df = 29, p-value = 0.02497\nalternative hypothesis: true mean is not equal to 30\n95 percent confidence interval:\n 30.67928 39.38739\nsample estimates:\nmean of x \n 35.03333 \n\n\n\n\n\n\n\n\nThe following code from the procs package was used to perform a one sample t-test. Note that the null hypothesis value goes in the “options” parameter.\n\n library(procs)\n\n # Perform t-test\n proc_ttest(read, var = score,\n options = c(\"h0\" = 30))\n\n$Statistics\n VAR N MEAN STD STDERR MIN MAX\n1 score 30 35.03333 11.66038 2.128884 14 54\n\n$ConfLimits\n VAR MEAN LCLM UCLM STD LCLMSTD UCLMSTD\n1 score 35.03333 30.67928 39.38739 11.66038 9.286404 15.67522\n\n$TTests\n VAR DF T PROBT\n1 score 29 2.364306 0.0249741\n\n\nViewer Output:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThe Base R t.test() function does not have an option for lognormal data. Likewise, the procs proc_ttest() function also does not have an option for lognormal data.\nOne possibility may be the tTestLnormAltPower() function from the EnvStats package. This package has not been evaluated yet."
+ "text": "Mixed models for repeated measures (MMRM) are a popular choice for analyzing longitudinal continuous outcomes in randomized clinical trials and beyond; see Cnaan, Laird and Slasor (1997) for a tutorial and Mallinckrodt, Lane and Schnell (2008) for a review.\nThis vignette shows examples from the mmrm package.\nThe mmrm package implements MMRM based on the marginal linear model without random effects using Template Model Builder (TMB) which enables fast and robust model fitting. Users can specify a variety of covariance matrices, weight observations, fit models with restricted or standard maximum likelihood inference, perform hypothesis testing with Satterthwaite or Kenward-Roger adjustment, and extract least square means estimates by using emmeans.\n\n\n\nFlexible covariance specification:\n\nStructures: unstructured, Toeplitz, AR1, compound symmetry, ante-dependence, and spatial exponential.\nGroups: shared covariance structure for all subjects or group-specific covariance estimates.\nVariances: homogeneous or heterogeneous across time points.\n\nInference:\n\nSupports REML and ML.\nSupports weights.\n\nHypothesis testing:\n\nLeast square means: can be obtained with the emmeans package\nOne- and multi-dimensional linear contrasts of model parameters can be tested.\nSatterthwaite adjusted degrees of freedom.\nKenward-Roger adjusted degrees of freedom and coefficients covariance matrix.\nCoefficient Covariance\n\nC++ backend:\n\nFast implementation using C++ and automatic differentiation to obtain precise gradient information for model fitting.\nModel fitting algorithm details used in mmrm.\n\nPackage ecosystems integration:\n\nIntegration with tidymodels package ecosystem\n\nDedicated parsnip engine for linear regression\nIntegration with recipes\n\nIntegration with tern package ecosystems\n\nThe tern.mmrm package can be used to run the mmrm fit and generate tabulation and plots of least square means per visit and treatment arm, tabulation of model diagnostics, diagnostic graphs, and other standard model outputs.\n\n\n\n\n\n\nSee also the introductory vignette\nThe code below implements an MMRM fit in R with the mmrm::mmrm function.\n\nlibrary(mmrm)\nfit <- mmrm(\n formula = FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID),\n data = fev_data\n)\n\nThe code specifies an MMRM with the given covariates and an unstructured covariance matrix for the timepoints (also called visits in the clinical trial context, here given by AVISIT) within the subjects (here USUBJID). While by default this uses restricted maximum likelihood (REML), it is also possible to use ML, see ?mmrm.\nPrinting the object will show you output which should be familiar to anyone who has used any popular modeling functions such as stats::lm(), stats::glm(), glmmTMB::glmmTMB(), and lme4::nlmer(). From this print out we see the function call, the data used, the covariance structure with number of variance parameters, as well as the likelihood method, and model deviance achieved. Additionally the user is provided a printout of the estimated coefficients and the model convergence information:\n\nfit\n\nmmrm fit\n\nFormula: FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID)\nData: fev_data (used 537 observations from 197 subjects with maximum 4 \ntimepoints)\nCovariance: unstructured (10 variance parameters)\nInference: REML\nDeviance: 3386.45\n\nCoefficients: \n (Intercept) RACEBlack or African American \n 30.7774065 1.5305950 \n RACEWhite SEXFemale \n 5.6435679 0.3260274 \n ARMCDTRT AVISITVIS2 \n 3.7744139 4.8396039 \n AVISITVIS3 AVISITVIS4 \n 10.3421671 15.0537863 \n ARMCDTRT:AVISITVIS2 ARMCDTRT:AVISITVIS3 \n -0.0420899 -0.6938068 \n ARMCDTRT:AVISITVIS4 \n 0.6241229 \n\nModel Inference Optimization:\nConverged with code 0 and message: No message provided.\n\n\nThe summary() method then provides the coefficients table with Satterthwaite degrees of freedom as well as the covariance matrix estimate:\n\nfit |>\n summary()\n\nmmrm fit\n\nFormula: FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID)\nData: fev_data (used 537 observations from 197 subjects with maximum 4 \ntimepoints)\nCovariance: unstructured (10 variance parameters)\nMethod: Satterthwaite\nVcov Method: Asymptotic\nInference: REML\n\nModel selection criteria:\n AIC BIC logLik deviance \n 3406.4 3439.3 -1693.2 3386.4 \n\nCoefficients: \n Estimate Std. Error df t value Pr(>|t|)\n(Intercept) 30.77741 0.88657 218.79000 34.715 < 2e-16\nRACEBlack or African American 1.53059 0.62446 168.67000 2.451 0.015263\nRACEWhite 5.64357 0.66559 157.14000 8.479 1.56e-14\nSEXFemale 0.32603 0.53194 166.14000 0.613 0.540776\nARMCDTRT 3.77441 1.07416 145.55000 3.514 0.000589\nAVISITVIS2 4.83960 0.80173 143.87000 6.036 1.27e-08\nAVISITVIS3 10.34217 0.82269 155.56000 12.571 < 2e-16\nAVISITVIS4 15.05379 1.31288 138.46000 11.466 < 2e-16\nARMCDTRT:AVISITVIS2 -0.04209 1.12933 138.56000 -0.037 0.970324\nARMCDTRT:AVISITVIS3 -0.69381 1.18764 158.17000 -0.584 0.559924\nARMCDTRT:AVISITVIS4 0.62412 1.85096 129.71000 0.337 0.736520\n \n(Intercept) ***\nRACEBlack or African American * \nRACEWhite ***\nSEXFemale \nARMCDTRT ***\nAVISITVIS2 ***\nAVISITVIS3 ***\nAVISITVIS4 ***\nARMCDTRT:AVISITVIS2 \nARMCDTRT:AVISITVIS3 \nARMCDTRT:AVISITVIS4 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCovariance estimate:\n VIS1 VIS2 VIS3 VIS4\nVIS1 40.5544 14.3960 4.9760 13.3779\nVIS2 14.3960 26.5714 2.7836 7.4773\nVIS3 4.9760 2.7836 14.8980 0.9036\nVIS4 13.3779 7.4773 0.9036 95.5565\n\n\n\n\n\nIn order to extract relevant marginal means (LSmeans) and contrasts we can use the emmeans package. This package includes methods that allow mmrm objects to be used with the emmeans package. emmeans computes estimated marginal means (also called least-square means) for the coefficients of the MMRM.\n\nif (require(emmeans)) {\n emmeans(fit, ~ ARMCD | AVISIT)\n}\n\nLoading required package: emmeans\n\n\nmmrm() registered as emmeans extension\n\n\nWelcome to emmeans.\nCaution: You lose important information if you filter this package's results.\nSee '? untidy'\n\n\nAVISIT = VIS1:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 33.3 0.755 148 31.8 34.8\n TRT 37.1 0.763 143 35.6 38.6\n\nAVISIT = VIS2:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 38.2 0.612 147 37.0 39.4\n TRT 41.9 0.602 143 40.7 43.1\n\nAVISIT = VIS3:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 43.7 0.462 130 42.8 44.6\n TRT 46.8 0.509 130 45.7 47.8\n\nAVISIT = VIS4:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 48.4 1.189 134 46.0 50.7\n TRT 52.8 1.188 133 50.4 55.1\n\nResults are averaged over the levels of: RACE, SEX \nConfidence level used: 0.95 \n\n\nNote that the degrees of freedom choice is inherited here from the initial mmrm fit."
},
{
- "objectID": "R/ttest_1Sample.html#normal",
- "href": "R/ttest_1Sample.html#normal",
- "title": "One Sample t-test",
+ "objectID": "R/mmrm.html#fitting-the-mmrm-in-r",
+ "href": "R/mmrm.html#fitting-the-mmrm-in-r",
+ "title": "MMRM in R",
"section": "",
- "text": "By default, the R one sample t-test functions assume normality in the data and use a classic Student’s t-test.\n\n\n\n\nThe following code was used to test the comparison in Base R. Note that the baseline null hypothesis goes in the “mu” parameter.\n\n # Perform t-test\n t.test(read$score, mu = 30)\n\n\n One Sample t-test\n\ndata: read$score\nt = 2.3643, df = 29, p-value = 0.02497\nalternative hypothesis: true mean is not equal to 30\n95 percent confidence interval:\n 30.67928 39.38739\nsample estimates:\nmean of x \n 35.03333 \n\n\n\n\n\n\n\n\nThe following code from the procs package was used to perform a one sample t-test. Note that the null hypothesis value goes in the “options” parameter.\n\n library(procs)\n\n # Perform t-test\n proc_ttest(read, var = score,\n options = c(\"h0\" = 30))\n\n$Statistics\n VAR N MEAN STD STDERR MIN MAX\n1 score 30 35.03333 11.66038 2.128884 14 54\n\n$ConfLimits\n VAR MEAN LCLM UCLM STD LCLMSTD UCLMSTD\n1 score 35.03333 30.67928 39.38739 11.66038 9.286404 15.67522\n\n$TTests\n VAR DF T PROBT\n1 score 29 2.364306 0.0249741\n\n\nViewer Output:"
+ "text": "Mixed models for repeated measures (MMRM) are a popular choice for analyzing longitudinal continuous outcomes in randomized clinical trials and beyond; see Cnaan, Laird and Slasor (1997) for a tutorial and Mallinckrodt, Lane and Schnell (2008) for a review.\nThis vignette shows examples from the mmrm package.\nThe mmrm package implements MMRM based on the marginal linear model without random effects using Template Model Builder (TMB) which enables fast and robust model fitting. Users can specify a variety of covariance matrices, weight observations, fit models with restricted or standard maximum likelihood inference, perform hypothesis testing with Satterthwaite or Kenward-Roger adjustment, and extract least square means estimates by using emmeans.\n\n\n\nFlexible covariance specification:\n\nStructures: unstructured, Toeplitz, AR1, compound symmetry, ante-dependence, and spatial exponential.\nGroups: shared covariance structure for all subjects or group-specific covariance estimates.\nVariances: homogeneous or heterogeneous across time points.\n\nInference:\n\nSupports REML and ML.\nSupports weights.\n\nHypothesis testing:\n\nLeast square means: can be obtained with the emmeans package\nOne- and multi-dimensional linear contrasts of model parameters can be tested.\nSatterthwaite adjusted degrees of freedom.\nKenward-Roger adjusted degrees of freedom and coefficients covariance matrix.\nCoefficient Covariance\n\nC++ backend:\n\nFast implementation using C++ and automatic differentiation to obtain precise gradient information for model fitting.\nModel fitting algorithm details used in mmrm.\n\nPackage ecosystems integration:\n\nIntegration with tidymodels package ecosystem\n\nDedicated parsnip engine for linear regression\nIntegration with recipes\n\nIntegration with tern package ecosystems\n\nThe tern.mmrm package can be used to run the mmrm fit and generate tabulation and plots of least square means per visit and treatment arm, tabulation of model diagnostics, diagnostic graphs, and other standard model outputs.\n\n\n\n\n\n\nSee also the introductory vignette\nThe code below implements an MMRM fit in R with the mmrm::mmrm function.\n\nlibrary(mmrm)\nfit <- mmrm(\n formula = FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID),\n data = fev_data\n)\n\nThe code specifies an MMRM with the given covariates and an unstructured covariance matrix for the timepoints (also called visits in the clinical trial context, here given by AVISIT) within the subjects (here USUBJID). While by default this uses restricted maximum likelihood (REML), it is also possible to use ML, see ?mmrm.\nPrinting the object will show you output which should be familiar to anyone who has used any popular modeling functions such as stats::lm(), stats::glm(), glmmTMB::glmmTMB(), and lme4::nlmer(). From this print out we see the function call, the data used, the covariance structure with number of variance parameters, as well as the likelihood method, and model deviance achieved. Additionally the user is provided a printout of the estimated coefficients and the model convergence information:\n\nfit\n\nmmrm fit\n\nFormula: FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID)\nData: fev_data (used 537 observations from 197 subjects with maximum 4 \ntimepoints)\nCovariance: unstructured (10 variance parameters)\nInference: REML\nDeviance: 3386.45\n\nCoefficients: \n (Intercept) RACEBlack or African American \n 30.7774065 1.5305950 \n RACEWhite SEXFemale \n 5.6435679 0.3260274 \n ARMCDTRT AVISITVIS2 \n 3.7744139 4.8396039 \n AVISITVIS3 AVISITVIS4 \n 10.3421671 15.0537863 \n ARMCDTRT:AVISITVIS2 ARMCDTRT:AVISITVIS3 \n -0.0420899 -0.6938068 \n ARMCDTRT:AVISITVIS4 \n 0.6241229 \n\nModel Inference Optimization:\nConverged with code 0 and message: No message provided.\n\n\nThe summary() method then provides the coefficients table with Satterthwaite degrees of freedom as well as the covariance matrix estimate:\n\nfit |>\n summary()\n\nmmrm fit\n\nFormula: FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID)\nData: fev_data (used 537 observations from 197 subjects with maximum 4 \ntimepoints)\nCovariance: unstructured (10 variance parameters)\nMethod: Satterthwaite\nVcov Method: Asymptotic\nInference: REML\n\nModel selection criteria:\n AIC BIC logLik deviance \n 3406.4 3439.3 -1693.2 3386.4 \n\nCoefficients: \n Estimate Std. Error df t value Pr(>|t|)\n(Intercept) 30.77741 0.88657 218.79000 34.715 < 2e-16\nRACEBlack or African American 1.53059 0.62446 168.67000 2.451 0.015263\nRACEWhite 5.64357 0.66559 157.14000 8.479 1.56e-14\nSEXFemale 0.32603 0.53194 166.14000 0.613 0.540776\nARMCDTRT 3.77441 1.07416 145.55000 3.514 0.000589\nAVISITVIS2 4.83960 0.80173 143.87000 6.036 1.27e-08\nAVISITVIS3 10.34217 0.82269 155.56000 12.571 < 2e-16\nAVISITVIS4 15.05379 1.31288 138.46000 11.466 < 2e-16\nARMCDTRT:AVISITVIS2 -0.04209 1.12933 138.56000 -0.037 0.970324\nARMCDTRT:AVISITVIS3 -0.69381 1.18764 158.17000 -0.584 0.559924\nARMCDTRT:AVISITVIS4 0.62412 1.85096 129.71000 0.337 0.736520\n \n(Intercept) ***\nRACEBlack or African American * \nRACEWhite ***\nSEXFemale \nARMCDTRT ***\nAVISITVIS2 ***\nAVISITVIS3 ***\nAVISITVIS4 ***\nARMCDTRT:AVISITVIS2 \nARMCDTRT:AVISITVIS3 \nARMCDTRT:AVISITVIS4 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCovariance estimate:\n VIS1 VIS2 VIS3 VIS4\nVIS1 40.5544 14.3960 4.9760 13.3779\nVIS2 14.3960 26.5714 2.7836 7.4773\nVIS3 4.9760 2.7836 14.8980 0.9036\nVIS4 13.3779 7.4773 0.9036 95.5565\n\n\n\n\n\nIn order to extract relevant marginal means (LSmeans) and contrasts we can use the emmeans package. This package includes methods that allow mmrm objects to be used with the emmeans package. emmeans computes estimated marginal means (also called least-square means) for the coefficients of the MMRM.\n\nif (require(emmeans)) {\n emmeans(fit, ~ ARMCD | AVISIT)\n}\n\nLoading required package: emmeans\n\n\nmmrm() registered as emmeans extension\n\n\nWelcome to emmeans.\nCaution: You lose important information if you filter this package's results.\nSee '? untidy'\n\n\nAVISIT = VIS1:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 33.3 0.755 148 31.8 34.8\n TRT 37.1 0.763 143 35.6 38.6\n\nAVISIT = VIS2:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 38.2 0.612 147 37.0 39.4\n TRT 41.9 0.602 143 40.7 43.1\n\nAVISIT = VIS3:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 43.7 0.462 130 42.8 44.6\n TRT 46.8 0.509 130 45.7 47.8\n\nAVISIT = VIS4:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 48.4 1.189 134 46.0 50.7\n TRT 52.8 1.188 133 50.4 55.1\n\nResults are averaged over the levels of: RACE, SEX \nConfidence level used: 0.95 \n\n\nNote that the degrees of freedom choice is inherited here from the initial mmrm fit."
},
{
- "objectID": "R/ttest_1Sample.html#lognormal",
- "href": "R/ttest_1Sample.html#lognormal",
- "title": "One Sample t-test",
+ "objectID": "R/correlation.html",
+ "href": "R/correlation.html",
+ "title": "Correlation Analysis Using R",
"section": "",
- "text": "The Base R t.test() function does not have an option for lognormal data. Likewise, the procs proc_ttest() function also does not have an option for lognormal data.\nOne possibility may be the tTestLnormAltPower() function from the EnvStats package. This package has not been evaluated yet."
+ "text": "The most commonly used correlation analysis methods in clinical trials include:\n\nPearson correlation coefficient: product moment coefficient between two continuous variables, measuring linear associations.\n\n\\[\nr = \\frac{\\sum_{i=1}^n (x_i - m_x)(y_i - m_y)}{\\sqrt{\\sum_{i=1}^n (x_i - m_x)^2\\sum_{i=1}^n (y_i - m_y)^2}},\\]\nwhere \\(x\\) and \\(y\\) are observations from two continuous variables of length \\(n\\) and \\(m_x\\) and \\(m_y\\) are their respective means.\nSpearman correlation coefficient: rank correlation defined through the scaled sum of the squared values of the difference between ranks of two continuous variables.\n\\[\n\\rho = \\frac{\\sum_{i=1}^n (x'_i - m_{x'})(y'_i - m_{y'})}{\\sqrt{\\sum_{i=1}^n (x'_i - m_{x'})^2\\sum_{i=1}^n(y'_i - m_{y'})^2}},\n\\]\nwhere \\(x'\\) and \\(y'\\) are the ranks of \\(x\\) and \\(y\\) and \\(m_{x'}\\) and \\(m_{y'}\\) are the mean ranks of \\(x\\) and \\(y\\), respectively.\nKendall’s rank correlation: rank correlation based on the number of inversions in one ranking as compared with another.\n\\[\n\\tau = \\frac{n_c - n_d}{\\frac{1}{2}\\,n\\,(n-1)},\n\\]\nwhere \\(n_c\\) is the total number of concordant pairs, \\(n_d\\) is the total number of disconcordant pairs and \\(n\\) the total size of observations in \\(x\\) and \\(y\\).\n\nOther association measures are available for count data/contingency tables comparing observed frequencies with those expected under the assumption of independence\n\nFisher exact test\nChi-Square statistic\n\n\nExample: Lung Cancer Data\nData source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.\nSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.\n\nlibrary(survival) \n\nglimpse(lung)\n\nRows: 228\nColumns: 10\n$ inst <dbl> 3, 3, 3, 5, 1, 12, 7, 11, 1, 7, 6, 16, 11, 21, 12, 1, 22, 16…\n$ time <dbl> 306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166, 170, 654…\n$ status <dbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …\n$ age <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68, 60, 57, …\n$ sex <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, …\n$ ph.ecog <dbl> 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 1, 2, 1, NA, 1, 1, 1, 2, 2, 1,…\n$ ph.karno <dbl> 90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 90, 60, 80,…\n$ pat.karno <dbl> 100, 90, 90, 60, 90, 80, 60, 80, 80, 70, 80, 70, 90, 70, 70,…\n$ meal.cal <dbl> 1175, 1225, NA, 1150, NA, 513, 384, 538, 825, 271, 1025, NA,…\n$ wt.loss <dbl> NA, 15, 15, 11, 0, 0, 10, 1, 16, 34, 27, 23, 5, 32, 60, 15, …\n\n\n\n\nOverview\ncor() computes the correlation coefficient between continuous variables x and y, where method chooses which correlation coefficient is to be computed (default: \"pearson\", \"kendall\", or \"spearman\").\ncor.test() calulates the test for association between paired samples, using one of Pearson’s product moment correlation coefficient, Kendall’s \\(\\tau\\) or Spearman’s \\(\\rho\\). Besides the correlation coefficient itself, it provides additional information.\nMissing values are assumed to be missing completely at random (MCAR). Different strategies are available, see ?cor for details.\n\n\nPearson Correlation\n\ncor.test(x = lung$age, y = lung$meal.cal, method = \"pearson\") \n\n\n Pearson's product-moment correlation\n\ndata: lung$age and lung$meal.cal\nt = -3.1824, df = 179, p-value = 0.001722\nalternative hypothesis: true correlation is not equal to 0\n95 percent confidence interval:\n -0.3649503 -0.0885415\nsample estimates:\n cor \n-0.2314107 \n\n\n\n\nSpearman Correlation\n\ncor.test(x = lung$age, y = lung$meal.cal, method = \"spearman\")\n\nWarning in cor.test.default(x = lung$age, y = lung$meal.cal, method =\n\"spearman\"): Cannot compute exact p-value with ties\n\n\n\n Spearman's rank correlation rho\n\ndata: lung$age and lung$meal.cal\nS = 1193189, p-value = 0.005095\nalternative hypothesis: true rho is not equal to 0\nsample estimates:\n rho \n-0.2073639 \n\n\nNote: Exact p-values require unanimous ranks.\n\n\nKendall’s rank correlation\n\ncor.test(x = lung$age, y = lung$meal.cal, method = \"kendall\")\n\n\n Kendall's rank correlation tau\n\ndata: lung$age and lung$meal.cal\nz = -2.7919, p-value = 0.00524\nalternative hypothesis: true tau is not equal to 0\nsample estimates:\n tau \n-0.1443877 \n\n\n\n\nInterpretation of correlation coefficients\nCorrelation coefficient is comprised between -1 and 1:\n\n\\(-1\\) indicates a strong negative correlation\n\\(0\\) means that there is no association between the two variables\n\\(1\\) indicates a strong positive correlation"
},
{
- "objectID": "R/logistic_regr.html",
- "href": "R/logistic_regr.html",
- "title": "Logistic Regression",
+ "objectID": "R/survival.html",
+ "href": "R/survival.html",
+ "title": "Survival Analysis Using R",
"section": "",
- "text": "A model of the dependence of binary variables on explanatory variables. The logit of expectation is explained as linear for of explanatory variables. If we observed \\((y_i, x_i),\\) where \\(y_i\\) is a Bernoulli variable and \\(x_i\\) a vector of explanatory variables, the model for \\(\\pi_i = P(y_i=1)\\) is\n\\[\n\\text{logit}(\\pi_i)= \\log\\left\\{ \\frac{\\pi_i}{1-\\pi_i}\\right\\} = \\beta_0 + \\beta x_i, i = 1,\\ldots,n\n\\]\nThe model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios.\n\nExample: Lung Cancer Data\nData source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.\nSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities (see ?lung for details).\n\nlibrary(survival) \nglimpse(lung)\n\nRows: 228\nColumns: 10\n$ inst <dbl> 3, 3, 3, 5, 1, 12, 7, 11, 1, 7, 6, 16, 11, 21, 12, 1, 22, 16…\n$ time <dbl> 306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166, 170, 654…\n$ status <dbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …\n$ age <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68, 60, 57, …\n$ sex <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, …\n$ ph.ecog <dbl> 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 1, 2, 1, NA, 1, 1, 1, 2, 2, 1,…\n$ ph.karno <dbl> 90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 90, 60, 80,…\n$ pat.karno <dbl> 100, 90, 90, 60, 90, 80, 60, 80, 80, 70, 80, 70, 90, 70, 70,…\n$ meal.cal <dbl> 1175, 1225, NA, 1150, NA, 513, 384, 538, 825, 271, 1025, NA,…\n$ wt.loss <dbl> NA, 15, 15, 11, 0, 0, 10, 1, 16, 34, 27, 23, 5, 32, 60, 15, …\n\n\n\n\nModel Fit\nWe analyze the weight loss in lung cancer patients in dependency of age, sex, ECOG performance score and calories consumed at meals.\n\nlung2 <- survival::lung %>% \n mutate(\n wt_grp = factor(wt.loss > 0, labels = c(\"weight loss\", \"weight gain\"))\n ) \n\n\nm1 <- glm(wt_grp ~ age + sex + ph.ecog + meal.cal, data = lung2, family = binomial(link=\"logit\"))\nsummary(m1)\n\n\nCall:\nglm(formula = wt_grp ~ age + sex + ph.ecog + meal.cal, family = binomial(link = \"logit\"), \n data = lung2)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 3.2631673 1.6488207 1.979 0.0478 *\nage -0.0101717 0.0208107 -0.489 0.6250 \nsex -0.8717357 0.3714042 -2.347 0.0189 *\nph.ecog 0.4179665 0.2588653 1.615 0.1064 \nmeal.cal -0.0008869 0.0004467 -1.985 0.0471 *\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 202.36 on 169 degrees of freedom\nResidual deviance: 191.50 on 165 degrees of freedom\n (58 observations deleted due to missingness)\nAIC: 201.5\n\nNumber of Fisher Scoring iterations: 4\n\n\nThe model summary contains the parameter estimates \\(\\beta_j\\) for each explanatory variable \\(x_j\\), corresponding to the log-odds for the response variable to take the value \\(1\\), conditional on all other explanatory variables remaining constant. For better interpretation, we can exponentiate these estimates, to obtain estimates for the odds instead and provide \\(95\\)% confidence intervals:\n\nexp(coef(m1))\n\n(Intercept) age sex ph.ecog meal.cal \n 26.1321742 0.9898798 0.4182250 1.5188698 0.9991135 \n\nexp(confint(m1))\n\nWaiting for profiling to be done...\n\n\n 2.5 % 97.5 %\n(Intercept) 1.0964330 730.3978786\nage 0.9495388 1.0307216\nsex 0.1996925 0.8617165\nph.ecog 0.9194053 2.5491933\nmeal.cal 0.9982107 0.9999837\n\n\n\n\nModel Comparison\nTo compare two logistic models, one tests the difference in residual variances from both models using a \\(\\chi^2\\)-distribution with a single degree of freedom (here at the \\(5\\)% level):\n\nm2 <- glm(wt_grp ~ sex + ph.ecog + meal.cal, data = lung2, family = binomial(link=\"logit\"))\nsummary(m2)\n\n\nCall:\nglm(formula = wt_grp ~ sex + ph.ecog + meal.cal, family = binomial(link = \"logit\"), \n data = lung2)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 2.5606595 0.7976887 3.210 0.00133 **\nsex -0.8359241 0.3637378 -2.298 0.02155 * \nph.ecog 0.3794295 0.2469030 1.537 0.12435 \nmeal.cal -0.0008334 0.0004346 -1.918 0.05517 . \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 202.36 on 169 degrees of freedom\nResidual deviance: 191.74 on 166 degrees of freedom\n (58 observations deleted due to missingness)\nAIC: 199.74\n\nNumber of Fisher Scoring iterations: 4\n\nanova(m1, m2, test = \"Chisq\")\n\nAnalysis of Deviance Table\n\nModel 1: wt_grp ~ age + sex + ph.ecog + meal.cal\nModel 2: wt_grp ~ sex + ph.ecog + meal.cal\n Resid. Df Resid. Dev Df Deviance Pr(>Chi)\n1 165 191.50 \n2 166 191.75 -1 -0.24046 0.6239\n\n\n\n\nPrediction\nPredictions from the model for the log-odds of a patient with new data to experience a weight loss are derived using predict():\n\n# new female, symptomatic but completely ambulatory patient consuming 2500 calories\nnew_pt <- data.frame(sex=2, ph.ecog=1, meal.cal=2500)\npredict(m2, new_pt, type = \"response\")\n\n 1 \n0.306767"
+ "text": "The most commonly used survival analysis methods in clinical trials include:\nAdditionally, other methods for analyzing time-to-event data are available, such as:\nWhile these models may be explored in a separate document, this particular document focuses solely on the three most prevalent methods: KM estimators, log-rank test and Cox PH model."
},
{
- "objectID": "R/ci_for_prop.html",
- "href": "R/ci_for_prop.html",
- "title": "Confidence Intervals for Proportions",
- "section": "",
- "text": "A confidence interval for binomial proportion is an interval estimate for the probability of success calculated from the outcome of a series of Bernoulli trials.\nThere are several ways to calculate a binomial confidence interval. Normal approximation is one of the most commonly used methods."
+ "objectID": "R/survival.html#example-data",
+ "href": "R/survival.html#example-data",
+ "title": "Survival Analysis Using R",
+ "section": "Example Data",
+ "text": "Example Data\nData source: https://stats.idre.ucla.edu/sas/seminars/sas-survival/\nThe data include 500 subjects from the Worcester Heart Attack Study. This study examined several factors, such as age, gender and BMI, that may influence survival time after heart attack. Follow up time for all participants begins at the time of hospital admission after heart attack and ends with death or loss to follow up (censoring). The variables used here are:\n\nlenfol: length of followup, terminated either by death or censoring - time variable\nfstat: loss to followup = 0, death = 1 - censoring variable\nafb: atrial fibrillation, no = 0, 1 = yes - explanatory variable\ngender: males = 0, females = 1 - stratification factor\n\n\nlibrary(tidyverse)\nlibrary(haven)\nlibrary(survival)\nlibrary(survminer)\nlibrary(broom)\nlibrary(knitr)\nknitr::opts_chunk$set(echo = TRUE)\n\ndat <- read_sas(file.path(\"../data/whas500.sas7bdat\")) %>%\n mutate(LENFOLY = round(LENFOL/365.25, 2), ## change follow-up days to years for better visualization\n AFB = factor(AFB, levels = c(1, 0))) ## change AFB order to use \"Yes\" as the reference group to be consistent with SAS"
},
{
- "objectID": "R/ci_for_prop.html#normal-approximation",
- "href": "R/ci_for_prop.html#normal-approximation",
- "title": "Confidence Intervals for Proportions",
- "section": "Normal approximation",
- "text": "Normal approximation\nIn large random samples from independent trials, the sampling distribution of proportions approximately follows the normal distribution. The expectation of a sample proportion is the corresponding population proportion. Therefore, based on a sample of size \\(n\\), a \\((1-\\alpha)\\%\\) confidence interval for population proportion can be calculated using normal approximation as follows:\n\\(p\\approx \\hat p \\pm z_\\alpha \\sqrt{\\hat p(1-\\hat p)}/{n}\\), where \\(\\hat p\\) is the sample proportion, \\(z_\\alpha\\) is the \\(1-\\alpha/2\\) quantile of a standard normal distribution corresponding to level \\(\\alpha\\), and \\(\\sqrt{\\hat p(1-\\hat p)}/{n}\\) is the standard error."
+ "objectID": "R/survival.html#the-non-stratified-model",
+ "href": "R/survival.html#the-non-stratified-model",
+ "title": "Survival Analysis Using R",
+ "section": "The Non-stratified Model",
+ "text": "The Non-stratified Model\nFirst we try a non-stratified analysis following the mock-up above to describe the association between survival time and afb (atrial fibrillation).\nThe KM estimators are from survival::survfit function, the log-rank test uses survminer::surv_pvalue, and Cox PH model is conducted using survival::coxph function. Numerous R packages and functions are available for performing survival analysis. The author has selected survival and survminer for use in this context, but alternative options can also be employed for survival analysis.\n\nKM estimators\n\nfit.km <- survfit(Surv(LENFOLY, FSTAT) ~ AFB, data = dat)\n\n## quantile estimates\nquantile(fit.km, probs = c(0.25, 0.5, 0.75)) \n\n$quantile\n 25 50 75\nAFB=1 0.26 2.37 6.43\nAFB=0 0.94 5.91 6.44\n\n$lower\n 25 50 75\nAFB=1 0.05 1.27 4.24\nAFB=0 0.55 4.32 6.44\n\n$upper\n 25 50 75\nAFB=1 1.11 4.24 NA\nAFB=0 1.47 NA NA\n\n## landmark estimates at 1, 3, 5-year\nsummary(fit.km, times = c(1, 3, 5)) \n\nCall: survfit(formula = Surv(LENFOLY, FSTAT) ~ AFB, data = dat)\n\n AFB=1 \n time n.risk n.event survival std.err lower 95% CI upper 95% CI\n 1 50 28 0.641 0.0543 0.543 0.757\n 3 27 12 0.455 0.0599 0.351 0.589\n 5 11 6 0.315 0.0643 0.211 0.470\n\n AFB=0 \n time n.risk n.event survival std.err lower 95% CI upper 95% CI\n 1 312 110 0.739 0.0214 0.699 0.782\n 3 199 33 0.642 0.0245 0.595 0.691\n 5 77 20 0.530 0.0311 0.472 0.595\n\n\n\n\nLog-rank test\n\nsurvminer::surv_pvalue(fit.km, data = dat)\n\n variable pval method pval.txt\n1 AFB 0.0009646027 Log-rank p = 0.00096\n\n\n\n\nCox PH model\n\nfit.cox <- coxph(Surv(LENFOLY, FSTAT) ~ AFB, data = dat)\nfit.cox %>% \n tidy(exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) %>%\n select(term, estimate, conf.low, conf.high)\n\n# A tibble: 1 × 4\n term estimate conf.low conf.high\n <chr> <dbl> <dbl> <dbl>\n1 AFB0 0.583 0.421 0.806"
},
{
- "objectID": "R/ci_for_prop.html#example-code",
- "href": "R/ci_for_prop.html#example-code",
- "title": "Confidence Intervals for Proportions",
- "section": "Example code",
- "text": "Example code\nThe following code calculates a confidence interval for a binomial proportion usinng normal approximation.\n\nset.seed(666)\n# generate a random sample of size 100 from independent Bernoulli trials\nn = 100\nmysamp = sample(c(0,1),n,replace = T)\n# sample proportion\np_hat = mean(mysamp)\n# standard error\nse = sqrt(p_hat*(1-p_hat)/n)\n# 95% CI of population proportion\nc(p_hat-qnorm(1-0.05/2)*se, p_hat+qnorm(1-0.05/2)*se)\n\n[1] 0.4936024 0.6863976"
+ "objectID": "R/survival.html#the-stratified-model",
+ "href": "R/survival.html#the-stratified-model",
+ "title": "Survival Analysis Using R",
+ "section": "The Stratified Model",
+ "text": "The Stratified Model\nIn a stratified model, the Kaplan-Meier estimators remain the same as those in the non-stratified model. To implement stratified log-rank tests and Cox proportional hazards models, simply include the strata() function within the model formula.\n\nStratified Log-rank test\n\nfit.km.str <- survfit(Surv(LENFOLY, FSTAT) ~ AFB + strata(GENDER), data = dat)\n\nsurvminer::surv_pvalue(fit.km.str, data = dat)\n\n variable pval method pval.txt\n1 AFB+strata(GENDER) 0.001506607 Log-rank p = 0.0015\n\n\n\n\nStratified Cox PH model\n\nfit.cox.str <- coxph(Surv(LENFOLY, FSTAT) ~ AFB + strata(GENDER), data = dat)\nfit.cox.str %>% \n tidy(exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) %>%\n select(term, estimate, conf.low, conf.high)\n\n# A tibble: 1 × 4\n term estimate conf.low conf.high\n <chr> <dbl> <dbl> <dbl>\n1 AFB0 0.594 0.430 0.823"
},
{
- "objectID": "R/association.html",
- "href": "R/association.html",
- "title": "Association Analysis for Count Data Using R",
+ "objectID": "R/ttest_2Sample.html",
+ "href": "R/ttest_2Sample.html",
+ "title": "Two Sample t-test",
"section": "",
- "text": "The most commonly used association analysis methods for count data/contingency tables compare observed frequencies with those expected under the assumption of independence:\n\\[\nX^2 = \\sum_{i=1}^k \\frac{(x_i-e_i)^2}{e_i},\n\\] where \\(k\\) is the number of contingency table cells.\nOther measures for the correlation of two continuous variables are:"
+ "text": "The Two Sample t-test is used to compare two independent samples against each other. In the Two Sample t-test, the mean of the first sample is compared against the mean of the second sample. In R, a Two Sample t-test can be performed using the Base R t.test() function from the stats package or the proc_ttest() function from the procs package.\n\n\nThe following data was used in this example.\n\n# Create sample data\nd1 <- tibble::tribble(\n ~trt_grp, ~WtGain,\n \"placebo\", 94, \"placebo\", 12, \"placebo\", 26, \"placebo\", 89,\n \"placebo\", 88, \"placebo\", 96, \"placebo\", 85, \"placebo\", 130,\n \"placebo\", 75, \"placebo\", 54, \"placebo\", 112, \"placebo\", 69,\n \"placebo\", 104, \"placebo\", 95, \"placebo\", 53, \"placebo\", 21,\n \"treatment\", 45, \"treatment\", 62, \"treatment\", 96, \"treatment\", 128,\n \"treatment\", 120, \"treatment\", 99, \"treatment\", 28, \"treatment\", 50,\n \"treatment\", 109, \"treatment\", 115, \"treatment\", 39, \"treatment\", 96,\n \"treatment\", 87, \"treatment\", 100, \"treatment\", 76, \"treatment\", 80\n)\n\n\n\n\nIf we have normalized data, we can use the classic Student’s t-test. For a Two sample test where the variances are not equal, we should use the Welch’s t-test. Both of those options are available with the Base R t.test() function.\n\n\n\n\nThe following code was used to test the comparison in Base R. By default, the R two sample t-test function assumes the variances in the data are unequal, and uses a Welch’s t-test. Therefore, to use a classic Student’s t-test with normalized data, we must specify var.equal = TRUE. Also note that we must separate the single variable into two variables to satisfy the t.test() syntax and set paired = FALSE.\n\n d1p <- dplyr::filter(d1, trt_grp == 'placebo')\n d1t <- dplyr::filter(d1, trt_grp == 'treatment')\n\n # Perform t-test\n t.test(d1p$WtGain, d1t$WtGain, \n var.equal = TRUE, paired = FALSE)\n\n\n Two Sample t-test\n\ndata: d1p$WtGain and d1t$WtGain\nt = -0.6969, df = 30, p-value = 0.4912\nalternative hypothesis: true difference in means is not equal to 0\n95 percent confidence interval:\n -31.19842 15.32342\nsample estimates:\nmean of x mean of y \n 75.1875 83.1250 \n\n\n\n\n\n\n\n\nThe following code was used to test the comparison in Base R using Welch’s t-test. Observe that in this case, the var.equal parameter is set to FALSE.\n\n d1p <- dplyr::filter(d1, trt_grp == 'placebo')\n d1t <- dplyr::filter(d1, trt_grp == 'treatment')\n\n # Perform t-test\n t.test(d1p$WtGain, d1t$WtGain, \n var.equal = FALSE, paired = FALSE)\n\n\n Welch Two Sample t-test\n\ndata: d1p$WtGain and d1t$WtGain\nt = -0.6969, df = 29.694, p-value = 0.4913\nalternative hypothesis: true difference in means is not equal to 0\n95 percent confidence interval:\n -31.20849 15.33349\nsample estimates:\nmean of x mean of y \n 75.1875 83.1250 \n\n\n\n\n\n\n\n\n\n\n\nThe following code from the procs package was used to perform a two sample t-test. Note that the proc_ttest() function performs both the Student’s t-test and Welch’s (Satterthwaite) t-test in the same call. The results are displayed on separate rows. This output is similar to SAS.\n\n library(procs)\n\n # Perform t-test\n proc_ttest(d1, var = WtGain,\n class = trt_grp)\n\n$Statistics\n VAR CLASS METHOD N MEAN STD STDERR MIN MAX\n1 WtGain placebo <NA> 16 75.1875 33.81167 8.452918 12 130\n2 WtGain treatment <NA> 16 83.1250 30.53495 7.633738 28 128\n3 WtGain Diff (1-2) Pooled NA -7.9375 NA 11.389723 NA NA\n4 WtGain Diff (1-2) Satterthwaite NA -7.9375 NA 11.389723 NA NA\n\n$ConfLimits\n VAR CLASS METHOD MEAN LCLM UCLM STD LCLMSTD\n1 WtGain placebo <NA> 75.1875 57.17053 93.20447 33.81167 24.97685\n2 WtGain treatment <NA> 83.1250 66.85407 99.39593 30.53495 22.55632\n3 WtGain Diff (1-2) Pooled -7.9375 -31.19842 15.32342 NA NA\n4 WtGain Diff (1-2) Satterthwaite -7.9375 -31.20849 15.33349 NA NA\n UCLMSTD\n1 52.33003\n2 47.25868\n3 NA\n4 NA\n\n$TTests\n VAR METHOD VARIANCES DF T PROBT\n1 WtGain Pooled Equal 30.00000 -0.6969002 0.4912306\n2 WtGain Satterthwaite Unequal 29.69359 -0.6969002 0.4912856\n\n$Equality\n VAR METHOD NDF DDF FVAL PROBF\n1 WtGain Folded F 15 15 1.226136 0.6980614\n\n\nViewer Output:"
},
{
- "objectID": "R/association.html#chi-squared-test",
- "href": "R/association.html#chi-squared-test",
- "title": "Association Analysis for Count Data Using R",
- "section": "Chi-Squared test",
- "text": "Chi-Squared test\n\nchisq.test(tab)\n\n\n Pearson's Chi-squared test with Yates' continuity correction\n\ndata: tab\nX-squared = 1.8261, df = 1, p-value = 0.1766"
+ "objectID": "R/ttest_2Sample.html#base-r",
+ "href": "R/ttest_2Sample.html#base-r",
+ "title": "Two Sample t-test",
+ "section": "",
+ "text": "If we have normalized data, we can use the classic Student’s t-test. For a Two sample test where the variances are not equal, we should use the Welch’s t-test. Both of those options are available with the Base R t.test() function.\n\n\n\n\nThe following code was used to test the comparison in Base R. By default, the R two sample t-test function assumes the variances in the data are unequal, and uses a Welch’s t-test. Therefore, to use a classic Student’s t-test with normalized data, we must specify var.equal = TRUE. Also note that we must separate the single variable into two variables to satisfy the t.test() syntax and set paired = FALSE.\n\n d1p <- dplyr::filter(d1, trt_grp == 'placebo')\n d1t <- dplyr::filter(d1, trt_grp == 'treatment')\n\n # Perform t-test\n t.test(d1p$WtGain, d1t$WtGain, \n var.equal = TRUE, paired = FALSE)\n\n\n Two Sample t-test\n\ndata: d1p$WtGain and d1t$WtGain\nt = -0.6969, df = 30, p-value = 0.4912\nalternative hypothesis: true difference in means is not equal to 0\n95 percent confidence interval:\n -31.19842 15.32342\nsample estimates:\nmean of x mean of y \n 75.1875 83.1250 \n\n\n\n\n\n\n\n\nThe following code was used to test the comparison in Base R using Welch’s t-test. Observe that in this case, the var.equal parameter is set to FALSE.\n\n d1p <- dplyr::filter(d1, trt_grp == 'placebo')\n d1t <- dplyr::filter(d1, trt_grp == 'treatment')\n\n # Perform t-test\n t.test(d1p$WtGain, d1t$WtGain, \n var.equal = FALSE, paired = FALSE)\n\n\n Welch Two Sample t-test\n\ndata: d1p$WtGain and d1t$WtGain\nt = -0.6969, df = 29.694, p-value = 0.4913\nalternative hypothesis: true difference in means is not equal to 0\n95 percent confidence interval:\n -31.20849 15.33349\nsample estimates:\nmean of x mean of y \n 75.1875 83.1250"
},
{
- "objectID": "R/association.html#fisher-exact-test",
- "href": "R/association.html#fisher-exact-test",
- "title": "Association Analysis for Count Data Using R",
- "section": "Fisher Exact Test",
- "text": "Fisher Exact Test\nFor \\(2 \\times 2\\) contingency tables, p-values are obtained directly using the hypergeometric distribution.\n\nfisher.test(tab)\n\n\n Fisher's Exact Test for Count Data\n\ndata: tab\np-value = 0.135\nalternative hypothesis: true odds ratio is not equal to 1\n95 percent confidence interval:\n 0.8158882 3.2251299\nsample estimates:\nodds ratio \n 1.630576"
+ "objectID": "R/ttest_2Sample.html#procs-package",
+ "href": "R/ttest_2Sample.html#procs-package",
+ "title": "Two Sample t-test",
+ "section": "",
+ "text": "The following code from the procs package was used to perform a two sample t-test. Note that the proc_ttest() function performs both the Student’s t-test and Welch’s (Satterthwaite) t-test in the same call. The results are displayed on separate rows. This output is similar to SAS.\n\n library(procs)\n\n # Perform t-test\n proc_ttest(d1, var = WtGain,\n class = trt_grp)\n\n$Statistics\n VAR CLASS METHOD N MEAN STD STDERR MIN MAX\n1 WtGain placebo <NA> 16 75.1875 33.81167 8.452918 12 130\n2 WtGain treatment <NA> 16 83.1250 30.53495 7.633738 28 128\n3 WtGain Diff (1-2) Pooled NA -7.9375 NA 11.389723 NA NA\n4 WtGain Diff (1-2) Satterthwaite NA -7.9375 NA 11.389723 NA NA\n\n$ConfLimits\n VAR CLASS METHOD MEAN LCLM UCLM STD LCLMSTD\n1 WtGain placebo <NA> 75.1875 57.17053 93.20447 33.81167 24.97685\n2 WtGain treatment <NA> 83.1250 66.85407 99.39593 30.53495 22.55632\n3 WtGain Diff (1-2) Pooled -7.9375 -31.19842 15.32342 NA NA\n4 WtGain Diff (1-2) Satterthwaite -7.9375 -31.20849 15.33349 NA NA\n UCLMSTD\n1 52.33003\n2 47.25868\n3 NA\n4 NA\n\n$TTests\n VAR METHOD VARIANCES DF T PROBT\n1 WtGain Pooled Equal 30.00000 -0.6969002 0.4912306\n2 WtGain Satterthwaite Unequal 29.69359 -0.6969002 0.4912856\n\n$Equality\n VAR METHOD NDF DDF FVAL PROBF\n1 WtGain Folded F 15 15 1.226136 0.6980614\n\n\nViewer Output:"
},
{
- "objectID": "R/association.html#chi-squared-test-1",
- "href": "R/association.html#chi-squared-test-1",
- "title": "Association Analysis for Count Data Using R",
- "section": "Chi-Squared Test",
- "text": "Chi-Squared Test\n\nchisq.test(tab2)\n\nWarning in chisq.test(tab2): Chi-squared approximation may be incorrect\n\n\n\n Pearson's Chi-squared test\n\ndata: tab2\nX-squared = 260.76, df = 15, p-value < 2.2e-16\n\n\nThe warning means that the smallest expected frequencies is lower than 5. It is recommended to use the Fisher’s exact test in this case."
+ "objectID": "R/survey-stats-summary.html",
+ "href": "R/survey-stats-summary.html",
+ "title": "Survey Summary Statistics using R",
+ "section": "",
+ "text": "When conducting large-scale trials on samples of the population, it can be necessary to use a more complex sampling design than a simple random sample.\nAll of these designs need to be taken into account when calculating statistics, and when producing models. Only summary statistics are discussed in this document, and variances are calculated using the default Taylor series linearisation methods. For a more detailed introduction to survey statistics in R, see (Lohr 2022) or (Lumley 2004)."
},
{
- "objectID": "R/association.html#fisher-exact-test-1",
- "href": "R/association.html#fisher-exact-test-1",
- "title": "Association Analysis for Count Data Using R",
- "section": "Fisher Exact Test",
- "text": "Fisher Exact Test\nFor contingency tables larger than \\(2 \\times 2\\), p-values are based on simulations, which might require a lot of time (see ?fisher.test for details).\n\nfisher.test(tab2, simulate.p.value=TRUE)\n\n\n Fisher's Exact Test for Count Data with simulated p-value (based on\n 2000 replicates)\n\ndata: tab2\np-value = 0.0004998\nalternative hypothesis: two.sided"
+ "objectID": "R/survey-stats-summary.html#mean",
+ "href": "R/survey-stats-summary.html#mean",
+ "title": "Survey Summary Statistics using R",
+ "section": "Mean",
+ "text": "Mean\nIf we want to calculate a mean of a variable in a dataset which has been obtained from a simple random sample such as apisrs, in R we can create a design object using the survey::svydesign function (specifying that there is no PSU using id = ~1 and the finite population correction using fpc=~fpc).\n\nsrs_design <- svydesign(id = ~1, fpc = ~fpc, data = apisrs)\n\nThis design object stores all metadata about the sample alongside the data, and is used by all subsequent functions in the {survey} package. To calculate the mean, standard error, and confidence intervals of the growth variable, we can use the survey::svymean and confint functions:\n\n# Calculate mean and SE of growth. The standard error will be corrected by the finite population correction specified in the design\nsrs_means <- svymean(~growth, srs_design)\n\nsrs_means\n\n mean SE\ngrowth 31.9 2.0905\n\n# Use degf() to get the degrees of freedom\nconfint(srs_means, df=degf(srs_design))\n\n 2.5 % 97.5 %\ngrowth 27.77764 36.02236\n\n\nNote that to obtain correct results, we had to specify the degrees of freedom using the design object."
+ },
+ {
+ "objectID": "R/survey-stats-summary.html#total",
+ "href": "R/survey-stats-summary.html#total",
+ "title": "Survey Summary Statistics using R",
+ "section": "Total",
+ "text": "Total\nCalculating population totals can be done using the survey::svytotal function in R.\n\nsvytotal(~growth, srs_design)\n\n total SE\ngrowth 197589 12949"
+ },
+ {
+ "objectID": "R/survey-stats-summary.html#ratios",
+ "href": "R/survey-stats-summary.html#ratios",
+ "title": "Survey Summary Statistics using R",
+ "section": "Ratios",
+ "text": "Ratios\nTo perform ratio analysis for means or proportions of analysis variables in R, we can survey::svyratio, here requesting that we do not separate the ratio estimation per Strata as this design is not stratified.\n\nsvy_ratio <- svyratio(\n ~api00,\n ~api99,\n srs_design,\n se=TRUE,\n df=degf(srs_design),\n separate=FALSE\n)\n\nsvy_ratio\n\nRatio estimator: svyratio.survey.design2(~api00, ~api99, srs_design, se = TRUE, \n df = degf(srs_design), separate = FALSE)\nRatios=\n api99\napi00 1.051066\nSEs=\n api99\napi00 0.003603991\n\nconfint(svy_ratio, df=degf(srs_design))\n\n 2.5 % 97.5 %\napi00/api99 1.043959 1.058173"
+ },
+ {
+ "objectID": "R/survey-stats-summary.html#proportions",
+ "href": "R/survey-stats-summary.html#proportions",
+ "title": "Survey Summary Statistics using R",
+ "section": "Proportions",
+ "text": "Proportions\nTo calculate a proportion in R, we use the svymean function on a factor or character column:\n\nprops <- svymean(~sch.wide, srs_design)\n\nprops\n\n mean SE\nsch.wideNo 0.185 0.0271\nsch.wideYes 0.815 0.0271\n\nconfint(props, df=degf(srs_design))\n\n 2.5 % 97.5 %\nsch.wideNo 0.1316041 0.2383959\nsch.wideYes 0.7616041 0.8683959\n\n\nFor proportions close to 0, it can be that survey::svyciprop is more accurate at producing confidence intervals than confint."
+ },
+ {
+ "objectID": "R/survey-stats-summary.html#quantiles",
+ "href": "R/survey-stats-summary.html#quantiles",
+ "title": "Survey Summary Statistics using R",
+ "section": "Quantiles",
+ "text": "Quantiles\nTo calculate quantiles in R, we can use the survey::svyquantile function. Note that this function was reworked in version 4.1 of {survey}, and prior to this had different arguments and results. The current version of svyquantile has an qrule which is similar to the type argument in quantile, and can be used to change how the quantiles are calculated. For more information, see vignette(\"qrule\", package=\"survey\").\n\nsvyquantile(\n ~growth,\n srs_design,\n quantiles = c(0.025, 0.5, 0.975),\n ci=TRUE,\n se=TRUE\n)\n\n$growth\n quantile ci.2.5 ci.97.5 se\n0.025 -16 -21 -12 2.281998\n0.5 27 24 31 1.774887\n0.975 99 84 189 26.623305\n\nattr(,\"hasci\")\n[1] TRUE\nattr(,\"class\")\n[1] \"newsvyquantile\""
+ },
+ {
+ "objectID": "R/count_data_regression.html",
+ "href": "R/count_data_regression.html",
+ "title": "Regression for Count Data",
+ "section": "",
+ "text": "The most commonly used models for count data in clinical trials include:\n\nPoisson regression: assumes the response variable \\(Y\\) has a Poisson distribution, which is linked using the logarithm with explanatory variables \\(\\bf{x}\\).\n\n\\[\n\\text{log}(E(Y|x))= \\beta_0 + \\beta' x, \\; i = 1,\\ldots,n\n\\]\n\nQuasi-Poisson regression: Poisson model that allows overdispersion, i.e. dispersion parameter is not fixed at one.\nNegative-Binomial regression: popular generalization which loosens the assumption that the variance is equal to the mean made by the Poisson model.\n\nOther models include hurdle or zero-inflated models, if data have more zero observations than expected.\n\nExample: Familial Andenomatous Polyposis Data\nData source: F. M. Giardiello, S. R. Hamilton, A. J. Krush, S. Piantadosi, L. M. Hylind, P. Celano, S. V. Booker, C. R. Robinson and G. J. A. Offerhaus (1993), Treatment of colonic and rectal adenomas with sulindac in familial adenomatous polyposis. New England Journal of Medicine, 328(18), 1313–1316.\nData from a placebo-controlled trial of a non-steroidal anti-inflammatory drug in the treatment of familial andenomatous polyposis (FAP). (see ?polyps for details).\n\npolyps <- HSAUR2::polyps\nglimpse(polyps)\n\nRows: 20\nColumns: 3\n$ number <dbl> 63, 2, 28, 17, 61, 1, 7, 15, 44, 25, 3, 28, 10, 40, 33, 46, 50,…\n$ treat <fct> placebo, drug, placebo, drug, placebo, drug, placebo, placebo, …\n$ age <dbl> 20, 16, 18, 22, 13, 23, 34, 50, 19, 17, 23, 22, 30, 27, 23, 22,…\n\n\nWe analyze the number of colonic polyps at 12 months in dependency of treatment and age of the patient.\n\npolyps %>% \n ggplot(aes(y = number, x = age, color = treat)) + \n geom_point() + theme_minimal()\n\n\n\n\n\n\n\n\n\n\nModel Fit\nWe fit a generalized linear model for number using the Poisson distribution with default log link.\n\n# Poisson\nm1 <- glm(number ~ treat + age, data = polyps, family = poisson)\nsummary(m1)\n\n\nCall:\nglm(formula = number ~ treat + age, family = poisson, data = polyps)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 4.529024 0.146872 30.84 < 2e-16 ***\ntreatdrug -1.359083 0.117643 -11.55 < 2e-16 ***\nage -0.038830 0.005955 -6.52 7.02e-11 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n Null deviance: 378.66 on 19 degrees of freedom\nResidual deviance: 179.54 on 17 degrees of freedom\nAIC: 273.88\n\nNumber of Fisher Scoring iterations: 5\n\n\nThe parameter estimates are on log-scale. For better interpretation, we can exponentiate these estimates, to obtain estimates and provide \\(95\\)% confidence intervals:\n\n# OR and CI\nexp(coef(m1))\n\n(Intercept) treatdrug age \n 92.6681047 0.2568961 0.9619140 \n\nexp(confint(m1))\n\nWaiting for profiling to be done...\n\n\n 2.5 % 97.5 %\n(Intercept) 69.5361752 123.6802476\ntreatdrug 0.2028078 0.3218208\nage 0.9505226 0.9729788\n\n\nPredictions for number of colonic polyps given a new 25-year-old patient on either treatment using predict():\n\n# new 25 year old patient\nnew_pt <- data.frame(treat = c(\"drug\",\"placebo\"), age=25)\npredict(m1, new_pt, type = \"response\")\n\n 1 2 \n 9.017654 35.102332 \n\n\n\n\nModelling Overdispersion\nPoisson model assumes that mean and variance are equal, which can be a very restrictive assumption. One option to relax the assumption is adding a overdispersion constant to the relationship, i.e. \\(\\text{Var}(\\text{response}) = \\phi\\cdot \\mu\\), which results in a quasipoisson model:\n\n# Quasi poisson\nm2 <- glm(number ~ treat + age, data = polyps, family = quasipoisson)\nsummary(m2)\n\n\nCall:\nglm(formula = number ~ treat + age, family = quasipoisson, data = polyps)\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 4.52902 0.48106 9.415 3.72e-08 ***\ntreatdrug -1.35908 0.38533 -3.527 0.00259 ** \nage -0.03883 0.01951 -1.991 0.06284 . \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for quasipoisson family taken to be 10.72805)\n\n Null deviance: 378.66 on 19 degrees of freedom\nResidual deviance: 179.54 on 17 degrees of freedom\nAIC: NA\n\nNumber of Fisher Scoring iterations: 5\n\n\nAlternatively, we can explicitly model the count data with overdispersion using the negative Binomial model. In this case, the overdispersion is a function of both \\(\\mu\\) and \\(\\mu^2\\):\n\\[\n\\text{Var}(\\text{response}) = \\mu + \\kappa\\,\\mu^2.\n\\]\n\n# Negative Binomial\nm3 <- MASS::glm.nb(number ~ treat + age, data = polyps)\nsummary(m3)\n\n\nCall:\nMASS::glm.nb(formula = number ~ treat + age, data = polyps, init.theta = 1.719491, \n link = log)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 4.52603 0.59466 7.611 2.72e-14 ***\ntreatdrug -1.36812 0.36903 -3.707 0.000209 ***\nage -0.03856 0.02095 -1.840 0.065751 . \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for Negative Binomial(1.7195) family taken to be 1)\n\n Null deviance: 36.734 on 19 degrees of freedom\nResidual deviance: 22.002 on 17 degrees of freedom\nAIC: 164.88\n\nNumber of Fisher Scoring iterations: 1\n\n Theta: 1.719 \n Std. Err.: 0.607 \n\n 2 x log-likelihood: -156.880 \n\n\nBoth model result very similar parameter estimates, but vary in estimates for their respective standard deviation."
+ },
+ {
+ "objectID": "R/rounding.html",
+ "href": "R/rounding.html",
+ "title": "Rounding in R",
+ "section": "",
+ "text": "The round() function in Base R will round to the nearest whole number and ‘rounding to the even number’ when equidistant, meaning that exactly 12.5 rounds to the integer 12.\nNote that the janitor package in R contains a function round_half_up() that rounds away from zero. in this case it rounds to the nearest whole number and ‘away from zero’ or ‘rounding up’ when equidistant, meaning that exactly 12.5 rounds to the integer 13.\n\n#Example code\nmy_number <-c(2.2,3.99,1.2345,7.876,13.8739)\n\nr_0_dec <- round(my_number, digits=0);\nr_1_dec <- round(my_number, digits=1);\nr_2_dec <- round(my_number, digits=2);\nr_3_dec <- round(my_number, digits=3);\n\nr_0_dec\nr_1_dec\nr_2_dec\nr_3_dec\n\n> r_0_dec\n[1] 2 4 1 8 14\n> r_1_dec\n[1] 2.2 4.0 1.2 7.9 13.9\n> r_2_dec\n[1] 2.20 3.99 1.23 7.88 13.87\n> r_3_dec\n[1] 2.200 3.990 1.234 7.876 13.874\n\nIf using the janitor package in R, and the function round_half_up(), the results would be the same with the exception of rounding 1.2345 to 3 decimal places where a result of 1.235 would be obtained instead of 1.234."
+ },
+ {
+ "objectID": "R/wilcoxonsr_hodges_lehman.html",
+ "href": "R/wilcoxonsr_hodges_lehman.html",
+ "title": "Wilcoxon signed-rank test",
+ "section": "",
+ "text": "Introduction\nWilcoxon signed-rank test is a non-parametric test which is sometimes used instead of the paired Student’s t-test when assumptions regarding a normal distribution are not valid. It is a rank test, designed for analyzing repeated measures or paired observations by a paired comparison (a type of location test) to assess whether their population means differ. Whilst it does not ‘compare’ means or medians for a set of paired data, it ranks the results on A and ranks the results on B, then compares if Prob(A>B) > Prob(B>A).\nTies are when you have two observations with the same result. For example, in a 2-period cross-over study, you take the difference between result on Treatment A minus result on Treatment B and find that two or more subjects have the same difference.\nAdditionally, “0s” can cause some trouble as well. For example when the difference between result on Treatment A minus result on Treatment B equals 0.\n\n\nData\nAnalysis will be conducted on the example of anonymized data from 2-period, cross-over study comparing treatments A and B in patients with asthma and acute airway obstruction induced by repeated mannitol challenges.\nWilcoxon signed rank test was applied to analyse the time to return to baseline FEV1 post-mannitol challenge 2. Median difference, p value and 95% CI were provided using the Hodges-Lehmann estimate.\n\nhead(blood_p)\n\n patient sex agegrp bp_before bp_after\n1 1 Male 30-45 143.670 153.316\n2 2 Male 30-45 163.082 170.576\n3 3 Male 30-45 153.393 168.599\n4 4 Male 30-45 153.082 142.358\n5 5 Male 30-45 146.720 141.193\n6 6 Male 30-45 150.668 147.204\n\n\n\n\nDataset without ties\nLet’s consider a case where the dataset has no ties.\n\n\nAvailable packages\nIn R Wilcoxon signed rank test can be performed using for example DOS (version 0.5.2) or stats (version 3.6.2) package.\n\nstats\nFunction wilcox.test used for Wilcoxon Rank Sum and Signed Rank Tests will be applied. For more information about that function go here\nWe will focus on the below arguments: - alternative - paired - exact - correct - conf.int.\n\n\n\nExamples\n\n# Exact \nstats::wilcox.test(x = blood_p$bp_after, y = blood_p$bp_before, \n paired = TRUE, \n conf.int = TRUE, \n conf.level = 0.9, \n alterative = \"two.sided\", \n exact = TRUE)\n\n\n Wilcoxon signed rank exact test\n\ndata: blood_p$bp_after and blood_p$bp_before\nV = 17251, p-value = 0.009379\nalternative hypothesis: true location shift is not equal to 0\n90 percent confidence interval:\n 1.5045 5.9945\nsample estimates:\n(pseudo)median \n 3.68875 \n\n# No exact & continuity correction\nstats::wilcox.test(x = blood_p$bp_after, y = blood_p$bp_before, \n paired = TRUE, \n conf.int = TRUE, \n conf.level = 0.9, \n alterative = \"two.sided\", \n exact = FALSE, \n correct = TRUE)\n\n\n Wilcoxon signed rank test with continuity correction\n\ndata: blood_p$bp_after and blood_p$bp_before\nV = 17251, p-value = 0.009548\nalternative hypothesis: true location shift is not equal to 0\n90 percent confidence interval:\n 1.504565 5.994467\nsample estimates:\n(pseudo)median \n 3.688796 \n\n# No exact & No continuity correction\nstats::wilcox.test(x = blood_p$bp_after, y = blood_p$bp_before, \n paired = TRUE, \n conf.int = TRUE, \n conf.level = 0.9, \n alterative = \"two.sided\" , \n exact = FALSE, \n correct = FALSE)\n\n\n Wilcoxon signed rank test\n\ndata: blood_p$bp_after and blood_p$bp_before\nV = 17251, p-value = 0.009535\nalternative hypothesis: true location shift is not equal to 0\n90 percent confidence interval:\n 1.504991 5.993011\nsample estimates:\n(pseudo)median \n 3.688796 \n\n\n\n\nImportant notes on stats:wilcox.test\n\nBy default an exact p-value is computed if the samples size is less than 50 and there are no ties. Otherwise, a normal approximation is used.\nIf exact p-values are available, an exact confidence interval is obtained by the algorithm described in Bauer (1972), and the Hodges-Lehmann estimator is employed. Otherwise, the returned confidence interval and point estimate are based on normal approximations.\nIf non-exact p-value is calculated, continuity correction in the normal approximation for the p-value can be applied with correct argument.\nStatistic V is provided, which is a test statistic based on Sprent (1993) algorithm\n\n\nDOS2\nFunction senWilcox used for Sensitivity Analysis for Wilcoxon’s Signed-rank Statistic will be applied. For more information about that function go here\n\n\n\nExamples\n\nDOS2::senWilcox(blood_p$bp_after - blood_p$bp_before, \n gamma = 1, \n conf.int = TRUE, \n alpha = 0.1, \n alternative = \"twosided\")\n\n$pval\n[1] 0.009534732\n\n$estimate\n low high \n3.688796 3.688796 \n\n$ci\n low high \n1.50494 5.99305 \n\n\n\n\nImportant notes on DOS2:senWilcox\n\nGamma >= 1 is the value of the sensitivity parameter. If gamma=1, then you are assuming ignorable treatment assignment or equivalently no unmeasured confounding - that is the considered scenario in our example, sensitivity analysis is not performed.\nOnly p value, estimate and CI are provided\n\n\n\nCoin package - coming soon!"
+ },
+ {
+ "objectID": "R/anova.html",
+ "href": "R/anova.html",
+ "title": "ANOVA",
+ "section": "",
+ "text": "Getting Started\nTo demonstrate the various types of sums of squares, we’ll create a data frame called df_disease taken from the SAS documentation. The corresponding data can be found here.\n\n\nThe Model\nFor this example, we’re testing for a significant difference in stem_length using ANOVA. In R, we’re using lm() to run the ANOVA, and then using broom::glance() and broom::tidy() to view the results in a table format.\n\nlm_model <- lm(y ~ drug + disease + drug*disease, df_disease)\n\nThe glance function gives us a summary of the model diagnostic values.\n\nlm_model %>% \n glance() %>% \n pivot_longer(everything())\n\n# A tibble: 12 × 2\n name value\n <chr> <dbl>\n 1 r.squared 0.456 \n 2 adj.r.squared 0.326 \n 3 sigma 10.5 \n 4 statistic 3.51 \n 5 p.value 0.00130\n 6 df 11 \n 7 logLik -212. \n 8 AIC 450. \n 9 BIC 477. \n10 deviance 5081. \n11 df.residual 46 \n12 nobs 58 \n\n\nThe tidy function gives a summary of the model results.\n\nlm_model %>% tidy()\n\n# A tibble: 12 × 5\n term estimate std.error statistic p.value\n <chr> <dbl> <dbl> <dbl> <dbl>\n 1 (Intercept) 29.3 4.29 6.84 0.0000000160\n 2 drug2 -1.33 6.36 -0.210 0.835 \n 3 drug3 -13 7.43 -1.75 0.0869 \n 4 drug4 -15.7 6.36 -2.47 0.0172 \n 5 disease2 -1.08 6.78 -0.160 0.874 \n 6 disease3 -8.93 6.36 -1.40 0.167 \n 7 drug2:disease2 6.58 9.78 0.673 0.504 \n 8 drug3:disease2 -10.9 10.2 -1.06 0.295 \n 9 drug4:disease2 0.317 9.30 0.0340 0.973 \n10 drug2:disease3 -0.900 9.00 -0.100 0.921 \n11 drug3:disease3 1.10 10.2 0.107 0.915 \n12 drug4:disease3 9.53 9.20 1.04 0.306 \n\n\n\n\nThe Results\nYou’ll see that R print the individual results for each level of the drug and disease interaction. We can get the combined F table in R using the anova() function on the model object.\n\nlm_model %>% \n anova() %>% \n tidy() %>% \n kable()\n\n\n\n\nterm\ndf\nsumsq\nmeansq\nstatistic\np.value\n\n\n\n\ndrug\n3\n3133.2385\n1044.4128\n9.455761\n0.0000558\n\n\ndisease\n2\n418.8337\n209.4169\n1.895990\n0.1617201\n\n\ndrug:disease\n6\n707.2663\n117.8777\n1.067225\n0.3958458\n\n\nResiduals\n46\n5080.8167\n110.4525\nNA\nNA\n\n\n\n\n\nWe can add a Total row, by using add_row and calculating the sum of the degrees of freedom and sum of squares.\n\nlm_model %>%\n anova() %>%\n tidy() %>%\n add_row(term = \"Total\", df = sum(.$df), sumsq = sum(.$sumsq)) %>% \n kable()\n\n\n\n\nterm\ndf\nsumsq\nmeansq\nstatistic\np.value\n\n\n\n\ndrug\n3\n3133.2385\n1044.4128\n9.455761\n0.0000558\n\n\ndisease\n2\n418.8337\n209.4169\n1.895990\n0.1617201\n\n\ndrug:disease\n6\n707.2663\n117.8777\n1.067225\n0.3958458\n\n\nResiduals\n46\n5080.8167\n110.4525\nNA\nNA\n\n\nTotal\n57\n9340.1552\nNA\nNA\nNA\n\n\n\n\n\n\n\nSums of Squares Tables\nUnfortunately, it is not easy to get the various types of sums of squares calculations in using functions from base R. However, the rstatix package offers a solution to produce these various sums of squares tables. For each type, you supply the original dataset and model to the. anova_test function, then specify the ttype and se detailed = TRUE.\n\nType I\n\ndf_disease %>% \n rstatix::anova_test(\n y ~ drug + disease + drug*disease, \n type = 1, \n detailed = TRUE) %>% \n rstatix::get_anova_table() %>% \n kable()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nEffect\nDFn\nDFd\nSSn\nSSd\nF\np\np<.05\nges\n\n\n\n\ndrug\n3\n46\n3133.239\n5080.817\n9.456\n5.58e-05\n*\n0.381\n\n\ndisease\n2\n46\n418.834\n5080.817\n1.896\n1.62e-01\n\n0.076\n\n\ndrug:disease\n6\n46\n707.266\n5080.817\n1.067\n3.96e-01\n\n0.122\n\n\n\n\n\n\n\nType II\n\ndf_disease %>% \n rstatix::anova_test(\n y ~ drug + disease + drug*disease, \n type = 2, \n detailed = TRUE) %>% \n rstatix::get_anova_table() %>% \n kable()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nEffect\nSSn\nSSd\nDFn\nDFd\nF\np\np<.05\nges\n\n\n\n\ndrug\n3063.433\n5080.817\n3\n46\n9.245\n6.75e-05\n*\n0.376\n\n\ndisease\n418.834\n5080.817\n2\n46\n1.896\n1.62e-01\n\n0.076\n\n\ndrug:disease\n707.266\n5080.817\n6\n46\n1.067\n3.96e-01\n\n0.122\n\n\n\n\n\n\n\nType III\n\ndf_disease %>% \n rstatix::anova_test(\n y ~ drug + disease + drug*disease, \n type = 3, \n detailed = TRUE) %>% \n rstatix::get_anova_table() %>% \n kable()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nEffect\nSSn\nSSd\nDFn\nDFd\nF\np\np<.05\nges\n\n\n\n\n(Intercept)\n20037.613\n5080.817\n1\n46\n181.414\n0.00e+00\n*\n0.798\n\n\ndrug\n2997.472\n5080.817\n3\n46\n9.046\n8.09e-05\n*\n0.371\n\n\ndisease\n415.873\n5080.817\n2\n46\n1.883\n1.64e-01\n\n0.076\n\n\ndrug:disease\n707.266\n5080.817\n6\n46\n1.067\n3.96e-01\n\n0.122\n\n\n\n\n\n\n\nType IV\nIn R there is no equivalent operation to the Type IV sums of squares calculation in SAS."
},
{
"objectID": "R/mi_mar_predictive_mean_match.html",
@@ -1117,217 +1166,245 @@
"href": "R/mi_mar_predictive_mean_match.html#example",
"title": "Multiple Imputation: Predictive Mean Matching",
"section": "Example",
- "text": "Example\nWe use the small dataset nhanes included in mice package. It has 25 rows, and three out of four variables have missings.\nThe original NHANES data is a large national level survey, some are publicly available via R package nhanes.\n\nlibrary(mice)\n\nWarning in check_dep_version(): ABI version mismatch: \nlme4 was built with Matrix ABI version 1\nCurrent Matrix ABI version is 0\nPlease re-install lme4 from source or restore original 'Matrix' package\n\n\n\nAttaching package: 'mice'\n\n\nThe following object is masked from 'package:stats':\n\n filter\n\n\nThe following objects are masked from 'package:base':\n\n cbind, rbind\n\n# load example dataset from mice\nhead(nhanes)\n\n age bmi hyp chl\n1 1 NA NA NA\n2 2 22.7 1 187\n3 1 NA 1 187\n4 3 NA NA NA\n5 1 20.4 1 113\n6 3 NA NA 184\n\nsummary(nhanes)\n\n age bmi hyp chl \n Min. :1.00 Min. :20.40 Min. :1.000 Min. :113.0 \n 1st Qu.:1.00 1st Qu.:22.65 1st Qu.:1.000 1st Qu.:185.0 \n Median :2.00 Median :26.75 Median :1.000 Median :187.0 \n Mean :1.76 Mean :26.56 Mean :1.235 Mean :191.4 \n 3rd Qu.:2.00 3rd Qu.:28.93 3rd Qu.:1.000 3rd Qu.:212.0 \n Max. :3.00 Max. :35.30 Max. :2.000 Max. :284.0 \n NA's :9 NA's :8 NA's :10 \n\n\n\nImpute with PMM\nTo impute with PMM is straightforward: specify the method, method = pmm.\n\nimp_pmm <- mice(nhanes, method = 'pmm', m=5, maxit=10)\n\n\n iter imp variable\n 1 1 bmi hyp chl\n 1 2 bmi hyp chl\n 1 3 bmi hyp chl\n 1 4 bmi hyp chl\n 1 5 bmi hyp chl\n 2 1 bmi hyp chl\n 2 2 bmi hyp chl\n 2 3 bmi hyp chl\n 2 4 bmi hyp chl\n 2 5 bmi hyp chl\n 3 1 bmi hyp chl\n 3 2 bmi hyp chl\n 3 3 bmi hyp chl\n 3 4 bmi hyp chl\n 3 5 bmi hyp chl\n 4 1 bmi hyp chl\n 4 2 bmi hyp chl\n 4 3 bmi hyp chl\n 4 4 bmi hyp chl\n 4 5 bmi hyp chl\n 5 1 bmi hyp chl\n 5 2 bmi hyp chl\n 5 3 bmi hyp chl\n 5 4 bmi hyp chl\n 5 5 bmi hyp chl\n 6 1 bmi hyp chl\n 6 2 bmi hyp chl\n 6 3 bmi hyp chl\n 6 4 bmi hyp chl\n 6 5 bmi hyp chl\n 7 1 bmi hyp chl\n 7 2 bmi hyp chl\n 7 3 bmi hyp chl\n 7 4 bmi hyp chl\n 7 5 bmi hyp chl\n 8 1 bmi hyp chl\n 8 2 bmi hyp chl\n 8 3 bmi hyp chl\n 8 4 bmi hyp chl\n 8 5 bmi hyp chl\n 9 1 bmi hyp chl\n 9 2 bmi hyp chl\n 9 3 bmi hyp chl\n 9 4 bmi hyp chl\n 9 5 bmi hyp chl\n 10 1 bmi hyp chl\n 10 2 bmi hyp chl\n 10 3 bmi hyp chl\n 10 4 bmi hyp chl\n 10 5 bmi hyp chl\n\nimp_pmm\n\nClass: mids\nNumber of multiple imputations: 5 \nImputation methods:\n age bmi hyp chl \n \"\" \"pmm\" \"pmm\" \"pmm\" \nPredictorMatrix:\n age bmi hyp chl\nage 0 1 1 1\nbmi 1 0 1 1\nhyp 1 1 0 1\nchl 1 1 1 0\n\n# imputations for bmi\nimp_pmm$imp$bmi\n\n 1 2 3 4 5\n1 27.2 27.2 22.0 22.7 22.5\n3 35.3 26.3 27.2 27.2 30.1\n4 25.5 20.4 27.5 22.5 22.0\n6 22.5 22.5 20.4 24.9 21.7\n10 22.7 27.2 27.4 22.0 29.6\n11 27.2 22.0 27.2 30.1 30.1\n12 27.5 22.5 29.6 25.5 27.5\n16 35.3 30.1 35.3 28.7 33.2\n21 21.7 27.2 30.1 27.2 27.2\n\n\nAn alternative to the standard PMM is midastouch.\n\nimp_pmms <- mice(nhanes, method = 'midastouch', m=5, maxit=10)\n\n\n iter imp variable\n 1 1 bmi hyp chl\n 1 2 bmi hyp chl\n 1 3 bmi hyp chl\n 1 4 bmi hyp chl\n 1 5 bmi hyp chl\n 2 1 bmi hyp chl\n 2 2 bmi hyp chl\n 2 3 bmi hyp chl\n 2 4 bmi hyp chl\n 2 5 bmi hyp chl\n 3 1 bmi hyp chl\n 3 2 bmi hyp chl\n 3 3 bmi hyp chl\n 3 4 bmi hyp chl\n 3 5 bmi hyp chl\n 4 1 bmi hyp chl\n 4 2 bmi hyp chl\n 4 3 bmi hyp chl\n 4 4 bmi hyp chl\n 4 5 bmi hyp chl\n 5 1 bmi hyp chl\n 5 2 bmi hyp chl\n 5 3 bmi hyp chl\n 5 4 bmi hyp chl\n 5 5 bmi hyp chl\n 6 1 bmi hyp chl\n 6 2 bmi hyp chl\n 6 3 bmi hyp chl\n 6 4 bmi hyp chl\n 6 5 bmi hyp chl\n 7 1 bmi hyp chl\n 7 2 bmi hyp chl\n 7 3 bmi hyp chl\n 7 4 bmi hyp chl\n 7 5 bmi hyp chl\n 8 1 bmi hyp chl\n 8 2 bmi hyp chl\n 8 3 bmi hyp chl\n 8 4 bmi hyp chl\n 8 5 bmi hyp chl\n 9 1 bmi hyp chl\n 9 2 bmi hyp chl\n 9 3 bmi hyp chl\n 9 4 bmi hyp chl\n 9 5 bmi hyp chl\n 10 1 bmi hyp chl\n 10 2 bmi hyp chl\n 10 3 bmi hyp chl\n 10 4 bmi hyp chl\n 10 5 bmi hyp chl\n\nimp_pmm\n\nClass: mids\nNumber of multiple imputations: 5 \nImputation methods:\n age bmi hyp chl \n \"\" \"pmm\" \"pmm\" \"pmm\" \nPredictorMatrix:\n age bmi hyp chl\nage 0 1 1 1\nbmi 1 0 1 1\nhyp 1 1 0 1\nchl 1 1 1 0\n\nimp_pmms$imp$bmi\n\n 1 2 3 4 5\n1 29.6 30.1 30.1 33.2 35.3\n3 29.6 30.1 30.1 29.6 29.6\n4 21.7 21.7 21.7 21.7 24.9\n6 21.7 25.5 21.7 21.7 24.9\n10 22.0 22.0 22.0 35.3 27.4\n11 30.1 30.1 29.6 33.2 29.6\n12 24.9 22.7 25.5 22.7 27.4\n16 35.3 33.2 33.2 33.2 29.6\n21 29.6 33.2 30.1 33.2 29.6"
+ "text": "Example\nWe use the small dataset nhanes included in mice package. It has 25 rows, and three out of four variables have missings.\nThe original NHANES data is a large national level survey, some are publicly available via R package nhanes.\n\nlibrary(mice)\n\nWarning in check_dep_version(): ABI version mismatch: \nlme4 was built with Matrix ABI version 1\nCurrent Matrix ABI version is 0\nPlease re-install lme4 from source or restore original 'Matrix' package\n\n\n\nAttaching package: 'mice'\n\n\nThe following object is masked from 'package:stats':\n\n filter\n\n\nThe following objects are masked from 'package:base':\n\n cbind, rbind\n\n# load example dataset from mice\nhead(nhanes)\n\n age bmi hyp chl\n1 1 NA NA NA\n2 2 22.7 1 187\n3 1 NA 1 187\n4 3 NA NA NA\n5 1 20.4 1 113\n6 3 NA NA 184\n\nsummary(nhanes)\n\n age bmi hyp chl \n Min. :1.00 Min. :20.40 Min. :1.000 Min. :113.0 \n 1st Qu.:1.00 1st Qu.:22.65 1st Qu.:1.000 1st Qu.:185.0 \n Median :2.00 Median :26.75 Median :1.000 Median :187.0 \n Mean :1.76 Mean :26.56 Mean :1.235 Mean :191.4 \n 3rd Qu.:2.00 3rd Qu.:28.93 3rd Qu.:1.000 3rd Qu.:212.0 \n Max. :3.00 Max. :35.30 Max. :2.000 Max. :284.0 \n NA's :9 NA's :8 NA's :10 \n\n\n\nImpute with PMM\nTo impute with PMM is straightforward: specify the method, method = pmm.\n\nimp_pmm <- mice(nhanes, method = 'pmm', m=5, maxit=10)\n\n\n iter imp variable\n 1 1 bmi hyp chl\n 1 2 bmi hyp chl\n 1 3 bmi hyp chl\n 1 4 bmi hyp chl\n 1 5 bmi hyp chl\n 2 1 bmi hyp chl\n 2 2 bmi hyp chl\n 2 3 bmi hyp chl\n 2 4 bmi hyp chl\n 2 5 bmi hyp chl\n 3 1 bmi hyp chl\n 3 2 bmi hyp chl\n 3 3 bmi hyp chl\n 3 4 bmi hyp chl\n 3 5 bmi hyp chl\n 4 1 bmi hyp chl\n 4 2 bmi hyp chl\n 4 3 bmi hyp chl\n 4 4 bmi hyp chl\n 4 5 bmi hyp chl\n 5 1 bmi hyp chl\n 5 2 bmi hyp chl\n 5 3 bmi hyp chl\n 5 4 bmi hyp chl\n 5 5 bmi hyp chl\n 6 1 bmi hyp chl\n 6 2 bmi hyp chl\n 6 3 bmi hyp chl\n 6 4 bmi hyp chl\n 6 5 bmi hyp chl\n 7 1 bmi hyp chl\n 7 2 bmi hyp chl\n 7 3 bmi hyp chl\n 7 4 bmi hyp chl\n 7 5 bmi hyp chl\n 8 1 bmi hyp chl\n 8 2 bmi hyp chl\n 8 3 bmi hyp chl\n 8 4 bmi hyp chl\n 8 5 bmi hyp chl\n 9 1 bmi hyp chl\n 9 2 bmi hyp chl\n 9 3 bmi hyp chl\n 9 4 bmi hyp chl\n 9 5 bmi hyp chl\n 10 1 bmi hyp chl\n 10 2 bmi hyp chl\n 10 3 bmi hyp chl\n 10 4 bmi hyp chl\n 10 5 bmi hyp chl\n\nimp_pmm\n\nClass: mids\nNumber of multiple imputations: 5 \nImputation methods:\n age bmi hyp chl \n \"\" \"pmm\" \"pmm\" \"pmm\" \nPredictorMatrix:\n age bmi hyp chl\nage 0 1 1 1\nbmi 1 0 1 1\nhyp 1 1 0 1\nchl 1 1 1 0\n\n# imputations for bmi\nimp_pmm$imp$bmi\n\n 1 2 3 4 5\n1 20.4 22.7 27.2 26.3 29.6\n3 29.6 28.7 35.3 26.3 22.0\n4 27.4 22.5 20.4 27.4 28.7\n6 20.4 22.5 20.4 27.4 26.3\n10 22.0 22.0 22.7 30.1 35.3\n11 33.2 26.3 22.0 27.2 28.7\n12 24.9 22.5 25.5 27.4 33.2\n16 22.7 22.0 29.6 30.1 27.2\n21 20.4 29.6 35.3 28.7 29.6\n\n\nAn alternative to the standard PMM is midastouch.\n\nimp_pmms <- mice(nhanes, method = 'midastouch', m=5, maxit=10)\n\n\n iter imp variable\n 1 1 bmi hyp chl\n 1 2 bmi hyp chl\n 1 3 bmi hyp chl\n 1 4 bmi hyp chl\n 1 5 bmi hyp chl\n 2 1 bmi hyp chl\n 2 2 bmi hyp chl\n 2 3 bmi hyp chl\n 2 4 bmi hyp chl\n 2 5 bmi hyp chl\n 3 1 bmi hyp chl\n 3 2 bmi hyp chl\n 3 3 bmi hyp chl\n 3 4 bmi hyp chl\n 3 5 bmi hyp chl\n 4 1 bmi hyp chl\n 4 2 bmi hyp chl\n 4 3 bmi hyp chl\n 4 4 bmi hyp chl\n 4 5 bmi hyp chl\n 5 1 bmi hyp chl\n 5 2 bmi hyp chl\n 5 3 bmi hyp chl\n 5 4 bmi hyp chl\n 5 5 bmi hyp chl\n 6 1 bmi hyp chl\n 6 2 bmi hyp chl\n 6 3 bmi hyp chl\n 6 4 bmi hyp chl\n 6 5 bmi hyp chl\n 7 1 bmi hyp chl\n 7 2 bmi hyp chl\n 7 3 bmi hyp chl\n 7 4 bmi hyp chl\n 7 5 bmi hyp chl\n 8 1 bmi hyp chl\n 8 2 bmi hyp chl\n 8 3 bmi hyp chl\n 8 4 bmi hyp chl\n 8 5 bmi hyp chl\n 9 1 bmi hyp chl\n 9 2 bmi hyp chl\n 9 3 bmi hyp chl\n 9 4 bmi hyp chl\n 9 5 bmi hyp chl\n 10 1 bmi hyp chl\n 10 2 bmi hyp chl\n 10 3 bmi hyp chl\n 10 4 bmi hyp chl\n 10 5 bmi hyp chl\n\nimp_pmm\n\nClass: mids\nNumber of multiple imputations: 5 \nImputation methods:\n age bmi hyp chl \n \"\" \"pmm\" \"pmm\" \"pmm\" \nPredictorMatrix:\n age bmi hyp chl\nage 0 1 1 1\nbmi 1 0 1 1\nhyp 1 1 0 1\nchl 1 1 1 0\n\nimp_pmms$imp$bmi\n\n 1 2 3 4 5\n1 35.3 27.4 29.6 27.5 30.1\n3 35.3 30.1 30.1 22.5 30.1\n4 24.9 25.5 25.5 21.7 25.5\n6 21.7 25.5 25.5 24.9 25.5\n10 22.7 25.5 29.6 28.7 26.3\n11 35.3 27.4 29.6 22.5 30.1\n12 22.7 20.4 33.2 28.7 35.3\n16 35.3 27.4 30.1 22.5 30.1\n21 35.3 27.4 29.6 27.5 30.1"
},
{
- "objectID": "R/anova.html",
- "href": "R/anova.html",
- "title": "ANOVA",
+ "objectID": "R/association.html",
+ "href": "R/association.html",
+ "title": "Association Analysis for Count Data Using R",
"section": "",
- "text": "Getting Started\nTo demonstrate the various types of sums of squares, we’ll create a data frame called df_disease taken from the SAS documentation. The corresponding data can be found here.\n\n\nThe Model\nFor this example, we’re testing for a significant difference in stem_length using ANOVA. In R, we’re using lm() to run the ANOVA, and then using broom::glance() and broom::tidy() to view the results in a table format.\n\nlm_model <- lm(y ~ drug + disease + drug*disease, df_disease)\n\nThe glance function gives us a summary of the model diagnostic values.\n\nlm_model %>% \n glance() %>% \n pivot_longer(everything())\n\n# A tibble: 12 × 2\n name value\n <chr> <dbl>\n 1 r.squared 0.456 \n 2 adj.r.squared 0.326 \n 3 sigma 10.5 \n 4 statistic 3.51 \n 5 p.value 0.00130\n 6 df 11 \n 7 logLik -212. \n 8 AIC 450. \n 9 BIC 477. \n10 deviance 5081. \n11 df.residual 46 \n12 nobs 58 \n\n\nThe tidy function gives a summary of the model results.\n\nlm_model %>% tidy()\n\n# A tibble: 12 × 5\n term estimate std.error statistic p.value\n <chr> <dbl> <dbl> <dbl> <dbl>\n 1 (Intercept) 29.3 4.29 6.84 0.0000000160\n 2 drug2 -1.33 6.36 -0.210 0.835 \n 3 drug3 -13 7.43 -1.75 0.0869 \n 4 drug4 -15.7 6.36 -2.47 0.0172 \n 5 disease2 -1.08 6.78 -0.160 0.874 \n 6 disease3 -8.93 6.36 -1.40 0.167 \n 7 drug2:disease2 6.58 9.78 0.673 0.504 \n 8 drug3:disease2 -10.9 10.2 -1.06 0.295 \n 9 drug4:disease2 0.317 9.30 0.0340 0.973 \n10 drug2:disease3 -0.900 9.00 -0.100 0.921 \n11 drug3:disease3 1.10 10.2 0.107 0.915 \n12 drug4:disease3 9.53 9.20 1.04 0.306 \n\n\n\n\nThe Results\nYou’ll see that R print the individual results for each level of the drug and disease interaction. We can get the combined F table in R using the anova() function on the model object.\n\nlm_model %>% \n anova() %>% \n tidy() %>% \n kable()\n\n\n\n\nterm\ndf\nsumsq\nmeansq\nstatistic\np.value\n\n\n\n\ndrug\n3\n3133.2385\n1044.4128\n9.455761\n0.0000558\n\n\ndisease\n2\n418.8337\n209.4169\n1.895990\n0.1617201\n\n\ndrug:disease\n6\n707.2663\n117.8777\n1.067225\n0.3958458\n\n\nResiduals\n46\n5080.8167\n110.4525\nNA\nNA\n\n\n\n\n\nWe can add a Total row, by using add_row and calculating the sum of the degrees of freedom and sum of squares.\n\nlm_model %>%\n anova() %>%\n tidy() %>%\n add_row(term = \"Total\", df = sum(.$df), sumsq = sum(.$sumsq)) %>% \n kable()\n\n\n\n\nterm\ndf\nsumsq\nmeansq\nstatistic\np.value\n\n\n\n\ndrug\n3\n3133.2385\n1044.4128\n9.455761\n0.0000558\n\n\ndisease\n2\n418.8337\n209.4169\n1.895990\n0.1617201\n\n\ndrug:disease\n6\n707.2663\n117.8777\n1.067225\n0.3958458\n\n\nResiduals\n46\n5080.8167\n110.4525\nNA\nNA\n\n\nTotal\n57\n9340.1552\nNA\nNA\nNA\n\n\n\n\n\n\n\nSums of Squares Tables\nUnfortunately, it is not easy to get the various types of sums of squares calculations in using functions from base R. However, the rstatix package offers a solution to produce these various sums of squares tables. For each type, you supply the original dataset and model to the. anova_test function, then specify the ttype and se detailed = TRUE.\n\nType I\n\ndf_disease %>% \n rstatix::anova_test(\n y ~ drug + disease + drug*disease, \n type = 1, \n detailed = TRUE) %>% \n rstatix::get_anova_table() %>% \n kable()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nEffect\nDFn\nDFd\nSSn\nSSd\nF\np\np<.05\nges\n\n\n\n\ndrug\n3\n46\n3133.239\n5080.817\n9.456\n5.58e-05\n*\n0.381\n\n\ndisease\n2\n46\n418.834\n5080.817\n1.896\n1.62e-01\n\n0.076\n\n\ndrug:disease\n6\n46\n707.266\n5080.817\n1.067\n3.96e-01\n\n0.122\n\n\n\n\n\n\n\nType II\n\ndf_disease %>% \n rstatix::anova_test(\n y ~ drug + disease + drug*disease, \n type = 2, \n detailed = TRUE) %>% \n rstatix::get_anova_table() %>% \n kable()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nEffect\nSSn\nSSd\nDFn\nDFd\nF\np\np<.05\nges\n\n\n\n\ndrug\n3063.433\n5080.817\n3\n46\n9.245\n6.75e-05\n*\n0.376\n\n\ndisease\n418.834\n5080.817\n2\n46\n1.896\n1.62e-01\n\n0.076\n\n\ndrug:disease\n707.266\n5080.817\n6\n46\n1.067\n3.96e-01\n\n0.122\n\n\n\n\n\n\n\nType III\n\ndf_disease %>% \n rstatix::anova_test(\n y ~ drug + disease + drug*disease, \n type = 3, \n detailed = TRUE) %>% \n rstatix::get_anova_table() %>% \n kable()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nEffect\nSSn\nSSd\nDFn\nDFd\nF\np\np<.05\nges\n\n\n\n\n(Intercept)\n20037.613\n5080.817\n1\n46\n181.414\n0.00e+00\n*\n0.798\n\n\ndrug\n2997.472\n5080.817\n3\n46\n9.046\n8.09e-05\n*\n0.371\n\n\ndisease\n415.873\n5080.817\n2\n46\n1.883\n1.64e-01\n\n0.076\n\n\ndrug:disease\n707.266\n5080.817\n6\n46\n1.067\n3.96e-01\n\n0.122\n\n\n\n\n\n\n\nType IV\nIn R there is no equivalent operation to the Type IV sums of squares calculation in SAS."
+ "text": "The most commonly used association analysis methods for count data/contingency tables compare observed frequencies with those expected under the assumption of independence:\n\\[\nX^2 = \\sum_{i=1}^k \\frac{(x_i-e_i)^2}{e_i},\n\\] where \\(k\\) is the number of contingency table cells.\nOther measures for the correlation of two continuous variables are:"
},
{
- "objectID": "R/wilcoxonsr_hodges_lehman.html",
- "href": "R/wilcoxonsr_hodges_lehman.html",
- "title": "Wilcoxon signed-rank test",
- "section": "",
- "text": "Introduction\nWilcoxon signed-rank test is a non-parametric test which is sometimes used instead of the paired Student’s t-test when assumptions regarding a normal distribution are not valid. It is a rank test, designed for analyzing repeated measures or paired observations by a paired comparison (a type of location test) to assess whether their population means differ. Whilst it does not ‘compare’ means or medians for a set of paired data, it ranks the results on A and ranks the results on B, then compares if Prob(A>B) > Prob(B>A).\nTies are when you have two observations with the same result. For example, in a 2-period cross-over study, you take the difference between result on Treatment A minus result on Treatment B and find that two or more subjects have the same difference.\nAdditionally, “0s” can cause some trouble as well. For example when the difference between result on Treatment A minus result on Treatment B equals 0.\n\n\nData\nAnalysis will be conducted on the example of anonymized data from 2-period, cross-over study comparing treatments A and B in patients with asthma and acute airway obstruction induced by repeated mannitol challenges.\nWilcoxon signed rank test was applied to analyse the time to return to baseline FEV1 post-mannitol challenge 2. Median difference, p value and 95% CI were provided using the Hodges-Lehmann estimate.\n\nhead(blood_p)\n\n patient sex agegrp bp_before bp_after\n1 1 Male 30-45 143.670 153.316\n2 2 Male 30-45 163.082 170.576\n3 3 Male 30-45 153.393 168.599\n4 4 Male 30-45 153.082 142.358\n5 5 Male 30-45 146.720 141.193\n6 6 Male 30-45 150.668 147.204\n\n\n\n\nDataset without ties\nLet’s consider a case where the dataset has no ties.\n\n\nAvailable packages\nIn R Wilcoxon signed rank test can be performed using for example DOS (version 0.5.2) or stats (version 3.6.2) package.\n\nstats\nFunction wilcox.test used for Wilcoxon Rank Sum and Signed Rank Tests will be applied. For more information about that function go here\nWe will focus on the below arguments: - alternative - paired - exact - correct - conf.int.\n\n\n\nExamples\n\n# Exact \nstats::wilcox.test(x = blood_p$bp_after, y = blood_p$bp_before, \n paired = TRUE, \n conf.int = TRUE, \n conf.level = 0.9, \n alterative = \"two.sided\", \n exact = TRUE)\n\n\n Wilcoxon signed rank exact test\n\ndata: blood_p$bp_after and blood_p$bp_before\nV = 17251, p-value = 0.009379\nalternative hypothesis: true location shift is not equal to 0\n90 percent confidence interval:\n 1.5045 5.9945\nsample estimates:\n(pseudo)median \n 3.68875 \n\n# No exact & continuity correction\nstats::wilcox.test(x = blood_p$bp_after, y = blood_p$bp_before, \n paired = TRUE, \n conf.int = TRUE, \n conf.level = 0.9, \n alterative = \"two.sided\", \n exact = FALSE, \n correct = TRUE)\n\n\n Wilcoxon signed rank test with continuity correction\n\ndata: blood_p$bp_after and blood_p$bp_before\nV = 17251, p-value = 0.009548\nalternative hypothesis: true location shift is not equal to 0\n90 percent confidence interval:\n 1.504565 5.994467\nsample estimates:\n(pseudo)median \n 3.688796 \n\n# No exact & No continuity correction\nstats::wilcox.test(x = blood_p$bp_after, y = blood_p$bp_before, \n paired = TRUE, \n conf.int = TRUE, \n conf.level = 0.9, \n alterative = \"two.sided\" , \n exact = FALSE, \n correct = FALSE)\n\n\n Wilcoxon signed rank test\n\ndata: blood_p$bp_after and blood_p$bp_before\nV = 17251, p-value = 0.009535\nalternative hypothesis: true location shift is not equal to 0\n90 percent confidence interval:\n 1.504991 5.993011\nsample estimates:\n(pseudo)median \n 3.688796 \n\n\n\n\nImportant notes on stats:wilcox.test\n\nBy default an exact p-value is computed if the samples size is less than 50 and there are no ties. Otherwise, a normal approximation is used.\nIf exact p-values are available, an exact confidence interval is obtained by the algorithm described in Bauer (1972), and the Hodges-Lehmann estimator is employed. Otherwise, the returned confidence interval and point estimate are based on normal approximations.\nIf non-exact p-value is calculated, continuity correction in the normal approximation for the p-value can be applied with correct argument.\nStatistic V is provided, which is a test statistic based on Sprent (1993) algorithm\n\n\nDOS2\nFunction senWilcox used for Sensitivity Analysis for Wilcoxon’s Signed-rank Statistic will be applied. For more information about that function go here\n\n\n\nExamples\n\nDOS2::senWilcox(blood_p$bp_after - blood_p$bp_before, \n gamma = 1, \n conf.int = TRUE, \n alpha = 0.1, \n alternative = \"twosided\")\n\n$pval\n[1] 0.009534732\n\n$estimate\n low high \n3.688796 3.688796 \n\n$ci\n low high \n1.50494 5.99305 \n\n\n\n\nImportant notes on DOS2:senWilcox\n\nGamma >= 1 is the value of the sensitivity parameter. If gamma=1, then you are assuming ignorable treatment assignment or equivalently no unmeasured confounding - that is the considered scenario in our example, sensitivity analysis is not performed.\nOnly p value, estimate and CI are provided\n\n\n\nCoin package - coming soon!"
+ "objectID": "R/association.html#chi-squared-test",
+ "href": "R/association.html#chi-squared-test",
+ "title": "Association Analysis for Count Data Using R",
+ "section": "Chi-Squared test",
+ "text": "Chi-Squared test\n\nchisq.test(tab)\n\n\n Pearson's Chi-squared test with Yates' continuity correction\n\ndata: tab\nX-squared = 1.8261, df = 1, p-value = 0.1766"
},
{
- "objectID": "R/rounding.html",
- "href": "R/rounding.html",
- "title": "Rounding in R",
- "section": "",
- "text": "The round() function in Base R will round to the nearest whole number and ‘rounding to the even number’ when equidistant, meaning that exactly 12.5 rounds to the integer 12.\nNote that the janitor package in R contains a function round_half_up() that rounds away from zero. in this case it rounds to the nearest whole number and ‘away from zero’ or ‘rounding up’ when equidistant, meaning that exactly 12.5 rounds to the integer 13.\n\n#Example code\nmy_number <-c(2.2,3.99,1.2345,7.876,13.8739)\n\nr_0_dec <- round(my_number, digits=0);\nr_1_dec <- round(my_number, digits=1);\nr_2_dec <- round(my_number, digits=2);\nr_3_dec <- round(my_number, digits=3);\n\nr_0_dec\nr_1_dec\nr_2_dec\nr_3_dec\n\n> r_0_dec\n[1] 2 4 1 8 14\n> r_1_dec\n[1] 2.2 4.0 1.2 7.9 13.9\n> r_2_dec\n[1] 2.20 3.99 1.23 7.88 13.87\n> r_3_dec\n[1] 2.200 3.990 1.234 7.876 13.874\n\nIf using the janitor package in R, and the function round_half_up(), the results would be the same with the exception of rounding 1.2345 to 3 decimal places where a result of 1.235 would be obtained instead of 1.234."
+ "objectID": "R/association.html#fisher-exact-test",
+ "href": "R/association.html#fisher-exact-test",
+ "title": "Association Analysis for Count Data Using R",
+ "section": "Fisher Exact Test",
+ "text": "Fisher Exact Test\nFor \\(2 \\times 2\\) contingency tables, p-values are obtained directly using the hypergeometric distribution.\n\nfisher.test(tab)\n\n\n Fisher's Exact Test for Count Data\n\ndata: tab\np-value = 0.135\nalternative hypothesis: true odds ratio is not equal to 1\n95 percent confidence interval:\n 0.8158882 3.2251299\nsample estimates:\nodds ratio \n 1.630576"
},
{
- "objectID": "R/count_data_regression.html",
- "href": "R/count_data_regression.html",
- "title": "Regression for Count Data",
- "section": "",
- "text": "The most commonly used models for count data in clinical trials include:\n\nPoisson regression: assumes the response variable \\(Y\\) has a Poisson distribution, which is linked using the logarithm with explanatory variables \\(\\bf{x}\\).\n\n\\[\n\\text{log}(E(Y|x))= \\beta_0 + \\beta' x, \\; i = 1,\\ldots,n\n\\]\n\nQuasi-Poisson regression: Poisson model that allows overdispersion, i.e. dispersion parameter is not fixed at one.\nNegative-Binomial regression: popular generalization which loosens the assumption that the variance is equal to the mean made by the Poisson model.\n\nOther models include hurdle or zero-inflated models, if data have more zero observations than expected.\n\nExample: Familial Andenomatous Polyposis Data\nData source: F. M. Giardiello, S. R. Hamilton, A. J. Krush, S. Piantadosi, L. M. Hylind, P. Celano, S. V. Booker, C. R. Robinson and G. J. A. Offerhaus (1993), Treatment of colonic and rectal adenomas with sulindac in familial adenomatous polyposis. New England Journal of Medicine, 328(18), 1313–1316.\nData from a placebo-controlled trial of a non-steroidal anti-inflammatory drug in the treatment of familial andenomatous polyposis (FAP). (see ?polyps for details).\n\npolyps <- HSAUR2::polyps\nglimpse(polyps)\n\nRows: 20\nColumns: 3\n$ number <dbl> 63, 2, 28, 17, 61, 1, 7, 15, 44, 25, 3, 28, 10, 40, 33, 46, 50,…\n$ treat <fct> placebo, drug, placebo, drug, placebo, drug, placebo, placebo, …\n$ age <dbl> 20, 16, 18, 22, 13, 23, 34, 50, 19, 17, 23, 22, 30, 27, 23, 22,…\n\n\nWe analyze the number of colonic polyps at 12 months in dependency of treatment and age of the patient.\n\npolyps %>% \n ggplot(aes(y = number, x = age, color = treat)) + \n geom_point() + theme_minimal()\n\n\n\n\n\n\n\n\n\n\nModel Fit\nWe fit a generalized linear model for number using the Poisson distribution with default log link.\n\n# Poisson\nm1 <- glm(number ~ treat + age, data = polyps, family = poisson)\nsummary(m1)\n\n\nCall:\nglm(formula = number ~ treat + age, family = poisson, data = polyps)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 4.529024 0.146872 30.84 < 2e-16 ***\ntreatdrug -1.359083 0.117643 -11.55 < 2e-16 ***\nage -0.038830 0.005955 -6.52 7.02e-11 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for poisson family taken to be 1)\n\n Null deviance: 378.66 on 19 degrees of freedom\nResidual deviance: 179.54 on 17 degrees of freedom\nAIC: 273.88\n\nNumber of Fisher Scoring iterations: 5\n\n\nThe parameter estimates are on log-scale. For better interpretation, we can exponentiate these estimates, to obtain estimates and provide \\(95\\)% confidence intervals:\n\n# OR and CI\nexp(coef(m1))\n\n(Intercept) treatdrug age \n 92.6681047 0.2568961 0.9619140 \n\nexp(confint(m1))\n\nWaiting for profiling to be done...\n\n\n 2.5 % 97.5 %\n(Intercept) 69.5361752 123.6802476\ntreatdrug 0.2028078 0.3218208\nage 0.9505226 0.9729788\n\n\nPredictions for number of colonic polyps given a new 25-year-old patient on either treatment using predict():\n\n# new 25 year old patient\nnew_pt <- data.frame(treat = c(\"drug\",\"placebo\"), age=25)\npredict(m1, new_pt, type = \"response\")\n\n 1 2 \n 9.017654 35.102332 \n\n\n\n\nModelling Overdispersion\nPoisson model assumes that mean and variance are equal, which can be a very restrictive assumption. One option to relax the assumption is adding a overdispersion constant to the relationship, i.e. \\(\\text{Var}(\\text{response}) = \\phi\\cdot \\mu\\), which results in a quasipoisson model:\n\n# Quasi poisson\nm2 <- glm(number ~ treat + age, data = polyps, family = quasipoisson)\nsummary(m2)\n\n\nCall:\nglm(formula = number ~ treat + age, family = quasipoisson, data = polyps)\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 4.52902 0.48106 9.415 3.72e-08 ***\ntreatdrug -1.35908 0.38533 -3.527 0.00259 ** \nage -0.03883 0.01951 -1.991 0.06284 . \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for quasipoisson family taken to be 10.72805)\n\n Null deviance: 378.66 on 19 degrees of freedom\nResidual deviance: 179.54 on 17 degrees of freedom\nAIC: NA\n\nNumber of Fisher Scoring iterations: 5\n\n\nAlternatively, we can explicitly model the count data with overdispersion using the negative Binomial model. In this case, the overdispersion is a function of both \\(\\mu\\) and \\(\\mu^2\\):\n\\[\n\\text{Var}(\\text{response}) = \\mu + \\kappa\\,\\mu^2.\n\\]\n\n# Negative Binomial\nm3 <- MASS::glm.nb(number ~ treat + age, data = polyps)\nsummary(m3)\n\n\nCall:\nMASS::glm.nb(formula = number ~ treat + age, data = polyps, init.theta = 1.719491, \n link = log)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 4.52603 0.59466 7.611 2.72e-14 ***\ntreatdrug -1.36812 0.36903 -3.707 0.000209 ***\nage -0.03856 0.02095 -1.840 0.065751 . \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for Negative Binomial(1.7195) family taken to be 1)\n\n Null deviance: 36.734 on 19 degrees of freedom\nResidual deviance: 22.002 on 17 degrees of freedom\nAIC: 164.88\n\nNumber of Fisher Scoring iterations: 1\n\n Theta: 1.719 \n Std. Err.: 0.607 \n\n 2 x log-likelihood: -156.880 \n\n\nBoth model result very similar parameter estimates, but vary in estimates for their respective standard deviation."
+ "objectID": "R/association.html#chi-squared-test-1",
+ "href": "R/association.html#chi-squared-test-1",
+ "title": "Association Analysis for Count Data Using R",
+ "section": "Chi-Squared Test",
+ "text": "Chi-Squared Test\n\nchisq.test(tab2)\n\nWarning in chisq.test(tab2): Chi-squared approximation may be incorrect\n\n\n\n Pearson's Chi-squared test\n\ndata: tab2\nX-squared = 260.76, df = 15, p-value < 2.2e-16\n\n\nThe warning means that the smallest expected frequencies is lower than 5. It is recommended to use the Fisher’s exact test in this case."
},
{
- "objectID": "R/survey-stats-summary.html",
- "href": "R/survey-stats-summary.html",
- "title": "Survey Summary Statistics using R",
- "section": "",
- "text": "When conducting large-scale trials on samples of the population, it can be necessary to use a more complex sampling design than a simple random sample.\nAll of these designs need to be taken into account when calculating statistics, and when producing models. Only summary statistics are discussed in this document, and variances are calculated using the default Taylor series linearisation methods. For a more detailed introduction to survey statistics in R, see (Lohr 2022) or (Lumley 2004)."
+ "objectID": "R/association.html#fisher-exact-test-1",
+ "href": "R/association.html#fisher-exact-test-1",
+ "title": "Association Analysis for Count Data Using R",
+ "section": "Fisher Exact Test",
+ "text": "Fisher Exact Test\nFor contingency tables larger than \\(2 \\times 2\\), p-values are based on simulations, which might require a lot of time (see ?fisher.test for details).\n\nfisher.test(tab2, simulate.p.value=TRUE)\n\n\n Fisher's Exact Test for Count Data with simulated p-value (based on\n 2000 replicates)\n\ndata: tab2\np-value = 0.0004998\nalternative hypothesis: two.sided"
},
{
- "objectID": "R/survey-stats-summary.html#mean",
- "href": "R/survey-stats-summary.html#mean",
- "title": "Survey Summary Statistics using R",
- "section": "Mean",
- "text": "Mean\nIf we want to calculate a mean of a variable in a dataset which has been obtained from a simple random sample such as apisrs, in R we can create a design object using the survey::svydesign function (specifying that there is no PSU using id = ~1 and the finite population correction using fpc=~fpc).\n\nsrs_design <- svydesign(id = ~1, fpc = ~fpc, data = apisrs)\n\nThis design object stores all metadata about the sample alongside the data, and is used by all subsequent functions in the {survey} package. To calculate the mean, standard error, and confidence intervals of the growth variable, we can use the survey::svymean and confint functions:\n\n# Calculate mean and SE of growth. The standard error will be corrected by the finite population correction specified in the design\nsrs_means <- svymean(~growth, srs_design)\n\nsrs_means\n\n mean SE\ngrowth 31.9 2.0905\n\n# Use degf() to get the degrees of freedom\nconfint(srs_means, df=degf(srs_design))\n\n 2.5 % 97.5 %\ngrowth 27.77764 36.02236\n\n\nNote that to obtain correct results, we had to specify the degrees of freedom using the design object."
+ "objectID": "R/ci_for_prop.html",
+ "href": "R/ci_for_prop.html",
+ "title": "Confidence Intervals for Proportions",
+ "section": "",
+ "text": "A confidence interval for binomial proportion is an interval estimate for the probability of success calculated from the outcome of a series of Bernoulli trials.\nThere are several ways to calculate a binomial confidence interval. Normal approximation is one of the most commonly used methods."
},
{
- "objectID": "R/survey-stats-summary.html#total",
- "href": "R/survey-stats-summary.html#total",
- "title": "Survey Summary Statistics using R",
- "section": "Total",
- "text": "Total\nCalculating population totals can be done using the survey::svytotal function in R.\n\nsvytotal(~growth, srs_design)\n\n total SE\ngrowth 197589 12949"
+ "objectID": "R/ci_for_prop.html#normal-approximation",
+ "href": "R/ci_for_prop.html#normal-approximation",
+ "title": "Confidence Intervals for Proportions",
+ "section": "Normal approximation",
+ "text": "Normal approximation\nIn large random samples from independent trials, the sampling distribution of proportions approximately follows the normal distribution. The expectation of a sample proportion is the corresponding population proportion. Therefore, based on a sample of size \\(n\\), a \\((1-\\alpha)\\%\\) confidence interval for population proportion can be calculated using normal approximation as follows:\n\\(p\\approx \\hat p \\pm z_\\alpha \\sqrt{\\hat p(1-\\hat p)}/{n}\\), where \\(\\hat p\\) is the sample proportion, \\(z_\\alpha\\) is the \\(1-\\alpha/2\\) quantile of a standard normal distribution corresponding to level \\(\\alpha\\), and \\(\\sqrt{\\hat p(1-\\hat p)}/{n}\\) is the standard error."
},
{
- "objectID": "R/survey-stats-summary.html#ratios",
- "href": "R/survey-stats-summary.html#ratios",
- "title": "Survey Summary Statistics using R",
- "section": "Ratios",
- "text": "Ratios\nTo perform ratio analysis for means or proportions of analysis variables in R, we can survey::svyratio, here requesting that we do not separate the ratio estimation per Strata as this design is not stratified.\n\nsvy_ratio <- svyratio(\n ~api00,\n ~api99,\n srs_design,\n se=TRUE,\n df=degf(srs_design),\n separate=FALSE\n)\n\nsvy_ratio\n\nRatio estimator: svyratio.survey.design2(~api00, ~api99, srs_design, se = TRUE, \n df = degf(srs_design), separate = FALSE)\nRatios=\n api99\napi00 1.051066\nSEs=\n api99\napi00 0.003603991\n\nconfint(svy_ratio, df=degf(srs_design))\n\n 2.5 % 97.5 %\napi00/api99 1.043959 1.058173"
+ "objectID": "R/ci_for_prop.html#example-code",
+ "href": "R/ci_for_prop.html#example-code",
+ "title": "Confidence Intervals for Proportions",
+ "section": "Example code",
+ "text": "Example code\nThe following code calculates a confidence interval for a binomial proportion usinng normal approximation.\n\nset.seed(666)\n# generate a random sample of size 100 from independent Bernoulli trials\nn = 100\nmysamp = sample(c(0,1),n,replace = T)\n# sample proportion\np_hat = mean(mysamp)\n# standard error\nse = sqrt(p_hat*(1-p_hat)/n)\n# 95% CI of population proportion\nc(p_hat-qnorm(1-0.05/2)*se, p_hat+qnorm(1-0.05/2)*se)\n\n[1] 0.4936024 0.6863976"
},
{
- "objectID": "R/survey-stats-summary.html#proportions",
- "href": "R/survey-stats-summary.html#proportions",
- "title": "Survey Summary Statistics using R",
- "section": "Proportions",
- "text": "Proportions\nTo calculate a proportion in R, we use the svymean function on a factor or character column:\n\nprops <- svymean(~sch.wide, srs_design)\n\nprops\n\n mean SE\nsch.wideNo 0.185 0.0271\nsch.wideYes 0.815 0.0271\n\nconfint(props, df=degf(srs_design))\n\n 2.5 % 97.5 %\nsch.wideNo 0.1316041 0.2383959\nsch.wideYes 0.7616041 0.8683959\n\n\nFor proportions close to 0, it can be that survey::svyciprop is more accurate at producing confidence intervals than confint."
+ "objectID": "R/logistic_regr.html",
+ "href": "R/logistic_regr.html",
+ "title": "Logistic Regression",
+ "section": "",
+ "text": "A model of the dependence of binary variables on explanatory variables. The logit of expectation is explained as linear for of explanatory variables. If we observed \\((y_i, x_i),\\) where \\(y_i\\) is a Bernoulli variable and \\(x_i\\) a vector of explanatory variables, the model for \\(\\pi_i = P(y_i=1)\\) is\n\\[\n\\text{logit}(\\pi_i)= \\log\\left\\{ \\frac{\\pi_i}{1-\\pi_i}\\right\\} = \\beta_0 + \\beta x_i, i = 1,\\ldots,n\n\\]\nThe model is especially useful in case-control studies and leads to the effect of risk factors by odds ratios.\n\nExample: Lung Cancer Data\nData source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.\nSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities (see ?lung for details).\n\nlibrary(survival) \nglimpse(lung)\n\nRows: 228\nColumns: 10\n$ inst <dbl> 3, 3, 3, 5, 1, 12, 7, 11, 1, 7, 6, 16, 11, 21, 12, 1, 22, 16…\n$ time <dbl> 306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166, 170, 654…\n$ status <dbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …\n$ age <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68, 60, 57, …\n$ sex <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, …\n$ ph.ecog <dbl> 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 1, 2, 1, NA, 1, 1, 1, 2, 2, 1,…\n$ ph.karno <dbl> 90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 90, 60, 80,…\n$ pat.karno <dbl> 100, 90, 90, 60, 90, 80, 60, 80, 80, 70, 80, 70, 90, 70, 70,…\n$ meal.cal <dbl> 1175, 1225, NA, 1150, NA, 513, 384, 538, 825, 271, 1025, NA,…\n$ wt.loss <dbl> NA, 15, 15, 11, 0, 0, 10, 1, 16, 34, 27, 23, 5, 32, 60, 15, …\n\n\n\n\nModel Fit\nWe analyze the weight loss in lung cancer patients in dependency of age, sex, ECOG performance score and calories consumed at meals.\n\nlung2 <- survival::lung %>% \n mutate(\n wt_grp = factor(wt.loss > 0, labels = c(\"weight loss\", \"weight gain\"))\n ) \n\n\nm1 <- glm(wt_grp ~ age + sex + ph.ecog + meal.cal, data = lung2, family = binomial(link=\"logit\"))\nsummary(m1)\n\n\nCall:\nglm(formula = wt_grp ~ age + sex + ph.ecog + meal.cal, family = binomial(link = \"logit\"), \n data = lung2)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 3.2631673 1.6488207 1.979 0.0478 *\nage -0.0101717 0.0208107 -0.489 0.6250 \nsex -0.8717357 0.3714042 -2.347 0.0189 *\nph.ecog 0.4179665 0.2588653 1.615 0.1064 \nmeal.cal -0.0008869 0.0004467 -1.985 0.0471 *\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 202.36 on 169 degrees of freedom\nResidual deviance: 191.50 on 165 degrees of freedom\n (58 observations deleted due to missingness)\nAIC: 201.5\n\nNumber of Fisher Scoring iterations: 4\n\n\nThe model summary contains the parameter estimates \\(\\beta_j\\) for each explanatory variable \\(x_j\\), corresponding to the log-odds for the response variable to take the value \\(1\\), conditional on all other explanatory variables remaining constant. For better interpretation, we can exponentiate these estimates, to obtain estimates for the odds instead and provide \\(95\\)% confidence intervals:\n\nexp(coef(m1))\n\n(Intercept) age sex ph.ecog meal.cal \n 26.1321742 0.9898798 0.4182250 1.5188698 0.9991135 \n\nexp(confint(m1))\n\nWaiting for profiling to be done...\n\n\n 2.5 % 97.5 %\n(Intercept) 1.0964330 730.3978786\nage 0.9495388 1.0307216\nsex 0.1996925 0.8617165\nph.ecog 0.9194053 2.5491933\nmeal.cal 0.9982107 0.9999837\n\n\n\n\nModel Comparison\nTo compare two logistic models, one tests the difference in residual variances from both models using a \\(\\chi^2\\)-distribution with a single degree of freedom (here at the \\(5\\)% level):\n\nm2 <- glm(wt_grp ~ sex + ph.ecog + meal.cal, data = lung2, family = binomial(link=\"logit\"))\nsummary(m2)\n\n\nCall:\nglm(formula = wt_grp ~ sex + ph.ecog + meal.cal, family = binomial(link = \"logit\"), \n data = lung2)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) 2.5606595 0.7976887 3.210 0.00133 **\nsex -0.8359241 0.3637378 -2.298 0.02155 * \nph.ecog 0.3794295 0.2469030 1.537 0.12435 \nmeal.cal -0.0008334 0.0004346 -1.918 0.05517 . \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 202.36 on 169 degrees of freedom\nResidual deviance: 191.74 on 166 degrees of freedom\n (58 observations deleted due to missingness)\nAIC: 199.74\n\nNumber of Fisher Scoring iterations: 4\n\nanova(m1, m2, test = \"Chisq\")\n\nAnalysis of Deviance Table\n\nModel 1: wt_grp ~ age + sex + ph.ecog + meal.cal\nModel 2: wt_grp ~ sex + ph.ecog + meal.cal\n Resid. Df Resid. Dev Df Deviance Pr(>Chi)\n1 165 191.50 \n2 166 191.75 -1 -0.24046 0.6239\n\n\n\n\nPrediction\nPredictions from the model for the log-odds of a patient with new data to experience a weight loss are derived using predict():\n\n# new female, symptomatic but completely ambulatory patient consuming 2500 calories\nnew_pt <- data.frame(sex=2, ph.ecog=1, meal.cal=2500)\npredict(m2, new_pt, type = \"response\")\n\n 1 \n0.306767"
},
{
- "objectID": "R/survey-stats-summary.html#quantiles",
- "href": "R/survey-stats-summary.html#quantiles",
- "title": "Survey Summary Statistics using R",
- "section": "Quantiles",
- "text": "Quantiles\nTo calculate quantiles in R, we can use the survey::svyquantile function. Note that this function was reworked in version 4.1 of {survey}, and prior to this had different arguments and results. The current version of svyquantile has an qrule which is similar to the type argument in quantile, and can be used to change how the quantiles are calculated. For more information, see vignette(\"qrule\", package=\"survey\").\n\nsvyquantile(\n ~growth,\n srs_design,\n quantiles = c(0.025, 0.5, 0.975),\n ci=TRUE,\n se=TRUE\n)\n\n$growth\n quantile ci.2.5 ci.97.5 se\n0.025 -16 -21 -12 2.281998\n0.5 27 24 31 1.774887\n0.975 99 84 189 26.623305\n\nattr(,\"hasci\")\n[1] TRUE\nattr(,\"class\")\n[1] \"newsvyquantile\""
+ "objectID": "R/ttest_1Sample.html",
+ "href": "R/ttest_1Sample.html",
+ "title": "One Sample t-test",
+ "section": "",
+ "text": "The One Sample t-test is used to compare a single sample against an expected hypothesis value. In the One Sample t-test, the mean of the sample is compared against the hypothesis value. In R, a One Sample t-test can be performed using the Base R t.test() from the stats package or the proc_ttest() function from the procs package.\n\n\nThe following data was used in this example.\n\n# Create sample data\nread <- tibble::tribble(\n ~score, ~count,\n 40, 2, 47, 2, 52, 2, 26, 1, 19, 2,\n 25, 2, 35, 4, 39, 1, 26, 1, 48, 1,\n 14, 2, 22, 1, 42, 1, 34, 2 , 33, 2,\n 18, 1, 15, 1, 29, 1, 41, 2, 44, 1,\n 51, 1, 43, 1, 27, 2, 46, 2, 28, 1,\n 49, 1, 31, 1, 28, 1, 54, 1, 45, 1\n)\n\n\n\n\nBy default, the R one sample t-test functions assume normality in the data and use a classic Student’s t-test.\n\n\n\n\nThe following code was used to test the comparison in Base R. Note that the baseline null hypothesis goes in the “mu” parameter.\n\n # Perform t-test\n t.test(read$score, mu = 30)\n\n\n One Sample t-test\n\ndata: read$score\nt = 2.3643, df = 29, p-value = 0.02497\nalternative hypothesis: true mean is not equal to 30\n95 percent confidence interval:\n 30.67928 39.38739\nsample estimates:\nmean of x \n 35.03333 \n\n\n\n\n\n\n\n\nThe following code from the procs package was used to perform a one sample t-test. Note that the null hypothesis value goes in the “options” parameter.\n\n library(procs)\n\n # Perform t-test\n proc_ttest(read, var = score,\n options = c(\"h0\" = 30))\n\n$Statistics\n VAR N MEAN STD STDERR MIN MAX\n1 score 30 35.03333 11.66038 2.128884 14 54\n\n$ConfLimits\n VAR MEAN LCLM UCLM STD LCLMSTD UCLMSTD\n1 score 35.03333 30.67928 39.38739 11.66038 9.286404 15.67522\n\n$TTests\n VAR DF T PROBT\n1 score 29 2.364306 0.0249741\n\n\nViewer Output:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThe Base R t.test() function does not have an option for lognormal data. Likewise, the procs proc_ttest() function also does not have an option for lognormal data.\nOne possibility may be the tTestLnormAltPower() function from the EnvStats package. This package has not been evaluated yet."
},
{
- "objectID": "R/ttest_2Sample.html",
- "href": "R/ttest_2Sample.html",
- "title": "Two Sample t-test",
+ "objectID": "R/ttest_1Sample.html#normal",
+ "href": "R/ttest_1Sample.html#normal",
+ "title": "One Sample t-test",
"section": "",
- "text": "The Two Sample t-test is used to compare two independent samples against each other. In the Two Sample t-test, the mean of the first sample is compared against the mean of the second sample. In R, a Two Sample t-test can be performed using the Base R t.test() function from the stats package or the proc_ttest() function from the procs package.\n\n\nThe following data was used in this example.\n\n# Create sample data\nd1 <- tibble::tribble(\n ~trt_grp, ~WtGain,\n \"placebo\", 94, \"placebo\", 12, \"placebo\", 26, \"placebo\", 89,\n \"placebo\", 88, \"placebo\", 96, \"placebo\", 85, \"placebo\", 130,\n \"placebo\", 75, \"placebo\", 54, \"placebo\", 112, \"placebo\", 69,\n \"placebo\", 104, \"placebo\", 95, \"placebo\", 53, \"placebo\", 21,\n \"treatment\", 45, \"treatment\", 62, \"treatment\", 96, \"treatment\", 128,\n \"treatment\", 120, \"treatment\", 99, \"treatment\", 28, \"treatment\", 50,\n \"treatment\", 109, \"treatment\", 115, \"treatment\", 39, \"treatment\", 96,\n \"treatment\", 87, \"treatment\", 100, \"treatment\", 76, \"treatment\", 80\n)\n\n\n\n\nIf we have normalized data, we can use the classic Student’s t-test. For a Two sample test where the variances are not equal, we should use the Welch’s t-test. Both of those options are available with the Base R t.test() function.\n\n\n\n\nThe following code was used to test the comparison in Base R. By default, the R two sample t-test function assumes the variances in the data are unequal, and uses a Welch’s t-test. Therefore, to use a classic Student’s t-test with normalized data, we must specify var.equal = TRUE. Also note that we must separate the single variable into two variables to satisfy the t.test() syntax and set paired = FALSE.\n\n d1p <- dplyr::filter(d1, trt_grp == 'placebo')\n d1t <- dplyr::filter(d1, trt_grp == 'treatment')\n\n # Perform t-test\n t.test(d1p$WtGain, d1t$WtGain, \n var.equal = TRUE, paired = FALSE)\n\n\n Two Sample t-test\n\ndata: d1p$WtGain and d1t$WtGain\nt = -0.6969, df = 30, p-value = 0.4912\nalternative hypothesis: true difference in means is not equal to 0\n95 percent confidence interval:\n -31.19842 15.32342\nsample estimates:\nmean of x mean of y \n 75.1875 83.1250 \n\n\n\n\n\n\n\n\nThe following code was used to test the comparison in Base R using Welch’s t-test. Observe that in this case, the var.equal parameter is set to FALSE.\n\n d1p <- dplyr::filter(d1, trt_grp == 'placebo')\n d1t <- dplyr::filter(d1, trt_grp == 'treatment')\n\n # Perform t-test\n t.test(d1p$WtGain, d1t$WtGain, \n var.equal = FALSE, paired = FALSE)\n\n\n Welch Two Sample t-test\n\ndata: d1p$WtGain and d1t$WtGain\nt = -0.6969, df = 29.694, p-value = 0.4913\nalternative hypothesis: true difference in means is not equal to 0\n95 percent confidence interval:\n -31.20849 15.33349\nsample estimates:\nmean of x mean of y \n 75.1875 83.1250 \n\n\n\n\n\n\n\n\n\n\n\nThe following code from the procs package was used to perform a two sample t-test. Note that the proc_ttest() function performs both the Student’s t-test and Welch’s (Satterthwaite) t-test in the same call. The results are displayed on separate rows. This output is similar to SAS.\n\n library(procs)\n\n # Perform t-test\n proc_ttest(d1, var = WtGain,\n class = trt_grp)\n\n$Statistics\n VAR CLASS METHOD N MEAN STD STDERR MIN MAX\n1 WtGain placebo <NA> 16 75.1875 33.81167 8.452918 12 130\n2 WtGain treatment <NA> 16 83.1250 30.53495 7.633738 28 128\n3 WtGain Diff (1-2) Pooled NA -7.9375 NA 11.389723 NA NA\n4 WtGain Diff (1-2) Satterthwaite NA -7.9375 NA 11.389723 NA NA\n\n$ConfLimits\n VAR CLASS METHOD MEAN LCLM UCLM STD LCLMSTD\n1 WtGain placebo <NA> 75.1875 57.17053 93.20447 33.81167 24.97685\n2 WtGain treatment <NA> 83.1250 66.85407 99.39593 30.53495 22.55632\n3 WtGain Diff (1-2) Pooled -7.9375 -31.19842 15.32342 NA NA\n4 WtGain Diff (1-2) Satterthwaite -7.9375 -31.20849 15.33349 NA NA\n UCLMSTD\n1 52.33003\n2 47.25868\n3 NA\n4 NA\n\n$TTests\n VAR METHOD VARIANCES DF T PROBT\n1 WtGain Pooled Equal 30.00000 -0.6969002 0.4912306\n2 WtGain Satterthwaite Unequal 29.69359 -0.6969002 0.4912856\n\n$Equality\n VAR METHOD NDF DDF FVAL PROBF\n1 WtGain Folded F 15 15 1.226136 0.6980614\n\n\nViewer Output:"
+ "text": "By default, the R one sample t-test functions assume normality in the data and use a classic Student’s t-test.\n\n\n\n\nThe following code was used to test the comparison in Base R. Note that the baseline null hypothesis goes in the “mu” parameter.\n\n # Perform t-test\n t.test(read$score, mu = 30)\n\n\n One Sample t-test\n\ndata: read$score\nt = 2.3643, df = 29, p-value = 0.02497\nalternative hypothesis: true mean is not equal to 30\n95 percent confidence interval:\n 30.67928 39.38739\nsample estimates:\nmean of x \n 35.03333 \n\n\n\n\n\n\n\n\nThe following code from the procs package was used to perform a one sample t-test. Note that the null hypothesis value goes in the “options” parameter.\n\n library(procs)\n\n # Perform t-test\n proc_ttest(read, var = score,\n options = c(\"h0\" = 30))\n\n$Statistics\n VAR N MEAN STD STDERR MIN MAX\n1 score 30 35.03333 11.66038 2.128884 14 54\n\n$ConfLimits\n VAR MEAN LCLM UCLM STD LCLMSTD UCLMSTD\n1 score 35.03333 30.67928 39.38739 11.66038 9.286404 15.67522\n\n$TTests\n VAR DF T PROBT\n1 score 29 2.364306 0.0249741\n\n\nViewer Output:"
},
{
- "objectID": "R/ttest_2Sample.html#base-r",
- "href": "R/ttest_2Sample.html#base-r",
- "title": "Two Sample t-test",
+ "objectID": "R/ttest_1Sample.html#lognormal",
+ "href": "R/ttest_1Sample.html#lognormal",
+ "title": "One Sample t-test",
"section": "",
- "text": "If we have normalized data, we can use the classic Student’s t-test. For a Two sample test where the variances are not equal, we should use the Welch’s t-test. Both of those options are available with the Base R t.test() function.\n\n\n\n\nThe following code was used to test the comparison in Base R. By default, the R two sample t-test function assumes the variances in the data are unequal, and uses a Welch’s t-test. Therefore, to use a classic Student’s t-test with normalized data, we must specify var.equal = TRUE. Also note that we must separate the single variable into two variables to satisfy the t.test() syntax and set paired = FALSE.\n\n d1p <- dplyr::filter(d1, trt_grp == 'placebo')\n d1t <- dplyr::filter(d1, trt_grp == 'treatment')\n\n # Perform t-test\n t.test(d1p$WtGain, d1t$WtGain, \n var.equal = TRUE, paired = FALSE)\n\n\n Two Sample t-test\n\ndata: d1p$WtGain and d1t$WtGain\nt = -0.6969, df = 30, p-value = 0.4912\nalternative hypothesis: true difference in means is not equal to 0\n95 percent confidence interval:\n -31.19842 15.32342\nsample estimates:\nmean of x mean of y \n 75.1875 83.1250 \n\n\n\n\n\n\n\n\nThe following code was used to test the comparison in Base R using Welch’s t-test. Observe that in this case, the var.equal parameter is set to FALSE.\n\n d1p <- dplyr::filter(d1, trt_grp == 'placebo')\n d1t <- dplyr::filter(d1, trt_grp == 'treatment')\n\n # Perform t-test\n t.test(d1p$WtGain, d1t$WtGain, \n var.equal = FALSE, paired = FALSE)\n\n\n Welch Two Sample t-test\n\ndata: d1p$WtGain and d1t$WtGain\nt = -0.6969, df = 29.694, p-value = 0.4913\nalternative hypothesis: true difference in means is not equal to 0\n95 percent confidence interval:\n -31.20849 15.33349\nsample estimates:\nmean of x mean of y \n 75.1875 83.1250"
+ "text": "The Base R t.test() function does not have an option for lognormal data. Likewise, the procs proc_ttest() function also does not have an option for lognormal data.\nOne possibility may be the tTestLnormAltPower() function from the EnvStats package. This package has not been evaluated yet."
},
{
- "objectID": "R/ttest_2Sample.html#procs-package",
- "href": "R/ttest_2Sample.html#procs-package",
- "title": "Two Sample t-test",
+ "objectID": "R/PCA_analysis.html",
+ "href": "R/PCA_analysis.html",
+ "title": "Principle Component Analysis",
"section": "",
- "text": "The following code from the procs package was used to perform a two sample t-test. Note that the proc_ttest() function performs both the Student’s t-test and Welch’s (Satterthwaite) t-test in the same call. The results are displayed on separate rows. This output is similar to SAS.\n\n library(procs)\n\n # Perform t-test\n proc_ttest(d1, var = WtGain,\n class = trt_grp)\n\n$Statistics\n VAR CLASS METHOD N MEAN STD STDERR MIN MAX\n1 WtGain placebo <NA> 16 75.1875 33.81167 8.452918 12 130\n2 WtGain treatment <NA> 16 83.1250 30.53495 7.633738 28 128\n3 WtGain Diff (1-2) Pooled NA -7.9375 NA 11.389723 NA NA\n4 WtGain Diff (1-2) Satterthwaite NA -7.9375 NA 11.389723 NA NA\n\n$ConfLimits\n VAR CLASS METHOD MEAN LCLM UCLM STD LCLMSTD\n1 WtGain placebo <NA> 75.1875 57.17053 93.20447 33.81167 24.97685\n2 WtGain treatment <NA> 83.1250 66.85407 99.39593 30.53495 22.55632\n3 WtGain Diff (1-2) Pooled -7.9375 -31.19842 15.32342 NA NA\n4 WtGain Diff (1-2) Satterthwaite -7.9375 -31.20849 15.33349 NA NA\n UCLMSTD\n1 52.33003\n2 47.25868\n3 NA\n4 NA\n\n$TTests\n VAR METHOD VARIANCES DF T PROBT\n1 WtGain Pooled Equal 30.00000 -0.6969002 0.4912306\n2 WtGain Satterthwaite Unequal 29.69359 -0.6969002 0.4912856\n\n$Equality\n VAR METHOD NDF DDF FVAL PROBF\n1 WtGain Folded F 15 15 1.226136 0.6980614\n\n\nViewer Output:"
+ "text": "Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining most of the information.\n\n\n\nWe will load the iris data.\nStandardize the data and then compute PCA.\n\n\nsuppressPackageStartupMessages({\n library(factoextra)\n library(plotly)\n})\n \ndata <- iris\npca_result <- prcomp(data[, 1:4], scale = T)\npca_result\n\nStandard deviations (1, .., p=4):\n[1] 1.7083611 0.9560494 0.3830886 0.1439265\n\nRotation (n x k) = (4 x 4):\n PC1 PC2 PC3 PC4\nSepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863\nSepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096\nPetal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492\nPetal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971\n\n\nWe print the summary of the PCA result, which includes the standard deviation of each principal component and the proportion of variance explained.\n\nsummary(pca_result)\n\nImportance of components:\n PC1 PC2 PC3 PC4\nStandard deviation 1.7084 0.9560 0.38309 0.14393\nProportion of Variance 0.7296 0.2285 0.03669 0.00518\nCumulative Proportion 0.7296 0.9581 0.99482 1.00000"
},
{
- "objectID": "R/survival.html",
- "href": "R/survival.html",
- "title": "Survival Analysis Using R",
+ "objectID": "R/PCA_analysis.html#introduction",
+ "href": "R/PCA_analysis.html#introduction",
+ "title": "Principle Component Analysis",
"section": "",
- "text": "The most commonly used survival analysis methods in clinical trials include:\nAdditionally, other methods for analyzing time-to-event data are available, such as:\nWhile these models may be explored in a separate document, this particular document focuses solely on the three most prevalent methods: KM estimators, log-rank test and Cox PH model."
+ "text": "Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining most of the information.\n\n\n\nWe will load the iris data.\nStandardize the data and then compute PCA.\n\n\nsuppressPackageStartupMessages({\n library(factoextra)\n library(plotly)\n})\n \ndata <- iris\npca_result <- prcomp(data[, 1:4], scale = T)\npca_result\n\nStandard deviations (1, .., p=4):\n[1] 1.7083611 0.9560494 0.3830886 0.1439265\n\nRotation (n x k) = (4 x 4):\n PC1 PC2 PC3 PC4\nSepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863\nSepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096\nPetal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492\nPetal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971\n\n\nWe print the summary of the PCA result, which includes the standard deviation of each principal component and the proportion of variance explained.\n\nsummary(pca_result)\n\nImportance of components:\n PC1 PC2 PC3 PC4\nStandard deviation 1.7084 0.9560 0.38309 0.14393\nProportion of Variance 0.7296 0.2285 0.03669 0.00518\nCumulative Proportion 0.7296 0.9581 0.99482 1.00000"
},
{
- "objectID": "R/survival.html#example-data",
- "href": "R/survival.html#example-data",
- "title": "Survival Analysis Using R",
- "section": "Example Data",
- "text": "Example Data\nData source: https://stats.idre.ucla.edu/sas/seminars/sas-survival/\nThe data include 500 subjects from the Worcester Heart Attack Study. This study examined several factors, such as age, gender and BMI, that may influence survival time after heart attack. Follow up time for all participants begins at the time of hospital admission after heart attack and ends with death or loss to follow up (censoring). The variables used here are:\n\nlenfol: length of followup, terminated either by death or censoring - time variable\nfstat: loss to followup = 0, death = 1 - censoring variable\nafb: atrial fibrillation, no = 0, 1 = yes - explanatory variable\ngender: males = 0, females = 1 - stratification factor\n\n\nlibrary(tidyverse)\nlibrary(haven)\nlibrary(survival)\nlibrary(survminer)\nlibrary(broom)\nlibrary(knitr)\nknitr::opts_chunk$set(echo = TRUE)\n\ndat <- read_sas(file.path(\"../data/whas500.sas7bdat\")) %>%\n mutate(LENFOLY = round(LENFOL/365.25, 2), ## change follow-up days to years for better visualization\n AFB = factor(AFB, levels = c(1, 0))) ## change AFB order to use \"Yes\" as the reference group to be consistent with SAS"
+ "objectID": "R/PCA_analysis.html#visualize-pca-results",
+ "href": "R/PCA_analysis.html#visualize-pca-results",
+ "title": "Principle Component Analysis",
+ "section": "Visualize PCA Results",
+ "text": "Visualize PCA Results\n\nScree Plot\nA scree plot shows the proportion of variance explained by each principal component.\n\nfviz_eig(pca_result)\n\n\n\n\n\n\n\n\n\n\nBiplot\nA biplot shows the scores of the samples and the loadings of the variables on the first two principal components.\n\nplt <- fviz_pca_biplot(pca_result, geom.ind = \"point\", pointshape = 21, \n pointsize = 2, fill.ind = iris$Species, \n col.var = \"black\", repel = TRUE)\nplt"
},
{
- "objectID": "R/survival.html#the-non-stratified-model",
- "href": "R/survival.html#the-non-stratified-model",
- "title": "Survival Analysis Using R",
- "section": "The Non-stratified Model",
- "text": "The Non-stratified Model\nFirst we try a non-stratified analysis following the mock-up above to describe the association between survival time and afb (atrial fibrillation).\nThe KM estimators are from survival::survfit function, the log-rank test uses survminer::surv_pvalue, and Cox PH model is conducted using survival::coxph function. Numerous R packages and functions are available for performing survival analysis. The author has selected survival and survminer for use in this context, but alternative options can also be employed for survival analysis.\n\nKM estimators\n\nfit.km <- survfit(Surv(LENFOLY, FSTAT) ~ AFB, data = dat)\n\n## quantile estimates\nquantile(fit.km, probs = c(0.25, 0.5, 0.75)) \n\n$quantile\n 25 50 75\nAFB=1 0.26 2.37 6.43\nAFB=0 0.94 5.91 6.44\n\n$lower\n 25 50 75\nAFB=1 0.05 1.27 4.24\nAFB=0 0.55 4.32 6.44\n\n$upper\n 25 50 75\nAFB=1 1.11 4.24 NA\nAFB=0 1.47 NA NA\n\n## landmark estimates at 1, 3, 5-year\nsummary(fit.km, times = c(1, 3, 5)) \n\nCall: survfit(formula = Surv(LENFOLY, FSTAT) ~ AFB, data = dat)\n\n AFB=1 \n time n.risk n.event survival std.err lower 95% CI upper 95% CI\n 1 50 28 0.641 0.0543 0.543 0.757\n 3 27 12 0.455 0.0599 0.351 0.589\n 5 11 6 0.315 0.0643 0.211 0.470\n\n AFB=0 \n time n.risk n.event survival std.err lower 95% CI upper 95% CI\n 1 312 110 0.739 0.0214 0.699 0.782\n 3 199 33 0.642 0.0245 0.595 0.691\n 5 77 20 0.530 0.0311 0.472 0.595\n\n\n\n\nLog-rank test\n\nsurvminer::surv_pvalue(fit.km, data = dat)\n\n variable pval method pval.txt\n1 AFB 0.0009646027 Log-rank p = 0.00096\n\n\n\n\nCox PH model\n\nfit.cox <- coxph(Surv(LENFOLY, FSTAT) ~ AFB, data = dat)\nfit.cox %>% \n tidy(exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) %>%\n select(term, estimate, conf.low, conf.high)\n\n# A tibble: 1 × 4\n term estimate conf.low conf.high\n <chr> <dbl> <dbl> <dbl>\n1 AFB0 0.583 0.421 0.806"
+ "objectID": "R/PCA_analysis.html#interpretation",
+ "href": "R/PCA_analysis.html#interpretation",
+ "title": "Principle Component Analysis",
+ "section": "Interpretation",
+ "text": "Interpretation\n\nThe Scree Plot suggests to decide the number of principle components to retain by looking an elbow point where the explained variance starts to level off.\nThe biplot visualizes both the samples (points) and the variables (arrows). Points that are close to each other represent samples with similar characteristics, while the direction and length of the arrows indicate the contribution of each variable to the principal components."
},
{
- "objectID": "R/survival.html#the-stratified-model",
- "href": "R/survival.html#the-stratified-model",
- "title": "Survival Analysis Using R",
- "section": "The Stratified Model",
- "text": "The Stratified Model\nIn a stratified model, the Kaplan-Meier estimators remain the same as those in the non-stratified model. To implement stratified log-rank tests and Cox proportional hazards models, simply include the strata() function within the model formula.\n\nStratified Log-rank test\n\nfit.km.str <- survfit(Surv(LENFOLY, FSTAT) ~ AFB + strata(GENDER), data = dat)\n\nsurvminer::surv_pvalue(fit.km.str, data = dat)\n\n variable pval method pval.txt\n1 AFB+strata(GENDER) 0.001506607 Log-rank p = 0.0015\n\n\n\n\nStratified Cox PH model\n\nfit.cox.str <- coxph(Surv(LENFOLY, FSTAT) ~ AFB + strata(GENDER), data = dat)\nfit.cox.str %>% \n tidy(exponentiate = TRUE, conf.int = TRUE, conf.level = 0.95) %>%\n select(term, estimate, conf.low, conf.high)\n\n# A tibble: 1 × 4\n term estimate conf.low conf.high\n <chr> <dbl> <dbl> <dbl>\n1 AFB0 0.594 0.430 0.823"
+ "objectID": "R/PCA_analysis.html#visualization-of-pca-in-3d-scatter-plot",
+ "href": "R/PCA_analysis.html#visualization-of-pca-in-3d-scatter-plot",
+ "title": "Principle Component Analysis",
+ "section": "Visualization of PCA in 3d Scatter Plot",
+ "text": "Visualization of PCA in 3d Scatter Plot\nA 3d scatter plot allows us to see the relationships between three principle components simultaneously and also gives us a better understanding of how much variance is explained by these components.\nIt also allows for interactive exploration where we can rotate the plot and view it from a different angles.\nWe will plot this using plotly package.\n\npca_result2 <- prcomp(data[, 1:4], scale = T, rank. = 3)\npca_result2\n\nStandard deviations (1, .., p=4):\n[1] 1.7083611 0.9560494 0.3830886 0.1439265\n\nRotation (n x k) = (4 x 3):\n PC1 PC2 PC3\nSepal.Length 0.5210659 -0.37741762 0.7195664\nSepal.Width -0.2693474 -0.92329566 -0.2443818\nPetal.Length 0.5804131 -0.02449161 -0.1421264\nPetal.Width 0.5648565 -0.06694199 -0.6342727\n\n\nNext, we will create a dataframe of the 3 principle components and negate PC2 and PC3 for visual preference to make the plot look more organised and symmetric in 3d space.\n\ncomponents <- as.data.frame(pca_result2$x)\ncomponents$PC2 <- -components$PC2\ncomponents$PC3 <- -components$PC3\n\n\nfig <- plot_ly(components, \n x = ~PC1, \n y = ~PC2, \n z = ~PC3, \n color = ~data$Species, \n colors = c('darkgreen','darkblue','darkred')) %>%\n add_markers(size = 12)\n\nfig <- fig %>%\n layout(title = \"3d Visualization of PCA\",\n scene = list(bgcolor = \"lightgray\"))\nfig"
},
{
- "objectID": "R/correlation.html",
- "href": "R/correlation.html",
- "title": "Correlation Analysis Using R",
+ "objectID": "R/linear-regression.html",
+ "href": "R/linear-regression.html",
+ "title": "Linear Regression",
"section": "",
- "text": "The most commonly used correlation analysis methods in clinical trials include:\n\nPearson correlation coefficient: product moment coefficient between two continuous variables, measuring linear associations.\n\n\\[\nr = \\frac{\\sum_{i=1}^n (x_i - m_x)(y_i - m_y)}{\\sqrt{\\sum_{i=1}^n (x_i - m_x)^2\\sum_{i=1}^n (y_i - m_y)^2}},\\]\nwhere \\(x\\) and \\(y\\) are observations from two continuous variables of length \\(n\\) and \\(m_x\\) and \\(m_y\\) are their respective means.\nSpearman correlation coefficient: rank correlation defined through the scaled sum of the squared values of the difference between ranks of two continuous variables.\n\\[\n\\rho = \\frac{\\sum_{i=1}^n (x'_i - m_{x'})(y'_i - m_{y'})}{\\sqrt{\\sum_{i=1}^n (x'_i - m_{x'})^2\\sum_{i=1}^n(y'_i - m_{y'})^2}},\n\\]\nwhere \\(x'\\) and \\(y'\\) are the ranks of \\(x\\) and \\(y\\) and \\(m_{x'}\\) and \\(m_{y'}\\) are the mean ranks of \\(x\\) and \\(y\\), respectively.\nKendall’s rank correlation: rank correlation based on the number of inversions in one ranking as compared with another.\n\\[\n\\tau = \\frac{n_c - n_d}{\\frac{1}{2}\\,n\\,(n-1)},\n\\]\nwhere \\(n_c\\) is the total number of concordant pairs, \\(n_d\\) is the total number of disconcordant pairs and \\(n\\) the total size of observations in \\(x\\) and \\(y\\).\n\nOther association measures are available for count data/contingency tables comparing observed frequencies with those expected under the assumption of independence\n\nFisher exact test\nChi-Square statistic\n\n\nExample: Lung Cancer Data\nData source: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.\nSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.\n\nlibrary(survival) \n\nglimpse(lung)\n\nRows: 228\nColumns: 10\n$ inst <dbl> 3, 3, 3, 5, 1, 12, 7, 11, 1, 7, 6, 16, 11, 21, 12, 1, 22, 16…\n$ time <dbl> 306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166, 170, 654…\n$ status <dbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …\n$ age <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68, 60, 57, …\n$ sex <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, …\n$ ph.ecog <dbl> 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 1, 2, 1, NA, 1, 1, 1, 2, 2, 1,…\n$ ph.karno <dbl> 90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 90, 60, 80,…\n$ pat.karno <dbl> 100, 90, 90, 60, 90, 80, 60, 80, 80, 70, 80, 70, 90, 70, 70,…\n$ meal.cal <dbl> 1175, 1225, NA, 1150, NA, 513, 384, 538, 825, 271, 1025, NA,…\n$ wt.loss <dbl> NA, 15, 15, 11, 0, 0, 10, 1, 16, 34, 27, 23, 5, 32, 60, 15, …\n\n\n\n\nOverview\ncor() computes the correlation coefficient between continuous variables x and y, where method chooses which correlation coefficient is to be computed (default: \"pearson\", \"kendall\", or \"spearman\").\ncor.test() calulates the test for association between paired samples, using one of Pearson’s product moment correlation coefficient, Kendall’s \\(\\tau\\) or Spearman’s \\(\\rho\\). Besides the correlation coefficient itself, it provides additional information.\nMissing values are assumed to be missing completely at random (MCAR). Different strategies are available, see ?cor for details.\n\n\nPearson Correlation\n\ncor.test(x = lung$age, y = lung$meal.cal, method = \"pearson\") \n\n\n Pearson's product-moment correlation\n\ndata: lung$age and lung$meal.cal\nt = -3.1824, df = 179, p-value = 0.001722\nalternative hypothesis: true correlation is not equal to 0\n95 percent confidence interval:\n -0.3649503 -0.0885415\nsample estimates:\n cor \n-0.2314107 \n\n\n\n\nSpearman Correlation\n\ncor.test(x = lung$age, y = lung$meal.cal, method = \"spearman\")\n\nWarning in cor.test.default(x = lung$age, y = lung$meal.cal, method =\n\"spearman\"): Cannot compute exact p-value with ties\n\n\n\n Spearman's rank correlation rho\n\ndata: lung$age and lung$meal.cal\nS = 1193189, p-value = 0.005095\nalternative hypothesis: true rho is not equal to 0\nsample estimates:\n rho \n-0.2073639 \n\n\nNote: Exact p-values require unanimous ranks.\n\n\nKendall’s rank correlation\n\ncor.test(x = lung$age, y = lung$meal.cal, method = \"kendall\")\n\n\n Kendall's rank correlation tau\n\ndata: lung$age and lung$meal.cal\nz = -2.7919, p-value = 0.00524\nalternative hypothesis: true tau is not equal to 0\nsample estimates:\n tau \n-0.1443877 \n\n\n\n\nInterpretation of correlation coefficients\nCorrelation coefficient is comprised between -1 and 1:\n\n\\(-1\\) indicates a strong negative correlation\n\\(0\\) means that there is no association between the two variables\n\\(1\\) indicates a strong positive correlation"
+ "text": "To demonstrate the use of linear regression we examine a dataset that illustrates the relationship between Height and Weight in a group of 237 teen-aged boys and girls. The dataset is available here and is imported to the workspace.\n\nDescriptive Statistics\nThe first step is to obtain the simple descriptive statistics for the numeric variables of htwt data, and one-way frequencies for categorical variables. This is accomplished by employing summary function. There are 237 participants who are from 13.9 to 25 years old. It is a cross-sectional study, with each participant having one observation. We can use this data set to examine the relationship of participants’ height to their age and sex.\n\nknitr::opts_chunk$set(echo = TRUE)\nhtwt<-read.csv(\"../data/htwt.csv\")\nsummary(htwt)\n\n ROW SEX AGE HEIGHT \n Min. : 1 Length:237 Min. :13.90 Min. :50.50 \n 1st Qu.: 60 Class :character 1st Qu.:14.80 1st Qu.:58.80 \n Median :119 Mode :character Median :16.30 Median :61.50 \n Mean :119 Mean :16.44 Mean :61.36 \n 3rd Qu.:178 3rd Qu.:17.80 3rd Qu.:64.30 \n Max. :237 Max. :25.00 Max. :72.00 \n WEIGHT \n Min. : 50.5 \n 1st Qu.: 85.0 \n Median :101.0 \n Mean :101.3 \n 3rd Qu.:112.0 \n Max. :171.5 \n\n\nIn order to create a regression model to demonstrate the relationship between age and height for females, we first need to create a flag variable identifying females and an interaction variable between age and female gender flag.\n\nhtwt$female <- ifelse(htwt$SEX=='f',1,0)\nhtwt$fem_age <- htwt$AGE * htwt$female\nhead(htwt)\n\n ROW SEX AGE HEIGHT WEIGHT female fem_age\n1 1 f 14.3 56.3 85.0 1 14.3\n2 2 f 15.5 62.3 105.0 1 15.5\n3 3 f 15.3 63.3 108.0 1 15.3\n4 4 f 16.1 59.0 92.0 1 16.1\n5 5 f 19.1 62.5 112.5 1 19.1\n6 6 f 17.1 62.5 112.0 1 17.1\n\n\n\n\nRegression Analysis\nNext, we fit a regression model, representing the relationships between gender, age, height and the interaction variable created in the datastep above. We again use a where statement to restrict the analysis to those who are less than or equal to 19 years old. We use the clb option to get a 95% confidence interval for each of the parameters in the model. The model that we are fitting is height = b0 + b1 x female + b2 x age + b3 x fem_age + e\n\nregression<-lm(HEIGHT~female+AGE+fem_age, data=htwt, AGE<=19)\nsummary(regression)\n\n\nCall:\nlm(formula = HEIGHT ~ female + AGE + fem_age, data = htwt, subset = AGE <= \n 19)\n\nResiduals:\n Min 1Q Median 3Q Max \n-8.2429 -1.7351 0.0383 1.6518 7.9289 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 28.8828 2.8734 10.052 < 2e-16 ***\nfemale 13.6123 4.0192 3.387 0.000841 ***\nAGE 2.0313 0.1776 11.435 < 2e-16 ***\nfem_age -0.9294 0.2478 -3.750 0.000227 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.799 on 215 degrees of freedom\nMultiple R-squared: 0.4595, Adjusted R-squared: 0.452 \nF-statistic: 60.93 on 3 and 215 DF, p-value: < 2.2e-16\n\n\nFrom the coefficients table b0,b1,b2,b3 are estimated as b0=28.88 b1=13.61 b2=2.03 b3=-0.92942\nThe resulting regression model for height, age and gender based on the available data is height=28.8828 + 13.6123 x female + 2.0313 x age -0.9294 x fem_age"
},
{
- "objectID": "R/mmrm.html",
- "href": "R/mmrm.html",
- "title": "MMRM in R",
+ "objectID": "R/ancova.html",
+ "href": "R/ancova.html",
+ "title": "Ancova",
"section": "",
- "text": "Mixed models for repeated measures (MMRM) are a popular choice for analyzing longitudinal continuous outcomes in randomized clinical trials and beyond; see Cnaan, Laird and Slasor (1997) for a tutorial and Mallinckrodt, Lane and Schnell (2008) for a review.\nThis vignette shows examples from the mmrm package.\nThe mmrm package implements MMRM based on the marginal linear model without random effects using Template Model Builder (TMB) which enables fast and robust model fitting. Users can specify a variety of covariance matrices, weight observations, fit models with restricted or standard maximum likelihood inference, perform hypothesis testing with Satterthwaite or Kenward-Roger adjustment, and extract least square means estimates by using emmeans.\n\n\n\nFlexible covariance specification:\n\nStructures: unstructured, Toeplitz, AR1, compound symmetry, ante-dependence, and spatial exponential.\nGroups: shared covariance structure for all subjects or group-specific covariance estimates.\nVariances: homogeneous or heterogeneous across time points.\n\nInference:\n\nSupports REML and ML.\nSupports weights.\n\nHypothesis testing:\n\nLeast square means: can be obtained with the emmeans package\nOne- and multi-dimensional linear contrasts of model parameters can be tested.\nSatterthwaite adjusted degrees of freedom.\nKenward-Roger adjusted degrees of freedom and coefficients covariance matrix.\nCoefficient Covariance\n\nC++ backend:\n\nFast implementation using C++ and automatic differentiation to obtain precise gradient information for model fitting.\nModel fitting algorithm details used in mmrm.\n\nPackage ecosystems integration:\n\nIntegration with tidymodels package ecosystem\n\nDedicated parsnip engine for linear regression\nIntegration with recipes\n\nIntegration with tern package ecosystems\n\nThe tern.mmrm package can be used to run the mmrm fit and generate tabulation and plots of least square means per visit and treatment arm, tabulation of model diagnostics, diagnostic graphs, and other standard model outputs.\n\n\n\n\n\n\nSee also the introductory vignette\nThe code below implements an MMRM fit in R with the mmrm::mmrm function.\n\nlibrary(mmrm)\nfit <- mmrm(\n formula = FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID),\n data = fev_data\n)\n\nThe code specifies an MMRM with the given covariates and an unstructured covariance matrix for the timepoints (also called visits in the clinical trial context, here given by AVISIT) within the subjects (here USUBJID). While by default this uses restricted maximum likelihood (REML), it is also possible to use ML, see ?mmrm.\nPrinting the object will show you output which should be familiar to anyone who has used any popular modeling functions such as stats::lm(), stats::glm(), glmmTMB::glmmTMB(), and lme4::nlmer(). From this print out we see the function call, the data used, the covariance structure with number of variance parameters, as well as the likelihood method, and model deviance achieved. Additionally the user is provided a printout of the estimated coefficients and the model convergence information:\n\nfit\n\nmmrm fit\n\nFormula: FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID)\nData: fev_data (used 537 observations from 197 subjects with maximum 4 \ntimepoints)\nCovariance: unstructured (10 variance parameters)\nInference: REML\nDeviance: 3386.45\n\nCoefficients: \n (Intercept) RACEBlack or African American \n 30.7774065 1.5305950 \n RACEWhite SEXFemale \n 5.6435679 0.3260274 \n ARMCDTRT AVISITVIS2 \n 3.7744139 4.8396039 \n AVISITVIS3 AVISITVIS4 \n 10.3421671 15.0537863 \n ARMCDTRT:AVISITVIS2 ARMCDTRT:AVISITVIS3 \n -0.0420899 -0.6938068 \n ARMCDTRT:AVISITVIS4 \n 0.6241229 \n\nModel Inference Optimization:\nConverged with code 0 and message: No message provided.\n\n\nThe summary() method then provides the coefficients table with Satterthwaite degrees of freedom as well as the covariance matrix estimate:\n\nfit |>\n summary()\n\nmmrm fit\n\nFormula: FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID)\nData: fev_data (used 537 observations from 197 subjects with maximum 4 \ntimepoints)\nCovariance: unstructured (10 variance parameters)\nMethod: Satterthwaite\nVcov Method: Asymptotic\nInference: REML\n\nModel selection criteria:\n AIC BIC logLik deviance \n 3406.4 3439.3 -1693.2 3386.4 \n\nCoefficients: \n Estimate Std. Error df t value Pr(>|t|)\n(Intercept) 30.77741 0.88657 218.79000 34.715 < 2e-16\nRACEBlack or African American 1.53059 0.62446 168.67000 2.451 0.015263\nRACEWhite 5.64357 0.66559 157.14000 8.479 1.56e-14\nSEXFemale 0.32603 0.53194 166.14000 0.613 0.540776\nARMCDTRT 3.77441 1.07416 145.55000 3.514 0.000589\nAVISITVIS2 4.83960 0.80173 143.87000 6.036 1.27e-08\nAVISITVIS3 10.34217 0.82269 155.56000 12.571 < 2e-16\nAVISITVIS4 15.05379 1.31288 138.46000 11.466 < 2e-16\nARMCDTRT:AVISITVIS2 -0.04209 1.12933 138.56000 -0.037 0.970324\nARMCDTRT:AVISITVIS3 -0.69381 1.18764 158.17000 -0.584 0.559924\nARMCDTRT:AVISITVIS4 0.62412 1.85096 129.71000 0.337 0.736520\n \n(Intercept) ***\nRACEBlack or African American * \nRACEWhite ***\nSEXFemale \nARMCDTRT ***\nAVISITVIS2 ***\nAVISITVIS3 ***\nAVISITVIS4 ***\nARMCDTRT:AVISITVIS2 \nARMCDTRT:AVISITVIS3 \nARMCDTRT:AVISITVIS4 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCovariance estimate:\n VIS1 VIS2 VIS3 VIS4\nVIS1 40.5544 14.3960 4.9760 13.3779\nVIS2 14.3960 26.5714 2.7836 7.4773\nVIS3 4.9760 2.7836 14.8980 0.9036\nVIS4 13.3779 7.4773 0.9036 95.5565\n\n\n\n\n\nIn order to extract relevant marginal means (LSmeans) and contrasts we can use the emmeans package. This package includes methods that allow mmrm objects to be used with the emmeans package. emmeans computes estimated marginal means (also called least-square means) for the coefficients of the MMRM.\n\nif (require(emmeans)) {\n emmeans(fit, ~ ARMCD | AVISIT)\n}\n\nLoading required package: emmeans\n\n\nmmrm() registered as emmeans extension\n\n\nWelcome to emmeans.\nCaution: You lose important information if you filter this package's results.\nSee '? untidy'\n\n\nAVISIT = VIS1:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 33.3 0.755 148 31.8 34.8\n TRT 37.1 0.763 143 35.6 38.6\n\nAVISIT = VIS2:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 38.2 0.612 147 37.0 39.4\n TRT 41.9 0.602 143 40.7 43.1\n\nAVISIT = VIS3:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 43.7 0.462 130 42.8 44.6\n TRT 46.8 0.509 130 45.7 47.8\n\nAVISIT = VIS4:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 48.4 1.189 134 46.0 50.7\n TRT 52.8 1.188 133 50.4 55.1\n\nResults are averaged over the levels of: RACE, SEX \nConfidence level used: 0.95 \n\n\nNote that the degrees of freedom choice is inherited here from the initial mmrm fit."
+ "text": "In this example, we’re looking at Analysis of Covariance. ANCOVA is typically used to analyse treatment differences, to see examples of prediction models go to the simple linear regression page."
},
{
- "objectID": "R/mmrm.html#fitting-the-mmrm-in-r",
- "href": "R/mmrm.html#fitting-the-mmrm-in-r",
- "title": "MMRM in R",
+ "objectID": "R/ancova.html#introduction",
+ "href": "R/ancova.html#introduction",
+ "title": "Ancova",
"section": "",
- "text": "Mixed models for repeated measures (MMRM) are a popular choice for analyzing longitudinal continuous outcomes in randomized clinical trials and beyond; see Cnaan, Laird and Slasor (1997) for a tutorial and Mallinckrodt, Lane and Schnell (2008) for a review.\nThis vignette shows examples from the mmrm package.\nThe mmrm package implements MMRM based on the marginal linear model without random effects using Template Model Builder (TMB) which enables fast and robust model fitting. Users can specify a variety of covariance matrices, weight observations, fit models with restricted or standard maximum likelihood inference, perform hypothesis testing with Satterthwaite or Kenward-Roger adjustment, and extract least square means estimates by using emmeans.\n\n\n\nFlexible covariance specification:\n\nStructures: unstructured, Toeplitz, AR1, compound symmetry, ante-dependence, and spatial exponential.\nGroups: shared covariance structure for all subjects or group-specific covariance estimates.\nVariances: homogeneous or heterogeneous across time points.\n\nInference:\n\nSupports REML and ML.\nSupports weights.\n\nHypothesis testing:\n\nLeast square means: can be obtained with the emmeans package\nOne- and multi-dimensional linear contrasts of model parameters can be tested.\nSatterthwaite adjusted degrees of freedom.\nKenward-Roger adjusted degrees of freedom and coefficients covariance matrix.\nCoefficient Covariance\n\nC++ backend:\n\nFast implementation using C++ and automatic differentiation to obtain precise gradient information for model fitting.\nModel fitting algorithm details used in mmrm.\n\nPackage ecosystems integration:\n\nIntegration with tidymodels package ecosystem\n\nDedicated parsnip engine for linear regression\nIntegration with recipes\n\nIntegration with tern package ecosystems\n\nThe tern.mmrm package can be used to run the mmrm fit and generate tabulation and plots of least square means per visit and treatment arm, tabulation of model diagnostics, diagnostic graphs, and other standard model outputs.\n\n\n\n\n\n\nSee also the introductory vignette\nThe code below implements an MMRM fit in R with the mmrm::mmrm function.\n\nlibrary(mmrm)\nfit <- mmrm(\n formula = FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID),\n data = fev_data\n)\n\nThe code specifies an MMRM with the given covariates and an unstructured covariance matrix for the timepoints (also called visits in the clinical trial context, here given by AVISIT) within the subjects (here USUBJID). While by default this uses restricted maximum likelihood (REML), it is also possible to use ML, see ?mmrm.\nPrinting the object will show you output which should be familiar to anyone who has used any popular modeling functions such as stats::lm(), stats::glm(), glmmTMB::glmmTMB(), and lme4::nlmer(). From this print out we see the function call, the data used, the covariance structure with number of variance parameters, as well as the likelihood method, and model deviance achieved. Additionally the user is provided a printout of the estimated coefficients and the model convergence information:\n\nfit\n\nmmrm fit\n\nFormula: FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID)\nData: fev_data (used 537 observations from 197 subjects with maximum 4 \ntimepoints)\nCovariance: unstructured (10 variance parameters)\nInference: REML\nDeviance: 3386.45\n\nCoefficients: \n (Intercept) RACEBlack or African American \n 30.7774065 1.5305950 \n RACEWhite SEXFemale \n 5.6435679 0.3260274 \n ARMCDTRT AVISITVIS2 \n 3.7744139 4.8396039 \n AVISITVIS3 AVISITVIS4 \n 10.3421671 15.0537863 \n ARMCDTRT:AVISITVIS2 ARMCDTRT:AVISITVIS3 \n -0.0420899 -0.6938068 \n ARMCDTRT:AVISITVIS4 \n 0.6241229 \n\nModel Inference Optimization:\nConverged with code 0 and message: No message provided.\n\n\nThe summary() method then provides the coefficients table with Satterthwaite degrees of freedom as well as the covariance matrix estimate:\n\nfit |>\n summary()\n\nmmrm fit\n\nFormula: FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID)\nData: fev_data (used 537 observations from 197 subjects with maximum 4 \ntimepoints)\nCovariance: unstructured (10 variance parameters)\nMethod: Satterthwaite\nVcov Method: Asymptotic\nInference: REML\n\nModel selection criteria:\n AIC BIC logLik deviance \n 3406.4 3439.3 -1693.2 3386.4 \n\nCoefficients: \n Estimate Std. Error df t value Pr(>|t|)\n(Intercept) 30.77741 0.88657 218.79000 34.715 < 2e-16\nRACEBlack or African American 1.53059 0.62446 168.67000 2.451 0.015263\nRACEWhite 5.64357 0.66559 157.14000 8.479 1.56e-14\nSEXFemale 0.32603 0.53194 166.14000 0.613 0.540776\nARMCDTRT 3.77441 1.07416 145.55000 3.514 0.000589\nAVISITVIS2 4.83960 0.80173 143.87000 6.036 1.27e-08\nAVISITVIS3 10.34217 0.82269 155.56000 12.571 < 2e-16\nAVISITVIS4 15.05379 1.31288 138.46000 11.466 < 2e-16\nARMCDTRT:AVISITVIS2 -0.04209 1.12933 138.56000 -0.037 0.970324\nARMCDTRT:AVISITVIS3 -0.69381 1.18764 158.17000 -0.584 0.559924\nARMCDTRT:AVISITVIS4 0.62412 1.85096 129.71000 0.337 0.736520\n \n(Intercept) ***\nRACEBlack or African American * \nRACEWhite ***\nSEXFemale \nARMCDTRT ***\nAVISITVIS2 ***\nAVISITVIS3 ***\nAVISITVIS4 ***\nARMCDTRT:AVISITVIS2 \nARMCDTRT:AVISITVIS3 \nARMCDTRT:AVISITVIS4 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nCovariance estimate:\n VIS1 VIS2 VIS3 VIS4\nVIS1 40.5544 14.3960 4.9760 13.3779\nVIS2 14.3960 26.5714 2.7836 7.4773\nVIS3 4.9760 2.7836 14.8980 0.9036\nVIS4 13.3779 7.4773 0.9036 95.5565\n\n\n\n\n\nIn order to extract relevant marginal means (LSmeans) and contrasts we can use the emmeans package. This package includes methods that allow mmrm objects to be used with the emmeans package. emmeans computes estimated marginal means (also called least-square means) for the coefficients of the MMRM.\n\nif (require(emmeans)) {\n emmeans(fit, ~ ARMCD | AVISIT)\n}\n\nLoading required package: emmeans\n\n\nmmrm() registered as emmeans extension\n\n\nWelcome to emmeans.\nCaution: You lose important information if you filter this package's results.\nSee '? untidy'\n\n\nAVISIT = VIS1:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 33.3 0.755 148 31.8 34.8\n TRT 37.1 0.763 143 35.6 38.6\n\nAVISIT = VIS2:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 38.2 0.612 147 37.0 39.4\n TRT 41.9 0.602 143 40.7 43.1\n\nAVISIT = VIS3:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 43.7 0.462 130 42.8 44.6\n TRT 46.8 0.509 130 45.7 47.8\n\nAVISIT = VIS4:\n ARMCD emmean SE df lower.CL upper.CL\n PBO 48.4 1.189 134 46.0 50.7\n TRT 52.8 1.188 133 50.4 55.1\n\nResults are averaged over the levels of: RACE, SEX \nConfidence level used: 0.95 \n\n\nNote that the degrees of freedom choice is inherited here from the initial mmrm fit."
+ "text": "In this example, we’re looking at Analysis of Covariance. ANCOVA is typically used to analyse treatment differences, to see examples of prediction models go to the simple linear regression page."
},
{
- "objectID": "R/summary-stats.html",
- "href": "R/summary-stats.html",
- "title": "Deriving Quantiles or Percentiles in R",
+ "objectID": "R/ancova.html#data-summary",
+ "href": "R/ancova.html#data-summary",
+ "title": "Ancova",
+ "section": "Data Summary",
+ "text": "Data Summary\n\ndf_sas %>% glimpse()\n\nRows: 30\nColumns: 3\n$ drug <fct> A, A, A, A, A, A, A, A, A, A, D, D, D, D, D, D, D, D, D, D, F, F,…\n$ pre <dbl> 11, 8, 5, 14, 19, 6, 10, 6, 11, 3, 6, 6, 7, 8, 18, 8, 19, 8, 5, 1…\n$ post <dbl> 6, 0, 2, 8, 11, 4, 13, 1, 8, 0, 0, 2, 3, 1, 18, 4, 14, 9, 1, 9, 1…\n\ndf_sas %>% summary()\n\n drug pre post \n A:10 Min. : 3.00 Min. : 0.00 \n D:10 1st Qu.: 7.00 1st Qu.: 2.00 \n F:10 Median :10.50 Median : 7.00 \n Mean :10.73 Mean : 7.90 \n 3rd Qu.:13.75 3rd Qu.:12.75 \n Max. :21.00 Max. :23.00"
+ },
+ {
+ "objectID": "R/ancova.html#the-model",
+ "href": "R/ancova.html#the-model",
+ "title": "Ancova",
+ "section": "The Model",
+ "text": "The Model\n\nmodel_ancova <- lm(post ~ drug + pre, data = df_sas)\nmodel_glance <- model_ancova %>% glance()\nmodel_tidy <- model_ancova %>% tidy()\nmodel_glance %>% gt()\n\n\n\n\n\n\n\nr.squared\nadj.r.squared\nsigma\nstatistic\np.value\ndf\nlogLik\nAIC\nBIC\ndeviance\ndf.residual\nnobs\n\n\n\n\n0.6762609\n0.6389064\n4.005778\n18.10386\n1.501369e-06\n3\n-82.05377\n174.1075\n181.1135\n417.2026\n26\n30\n\n\n\n\n\n\nmodel_tidy %>% gt()\n\n\n\n\n\n\n\nterm\nestimate\nstd.error\nstatistic\np.value\n\n\n\n\n(Intercept)\n-3.8808094\n1.9862017\n-1.9538849\n6.155192e-02\n\n\ndrugD\n0.1089713\n1.7951351\n0.0607037\n9.520594e-01\n\n\ndrugF\n3.4461383\n1.8867806\n1.8264647\n7.928458e-02\n\n\npre\n0.9871838\n0.1644976\n6.0012061\n2.454330e-06\n\n\n\n\n\n\n\n\nmodel_table <- \n model_ancova %>% \n anova() %>% \n tidy() %>% \n add_row(term = \"Total\", df = sum(.$df), sumsq = sum(.$sumsq))\nmodel_table %>% gt()\n\n\n\n\n\n\n\nterm\ndf\nsumsq\nmeansq\nstatistic\np.value\n\n\n\n\ndrug\n2\n293.6000\n146.80000\n9.148553\n9.812371e-04\n\n\npre\n1\n577.8974\n577.89740\n36.014475\n2.454330e-06\n\n\nResiduals\n26\n417.2026\n16.04625\nNA\nNA\n\n\nTotal\n29\n1288.7000\nNA\nNA\nNA\n\n\n\n\n\n\n\n\nType 1\n\ndf_sas %>%\n anova_test(post ~ drug + pre, type = 1, detailed = TRUE) %>% \n get_anova_table() %>%\n gt()\n\n\n\n\n\n\n\nEffect\nDFn\nDFd\nSSn\nSSd\nF\np\np<.05\nges\n\n\n\n\ndrug\n2\n26\n293.600\n417.203\n9.149\n9.81e-04\n*\n0.413\n\n\npre\n1\n26\n577.897\n417.203\n36.014\n2.45e-06\n*\n0.581\n\n\n\n\n\n\n\n\n\nType 2\n\ndf_sas %>% \n anova_test(post ~ drug + pre, type = 2, detailed = TRUE) %>% \n get_anova_table() %>% \n gt()\n\n\n\n\n\n\n\nEffect\nSSn\nSSd\nDFn\nDFd\nF\np\np<.05\nges\n\n\n\n\ndrug\n68.554\n417.203\n2\n26\n2.136\n1.38e-01\n\n0.141\n\n\npre\n577.897\n417.203\n1\n26\n36.014\n2.45e-06\n*\n0.581\n\n\n\n\n\n\n\n\n\nType 3\n\ndf_sas %>%\n anova_test(post ~ drug + pre, type = 3, detailed = TRUE) %>% \n get_anova_table() %>% \n gt()\n\n\n\n\n\n\n\nEffect\nSSn\nSSd\nDFn\nDFd\nF\np\np<.05\nges\n\n\n\n\n(Intercept)\n31.929\n417.203\n1\n26\n1.990\n1.70e-01\n\n0.071\n\n\ndrug\n68.554\n417.203\n2\n26\n2.136\n1.38e-01\n\n0.141\n\n\npre\n577.897\n417.203\n1\n26\n36.014\n2.45e-06\n*\n0.581\n\n\n\n\n\n\n\n\n\nLS Means\n\nmodel_ancova %>% emmeans::lsmeans(\"drug\") %>% emmeans::pwpm(pvals = TRUE, means = TRUE) \n\n A D F\nA [ 6.71] 0.9980 0.1809\nD -0.109 [ 6.82] 0.1893\nF -3.446 -3.337 [10.16]\n\nRow and column labels: drug\nUpper triangle: P values adjust = \"tukey\"\nDiagonal: [Estimates] (lsmean) \nLower triangle: Comparisons (estimate) earlier vs. later\n\nmodel_ancova %>% emmeans::lsmeans(\"drug\") %>% plot(comparisons = TRUE)"
+ },
+ {
+ "objectID": "R/ancova.html#saslm-package",
+ "href": "R/ancova.html#saslm-package",
+ "title": "Ancova",
+ "section": "sasLM Package",
+ "text": "sasLM Package\nThe following code performs an ANCOVA analysis using the sasLM package. This package was written specifically to replicate SAS statistics. The console output is also organized in a manner that is similar to SAS.\n\nlibrary(sasLM)\n\nsasLM::GLM(post ~ drug + pre, df_sas, BETA = TRUE, EMEAN = TRUE)\n\n$ANOVA\nResponse : post\n Df Sum Sq Mean Sq F value Pr(>F) \nMODEL 3 871.5 290.499 18.104 1.501e-06 ***\nRESIDUALS 26 417.2 16.046 \nCORRECTED TOTAL 29 1288.7 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n$Fitness\n Root MSE post Mean Coef Var R-square Adj R-sq\n 4.005778 7.9 50.70604 0.6762609 0.6389064\n\n$`Type I`\n Df Sum Sq Mean Sq F value Pr(>F) \ndrug 2 293.6 146.8 9.1486 0.0009812 ***\npre 1 577.9 577.9 36.0145 2.454e-06 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n$`Type II`\n Df Sum Sq Mean Sq F value Pr(>F) \ndrug 2 68.55 34.28 2.1361 0.1384 \npre 1 577.90 577.90 36.0145 2.454e-06 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n$`Type III`\n Df Sum Sq Mean Sq F value Pr(>F) \ndrug 2 68.55 34.28 2.1361 0.1384 \npre 1 577.90 577.90 36.0145 2.454e-06 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n$Parameter\n Estimate Estimable Std. Error Df t value Pr(>|t|) \n(Intercept) -0.4347 0 2.4714 26 -0.1759 0.86175 \ndrugA -3.4461 0 1.8868 26 -1.8265 0.07928 . \ndrugD -3.3372 0 1.8539 26 -1.8001 0.08346 . \ndrugF 0.0000 0 0.0000 26 \npre 0.9872 1 0.1645 26 6.0012 2.454e-06 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n$`Expected Mean`\n LSmean LowerCL UpperCL SE Df\n(Intercept) 7.900000 6.396685 9.403315 0.7313516 26\ndrugA 6.714963 4.066426 9.363501 1.2884943 26\ndrugD 6.823935 4.208337 9.439532 1.2724690 26\ndrugF 10.161102 7.456182 12.866021 1.3159234 26\npre 7.900000 6.396685 9.403315 0.7313516 26\n\n\nNote that the LSMEANS statistics are produced using the EMEAN = TRUE option. The BETA = TRUE option is equivalent to the SOLUTION option in SAS. See the sasLM documentation for additional information."
+ },
+ {
+ "objectID": "R/summary_skew_kurt.html",
+ "href": "R/summary_skew_kurt.html",
+ "title": "Skewness/Kurtosis",
"section": "",
- "text": "Percentiles can be calculated in R using the quantile function. The function has the argument type which allows for nine different percentile definitions to be used. The default is type = 7, which uses a piecewise-linear estimate of the cumulative distribution function to find percentiles.\nThis is how the 25th and 40th percentiles of aval could be calculated using the default type.\n\nquantile(aval, probs = c(0.25, 0.4))"
+ "text": "Skewness measures the the amount of asymmetry in a distribution, while Kurtosis describes the “tailedness” of the curve. These measures are frequently used to assess the normality of the data. There are several methods to calculate these measures. In R, there are at least four different packages that contain functions for Skewness and Kurtosis. This write-up will examine the following packages: e1071, moments, procs, and sasLM.\n\n\nThe following data was used in this example.\n\n# Create sample data\ndat <- tibble::tribble(\n ~team, ~points, ~assists,\n \"A\", 10, 2,\n \"A\", 17, 5,\n \"A\", 17, 6,\n \"A\", 18, 3,\n \"A\", 15, 0,\n \"B\", 10, 2,\n \"B\", 14, 5,\n \"B\", 13, 4,\n \"B\", 29, 0,\n \"B\", 25, 2,\n \"C\", 12, 1,\n \"C\", 30, 1,\n \"C\", 34, 3,\n \"C\", 12, 4,\n \"C\", 11, 7 \n)\n\n\n\n\nBase R and the stats package have no native functions for Skewness and Kurtosis. It is therefore necessary to use a packaged function to calculate these statistics. The packages examined use three different methods of calculating Skewness, and four different methods for calculating Kurtosis. Of the available packages, the functions in the e1071 package provide the most flexibility, and have options for three of the different methodologies.\n\n\nThe e1071 package contains miscellaneous statistical functions from the Probability Theory Group at the Vienna University of Technology. The package includes functions for both Skewness and Kurtosis, and each function has a “type” parameter to specify the method. There are three available methods for Skewness, and three methods for Kurtosis. A portion of the documentation for these functions is included below:\n\n\nThe documentation for the skewness() function describes three types of skewness calculations: Joanes and Gill (1998) discusses three methods for estimating skewness:\n\nType 1: This is the typical definition used in many older textbooks\n\n\\[g_1 = m_1/m_2^{3/2}\\]\n\nType 2: Used in SAS and SPSS\n\\[\nG_1 = g_1\\sqrt{n(n-1)}/(n-2)\n\\]\nType 3: Used in MINITAB and BMDP\n\\[\nb_1 = m_3/s^3 = g_1((n-1)/n)^{3/2}\n\\]\n\nAll three skewness measures are unbiased under normality. The three methods are illustrated in the following code:\n\ntype1 <- e1071::skewness(dat$points, type = 1)\nstringr::str_glue(\"Skewness - Type 1: {type1}\")\n\nSkewness - Type 1: 0.905444204379853\n\ntype2 <- e1071::skewness(dat$points, type = 2)\nstringr::str_glue(\"Skewness - Type 2: {type2}\")\n\nSkewness - Type 2: 1.00931792987094\n\ntype3 <- e1071::skewness(dat$points, type = 3)\nstringr::str_glue(\"Skewness - Type 3: {type3}\")\n\nSkewness - Type 3: 0.816426058828937\n\n\nThe default for the e1071 skewness() function is Type 3.\n\n\n\nThe documentation for the kurtosis() function describes three types of kurtosis calculations: Joanes and Gill (1998) discuss three methods for estimating kurtosis:\n\nType 1: This is the typical definition used in many older textbooks\n\n\\[g_2 = m_4/m_2^{2}-3\\]\n\nType 2: Used in SAS and SPSS\n\\[G_2 = ((n+1)g_2+6)*\\frac{(n-1)}{(n-2)(n-3)}\\]\nType 3: Used in MINITAB and BMDP\n\\[b_2 = m_4/s^4-3 = (g_2 + 3)(1-1/n)^2-3\\]\n\nOnly \\(G_2\\) (corresponding to type 2) is unbiased under normality. The three methods are illustrated in the following code:\n\n # Kurtosis - Type 1\ntype1 <- e1071::kurtosis(dat$points, type = 1)\nstringr::str_glue(\"Kurtosis - Type 1: {type1}\")\n\nKurtosis - Type 1: -0.583341077124784\n\n# Kurtosis - Type 2\ntype2 <- e1071::kurtosis(dat$points, type = 2)\nstringr::str_glue(\"Kurtosis - Type 2: {type2}\")\n\nKurtosis - Type 2: -0.299156418435587\n\n# Kurtosis - Type 3\ntype3 <- e1071::kurtosis(dat$points, type = 3)\nstringr::str_glue(\"Kurtosis - Type 3: {type3}\")\n\nKurtosis - Type 3: -0.894821560517589\n\n\nThe default for the e1071 kurtosis() function is Type 3.\n\n\n\n\nThe moments package is a well-known package with a variety of statistical functions. The package contains functions for both Skewness and Kurtosis. But these functions provide no “type” option. The skewness() function in the moments package corresponds to Type 1 above. The kurtosis() function uses a Pearson’s measure of Kurtosis, which corresponds to none of the three types in the e1071 package.\n\n library(moments)\n\n # Skewness - Type 1\n moments::skewness(dat$points)\n\n[1] 0.9054442\n\n # [1] 0.9054442\n \n # Kurtosis - Pearson's measure\n moments::kurtosis(dat$points)\n\n[1] 2.416659\n\n # [1] 2.416659\n\nNote that neither of the functions from the moments package match SAS.\n\n\n\nThe procs package proc_means() function was written specifically to match SAS, and produces a Type 2 Skewness and Type 2 Kurtosis. This package also produces a data frame output, instead of a scalar value.\n\n library(procs)\n\n # Skewness and Kurtosis - Type 2 \n proc_means(dat, var = points,\n stats = v(skew, kurt))\n\n# A tibble: 1 × 5\n TYPE FREQ VAR SKEW KURT\n <dbl> <int> <chr> <dbl> <dbl>\n1 0 15 points 1.01 -0.299\n\n\nViewer Output:\n\n\n\n\n\n\n\n\n\n\n\n\nThe sasLM package was also written specifically to match SAS. The Skewness() function produces a Type 2 Skewness, and the Kurtosis() function a Type 2 Kurtosis.\n\n library(sasLM)\n\n # Skewness - Type 2\n Skewness(dat$points)\n\n[1] 1.009318\n\n # [1] 1.009318\n \n # Kurtosis - Type 2\n Kurtosis(dat$points)\n\n[1] -0.2991564\n\n # [1] -0.2991564"
},
{
- "objectID": "R/kruskal_wallis.html",
- "href": "R/kruskal_wallis.html",
- "title": "Kruskal Wallis R",
+ "objectID": "R/summary_skew_kurt.html#data-used",
+ "href": "R/summary_skew_kurt.html#data-used",
+ "title": "Skewness/Kurtosis",
"section": "",
- "text": "The Kruskal-Wallis test is a non-parametric equivalent to the one-way ANOVA. For this example, the data used is a subset of datasets::iris, testing for difference in sepal width between species of flower.\n\n\n Species Sepal_Width\n1 setosa 3.4\n2 setosa 3.0\n3 setosa 3.4\n4 setosa 3.2\n5 setosa 3.5\n6 setosa 3.1\n7 versicolor 2.7\n8 versicolor 2.9\n9 versicolor 2.7\n10 versicolor 2.6\n11 versicolor 2.5\n12 versicolor 2.5\n13 virginica 3.0\n14 virginica 3.0\n15 virginica 3.1\n16 virginica 3.8\n17 virginica 2.7\n18 virginica 3.3"
+ "text": "The following data was used in this example.\n\n# Create sample data\ndat <- tibble::tribble(\n ~team, ~points, ~assists,\n \"A\", 10, 2,\n \"A\", 17, 5,\n \"A\", 17, 6,\n \"A\", 18, 3,\n \"A\", 15, 0,\n \"B\", 10, 2,\n \"B\", 14, 5,\n \"B\", 13, 4,\n \"B\", 29, 0,\n \"B\", 25, 2,\n \"C\", 12, 1,\n \"C\", 30, 1,\n \"C\", 34, 3,\n \"C\", 12, 4,\n \"C\", 11, 7 \n)"
},
{
- "objectID": "R/kruskal_wallis.html#introduction",
- "href": "R/kruskal_wallis.html#introduction",
- "title": "Kruskal Wallis R",
+ "objectID": "R/summary_skew_kurt.html#package-examination",
+ "href": "R/summary_skew_kurt.html#package-examination",
+ "title": "Skewness/Kurtosis",
"section": "",
- "text": "The Kruskal-Wallis test is a non-parametric equivalent to the one-way ANOVA. For this example, the data used is a subset of datasets::iris, testing for difference in sepal width between species of flower.\n\n\n Species Sepal_Width\n1 setosa 3.4\n2 setosa 3.0\n3 setosa 3.4\n4 setosa 3.2\n5 setosa 3.5\n6 setosa 3.1\n7 versicolor 2.7\n8 versicolor 2.9\n9 versicolor 2.7\n10 versicolor 2.6\n11 versicolor 2.5\n12 versicolor 2.5\n13 virginica 3.0\n14 virginica 3.0\n15 virginica 3.1\n16 virginica 3.8\n17 virginica 2.7\n18 virginica 3.3"
+ "text": "Base R and the stats package have no native functions for Skewness and Kurtosis. It is therefore necessary to use a packaged function to calculate these statistics. The packages examined use three different methods of calculating Skewness, and four different methods for calculating Kurtosis. Of the available packages, the functions in the e1071 package provide the most flexibility, and have options for three of the different methodologies.\n\n\nThe e1071 package contains miscellaneous statistical functions from the Probability Theory Group at the Vienna University of Technology. The package includes functions for both Skewness and Kurtosis, and each function has a “type” parameter to specify the method. There are three available methods for Skewness, and three methods for Kurtosis. A portion of the documentation for these functions is included below:\n\n\nThe documentation for the skewness() function describes three types of skewness calculations: Joanes and Gill (1998) discusses three methods for estimating skewness:\n\nType 1: This is the typical definition used in many older textbooks\n\n\\[g_1 = m_1/m_2^{3/2}\\]\n\nType 2: Used in SAS and SPSS\n\\[\nG_1 = g_1\\sqrt{n(n-1)}/(n-2)\n\\]\nType 3: Used in MINITAB and BMDP\n\\[\nb_1 = m_3/s^3 = g_1((n-1)/n)^{3/2}\n\\]\n\nAll three skewness measures are unbiased under normality. The three methods are illustrated in the following code:\n\ntype1 <- e1071::skewness(dat$points, type = 1)\nstringr::str_glue(\"Skewness - Type 1: {type1}\")\n\nSkewness - Type 1: 0.905444204379853\n\ntype2 <- e1071::skewness(dat$points, type = 2)\nstringr::str_glue(\"Skewness - Type 2: {type2}\")\n\nSkewness - Type 2: 1.00931792987094\n\ntype3 <- e1071::skewness(dat$points, type = 3)\nstringr::str_glue(\"Skewness - Type 3: {type3}\")\n\nSkewness - Type 3: 0.816426058828937\n\n\nThe default for the e1071 skewness() function is Type 3.\n\n\n\nThe documentation for the kurtosis() function describes three types of kurtosis calculations: Joanes and Gill (1998) discuss three methods for estimating kurtosis:\n\nType 1: This is the typical definition used in many older textbooks\n\n\\[g_2 = m_4/m_2^{2}-3\\]\n\nType 2: Used in SAS and SPSS\n\\[G_2 = ((n+1)g_2+6)*\\frac{(n-1)}{(n-2)(n-3)}\\]\nType 3: Used in MINITAB and BMDP\n\\[b_2 = m_4/s^4-3 = (g_2 + 3)(1-1/n)^2-3\\]\n\nOnly \\(G_2\\) (corresponding to type 2) is unbiased under normality. The three methods are illustrated in the following code:\n\n # Kurtosis - Type 1\ntype1 <- e1071::kurtosis(dat$points, type = 1)\nstringr::str_glue(\"Kurtosis - Type 1: {type1}\")\n\nKurtosis - Type 1: -0.583341077124784\n\n# Kurtosis - Type 2\ntype2 <- e1071::kurtosis(dat$points, type = 2)\nstringr::str_glue(\"Kurtosis - Type 2: {type2}\")\n\nKurtosis - Type 2: -0.299156418435587\n\n# Kurtosis - Type 3\ntype3 <- e1071::kurtosis(dat$points, type = 3)\nstringr::str_glue(\"Kurtosis - Type 3: {type3}\")\n\nKurtosis - Type 3: -0.894821560517589\n\n\nThe default for the e1071 kurtosis() function is Type 3.\n\n\n\n\nThe moments package is a well-known package with a variety of statistical functions. The package contains functions for both Skewness and Kurtosis. But these functions provide no “type” option. The skewness() function in the moments package corresponds to Type 1 above. The kurtosis() function uses a Pearson’s measure of Kurtosis, which corresponds to none of the three types in the e1071 package.\n\n library(moments)\n\n # Skewness - Type 1\n moments::skewness(dat$points)\n\n[1] 0.9054442\n\n # [1] 0.9054442\n \n # Kurtosis - Pearson's measure\n moments::kurtosis(dat$points)\n\n[1] 2.416659\n\n # [1] 2.416659\n\nNote that neither of the functions from the moments package match SAS.\n\n\n\nThe procs package proc_means() function was written specifically to match SAS, and produces a Type 2 Skewness and Type 2 Kurtosis. This package also produces a data frame output, instead of a scalar value.\n\n library(procs)\n\n # Skewness and Kurtosis - Type 2 \n proc_means(dat, var = points,\n stats = v(skew, kurt))\n\n# A tibble: 1 × 5\n TYPE FREQ VAR SKEW KURT\n <dbl> <int> <chr> <dbl> <dbl>\n1 0 15 points 1.01 -0.299\n\n\nViewer Output:\n\n\n\n\n\n\n\n\n\n\n\n\nThe sasLM package was also written specifically to match SAS. The Skewness() function produces a Type 2 Skewness, and the Kurtosis() function a Type 2 Kurtosis.\n\n library(sasLM)\n\n # Skewness - Type 2\n Skewness(dat$points)\n\n[1] 1.009318\n\n # [1] 1.009318\n \n # Kurtosis - Type 2\n Kurtosis(dat$points)\n\n[1] -0.2991564\n\n # [1] -0.2991564"
},
{
- "objectID": "R/kruskal_wallis.html#implementing-kruskal-wallis-in-r",
- "href": "R/kruskal_wallis.html#implementing-kruskal-wallis-in-r",
- "title": "Kruskal Wallis R",
- "section": "Implementing Kruskal-Wallis in R",
- "text": "Implementing Kruskal-Wallis in R\nThe Kruskal-Wallis test can be implemented in R using stats::kruskal.test. Below, the test is defined using R’s formula interface (dependent ~ independent variable) and specifying the data set. The null hypothesis is that the samples are from identical populations.\n\nkruskal.test(Sepal_Width~Species, data=iris_sub)\n\n\n Kruskal-Wallis rank sum test\n\ndata: Sepal_Width by Species\nKruskal-Wallis chi-squared = 10.922, df = 2, p-value = 0.004249"
+ "objectID": "R/cmh.html",
+ "href": "R/cmh.html",
+ "title": "CMH Test",
+ "section": "",
+ "text": "The CMH procedure tests for conditional independence in partial contingency tables for a 2 x 2 x K design. However, it can be generalized to tables of X x Y x K dimensions.\n\n\nWe did not find any R package that delivers all the same measures as SAS at once. Therefore, we tried out multiple packages:\n\n\n\n\n\n\n\n\nPackage\nGeneral Association\nRow Means Differ\nNonzero Correlation\nM-H Odds Ratio\nHomogeneity Test\nNote\n\n\n\n\nstats::mantelhaen.test()\n✅\n❌\n❌\n✅\n❌\nWorks well for 2x2xK\n\n\nvcdExtra::CMHtest()\n✅\n✅\n✅\n❌\n❌\nProblems with sparsity, potential bug\n\n\nepiDisplay::mhor()\n❌\n❌\n❌\n✅\n✅\nOR are limited to 2x2xK design\n\n\n\n\n\n\n\n\n\n\nWe will use the CDISC Pilot data set, which is publicly available on the PHUSE Test Data Factory repository. We applied very basic filtering conditions upfront (see below) and this data set served as the basis of the examples to follow.\n\n\n# A tibble: 231 × 36\n STUDYID SITEID SITEGR1 USUBJID TRTSDT TRTEDT TRTP TRTPN AGE AGEGR1\n <chr> <chr> <chr> <chr> <date> <date> <chr> <dbl> <dbl> <chr> \n 1 CDISCP… 701 701 01-701… 2014-01-02 2014-07-02 Plac… 0 63 <65 \n 2 CDISCP… 701 701 01-701… 2012-08-05 2012-09-01 Plac… 0 64 <65 \n 3 CDISCP… 701 701 01-701… 2013-07-19 2014-01-14 Xano… 81 71 65-80 \n 4 CDISCP… 701 701 01-701… 2014-03-18 2014-03-31 Xano… 54 74 65-80 \n 5 CDISCP… 701 701 01-701… 2014-07-01 2014-12-30 Xano… 81 77 65-80 \n 6 CDISCP… 701 701 01-701… 2013-02-12 2013-03-09 Plac… 0 85 >80 \n 7 CDISCP… 701 701 01-701… 2014-01-01 2014-07-09 Xano… 54 68 65-80 \n 8 CDISCP… 701 701 01-701… 2012-09-07 2012-09-16 Xano… 54 81 >80 \n 9 CDISCP… 701 701 01-701… 2012-11-30 2013-01-23 Xano… 54 84 >80 \n10 CDISCP… 701 701 01-701… 2014-03-12 2014-09-09 Plac… 0 52 <65 \n# ℹ 221 more rows\n# ℹ 26 more variables: AGEGR1N <dbl>, RACE <chr>, RACEN <dbl>, SEX <chr>,\n# ITTFL <chr>, EFFFL <chr>, COMP24FL <chr>, AVISIT <chr>, AVISITN <dbl>,\n# VISIT <chr>, VISITNUM <dbl>, ADY <dbl>, ADT <date>, PARAMCD <chr>,\n# PARAM <chr>, PARAMN <dbl>, AVAL <dbl>, ANL01FL <chr>, DTYPE <chr>,\n# AWRANGE <chr>, AWTARGET <dbl>, AWTDIFF <dbl>, AWLO <dbl>, AWHI <dbl>,\n# AWU <chr>, QSSEQ <dbl>\n\n\n\n\n\n\n\nThis is included in a base installation of R, as part of the stats package. Requires inputting data as a table or as vectors.\n\nmantelhaen.test(x = data$TRTP, y = data$SEX, z = data$AGEGR1)\n\n\n Cochran-Mantel-Haenszel test\n\ndata: data$TRTP and data$SEX and data$AGEGR1\nCochran-Mantel-Haenszel M^2 = 2.482, df = 2, p-value = 0.2891\n\n\n\n\n\nThe vcdExtra package provides results for the generalized CMH test, for each of the three model it outputs the Chi-square value and the respective p-values. Flexible data input methods available: table or formula (aggregated level data in a data frame).\n\nlibrary(vcdExtra)\n\nLoading required package: vcd\n\n\nLoading required package: grid\n\n\nLoading required package: gnm\n\n\n\nAttaching package: 'vcdExtra'\n\n\nThe following object is masked from 'package:dplyr':\n\n summarise\n\n# Formula: Freq ~ X + Y | K\nCMHtest(Freq ~ TRTP + SEX | AGEGR1 , data=data, overall=TRUE) \n\n$`AGEGR1:<65`\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n in stratum AGEGR1:<65 \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.33168 1 0.56467\nrmeans Row mean scores differ 1.52821 2 0.46575\ncmeans Col mean scores differ 0.33168 1 0.56467\ngeneral General association 1.52821 2 0.46575\n\n\n$`AGEGR1:>80`\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n in stratum AGEGR1:>80 \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.39433 1 0.53003\nrmeans Row mean scores differ 3.80104 2 0.14949\ncmeans Col mean scores differ 0.39433 1 0.53003\ngeneral General association 3.80104 2 0.14949\n\n\n$`AGEGR1:65-80`\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n in stratum AGEGR1:65-80 \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.52744 1 0.46768\nrmeans Row mean scores differ 0.62921 2 0.73008\ncmeans Col mean scores differ 0.52744 1 0.46768\ngeneral General association 0.62921 2 0.73008\n\n\n$ALL\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n Overall tests, controlling for all strata \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.00086897 1 0.97648\nrmeans Row mean scores differ 2.482 2 0.28909\ncmeans Col mean scores differ 0.00086897 1 0.97648\ngeneral General association 2.482 2 0.28909\n\n\n\n\n\nTo get the M-H common odds ratio and the homogeneity test, the epiDisplay package can be used.\n\nlibrary(epiDisplay) \nmhor(x,y,k, graph = FALSE)\n\n\n\nTo tackle the issue with sparse data it is recommended that a use of solve() is replaced with MASS::ginv. This was implemented in the forked version of vcdExtra which can be installed from here:\n\ndevtools::install_github(\"mstackhouse/vcdExtra\")\n\nHowever, also the forked version for the vcdExtra package works only until a certain level of sparsity. In case of our data, it still works if the data are stratified by the pooled Site ID (SITEGR1 - 11 unique values) whereas using the unpooled Site ID (SITEID - 17 unique values) also throws an error. Note: this version is not up to date and sometimes calculates degrees of freedom incorrectly."
},
{
- "objectID": "R/kruskal_wallis.html#results",
- "href": "R/kruskal_wallis.html#results",
- "title": "Kruskal Wallis R",
- "section": "Results",
- "text": "Results\nAs seen above, R outputs the Kruskal-Wallis rank sum statistic (10.922), the degrees of freedom (2), and the p-value of the test (0.004249). Therefore, the difference in population medians is statistically significant at the 5% level."
+ "objectID": "R/cmh.html#available-r-packages",
+ "href": "R/cmh.html#available-r-packages",
+ "title": "CMH Test",
+ "section": "",
+ "text": "We did not find any R package that delivers all the same measures as SAS at once. Therefore, we tried out multiple packages:\n\n\n\n\n\n\n\n\nPackage\nGeneral Association\nRow Means Differ\nNonzero Correlation\nM-H Odds Ratio\nHomogeneity Test\nNote\n\n\n\n\nstats::mantelhaen.test()\n✅\n❌\n❌\n✅\n❌\nWorks well for 2x2xK\n\n\nvcdExtra::CMHtest()\n✅\n✅\n✅\n❌\n❌\nProblems with sparsity, potential bug\n\n\nepiDisplay::mhor()\n❌\n❌\n❌\n✅\n✅\nOR are limited to 2x2xK design"
},
{
- "objectID": "R/mcnemar.html",
- "href": "R/mcnemar.html",
- "title": "McNemar’s test in R",
+ "objectID": "R/cmh.html#data-used",
+ "href": "R/cmh.html#data-used",
+ "title": "CMH Test",
"section": "",
- "text": "Performing McNemar’s test in R\nTo demonstrate McNemar’s test, data was used concerning the presence or absence of cold symptoms reported by the same children at age 12 and age 14. A total of 2638 participants were involved.\n\nUsing the epibasix::mcnemar function\nTesting for a significant difference in cold symptoms between the two ages using the mcNemar function from the epibasix package can be performed as below. The symptoms for participants at age 12 and age 14 are tabulated and stored as an object, then passed to the mcNemar function. A more complete view of the output is achieved by calling the summary function.\n\nlibrary(epibasix)\n\nX <- table(colds$age12, colds$age14)\nepi_mcn <- mcNemar(X)\nsummary(epi_mcn)\n\n\nMatched Pairs Analysis: McNemar's Statistic and Odds Ratio (Detailed Summary):\n \n \n No Yes\n No 707 256\n Yes 144 212\n\nEntries in above matrix correspond to number of pairs. \n \nMcNemar's Chi^2 Statistic (corrected for continuity) = 30.802 which has a p-value of: 0\nNote: The p.value for McNemar's Test corresponds to the hypothesis test: H0: OR = 1 vs. HA: OR != 1\nMcNemar's Odds Ratio (b/c): 1.778\n95% Confidence Limits for the OR are: [1.449, 2.208]\nThe risk difference is: 0.085\n95% Confidence Limits for the rd are: [0.055, 0.115]\n\n\n\n\nUsing the stats::mcnemar.test function\nMcNemar’s test can also be performed using stats::mcnemar.test as shown below, using the same table X as in the previous section.\n\nmcnemar.test(X)\n\n\n McNemar's Chi-squared test with continuity correction\n\ndata: X\nMcNemar's chi-squared = 30.802, df = 1, p-value = 2.857e-08\n\n\nThe result is shown without continuity correction by specifying correct=FALSE.\n\nmcnemar.test(X, correct=FALSE)\n\n\n McNemar's Chi-squared test\n\ndata: X\nMcNemar's chi-squared = 31.36, df = 1, p-value = 2.144e-08\n\n\n\n\nResults\nAs default, using summary with epibasix::mcNemar gives additional information to the McNemar’s chi-square statistic. This includes a table to view proportions, and odds ratio and risk difference with 95% confidence limits. The result uses Edward’s continuity correction without the option to remove this, which is consistent with other functions within the package.\nstats::mcnemar.test uses a continuity correction as default but does allow for this to be removed. This function does not output any other coefficients for agreement or proportions but (if required) these can be achieved within other functions or packages in R."
+ "text": "We will use the CDISC Pilot data set, which is publicly available on the PHUSE Test Data Factory repository. We applied very basic filtering conditions upfront (see below) and this data set served as the basis of the examples to follow.\n\n\n# A tibble: 231 × 36\n STUDYID SITEID SITEGR1 USUBJID TRTSDT TRTEDT TRTP TRTPN AGE AGEGR1\n <chr> <chr> <chr> <chr> <date> <date> <chr> <dbl> <dbl> <chr> \n 1 CDISCP… 701 701 01-701… 2014-01-02 2014-07-02 Plac… 0 63 <65 \n 2 CDISCP… 701 701 01-701… 2012-08-05 2012-09-01 Plac… 0 64 <65 \n 3 CDISCP… 701 701 01-701… 2013-07-19 2014-01-14 Xano… 81 71 65-80 \n 4 CDISCP… 701 701 01-701… 2014-03-18 2014-03-31 Xano… 54 74 65-80 \n 5 CDISCP… 701 701 01-701… 2014-07-01 2014-12-30 Xano… 81 77 65-80 \n 6 CDISCP… 701 701 01-701… 2013-02-12 2013-03-09 Plac… 0 85 >80 \n 7 CDISCP… 701 701 01-701… 2014-01-01 2014-07-09 Xano… 54 68 65-80 \n 8 CDISCP… 701 701 01-701… 2012-09-07 2012-09-16 Xano… 54 81 >80 \n 9 CDISCP… 701 701 01-701… 2012-11-30 2013-01-23 Xano… 54 84 >80 \n10 CDISCP… 701 701 01-701… 2014-03-12 2014-09-09 Plac… 0 52 <65 \n# ℹ 221 more rows\n# ℹ 26 more variables: AGEGR1N <dbl>, RACE <chr>, RACEN <dbl>, SEX <chr>,\n# ITTFL <chr>, EFFFL <chr>, COMP24FL <chr>, AVISIT <chr>, AVISITN <dbl>,\n# VISIT <chr>, VISITNUM <dbl>, ADY <dbl>, ADT <date>, PARAMCD <chr>,\n# PARAM <chr>, PARAMN <dbl>, AVAL <dbl>, ANL01FL <chr>, DTYPE <chr>,\n# AWRANGE <chr>, AWTARGET <dbl>, AWTDIFF <dbl>, AWLO <dbl>, AWHI <dbl>,\n# AWU <chr>, QSSEQ <dbl>"
},
{
- "objectID": "R/nparestimate.html",
- "href": "R/nparestimate.html",
- "title": "Non-parametric point estimation",
+ "objectID": "R/cmh.html#example-code",
+ "href": "R/cmh.html#example-code",
+ "title": "CMH Test",
"section": "",
- "text": "The Hodges-Lehman estimator (Hodges and Lehmann 1962) provides a point estimate which is associated with the Wilcoxon rank sum statistics based on location shift. This is typically used for the 2-sample comparison with small sample size. Note: The Hodges-Lehman estimates the median of the difference and not the difference of the medians. The corresponding distribution-free confidence interval is also based on the Wilcoxon rank sum statistics (Moses).\nThere are several packages covering this functionality. However, we will focus on the wilcox.test function implemented in R base. The {pairwiseCI} package provides further resources to derive various types of confidence intervals for the pairwise comparison case. This package is very flexible and uses the functions of related packages.\nHodges, J. L. and Lehmann, E. L. (1962) Rank methods for combination of independent experiments in analysis of variance. Annals of Mathematical Statistics, 33, 482-4."
+ "text": "This is included in a base installation of R, as part of the stats package. Requires inputting data as a table or as vectors.\n\nmantelhaen.test(x = data$TRTP, y = data$SEX, z = data$AGEGR1)\n\n\n Cochran-Mantel-Haenszel test\n\ndata: data$TRTP and data$SEX and data$AGEGR1\nCochran-Mantel-Haenszel M^2 = 2.482, df = 2, p-value = 0.2891\n\n\n\n\n\nThe vcdExtra package provides results for the generalized CMH test, for each of the three model it outputs the Chi-square value and the respective p-values. Flexible data input methods available: table or formula (aggregated level data in a data frame).\n\nlibrary(vcdExtra)\n\nLoading required package: vcd\n\n\nLoading required package: grid\n\n\nLoading required package: gnm\n\n\n\nAttaching package: 'vcdExtra'\n\n\nThe following object is masked from 'package:dplyr':\n\n summarise\n\n# Formula: Freq ~ X + Y | K\nCMHtest(Freq ~ TRTP + SEX | AGEGR1 , data=data, overall=TRUE) \n\n$`AGEGR1:<65`\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n in stratum AGEGR1:<65 \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.33168 1 0.56467\nrmeans Row mean scores differ 1.52821 2 0.46575\ncmeans Col mean scores differ 0.33168 1 0.56467\ngeneral General association 1.52821 2 0.46575\n\n\n$`AGEGR1:>80`\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n in stratum AGEGR1:>80 \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.39433 1 0.53003\nrmeans Row mean scores differ 3.80104 2 0.14949\ncmeans Col mean scores differ 0.39433 1 0.53003\ngeneral General association 3.80104 2 0.14949\n\n\n$`AGEGR1:65-80`\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n in stratum AGEGR1:65-80 \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.52744 1 0.46768\nrmeans Row mean scores differ 0.62921 2 0.73008\ncmeans Col mean scores differ 0.52744 1 0.46768\ngeneral General association 0.62921 2 0.73008\n\n\n$ALL\nCochran-Mantel-Haenszel Statistics for TRTP by SEX \n Overall tests, controlling for all strata \n\n AltHypothesis Chisq Df Prob\ncor Nonzero correlation 0.00086897 1 0.97648\nrmeans Row mean scores differ 2.482 2 0.28909\ncmeans Col mean scores differ 0.00086897 1 0.97648\ngeneral General association 2.482 2 0.28909\n\n\n\n\n\nTo get the M-H common odds ratio and the homogeneity test, the epiDisplay package can be used.\n\nlibrary(epiDisplay) \nmhor(x,y,k, graph = FALSE)\n\n\n\nTo tackle the issue with sparse data it is recommended that a use of solve() is replaced with MASS::ginv. This was implemented in the forked version of vcdExtra which can be installed from here:\n\ndevtools::install_github(\"mstackhouse/vcdExtra\")\n\nHowever, also the forked version for the vcdExtra package works only until a certain level of sparsity. In case of our data, it still works if the data are stratified by the pooled Site ID (SITEGR1 - 11 unique values) whereas using the unpooled Site ID (SITEID - 17 unique values) also throws an error. Note: this version is not up to date and sometimes calculates degrees of freedom incorrectly."
},
{
- "objectID": "R/nparestimate.html#base",
- "href": "R/nparestimate.html#base",
- "title": "Non-parametric point estimation",
- "section": "{base}",
- "text": "{base}\nThe base function provides the Hodges-Lehmann estimate and the Moses confidence interval. The function will provide warnings in case of ties in the data and will not provide the exact confidence interval.\n\nwt <- wilcox.test(x, y, exact = TRUE, conf.int = TRUE)\n\nWarning in wilcox.test.default(x, y, exact = TRUE, conf.int = TRUE): cannot\ncompute exact p-value with ties\n\n\nWarning in wilcox.test.default(x, y, exact = TRUE, conf.int = TRUE): cannot\ncompute exact confidence intervals with ties\n\n# Hodges-Lehmann estimator\nwt$estimate\n\ndifference in location \n 0.5600562 \n\n# Moses confidence interval\nwt$conf.int\n\n[1] -0.3699774 1.1829708\nattr(,\"conf.level\")\n[1] 0.95\n\n\nNote: You can process the long format also for wilcox.test using the formula structure:\n\nwilcox.test(all$value ~ all$treat, exact = TRUE, conf.int = TRUE)\n\nWarning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot\ncompute exact p-value with ties\n\n\nWarning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot\ncompute exact confidence intervals with ties\n\n\n\n Wilcoxon rank sum test with continuity correction\n\ndata: all$value by all$treat\nW = 58, p-value = 0.1329\nalternative hypothesis: true location shift is not equal to 0\n95 percent confidence interval:\n -0.3699774 1.1829708\nsample estimates:\ndifference in location \n 0.5600562"
+ "objectID": "R/nonpara_wilcoxon_ranksum.html",
+ "href": "R/nonpara_wilcoxon_ranksum.html",
+ "title": "Wilcoxon Rank Sum (Mann Whitney-U) in R",
+ "section": "",
+ "text": "Wilcoxon rank sum test, or equivalently, Mann-Whitney U-test is a rank based non-parametric method. The aim is to compare two independent groups of observations. Under certain scenarios, it can be thought of as a test for median differences, however this is only valid when: 1) both samples are independent and identically distributed (same dispersion, same shape, not necessarily normal) and 2) are symmetric around their medians.\nGenerally, with two samples of observations (A and B), the test uses the mean of each possible pair of observations in each group (including the pair of each value with itself) to test if the probability that (A>B) > probability (B>A).\nThe Wilcoxon rank sum test is often presented alongside a Hodges-Lehmann estimate of the pseudo-median (the median of the Walsh averages), and an associated confidence interval for the pseudo-median.\nA tie in the data exists when an observation in group A, has the same result as an observation in group B.\n\n\n\nMethods and Formulae\nMann Whitney is not about medians in general\nRelationship between walsh averages and WRS\nHodges Lehmann Problems\n\n\n\n\nThere are three main implementations of the Wilcoxon rank sum test in R.\n\nstats::wilcox.test\nasht::wmwTest()\ncoin::wilcox_test()\n\nThe stats package implements various classic statistical tests, including Wilcoxon rank sum test. Although this is arguably the most commonly applied package, this one does not account for any ties in the data.\n\n# x, y are two unpaired vectors. Do not necessary need to be of the same length.\nstats::wilcox.test(x, y, paired = F)\n\n\n\n\nData source: Table 30.4, Kirkwood BR. and Sterne JAC. Essentials of medical statistics. Second Edition. ISBN 978-0-86542-871-3\nComparison of birth weights (kg) of children born to 15 non-smokers with those of children born to 14 heavy smokers.\n\n# bw_ns: non smokers\n# bw_s: smokers\nbw_ns <- c(3.99, 3.89, 3.6, 3.73, 3.31, \n 3.7, 4.08, 3.61, 3.83, 3.41, \n 4.13, 3.36, 3.54, 3.51, 2.71)\nbw_s <- c(3.18, 2.74, 2.9, 3.27, 3.65, \n 3.42, 3.23, 2.86, 3.6, 3.65, \n 3.69, 3.53, 2.38, 2.34)\n\nCan visualize the data on two histograms. Red lines indicate the location of medians.\n\npar(mfrow =c(1,2))\nhist(bw_ns, main = 'Birthweight: non-smokers')\nabline(v = median(bw_ns), col = 'red', lwd = 2)\nhist(bw_s, main = 'Birthweight: smokers')\nabline(v = median(bw_s), col = 'red', lwd = 2)\n\n\n\n\n\n\n\n\nIt is possible to see that for non-smokers, the median birthweight is higher than those of smokers. Now we can formally test it with wilcoxon rank sum test.\nThe default test is two-sided with confidence level of 0.95, and does continuity correction.\n\n# default is two sided\nstats::wilcox.test(bw_ns, bw_s, paired = F)\n\nWarning in wilcox.test.default(bw_ns, bw_s, paired = F): cannot compute exact\np-value with ties\n\n\n\n Wilcoxon rank sum test with continuity correction\n\ndata: bw_ns and bw_s\nW = 164.5, p-value = 0.01001\nalternative hypothesis: true location shift is not equal to 0\n\n\nWe can also carry out a one-sided test, by specifying alternative = greater (if the first item is greater than the second).\n\n# default is two sided\nstats::wilcox.test(bw_ns, bw_s, paired = F, alternative = 'greater')\n\nWarning in wilcox.test.default(bw_ns, bw_s, paired = F, alternative =\n\"greater\"): cannot compute exact p-value with ties\n\n\n\n Wilcoxon rank sum test with continuity correction\n\ndata: bw_ns and bw_s\nW = 164.5, p-value = 0.005003\nalternative hypothesis: true location shift is greater than 0"
},
{
- "objectID": "R/nparestimate.html#pairwiseci",
- "href": "R/nparestimate.html#pairwiseci",
- "title": "Non-parametric point estimation",
- "section": "{pairwiseCI}",
- "text": "{pairwiseCI}\nThe pairwiseCI package requires data to be in a long format to use the formula structure. Via the control parameter the direction can be defined. Setting method to “HL.diff” provides exact confidence intervals together with the Hodges-Lehmann point estimate.\n\n# pairwiseCI is using the formula structure \npairwiseCI(value ~ treat, data = all, \n method=\"HL.diff\", control=\"B\",\n conf.level = .95)\n\n \n95 %-confidence intervals \n Method: Difference in location (Hodges-Lehmann estimator) \n \n \n estimate lower upper\nA - B 0.56 -0.22 1.082"
+ "objectID": "R/nonpara_wilcoxon_ranksum.html#useful-references",
+ "href": "R/nonpara_wilcoxon_ranksum.html#useful-references",
+ "title": "Wilcoxon Rank Sum (Mann Whitney-U) in R",
+ "section": "",
+ "text": "Methods and Formulae\nMann Whitney is not about medians in general\nRelationship between walsh averages and WRS\nHodges Lehmann Problems"
},
{
- "objectID": "R/manova.html",
- "href": "R/manova.html",
- "title": "Multivariate Analysis of Variance in R",
+ "objectID": "R/nonpara_wilcoxon_ranksum.html#available-r-packages",
+ "href": "R/nonpara_wilcoxon_ranksum.html#available-r-packages",
+ "title": "Wilcoxon Rank Sum (Mann Whitney-U) in R",
"section": "",
- "text": "For a detailed description of MANOVA including assumptions see Renesh Bedre\nExample 39.6 Multivariate Analysis of Variance from SAS MANOVA User Guide\nThis example employs multivariate analysis of variance (MANOVA) to measure differences in the chemical characteristics of ancient pottery found at four kiln sites in Great Britain. The data are from Tubb, Parker, and Nickless (1980), as reported in Hand et al. (1994).\nFor each of 26 samples of pottery, the percentages of oxides of five metals are measured. The following statements create the data set and perform a one-way MANOVA. Additionally, it is of interest to know whether the pottery from one site in Wales (Llanederyn) differs from the samples from other sites.\n\nlibrary(tidyverse)\n\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\nlibrary(knitr)\nlibrary(emmeans)\n\nWelcome to emmeans.\nCaution: You lose important information if you filter this package's results.\nSee '? untidy'\n\nknitr::opts_chunk$set(echo = TRUE, cache = TRUE)\npottery <- read.csv(\"../data/manova1.csv\")\npottery\n\n site al fe mg ca na\n1 Llanederyn 14.4 7.00 4.30 0.15 0.51\n2 Llanederyn 13.8 7.08 3.43 0.12 0.17\n3 Llanederyn 14.6 7.09 3.88 0.13 0.20\n4 Llanederyn 11.5 6.37 5.64 0.16 0.14\n5 Llanederyn 13.8 7.06 5.34 0.20 0.20\n6 Llanederyn 10.9 6.26 3.47 0.17 0.22\n7 Llanederyn 10.1 4.26 4.26 0.20 0.18\n8 Llanederyn 11.6 5.78 5.91 0.18 0.16\n9 Llanederyn 11.1 5.49 4.52 0.29 0.30\n10 Llanederyn 13.4 6.92 7.23 0.28 0.20\n11 Llanederyn 12.4 6.13 5.69 0.22 0.54\n12 Llanederyn 13.1 6.64 5.51 0.31 0.24\n13 Llanederyn 12.7 6.69 4.45 0.20 0.22\n14 Llanederyn 12.5 6.44 3.94 0.22 0.23\n15 Caldicot 11.8 5.44 3.94 0.30 0.04\n16 Caldicot 11.6 5.39 3.77 0.29 0.06\n17 IslandThorns 18.3 1.28 0.67 0.03 0.03\n18 IslandThorns 15.8 2.39 0.63 0.01 0.04\n19 IslandThorns 18.0 1.50 0.67 0.01 0.06\n20 IslandThorns 18.0 1.88 0.68 0.01 0.04\n21 IslandThorns 20.8 1.51 0.72 0.07 0.10\n22 AshleyRails 17.7 1.12 0.56 0.06 0.06\n23 AshleyRails 18.3 1.14 0.67 0.06 0.05\n24 AshleyRails 16.7 0.92 0.53 0.01 0.05\n25 AshleyRails 14.8 2.74 0.67 0.03 0.05\n26 AshleyRails 19.1 1.64 0.60 0.10 0.03\n\n\n1 Perform one way MANOVA\nResponse ID for ANOVA is order of 1=al, 2=fe, 3=mg, ca, na.\nWe are testing H0: group mean vectors are the same for all groups or they dont differ significantly vs\nH1: At least one of the group mean vectors is different from the rest.\n\ndep_vars <- cbind(pottery$al,pottery$fe,pottery$mg, pottery$ca, pottery$na)\nfit <-manova(dep_vars ~ pottery$site)\nsummary.aov(fit)\n\n Response 1 :\n Df Sum Sq Mean Sq F value Pr(>F) \npottery$site 3 175.610 58.537 26.669 1.627e-07 ***\nResiduals 22 48.288 2.195 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n Response 2 :\n Df Sum Sq Mean Sq F value Pr(>F) \npottery$site 3 134.222 44.741 89.883 1.679e-12 ***\nResiduals 22 10.951 0.498 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n Response 3 :\n Df Sum Sq Mean Sq F value Pr(>F) \npottery$site 3 103.35 34.450 49.12 6.452e-10 ***\nResiduals 22 15.43 0.701 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n Response 4 :\n Df Sum Sq Mean Sq F value Pr(>F) \npottery$site 3 0.204703 0.068234 29.157 7.546e-08 ***\nResiduals 22 0.051486 0.002340 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n Response 5 :\n Df Sum Sq Mean Sq F value Pr(>F) \npottery$site 3 0.25825 0.086082 9.5026 0.0003209 ***\nResiduals 22 0.19929 0.009059 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\n‘summary(fit)’ outputs the MANOVA testing of an overall site effect.\nP<0.001 suggests there is an overall difference between the chemical composition of samples from different sites.\n\nsummary(fit)\n\n Df Pillai approx F num Df den Df Pr(>F) \npottery$site 3 1.5539 4.2984 15 60 2.413e-05 ***\nResiduals 22 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\n2 Now we test to see if the Llanaderyn site is different to the other sites\nNOTE: interest may now lie in using pre-planned contrast statements to investigate if one site differs when compared to the average of the others. You would imagine this could be done using the ‘contrast’ function something like the code below, however this result does not match the SAS user guide and so looks to be doing a different analysis. SUGGEST THIS IS NOT USED UNTIL MORE RESEARCH INTO THIS METHOD CAN BE PERFORMED. One alternative suggestion is to perform a linear descriminent analysis (LDA).\n\nmanova(dep_vars ~ pottery$site) %>% \n emmeans(\"site\") %>% \n contrast(method=list(\n \"Llanederyn vs other sites\"= c(\"Llanederyn\"=-3, \"Caldicot\"=1, \"IslandThorns\"=1, \"AshleyRails\"=1)))\n\n contrast estimate SE df t.ratio p.value\n Llanederyn vs other sites 1.51 0.661 22 2.288 0.0321\n\nResults are averaged over the levels of: rep.meas \n\n\nNOTE: if you feel you can help with the above discrepancy please contribute to the CAMIS repo by following the instructions on the contributions page."
+ "text": "There are three main implementations of the Wilcoxon rank sum test in R.\n\nstats::wilcox.test\nasht::wmwTest()\ncoin::wilcox_test()\n\nThe stats package implements various classic statistical tests, including Wilcoxon rank sum test. Although this is arguably the most commonly applied package, this one does not account for any ties in the data.\n\n# x, y are two unpaired vectors. Do not necessary need to be of the same length.\nstats::wilcox.test(x, y, paired = F)"
+ },
+ {
+ "objectID": "R/nonpara_wilcoxon_ranksum.html#example-birth-weight",
+ "href": "R/nonpara_wilcoxon_ranksum.html#example-birth-weight",
+ "title": "Wilcoxon Rank Sum (Mann Whitney-U) in R",
+ "section": "",
+ "text": "Data source: Table 30.4, Kirkwood BR. and Sterne JAC. Essentials of medical statistics. Second Edition. ISBN 978-0-86542-871-3\nComparison of birth weights (kg) of children born to 15 non-smokers with those of children born to 14 heavy smokers.\n\n# bw_ns: non smokers\n# bw_s: smokers\nbw_ns <- c(3.99, 3.89, 3.6, 3.73, 3.31, \n 3.7, 4.08, 3.61, 3.83, 3.41, \n 4.13, 3.36, 3.54, 3.51, 2.71)\nbw_s <- c(3.18, 2.74, 2.9, 3.27, 3.65, \n 3.42, 3.23, 2.86, 3.6, 3.65, \n 3.69, 3.53, 2.38, 2.34)\n\nCan visualize the data on two histograms. Red lines indicate the location of medians.\n\npar(mfrow =c(1,2))\nhist(bw_ns, main = 'Birthweight: non-smokers')\nabline(v = median(bw_ns), col = 'red', lwd = 2)\nhist(bw_s, main = 'Birthweight: smokers')\nabline(v = median(bw_s), col = 'red', lwd = 2)\n\n\n\n\n\n\n\n\nIt is possible to see that for non-smokers, the median birthweight is higher than those of smokers. Now we can formally test it with wilcoxon rank sum test.\nThe default test is two-sided with confidence level of 0.95, and does continuity correction.\n\n# default is two sided\nstats::wilcox.test(bw_ns, bw_s, paired = F)\n\nWarning in wilcox.test.default(bw_ns, bw_s, paired = F): cannot compute exact\np-value with ties\n\n\n\n Wilcoxon rank sum test with continuity correction\n\ndata: bw_ns and bw_s\nW = 164.5, p-value = 0.01001\nalternative hypothesis: true location shift is not equal to 0\n\n\nWe can also carry out a one-sided test, by specifying alternative = greater (if the first item is greater than the second).\n\n# default is two sided\nstats::wilcox.test(bw_ns, bw_s, paired = F, alternative = 'greater')\n\nWarning in wilcox.test.default(bw_ns, bw_s, paired = F, alternative =\n\"greater\"): cannot compute exact p-value with ties\n\n\n\n Wilcoxon rank sum test with continuity correction\n\ndata: bw_ns and bw_s\nW = 164.5, p-value = 0.005003\nalternative hypothesis: true location shift is greater than 0"
},
{
"objectID": "R/mi_mar_regression.html",
@@ -1607,7 +1684,7 @@
"href": "python/logistic_regression.html#statsmodels-package",
"title": "Logistic Regression",
"section": "Statsmodels package",
- "text": "Statsmodels package\nWe will use the sm.Logit() method to fit our logistic regression model.\n\n#intercept column\nx_sm = sm.add_constant(x)\n\n#fit model\nlr_sm = sm.Logit(y, x_sm).fit() \nprint(lr_sm.summary())\n\nOptimization terminated successfully.\n Current function value: 0.568825\n Iterations 5\n Logit Regression Results \n==============================================================================\nDep. Variable: wt_grp No. Observations: 167\nModel: Logit Df Residuals: 162\nMethod: MLE Df Model: 4\nDate: Mon, 05 Aug 2024 Pseudo R-squ.: 0.05169\nTime: 10:47:00 Log-Likelihood: -94.994\nconverged: True LL-Null: -100.17\nCovariance Type: nonrobust LLR p-value: 0.03484\n==============================================================================\n coef std err z P>|z| [0.025 0.975]\n------------------------------------------------------------------------------\nconst 3.3576 1.654 2.029 0.042 0.115 6.600\nage -0.0126 0.021 -0.598 0.550 -0.054 0.029\nsex -0.8645 0.371 -2.328 0.020 -1.592 -0.137\nph.ecog 0.4182 0.263 1.592 0.111 -0.097 0.933\nmeal.cal -0.0009 0.000 -1.932 0.053 -0.002 1.27e-05\n==============================================================================\n\n\n\nModel fitting\nIn addition to the information contained in the summary, we can display the model coefficients as odds ratios:\n\nprint(\"Odds ratios for statsmodels logistic regression:\")\nprint(np.exp(lr_sm.params))\n\nOdds ratios for statsmodels logistic regression:\nconst 28.719651\nage 0.987467\nsex 0.421266\nph.ecog 1.519198\nmeal.cal 0.999140\ndtype: float64\n\n\nWe can also provide the 5% confidence intervals for the odds ratios:\n\nprint(\"CI at 5% for statsmodels logistic regression:\")\nprint(np.exp(lr_sm.conf_int(alpha = 0.05)))\n\nCI at 5% for statsmodels logistic regression:\n 0 1\nconst 1.121742 735.301118\nage 0.947449 1.029175\nsex 0.203432 0.872354\nph.ecog 0.907984 2.541852\nmeal.cal 0.998269 1.000013\n\n\n\n\nPrediction\nLet’s use our trained model to make a weight loss prediction about a new patient.\n\n# new female, symptomatic but completely ambulatory patient consuming 2500 calories\nnew_pt = pd.DataFrame({\n \"age\": [56],\n \"sex\": [2],\n \"ph.ecog\": [1.00], \n \"meal.cal\": [2500]\n})\n\n# Add intercept term to the new data; for a single row this should be \n# forced using the `add_constant` command\nnew_pt_sm = sm.add_constant(new_pt, has_constant=\"add\")\nprint(\"Probability of weight loss using the statsmodels package:\")\nprint(lr_sm.predict(new_pt_sm))\n\nProbability of weight loss using the statsmodels package:\n0 0.308057\ndtype: float64"
+ "text": "Statsmodels package\nWe will use the sm.Logit() method to fit our logistic regression model.\n\n#intercept column\nx_sm = sm.add_constant(x)\n\n#fit model\nlr_sm = sm.Logit(y, x_sm).fit() \nprint(lr_sm.summary())\n\nOptimization terminated successfully.\n Current function value: 0.568825\n Iterations 5\n Logit Regression Results \n==============================================================================\nDep. Variable: wt_grp No. Observations: 167\nModel: Logit Df Residuals: 162\nMethod: MLE Df Model: 4\nDate: Mon, 05 Aug 2024 Pseudo R-squ.: 0.05169\nTime: 13:27:15 Log-Likelihood: -94.994\nconverged: True LL-Null: -100.17\nCovariance Type: nonrobust LLR p-value: 0.03484\n==============================================================================\n coef std err z P>|z| [0.025 0.975]\n------------------------------------------------------------------------------\nconst 3.3576 1.654 2.029 0.042 0.115 6.600\nage -0.0126 0.021 -0.598 0.550 -0.054 0.029\nsex -0.8645 0.371 -2.328 0.020 -1.592 -0.137\nph.ecog 0.4182 0.263 1.592 0.111 -0.097 0.933\nmeal.cal -0.0009 0.000 -1.932 0.053 -0.002 1.27e-05\n==============================================================================\n\n\n\nModel fitting\nIn addition to the information contained in the summary, we can display the model coefficients as odds ratios:\n\nprint(\"Odds ratios for statsmodels logistic regression:\")\nprint(np.exp(lr_sm.params))\n\nOdds ratios for statsmodels logistic regression:\nconst 28.719651\nage 0.987467\nsex 0.421266\nph.ecog 1.519198\nmeal.cal 0.999140\ndtype: float64\n\n\nWe can also provide the 5% confidence intervals for the odds ratios:\n\nprint(\"CI at 5% for statsmodels logistic regression:\")\nprint(np.exp(lr_sm.conf_int(alpha = 0.05)))\n\nCI at 5% for statsmodels logistic regression:\n 0 1\nconst 1.121742 735.301118\nage 0.947449 1.029175\nsex 0.203432 0.872354\nph.ecog 0.907984 2.541852\nmeal.cal 0.998269 1.000013\n\n\n\n\nPrediction\nLet’s use our trained model to make a weight loss prediction about a new patient.\n\n# new female, symptomatic but completely ambulatory patient consuming 2500 calories\nnew_pt = pd.DataFrame({\n \"age\": [56],\n \"sex\": [2],\n \"ph.ecog\": [1.00], \n \"meal.cal\": [2500]\n})\n\n# Add intercept term to the new data; for a single row this should be \n# forced using the `add_constant` command\nnew_pt_sm = sm.add_constant(new_pt, has_constant=\"add\")\nprint(\"Probability of weight loss using the statsmodels package:\")\nprint(lr_sm.predict(new_pt_sm))\n\nProbability of weight loss using the statsmodels package:\n0 0.308057\ndtype: float64"
},
{
"objectID": "python/logistic_regression.html#scikit-learn-package",
@@ -1649,7 +1726,7 @@
"href": "python/ancova.html#ancova-in-python",
"title": "Ancova",
"section": "Ancova in Python",
- "text": "Ancova in Python\nIn Python, Ancova can be performed using the statsmodels library from the scipy package.\n\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\nfrom tabulate import tabulate\n\n# Fit the ANCOVA model\nmodel_ancova = smf.ols('post ~ drug + pre', data=df).fit()\n\n# Summary of the model\nmodel_summary = model_ancova.summary()\nprint(model_summary)\n\n# Extracting glance (summary) information\nmodel_glance = {\n 'r_squared': model_ancova.rsquared,\n 'adj_r_squared': model_ancova.rsquared_adj,\n 'f_statistic': model_ancova.fvalue,\n 'f_pvalue': model_ancova.f_pvalue,\n 'aic': model_ancova.aic,\n 'bic': model_ancova.bic\n}\nmodel_glance_df = pd.DataFrame([model_glance])\nprint(tabulate(model_glance_df, headers='keys', tablefmt='grid'))\n\n# Extracting tidy (coefficients) information\nmodel_tidy = model_ancova.summary2().tables[1]\nprint(tabulate(model_tidy, headers='keys', tablefmt='grid'))\n\n OLS Regression Results \n==============================================================================\nDep. Variable: post R-squared: 0.676\nModel: OLS Adj. R-squared: 0.639\nMethod: Least Squares F-statistic: 18.10\nDate: Mon, 05 Aug 2024 Prob (F-statistic): 1.50e-06\nTime: 10:47:05 Log-Likelihood: -82.054\nNo. Observations: 30 AIC: 172.1\nDf Residuals: 26 BIC: 177.7\nDf Model: 3 \nCovariance Type: nonrobust \n==============================================================================\n coef std err t P>|t| [0.025 0.975]\n------------------------------------------------------------------------------\nIntercept -3.8808 1.986 -1.954 0.062 -7.964 0.202\ndrug[T.D] 0.1090 1.795 0.061 0.952 -3.581 3.799\ndrug[T.F] 3.4461 1.887 1.826 0.079 -0.432 7.324\npre 0.9872 0.164 6.001 0.000 0.649 1.325\n==============================================================================\nOmnibus: 2.609 Durbin-Watson: 2.526\nProb(Omnibus): 0.271 Jarque-Bera (JB): 2.148\nSkew: 0.645 Prob(JB): 0.342\nKurtosis: 2.765 Cond. No. 39.8\n==============================================================================\n\nNotes:\n[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n+----+-------------+-----------------+---------------+-------------+---------+---------+\n| | r_squared | adj_r_squared | f_statistic | f_pvalue | aic | bic |\n+====+=============+=================+===============+=============+=========+=========+\n| 0 | 0.676261 | 0.638906 | 18.1039 | 1.50137e-06 | 172.108 | 177.712 |\n+----+-------------+-----------------+---------------+-------------+---------+---------+\n+-----------+-----------+------------+------------+-------------+-----------+----------+\n| | Coef. | Std.Err. | t | P>|t| | [0.025 | 0.975] |\n+===========+===========+============+============+=============+===========+==========+\n| Intercept | -3.88081 | 1.9862 | -1.95388 | 0.0615519 | -7.96351 | 0.201887 |\n+-----------+-----------+------------+------------+-------------+-----------+----------+\n| drug[T.D] | 0.108971 | 1.79514 | 0.0607037 | 0.952059 | -3.58098 | 3.79892 |\n+-----------+-----------+------------+------------+-------------+-----------+----------+\n| drug[T.F] | 3.44614 | 1.88678 | 1.82646 | 0.0792846 | -0.432195 | 7.32447 |\n+-----------+-----------+------------+------------+-------------+-----------+----------+\n| pre | 0.987184 | 0.164498 | 6.00121 | 2.45433e-06 | 0.649054 | 1.32531 |\n+-----------+-----------+------------+------------+-------------+-----------+----------+\n\n\nPlease note that all values match with the corresponding R version, except for the AIC and BIC values, which differ slightly. This should be acceptable for most practical purposes in statistical analysis. Currently, there are ongoing discussions in the statsmodels community regarding the computational details of AIC and BIC.\nThe following code can be used to enforce complete consistency of AIC and BIC values with R outputs by adding 1 to the number of parameters:\n\nimport numpy as np\n\n# Manual calculation of AIC and BIC to ensure consistency with R\nn = df.shape[0] # number of observations\nk = model_ancova.df_model + 1 # number of parameters (including intercept)\nlog_lik = model_ancova.llf # log-likelihood\n\n# Adjusted number of parameters (including scale parameter)\nk_adjusted = k + 1\n\n# Manually calculate AIC and BIC to match R's behavior\naic_adjusted = 2 * k_adjusted - 2 * log_lik\nbic_adjusted = np.log(n) * k_adjusted - 2 * log_lik\n\nprint(f\"Number of observations (n): {n}\")\nprint(f\"Number of parameters (k_adjusted): {k_adjusted}\")\nprint(f\"Log-likelihood: {log_lik}\")\nprint(f\"AIC (adjusted): {aic_adjusted}\")\nprint(f\"BIC (adjusted): {bic_adjusted}\")\n\nNumber of observations (n): 30\nNumber of parameters (k_adjusted): 5.0\nLog-likelihood: -82.0537744890265\nAIC (adjusted): 174.107548978053\nBIC (adjusted): 181.11353588636376\n\n\nThere are different types of anova computations. The statsmodels.stats.anova.anova_lm function allows the types 1, 2 and 3. The code to compute these types is depicted below:\n\nimport statsmodels.formula.api as smf\nimport statsmodels.stats.anova as ssa\n\n# Center the predictor for Type III anova\n#df['pre_centered'] = df['pre'] - df['pre'].mean()\n\n# Fit the model for types I and II anova\nmodel = smf.ols('post ~ C(drug) + pre', data=df).fit()\n\n# Perform anova for types I and II\nancova_table_type_1 = ssa.anova_lm(model, typ=1)\nancova_table_type_2 = ssa.anova_lm(model, typ=2)\n\n# Fit the model for Type III anova with centered predictors\nmodel_type_3 = smf.ols('post ~ C(drug) + pre', data=df).fit()\nancova_table_type_3 = ssa.anova_lm(model_type_3, typ=3)\n\n# Calculate SSd (sum of squares for residuals)\nssd_type1 = ancova_table_type_1['sum_sq'].loc['Residual']\nssd_type2 = ancova_table_type_2['sum_sq'].loc['Residual']\nssd_type3 = ancova_table_type_3['sum_sq'].loc['Residual']\n\n# Calculate ges\nancova_table_type_1['ges'] = ancova_table_type_1['sum_sq'] / (ancova_table_type_1['sum_sq'] + ssd_type1)\nancova_table_type_2['ges'] = ancova_table_type_2['sum_sq'] / (ancova_table_type_2['sum_sq'] + ssd_type2)\nancova_table_type_3['ges'] = ancova_table_type_3['sum_sq'] / (ancova_table_type_3['sum_sq'] + ssd_type3)\n\n# Add SSd column\nancova_table_type_1['SSd'] = ssd_type1\nancova_table_type_2['SSd'] = ssd_type2\nancova_table_type_3['SSd'] = ssd_type3\n\n# Add significance column\nancova_table_type_1['p<0.05'] = ancova_table_type_1['PR(>F)'] < 0.05\nancova_table_type_2['p<0.05'] = ancova_table_type_2['PR(>F)'] < 0.05\nancova_table_type_3['p<0.05'] = ancova_table_type_3['PR(>F)'] < 0.05\n\n# Rename columns to match the R output\nancova_table_type_1.rename(columns={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)\nancova_table_type_1.reset_index(inplace=True)\nancova_table_type_1.rename(columns={'index': 'Effect'}, inplace=True)\n\nancova_table_type_2.rename(columns={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)\nancova_table_type_2.reset_index(inplace=True)\nancova_table_type_2.rename(columns={'index': 'Effect'}, inplace=True)\n\nancova_table_type_3.rename(columns={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)\nancova_table_type_3.reset_index(inplace=True)\nancova_table_type_3.rename(columns={'index': 'Effect'}, inplace=True)\n\n# Calculate DFd (degrees of freedom for residuals)\ndfd_type1 = ancova_table_type_1.loc[ancova_table_type_1['Effect'] == 'Residual', 'DFn'].values[0]\ndfd_type2 = ancova_table_type_2.loc[ancova_table_type_2['Effect'] == 'Residual', 'DFn'].values[0]\ndfd_type3 = ancova_table_type_3.loc[ancova_table_type_3['Effect'] == 'Residual', 'DFn'].values[0]\nancova_table_type_1['DFd'] = dfd_type1\nancova_table_type_2['DFd'] = dfd_type2\nancova_table_type_3['DFd'] = dfd_type3\n\n# Filter out the Residual row\nancova_table_type_1 = ancova_table_type_1[ancova_table_type_1['Effect'] != 'Residual']\nancova_table_type_2 = ancova_table_type_2[ancova_table_type_2['Effect'] != 'Residual']\nancova_table_type_3 = ancova_table_type_3[ancova_table_type_3['Effect'] != 'Residual']\n\n# Select and reorder columns to match the R output\nancova_table_type_1 = ancova_table_type_1[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]\nancova_table_type_2 = ancova_table_type_2[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]\nancova_table_type_3 = ancova_table_type_3[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]"
+ "text": "Ancova in Python\nIn Python, Ancova can be performed using the statsmodels library from the scipy package.\n\nimport statsmodels.api as sm\nimport statsmodels.formula.api as smf\nfrom tabulate import tabulate\n\n# Fit the ANCOVA model\nmodel_ancova = smf.ols('post ~ drug + pre', data=df).fit()\n\n# Summary of the model\nmodel_summary = model_ancova.summary()\nprint(model_summary)\n\n# Extracting glance (summary) information\nmodel_glance = {\n 'r_squared': model_ancova.rsquared,\n 'adj_r_squared': model_ancova.rsquared_adj,\n 'f_statistic': model_ancova.fvalue,\n 'f_pvalue': model_ancova.f_pvalue,\n 'aic': model_ancova.aic,\n 'bic': model_ancova.bic\n}\nmodel_glance_df = pd.DataFrame([model_glance])\nprint(tabulate(model_glance_df, headers='keys', tablefmt='grid'))\n\n# Extracting tidy (coefficients) information\nmodel_tidy = model_ancova.summary2().tables[1]\nprint(tabulate(model_tidy, headers='keys', tablefmt='grid'))\n\n OLS Regression Results \n==============================================================================\nDep. Variable: post R-squared: 0.676\nModel: OLS Adj. R-squared: 0.639\nMethod: Least Squares F-statistic: 18.10\nDate: Mon, 05 Aug 2024 Prob (F-statistic): 1.50e-06\nTime: 13:27:21 Log-Likelihood: -82.054\nNo. Observations: 30 AIC: 172.1\nDf Residuals: 26 BIC: 177.7\nDf Model: 3 \nCovariance Type: nonrobust \n==============================================================================\n coef std err t P>|t| [0.025 0.975]\n------------------------------------------------------------------------------\nIntercept -3.8808 1.986 -1.954 0.062 -7.964 0.202\ndrug[T.D] 0.1090 1.795 0.061 0.952 -3.581 3.799\ndrug[T.F] 3.4461 1.887 1.826 0.079 -0.432 7.324\npre 0.9872 0.164 6.001 0.000 0.649 1.325\n==============================================================================\nOmnibus: 2.609 Durbin-Watson: 2.526\nProb(Omnibus): 0.271 Jarque-Bera (JB): 2.148\nSkew: 0.645 Prob(JB): 0.342\nKurtosis: 2.765 Cond. No. 39.8\n==============================================================================\n\nNotes:\n[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n+----+-------------+-----------------+---------------+-------------+---------+---------+\n| | r_squared | adj_r_squared | f_statistic | f_pvalue | aic | bic |\n+====+=============+=================+===============+=============+=========+=========+\n| 0 | 0.676261 | 0.638906 | 18.1039 | 1.50137e-06 | 172.108 | 177.712 |\n+----+-------------+-----------------+---------------+-------------+---------+---------+\n+-----------+-----------+------------+------------+-------------+-----------+----------+\n| | Coef. | Std.Err. | t | P>|t| | [0.025 | 0.975] |\n+===========+===========+============+============+=============+===========+==========+\n| Intercept | -3.88081 | 1.9862 | -1.95388 | 0.0615519 | -7.96351 | 0.201887 |\n+-----------+-----------+------------+------------+-------------+-----------+----------+\n| drug[T.D] | 0.108971 | 1.79514 | 0.0607037 | 0.952059 | -3.58098 | 3.79892 |\n+-----------+-----------+------------+------------+-------------+-----------+----------+\n| drug[T.F] | 3.44614 | 1.88678 | 1.82646 | 0.0792846 | -0.432195 | 7.32447 |\n+-----------+-----------+------------+------------+-------------+-----------+----------+\n| pre | 0.987184 | 0.164498 | 6.00121 | 2.45433e-06 | 0.649054 | 1.32531 |\n+-----------+-----------+------------+------------+-------------+-----------+----------+\n\n\nPlease note that all values match with the corresponding R version, except for the AIC and BIC values, which differ slightly. This should be acceptable for most practical purposes in statistical analysis. Currently, there are ongoing discussions in the statsmodels community regarding the computational details of AIC and BIC.\nThe following code can be used to enforce complete consistency of AIC and BIC values with R outputs by adding 1 to the number of parameters:\n\nimport numpy as np\n\n# Manual calculation of AIC and BIC to ensure consistency with R\nn = df.shape[0] # number of observations\nk = model_ancova.df_model + 1 # number of parameters (including intercept)\nlog_lik = model_ancova.llf # log-likelihood\n\n# Adjusted number of parameters (including scale parameter)\nk_adjusted = k + 1\n\n# Manually calculate AIC and BIC to match R's behavior\naic_adjusted = 2 * k_adjusted - 2 * log_lik\nbic_adjusted = np.log(n) * k_adjusted - 2 * log_lik\n\nprint(f\"Number of observations (n): {n}\")\nprint(f\"Number of parameters (k_adjusted): {k_adjusted}\")\nprint(f\"Log-likelihood: {log_lik}\")\nprint(f\"AIC (adjusted): {aic_adjusted}\")\nprint(f\"BIC (adjusted): {bic_adjusted}\")\n\nNumber of observations (n): 30\nNumber of parameters (k_adjusted): 5.0\nLog-likelihood: -82.0537744890265\nAIC (adjusted): 174.107548978053\nBIC (adjusted): 181.11353588636376\n\n\nThere are different types of anova computations. The statsmodels.stats.anova.anova_lm function allows the types 1, 2 and 3. The code to compute these types is depicted below:\n\nimport statsmodels.formula.api as smf\nimport statsmodels.stats.anova as ssa\n\n# Center the predictor for Type III anova\n#df['pre_centered'] = df['pre'] - df['pre'].mean()\n\n# Fit the model for types I and II anova\nmodel = smf.ols('post ~ C(drug) + pre', data=df).fit()\n\n# Perform anova for types I and II\nancova_table_type_1 = ssa.anova_lm(model, typ=1)\nancova_table_type_2 = ssa.anova_lm(model, typ=2)\n\n# Fit the model for Type III anova with centered predictors\nmodel_type_3 = smf.ols('post ~ C(drug) + pre', data=df).fit()\nancova_table_type_3 = ssa.anova_lm(model_type_3, typ=3)\n\n# Calculate SSd (sum of squares for residuals)\nssd_type1 = ancova_table_type_1['sum_sq'].loc['Residual']\nssd_type2 = ancova_table_type_2['sum_sq'].loc['Residual']\nssd_type3 = ancova_table_type_3['sum_sq'].loc['Residual']\n\n# Calculate ges\nancova_table_type_1['ges'] = ancova_table_type_1['sum_sq'] / (ancova_table_type_1['sum_sq'] + ssd_type1)\nancova_table_type_2['ges'] = ancova_table_type_2['sum_sq'] / (ancova_table_type_2['sum_sq'] + ssd_type2)\nancova_table_type_3['ges'] = ancova_table_type_3['sum_sq'] / (ancova_table_type_3['sum_sq'] + ssd_type3)\n\n# Add SSd column\nancova_table_type_1['SSd'] = ssd_type1\nancova_table_type_2['SSd'] = ssd_type2\nancova_table_type_3['SSd'] = ssd_type3\n\n# Add significance column\nancova_table_type_1['p<0.05'] = ancova_table_type_1['PR(>F)'] < 0.05\nancova_table_type_2['p<0.05'] = ancova_table_type_2['PR(>F)'] < 0.05\nancova_table_type_3['p<0.05'] = ancova_table_type_3['PR(>F)'] < 0.05\n\n# Rename columns to match the R output\nancova_table_type_1.rename(columns={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)\nancova_table_type_1.reset_index(inplace=True)\nancova_table_type_1.rename(columns={'index': 'Effect'}, inplace=True)\n\nancova_table_type_2.rename(columns={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)\nancova_table_type_2.reset_index(inplace=True)\nancova_table_type_2.rename(columns={'index': 'Effect'}, inplace=True)\n\nancova_table_type_3.rename(columns={'sum_sq': 'SSn', 'df': 'DFn', 'F': 'F', 'PR(>F)': 'p'}, inplace=True)\nancova_table_type_3.reset_index(inplace=True)\nancova_table_type_3.rename(columns={'index': 'Effect'}, inplace=True)\n\n# Calculate DFd (degrees of freedom for residuals)\ndfd_type1 = ancova_table_type_1.loc[ancova_table_type_1['Effect'] == 'Residual', 'DFn'].values[0]\ndfd_type2 = ancova_table_type_2.loc[ancova_table_type_2['Effect'] == 'Residual', 'DFn'].values[0]\ndfd_type3 = ancova_table_type_3.loc[ancova_table_type_3['Effect'] == 'Residual', 'DFn'].values[0]\nancova_table_type_1['DFd'] = dfd_type1\nancova_table_type_2['DFd'] = dfd_type2\nancova_table_type_3['DFd'] = dfd_type3\n\n# Filter out the Residual row\nancova_table_type_1 = ancova_table_type_1[ancova_table_type_1['Effect'] != 'Residual']\nancova_table_type_2 = ancova_table_type_2[ancova_table_type_2['Effect'] != 'Residual']\nancova_table_type_3 = ancova_table_type_3[ancova_table_type_3['Effect'] != 'Residual']\n\n# Select and reorder columns to match the R output\nancova_table_type_1 = ancova_table_type_1[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]\nancova_table_type_2 = ancova_table_type_2[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]\nancova_table_type_3 = ancova_table_type_3[['Effect', 'DFn', 'DFd', 'SSn', 'SSd', 'F', 'p', 'p<0.05', 'ges']]"
},
{
"objectID": "python/ancova.html#type-1-ancova-in-python",