Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

Commit

Permalink
Merge pull request #36 from Nima-Jamshidi/milestone-03
Browse files Browse the repository at this point in the history
final changes without changing explore_data.R
  • Loading branch information
Nima-Jamshidi authored Mar 17, 2020
2 parents b777c6f + 0d6833e commit b947d1e
Show file tree
Hide file tree
Showing 13 changed files with 42 additions and 33 deletions.
Binary file modified data/linear_model/augmented.rds
Binary file not shown.
Binary file modified data/linear_model/glanced.rds
Binary file not shown.
Binary file modified data/linear_model/model.rds
Binary file not shown.
Binary file modified data/linear_model/tidied.rds
Binary file not shown.
33 changes: 21 additions & 12 deletions docs/milestone3.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,8 @@ date: "14/03/2020"
output:
bookdown::html_document2:
toc: true
keep_md: true
bookdown::pdf_document2:
toc: true

---

```{r setup, include=FALSE}
Expand Down Expand Up @@ -49,15 +47,21 @@ This dataset explains the medical insurance costs of a small sample of the USA p
```{r load the data, echo=FALSE}
# import the data
costs <- read_csv(
here("data", "raw", "data.csv"),
here("data", "processed", "processed_data.csv"),
col_types = cols(
age = col_integer(),
sex = readr::col_factor(),
bmi = col_double(),
children = col_integer(),
smoker = readr::col_factor(),
region = readr::col_factor(),
charges = col_double()
age = col_double(),
sex = col_character(),
bmi = col_double(),
children = col_double(),
smoker = col_character(),
region = col_character(),
sex_dummy = col_double(),
smoker_dummy = col_double(),
southeast = col_double(),
southwest = col_double(),
northwest = col_double(),
northeast = col_double(),
charges = col_double()
)
)
```
Expand All @@ -81,13 +85,18 @@ Here is a summary of the dataset, and the values of each variable (Table \@ref(t

```{r summary, echo=FALSE}
options(knitr.kable.NA="")
kable(summary(costs), caption = "summary of the dataset")
kable(summary(costs %>% select(-sex_dummy,-smoker_dummy,-northeast,-northwest,-southeast,-southwest)), caption = "summary of the dataset")
```
```{r correlation, include=FALSE}
costs_correlations <- costs %>%
select(-sex, -smoker, -region) %>% # remove the columns that are not dummy variables
cor()
```

Next, we want to inspect the data set to see if there is any correlation between the variables. From now on we want to consider charges as our dependent variable.
In order to analyze correlation between variables, the ones that are categorical with two categories, are translated into binery vectors. The only categorical variable with more than two categories, is region. We split this variable into four different binery vectors, each indicating if the sample data has category (1) or not (0).

After using dummy variables for sex, smoker, and region, according to the correlogram show in Figure \@ref(fig:corrplot-png), smoker and charges has the strongest correlation of 0.79. No high collinearity between independent variables is observed.
After using dummy variables for sex, smoker, and region, according to the correlogram show in Figure \@ref(fig:corrplot-png), smoker and charges has the strongest correlation of `r round(costs_correlations[5,10],2)`. No high collinearity between independent variables is observed.


```{r corrplot-png, echo = FALSE, fig.cap="Correlation plot", fig.align = 'center', out.width='75%', out.height='75%'}
Expand Down
30 changes: 15 additions & 15 deletions docs/milestone3.html

Large diffs are not rendered by default.

Binary file modified docs/milestone3.pdf
Binary file not shown.
Binary file modified images/lmplot001.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/lmplot002.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/lmplot003.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/lmplot004.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/lmplot005.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 6 additions & 6 deletions scripts/linear_model.R
Original file line number Diff line number Diff line change
Expand Up @@ -50,22 +50,22 @@ main <- function(processed_data,image_path,lm_path) {
if (!dir.exists(here(lm_path))) {
dir.create(here(lm_path), recursive = TRUE)
}

#read the processed data.
data <- read.csv(here(processed_data))

#conduct linear regression
model <-
lm(charges ~ age + sex + bmi + children + smoker + region, data = data)

#plot the first 4 diagnostics graphs
plots <-
png(here(image_path, "lmplot%03d.png"))
plot(model, ask = FALSE)
dev.off()


#linear regression statistics
glanced <- glance(model)
tidied <- tidy(model)
augmented <- augment(model)

#plot the fifth diagnostics graph
augmented %>%
ggplot(aes(x = .fitted, y = charges)) +
geom_point() +
Expand All @@ -84,7 +84,7 @@ main <- function(processed_data,image_path,lm_path) {
)
)


#save the statistics in separate .rds files
flist <-
list(
model = model,
Expand Down

0 comments on commit b947d1e

Please sign in to comment.