forked from STAT545-UBC/STAT545-UBC-original-website
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathblock006_care-feeding-data.html
442 lines (413 loc) · 30.3 KB
/
block006_care-feeding-data.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<title>Basic care and feeding of data in R</title>
<script src="libs/jquery-1.11.0/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link href="libs/bootstrap-2.3.2/css/united.min.css" rel="stylesheet" />
<link href="libs/bootstrap-2.3.2/css/bootstrap-responsive.min.css" rel="stylesheet" />
<script src="libs/bootstrap-2.3.2/js/bootstrap.min.js"></script>
<style type="text/css">code{white-space: pre;}</style>
<link rel="stylesheet"
href="libs/highlight/default.css"
type="text/css" />
<script src="libs/highlight/highlight.js"></script>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<script type="text/javascript">
if (window.hljs && document.readyState && document.readyState === "complete") {
window.setTimeout(function() {
hljs.initHighlighting();
}, 0);
}
</script>
<link rel="stylesheet" href="libs/local/nav.css" type="text/css" />
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
</style>
<div class="container-fluid main-container">
<header>
<div class="nav">
<a class="nav-logo" href="index.html">
<img src="static/img/stat545-logo-s.png" width="70px" height="70px"/>
</a>
<ul>
<li class="home"><a href="index.html">Home</a></li>
<li class="faq"><a href="faq.html">FAQ</a></li>
<li class="syllabus"><a href="syllabus.html">Syllabus</a></li>
<li class="topics"><a href="topics.html">Topics</a></li>
<li class="people"><a href="people.html">People</a></li>
</ul>
</div>
</header>
<div id="header">
<h1 class="title">Basic care and feeding of data in R</h1>
</div>
<div id="TOC">
<ul>
<li><a href="#buckle-your-seatbelt">Buckle your seatbelt</a></li>
<li><a href="#get-the-gapminder-data">Get the Gapminder data</a></li>
<li><a href="#create-a-data.frame-via-import">Create a data.frame via import</a></li>
<li><a href="#look-at-the-variables-inside-a-data.frame">Look at the variables inside a data.frame</a></li>
<li><a href="#subset-is-a-nice-way-to-isolate-bits-of-data.frames-and-other-things"><code>subset()</code> is a nice way to isolate bits of data.frames (and other things)</a></li>
<li><a href="#review-of-data.frames-and-the-best-ways-to-exploit-them">Review of data.frames and the best ways to exploit them</a></li>
</ul>
</div>
<div id="buckle-your-seatbelt" class="section level3">
<h3>Buckle your seatbelt</h3>
<p><em>Ignore if you don’t need this bit of support.</em></p>
<p>Now is the time to make sure you are working in an appropriate directory on your computer, probably through the use of an <a href="block002_hello-r-workspace-wd-project.html">RStudio Project</a>. Enter <code>getwd()</code> in the Console to see current working directory or, in RStudio, this is displayed in the bar at the top of Console.</p>
<p>You should clean out your workspace. In RStudio, click on the “Clear” broom icon from the Environment tab or use Session > Clear Workspace. You can also enter <code>rm(list = ls())</code> in the Console to accomplish same.</p>
<p>Now restart R. This will ensure you don’t have any packages loaded from previous calls to <code>library()</code>. In RStudio, use Session > Restart R. Otherwise, quit R with <code>q()</code> and re-launch it.</p>
<p>Why do we do this? So that the code you write is complete and re-runnable. If you return to a clean slate often, you will root out hidden dependencies where one snippet of code only works because it relies on objects created by code saved elsewhere or, much worse, never saved at all. Similary, an aggressive clean slate approach will expose any usage of packages that have not been explicitly loaded.</p>
<p>Finally, open a new R script and develop and run your code from there. In RStudio, use File > New File > R Script. Save this script with a name ending in <code>.r</code> or <code>.R</code>, containing no spaces or other funny stuff, and that evokes whatever it is we’re doing today. Example: <code>session03_data-aggregation.r</code>.</p>
</div>
<div id="get-the-gapminder-data" class="section level3">
<h3>Get the Gapminder data</h3>
<p>We will work with some of the data from the <a href="http://www.gapminder.org">Gapminder project</a>. Here is an excerpt prepared for your use. Please save this file locally, for example, in the directory associated with your RStudio Project:</p>
<ul>
<li><a href="http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt">http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt</a></li>
<li>Nicer URL: <a href="http://tiny.cc/gapminder">http://tiny.cc/gapminder</a></li>
</ul>
<p>You should now have a plain text file called <code>gapminderDataFiveYear.txt</code> on your computer, in your working directory. Do this to confirm:</p>
<pre class="r"><code>list.files()</code></pre>
<p>If you <strong>don’t</strong> see <code>gapminderDataFiveYear.txt</code> there, DEAL WITH THAT BEFORE YOU MOVE ON.</p>
</div>
<div id="create-a-data.frame-via-import" class="section level3">
<h3>Create a data.frame via import</h3>
<p>In real life you will usually bring data into R from an outside file. This rarely goes smoothly for “wild caught” datasets, which have little gremlins lurking in them that complicate import and require cleaning. Since this is not our focus today, we will work with a “domesticated” dataset JB uses a lot in teaching, an extract from the Gapminder data Hans Rosling has popularized.</p>
<blockquote>
<p>Assumption: The file <code>gapminderDataFiveYear.txt</code> is saved on your computer and available for reading in R’s current working directory.</p>
</blockquote>
<p>Bring the data into R. Note that RStudio’s tab completion facilities can help you with filenames, as well as function and object names. Try it out!</p>
<pre class="r"><code>gDat <- read.delim("gapminderDataFiveYear.txt")</code></pre>
<p>One can also read data directly from a URL, though this is more of a party trick than a great general strategy.</p>
<pre class="r"><code>## data import from URL
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gdURL <- "http://tiny.cc/gapminder"
gDat <- read.delim(file = gdURL)</code></pre>
<p>The R function <code>read.table()</code> is the main workhorse for importing rectangular spreadsheet-y data into an R data.frame. Use it. Read the documentation. There you will learn about handy wrappers around <code>read.table()</code>, such as <code>read.delim()</code> where many arguments have been preset to anticipate some common file formats. Competent use of the many arguments of <code>read.table()</code> can eliminate a very great deal of agony and post-import fussing around.</p>
<p>Whenever you have rectangular, spreadsheet-y data, your default data receptacle in R is a data.frame. Do not depart from this without good reason. data.frames are awesome because…</p>
<ul>
<li>data.frames package related variables neatly together,
<ul>
<li>keeping them in sync vis-a-vis row order</li>
<li>applying any filtering of observations uniformly</li>
</ul></li>
<li>most functions for inference, modelling, and graphing are happy to be passed a data.frame via a <code>data =</code> argument as the place to find the variables you’re working on; the latest and greatest packages actually <strong>require</strong> that your data be in a data.frame</li>
<li>data.frames – unlike general arrays or, specifically, matrices in R – can hold variables of different flavors (heuristic term defined later), such as character data (subject ID or name), quantitative data (white blood cell count), and categorical information (treated vs. untreated)</li>
</ul>
<p>Get an overview of the object we just created with <code>str()</code> which displays the structure of an object. It will provide a sensible description of almost anything and, worst case, nothing bad can actually happen. When in doubt, just <code>str()</code> some of the recently created objects to get some ideas about what to do next.</p>
<pre class="r"><code>str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...</code></pre>
<p>We could print the whole thing to screen (not so useful with datasets of any size) but it’s nicer to look at the first bit or the last bit or a random snippet (I’ve written a function <code>peek()</code> to look at some random rows).</p>
<pre class="r"><code>head(gDat)
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
tail(gDat)
## country year pop continent lifeExp gdpPercap
## 1699 Zimbabwe 1982 7636524 Africa 60.363 788.8550
## 1700 Zimbabwe 1987 9216418 Africa 62.351 706.1573
## 1701 Zimbabwe 1992 10704340 Africa 60.377 693.4208
## 1702 Zimbabwe 1997 11404948 Africa 46.809 792.4500
## 1703 Zimbabwe 2002 11926563 Africa 39.989 672.0386
## 1704 Zimbabwe 2007 12311143 Africa 43.487 469.7093
#peek(gDat) # you won't have this function!</code></pre>
<p><em>JB note to self: possible sidebar constructing <code>peek()</code> here</em></p>
<p>More ways to query basic info on a data.frame. Note: with some of the commands below we’re benefitting from the fact that even though data.frames are technically NOT matrices, it’s usually fine to think of them that way and many functions have reasonable methods for both types of input.</p>
<pre class="r"><code>names(gDat)# variable or column names
## [1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
ncol(gDat)
## [1] 6
length(gDat)
## [1] 6
head(rownames(gDat)) # boring, in this case
## [1] "1" "2" "3" "4" "5" "6"
dim(gDat)
## [1] 1704 6
nrow(gDat)
## [1] 1704
#dimnames(gDat) # ill-advised here ... too many rows</code></pre>
<p>A statistical overview can be obtained with <code>summary()</code></p>
<pre class="r"><code>summary(gDat)
## country year pop continent
## Afghanistan: 12 Min. :1952 Min. :6.001e+04 Africa :624
## Albania : 12 1st Qu.:1966 1st Qu.:2.794e+06 Americas:300
## Algeria : 12 Median :1980 Median :7.024e+06 Asia :396
## Angola : 12 Mean :1980 Mean :2.960e+07 Europe :360
## Argentina : 12 3rd Qu.:1993 3rd Qu.:1.959e+07 Oceania : 24
## Australia : 12 Max. :2007 Max. :1.319e+09
## (Other) :1632
## lifeExp gdpPercap
## Min. :23.60 Min. : 241.2
## 1st Qu.:48.20 1st Qu.: 1202.1
## Median :60.71 Median : 3531.8
## Mean :59.47 Mean : 7215.3
## 3rd Qu.:70.85 3rd Qu.: 9325.5
## Max. :82.60 Max. :113523.1
## </code></pre>
<p>Although we haven’t begun our formal coverage of visualization yet, it’s so important for smell-testing dataset that we will make a few figures anyway. Here we use only base R graphics, which are very basic.</p>
<pre class="r"><code>plot(lifeExp ~ year, gDat)</code></pre>
<p><img src="block006_care-feeding-data_files/figure-html/first-plots-base-R1.png" /></p>
<pre class="r"><code>plot(lifeExp ~ gdpPercap, gDat)</code></pre>
<p><img src="block006_care-feeding-data_files/figure-html/first-plots-base-R2.png" /></p>
<pre class="r"><code>plot(lifeExp ~ log(gdpPercap), gDat)</code></pre>
<p><img src="block006_care-feeding-data_files/figure-html/first-plots-base-R3.png" /></p>
<!-- This is a non-sequitur here ... where came from originally?
> Sidebar on equals: A single equal sign `=` is most commonly used to specify values of arguments when calling functions in R, e.g. `group = continent`. It can be used for assignment but we advise against that, in favor of `<-`. A double equal sign `==` is a binary comparison operator, akin to less than `<` or greater than `>`, returning the logical value `TRUE` in the case of equality and `FALSE` otherwise. Although you may not yet understand exactly why, `subset = country == "Colombia"` restricts operation -- scatterplotting, in above examples -- to observations where the country is Colombia.
-->
<p>Let’s go back to the result of <code>str()</code> to talk about data.frames and vectors in R</p>
<pre class="r"><code>str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...</code></pre>
<p>A data.frame is a special case of a <em>list</em>, which is used in R to hold just about anything. data.frames are the special case where the length of each list component is the same. data.frames are superior to matrices in R because they can hold vectors of different flavors (heuristic term explained below), e.g. numeric, character, and categorical data can be stored together. This comes up alot.</p>
</div>
<div id="look-at-the-variables-inside-a-data.frame" class="section level3">
<h3>Look at the variables inside a data.frame</h3>
<p>To specify a single variable from a data.frame, use the dollar sign <code>$</code>. Let’s explore the numeric variable for life expectancy.</p>
<pre class="r"><code>head(gDat$lifeExp)
## [1] 28.801 30.332 31.997 34.020 36.088 38.438
summary(gDat$lifeExp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.60 48.20 60.71 59.47 70.85 82.60
hist(gDat$lifeExp)</code></pre>
<p><img src="block006_care-feeding-data_files/figure-html/histogram-lifeExp.png" /></p>
<p>The year variable is a numeric integer variable, but since there are so few unique values it also functions a bit like a categorical variable.</p>
<pre class="r"><code>summary(gDat$year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1952 1966 1980 1980 1993 2007
table(gDat$year)
##
## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## 142 142 142 142 142 142 142 142 142 142 142 142</code></pre>
<p>The variables for country and continent hold truly categorical information, which is stored as a <em>factor</em> in R.</p>
<pre class="r"><code>class(gDat$continent)
## [1] "factor"
summary(gDat$continent)
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
levels(gDat$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
nlevels(gDat$continent)
## [1] 5</code></pre>
<p>The <strong>levels</strong> of the factor <code>continent</code> are “Africa”, “Americas”, etc. and this is what’s usually presented to your eyeballs by R. In general, the levels are friendly human-readable character strings, like “male/female” and “control/treated”. But never ever ever forget that, under the hood, R is really storing integer codes 1, 2, 3, etc. Look at the result from <code>str(gDat$continent)</code> if you are skeptical.</p>
<pre class="r"><code>str(gDat$continent)
## Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...</code></pre>
<p>This <a href="http://en.wikipedia.org/wiki/Janus">Janus</a>-like nature of factors means they are rich with booby traps for the unsuspecting but they are a necessary evil. I recommend you resolve to learn how to properly care and feed for factors. The pros far outweigh the cons. Specifically in modelling and figure-making, factors are anticipated and accomodated by the functions and packages you will want to exploit.</p>
<p>Here we count how many observations are associated with each continent and, as usual, try to portray that info visually. This makes it much easier to quickly see that African countries are well represented in this dataset.</p>
<pre class="r"><code>table(gDat$continent)
##
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
barplot(table(gDat$continent))</code></pre>
<p><img src="block006_care-feeding-data_files/figure-html/tabulate-continent.png" /></p>
<p>In the figures below, we see how factors can be put to work in figures. The <code>continent</code> factor is easily mapped into “facets” or colors and a legend by the <code>ggplot2</code> package. <em>Making figures with <code>ggplot2</code> is covered elsewhere so feel free to just sit back and enjoy these plots or blindly copy/paste.</em></p>
<pre class="r"><code>## install ggplot2 if you don't have it!
## install.packages(ggplot2)
library(ggplot2)
p <- ggplot(subset(gDat, continent != "Oceania"),
aes(x = gdpPercap, y = lifeExp)) # just initializes
p <- p + scale_x_log10() # log the x axis the right way
p + geom_point() # scatterplot
p + geom_point(aes(color = continent)) # map continent to color
p + geom_point(alpha = (1/3), size = 3) + geom_smooth(lwd = 3, se = FALSE)
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
p + geom_point(alpha = (1/3), size = 3) + facet_wrap(~ continent)</code></pre>
<p><img src="block006_care-feeding-data_files/figure-html/factors-nice-for-plots1.png" title="" alt="" width="49%" /><img src="block006_care-feeding-data_files/figure-html/factors-nice-for-plots2.png" title="" alt="" width="49%" /><img src="block006_care-feeding-data_files/figure-html/factors-nice-for-plots3.png" title="" alt="" width="49%" /><img src="block006_care-feeding-data_files/figure-html/factors-nice-for-plots4.png" title="" alt="" width="49%" /></p>
</div>
<div id="subset-is-a-nice-way-to-isolate-bits-of-data.frames-and-other-things" class="section level3">
<h3><code>subset()</code> is a nice way to isolate bits of data.frames (and other things)</h3>
<p>Logical little pieces of data.frames are useful for sanity checking, prototyping visualizations or computations for later scale-up, etc. Many functions are happy to restrict their operations to a subset of observations via a formal <code>subset =</code> argument. There is a stand-alone function, also confusingly called <code>subset()</code>, that can isolate pieces of an object for inspection or assignment. Although <code>subset()</code> can work on objects other than data.frames, we focus on that usage here.</p>
<p>The <code>subset()</code> function has a <code>subset =</code> argument (sorry, not my fault it’s so confusing) for specifying which observations to keep. This expression will be evaluated within the specified data.frame, which is non-standard but convenient.</p>
<pre class="r"><code>subset(gDat, subset = country == "Uruguay")
## country year pop continent lifeExp gdpPercap
## 1621 Uruguay 1952 2252965 Americas 66.071 5716.767
## 1622 Uruguay 1957 2424959 Americas 67.044 6150.773
## 1623 Uruguay 1962 2598466 Americas 68.253 5603.358
## 1624 Uruguay 1967 2748579 Americas 68.468 5444.620
## 1625 Uruguay 1972 2829526 Americas 68.673 5703.409
## 1626 Uruguay 1977 2873520 Americas 69.481 6504.340
## 1627 Uruguay 1982 2953997 Americas 70.805 6920.223
## 1628 Uruguay 1987 3045153 Americas 71.918 7452.399
## 1629 Uruguay 1992 3149262 Americas 72.752 8137.005
## 1630 Uruguay 1997 3262838 Americas 74.223 9230.241
## 1631 Uruguay 2002 3363085 Americas 75.307 7727.002
## 1632 Uruguay 2007 3447496 Americas 76.384 10611.463</code></pre>
<p>Contrast the above command with this one accomplishing the same thing:</p>
<pre class="r"><code>gDat[1621:1632, ]
## country year pop continent lifeExp gdpPercap
## 1621 Uruguay 1952 2252965 Americas 66.071 5716.767
## 1622 Uruguay 1957 2424959 Americas 67.044 6150.773
## 1623 Uruguay 1962 2598466 Americas 68.253 5603.358
## 1624 Uruguay 1967 2748579 Americas 68.468 5444.620
## 1625 Uruguay 1972 2829526 Americas 68.673 5703.409
## 1626 Uruguay 1977 2873520 Americas 69.481 6504.340
## 1627 Uruguay 1982 2953997 Americas 70.805 6920.223
## 1628 Uruguay 1987 3045153 Americas 71.918 7452.399
## 1629 Uruguay 1992 3149262 Americas 72.752 8137.005
## 1630 Uruguay 1997 3262838 Americas 74.223 9230.241
## 1631 Uruguay 2002 3363085 Americas 75.307 7727.002
## 1632 Uruguay 2007 3447496 Americas 76.384 10611.463</code></pre>
<p>Yes, these both return the same result. But the second command is horrible for these reasons:</p>
<ul>
<li>It contains <a href="http://en.wikipedia.org/wiki/Magic_number_(programming)">Magic Numbers</a>. The reason for keeping rows 1621 to 1632 will be non-obvious to someone else and that includes <strong>you</strong> in a couple of weeks.</li>
<li>It is fragile. If the rows of <code>gDat</code> are reordered or if some observations are eliminated, these rows may no longer correspond to the Uruguay data.</li>
</ul>
<p>In contrast, the first command, using <code>subset()</code>, is self-documenting; one does not need to be an R expert to take a pretty good guess at what’s happening. It’s also more robust. It will still produce the correct result even if <code>gDat</code> has undergone some reasonable set of transformations.</p>
<p>The <code>subset()</code> function can also be used to select certain variables via the <code>select</code> argument. It also offers unusual flexibility, so you can, for example, provide the names of variables you wish to keep without surrounding by quotes. I suppose this is mostly a good thing, but even the documentation stresses that the <code>subset()</code> function is intended for interactive use (which I interpret more broadly to mean data analysis, as opposed to programming).</p>
<p>You can use <code>subset =</code> and <code>select =</code> together to simultaneously filter rows and columns or variables.</p>
<pre class="r"><code>subset(gDat, subset = country == "Mexico",
select = c(country, year, lifeExp))
## country year lifeExp
## 985 Mexico 1952 50.789
## 986 Mexico 1957 55.190
## 987 Mexico 1962 58.299
## 988 Mexico 1967 60.110
## 989 Mexico 1972 62.361
## 990 Mexico 1977 65.032
## 991 Mexico 1982 67.405
## 992 Mexico 1987 69.498
## 993 Mexico 1992 71.455
## 994 Mexico 1997 73.670
## 995 Mexico 2002 74.902
## 996 Mexico 2007 76.195</code></pre>
<!---
TO DO: CLEAN UP AND UN-COMMENT THESE EXERCISES FOR THE STUDENT
Let's get the data for just 2007.
How many rows?
How many observations per continent?
Scatterplot life expectancy against GDP per capita.
Variants of that: indicate continent by color, do for just one continent, do for multiple continents at once but in separate plots
```r
hDat <- subset(gDat, subset = year == 2007)
str(hDat)
## 'data.frame': 142 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## $ pop : num 31889923 3600523 33333216 12420476 40301927 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
## $ lifeExp : num 43.8 76.4 72.3 42.7 75.3 ...
## $ gdpPercap: num 975 5937 6223 4797 12779 ...
table(hDat$continent)
##
## Africa Americas Asia Europe Oceania
## 52 25 33 30 2
#xyplot(lifeExp ~ gdpPercap, hDat)
#xyplot(lifeExp ~ gdpPercap, hDat, group = continent, auto.key = TRUE)
#xyplot(lifeExp ~ gdpPercap | continent, hDat)
```
## if you want just some rows and/or just some variables, for inspection or to
## assign as a new object, use subset()
subset(gDat, subset = country == "Cambodia")
subset(gDat, subset = country %in% c("Japan", "Belgium"))
subset(gDat, subset = year == 1952)
subset(gDat, subset = country == "Uruguay", select = c(country, year, lifeExp))
plot(lifeExp ~ year, gDat, subset = country == "Zimbabwe")
plot(lifeExp ~ log(gdpPercap), gDat, subset = year == 2007)
## exercise:
## get data for which life expectancy is less than 32 years
## assign to an object
## how many rows? how many observations per continent?
--->
<p>Many of the functions for inference, modelling, and graphics that permit you to specify a data.frame via<code>data =</code> also offer a <code>subset =</code> argument that limits the computation to certain observations. Here’s an example of subsetting the data to make a plot just for Colombia and a similar call to <code>lm</code> for fitting a linear model to just the data from Colombia.</p>
<pre class="r"><code>p <- ggplot(subset(gDat, country == "Colombia"), aes(x = year, y = lifeExp))
p + geom_point() + geom_smooth(lwd = 1, se = FALSE, method = "lm")</code></pre>
<p><img src="block006_care-feeding-data_files/figure-html/just-colombia.png" /></p>
<pre class="r"><code>(minYear <- min(gDat$year))
## [1] 1952
myFit <- lm(lifeExp ~ I(year - minYear), gDat, subset = country == "Colombia")
summary(myFit)
##
## Call:
## lm(formula = lifeExp ~ I(year - minYear), data = gDat, subset = country ==
## "Colombia")
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7841 -0.3816 0.1840 0.8413 1.8034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.42712 0.71223 75.01 4.33e-15 ***
## I(year - minYear) 0.38075 0.02194 17.36 8.54e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.312 on 10 degrees of freedom
## Multiple R-squared: 0.9679, Adjusted R-squared: 0.9647
## F-statistic: 301.3 on 1 and 10 DF, p-value: 8.537e-09</code></pre>
</div>
<div id="review-of-data.frames-and-the-best-ways-to-exploit-them" class="section level3">
<h3>Review of data.frames and the best ways to exploit them</h3>
<p>Use data.frames!!!</p>
<p>The most modern, slick way to work with data.frame is with <code>dplyr</code>. Two later tutorials introduce this exciting new (2014) package:</p>
<ul>
<li><a href="block009_dplyr-intro.html">Introduction to dplyr</a></li>
<li><a href="block010_dplyr-end-single-table.html"><code>dplyr</code> functions for a single dataset</a></li>
</ul>
<p>Work within your data.frames by passing them to the <code>data =</code> argument of functions that offer that. If you need to restrict operations, use the <code>subset =</code> argument. Do computations or make figures <em>in situ</em> – don’t create little copies and excerpts of your data. This will leave a cleaner workspace and cleaner code.</p>
<p>This workstyle leaves behind code that is also fairly self-documenting, e.g.,</p>
<pre class="r"><code>lm(lifeExp ~ year, gDat, subset = country == "Colombia")
plot(lifeExp ~ year, gDat, subset = country == "Colombia")</code></pre>
<p>The availability and handling of <code>data =</code> and <code>subset =</code> arguments is broad enough– though sadly not universal – that sometimes you can even copy and paste these argument specifications, for example, from an exploratory plotting command into a model-fitting command. Consistent use of this convention also makes you faster at writing and reading such code.</p>
<p>Two important practices</p>
<ul>
<li>give variables short informative names (<code>lifeExp</code> versus “X5”)</li>
<li>refer to variables by name, not by column number</li>
</ul>
<p>This will produce code that is self-documenting and more robust. Variable names often propagate to downstream outputs like figures and numerical tables and therefore good names have a positive multiplier effect throughout an analysis.</p>
<p>If a function doesn’t have a <code>data =</code> argument where you can provide a data.frame, you can fake it with <code>with()</code>. <code>with()</code> helps you avoid the creation of temporary, confusing little partial copies of your data. Use it – possibly in combination with <code>subset()</code> – to do specific computations without creating all the intermediate temporary objects you have no lasting interest in. <code>with()</code> is also useful if you are tempted to use <code>attach()</code> in order to save some typing. <strong>Never ever use <code>attach()</code>. It is evil.</strong> If you’ve never heard of it, consider yourself lucky.</p>
<p>Example: How would you compute the correlation of life expectancy and GDP per capita for the country of Colombia? The <code>cor()</code> function sadly does not offer the usual <code>data =</code> and <code>subset =</code> arguments. Here’s a nice way to combine <code>with()</code> and <code>subset()</code> to accomplish without unnecessary object creation and with fairly readable code.</p>
<pre class="r"><code>with(subset(gDat, subset = country == "Colombia"),
cor(lifeExp, gdpPercap))
## [1] 0.9514699</code></pre>
</div>
<div class="footer">
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">CC BY-NC 3.0 Creative Commons License</a>.
</div>
</div>
<script>
// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>