-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathPROJ2_Writeup.Rmd
1161 lines (890 loc) · 66.5 KB
/
PROJ2_Writeup.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "See from the Sky: Examining Cloud Detection Algorithms on the Arctic MISR Data"
author: "Yicheng Shen ([email protected]) & Yunhong Bao ([email protected])"
date: "`r format(Sys.time(), '%d %B, %Y')`"
header-includes:
- \setlength{\parskip}{0em}
- \setlength{\parindent}{2em}
- \usepackage{indentfirst}
- \usepackage{float}
output:
pdf_document:
extra_dependencies: ["float"]
number_sections: true
bibliography: cloud.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = F, cache = F, warning = F, message = F)
library(mosaic)
library(caret)
library(gridExtra)
library(kableExtra)
library(GGally)
library(ggcorrplot)
library(corrplot)
library(MASS)
library(e1071)
library(class)
library(tree)
library(MLmetrics)
library(randomForest)
library(gbm)
library(xgboost)
library(pROC)
ggplot2::theme_set(ggplot2::theme_bw())
knitr::opts_chunk$set(fig.align = 'center')
knitr::opts_chunk$set(fig.pos = "H")
image_1 <- read.table("image_data/imagem1.txt")
image_2 <- read.table("image_data/imagem2.txt")
image_3 <- read.table("image_data/imagem3.txt")
var_name <- c("Y_coord", "X_coord", "Cloud", "NDAI", "SD", "CORR", "DF","CF","BF","AF","AN")
colnames(image_1) = colnames(image_2) = colnames(image_3) <- var_name
image_1$Cloud <- factor(image_1$Cloud)
image_2$Cloud <- factor(image_2$Cloud)
image_3$Cloud <- factor(image_3$Cloud)
image_1$logSD <- log(image_1$SD)
image_2$logSD <- log(image_2$SD)
image_3$logSD <- log(image_3$SD)
Image_1 <- image_1 %>% filter(Cloud != "0") %>% mutate(Cloud = factor(Cloud))
Image_2 <- image_2 %>% filter(Cloud != "0") %>% mutate(Cloud = factor(Cloud))
Image_3 <- image_3 %>% filter(Cloud != "0") %>% mutate(Cloud = factor(Cloud))
All_Image <- rbind(Image_1, Image_2, Image_3)
load("Image_all_Cut.RData")
load("Image_all_KM.RData")
Image_Train_Cut <- Image_all_Cut %>% filter(Type != "Test")
Image_Test_Cut <- Image_all_Cut %>% filter(Type == "Test")
Image_Train_KM <- Image_all_KM %>% filter(Type != "Test")
Image_Test_KM <- Image_all_KM %>% filter(Type == "Test")
```
\section{Data Collection and Exploration}
\subsection{Background \& Motivation}
<!-- (a) Write a half-page summary of the paper, including at least the purpose of the study, the data, the collection method, its conclusions and potential impact. -->
With the global climate getting increasingly extreme, humans are making the best use of sciences and technologies to understand the environment, especially in the Arctic region.
The detection of clouds in satellite images has become an important task, as cloud coverage is closely related to surface air temperatures and atmospheric carbon dioxide levels. Yet it is a challenging problem since clouds are similar on snow- and ice-covered surfaces.
In this study, we are going to examine various classification methods, build reliable models that distinguish the presence of clouds from Arctic satellite images using available features and evaluate our models' performance.
The data is obtained from a study by @yu2008. This team of researchers collected the data via the camera of Multiangle Imaging SpectroRadiometer (MISR) launched by NASA. The data are in the form of image pixels, with each MISR pixel covering a 275 m by 275 m region on the ground.
Since standard classification frameworks of clouds were not readily applicable, their goal was also to build operational cloud detection algorithms that can efficiently process the massive MISR data set one data unit at a time without requiring human intervention or expert labeling.
@yu2008 proposed two algorithms, an enhanced linear correlation matching (ELCM) algorithm based on thresholding the three features with values either fixed or data-adaptive, and an ELCM algorithm with Fisher’s quadratic discriminant analysis (ELCM-QDA).
Their results suggest that both proposed algorithms are computationally efficient for operational processing of the massive MISR data sets. The accuracy and coverage of ELCM are better and more informative compared with conventional MISR operational algorithms. These findings provide important implications and foundations for further analysis of MISR data.
\subsection{Data Description}
<!-- (b) Summarize the data, i.e., % of pixels for the different classes. Plot well-labeled beautiful maps using x, y coordinates the expert labels with color of the region based on the expert labels. Do you observe some trend/pattern? Is an i.i.d. assumption for the samples justified for this dataset? -->
```{r show map, fig.cap="\\label{fig:map3}Maps of three images with expert labels. White represents high confidence of cloud; gray represents high confidence of clear; and dark represents unlabeled pixels.",fig.width = 12, fig.height = 4, out.width="85%"}
# Palette_1 <- c("gray","black", "white")
#
# map_1 <- image_1 %>%
# ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = Cloud), size = 0.5) +
# theme_map() + theme(legend.position = "none") +
# labs(title = "Image 1", x = "X Coordinate", y = "Y Coordinate") +
# scale_colour_manual(values = Palette_1)
#
# map_2 <- image_2 %>%
# ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = Cloud), size = 0.5) +
# theme_map() + theme(legend.position = "none") +
# labs(title = "Image 2", x = "X Coordinate", y = "Y Coordinate") +
# scale_colour_manual(values = Palette_1)
#
# map_3 <- image_3 %>%
# ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = Cloud), size = 0.5) +
# theme_map() + theme(legend.position = "none") +
# labs(title = "Image 3", x = "X Coordinate", y = "Y Coordinate") +
# scale_colour_manual(values = Palette_1)
#
# grid.arrange(arrangeGrob(map_1 + ggeasy::easy_center_title() + theme(legend.position="none"),
# map_2 + ggeasy::easy_center_title() + theme(legend.position="none"),
# map_3 + ggeasy::easy_center_title() + theme(legend.position="none"),
# nrow = 1))
Palette_2 <- c("gray", "white")
map1 <- image_1 %>%
filter(Cloud != 0) %>%
ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = Cloud), size = 0.5) +
theme_dark() + labs(title = "Image 1", x = "X Coordinate", y = "Y Coordinate") +
scale_colour_manual(values = Palette_2)
map2 <- image_2 %>%
filter(Cloud != 0) %>%
ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = Cloud), size = 0.5) +
theme_dark() + labs(title = "Image 2", x = "X Coordinate", y = "Y Coordinate") +
scale_colour_manual(values = Palette_2)
map3 <- image_3 %>%
filter(Cloud != 0) %>%
ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = Cloud), size = 0.5) +
theme_dark() + labs(title = "Image 3", x = "X Coordinate", y = "Y Coordinate") +
scale_colour_manual(values = Palette_2)
map_legend <- lemon::g_legend(map1 + guides(colour = guide_legend(nrow = 1)))
grid.arrange(arrangeGrob(map1 + ggeasy::easy_center_title() + theme(legend.position="none"),
map2 + ggeasy::easy_center_title() + theme(legend.position="none"),
map3 + ggeasy::easy_center_title() + theme(legend.position="none"),
nrow = 1),
map_legend, nrow = 2, heights = c(10, 1))
```
In this study, we primarily focus on three of the MISR images. These three images contain 115110, 115229, and 115217 pixels respectively. However, not all pixels are labeled with confident experts' classification. As shown in Figure \ref{fig:map3}, there are considerable portions of images not labeled. We notice the pattern that usually experts have difficulties distinguishing the areas around what they recognize as clouds. It is understandable that the borderlands between cloudy and clear surfaces are more challenging to determine.
We can also observe that labeled clouds are often clustered in chunks. Therefore, adjacent pixels' labels are not independent and instead seem to have very high positive correlations. For example, if a pixel's neighbors are all labeled as clouds, it is highly likely that it is also labeled as part of the cloud.
In Table 1, we present the percentages of expert labels in each image.
Since we have no confident expert opinion on unlabeled pixels, they are viewed as missing/unknown values.
After removing unlabeled pixels, we have 82148, 70917, and 54996 pixels in each image, with available information (labels and features).
```{r}
t1 <- image_1 %>% summarise(`Cloud Labels` = mean(Cloud == "1"), `Clear Labels` = mean(Cloud == "-1"), `Unknown` = mean(Cloud == "0")) %>% mutate(Image = "1")
t2 <- image_2 %>% summarise(`Cloud Labels` = mean(Cloud == "1"), `Clear Labels` = mean(Cloud == "-1"), `Unknown` = mean(Cloud == "0")) %>% mutate(Image = "2")
t3 <- image_3 %>% summarise(`Cloud Labels` = mean(Cloud == "1"), `Clear Labels` = mean(Cloud == "-1"), `Unknown` = mean(Cloud == "0")) %>% mutate(Image = "3")
knitr::kable(rbind(t1, t2, t3),
caption = "Proportions of Cloudy and Clear Surfaces by Expert Labeling",
booktabs = T ) %>%
kable_styling(latex_options = "HOLD_position", position = "center", font_size = 8)
```
In specific, each pixel has eight features. The first three are physically useful features for characterizing the scattering properties of ice and snow-covered surfaces: the correlation (`CORR`) of MISR images of the same scene from different MISR viewing directions, the standard deviation (`SD`) of MISR nadir camera pixel values across a scene, and a normalized difference angular index (`NDAI`) that characterizes the changes in a scene with changes in the MISR view direction.
The latter five are five view zenith angles of the cameras. $70.5^{\circ}$ (DF), $60.0^{\circ}$ (CF), $45.6^{\circ}$ (BF), and $26.1^{\circ}$ (AF) in the forward direction and $0.0^{\circ}$ (AN) in the nadir direction.
\subsection{Exploratory Data Analysis}
<!-- (c) Perform a visual and quantitative EDA of the dataset, e.g., summarizing (i) pair- wise relationship between the features themselves and (ii) the relationship between the expert labels with the individual features. Do you notice differences between the two classes (cloud, no cloud) based on the radiance or other features (CORR, NDAI, SD)? -->
Conducting an EDA on the available features gives a preliminary picture of the relationship within potential predictors as well as with the response. First of all, we notice in Figure \ref{fig:corrplot} that there is a positive correlation between the features. The correlation is particularly strong and positive among the five radiance angles. For example, `BF`, `AF`, and `AN` are very strongly correlated. We also see positive correlations between `NADI`, `CORR`, and `SD`.
However, the correlations between the first three features (`NADI`, `CORR` and `SD`) and five radiance angles (`Df`, `CF`, `BF`, `AF`, `AN`) are mostly negative.
```{r pariwise covariate, fig.cap="\\label{fig:corrplot} Pair-wise correlations between eight available features.", out.width="45%"}
All_Image |>
dplyr::select(-Y_coord, -X_coord, -Cloud) |>
cor() |>
ggcorrplot(hc.order = TRUE)
# All_Image %>%
# dplyr::select(-Y_coord, -X_coord, -Cloud) %>%
# cor(use="pairwise.complete.obs") %>%
# ggcorrplot(show.diag = F, type="lower", lab=TRUE, lab_size=2)
```
Figure \ref{fig:hist_cov} shows that for pixels labeled as cloud, their `NADI`, `CORR` and `SD` are likely to be higher than those labeled as no cloud. This distinct pattern provides strong support for using these features as predictors of our classification models. It also seems reasonable to have a log transformation of `SD` so that it is in a similar scale as other features.
```{r hist covariate, fig.cap="\\label{fig:hist_cov}Density distributions of three features that describe cloud and clear pixels.", fig.width=9, fig.height=3.3, out.width="92%"}
hist_1 <- All_Image %>%
ggplot(aes(x = NDAI )) +
geom_density(aes(fill = Cloud, color = Cloud), alpha = 0.7)
hist_2 <- All_Image %>%
ggplot(aes(x = CORR )) +
geom_density(aes(fill = Cloud, color = Cloud), alpha = 0.7)
hist_3 <- All_Image %>%
ggplot(aes(x = SD )) +
geom_density(aes(fill = Cloud, color = Cloud), alpha = 0.7)
hist_4 <- All_Image %>%
ggplot(aes(x = log(SD) )) +
geom_density(aes(fill = Cloud, color = Cloud), alpha = 0.7)
grid.arrange(hist_1 + ggeasy::easy_center_title(),
hist_2 + ggeasy::easy_center_title(),
hist_3 + ggeasy::easy_center_title(),
hist_4 + ggeasy::easy_center_title(),
nrow = 2)
```
In terms of other features, we can see that the density distributions of radiance angles of cloud and cloud-free pixels are pretty consistent across angles in Figure \ref{fig:hist_cov2}. The radiance angles of cloud pixels are usually wider and right-skewed, whereas the radiance angles of cloud-free pixels are usually higher and distributed in a bimodal shape. The distributions between cloud and cloud-free pixels are not as separable and distinct as the first three features.
```{r hist covariate2, fig.cap="\\label{fig:hist_cov2}Density distributions of radiance angles from cloud and clear pixels.", fig.width=10, fig.height=3}
hist_1 <- All_Image %>%
ggplot(aes(x = DF )) +
geom_density(aes(fill = Cloud, color = Cloud), alpha = 0.7)
hist_2 <- All_Image %>%
ggplot(aes(x = CF )) +
geom_density(aes(fill = Cloud, color = Cloud), alpha = 0.7)
hist_3 <- All_Image %>%
ggplot(aes(x = BF )) +
geom_density(aes(fill = Cloud, color = Cloud), alpha = 0.7)
hist_4 <- All_Image %>%
ggplot(aes(x = AF )) +
geom_density(aes(fill = Cloud, color = Cloud), alpha = 0.7)
hist_5 <- All_Image %>%
ggplot(aes(x = AN )) +
geom_density(aes(fill = Cloud, color = Cloud), alpha = 0.7)
grid.arrange(hist_1 + ggeasy::easy_center_title(),
hist_2 + ggeasy::easy_center_title(),
hist_3 + ggeasy::easy_center_title(),
hist_4 + ggeasy::easy_center_title(),
hist_5 + ggeasy::easy_center_title(),
nrow = 2)
```
\section{Preparation}
<!-- (a) (Data Split) Split the entire data (imagem1.txt, imagem2.txt, imagem3.txt) into three sets: training, validation and test. Think carefully about how to split the data. Suggest at least two non-trivial different ways of splitting the data which takes into account that the data is not i.i.d. -->
\subsection{Data Splitting}
Before building models, it is important to prepare the data as training, validation and testing sets so that we can make the best use of the data and evaluate our models.
The key idea of our data splitting is to take into account the fact that this data set is not i.i.d. Therefore, we propose the two following ways of dividing data into blocks.
**Method 1. Horizontal Cuts**: The first method cuts each image horizontally to ensure that every resulting block has a reasonable portion of clouds and clear surfaces.
Basically, each image is cut into five blocks by evenly dividing Y coordinates (shown in Figure \ref{fig:cut_split}), and three of them would be used as training data, the rest two blocks are used as validation and testing respectively.
This method splits the data into 58.59% training data, 19.03% validation data and 22.38% testing data (roughly 3:1:1).
```{r split 1, fig.cap="\\label{fig:cut_split}The first data splitting method is to divide each image horizontally.", fig.width=12, fig.height=4, out.width="90%"}
seperate_fold <- function(train_data) {
fold_1 <- train_data %>% filter( Y_coord < 383*0.2 )
fold_2 <- train_data %>% filter( 383*0.2 <= Y_coord, Y_coord < 383*0.4 )
fold_3 <- train_data %>% filter( 383*0.4 <= Y_coord, Y_coord < 383*0.6 )
fold_4 <- train_data %>% filter( 383*0.6 <= Y_coord, Y_coord < 383*0.8 )
fold_5 <- train_data %>% filter( 383*0.8 <= Y_coord )
return( rbind(fold_1 %>% mutate(Type = "Train"),
fold_2 %>% mutate(Type = "Train"),
fold_3 %>% mutate(Type = "Train"),
fold_4 %>% mutate(Type = "Test"),
fold_5 %>% mutate(Type = "Validation") ) )
}
Cut_1 <- seperate_fold(Image_1)
Cut_2 <- seperate_fold(Image_2)
Cut_3 <- seperate_fold(Image_3)
Cut_1_Map <- Cut_1 %>%
ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = Type), size = 0.5) +
labs(title = "Image 1", x = "X Coordinate", y = "Y Coordinate")
Cut_2_Map <- Cut_2 %>%
ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = Type), size = 0.5) +
labs(title = "Image 2", x = "X Coordinate", y = "Y Coordinate")
Cut_3_Map <- Cut_3 %>%
ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = Type), size = 0.5) +
labs(title = "Image 3", x = "X Coordinate", y = "Y Coordinate") + theme(legend.position="none")
grid.arrange(arrangeGrob(Cut_1_Map + ggeasy::easy_center_title() + theme(legend.position="none"),
Cut_2_Map + ggeasy::easy_center_title() + theme(legend.position="none"),
Cut_3_Map + ggeasy::easy_center_title() + theme(legend.position="none"),
nrow = 1),
lemon::g_legend(Cut_1_Map + guides(colour = guide_legend(nrow = 1))), nrow = 2, heights = c(10, 1))
Image_all_Cut <- rbind(Cut_1, Cut_2, Cut_3)
```
**Method 2. K-means Clusters**: The second method of blocked data splitting is to use the K-means algorithm. By selecting a cluster size of five, we can divide each image's data points into five distinct groups (according to X-Y coordinates). Again, shown in Figure \ref{fig:kmean_split}, three of these are used for training data, one is for validation and the last one is for testing. The K-means method splits all data points into 60.72% training data, 20.04% validation data, and 19.24% testing data.
```{r split 2, fig.cap="\\label{fig:kmean_split}The second data splitting method is to divide based on the K-means algorithm.", fig.width=12, fig.height=4, out.width="90%"}
set.seed(8848)
KM_1 <- kmeans(Image_1[,1:2], centers = 5, nstart = 25)
KM_2 <- kmeans(Image_2[,1:2], centers = 5, nstart = 25)
KM_3 <- kmeans(Image_3[,1:2], centers = 5, nstart = 25)
Image_all_KM <- rbind(Image_1 %>% mutate(image = "1", cluster = KM_1$cluster),
Image_2 %>% mutate(image = "2", cluster = KM_2$cluster),
Image_3 %>% mutate(image = "3", cluster = KM_3$cluster))
Image_all_KM <- Image_all_KM %>%
mutate(Type = ifelse(cluster == "2", "Test", "Train")) %>%
mutate(Type = ifelse(cluster == "5", "Validation", Type))
KM_1_Map <- Image_all_KM %>% filter(image == "1") %>%
ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = Type), size = 0.5) +
theme_bw() + labs(title = "Image 1", x = "X Coordinate", y = "Y Coordinate")
KM_2_Map <- Image_all_KM %>% filter(image == "2") %>%
ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = Type), size = 0.5) +
theme_bw() + labs(title = "Image 2", x = "X Coordinate", y = "Y Coordinate")
KM_3_Map <-Image_all_KM %>% filter(image == "3") %>%
ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = Type), size = 0.5) +
theme_bw() + labs(title = "Image 3", x = "X Coordinate", y = "Y Coordinate")
grid.arrange(arrangeGrob(KM_1_Map + ggeasy::easy_center_title() + theme(legend.position="none"),
KM_2_Map + ggeasy::easy_center_title() + theme(legend.position="none"),
KM_3_Map + ggeasy::easy_center_title() + theme(legend.position="none"),
nrow = 1),
lemon::g_legend(KM_1_Map + guides(colour = guide_legend(nrow = 1))), nrow = 2, heights = c(10, 1))
```
The table below describes how many cloud pixels there are in each set after two ways of data splitting. We think they are both reasonable in terms of having both cloud and clear pixels in every subset of data.
```{r, eval = T}
# prop.table(table(Image_all_Cut$Type))
# prop.table(table(Image_all_KM$Type))
knitr::kable(
cbind(
Image_all_Cut %>%
group_by(Type) %>%
dplyr::summarise(`Cloud Prop` = mean(Cloud == "1")) %>%
mutate(Method = "Horizontal Cuts"),
Image_all_KM %>%
group_by(Type) %>%
dplyr::summarise(`Cloud Prop` = mean(Cloud == "1")) %>%
mutate(Method = "K-means")
)[,c(3,1,2,6,4,5)],
caption = "Proportions of cloud pixels in each set",
booktabs = T) %>%
kable_styling(latex_options = "HOLD_position",
position = "center",
font_size = 8)
```
\subsection{Baseline Accuracy}
<!-- Report the accuracy of a trivial classifier which sets all labels to -1 (cloud-free) on the validation set and on the test set. In what scenarios will such a classifier have high average accuracy? Hint: Such a step provides a baseline to ensure that the classification problems at hand is not trivial. -->
We can examine the accuracy of a trivial classifier that sets all labels to -1 on the validation set and on the test set as shown in the table below. This classifier assumes that the image has no cloud pixels at all. Logically, the accuracy, or the success rate, of this trivial classifier depends entirely on the percentage of actual cloud-free pixels labeled in the data. If the image is mostly cloud-free, then labeling all points as -1 would easily achieve high accuracy.
```{r accuaracy of trival predictor}
cut_v <- Image_all_Cut %>%
filter(Type == "Validation")%>%
mutate(Predict = "-1") %>%
summarise(`Data Type` = "Validation", Accuracy = mean(Predict == Cloud)) %>% mutate(Method = "Horizontal Cuts")
cut_t <- Image_all_Cut %>%
filter(Type == "Test")%>%
mutate(Predict = "-1") %>%
summarise(`Data Type` = "Test", Accuracy = mean(Predict == Cloud)) %>% mutate(Method = "Horizontal Cuts")
km_v <- Image_all_KM %>%
filter(Type == "Validation")%>%
mutate(Predict = "-1") %>%
summarise(`Data Type` = "Validation", Accuracy = mean(Predict == Cloud)) %>% mutate(Method = "K-means")
km_t <- Image_all_KM %>%
filter(Type == "Test")%>%
mutate(Predict = "-1") %>%
summarise(`Data Type` = "Test", Accuracy = mean(Predict == Cloud)) %>% mutate(Method = "K-means")
knitr::kable(rbind(cut_v, cut_t, km_v, km_t)[,c(3,1,2)],
caption = "Accuracy of a trivial classifier that sets all labels to -1",
booktabs = T ) %>%
kable_styling(latex_options = "HOLD_position", position = "center", font_size = 8)
```
Here it is shown that labeling all points as -1 can only achieve an accuracy of 66.15% and 67.04% on the test data sets, which shows that achieving high accuracy in this classification problem is not trivially easy. This also sets a baseline of reference for our subsequent classification methods.
\subsection{First order importance}
<!-- (c) (First order importance) Assuming the expert labels as the truth, and without using fancy classification methods, suggest three of the “best” features, using quantitative and visual justification. Define your “best” feature criteria clearly. Only the relevant plots are necessary. Be sure to give this careful consideration, as it relates to subsequent problems. -->
We consider `NADI`, `CORR` and `logSD` (notice that we apply a log transformation on `SD` as justified in section 1.3) as the three most important features when classifying pixels. Firstly, the EDA as shown in Figure \ref{fig:hist_cov} and \ref{fig:hist_cov2} has shown that the density distributions of the first three variables are more distinct between cloud and non-cloud pixels compared with the angular radiance variables.
Then we also examine the pair-wise correlation between the response, `Cloud`, and each of the eight features. Looking at the two bottom rows in Figure \ref{fig:corrplot_cloud}, we notice that the correlation between `Cloud` and `NADI`, `CORR` and `logSD` are much stronger than the other five features.
```{r pariwise cloud, fig.cap="\\label{fig:corrplot_cloud} Pair-wise correlations between response (Cloud) and features (eight covariates).", out.width="50%"}
All_Image$logSD <- log(All_Image$SD)
model.matrix(~0+., data=All_Image %>%
dplyr::select(-Y_coord, -X_coord) %>%
dplyr::select(Cloud, NDAI, CORR, logSD, DF, CF, BF, AF, AN))%>%
cor(use="pairwise.complete.obs") %>%
ggcorrplot(show.diag = F, type="lower", lab=TRUE, lab_size=2)
```
We also apply PCA on the eight features, which suggests that first three PCs could explain over 95% of the total variance. Among the first three PCs, `NADI`, `CORR` and `logSD` again consistently contribute a lot to the total variance. Thus, we tentatively suggest these three as the best features without model justification.
\subsection{Cross Validation Method}
<!-- (d) Write a generic cross validation (CV) function CVmaster in R that takes a generic classifier, training features, training labels, number of folds K and a loss function (at least classification accuracy should be there) as inputs and outputs the K-fold CV loss on the training set. Please remember to put it in your github folder in Section 5. -->
After splitting training and test data, we use various classification methods to train models that can classify cloud and non-cloud pixels based on available features. To make the best use of the training data, we employ cross validation as a way to compare models.
As shown in our code appendix and GitHub repository, our `CVmaster` function is able to take a generic classifier, training features, training labels, number of folds K and a loss function as inputs, then perform K-fold cross validation (CV) to assess the classification methods, and finally outputs the K-fold CV loss on the training data.
\section{Modeling}
<!--(a) Try several classification methods and assess their fit using cross-validation (CV). Provide a commentary on the assumptions for the methods you tried and if they are satisfied in this case. Since CV does not have a validation set, you can merge your training and validation set to fit your CV model. Report the accuracies across folds (and not just the average across folds) and the test accuracy. CV-results for both the ways of creating folds (as answered in part 2(a)) should be reported. Provide a brief commentary on the results. Make sure you honestly mention all the classification methods you have tried. -->
In this section, different machine learning algorithms including logistic regression, LDA, QDA, Naive Bayes, and gradient boosting are used to construct the classification model. Hyperparameters are carefully selected based on cross validation accuracies. Performance is assessed through both CV error and prediction accuracy on the test data. A comparison across model fit is also conducted utilizing various performance metrics including the Area Under the Curve (AUC), precision, and F1 scores.
\subsection{Model Fitting}
Before fitting the chosen classification algorithms, we first assess different model assumptions and their fitness on the given data set. Logistic regression assumes linearity in the classification log odds and a linear decision boundary. Although such a strong assumption is not satisfied by the data, we can adjust the classification threshold to still achieve a desirable accuracy. LDA assumes predictors in different classes follow normal distributions with the same covariance matrix, $\Sigma$, but different means, $\mu_j$, while QDA allows distinct class covariance matrices $\Sigma_j$. From Figure \ref{fig:hist_cov} and \ref{fig:hist_cov2}, we can observe that the distributions of covariants are hardly normal. Thus, LDA and QDA's assumption of normality is unsatisfied here. Naive Bayes makes the assumption that conditional on the class labels, predictors follow i.i.d normal distributions. Such an assumption is also unsatisfied base on Figure \ref{fig:hist_cov} and \ref{fig:hist_cov2}. Tree-based models, for example the Boosting Trees, on the other hand, do not have any model assumptions, and should be able to provide good results.
After the assessment of model assumption fitness, different classification methods are applied to the data set. For Boosting Tree algorithm, hyperparameters including max-depth of each weak learner, the number of weak learners, and the learning rate are tuned through K-fold validation. For algorithms that return classification probabilities, we determine the model threshold by consulting the Youden statistics which maximized the sum of specificity and sensitivity. Using our `CVmaster` function and selecting hyperparameters with minimal cross validation errors, we construct our final models on the entire training dataset and compute the respective test accuracies. Both CV and test accuracies are displayed in the following table:
```{r}
# all_result <- read.csv("all_result.csv") %>% mutate(Cv.mean = CV.Average, Test = Test.Accuracy) %>%
# dplyr::select(-CV.Average, -Test.Accuracy)
all_result <- read.csv("all_result.csv")
knitr::kable(all_result,
caption = "10-fold CV Results and Test Accuracy based on Two ways of Data Splitting",
booktabs = T,linesep="") %>%
kableExtra::kable_styling(latex_options = "HOLD_position", position = "center", font_size = 6)
```
As displayed above, Boosting Trees algorithm achieves the best CV and Test errors. This result aligns with our expectation as the Boosting Trees model has the weakest model assumption thus is able to fit data well. LDA, QDA, and Naive Bayes, on the other hand, have a lower accuracy since their model assumptions are unsatisfied by the data. Logistic regression has a slightly better performance, an explainable result since its classification threshold is adjusted by Youden statistics.
\subsection{Model Comparison with ROC Curves}
With final models being constructed, we compare model fits with ROC curves and AUC values as in Figure \ref{fig:ROCs}. The decision thresholds (cut-off values used for predicted probabilities) for logistic regression and Boosting Trees are also plotted on the curve, as shown in Figure \ref{fig:ROCs}.
```{r}
load("Image_all_Cut.RData")
load("Image_all_KM.RData")
Image_Train_Cut <- Image_all_Cut %>% filter(Type != "Test")
Image_Test_Cut <- Image_all_Cut %>% filter(Type == "Test")
Image_Train_KM <- Image_all_KM %>% filter(Type != "Test")
Image_Test_KM <- Image_all_KM %>% filter(Type == "Test")
Image_Train_Cut$Cloud01 <- factor(ifelse(Image_Train_Cut$Cloud == "-1", "0", Image_Train_Cut$Cloud))
levels(Image_Train_Cut$Cloud01) <- c("0", "1")
Image_Test_Cut$Cloud01 <- factor(ifelse(Image_Test_Cut$Cloud == "-1", "0", Image_Test_Cut$Cloud))
levels(Image_Test_Cut$Cloud01) <- c("0", "1")
Image_Train_KM$Cloud01 <- factor(ifelse(Image_Train_KM$Cloud == "-1", "0", Image_Train_KM$Cloud))
levels(Image_Train_KM$Cloud01) <- c("0", "1")
Image_Test_KM$Cloud01 <- factor(ifelse(Image_Test_KM$Cloud == "-1", "0", Image_Test_KM$Cloud))
levels(Image_Test_KM$Cloud01) <- c("0", "1")
```
```{r test ROC, fig.cap="\\label{fig:ROCs} ROC curves on the test data (the first row is models trained on horizontal cut data, the second row is trained on K-means splitted data)", message = F, fig.height=6,fig.width=15, out.width="101%"}
load("final_boost_All.RData")
load("final_boost_KM.RData")
par(mfrow = c(2, 5))
# ROC for GLM cut
glm_result = glm(Cloud ~ NDAI + logSD + CORR + DF + CF + BF + AF + AN , data=Image_Train_Cut,
family = 'binomial')
GLMpreds <- predict(glm_result,Image_Train_Cut,type = "response")
t = pROC::coords(roc(Image_Train_Cut$Cloud, GLMpreds), "best", transpose = FALSE)
GLMpreds <- predict(glm_result,Image_Test_Cut,type = "response")
invisible(roc(Image_Test_Cut$Cloud, GLMpreds, print.auc = T, plot = T, print.thres = t[[1]], main = "Logistic Regression"))
glmcutF1 = F1_Score(Image_Test_Cut$Cloud, ifelse(GLMpreds>t[[1]],1,-1))
cm <- confusionMatrix(as.factor(ifelse(GLMpreds>t[[1]],1,-1)), reference = as.factor(Image_Test_Cut$Cloud))
glmcutrec = cm[["byClass"]]["Precision"]
# ROC for LDA cut
lda_result <- lda(Cloud ~ NDAI + CORR + logSD + DF + CF + BF + AF + AN,
data = Image_Train_Cut)
LDApreds <- predict(lda_result, Image_Test_Cut)$class
LDApreds = ifelse(LDApreds==1,1,-1)
invisible(roc(Image_Test_Cut$Cloud, LDApreds, print.auc = T, plot = T, main = "LDA"))
LDAcutF1 = F1_Score(Image_Test_Cut$Cloud, LDApreds)
cm <- confusionMatrix(as.factor(LDApreds), reference = as.factor(Image_Test_Cut$Cloud))
LDAcutrec = cm[["byClass"]]["Precision"]
# ROC for QDA cut
qda_result <- qda(Cloud ~ NDAI + CORR + logSD + DF + CF + BF + AF + AN,
data = Image_Train_Cut)
QDApreds <- predict(qda_result, Image_Test_Cut)$class
QDApreds = ifelse(QDApreds==1,1,-1)
invisible(roc(Image_Test_Cut$Cloud, QDApreds, print.auc = T, plot = T, main = "QDA"))
QDAcutF1 = F1_Score(Image_Test_Cut$Cloud, QDApreds)
cm <- confusionMatrix(as.factor(QDApreds), reference = as.factor(Image_Test_Cut$Cloud))
QDAcutrec = cm[["byClass"]]["Precision"]
# ROC for NB cut
NB_result <- naiveBayes(Cloud ~ NDAI + CORR + logSD + DF + CF + BF + AF + AN,
data = Image_Train_Cut)
NBpreds <- predict(NB_result, Image_Test_Cut)
NBpreds = ifelse(NBpreds==1,1,-1)
invisible(roc(Image_Test_Cut$Cloud, NBpreds, print.auc = T, plot = T, main = "Naive Bayes"))
NBcutF1 = F1_Score(Image_Test_Cut$Cloud, NBpreds)
cm <- confusionMatrix(as.factor(NBpreds), reference = as.factor(Image_Test_Cut$Cloud))
NBcutrec = cm[["byClass"]]["Precision"]
# ROC for Boosting Trees
# boost_model_Cut <- invisible(xgboost(
# data = as.matrix(Image_Train_Cut[, 4:11]),
# label = as.matrix(Image_Train_Cut$Cloud01),
# max.depth = 4,
# eta = 0.05,
# nthread = parallel::detectCores(),
# nrounds = 3000,
# objective = "binary:logistic",
# verbose=0
# ))
pred <- predict(final_boost_All, as.matrix(Image_Train_Cut[, 4:11]))
t = pROC::coords(roc(Image_Train_Cut$Cloud01, pred), "best", transpose = FALSE)
pred <- predict(final_boost_All, as.matrix(Image_Test_Cut[, 4:11]))
invisible(roc(Image_Test_Cut$Cloud01, pred, print.auc = T, plot = T, print.thres = t[[1]], main = "Boosting Trees"))
xgbcutF1 = F1_Score(Image_Test_Cut$Cloud01, ifelse(pred>t[[1]],1,0))
cm <- confusionMatrix(as.factor(ifelse(pred>t[[1]],1,0)), reference = as.factor(Image_Test_Cut$Cloud01))
xgbcutrec = cm[["byClass"]]["Precision"]
Cut_F1=c(glmcutF1,LDAcutF1,QDAcutF1,NBcutF1,xgbcutF1)
Cut_pr=c(glmcutrec,LDAcutrec,QDAcutrec,NBcutrec,xgbcutrec)
names(Cut_F1)=c("Logistic Regression","LDA","QDA","Naive Bayes","Boosting Trees")
names(Cut_pr)=c("Logistic Regression","LDA","QDA","Naive Bayes","Boosting Trees")
# KM
glm_result = glm(Cloud ~ NDAI + logSD + CORR + DF + CF + BF + AF + AN , data=Image_Train_KM,
family = 'binomial')
preds <- predict(glm_result,Image_Train_KM, type = "response")
t = pROC::coords(roc(Image_Train_KM$Cloud, preds), "best", transpose = FALSE)
GLMpreds <- predict(glm_result,Image_Test_KM,type = "response")
invisible(roc(Image_Test_KM$Cloud, GLMpreds, print.auc = T, plot = T, print.thres = t[[1]], main = "Logistic Regression"))
glmKMF1 = F1_Score(Image_Test_KM$Cloud, ifelse(GLMpreds>t[[1]],1,-1))
cm <- confusionMatrix(as.factor(ifelse(GLMpreds>t[[1]],1,-1)), reference = as.factor(Image_Test_KM$Cloud))
glmKMrec = cm[["byClass"]]["Precision"]
# ROC for LDA cut
lda_result <- lda(Cloud ~ NDAI + CORR + logSD + DF + CF + BF + AF + AN,
data = Image_Train_KM)
LDApreds <- predict(lda_result, Image_Test_KM)$class
LDApreds = ifelse(LDApreds==1,1,-1)
invisible(roc(Image_Test_KM$Cloud, LDApreds, print.auc = T, plot = T, main = "LDA"))
LDAKMF1 = F1_Score(Image_Test_KM$Cloud, LDApreds)
cm <- confusionMatrix(as.factor(LDApreds), reference = as.factor(Image_Test_KM$Cloud))
LDAKMrec = cm[["byClass"]]["Precision"]
# ROC for QDA cut
qda_result <- qda(Cloud ~ NDAI + CORR + logSD + DF + CF + BF + AF + AN,
data = Image_Train_KM)
QDApreds <- predict(qda_result, Image_Test_KM)$class
QDApreds = ifelse(QDApreds==1,1,-1)
invisible(roc(Image_Test_KM$Cloud, QDApreds, print.auc = T, plot = T, main = "QDA"))
QDAKMF1 = F1_Score(Image_Test_KM$Cloud, QDApreds)
cm <- confusionMatrix(as.factor(QDApreds), reference = as.factor(Image_Test_KM$Cloud))
QDAKMrec = cm[["byClass"]]["Precision"]
# ROC for NB cut
NB_result <- naiveBayes(Cloud ~ NDAI + CORR + logSD + DF + CF + BF + AF + AN,
data = Image_Train_KM)
NBpreds <- predict(NB_result, Image_Test_KM)
NBpreds = ifelse(NBpreds==1,1,-1)
invisible(roc(Image_Test_KM$Cloud, NBpreds, print.auc = T, plot = T, main = "Naive Bayes"))
NBKMF1 = F1_Score(Image_Test_KM$Cloud, NBpreds)
cm <- confusionMatrix(as.factor(NBpreds), reference = as.factor(Image_Test_KM$Cloud))
NBKMrec = cm[["byClass"]]["Precision"]
# ROC for Boosting Trees
# boost_model_KM <- invisible(xgboost(
# data = as.matrix(Image_Train_KM[, 4:11]),
# label = as.matrix(Image_Train_KM$Cloud01),
# max.depth = 4,
# eta = 0.05,
# nthread = parallel::detectCores(),
# nrounds = 3000,
# objective = "binary:logistic",
# verbose=0
# ))
pred <- predict(final_boost_KM, as.matrix(Image_Train_KM[, 4:11]))
t = pROC::coords(roc(Image_Train_KM$Cloud01, pred), "best", transpose = FALSE)
pred <- predict(final_boost_KM, as.matrix(Image_Test_KM[, 4:11]))
invisible(roc(Image_Test_KM$Cloud01, pred, print.auc = T, plot = T, print.thres = t[[1]], main = "Boosting Trees"))
xgbKMF1 = F1_Score(Image_Test_KM$Cloud01, ifelse(pred>t[[1]],1,0))
cm <- confusionMatrix(as.factor(ifelse(pred>t[[1]],1,0)), reference = as.factor(Image_Test_KM$Cloud01))
xgbKMrec = cm[["byClass"]]["Precision"]
KM_F1=c(glmKMF1,LDAKMF1,QDAKMF1,NBKMF1,xgbKMF1)
KM_pr=c(glmKMrec,LDAKMrec,QDAKMrec,NBKMrec,xgbKMrec)
names(KM_F1)=c("Logistic Regression","LDA","QDA","Naive Bayes","Boosting Trees")
names(KM_pr)=c("Logistic Regression","LDA","QDA","Naive Bayes","Boosting Trees")
```
It can be observed from the above figure that Boosting Trees Algorithm has the best ROC curves and AUC values of 0.989 and 0.991. Its classification threshold is 0.472, slightly lower than 0.5. LDA and QDA have similar ROC curves with AUC values of 0.838 and 0.853, worse than Boosting as their normality assumptions are unsatisfied by data. Naive Bayes has the worst performance due to its strongest model assumption. A large change in AUC curve can be observed in logistic regression model, indicating its sensitivity to data splitting method might incur instability. Overall, Boosting has the best ROC value and is the most reliable model by this metric.
\subsection{Comparison of Model Fit}
Despite accuracy often being one of the most significant metrics for classification models, other performance metrics provide valuable information concerning the overall model fit. In this section, we compare model performances based on precisions and F1 scores.
```{r}
Def_Table <- data.frame(`True Non-Cloud` = c("True Negative", "False Negative"), `True Cloud` = c("False Positive", "True Positive"))
row.names(Def_Table) <- c("Predicted.No-Cloud","Predicted.Cloud")
knitr::kable(Def_Table,
caption = "Definition of four scenarios in a confusion matrix.",
booktabs = T,linesep="") %>%
kableExtra::kable_styling(latex_options = "HOLD_position", position = "center", font_size = 6)
```
Precision measure the percentage of true positive among all positive predictions and is calculated as
\begin{equation}
\text{Precision} = \frac{\text{True Positiveness}}{\text{True Positiveness + False Positiveness}}
\end{equation}
A higher precision reflects a model's ability to avoid false positiveness, i.e. classifying a non-cloud pixels as cloudy.
F1 score, on the other hand, combines the precision and recall of a classifier into a single metric by taking their harmonic mean. It is defined as
\begin{equation}
\text{F1} = \frac{\text{True Positiveness}}{\text{True Positiveness} + \frac{1}{2}(\text{False Positiveness+False Negativeness})}
\end{equation}
A high F1 score indicates the model's ability to identify positive classes while minimizing errors. The precision and F1 score of our final models are plotted in Figure \ref{fig:f1_score} in descending order.
```{r F1 score, fig.cap="\\label{fig:f1_score} The F1 scores and precisions of five classification methods.", out.width="90%",}
picCut_F1=data.frame(names = names(Cut_F1), value = Cut_F1) %>% arrange(desc(value)) %>%
ggplot(aes(x = reorder(names, -value),y = value))+geom_bar(stat="identity",width=0.5) + labs(title="Model F1 Score of Horizontal Cut Data", x = "", y = "F1 Score") + coord_cartesian(ylim = c(0.85,1))+theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5, size = 5),plot.title=element_text(size=10))
picCut_pr=data.frame(names = names(Cut_pr), value = Cut_pr) %>% arrange(desc(value)) %>%
ggplot(aes(x = reorder(names, -value),y = value))+geom_bar(stat="identity",width=0.5) + labs(title="Model Precision of Horizontal Cut Data", x = "", y = "Precision") + coord_cartesian(ylim = c(0.8,1))+theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5, size = 5),plot.title=element_text(size=10))
picKM_F1=data.frame(names = names(KM_F1), value = KM_F1) %>% arrange(desc(value)) %>%
ggplot(aes(x = reorder(names, -value),y = value))+geom_bar(stat="identity",width=0.5) + labs(title="Model F1 Score of K-means Data", x = "", y = "F1 Score") + coord_cartesian(ylim = c(0.85,1))+theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5, size = 5),plot.title=element_text(size=10))
picKM_pr=data.frame(names = names(KM_pr), value = KM_pr) %>% arrange(desc(value)) %>%
ggplot(aes(x = reorder(names, -value),y = value))+geom_bar(stat="identity",width=0.5) + labs(title="Model Precision of K-means Data",x = "", y = "Precision") + coord_cartesian(ylim = c(0.85,1))+theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5, size = 5),plot.title=element_text(size=10))
grid.arrange(arrangeGrob(picCut_F1 + ggeasy::easy_center_title() + theme(legend.position="none"),
picCut_pr + ggeasy::easy_center_title() + theme(legend.position="none"),
picKM_F1 + ggeasy::easy_center_title() + theme(legend.position="none"),
picKM_pr + ggeasy::easy_center_title() + theme(legend.position="none"),
nrow = 2))
```
Boosting algorithm has the highest F1 score and precision under all circumstances, indicating it is the best model for this classification problem. Naive Bayes, on the other hand, has the worst performance. An interesting observation is that QDA and LDA has better F1 score than logistic regression on horizontal cut data. This result demonstrates that accuracy might be insufficient in determining model fit. Combining results from previous sections, we can conclude that Boosting is the best classification model for this particular classification problem.
\section{Diagnostics}
\subsection{Examining Boosting Trees}
In the previous model comparison section, it has been determined that boosting tree is the best model for this specific classification problem. We will further discuss the hyperparameter selection process and model convergence of boosting in this section.
There are three major hyperparameters in our Boosting Trees model: the depth of weak learners, learning rate, and number of weak learners. These hyperparameters are selected through the process of 10-Fold validation. Learning rate is adjusted in the range of 0.01 to 0.1, depth in the range of 2 to 4, and number of trees from 500 to 2500. CV result suggests that a learning rate of 0.05 with 2000 trees, each having a max depth of 4, produces the best model. With the best parameters in hand, we want to further assess model convergence by adjusting the number of trees and comparing the respective test and train errors. In Figure \ref{fig:converge_plot}, the number of trees are adjusted from 1 to 10000 and misclassification errors are plotted.
```{r, eval=F}
set.seed(4577)
n = seq(1, 6001, 2000)
train_accuracy_Cut = rep(NA, length(n))
test_accuracy_Cut = rep(NA, length(n))
for (i in seq_along(n)) {
boost_model <- xgboost(
data = as.matrix(Image_Train_Cut[, 4:11]),
label = as.matrix(Image_Train_Cut$Cloud01),
max.depth = 2,
eta = 0.06,
nthread = parallel::detectCores(),
nrounds = n[i],
objective = "binary:logistic",
colsample_bytree = 0.6,
verbose = 0
)
train_pred = predict(boost_model, as.matrix(Image_Train_Cut[, 4:11]))
t = pROC::coords(roc(Image_Train_Cut$Cloud01, train_pred),
"best",
transpose = FALSE)
train_label = ifelse(train_pred > t[[1]], 1, 0)
train_accuracy_Cut[i] = mean(train_label == Image_Train_Cut$Cloud01)
test_pred <- predict(boost_model, as.matrix(Image_Test_Cut[, 4:11]))
test_label = ifelse(test_pred > t[[1]], 1, 0)
test_accuracy_Cut[i] = mean(test_label == Image_Test_Cut$Cloud01)
print(i)
}
plot(n, 1 - train_accuracy_Cut, type = "o")
plot(n, 1 - test_accuracy_Cut, type = "o")
save(train_accuracy_Cut, file = "train_accuracy_Cut.RData")
save(test_accuracy_Cut, file = "test_accuracy_Cut.RData")
```
```{r,message=F,warning=F,cache=T, eval=F}
set.seed(123456)
n = seq(1, 5001, 200)
test_accuracy_KM = rep(NA, length(n))
train_accuracy_KM = rep(NA, length(n))
for (i in seq_along(n)) {
boost_model <- xgboost(
data = as.matrix(Image_Train_KM[, 4:11]),
label = as.matrix(Image_Train_KM$Cloud01),
max.depth = 2,
eta = 0.06,
nthread = parallel::detectCores(),
nrounds = n[i],
objective = "binary:logistic",
colsample_bytree = 0.6,
verbose = 0
)
train_pred = predict(boost_model, as.matrix(Image_Train_KM[, 4:11]))
t = pROC::coords(roc(Image_Train_KM$Cloud01, train_pred),
"best",
transpose = FALSE)
train_label = ifelse(train_pred > t[[1]], 1, 0)
train_accuracy_KM[i] = mean(train_label == Image_Train_KM$Cloud01)
test_pred <- predict(boost_model, as.matrix(Image_Test_KM[, 4:11]))
test_label = ifelse(test_pred > t[[1]], 1, 0)
test_accuracy_KM[i] = mean(test_label == Image_Test_KM$Cloud01)
print(i)
}
plot(n, 1 - train_accuracy_KM, type = "o")
plot(n, 1 - test_accuracy_KM, type = "o")
save(train_accuracy_KM, file = "train_accuracy_KM2.RData")
save(test_accuracy_KM, file = "test_accuracy_KM2.RData")
load("train_accuracy_KM2.RData")
load("test_accuracy_KM2.RData")
plot(seq(1, 5001, 200), 1 - train_accuracy_KM, type = "l")
plot(seq(1, 5001, 200), 1 - test_accuracy_KM, type = "l")
```
```{r, eval = F}
set.seed(123456)
n = seq(1001, 3001, 100)
test_accuracy_KM = rep(NA, length(n))
train_accuracy_KM = rep(NA, length(n))
for (i in seq_along(n)) {
boost_model <- xgboost(
data = as.matrix(Image_Train_KM[, 4:11]),
label = as.matrix(Image_Train_KM$Cloud01),
max.depth = 2,
eta = 0.06,
nthread = parallel::detectCores(),
nrounds = n[i],
objective = "binary:logistic",
colsample_bytree = 0.6,
verbose = 0
)
train_pred = predict(boost_model, as.matrix(Image_Train_KM[, 4:11]))
t = pROC::coords(roc(Image_Train_KM$Cloud01, train_pred),
"best",
transpose = FALSE)
train_label = ifelse(train_pred > t[[1]], 1, 0)
train_accuracy_KM[i] = mean(train_label == Image_Train_KM$Cloud01)
test_pred <- predict(boost_model, as.matrix(Image_Test_KM[, 4:11]))
test_label = ifelse(test_pred > t[[1]], 1, 0)
test_accuracy_KM[i] = mean(test_label == Image_Test_KM$Cloud01)
print(i)
}
plot(n, 1 - train_accuracy_KM, type = "o")
plot(n, 1 - test_accuracy_KM, type = "o")
```
```{r, eval = F}
set.seed(8848)
boost_modelKM <- xgboost(
data = as.matrix(Image_Train_KM[, 4:11]),
label = as.matrix(Image_Train_KM$Cloud01),
max.depth = 2,
eta = 0.06,
nthread = 4,
nrounds = 2000,
objective = "binary:logistic",
colsample_bytree=0.6,
verbose=0
)
train_pred = predict(boost_modelKM, as.matrix(Image_Train_KM[, 4:11]))
t = pROC::coords(roc(Image_Train_KM$Cloud01, train_pred),
"best",
transpose = FALSE)
train_label = ifelse(train_pred > t[[1]], 1, 0)
mean(train_label == Image_Train_KM$Cloud01)
test_pred <- predict(boost_modelKM, as.matrix(Image_Test_KM[, 4:11]))
test_label = ifelse(test_pred > t[[1]], 1, 0)
mean(test_label == Image_Test_KM$Cloud01)
```
```{r converge plot, fig.cap="\\label{fig:converge_plot} Misclassfication errors v.s. number of iterations", fig.width = 12, fig.height = 4, out.width="88%"}
par(mfrow = c(1, 2))
load("train_accuracy_KM.RData")
load("test_accuracy_KM.RData")
plot(seq(1, 10001, 1000), 1 - train_accuracy_KM, type = "l", ylab = "Train MisClassification Error", xlab = "Number of Iterations")
plot(seq(1, 10001, 1000), 1 - test_accuracy_KM, type = "l", ylab = "Test MisClassification Error", xlab = "Number of Iterations")
```
Distinct patterns can be observed in the two figures above. Training error maintains a decreasing trend as the number of weak learners increases. Since XGBoost put more weight on misclassified points, each new weak learner aims to correct previous mistakes. Thus more weak learners would result in decreasing training errors. Test error, on the other hand, start to increase after the number of weak learners exceeds 2000. This result aligns with our intuition as overfitting occurs with growing model complexity, resulting in increased test error as shown in the test error plot. The two plots above demonstrate that the chosen model parameter is the optimum that ensures model convergence while avoiding overfitting.
\subsection{Patterns of Misclassification}
In this section, we conduct a detailed analysis of misclassifications. Figure \ref{fig:pred_plot} is a plot of model prediction on all the data points, with misclassified points plotted in yellow.
```{r prediction plot, fig.cap="\\label{fig:pred_plot} Prediction of cloud pixels in each image using a trained boosting trees model, with those misclassified points highlighted in yellow.", fig.width = 12, fig.height = 4, out.width="85%", eval = T, message = F}
# prediction plot
Image_all_KM$Cloud01 <- factor(ifelse(Image_all_KM$Cloud == "-1", "0", Image_all_KM$Cloud))
levels(Image_all_KM$Cloud01) <- c("0", "1")
# final_boost_All <- xgboost(
# data = as.matrix(Image_Train_Cut[, 4:11]),
# label = as.matrix(Image_Train_Cut$Cloud01),
# max.depth = 4,
# eta = 0.05,
# nrounds = 3000,
# verbose = F, nthread = parallel::detectCores(),
# objective = "binary:logistic",
# colsample_bytree = 0.6,
# )
# cv=xgb.cv(
# params = list(eta=c(0.1,0.05,0.03),max_depth=c(2,3,4)),
# data = as.matrix(Image_Train_KM[, 4:11]),
# label = as.matrix(Image_Train_KM$Cloud01),
# nrounds=1000,
# 5,
# missing = NA,
# prediction = FALSE,
# showsd = TRUE,
# obj = NULL,
# feval = NULL,
# stratified = TRUE,
# folds = NULL,
# )
pred <- predict(final_boost_All, as.matrix(Image_Train_Cut[, 4:11]))
t <- invisible(pROC::coords(roc(Image_Train_Cut$Cloud01, pred), "best", transpose = FALSE))
all_pred <- predict(final_boost_All, as.matrix(Image_all_KM[, 4:11]))
Image_all_KM$all_prediction <- as.factor(as.numeric(all_pred > t[[1]]))
# draw plots
Palette_2 <- c("gray", "white")
# map1miss = Image_all_KM %>%
# filter(image == "1") %>%
# filter(Cloud01!=all_prediction) %>%
# dplyr::select(X_coord,Y_coord)
map1 <- Image_all_KM %>%
filter(image == "1") %>%
ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = all_prediction), size = 0.5) +
geom_point(data=filter(Image_all_KM %>% filter(image == "1"), Cloud01!=all_prediction),
aes(x = X_coord, y = Y_coord), color="yellow",size = 0.1, alpha = 0.3) +
theme_dark() + labs(title = "Image 1", x = "X Coordinate", y = "Y Coordinate", color = "Prediction") +
scale_colour_manual(values = Palette_2)
# mean(Image_all_KM %>% filter(image == "1")%>%dplyr::select(Cloud01)==Image_all_KM %>% filter(image == "1")%>%dplyr::select(all_prediction))
map2 <- Image_all_KM %>%
filter(image == "2") %>%
ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = all_prediction), size = 0.5) +
geom_point(data=filter(Image_all_KM %>% filter(image == "2"), Cloud01!=all_prediction),
aes(x = X_coord, y = Y_coord), color="yellow",size = 0.1, alpha = 0.3) +
theme_dark() + labs(title = "Image 2", x = "X Coordinate", y = "Y Coordinate") +
scale_colour_manual(values = Palette_2)
# mean(Image_all_KM %>% filter(image == "2")%>%dplyr::select(Cloud01)==Image_all_KM %>% filter(image == "2")%>%dplyr::select(all_prediction))
map3 <- Image_all_KM %>%
filter(image == "3") %>%
ggplot() + geom_point(aes(x = X_coord, y = Y_coord, color = all_prediction), size = 0.5) +
geom_point(data=filter(Image_all_KM %>% filter(image == "3"), Cloud01!=all_prediction),
aes(x = X_coord, y = Y_coord), color="yellow",size = 0.1, alpha = 0.3) +
theme_dark() + labs(title = "Image 3", x = "X Coordinate", y = "Y Coordinate") +
scale_colour_manual(values = Palette_2)
# mean(Image_all_KM %>% filter(image == "3")%>%dplyr::select(Cloud01)==Image_all_KM %>% filter(image == "3")%>%dplyr::select(all_prediction))
map_legend <- lemon::g_legend(map1 + guides(colour = guide_legend(nrow = 1)))
grid.arrange(arrangeGrob(map1 + ggeasy::easy_center_title() + theme(legend.position="none"),
map2 + ggeasy::easy_center_title() + theme(legend.position="none"),
map3 + ggeasy::easy_center_title() + theme(legend.position="none"),
nrow = 1),
map_legend, nrow = 2, heights = c(10, 1))
```
Most classifications occur near the classification boundaries, where between-class differences are less observable. Comparing this result to Figure \ref{fig:image_1}, \ref{fig:image_2}, \ref{fig:image_3} from Appendix, it can be seen that areas with frequent misclassification errors are indeed hard to differentiate based on covariate values. The confusion matrix of the proposed model is presented below (assuming expert labels are the truth).
```{r}
conf.mat <- confusionMatrix(Image_all_KM$all_prediction, Image_all_KM$Cloud01)
row.names(conf.mat$table) <- c("Predicted No Cloud","Predicted Cloud")
knitr::kable(conf.mat$table,
caption = "Confusion Matrix of Boosting Trees Predictions",
col.names = c("True Non-Cloud","True Cloud"),
booktabs = T ) %>%
kableExtra::kable_styling(latex_options = "HOLD_position", position = "center")
```
A large proportion of the misclassifications (4699 out of 7412) occur as the model misclassifies pixels of land into cloud (False Positive). To investigate the reason behind this phenomenon, we began by analyzing the most relevant covariates in the model.
```{r importance plot, fig.cap="\\label{fig:importance_plot}The feature importance plot in boosting trees.", fig.width=12, fig.height=3.5}
par(mfrow = c(1, 2))
# final_boost_Cut <- xgboost(
# data = as.matrix(Image_Train_Cut[, 4:11]),
# label = as.matrix(Image_Train_Cut$Cloud01),
# max.depth = 2,
# eta = 0.06,
# nrounds = 2000,
# verbose = F, nthread = parallel::detectCores(),
# objective = "binary:logistic",
# colsample_bytree=0.6,
# )
xgb.plot.importance(xgb.importance(colnames(as.matrix(Image_Train_Cut[, 4:11])), model = final_boost_All), main = "Boosting trees on horizontal cut data", cex.main = 0.75, cex.lab = 0.5, cex.axis = 0.5)
# final_boost_KM <- xgboost(
# data = as.matrix(Image_Train_KM[, 4:11]),
# label = as.matrix(Image_Train_KM$Cloud01),
# max.depth = 2,
# eta = 0.06,
# nrounds = 3000,
# verbose = F, nthread = parallel::detectCores(),
# objective = "binary:logistic",
# colsample_bytree=0.6,
# )
xgb.plot.importance(xgb.importance(colnames(as.matrix(Image_Train_KM[, 4:11])), model = final_boost_KM), main = "Boosting trees on K-means data", cex.main = 0.75, cex.lab = 0.5, cex.axis = 0.5)
```
Figure \ref{fig:importance_plot} is the importance plot of our boosting models. In both scenarios, `NDAI`, `logSD` and `CORR` are the most influential covariates. Figure \ref{fig:tree_vis} in the Appendix is a visualization of the first weak learner in the boosting model. The first split on `NDAI` is able to reduce total loss by more than 60 percent, indicating that `NDAI` is the most significant predictor. `LogSD` and `CORR` also further reduce the loss function by an observable amount. After identifying the most relevant covariates, we plot their density in the misclassified points.
```{r correct plot, fig.cap="\\label{fig:correct_plot}The feature density plot of correctly and wrongly predicted pixels.", fig.height=3, fig.width=12}
C1 <- Image_all_KM %>% mutate(Correctness = ifelse(all_prediction == Cloud01, "Correct Prediction", "Wrong Prediction")) %>%
ggplot(aes(x = NDAI )) +
geom_density(aes(fill = Correctness, color = Correctness), alpha = 0.7)
C2 <- Image_all_KM %>% mutate(Correctness = ifelse(all_prediction == Cloud01, "Correct Prediction", "Wrong Prediction")) %>%
ggplot(aes(x = CORR )) +
geom_density(aes(fill = Correctness, color = Correctness), alpha = 0.7)
C3 <- Image_all_KM %>% mutate(Correctness = ifelse(all_prediction == Cloud01, "Correct Prediction", "Wrong Prediction")) %>%
ggplot(aes(x = logSD )) +
geom_density(aes(fill = Correctness, color = Correctness), alpha = 0.7)
C_legend <- lemon::g_legend(C1 + guides(colour = guide_legend(nrow = 1)))
grid.arrange(arrangeGrob(C1 + ggeasy::easy_center_title() + theme(legend.position="none"),
C2 + ggeasy::easy_center_title() + theme(legend.position="none"),
C3 + ggeasy::easy_center_title() + theme(legend.position="none"),
nrow = 1),
C_legend, nrow = 2, heights = c(10, 3))
# Image_all_KM %>% mutate(Correctness = ifelse(all_prediction == Cloud01, "Correct Prediction", "Wrong Prediction")) %>%
# filter(Correctness =="Wrong Prediction" ) %>%
# ggplot(aes(x = NDAI )) +
# geom_density(aes(fill = Cloud01, color = Cloud01), alpha = 0.7)
```
Figure \ref{fig:correct_plot} is a covariate density plot, with red indicating correctly classified points and blue being misclassified data. It can be seen that misclassification mainly occurs in the upper range of `NDAI` and `logSD`, while equally distributes with respect to `CORR.` Such a pattern suggests that models with even strong `NDAI` and `logSD` dominance might be preferable. Based on this result, we propose several ways to further improve our proposed classifier.
\subsection{Model Improvements}
Despite a high model accuracy, there are three potential improvements that can be made to further improve the proposed classifier. The first potential improvement is dimension reduction. As seen in the EDA plots, there are strong correlations between the angular covariates. A dimension reduction might be able to avoid the repetition of information or correlated variables and further strengthen the dominance of the most relevant predictors. However, dimension reduction techniques such as PCA would complicate the interpretability of the model.
Another potential improvement is to add $L_1$ or $L_2$ regularization for model complexity reduction. A simpler model potentially reduces model variance and avoids over-fitting. Nevertheless, it would require more tuning of a proper regularization parameter and $L_1$ regularization performs variable selection arbitrarily.