Hierarchical optimal reconciliation with random forest forecasts in R - r

I have some doubts about how to use the hts library to create an optimal reconciliation of my forecasts.
The goal is to forecast the wind energy production of some plants that are located in a macrozone.
Of each macrozone {A;B;C} I have information of the total energy input hour by hour. Each macrozone is divided into multiple zones {A1;A2;A3[...];B4;B5;B6[...];C;7;C8;C9[...]} and I know the total energy input of these microzones (which are about 95% of the total energy input of the macrozone).
I have already generated my forecasts for both each microzone and the total macrozone A with a random forest algorithm since I use some weather variables as regressors.
I would like to figure out how to apply hierarachic optimal reconciliation to these forecasts without having to use the "forecast" function.
I report a one-day excerpt of both the actual and estimated measurements from macrozone A.
In Table 1 and Table2 you will find the actual measurements to be tested, in Table 3 the predicted values for the total macrozone A, and in Table 3 the predicted values for the microzones with the random forest models.
How can I apply the method of hierarchical optimal reconciliation in R to this data?
Thanks for your attention!
#Table1
Test_MacrozoneA<-data.frame(Macrozone="A",
Date=as.Date("2021/05/15"),
Hour=c(seq(1,24,1)),
Measure=c(8.92,6.50,7.06,5.89,7.34,10.65,15.03,16.21,17.55,21.00,19.90,14.88,11.74,10.62,10.08,9.01,7.16,5.12,5.40,8.97,12.43,15.67,22.09,24.30)
)
#Table2
Test_MicrozoneA<-data.frame(Date=as.Date("2021/05/15"),
Hour=c(seq(1,24,1)),
A1=c(1.32,1.04,1.44,1.45,3.11,4.84,9.97,11.52,11.99,14.54,13.93,7.87,6.13,6.20,5.84,5.34,4.37,2.82,3.12,6.02,8.20,12.58,19.67,21.32),
A2=c(1.90,0.79,0.92,0.50,0.43,1.18,1.14,1.01,0.33,0.12,0.79,2.18,1.07,0.36,0.23,0.20,0.09,0.06,0.06,0.06,0.11,0.12,0.08,0.02),
A3=c(1.78,1.47,1.48,1.34,1.27,1.59,0.84,0.85,1.05,1.47,1.17,0.68,0.86,1.21,1.28,1.33,1.38,1.40,1.39,1.59,1.61,1.67,1.38,1.75),
A4=c(2.54,2.09,2.03,1.28,1.13,1.15,0.75,0.89,1.35,1.81,1.82,1.65,1.63,1.27,0.97,0.92,0.78,0.53,0.60,1.09,1.70,0.69,0.14,0.20),
A5=c(0.58,0.03,0.00,0.23,0.04,0.71,1.95,1.64,2.45,2.87,2.00,2.33,1.77,1.30,1.53,1.04,0.35,0.12,0.01,0.00,0.58,0.40,0.56,0.67)
)
#Table3
Pred_MacrozoneA<-data.frame(Macrozone="A",
Date=as.Date("2021/05/15"),
Hour=c(seq(1,24,1)),
Measure=c(5.21,6.62,4.66,4.92,6.38,7.13,8.14,8.90,12.09,15.29,19.62,17.51,21.00,20.72,17.55,15.83,15.94,14.45,10.61,7.09,5.37,7.01,11.98,16.51)
)
#Table4
Pred_MicrozoneA<-data.frame(Date=as.Date("2021/05/15"),
Hour=c(seq(1,24,1)),
A1=c(1.84,1.61,1.26,1.16,1.41,2.47,4.54,4.34,4.14,10.97,11.74,10.73,11.61,11.37,9.50,5.73,5.75,6.12,5.64,2.34,3.70,3.06,7.23,10.53),
A2=c(0.79,1.95,1.92,2.11,2.65,2.66,1.72,0.84,0.91,1.32,1.90,2.31,2.49,2.23,1.70,1.43,1.65,1.70,1.02,0.33,0.18,0.17,0.30,0.33),
A3=c(0.46,0.43,0.38,0.22,0.23,0.56,1.28,0.83,0.75,0.77,0.89,1.02,1.05,1.29,1.43,1.48,1.47,1.41,1.34,1.36,1.42,1.43,1.23,1.76),
A4=c(2.00,0.95,0.63,0.88,0.69,0.33,1.30,1.08,1.59,2.57,2.59,2.14,1.95,1.90,1.75,2.31,2.40,2.20,2.20,1.19,0.89,0.88,0.92,0.95),
A5=c(0.67,0.74,0.55,0.38,0.23,0.30,0.30,0.55,0.58,0.97,0.80,0.81,1.24,1.83,1.79,1.55,1.13,0.56,0.36,0.12,0.16,0.20,0.25,0.30)
)
> Test_MacrozoneA
Macrozone Date Hour Measure
1 A 2021-05-15 1 8.92
2 A 2021-05-15 2 6.50
3 A 2021-05-15 3 7.06
4 A 2021-05-15 4 5.89
5 A 2021-05-15 5 7.34
6 A 2021-05-15 6 10.65
7 A 2021-05-15 7 15.03
8 A 2021-05-15 8 16.21
9 A 2021-05-15 9 17.55
10 A 2021-05-15 10 21.00
11 A 2021-05-15 11 19.90
12 A 2021-05-15 12 14.88
13 A 2021-05-15 13 11.74
14 A 2021-05-15 14 10.62
15 A 2021-05-15 15 10.08
16 A 2021-05-15 16 9.01
17 A 2021-05-15 17 7.16
18 A 2021-05-15 18 5.12
19 A 2021-05-15 19 5.40
20 A 2021-05-15 20 8.97
21 A 2021-05-15 21 12.43
22 A 2021-05-15 22 15.67
23 A 2021-05-15 23 22.09
24 A 2021-05-15 24 24.30
> Test_MicrozoneA
Date Hour A1 A2 A3 A4 A5
1 2021-05-15 1 1.32 1.90 1.78 2.54 0.58
2 2021-05-15 2 1.04 0.79 1.47 2.09 0.03
3 2021-05-15 3 1.44 0.92 1.48 2.03 0.00
4 2021-05-15 4 1.45 0.50 1.34 1.28 0.23
5 2021-05-15 5 3.11 0.43 1.27 1.13 0.04
6 2021-05-15 6 4.84 1.18 1.59 1.15 0.71
7 2021-05-15 7 9.97 1.14 0.84 0.75 1.95
8 2021-05-15 8 11.52 1.01 0.85 0.89 1.64
9 2021-05-15 9 11.99 0.33 1.05 1.35 2.45
10 2021-05-15 10 14.54 0.12 1.47 1.81 2.87
11 2021-05-15 11 13.93 0.79 1.17 1.82 2.00
12 2021-05-15 12 7.87 2.18 0.68 1.65 2.33
13 2021-05-15 13 6.13 1.07 0.86 1.63 1.77
14 2021-05-15 14 6.20 0.36 1.21 1.27 1.30
15 2021-05-15 15 5.84 0.23 1.28 0.97 1.53
16 2021-05-15 16 5.34 0.20 1.33 0.92 1.04
17 2021-05-15 17 4.37 0.09 1.38 0.78 0.35
18 2021-05-15 18 2.82 0.06 1.40 0.53 0.12
19 2021-05-15 19 3.12 0.06 1.39 0.60 0.01
20 2021-05-15 20 6.02 0.06 1.59 1.09 0.00
21 2021-05-15 21 8.20 0.11 1.61 1.70 0.58
22 2021-05-15 22 12.58 0.12 1.67 0.69 0.40
23 2021-05-15 23 19.67 0.08 1.38 0.14 0.56
24 2021-05-15 24 21.32 0.02 1.75 0.20 0.67
> Pred_MacrozoneA
Macrozone Date Hour Measure
1 A 2021-05-15 1 5.21
2 A 2021-05-15 2 6.62
3 A 2021-05-15 3 4.66
4 A 2021-05-15 4 4.92
5 A 2021-05-15 5 6.38
6 A 2021-05-15 6 7.13
7 A 2021-05-15 7 8.14
8 A 2021-05-15 8 8.90
9 A 2021-05-15 9 12.09
10 A 2021-05-15 10 15.29
11 A 2021-05-15 11 19.62
12 A 2021-05-15 12 17.51
13 A 2021-05-15 13 21.00
14 A 2021-05-15 14 20.72
15 A 2021-05-15 15 17.55
16 A 2021-05-15 16 15.83
17 A 2021-05-15 17 15.94
18 A 2021-05-15 18 14.45
19 A 2021-05-15 19 10.61
20 A 2021-05-15 20 7.09
21 A 2021-05-15 21 5.37
22 A 2021-05-15 22 7.01
23 A 2021-05-15 23 11.98
24 A 2021-05-15 24 16.51
> Pred_MicrozoneA
Date Hour A1 A2 A3 A4 A5
1 2021-05-15 1 1.84 0.79 0.46 2.00 0.67
2 2021-05-15 2 1.61 1.95 0.43 0.95 0.74
3 2021-05-15 3 1.26 1.92 0.38 0.63 0.55
4 2021-05-15 4 1.16 2.11 0.22 0.88 0.38
5 2021-05-15 5 1.41 2.65 0.23 0.69 0.23
6 2021-05-15 6 2.47 2.66 0.56 0.33 0.30
7 2021-05-15 7 4.54 1.72 1.28 1.30 0.30
8 2021-05-15 8 4.34 0.84 0.83 1.08 0.55
9 2021-05-15 9 4.14 0.91 0.75 1.59 0.58
10 2021-05-15 10 10.97 1.32 0.77 2.57 0.97
11 2021-05-15 11 11.74 1.90 0.89 2.59 0.80
12 2021-05-15 12 10.73 2.31 1.02 2.14 0.81
13 2021-05-15 13 11.61 2.49 1.05 1.95 1.24
14 2021-05-15 14 11.37 2.23 1.29 1.90 1.83
15 2021-05-15 15 9.50 1.70 1.43 1.75 1.79
16 2021-05-15 16 5.73 1.43 1.48 2.31 1.55
17 2021-05-15 17 5.75 1.65 1.47 2.40 1.13
18 2021-05-15 18 6.12 1.70 1.41 2.20 0.56
19 2021-05-15 19 5.64 1.02 1.34 2.20 0.36
20 2021-05-15 20 2.34 0.33 1.36 1.19 0.12
21 2021-05-15 21 3.70 0.18 1.42 0.89 0.16
22 2021-05-15 22 3.06 0.17 1.43 0.88 0.20
23 2021-05-15 23 7.23 0.30 1.23 0.92 0.25
24 2021-05-15 24 10.53 0.33 1.76 0.95 0.30

Related

Fixing Tukey multicomparison

Hi I was running a Tukey multicomparison test for the data below (data and code attached) after confirming that the data was normal, equal variances and significant interactions via two-way ANOVA (time and growth condition). The results in R and final barchart (1) were included as well. As you can see the visualization could be improved and need to be tidied up due to the redundant letters. I was advised to redo using the same Tukey test but adding some additional codes to assign the samples at time point 0 hr as the reference/control (sort of like Dunnett test with a single control). I couldn't really find any useful information regarding this online, would appreciate any help/suggestion!
data.frame(Exp1)
id growth_condition time fv fq npq in_situ rd
1 1 Control 0 0.81 0.56 0.72 0.797 1.000
2 2 Control 0 0.81 0.58 0.78 0.788 1.000
3 3 Control 0 0.80 0.59 0.76 0.793 1.000
4 4 High light+Chilled 0 0.82 0.57 0.85 0.799 1.000
5 5 High light+Chilled 0 0.81 0.59 0.75 0.796 1.000
6 6 High light+Chilled 0 0.81 0.56 0.69 0.782 1.000
7 7 Control 0.5 0.81 0.53 1.08 0.759 1.279
8 8 Control 0.5 0.81 0.56 0.72 0.759 0.668
9 9 Control 0.5 0.79 0.50 1.04 0.771 0.877
10 10 High light+Chilled 0.5 0.70 0.46 1.04 0.540 0.487
11 11 High light+Chilled 0.5 0.60 0.43 0.69 0.652 1.341
12 12 High light+Chilled 0.5 0.73 0.46 1.19 0.606 0.904
13 13 Control 8 0.82 0.52 1.20 0.753 0.958
14 14 Control 8 0.81 0.55 1.09 0.759 0.642
15 15 Control 8 0.80 0.55 1.07 0.747 0.612
16 16 High light+Chilled 8 0.44 0.28 0.58 0.230 0.471
17 17 High light+Chilled 8 0.35 0.21 0.45 0.237 0.777
18 18 High light+Chilled 8 0.54 0.35 0.68 0.186 0.342
19 19 Control 24 0.81 0.49 1.17 0.762 0.915
20 20 Control 24 0.82 0.67 1.25 0.749 0.876
21 21 Control 24 0.82 0.48 1.18 0.756 0.836
22 22 High light+Chilled 24 0.40 0.25 0.45 0.089 0.392
23 23 High light+Chilled 24 0.43 0.27 0.51 0.106 0.627
24 24 High light+Chilled 24 0.34 0.21 0.37 0.140 0.258
25 25 Control 48 0.81 0.48 1.05 0.773 0.662
26 26 Control 48 0.80 0.45 1.14 0.785 0.914
27 27 Control 48 0.82 0.47 1.09 0.792 0.912
28 28 High light+Chilled 48 0.73 0.45 0.90 0.750 0.800
29 29 High light+Chilled 48 0.70 0.51 0.79 0.626 1.305
30 30 High light+Chilled 48 0.66 0.43 0.74 0.655 0.579
Code:
res.Exp8 <- aov(npq ~ growth_condition * time, data =Exp1)
summary(res.Exp8)
t8<-TukeyHSD(res.Exp8)
plot(t8)
multcompLetters4(res.Exp8,t8)
Results:
$`growth_condition:time`
Control:24 Control:8 Control:48 High light+Chilled:0.5 Control:0.5 High light+Chilled:48
"a" "ab" "abc" "abc" "abc" "bcd"
High light+Chilled:0 Control:0 High light+Chilled:8 High light+Chilled:24
"cde" "cde" "de" "e"

Problem with imputing data using Package Mice

Here's a small sample of my data :
> sample_n(k,20)
A B C D E
1 1.05 2.02 8.27 0.76 1.02
2 1.2 2.28 19.56 0.62 <NA>
3 1.2 2.31 3.45 0.65 1.22
4 <NA> 2.44 6.76 0.68 1.82
5 <NA> 2.24 6.99 0.59 1.37
6 0.87 1.71 3.32 0.64 1.87
7 <NA> 1.77 3.4 0.6 2.13
8 <NA> 2.17 4.13 0.81 1.19
9 <NA> 1.96 4.39 <NA> 1.66
10 1.15 2.28 14.73 0.73 1.57
11 <NA> 1.76 <NA> 0.79 2.66
12 <NA> 1.97 9 0.81 1.38
13 <NA> 2.18 9.32 0.78 0.9
14 <NA> 1.93 2.3 0.78 1.62
15 1.02 2.05 2.81 0.78 1.24
16 0.94 1.77 1.69 0.73 1.83
17 1.17 2.21 14.79 0.66 1.34
18 1.11 2.18 9.41 <NA> 1.32
19 1.35 2.51 20.44 0.76 0.73
20 <NA> 2.37 <NA> 0.74 1.41
I'm trying to impute the missing data using the package mice :
new_df = mice(df, method="cart")
I get the following error :
Error in edit.setup(data, setup, ...) :
`mice` detected constant and/or collinear variables. No predictors were left after their removal.
How can I fix this?

is there an R function i could use to create a cluster column of an imported dataset in csv format using ggplot2

i want to plot a stacked column using ggplot2 with R1, R2, R3 as the y variables while the varieties names remain in the x variable.
i have tried it on excel it worked but i decided importing the dataset in csv format to R for a more captivating outlook as this is part of my final year project.
varieties R1 R2 R3 Relative.yield SD
1 bd 0.40 2.65 1.45 1.50 1.13
2 bdj1 4.60 NA 2.80 3.70 1.27
3 bdj2 2.40 1.90 0.50 1.60 0.98
4 bdj3 2.40 1.65 5.20 3.08 1.87
5 challenge 2.10 5.15 1.35 2.87 2.01
6 doris 4.20 2.50 2.55 3.08 0.97
7 fel 0.80 2.40 0.75 1.32 0.94
8 fel2 NA 0.70 1.90 1.30 0.85
9 felbv 0.10 2.95 2.05 1.70 1.46
10 felnn 1.50 4.05 1.25 2.27 1.55
11 lad1 0.55 2.20 0.20 0.98 1.07
12 lad2 0.50 NA 0.50 0.50 0.00
13 lad3 1.10 3.90 1.00 2.00 1.65
14 lad4 1.50 1.65 0.50 1.22 0.63
15 molete1 2.60 1.80 2.75 2.38 0.51
16 molete2 1.70 4.70 4.20 3.53 1.61
17 mother's delight 0.10 4.00 1.90 2.00 1.95
18 ojaoba1a 1.90 3.45 2.75 2.70 0.78
19 ojaoba1b 4.20 2.75 4.30 3.75 0.87
20 ojoo 2.80 NA 3.60 3.20 0.57
21 omini 0.20 0.30 0.25 0.25 0.05
22 papa1 2.20 6.40 3.55 4.05 2.14
23 pk5 1.00 2.75 1.10 1.62 0.98
24 pk6 2.30 1.30 3.10 2.23 0.90
25 sango1a 0.40 0.90 1.55 0.95 0.58
26 sango1b 2.60 5.10 3.15 3.62 1.31
27 sango2a 0.50 0.55 0.75 0.60 0.13
28 sango2b 2.95 NA 2.60 2.78 0.25
29 usman 0.60 3.50 1.20 1.77 1.53
30 yau 0.05 0.85 0.20 0.37 0.43
> barplot(yield$R1)
> barplot(yield$Relative.yield)
> barplot(yield$Relative.yield, names.arg = varieties)
Error in barplot.default(yield$Relative.yield, names.arg = varieties) :
object 'varieties' not found
> ggplot(data = yield, mapping = aes(x = varieties, y = yield[,2:4])) + geom_()
Error in geom_() : could not find function "geom_"
> ggplot(data = yield, mapping = aes(x = varieties, y = yield[,2:4])) + geom()
Error in geom() : could not find function "geom"
You should put it in long format first, with tidyr::gather provides this functionality:
library(tidyverse)
gather(df[1:4],R, value, R1:R3) %>%
ggplot(aes(varieties,value, fill = R)) + geom_col()
#> Warning: Removed 5 rows containing missing values (position_stack).
data
df <- read.table(h=T,strin=F,text=
" varieties R1 R2 R3 Relative.yield SD
1 bd 0.40 2.65 1.45 1.50 1.13
2 bdj1 4.60 NA 2.80 3.70 1.27
3 bdj2 2.40 1.90 0.50 1.60 0.98
4 bdj3 2.40 1.65 5.20 3.08 1.87
5 challenge 2.10 5.15 1.35 2.87 2.01
6 doris 4.20 2.50 2.55 3.08 0.97
7 fel 0.80 2.40 0.75 1.32 0.94
8 fel2 NA 0.70 1.90 1.30 0.85
9 felbv 0.10 2.95 2.05 1.70 1.46
10 felnn 1.50 4.05 1.25 2.27 1.55
11 lad1 0.55 2.20 0.20 0.98 1.07
12 lad2 0.50 NA 0.50 0.50 0.00
13 lad3 1.10 3.90 1.00 2.00 1.65
14 lad4 1.50 1.65 0.50 1.22 0.63
15 molete1 2.60 1.80 2.75 2.38 0.51
16 molete2 1.70 4.70 4.20 3.53 1.61
17 'mother\\'s delight' 0.10 4.00 1.90 2.00 1.95
18 ojaoba1a 1.90 3.45 2.75 2.70 0.78
19 ojaoba1b 4.20 2.75 4.30 3.75 0.87
20 ojoo 2.80 NA 3.60 3.20 0.57
21 omini 0.20 0.30 0.25 0.25 0.05
22 papa1 2.20 6.40 3.55 4.05 2.14
23 pk5 1.00 2.75 1.10 1.62 0.98
24 pk6 2.30 1.30 3.10 2.23 0.90
25 sango1a 0.40 0.90 1.55 0.95 0.58
26 sango1b 2.60 5.10 3.15 3.62 1.31
27 sango2a 0.50 0.55 0.75 0.60 0.13
28 sango2b 2.95 NA 2.60 2.78 0.25
29 usman 0.60 3.50 1.20 1.77 1.53
30 yau 0.05 0.85 0.20 0.37 0.43"
)

How to do Shapiro test for multicolumns in data.frame? And avoid 2 errors: values are identical and missing value where TRUE/FALSE needed

I have a dataframe like this:
head(Betula, 10)
year start Start_DayOfYear end End_DayOfYear duration DateMax Max_DayOfYear BetulaPollenMax SPI Jan.NAO Jan.AO
1 1997 <NA> NA <NA> NA NA <NA> NA NA NA -0.49 -0.46
2 1998 <NA> 143 <NA> 184 41 <NA> 146 42 361 0.39 -2.08
3 1999 <NA> 148 <NA> 188 40 <NA> 158 32 149 0.77 0.11
4 2000 <NA> 135 <NA> 197 62 <NA> 156 173 917 0.60 1.27
5 2001 <NA> 143 <NA> 175 32 <NA> 154 113 457 0.25 -0.96
Jan.SO Feb.NAO Feb.AO Feb.SO Mar.NAO Mar.AO Mar.SO Apr.NAO Apr.AO Apr.SO DecJanFebMarApr.NAO DecJanFebMar.NAO
1 0.5 1.70 1.89 1.7 1.46 1.09 -0.4 -1.02 0.32 -0.6 0.14 0.43
2 -2.7 -0.11 -0.18 -2.0 0.87 -0.25 -2.4 -0.68 -0.04 -1.4 0.27 0.51
3 1.8 0.29 0.48 1.0 0.23 -1.49 1.3 -0.95 0.28 1.4 0.39 0.73
4 0.7 1.70 1.08 1.7 0.77 -0.45 1.3 -0.03 -0.28 1.2 0.49 0.62
5 1.0 0.45 -0.62 1.7 -1.26 -1.69 0.9 0.00 0.91 0.2 -0.28 -0.35
DecJanFeb.NAO DecJan.NAO JanFebMarApr.NAO JanFebMar.NAO JanFeb.NAO FebMarApr.NAO FebMar.NAO MarApr.NAO
1 0.08 -0.73 0.41 0.89 0.61 0.71 1.58 0.22
2 0.38 0.63 0.12 0.38 0.14 0.03 0.38 0.10
3 0.89 1.19 0.09 0.43 0.53 -0.14 0.26 -0.36
4 0.57 0.01 0.76 1.02 1.15 0.81 1.24 0.37
5 -0.04 -0.29 -0.14 -0.19 0.35 -0.27 -0.41 -0.63
DecJanFebMarApr.AO DecJanFebMar.AO DecJanFeb.AO DecJan.AO JanFebMarApr.AO JanFebMar.AO JanFeb.AO FebMarApr.AO
1 0.55 0.61 0.45 -0.27 0.71 0.84 0.72 1.10
2 -0.24 -0.29 -0.30 -0.37 -0.64 -0.84 -1.13 -0.16
3 0.08 0.04 0.54 0.58 -0.16 -0.30 0.30 -0.24
4 -0.15 -0.11 0.00 -0.54 0.41 0.63 1.18 0.12
5 -0.74 -1.15 -0.97 -1.14 -0.59 -1.09 -0.79 -0.47
FebMar.AO MarApr.AO DecJanFebMarApr.SO DecJanFebMar.SO DecJanFeb.SO DecJan.SO JanFebMarApr.SO JanFebMar.SO
1 1.49 0.71 0.04 0.20 0.40 -0.25 0.30 0.60
2 -0.22 -0.15 -1.42 -1.43 -1.10 -0.65 -2.13 -2.37
3 -0.51 -0.61 1.38 1.38 1.40 1.60 1.38 1.37
4 0.32 -0.37 1.14 1.13 1.07 0.75 1.23 1.23
5 -1.16 -0.39 0.60 0.70 0.63 0.10 0.95 1.20
JanFeb.SO FebMarApr.SO FebMar.SO MarApr.SO TmaxAprI TminAprI TmeanAprI RainfallAprI HumidityAprI SunshineAprI
1 1.10 0.23 0.65 -0.50 3.27 -3.86 -0.44 0.82 76.3 3.45
2 -2.35 -1.93 -2.20 -1.90 4.52 -3.28 -0.15 0.12 73.5 7.12
3 1.40 1.23 1.15 1.35 4.11 -3.86 -0.34 1.32 78.4 4.85
4 1.20 1.40 1.50 1.25 6.11 -1.31 1.93 0.80 71.9 4.20
5 1.35 0.93 1.30 0.55 1.46 -2.37 -1.04 2.83 84.4 1.21
CloudAprI WindAprI SeeLevelPressureAprI TmaxAprII TminAprII TmeanAprII RainfallAprII HumidityAprII
1 6.30 5.26 1008.63 12.12 2.11 6.17 0.23 76.5
2 3.93 3.86 1022.39 5.57 -0.44 1.82 0.83 77.9
3 5.02 3.23 1007.09 0.20 -6.36 -3.23 2.63 82.5
4 6.15 5.13 1012.21 2.74 -4.88 -2.35 0.34 76.0
5 7.50 3.90 1009.50 6.75 -3.22 1.16 0.32 71.5
SunshineAprII CloudAprII WindAprII SeeLevelPressureAprII TmaxAprIII TminAprIII TmeanAprIII RainfallAprIII
1 3.12 6.53 5.19 1024.31 7.35 0.33 3.37 0.33
2 2.41 6.85 3.70 1012.01 6.34 0.76 2.69 2.01
3 4.99 5.87 6.23 1019.66 8.65 0.73 4.23 0.70
4 6.63 5.17 5.84 1022.62 5.84 -1.81 2.02 0.00
5 6.11 4.82 3.92 1018.81 8.47 1.02 4.17 1.09
HumidityAprIII SunshineAprIII CloudAprIII WindAprIII SeeLevelPressureAprIII TmaxDecI TminDecI TmeanDecI
1 75.0 3.73 6.40 4.08 1009.91 -0.90 -5.88 -3.67
2 83.5 1.52 7.31 4.66 1008.33 5.33 0.01 2.46
3 73.4 6.62 5.12 3.16 1017.01 -0.24 -6.93 -3.64
4 69.0 8.80 4.80 4.99 1021.18 4.67 1.86 2.79
5 72.7 5.33 5.41 4.27 1005.48 3.69 -1.43 1.65
RainfallDecI HumidityDecI SunshineDecI CloudDecI WindDecI SeeLevelPressureDecI TmaxDecII TminDecII TmeanDecII
1 0.12 77.3 0.22 5.08 3.49 1003.15 7.99 0.77 4.10
2 1.10 73.5 0.04 6.29 5.21 999.94 0.24 -4.74 -2.67
3 2.41 82.3 0.00 6.70 4.92 998.64 1.22 -5.90 -2.05
4 3.13 88.1 0.00 7.97 4.00 997.82 2.76 -3.89 -0.54
5 1.60 79.1 0.07 5.44 5.76 996.35 10.82 4.36 6.90
RainfallDecII HumidityDecII SunshineDecII CloudDecII WindDecII SeeLevelPressureDecII TmaxDecIII TminDecIII
1 1.90 71.3 0 4.96 5.55 1007.16 4.78 -2.12
2 4.34 82.2 0 7.03 6.06 998.02 2.07 -4.60
3 1.94 78.6 0 6.53 5.82 1008.33 2.09 -2.48
4 1.45 77.2 0 6.57 5.26 1005.11 -1.49 -8.37
5 1.15 66.6 0 5.74 5.47 1030.02 1.40 -7.34
TmeanDecIII RainfallDecIII HumidityDecIII SunshineDecIII CloudDecIII WindDecIII SeeLevelPressureDecIII TmaxFebI
1 1.15 3.96 82.36 0 6.01 4.02 991.60 -0.23
2 -0.51 4.10 81.18 0 6.67 3.91 986.52 0.79
3 -0.61 1.97 81.27 0 6.21 5.53 982.13 2.19
4 -5.28 1.26 79.64 0 6.11 4.22 1019.63 3.27
5 -3.45 1.19 82.18 0 6.20 4.77 1015.53 2.42
TminFebI TmeanFebI RainfallFebI HumidityFebI SunshineFebI CloudFebI WindFebI SeeLevelPressureFebI TmaxFebII
1 -6.67 -3.57 0.84 84.3 1.11 6.81 5.35 990.51 2.97
2 -7.79 -4.49 2.31 72.2 1.88 4.73 4.53 990.39 3.31
3 -4.14 -1.77 0.42 73.3 1.29 6.02 5.57 1007.67 1.55
4 -2.48 0.04 2.28 77.0 0.46 6.84 4.29 982.97 -1.24
5 -3.52 -0.74 1.98 81.5 0.76 5.78 4.93 1008.29 6.71
TminFebII TmeanFebII RainfallFebII HumidityFebII SunshineFebII CloudFebII WindFebII SeeLevelPressureFebII
1 -2.31 -0.10 1.44 82.2 1.07 6.45 4.42 980.59
2 -4.85 -0.99 3.84 75.0 2.54 5.91 5.05 999.98
3 -5.76 -2.44 2.89 75.3 0.40 6.95 5.82 990.44
4 -8.47 -4.65 3.33 83.1 0.63 6.55 4.95 1000.10
5 -0.25 3.01 1.38 66.1 1.16 6.18 6.28 1001.46
TmaxFebIII TminFebIII TmeanFebIII RainfallFebIII HumidityFebIII SunshineFebIII CloudFebIII WindFebIII
1 0.05 -6.01 -3.35 4.60 83.50 1.29 6.58 4.71
2 -0.45 -7.43 -4.51 2.93 78.38 1.00 6.91 5.99
3 2.13 -4.51 -1.21 2.90 79.38 2.51 5.76 5.46
4 0.59 -3.79 -1.92 5.94 88.33 1.40 6.86 6.70
5 -2.68 -7.23 -5.05 1.39 83.88 1.13 7.41 5.69
SeeLevelPressureFebIII TmaxJanI TminJanI TmeanJanI RainfallJanI HumidityJanI SunshineJanI CloudJanI WindJanI
1 980.25 0.38 -5.57 -3.36 0.01 82.9 0.27 3.45 2.97
2 997.71 4.29 -0.03 2.08 3.70 82.9 0.00 7.39 5.01
3 988.45 1.02 -4.47 -1.87 2.22 82.3 0.00 6.94 4.29
4 987.21 0.04 -6.28 -3.03 4.99 85.8 0.00 5.84 4.75
5 1023.84 -0.33 -5.11 -3.17 0.66 81.2 0.00 7.08 3.88
SeeLevelPressureJanI TmaxJanII TminJanII TmeanJanII RainfallJanII HumidityJanII SunshineJanII CloudJanII
1 1023.71 0.09 -6.48 -2.50 4.29 86.5 0.01 7.23
2 984.57 -0.34 -6.49 -3.61 2.74 80.2 0.23 6.99
3 1004.06 0.32 -5.59 -3.03 5.28 83.3 0.00 6.68
4 983.42 8.38 1.46 4.97 0.64 69.3 0.10 6.13
5 1010.31 7.35 3.00 5.09 1.27 66.3 0.03 6.19
WindJanII SeeLevelPressureJanII TmaxJanIII TminJanIII TmeanJanIII RainfallJanIII HumidityJanIII SunshineJanIII
1 5.42 998.88 5.66 -2.39 1.97 1.03 74.27 0.65
2 6.38 1011.44 3.84 -3.32 -0.37 0.70 73.55 0.55
3 6.24 980.15 4.33 -5.19 -0.59 2.23 76.64 0.69
4 6.44 1019.41 4.09 -2.67 0.05 2.18 71.73 0.42
5 6.74 1006.10 4.43 -0.86 1.58 1.91 80.09 0.20
CloudJanIII WindJanIII SeeLevelPressureJanIII TmaxMarI TminMarI TmeanMarI RainfallMarI HumidityMarI
1 6.47 7.59 1004.59 2.83 -3.60 -0.72 2.14 79.9
2 5.25 4.72 1019.95 -5.31 -12.52 -9.52 2.28 72.6
3 5.34 4.65 1001.66 -0.70 -6.67 -4.47 1.39 81.0
4 5.85 4.83 1007.23 0.10 -7.91 -3.98 2.36 80.2
5 6.53 3.63 992.53 -0.38 -4.59 -2.27 3.00 86.4
SunshineMarI CloudMarI WindMarI SeeLevelPressureMarI TmaxMarII TminMarII TmeanMarII RainfallMarII HumidityMarII
1 0.85 6.77 6.64 986.96 -1.48 -8.43 -5.58 1.09 81.0
2 2.92 5.91 4.68 1013.17 6.53 -1.81 2.56 0.43 65.5
3 2.40 5.71 4.02 1014.62 0.53 -5.17 -2.90 5.20 82.8
4 0.91 7.02 5.87 1006.64 5.32 -0.94 1.23 1.11 74.4
5 0.19 7.82 4.49 999.35 1.60 -4.29 -1.89 0.95 79.3
SunshineMarII CloudMarII WindMarII SeeLevelPressureMarII TmaxMarIII TminMarIII TmeanMarIII RainfallMarIII
1 2.12 5.51 3.93 1021.57 3.88 -1.95 0.55 1.42
2 2.25 6.29 6.11 1008.31 3.95 -2.46 -0.15 1.30
3 1.00 6.61 5.77 1006.63 -0.68 -6.60 -4.07 0.70
4 2.16 6.61 6.45 1003.23 5.49 -0.68 1.65 1.58
5 4.07 5.21 3.14 1017.24 -0.66 -7.21 -4.00 1.37
HumidityMarIII SunshineMarIII CloudMarIII WindMarIII SeeLevelPressureMarIII
1 80.45 2.80 6.13 4.03 995.31
2 72.09 3.98 5.99 5.14 1000.32
3 78.73 2.34 6.46 3.81 1005.67
4 74.64 2.85 6.54 6.34 1013.45
5 79.45 4.71 5.65 4.95 1010.47
[ reached 'max' / getOption("max.print") -- omitted 5 rows ]
And I would like to do the normality test for all column in once. I tried
apply(x, shapiro.test)
Betula_shapiro <- apply(Betula, shapiro.test)
Error in FUN(X[[i]], ...) : is.numeric(x) is not TRUE
and it didnĀ“t work. I also tried this:
Betula <- apply(Betula[which(sapply(Betula, is.numeric))], 2, shapiro.test)
Error in FUN(newX[, i], ...) : all 'x' values are identical
f<-function(x){if(diff(range(x))==0)list()else shapiro.test(x)}
Betula <- apply(Betula[which(sapply(Betula, is.numeric))], 2, f)
Error in if (diff(range(x)) == 0) list() else shapiro.test(x) :
missing value where TRUE/FALSE needed
So I did:
Betula_numerics_only <- Betula[which(sapply(Betula, is.numeric))]
selecting columns with at least 3 not missing values and applying shapiro.test on them
Betula_numerics_only_filled_columns <- Betula_numerics_only[which(apply(Betula_numerics_only, 2, function(f) sum(!is.na(f))>=3 ))]
Betula_shapiro<-apply(Betula_numerics_only_filled_columns, 2, shapiro.test)
Error in FUN(newX[, i], ...) : all 'x' values are identical
Could you please help me with this problem?
Since i was talking about readability in my comment i felt i should provide something more readable too as an answer.
Lets make some dummy-data:
data_test <- data.frame(matrix(rnorm(100, 10, 1), ncol = 5, byrow = T), stringsAsFactors = F)
Lets apply shapiro.test to each column
apply(data_test, 2, shapiro.test)
In case there are non numeric columns:
Lets add a dummy-char column for testing-purposes
data_test$non_numeric <- sample(c("hello", "hi", "good morning"), NROW(data_test), replace = T)
and try to apply the test again
apply(data_test, 2, shapiro.test)
which results in:
> apply(data_test, 2, shapiro.test)
Error: is.numeric(x) is not TRUE
To solve this we select only numeric colums by using sapply:
data_test[which(sapply(data_test, is.numeric))]
and combine it with the apply:
apply(data_test[which(sapply(data_test, is.numeric))], 2, shapiro.test)
Removing colums, that are all NA:
data_test_numerics_only <- data_test[which(sapply(data_test, is.numeric))]
Selecting colums with at least 3 not missing values and applying shapiro.test on them:
data_test_numerics_only_filled_colums = data_test_numerics_only[which(apply(data_test_numerics_only, 2, function(f) sum(!is.na(f)) >= 3))]
apply(data_test_numerics_only_filled_colums, 2, shapiro.test)
We will get this running, lets try once more :)
remove non numeric columns
Betula_numerics <- Betula[which(sapply(Betula, is.numeric))]
Remove columns with less than 3 values
Betula_numerics_filled <- Betula_numerics[which(apply(Betula_numerics, 2, function(f) sum(!is.na(f)) >= 3))]
Remove columns with zero variance
Betula_numerics_filled_not_constant <- Betula_numerics_filled [apply(Betula_numerics_filled , 2, function(f) var(f, na.rm = T) != 0)]
Shapiro.test and hope for the best :)
apply(Betula_numerics_filled_not_constant, 2, shapiro.test)

Extract the column number which has the last data point from a table

I have the following inverted data frame
z
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
14 -8.70 0.28 18.66 4.81 -34.33 40.39 3.09 7.89 49.41
13 -6.10 9.51 -1.09 -0.01 7.89 -7.37 -0.61 -9.79 31.75 40.67 5.41 -10.53
12 -5.21 7.49 -7.92 3.54 11.19 -6.66 23.64 13.21 9.64 14.44 59.95 -20.96
11 -12.68 11.04 -11.10 -6.18 -5.61 8.93 94.99 30.15 14.37 31.08 -9.02 -14.77
10 5.07 -2.04 22.77 12.05 0.38 -3.28 -2.73 11.26 5.30 4.61 13.80 3.68
9 -0.82 0.86 3.18 1.06 6.47 1.57 2.25 -9.34 5.27 7.25 2.85 0.42
8 10.48 1.17 10.97 -0.13 0.32 -5.89 -2.26 -7.28 -1.39 3.35 14.81 3.40
7 -5.22 3.09 -7.75 -3.41 -0.09 12.37 -17.38 1.41 8.57 10.48 -1.20 7.45
6 13.85 7.22 3.14 -2.92 -7.12 0.45 3.51 -2.30 7.07 -2.83 -2.27 -1.52
5 -0.57 0.58 -2.59 3.29 -6.07 0.37 1.32 -0.58 4.07 -4.85 -0.48 1.66
4 0.46 -0.41 3.01 0.60 2.20 -2.39 0.22 3.99 5.50 16.07 -4.51 0.50
3 1.28 5.10 -3.61 5.02 3.04 -4.05 -2.64 1.88 -2.44 3.27 -2.71 2.02
2 -1.28 0.99 2.38 0.16 1.03 10.93 5.07 0.26 0.84 -0.05 -0.88 -3.71
1 2.33 -1.71 -0.41 -0.58 -2.19 1.26 1.88 -4.03 0.54 0.34 0.22 -0.50
I would like to find out which column has the last data point in this example -0.50 and extract the column name in this case Dec as a number (12), without using the -0.50 data point, tried wrong with the below expressions
which( colnames(z)==-0.50)
integer(0)
which( colnames(z)==z[length(z)])
integer(0)
Second example
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
18 -12.97 9.96 8.14
17 1.50 3.27 7.38 -1.63 8.53 2.97 1.51 10.99 4.51 -5.70 1.15 9.50
16 -1.38 3.61 -3.98 10.51 -8.39 5.29 -2.01 -3.47 -0.17 -6.20 13.93 9.04
15 -3.96 1.72 -3.28 2.06 -0.26 -1.27 -4.58 3.23 -7.76 2.09 7.33 16.81
14 4.38 0.56 7.09 -5.31 -2.61 -2.66 0.66 0.56 4.64 13.75 -7.10 -5.15
13 -10.13 -6.04 12.62 -3.76 -3.96 7.95 4.71 6.04 7.63 -7.96 -0.69 14.16
12 5.95 11.95 -10.80 2.45 10.19 -5.20 -0.68 0.62 0.26 4.72 -2.48 10.27
11 2.72 11.56 -0.80 -8.62 0.28 -2.96 1.33 3.09 5.14 4.03 6.37 -0.19
10 -5.38 6.58 4.64 -4.21 6.62 3.13 -1.85 7.63 -6.17 -2.95 7.32 -4.37
9 4.20 -2.58 4.01 5.66 -2.94 -1.17 -0.47 4.54 -1.10 1.48 3.24 2.14
8 3.86 -5.93 -3.95 6.46 5.05 1.91 -1.18 -0.88 6.99 2.52 2.42 0.24
7 3.85 7.95 -0.66 -0.99 1.99 5.06 -4.63 -3.00 -0.41 3.73 4.97 2.10
6 0.99 -0.21 -1.64 -3.01 -2.03 -1.26 -1.52 0.32 2.85 -1.59 5.12 -2.45
5 -2.64 2.33 4.91 1.75 -1.01 1.47 -2.78 4.78 0.94 2.51 -2.01 3.75
4 0.08 1.51 0.25 3.00 -2.16 -2.51 4.59 1.43 0.16 -2.59 0.97 1.65
3 0.63 -0.83 -0.68 0.12 -0.22 -3.17 4.41 -1.29 -2.18 -2.54 1.00 1.36
2 2.51 0.17 2.66 3.41 -2.40 -1.77 -0.63 -3.80 3.47 3.20 2.20 0.37
1 -2.37
Last point is Jan -2.37
Thanks
My answer is based on #BrodieG's one.
You could try nchar to test for "empty cells":
tail(which(nchar(as.matrix(z)) == 0, arr.ind=TRUE), 1)
col <- max(which(!is.na(t(as.matrix(z))))) %% ncol(z)
if(!col) col <- ncol(z)
names(z)[[col]]
# [1] "Dec"
This assumes "empty" values are NA, and that z is a data frame. I tested this by removing some values from the end, and it worked as well.

Resources