How to predict with a regression model with many missing values? - r

I intend to analyze and build a regression model with a dummy variable as a dependent variable.
I'm using the glm function, but I can't predict it. I don't want to exclude the missing values. What is the best way to make good predictions in cases where the database has many missing values?
n$status <-as.factor(n$status)
set.seed(900)
training.samples <- n$status %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- n[training.samples, ]
test.data <- n[-training.samples, ]
model=glm(status ~.,data = train.data,family = binomial(link = "logit"))
m <- data.frame(x1=mean(n$x1,na.rm=T),x2=mean(n$x2,na.rm=T))
m$predictprob <- predict(model, newdata=m, type="response")
Error in eval(predvars, data, env) : object 'x1' not found
When I try to make the forecast this error appears. I think it must be because of the missing values.
str(n)
'data.frame': 4371 obs. of 8 variables:
$ status: Factor w/ 2 levels "Active","Inactive": 1 1 1 1 1 1 1 1 1 1 ...
$ x1 : num 12.2 12.4 13.1 10.9 22.7 ...
$ x2 : num 4.27 2.17 5.91 5.81 7.44 ...
$ x3 : num 8.3 7.71 12.41 9.34 19.57 ...
$ x4 : num 2.91 1.34 5.61 4.99 6.43 ...
$ x5 : num 4.51 1.83 9.11 10.68 14.23 ...
$ x6 : num 3.7 4.94 12.27 11.29 15.13 ...
$ x7 : num 2.22 3.4 1.12 0.84 1.11 4.07 8.15 0.79 8.16 8.86 ..
dput(train.data[1:10,])
structure(list(Status = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("Active", "Inactive"), class = "factor"),
x1 = c(12.17, 12.41, 13.07, 10.88, 22.66, 43.54, 64.75,
255.43, 10.05, 1.84), x2 = c(4.27, 2.17, 5.91, 5.81, 7.44,
17.17, 22.51, 9.29, 0.78, 0.42), x3 = c(8.3, 7.71, 12.41,
9.34, 19.57, 33.7, 48.1, 252.75, 6.89, 2.24), x4 = c(2.91,
1.34, 5.61, 4.99, 6.43, 13.29, 16.72, 9.19, 0.53, 0.51),
x5 = c(4.51, 1.83, 9.11, 10.68, 14.23, 8.99, 7.94, 19.73,
1.09, 0.2), x6 = c(3.7, 4.94, 12.27, 11.29, 15.13, 9.07,
7.94, 21.21, 0.96, 0.02), x7 = c(2.22, 3.4, 1.12, 0.84,
1.11, 4.07, 8.15, 0.79, 8.16, 8.86), row.names = c(NA, 10L), class =
"data.frame")

Related

Two-way ANOVA in R. How to check all the dataframe variables automatically?

I have a dataframe with 33 varialbles and 1 dependable variable. I need to perform two-way ANOVA test to see their impacts.
Now I have to type vars manually:
two.way <- aov(`Yield t/ha` ~
TypeP*PreviousCulture *
T1may*T2may*T3may*T1june*
T2june*T3june*T1july*T2july*
T3july*T1aug*T2aug*T3aug*
T1sept*T2sept*T3sept*
P1may*P2may*P3may*P1june*
P2june*P3june*P1july*P2july*
P3july*P1aug*P2aug*P3aug*
P1sept*P2sept*P3sept,
data = KemData)
summary(two.way)
Maybe there's another way to put this variables into aov() function?
A sample of data:
> dput(head(KemData, 6))
structure(list(TypeP = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Combined deep",
"Deep moldboard"), class = "factor"), PreviousCulture = structure(c(1L,
3L, 2L, 1L, 3L, 2L), .Label = c("Pure steam", "Sideral steam (melilot)",
"Sideral steam (rapeseed)"), class = "factor"), `Yield t/ha` = c(1.53,
1.33, 1.46, 0.5, 0.66, 0.58), T1may = c(9.55, 9.55, 9.55, 11.04,
11.04, 11.04), T2may = c(5.92, 5.92, 5.92, 6.89, 6.89, 6.89),
T3may = c(9.26, 9.26, 9.26, 7.61, 7.61, 7.61), T1june = c(11.43,
11.43, 11.43, 8.02, 8.02, 8.02), T2june = c(16.37, 16.37,
16.37, 18.28, 18.28, 18.28), T3june = c(15.89, 15.89, 15.89,
22.34, 22.34, 22.34), T1july = c(16.01, 16.01, 16.01, 21.1,
21.1, 21.1), T2july = c(20.02, 20.02, 20.02, 20.85, 20.85,
20.85), T3july = c(19.02, 19.02, 19.02, 18, 18, 18), T1aug = c(18.57,
18.57, 18.57, 17.32, 17.32, 17.32), T2aug = c(16.53, 16.53,
16.53, 20.82, 20.82, 20.82), T3aug = c(15.36, 15.36, 15.36,
13.64, 13.64, 13.64), T1sept = c(12.46, 12.46, 12.46, 10.45,
10.45, 10.45), T2sept = c(6.89, 6.89, 6.89, 7.33, 7.33, 7.33
), T3sept = c(6.64, 6.64, 6.64, 5.98, 5.98, 5.98), P1may = c(1.69,
1.69, 1.69, 0.06, 0.06, 0.06), P2may = c(2.44, 2.44, 2.44,
2.8, 2.8, 2.8), P3may = c(2.04, 2.04, 2.04, 3.94, 3.94, 3.94
), P1june = c(1, 1, 1, 2.23, 2.23, 2.23), P2june = c(1.73,
1.73, 1.73, 0.87, 0.87, 0.87), P3june = c(1.34, 1.34, 1.34,
0.31, 0.31, 0.31), P1july = c(5.65, 5.65, 5.65, 0.44, 0.44,
0.44), P2july = c(0.18, 0.18, 0.18, 2.18, 2.18, 2.18), P3july = c(6.7,
6.7, 6.7, 3.57, 3.57, 3.57), P1aug = c(3.38, 3.38, 3.38,
0.62, 0.62, 0.62), P2aug = c(7.65, 7.65, 7.65, 1.26, 1.26,
1.26), P3aug = c(2.73, 2.73, 2.73, 4.5, 4.5, 4.5), P1sept = c(0.31,
0.31, 0.31, 1.44, 1.44, 1.44), P2sept = c(2.94, 2.94, 2.94,
3.13, 3.13, 3.13), P3sept = c(1.65, 1.65, 1.65, 0.64, 0.64,
0.64)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
TypeP PreviousCulture `Yield t/ha` T1may T2may T3may T1june T2june T3june T1july T2july T3july T1aug T2aug T3aug T1sept T2sept T3sept P1may P2may P3may P1june P2june P3june
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Deep mo~ Pure steam 1.53 9.55 5.92 9.26 11.4 16.4 15.9 16.0 20.0 19.0 18.6 16.5 15.4 12.5 6.89 6.64 1.69 2.44 2.04 1 1.73 1.34
2 Deep mo~ Sideral steam (ra~ 1.33 9.55 5.92 9.26 11.4 16.4 15.9 16.0 20.0 19.0 18.6 16.5 15.4 12.5 6.89 6.64 1.69 2.44 2.04 1 1.73 1.34
3 Deep mo~ Sideral steam (me~ 1.46 9.55 5.92 9.26 11.4 16.4 15.9 16.0 20.0 19.0 18.6 16.5 15.4 12.5 6.89 6.64 1.69 2.44 2.04 1 1.73 1.34
4 Combine~ Pure steam 0.5 11.0 6.89 7.61 8.02 18.3 22.3 21.1 20.8 18 17.3 20.8 13.6 10.4 7.33 5.98 0.06 2.8 3.94 2.23 0.87 0.31
5 Combine~ Sideral steam (ra~ 0.66 11.0 6.89 7.61 8.02 18.3 22.3 21.1 20.8 18 17.3 20.8 13.6 10.4 7.33 5.98 0.06 2.8 3.94 2.23 0.87 0.31
6 Combine~ Sideral steam (me~ 0.58 11.0 6.89 7.61 8.02 18.3 22.3 21.1 20.8 18 17.3 20.8 13.6 10.4 7.33 5.98 0.06 2.8 3.94 2.23 0.87 0.31
Building the formula with paste() inside a loop:
Get the variable names, and exclude the dependent one:
var.names = colnames(KemData)
var.names = var.names[-which(var.names="Урожайность т/га")]
Now the loop:
formula = "Урожайность т/га ~ "
for(i in var.names){
formula = paste0(formula, "`", i, "`", " * ")}
formula = substr(formula, 1, nchar(formula)-3)

Adding a prefix to each column name in each dataframe in a list of dataframes R

I want to add a prefix to each column (except the first) in each dataframe in a list of dataframe. I have been taking the approach that I would used for a single dataframe and trying to use both lapply and Map without success.
I also want change the the first column of each dataframe by adding the name of the dataframe as a prefix to the existing name
A snippet of my list of dataframes
l1 <- list(Fe = structure(list(Determination_No = 1:6, `2` = c(55.94,
55.7, 56.59, 56.5, 55.98, 55.93), `3` = c(56.83, 56.54, 56.18,
56.5, 56.51, 56.34), `4` = c(56.39, 56.43, 56.53, 56.31, 56.47,
56.35), `5` = c(56.32, 56.29, 56.31, 56.32, 56.39, 56.32), `7` = c(56.48,
56.4, 56.54, 56.43, 56.73, 56.62), `8` = c(56.382, 56.258, 56.442,
56.258, 56.532, 56.264), `10` = c(56.3, 56.5, 56.2, 56.5, 56.7,
56.5), `12` = c(56.11, 56.46, 56.1, 56.35, 56.36, 56.37)), class = "data.frame", row.names = c(NA,
-6L)), SiO2 = structure(list(Determination_No = 1:6, `2` = c(7.63,
7.65, 7.73, 7.67, 7.67, 7.67), `3` = c(7.84, 7.69, 7.59, 7.77,
7.74, 7.64), `4` = c(7.67, 7.74, 7.62, 7.81, 7.66, 7.8), `5` = c(7.91,
7.84, 7.96, 7.87, 7.84, 7.92), `7` = c(7.77, 7.83, 7.76, 7.78,
7.65, 7.74), `8` = c(7.936, 7.685, 7.863, 7.838, 7.828, 7.767
), `10` = c(7.872684992, 7.851291827, 7.872684992, 7.722932832,
7.680146501, 7.615967003), `12` = c(7.64, 7.71, 7.71, 7.65, 7.82,
7.68)), class = "data.frame", row.names = c(NA, -6L)), Al2O3 = structure(list(
Determination_No = 1:6, `2` = c(2.01, 2.02, 2.03, 2.01, 2.02,
2), `3` = c(2.01, 2.01, 2, 2.02, 2.02, 2.03), `4` = c(2,
2.03, 1.99, 2.01, 2.01, 2.01), `5` = c(2.02, 2.02, 2.05,
2.03, 2.02, 2.03), `7` = c(1.88, 1.9, 1.89, 1.88, 1.88, 1.87
), `8` = c(2.053, 2.044, 2.041, 2.038, 2.008, 2.02), `10` = c(2.002830415,
2.021725042, 2.021725042, 1.983935789, 2.002830415, 2.021725042
), `12` = c(2.09, 2.05, 1.96, 2.09, 2.06, 2.02)), class = "data.frame", row.names = c(NA,
-6L)))
I have tried the following
colnames(l1[-1]) <- lapply(l1[-1],paste0("Lab-",colnames(l1[-1])))
colnames(l1[-1]) <- Map(paste("Lab",colnames(l1[-1]),sep=" "),l1[-1])
Either solution I get the following error message
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'Lab-' of mode 'function' was not found
not sure what the issue is
Thanks
In tidyverse, we can use imap with rename_with :
library(dplyr)
library(purrr)
imap(l1, ~.x %>%
rename_with(function(x) c(paste(.y, x[1], sep = '_'), paste0('lab_', x[-1]))))
#$Fe
# Fe_Determination_No lab_2 lab_3 lab_4 lab_5 lab_7 lab_8 lab_10 lab_12
#1 1 55.94 56.83 56.39 56.32 56.48 56.382 56.3 56.11
#2 2 55.70 56.54 56.43 56.29 56.40 56.258 56.5 56.46
#3 3 56.59 56.18 56.53 56.31 56.54 56.442 56.2 56.10
#4 4 56.50 56.50 56.31 56.32 56.43 56.258 56.5 56.35
#5 5 55.98 56.51 56.47 56.39 56.73 56.532 56.7 56.36
#6 6 55.93 56.34 56.35 56.32 56.62 56.264 56.5 56.37
3$SiO2
# SiO2_Determination_No lab_2 lab_3 lab_4 lab_5 lab_7 lab_8 lab_10 lab_12
#1 1 7.63 7.84 7.67 7.91 7.77 7.936 7.872685 7.64
#2 2 7.65 7.69 7.74 7.84 7.83 7.685 7.851292 7.71
#3 3 7.73 7.59 7.62 7.96 7.76 7.863 7.872685 7.71
#4 4 7.67 7.77 7.81 7.87 7.78 7.838 7.722933 7.65
#5 5 7.67 7.74 7.66 7.84 7.65 7.828 7.680147 7.82
#6 6 7.67 7.64 7.80 7.92 7.74 7.767 7.615967 7.68
#$Al2O3
# Al2O3_Determination_No lab_2 lab_3 lab_4 lab_5 lab_7 lab_8 lab_10 lab_12
#1 1 2.01 2.01 2.00 2.02 1.88 2.053 2.002830 2.09
#2 2 2.02 2.01 2.03 2.02 1.90 2.044 2.021725 2.05
#3 3 2.03 2.00 1.99 2.05 1.89 2.041 2.021725 1.96
#4 4 2.01 2.02 2.01 2.03 1.88 2.038 1.983936 2.09
#5 5 2.02 2.02 2.01 2.02 1.88 2.008 2.002830 2.06
#6 6 2.00 2.03 2.01 2.03 1.87 2.020 2.021725 2.02
Or in base R with Map :
Map(function(x, y) {
names(x) <- c(paste(y, names(x)[1], sep = '_'), paste0('lab_', names(x[-1])))
x
}, l1, names(l1))

Kruskal-wallis test in R gives an error: Error in model.frame.default: variable lengths differ

I am trying to run Kruskal wallis tests for multiple columns in my example dataframe (df) in R, but I am stuck with the following error:
Error in model.frame.default(formula = as.numeric(x) ~ as.factor(Groups), :
variable lengths differ (found for 'as.factor(Groups)')
Here is my example dataframe (df):
Groups Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8 Gene9 Gene10
Group1 120.67 69.33 1.24 2.31 0.39 6.57 2.49 383.84 415.23 NA
Group1 157 110.67 0.4 0.84 0.28 2.62 2.11 245.42 325.23 NA
Group1 113.5 66.75 1.07 4.53 0.33 2.37 2.35 421.25 352.03 73.51
Group1 131 79.67 1.13 5.03 0.72 3.36 2.24 305.32 432.81 71.11
Group1 120 79.67 0.91 3.84 0.74 3.77 1.92 298.91 382.43 66.49
Group2 125.67 83.67 2.07 1.73 0.38 3.89 2.09 233.81 377.21 72.1
Group2 103.33 68.67 1.01 4.89 0.3 4.5 1.75 231.5 381.73 53
Group2 121.33 74.67 0.54 2.39 3.95 3.7 2.46 310.66 355.97 143.61
Group2 136 83.67 1.6 1.75 0.32 5.17 2.36 410.21 389.62 170.34
Group2 143.67 71.33 0.56 1.22 0.26 4.48 2.62 294.01 491.57 96.72
Group2 134.67 69.67 0.85 1.77 0.45 3.58 2.44 236.61 441.32 69.06
Group2 158.33 98.33 0.87 3.69 0.51 2.53 2.6 257.66 396.96 41.94
Group2 147.33 88.33 NA NA NA NA NA NA NA NA
Group2 95.67 59 1.39 0.56 0.31 2.49 2.09 395.38 420.28 64.83
Group3 135 82 13.31 24.05 1.21 3.83 2.83 313.71 327.84 66.8
Group3 124.67 78 1.12 2 0.71 3.77 2.42 334.36 358.9 131.35
Group3 152 98.33 1.11 1.54 0.35 2.11 2.21 297.68 433.48 117.18
Group3 135.33 73.67 0.13 2.99 0.3 2.4 1.86 296.82 415.13 112.97
Group3 135.33 87 0.91 3.73 0.65 2.92 1.85 335.31 412.16 103.18
Group4 124.67 77.67 0.28 0.81 0.49 2.62 1.96 251.49 468.19 80.27
Group4 125.67 72.33 1.01 1.82 0.35 3.65 1.62 335.18 264.74 145.15
Group4 169 105 0.6 3.12 0.29 3.9 2.22 311.01 459.85 82.89
Group4 123.67 76.33 0.65 1.78 0.47 2.77 1.57 253.56 283.38 59.07
Group5 132.67 76.33 2.94 17.01 0.27 3.99 2.55 354.78 493.02 145.36
Group5 NA NA 1.34 1.42 0.4 4.21 2.02 243.26 345.2 43.91
Group5 144.33 75 NA NA 0.55 3.26 2.85 312.16 419.86 55.71
Group5 136.25 78.25 NA 1.32 0.65 3.63 1.52 267.13 256.18 53.49
Group5 123.67 69.33 1.81 1.52 0.67 3.89 2 303.89 346.57 112.16
Group5 116.67 66.33 0.7 1.68 0.27 3.55 2.16 284.96 407.04 102.97
Group5 136.67 76 2.68 4.3 0.33 7.36 2.26 237.28 423.29 88.65
Group6 122 63.33 0.87 4.2 0.17 3.92 2.11 159.04 300.24 60.13
Group6 130.67 82.67 0.8 1.85 1 5.26 2.46 388.61 558.51 66.76
Group6 136.33 70.33 0.54 2.26 0.35 NA NA 388.81 551.69 113.39
Group6 127.33 73 1.32 2.19 0.99 4.42 2.59 378.57 501.12 85.56
Group7 186.67 89.67 0.79 1.77 0.53 5.22 2.73 269.87 490.25 77.74
Group7 203 93 5.63 22.08 0.82 6.97 2.92 341.87 611.33 92.7
Group7 127 72.67 0.55 1.07 0.38 3.2 1.69 310.9 410.19 65.62
Group7 142 79.67 1.61 1.35 3.24 3.73 2.08 304.52 495.79 60.15
Here is my code:
kw.tests <- lapply(
data[, -1],
function(x) { kruskal.test(as.numeric(x) ~ as.factor(Groups), data = data_test, na.action=na.omit) }
)
Error in model.frame.default(formula = as.numeric(x) ~ as.factor(Groups), :
variable lengths differ (found for 'as.factor(Groups)')
This code runs perfectly when I am running each of the gene individually, for example, for Gene1:
kruskal.test(Gene1 ~ as.factor(Groups), data = data_test, na.action=na.omit)
Kruskal-Wallis rank sum test
data: Gene1 by as.factor(Groups)
Kruskal-Wallis chi-squared = 5.6607, df = 6, p-value = 0.4622
However, it gives me this error when I use lapply or even a for loop. I have already googled this error several times, but none of the following answers are helping me.
I learn that it could be due to the NAs in the file. However, I cannot avoid NAs as my dataframe is much larger than this. Also, that this test runs perfectly for each Gene separately without lapply or loops, even though there are NAs.
The variable length of the 'Groups' variable is the same as that of all other variables, so this is also not an issue.
I here post snippet of my data:
> dput(data_test)
structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L), .Label = c("Group1",
"Group2", "Group3", "Group4", "Group5", "Group6", "Group7"), class = "factor"),
Gene1 = c(120.67, 157, 113.5, 131, 120, 125.67, 103.33, 121.33,
136, 143.67, 134.67, 158.33, 147.33, 95.67, 135, 124.67,
152, 135.33, 135.33, 124.67, 125.67, 169, 123.67, 132.67,
NA, 144.33, 136.25, 123.67, 116.67, 136.67, 122, 130.67,
136.33, 127.33, 186.67, 203, 127, 142), Gene2 = c(69.33,
110.67, 66.75, 79.67, 79.67, 83.67, 68.67, 74.67, 83.67,
71.33, 69.67, 98.33, 88.33, 59, 82, 78, 98.33, 73.67, 87,
77.67, 72.33, 105, 76.33, 76.33, NA, 75, 78.25, 69.33, 66.33,
76, 63.33, 82.67, 70.33, 73, 89.67, 93, 72.67, 79.67), Gene3 = c(1.24,
0.4, 1.07, 1.13, 0.91, 2.07, 1.01, 0.54, 1.6, 0.56, 0.85,
0.87, NA, 1.39, 13.31, 1.12, 1.11, 0.13, 0.91, 0.28, 1.01,
0.6, 0.65, 2.94, 1.34, NA, NA, 1.81, 0.7, 2.68, 0.87, 0.8,
0.54, 1.32, 0.79, 5.63, 0.55, 1.61), Gene4 = c(2.31, 0.84,
4.53, 5.03, 3.84, 1.73, 4.89, 2.39, 1.75, 1.22, 1.77, 3.69,
NA, 0.56, 24.05, 2, 1.54, 2.99, 3.73, 0.81, 1.82, 3.12, 1.78,
17.01, 1.42, NA, 1.32, 1.52, 1.68, 4.3, 4.2, 1.85, 2.26,
2.19, 1.77, 22.08, 1.07, 1.35), Gene5 = c(0.39, 0.28, 0.33,
0.72, 0.74, 0.38, 0.3, 3.95, 0.32, 0.26, 0.45, 0.51, NA,
0.31, 1.21, 0.71, 0.35, 0.3, 0.65, 0.49, 0.35, 0.29, 0.47,
0.27, 0.4, 0.55, 0.65, 0.67, 0.27, 0.33, 0.17, 1, 0.35, 0.99,
0.53, 0.82, 0.38, 3.24), Gene6 = c(6.57, 2.62, 2.37, 3.36,
3.77, 3.89, 4.5, 3.7, 5.17, 4.48, 3.58, 2.53, NA, 2.49, 3.83,
3.77, 2.11, 2.4, 2.92, 2.62, 3.65, 3.9, 2.77, 3.99, 4.21,
3.26, 3.63, 3.89, 3.55, 7.36, 3.92, 5.26, NA, 4.42, 5.22,
6.97, 3.2, 3.73), Gene7 = c(2.49, 2.11, 2.35, 2.24, 1.92,
2.09, 1.75, 2.46, 2.36, 2.62, 2.44, 2.6, NA, 2.09, 2.83,
2.42, 2.21, 1.86, 1.85, 1.96, 1.62, 2.22, 1.57, 2.55, 2.02,
2.85, 1.52, 2, 2.16, 2.26, 2.11, 2.46, NA, 2.59, 2.73, 2.92,
1.69, 2.08), Gene8 = c(383.84, 245.42, 421.25, 305.32, 298.91,
233.81, 231.5, 310.66, 410.21, 294.01, 236.61, 257.66, NA,
395.38, 313.71, 334.36, 297.68, 296.82, 335.31, 251.49, 335.18,
311.01, 253.56, 354.78, 243.26, 312.16, 267.13, 303.89, 284.96,
237.28, 159.04, 388.61, 388.81, 378.57, 269.87, 341.87, 310.9,
304.52), Gene9 = c(415.23, 325.23, 352.03, 432.81, 382.43,
377.21, 381.73, 355.97, 389.62, 491.57, 441.32, 396.96, NA,
420.28, 327.84, 358.9, 433.48, 415.13, 412.16, 468.19, 264.74,
459.85, 283.38, 493.02, 345.2, 419.86, 256.18, 346.57, 407.04,
423.29, 300.24, 558.51, 551.69, 501.12, 490.25, 611.33, 410.19,
495.79), Gene10 = c(NA, NA, 73.51, 71.11, 66.49, 72.1, 53,
143.61, 170.34, 96.72, 69.06, 41.94, NA, 64.83, 66.8, 131.35,
117.18, 112.97, 103.18, 80.27, 145.15, 82.89, 59.07, 145.36,
43.91, 55.71, 53.49, 112.16, 102.97, 88.65, 60.13, 66.76,
113.39, 85.56, 77.74, 92.7, 65.62, 60.15)), class = "data.frame", row.names = c(NA,
-38L))
Any further help appreciated.
Thanking you.
You used the wrong dataset name in your lapply / apply call
apply(data_test[,-1],2,function(x){kruskal.test(as.numeric(x)~as.factor(data_test$Groups))})
works for me.

How to make a graph to represent mean, standard derivation, maximum and minimum values?

I want to make a graph with monthly data: mean, standard deviation and maximum and minimum values.
X axis would be my months and I would like to represent my mean with dots, squares and cross, for exemple and my standard deviation represented by delimited vertical lines in the mean symbols.
And still, represent my maximum and minimum values by an area.
I would like to plot 3 different periods: 1961-1990, 1990-2010 and 1961-2010.
Is it possible?
Some data:
Mês;Mean1;Std1;Min1;Max1;Mean2;Std2;Min2;Max2;Mean3;Std3;Min3;Max3
Jan;25.45;2.04;13.05;27.50;25.83;1.94;14.01;27.85;25.54;2.03;13.24;27.58
Feb;25.74;2.09;13.02;27.85;26.16;2.01;13.95;28.16;25.92;2.04;13.58;27.99
Mar;25.01;2.13;12.12;27.27;25.35;2.14;12.41;27.67;25.16;2.07;12.68;27.45
Apr;23.16;2.19;9.89;25.48;23.81;2.35;9.62;26.35;23.51;2.17;10.46;25.90
May;21.17;2.21;7.99;23.59;21.31;2.29;7.54;23.88;21.18;2.23;7.84;23.67
Jun;19.88;2.26;6.37;22.34;20.15;2.25;6.65;22.65;20.00;2.26;6.42;22.47
Jul;19.41;2.27;5.78;21.79;19.96;2.10;7.34;22.25;19.60;2.22;6.24;22.02
Aug;20.39;2.10;7.73;22.64;20.75;2.03;8.56;23.00;20.53;2.09;7.93;22.80
Sep;21.08;1.96;9.26;23.29;21.66;1.58;12.21;23.53;21.33;1.91;9.84;23.53
Oct;22.19;1.81;11.33;24.32;23.17;1.62;13.40;25.00;22.60;1.79;11.92;24.76
Nov;23.42;1.90;11.94;25.52;23.89;1.64;13.96;25.68;23.60;1.82;12.63;25.67
Dec;24.39;1.98;12.39;26.39;25.17;1.99;13.07;27.54;24.67;1.94;12.93;26.73
Short answer:
Yes that is indeed possible.
Long answer:
Here is how you could do it:
Assuming that the data you posted is in some data.frame called df:
head(df)
Mês Mean1 Std1 Min1 Max1 Mean2 Std2 Min2 Max2 Mean3 Std3 Min3 Max3
1 Jan 25.45 2.04 13.05 27.50 25.83 1.94 14.01 27.85 25.54 2.03 13.24 27.58
2 Feb 25.74 2.09 13.02 27.85 26.16 2.01 13.95 28.16 25.92 2.04 13.58 27.99
3 Mar 25.01 2.13 12.12 27.27 25.35 2.14 12.41 27.67 25.16 2.07 12.68 27.45
4 Apr 23.16 2.19 9.89 25.48 23.81 2.35 9.62 26.35 23.51 2.17 10.46 25.90
5 May 21.17 2.21 7.99 23.59 21.31 2.29 7.54 23.88 21.18 2.23 7.84 23.67
6 Jun 19.88 2.26 6.37 22.34 20.15 2.25 6.65 22.65 20.00 2.26 6.42 22.47
You first want to convert it from a wide format to a long format, meaning that we want each observation to have its own row. Perhaps someone with more tidyverse experience can do this in a more elegant way, but this is how I would do that:
# First we melt the dataframe
df2 <- reshape2::melt(df, id.vars = "Mês")
# Then we get a grouping variable from the column "variable"
df2$variable <- as.character(df2$variable)
df2$group <- substr(df2$variable, nchar(df2$variable), nchar(df2$variable))
# And we remove the trailing number from the variable
df2$variable <- substr(df2$variable, 1, nchar(df2$variable) - 1)
This is what the data will look like at this point:
head(df2)
Mês variable value group
1 Jan Mean 25.45 1
2 Feb Mean 25.74 1
3 Mar Mean 25.01 1
4 Apr Mean 23.16 1
5 May Mean 21.17 1
6 Jun Mean 19.88 1
We still need the means, standard deviations, minima and maxima to be on the same row, so we are going to un-melt (cast) the data by each group:
# First we split by group
df2 <- split(df2, df2$group)
# Then, we loop over the data and cast the data
df2 <- lapply(seq(df2), function(i){
dat <- df2[[i]]
cbind(reshape2::dcast(dat, Mês ~ variable), group = i)
})
# And finally combine the data.frame back together
df2 <- do.call(rbind, df2)
Now the data should look like this:
head(df2)
Mês Max Mean Min Std group
1 Jan 27.50 25.45 13.05 2.04 1
2 Feb 27.85 25.74 13.02 2.09 1
3 Mar 27.27 25.01 12.12 2.13 1
4 Apr 25.48 23.16 9.89 2.19 1
5 May 23.59 21.17 7.99 2.21 1
6 Jun 22.34 19.88 6.37 2.26 1
Data in this format is the easiest to plot with. We'll do that as follows:
# First we define all shared aesthetics in the main 'ggplot'-call:
ggplot(df2, aes(x = Mês,
group = as.factor(group),
colour = as.factor(group))) +
# Then as lowest layer, we want that area spanning 'Min' to 'Max'
geom_ribbon(aes(ymin = Min,
ymax = Max,
fill = as.factor(group)), alpha = 0.1) +
# Then we want our means displayed as points
geom_point(aes(y = Mean, shape = as.factor(group))) +
# The standard deviation as line segments with an arrowhead
geom_segment(aes(xend = Mês,
y = Mean - Std,
yend = Mean + Std),
arrow = arrow(angle = 90, ends = "both", length = unit(2, "mm"))) +
# Finally we tell that our point shapes should be dots, squares and crosses
scale_shape_manual(values = c(16, 15, 4))
And in my hands this yielded the following:
Now, as a last tip: if you want more people to help you or get help quicker, it is easiest to give them some data to play around with that they can copy-paste directly in R:
dput(df)
structure(list(Mês = structure(1:12, .Label = c("Jan", "Feb",
"Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov",
"Dec"), class = "factor"), Mean1 = c(25.45, 25.74, 25.01, 23.16,
21.17, 19.88, 19.41, 20.39, 21.08, 22.19, 23.42, 24.39), Std1 = c(2.04,
2.09, 2.13, 2.19, 2.21, 2.26, 2.27, 2.1, 1.96, 1.81, 1.9, 1.98
), Min1 = c(13.05, 13.02, 12.12, 9.89, 7.99, 6.37, 5.78, 7.73,
9.26, 11.33, 11.94, 12.39), Max1 = c(27.5, 27.85, 27.27, 25.48,
23.59, 22.34, 21.79, 22.64, 23.29, 24.32, 25.52, 26.39), Mean2 = c(25.83,
26.16, 25.35, 23.81, 21.31, 20.15, 19.96, 20.75, 21.66, 23.17,
23.89, 25.17), Std2 = c(1.94, 2.01, 2.14, 2.35, 2.29, 2.25, 2.1,
2.03, 1.58, 1.62, 1.64, 1.99), Min2 = c(14.01, 13.95, 12.41,
9.62, 7.54, 6.65, 7.34, 8.56, 12.21, 13.4, 13.96, 13.07), Max2 = c(27.85,
28.16, 27.67, 26.35, 23.88, 22.65, 22.25, 23, 23.53, 25, 25.68,
27.54), Mean3 = c(25.54, 25.92, 25.16, 23.51, 21.18, 20, 19.6,
20.53, 21.33, 22.6, 23.6, 24.67), Std3 = c(2.03, 2.04, 2.07,
2.17, 2.23, 2.26, 2.22, 2.09, 1.91, 1.79, 1.82, 1.94), Min3 = c(13.24,
13.58, 12.68, 10.46, 7.84, 6.42, 6.24, 7.93, 9.84, 11.92, 12.63,
12.93), Max3 = c(27.58, 27.99, 27.45, 25.9, 23.67, 22.47, 22.02,
22.8, 23.53, 24.76, 25.67, 26.73)), row.names = c(NA, -12L), class = "data.frame")

Undefined columns data frame error

I will like to create a scatter plot of two variable (Disk and Band), for that I and using the function "ggscatter" that is on the "ggpubr" package. Every time I try to use the ggscatter function I get the following error
Error in [.data.frame(data, , x) : undefined columns selected
Here is my code
install.packages("ggpubr")
library("ggpubr")
my_data <- All_Data_Summer_17_
head(my_data, 6)
ggscatter(my_data, x = "band", y = "Disk",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Band", ylab = "Disk (cm)")
Output of str(my_data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 24 obs. of 22 variables:
$ Sample ID : chr "NP-A-1" "NP-A-2" "NP-A-3" "NP-A-4" ...
$ Lat : num 36.6 36.6 36.6 36.6 36.6 ...
$ Lon : num -95 -95 -95 -95 -95 ...
$ Temp : num 29.1 30.5 30.6 30.7 31 ...
$ SpCond : num 0.077 0.081 0.082 0.086 0.088 0.09 0.084 0.09 0.084 0.085 ...
$ Cond : int 83 90 90 95 98 99 93 99 93 96 ...
$ Resist : num 12107 11116 11066 10537 10248 ...
$ TDS : num 0.05 0.053 0.053 0.056 0.057 0.058 0.055 0.058 0.055 0.055 ...
$ Sal : num 0.03 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 ...
$ pH : num 8.87 9.41 9.56 9.77 9.61 9.38 9.89 9.67 9.89 9.85 ...
$ Chl : num 62.1 40.1 3.7 1.4 4.2 5.6 41.5 17.8 4.5 7.7 ...
$ ODO : num 5.69 8.76 8.28 8.35 8.75 ...
$ TSS : num 1.111 0.667 2.556 3.333 0.778 ...
$ TP : num 0 1.03 0.01 -0.02 -0.01 -0.03 0.01 -0.01 -0.03 0.01 ...
$ TN : num 0.2 0.3 1.9 0.3 1.1 0.5 1.6 0.9 0.5 0.7 ...
$ NO3-N : num 0.43 0.18 0.71 0.36 0.25 0.42 0.26 0.17 0.24 0.19 ...
$ NH3-N : num 0.3 0.2 -0.3 -0.1 -0.4 -0.3 -0.3 -0.3 -0.2 -0.1 ...
$ Chloro-a : num 8.23 7.19 15.37 12.6 14.22 ...
$ Disk: num 55.5 68 50 50.5 69 65 65 67.7 70 66 ...
$ band : num 0.000093 0.000096 0.000103 0.000152 0.000088 0.000089 0.000096 0.000097 0.000092 0.000101 ...
$ Green Band : num 0.000163 0.000169 0.000154 0.000276 0.00016 0.00013 0.00015 0.000175 0.000171 0.000163 ...
$ Red Band : num 0.00012 0.000145 0.000126 0.000246 0.000117 0.000095 0.000116 0.00011 0.000108 0.000126 ...
Output dput(my_data)
dput(my_data)
structure(list(`Sample ID` = c("NP-A-1", "NP-A-2", "NP-A-3",
"NP-A-4", "NP-A-5", "NP-A-6", "NP-A-7", "NP-A-8", "NP-A-9", "NP-A-10",
"NP-A-11", "NP-A-12", "NP-A-13", "NP-A-14", "NP-A-15", "NP-A-16",
"NP-A-17", "NP-B-1", "NP-B-2", "NP-B-3", "NP-B-4", "NP-B-5",
"NP-B-6", "NP-B-7"), Lat = c(36.568738, 36.569005, 36.569258,
36.569554, 36.569585, 36.569382, 36.56928, 36.568647, 36.568809,
36.569124, 36.569425, 36.569331, 36.56919, 36.569071, 36.568888,
36.568633, 36.568869, 36.568651, 36.568932, 36.56946, 36.569893,
36.570058, 36.569811, 36.56988), Lon = c(-94.96671, -94.966703,
-94.966604, -94.966647, -94.96698, -94.966928, -94.966923, -94.967296,
-94.9677, -94.967761, -94.967911, -94.968069, -94.967358, -94.968107,
-94.968018, -94.968049, -94.968293, -94.968723, -94.968833, -94.968396,
-94.968101, -94.967793, -94.967141, -94.96663), Temp = c(29.12,
30.49, 30.6, 30.71, 30.97, 30.83, 30.82, 30.64, 30.42, 31.62,
31.96, 31.16, 31.16, 32.88, 32.03, 31, 32.41, 31.79, 31.93, 32.17,
32.16, 32.55, 32.61, 32.83), SpCond = c(0.077, 0.081, 0.082,
0.086, 0.088, 0.09, 0.084, 0.09, 0.084, 0.085, 0.08, 0.079, 0.083,
0.079, 0.086, 0.094, 0.078, 0.183, 0.183, 0.183, 0.183, 0.183,
0.183, 0.183), Cond = c(83L, 90L, 90L, 95L, 98L, 99L, 93L, 99L,
93L, 96L, 91L, 88L, 93L, 90L, 97L, 105L, 89L, 206L, 207L, 208L,
208L, 209L, 210L, 210L), Resist = c(12107.2, 11115.7, 11066.2,
10537.1, 10247.7, 10051, 10700.4, 10076.5, 10753.3, 10434.4,
11023, 11304, 10741.8, 11058.1, 10270.4, 9536.35, 11269.8, 4845.53,
4834.38, 4815.44, 4814.59, 4787.82, 4770.86, 4755.86), TDS = c(0.05,
0.053, 0.053, 0.056, 0.057, 0.058, 0.055, 0.058, 0.055, 0.055,
0.052, 0.051, 0.054, 0.051, 0.056, 0.061, 0.051, 0.119, 0.119,
0.119, 0.119, 0.119, 0.119, 0.119), Sal = c(0.03, 0.04, 0.04,
0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04,
0.04, 0.04, 0.03, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08),
pH = c(8.87, 9.41, 9.56, 9.77, 9.61, 9.38, 9.89, 9.67, 9.89,
9.85, 9.46, 9.42, 9.75, 9.19, 10.02, 8.83, 9.65, 7.89, 8.14,
8.21, 8.22, 8.4, 8.21, 8.18), Chl = c(62.1, 40.1, 3.7, 1.4,
4.2, 5.6, 41.5, 17.8, 4.5, 7.7, 8.2, 7.7, 120.3, 3.1, 7.8,
3.6, 3.2, 9.8, 7.6, 6, 10, 8.1, 6.3, 4.3), ODO = c(5.69,
8.76, 8.28, 8.35, 8.75, 8.59, 10.1, 10.06, 9.14, 10.32, 9.1,
8.41, 8.03, 9.63, 9.77, 8.91, 10.16, 7.17, 7.31, 7.41, 7.49,
7.75, 6.98, 7.09), TSS = c(1.1111, 0.6667, 2.5556, 3.3333,
0.7778, -27.3333, 2.1111, -0.3333, 1.2222, -32.6667, -0.2222,
2.3333, -0.2222, 1.1111, 1.4444, 2.6667, 0.1111, 6.3333,
7, 5, 5.4444, 6.4444, 3, 2.7778), TP = c(0, 1.03, 0.01, -0.02,
-0.01, -0.03, 0.01, -0.01, -0.03, 0.01, 0.04, -0.01, -0.03,
0, 0.01, 0.03, 0.04, 0.2, -0.01, 0, -0.03, 0.04, 0.01, -0.01
), TN = c(0.2, 0.3, 1.9, 0.3, 1.1, 0.5, 1.6, 0.9, 0.5, 0.7,
0.6, 1, 0.8, 0.1, 0.4, 1.6, 0.6, 0.8, 0.6, 0.5, 0.9, 1.2,
0.3, 0.6), `NO3-N` = c(0.43, 0.18, 0.71, 0.36, 0.25, 0.42,
0.26, 0.17, 0.24, 0.19, 0.17, 0.41, 0.6, 0.23, 0.3, 0.26,
0.22, 0.32, 0.63, 0.36, 0.24, 0.33, 0.55, 0.36), `NH3-N` = c(0.3,
0.2, -0.3, -0.1, -0.4, -0.3, -0.3, -0.3, -0.2, -0.1, 0.1,
-0.2, 0.2, -0.1, -0.3, -0.1, 0.1, -0.5, 0.2, 0.5, -0.3, 0.2,
-0.4, -0.1), `Chloro-a` = c(8.23, 7.19, 15.37, 12.6, 14.22,
4.56, 7.2, 8.61, 6.31, 8.74, 5.59, 10.92, 5.24, 4.26, 5.48,
6.26, 4.75, 11.45, 10.39, 11.79, 9.59, 9.82, 7.97, 7.92),
`Disk` = c(55.5, 68, 50, 50.5, 69, 65, 65, 67.7, 70,
66, 69, 67, 69, 62, 60, 62, 66, 50, 52, 50, 40, 57, 57, 62
), `band` = c(9.3e-05, 9.6e-05, 0.000103, 0.000152,
8.8e-05, 8.9e-05, 9.6e-05, 9.7e-05, 9.2e-05, 0.000101, 0.000102,
9.6e-05, 0.000106, 8.7e-05, 9.1e-05, 0.000126, 0.000107,
0.000139, 0.000139, 0.000135, 0.000174, 0.000144, 0.000137,
0.000134), `Green Band` = c(0.000163, 0.000169, 0.000154,
0.000276, 0.00016, 0.00013, 0.00015, 0.000175, 0.000171,
0.000163, 0.000177, 0.000188, 0.000131, 0.000162, 0.000166,
0.000233, 0.000204, 0.000265, 0.00023, 0.000254, 0.000325,
0.000262, 0.000263, 0.00028), `Red Band` = c(0.00012, 0.000145,
0.000126, 0.000246, 0.000117, 9.5e-05, 0.000116, 0.00011,
0.000108, 0.000126, 0.000128, 0.000133, 9.3e-05, 0.000114,
0.000113, 0.000176, 0.000136, 0.000215, 0.000198, 0.00019,
0.000218, 0.00021, 0.000205, 0.000223)), .Names = c("Sample ID",
"Lat", "Lon", "Temp", "SpCond", "Cond", "Resist", "TDS", "Sal",
"pH", "Chl", "ODO", "TSS", "TP", "TN", "NO3-N", "NH3-N", "Chloro-a",
"Disk", "band", "Green Band", "Red Band"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -24L), spec = structure(list(
cols = structure(list(`Sample ID` = structure(list(), class = c("collector_character",
"collector")), Lat = structure(list(), class = c("collector_double",
"collector")), Lon = structure(list(), class = c("collector_double",
"collector")), Temp = structure(list(), class = c("collector_double",
"collector")), SpCond = structure(list(), class = c("collector_double",
"collector")), Cond = structure(list(), class = c("collector_integer",
"collector")), Resist = structure(list(), class = c("collector_double",
"collector")), TDS = structure(list(), class = c("collector_double",
"collector")), Sal = structure(list(), class = c("collector_double",
"collector")), pH = structure(list(), class = c("collector_double",
"collector")), Chl = structure(list(), class = c("collector_double",
"collector")), ODO = structure(list(), class = c("collector_double",
"collector")), TSS = structure(list(), class = c("collector_double",
"collector")), TP = structure(list(), class = c("collector_double",
"collector")), TN = structure(list(), class = c("collector_double",
"collector")), `NO3-N` = structure(list(), class = c("collector_double",
"collector")), `NH3-N` = structure(list(), class = c("collector_double",
"collector")), `Chloro-a` = structure(list(), class = c("collector_double",
"collector")), `Disk` = structure(list(), class = c("collector_double",
"collector")), `band` = structure(list(), class = c("collector_double",
"collector")), `Green Band` = structure(list(), class = c("collector_double",
"collector")), `Red Band` = structure(list(), class = c("collector_double",
"collector"))), .Names = c("Sample ID", "Lat", "Lon", "Temp",
"SpCond", "Cond", "Resist", "TDS", "Sal", "pH", "Chl", "ODO",
"TSS", "TP", "TN", "NO3-N", "NH3-N", "Chloro-a", "Disk",
"band", "Green Band", "Red Band")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
Ok, the easy answer is to run the correlation coefficients first, then the CIs.
Perhaps you could report the bug to ggpubr's Maintainer.
ggscatter(my_data, x = "band",
y = "Disk",
add = "reg.line",
cor.coef = FALSE,
cor.method = "pearson",
conf.int = TRUE,
xlab = "Band",
ylab = "Disk (cm)")

Resources