regression: Error in eval(predvars, data, env) : object 'volt' not found - r

Trying to run a OLS regression model in R.
data = read.csv("C:/.../VOLATILITY.csv")
head(data)
volt LfquantBS HfquantBS LfbankVOL HfbankMM HfnonbankMM HfindMM
1 18.23 3.7 9.2 3.2 2.6 35.3 7.9
2 16.09 4.1 11.4 3.2 2.7 35.3 8.2
3 16.79 4.1 11.4 3.2 2.7 35.3 8.2
4 17.01 4.1 11.4 3.2 2.7 35.3 8.2
5 16.09 4.1 11.4 3.2 2.7 35.3 8.2
6 19.66 6.2 10.5 4.2 1.8 30.7 8.6
model <- lm(volt ~ lfquantBS + HfquantBs + LfbankVOL + HfbankMM + HfnonbankMM
+ HfindMM)
Error in eval(predvars, data, env) : object 'volt' not found
Have done this before without any problem. Any help appreciated.

It should have the data because the columns volt, lfquantBS, etc. exist only within the frame of the data.frame object named 'data'. In addition, case is important. In the formula, there is lfquantBS while in the dataset, it is named as LfQuantBS
lm(volt ~ LfquantBS + HfquantBS + LfbankVOL + HfbankMM +
HfnonbankMM + HfindMM, data = data)
-output
Call:
lm(formula = volt ~ LfquantBS + HfquantBS + LfbankVOL + HfbankMM +
HfnonbankMM + HfindMM, data = data)
Coefficients:
(Intercept) LfquantBS HfquantBS LfbankVOL HfbankMM HfnonbankMM HfindMM
23.2866 1.0846 -0.9858 NA NA NA NA
Regarding the comment Have done this before without any problem. It is possible that the OP may have attach(data) in the past to create those columns as objects in the global env or have created those as vector objects first before constructing the data.frame
data
data <- structure(list(volt = c(18.23, 16.09, 16.79, 17.01, 16.09, 19.66
), LfquantBS = c(3.7, 4.1, 4.1, 4.1, 4.1, 6.2), HfquantBS = c(9.2,
11.4, 11.4, 11.4, 11.4, 10.5), LfbankVOL = c(3.2, 3.2, 3.2, 3.2,
3.2, 4.2), HfbankMM = c(2.6, 2.7, 2.7, 2.7, 2.7, 1.8), HfnonbankMM = c(35.3,
35.3, 35.3, 35.3, 35.3, 30.7), HfindMM = c(7.9, 8.2, 8.2, 8.2,
8.2, 8.6)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6"))

Related

Error when Running cor.test on few columns with apply in R

I want to calculate the correlation between 'y' column and each column in 'col_df' dataframe.
For each calculation I want to save only the columns name with significant p_value (p_value<0.05).
y is a vector 64X1 of 0 and 1.
Example of the col_df- 60X12000
a b c d e
7.6 4.9 8.9 6.0 4.2
25.0 6.5 4.6 13.2 3.0
col_df <- as.matrix(df)
test <- col_df[, apply(col_df, MARGIN = 2, FUN = function(x)
(cor.test(y, col_df[,x], method = "pearson")$p.value <0.05))]
This is the error:
Error in col_df[, x] : subscript out of bounds
Is this the way to do that?
This is a working solution:
df <- structure(list(a = c(7.6, 7.6, 25, 25, 25, 25, 7.6, 7.6, 7.6, 25),
b = c(4.9, 4.9, 6.5, 6.5, 4.9, 6.5, 4.9, 4.9, 6.5, 6.5),
c = c(8.9, 4.6, 8.9, 8.9, 8.9, 4.6, 4.6, 8.9, 8.9, 4.6),
d = c(13.2, 13.2, 6, 6, 6, 6, 6, 13.2, 13.2, 13.2),
e = c(3, 4.2, 3, 4.2, 3, 3, 3, 4.2, 4.2, 4.2)),
class = "data.frame", row.names = c(NA, -10L))
y <- c(1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L)
test <- df[, apply(df, MARGIN = 2, FUN = function(x)
(cor.test(y, x, method = "pearson")$p.value < 0.05))]
test
#> a b
#> 1 7.6 4.9
#> 2 7.6 4.9
#> 3 25.0 6.5
#> 4 25.0 6.5
#> 5 25.0 4.9
#> 6 25.0 6.5
#> 7 7.6 4.9
#> 8 7.6 4.9
#> 9 7.6 6.5
#> 10 25.0 6.5
The difference to your solution ist that apply() gives you the column as x and
not an index. Hence, all you have to do is replace col_df[,x] of your solution with
just x.
You can simplify it a little with sapply(). I also recommend not to put everything into
a single line. It is hard to read and harder to debug.
Columns <- sapply(df, FUN = function(x) (cor.test(y, x, method = "pearson")$p.value < 0.05))
test <- df[, Columns]
test
#> a b
#> 1 7.6 4.9
#> 2 7.6 4.9
#> 3 25.0 6.5
#> 4 25.0 6.5
#> 5 25.0 4.9
#> 6 25.0 6.5
#> 7 7.6 4.9
#> 8 7.6 4.9
#> 9 7.6 6.5
#> 10 25.0 6.5
Created on 2020-07-22 by the reprex package (v0.3.0)

Carrying out an ANOVA test from a table of data with multiple columns

Is there a way in R to carry out an ANOVA test from a table of data that looks as follows:
Trees Avg_number_1m Avg_number_2m Avg_number_3m Avg_number_4m
1 Tree_1 15.2 15.0 15.2 12.0
2 Tree_2 16.2 15.4 14.2 15.4
3 Tree_3 14.4 9.2 3.2 1.6
4 Tree_4 14.6 5.6 10.4 9.2
5 Tree_5 15.2 13.0 7.4 3.0
6 Tree_6 14.0 12.0 13.0 11.2
7 Tree_7 13.8 7.8 7.2 2.0
8 Tree_8 10.8 5.8 4.4 2.4
9 Tree_9 12.4 9.6 6.8 2.6
10 Tree_10 15.6 11.0 7.2 1.8
11 Tree_11 7.6 7.4 9.0 1.8
12 Tree_12 13.8 7.8 7.2 2.0
13 Tree_13 10.8 5.8 4.4 1.6
14 Tree_14 15.2 15.0 15.2 12.0
15 Tree_15 16.2 15.4 14.2 15.0
16 Tree_16 12.4 9.2 3.2 1.6
17 Tree_17 14.6 5.6 10.4 9.2
18 Tree_18 15.2 13.0 7.4 3.0
19 Tree_19 14.0 14.4 13.2 13.8
20 Tree_20 11.0 5.2 4.4 0.8
I've tried to find tutorials on how to do this but the fact that the aov command requires one x and one y variable has been throwing me off. Any help is much appreciated.
So this is your data:
x = structure(list(Trees = structure(c(1L, 12L, 14L, 15L, 16L, 17L,
18L, 19L, 20L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 13L), .Label = c("Tree_1",
"Tree_10", "Tree_11", "Tree_12", "Tree_13", "Tree_14", "Tree_15",
"Tree_16", "Tree_17", "Tree_18", "Tree_19", "Tree_2", "Tree_20",
"Tree_3", "Tree_4", "Tree_5", "Tree_6", "Tree_7", "Tree_8", "Tree_9"
), class = "factor"), Avg_number_1m = c(15.2, 16.2, 14.4, 14.6,
15.2, 14, 13.8, 10.8, 12.4, 15.6, 7.6, 13.8, 10.8, 15.2, 16.2,
12.4, 14.6, 15.2, 14, 11), Avg_number_2m = c(15, 15.4, 9.2, 5.6,
13, 12, 7.8, 5.8, 9.6, 11, 7.4, 7.8, 5.8, 15, 15.4, 9.2, 5.6,
13, 14.4, 5.2), Avg_number_3m = c(15.2, 14.2, 3.2, 10.4, 7.4,
13, 7.2, 4.4, 6.8, 7.2, 9, 7.2, 4.4, 15.2, 14.2, 3.2, 10.4, 7.4,
13.2, 4.4), Avg_number_4m = c(12, 15.4, 1.6, 9.2, 3, 11.2, 2,
2.4, 2.6, 1.8, 1.8, 2, 1.6, 12, 15, 1.6, 9.2, 3, 13.8, 0.8)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20"))
We can very quickly visualize your data using boxplot, and it shows that there are fewer spines at greater heights:
So we load a few libraries to get the data in the correct shape:
library(ggplot2)
library(tidyr)
# first we make it a "long" format
df = pivot_longer(x,-Trees,names_to="Height_levels")
Now we visualize for each individual tree how it looks like:
ggplot(df,aes(x=Height_levels,y=value,col=Trees)) + geom_point() +
geom_line(aes(group=Trees)) + theme(legend.position="top")
These tells us two things, we need to adjust the Tree, and then test when there are differences between the heights, the most straightfoward is to use an anova to test:
aovfit = aov(value ~ Trees + Height_levels,data=df)
summary(aovfit)
Df Sum Sq Mean Sq F value Pr(>F)
Trees 19 877.9 46.20 7.692 8.98e-10 ***
Height_levels 3 588.9 196.31 32.682 2.02e-12 ***
Residuals 57 342.4 6.01
And post-hoc with Tukey:
posthoc = TukeyHSD(aovfit)
posthoc$Height_levels
diff lwr upr p adj
Avg_number_2m-Avg_number_1m -3.49 -5.54109 -1.4389103 1.930647e-04
Avg_number_3m-Avg_number_1m -4.77 -6.82109 -2.7189103 4.752523e-07
Avg_number_4m-Avg_number_1m -7.55 -9.60109 -5.4989103 1.182687e-11
Avg_number_3m-Avg_number_2m -1.28 -3.33109 0.7710897 3.586375e-01
Avg_number_4m-Avg_number_2m -4.06 -6.11109 -2.0089103 1.429319e-05
Avg_number_4m-Avg_number_3m -2.78 -4.83109 -0.7289103 3.779450e-03
If you would like, you can also fit a linear model, where the height is a continuous variable, and test it with an anova:
df$Height = as.numeric(gsub("[^0-9]","",as.character(df$Height_levels)))
aov_continuous = aov(value ~ Trees + Height,data=df)
summary(aov_continuous)
Df Sum Sq Mean Sq F value Pr(>F)
Trees 19 877.9 46.2 7.601 7.74e-10 ***
Height 1 572.6 572.6 94.199 7.78e-14 ***
Residuals 59 358.7 6.1
And coefficients tell you how much lesser spines on average you get, by going up 1 m. In this case, it's about -2.39..
aov_continuous$coefficients
[...]
Height
-2.393000e+00

Split Data Frame Into N Data Frames Based On Column Names

I have a large data (thousands of columns) frame in which few columns have duplicate column name. Then, there are set of column names which have part of column name which is duplicate and another part of the same column name is not.
Using R and above two properties, I want to split all such columns into different data frames for further analysis. To achieve this I want to run following dynamic logic on data frame:
First: Find and cbind() duplicate column name columns into different data frames. If 10 columns have same column name, they form a data frame and another another 5 with same column name form another data frame.
Second: Find and cbind() duplicate column name columns into different data frames if the string of column name before - matches with the string of column name before - for another column and the string of column name after - doesn't match with part of column name after - for another column.
Below is the sample input data (the big data is too big, but follows exact same property), for which first two columns will form a single data frame based on above example. There will be another data frame that will contain columns starting three to the last one.
I tried split(), but that hasn't worked out so far. Any suggestions on how I can do this?
Sample Input Data
structure(list(`A-DIODE` = c(1.2, 0.4), `A-DIODE` = c(1.3, 0.6
), `B-DIODE` = c(1.4, 0.8), `B-ACC1` = c(1.5, 1), `B-ACC2` = c(1.6,
1.2), `B-ANA0` = c(1.7, 1.4), `B-ANA1` = c(1.8, 1.6), `B-BRICKID` = c(1.9,
1.8), `B-CC0` = c(2L, 2L), `B-CC1` = c(2.1, 2.2), `B-DIGDN` = c(2.2,
2.4), `B-DIGDP` = c(2.3, 2.6), `B-DN1` = c(2.4, 2.8), `B-DN2` = c(2.5,
3), `B-DP1` = c(2.6, 3.2), `B-DP2` = c(2.7, 3.4), `B-SCL` = c(2.8,
3.6), `B-SDA` = c(2.9, 3.8), `B-USB0DN` = 3:4, `B-USB0DP` = c(3.1,
4.2), `B-USB1DN` = c(3.2, 4.4), `B-USB1DP` = c(3.3, 4.6), `B-ACC1` = c(3.4,
4.8), `B-ACC2` = c(3.5, 5), `B-ANA0` = c(3.6, 5.2), `B-ANA1` = c(3.7,
5.4), `B-BRICKID` = c(3.8, 5.6), `B-CC0` = c(3.9, 5.8), `B-CC1` = c(4L,
6L), `B-DIGDN` = c(4.1, 6.2), `B-DIGDP` = c(4.2, 6.4), `B-DN1` = c(4.3,
6.6), `B-DN2` = c(4.4, 6.8), `B-DP1` = c(4.5, 7), `B-DP2` = c(4.6,
7.2), `B-SCL` = c(4.7, 7.4), `B-SDA` = c(4.8, 7.6), `B-USB0DN` = c(4.9,
7.8), `B-USB0DP` = c(5L, 8L), `B-USB1DN` = c(5.1, 8.2), `B-USB1DP` = c(5.2,
8.4), `B-NA` = c(5.3, 8.6), `B-ACC2PWRLKG_0v4` = c(5.4, 8.8),
`B-ACC2PWRLKG_0v4` = c(5.5, 9), `B-P_IN_Leak` = c(5.6, 9.2
)), class = "data.frame", row.names = c(NA, -2L))
Output Based On Logic Discussed Above
Data Frame 1
A-DIODE A-DIODE
1.2 1.3
0.4 0.6
Data Frame 2
B-DIODE B-ACC1 B-ACC2 B-ANA0 B-ANA1 B-BRICKID B-CC0 B-CC1 B-DIGDN B-DIGDP B-DN1 B-DN2 B-DP1 B-DP2 B-SCL B-SDA B-USB0DN B-USB0DP
1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1
0.8 1.0 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4 4.2
B-USB1DN B-USB1DP B-ACC1.1 B-ACC2.1 B-ANA0.1 B-ANA1.1 B-BRICKID.1 B-CC0.1 B-CC1.1 B-DIGDN.1 B-DIGDP.1 B-DN1.1 B-DN2.1 B-DP1.1
3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5
4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7.0
B-DP2.1 B-SCL.1 B-SDA.1 B-USB0DN.1 B-USB0DP.1 B-USB1DN.1 B-USB1DP.1 B-NA B-ACC2PWRLKG_0v4 B-ACC2PWRLKG_0v4.1 B-P_IN_Leak
4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6
7.2 7.4 7.6 7.8 8 8.2 8.4 8.6 8.8 9.0 9.2
We can use split.default on the substring of names of the dataset
split.default(df1, sub("-.*", "", names(df1)))
Or if we know there would be only one character before -
split.default(df1, substr(names(df1), 1, 1))
#$A
# A-DIODE A-DIODE.1
#1 1.2 1.3
#2 0.4 0.6
#$B
# B-DIODE B-ACC1 B-ACC2 B-ANA0 B-ANA1 B-BRICKID B-CC0 B-CC1 B-DIGDN B-DIGDP B-DN1 B-DN2 B-DP1 B-DP2 B-SCL B-SDA B-USB0DN B-USB0DP
#1 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1
#2 0.8 1.0 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4 4.2
# B-USB1DN B-USB1DP B-ACC1.1 B-ACC2.1 B-ANA0.1 B-ANA1.1 B-BRICKID.1 B-CC0.1 B-CC1.1 B-DIGDN.1 B-DIGDP.1 B-DN1.1 B-DN2.1 B-DP1.1 B-DP2.1
#1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6
#2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7.0 7.2
# B-SCL.1 B-SDA.1 B-USB0DN.1 B-USB0DP.1 B-USB1DN.1 B-USB1DP.1 B-NA B-ACC2PWRLKG_0v4 B-ACC2PWRLKG_0v4.1 B-P_IN_Leak
#1 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6
#2 7.4 7.6 7.8 8 8.2 8.4 8.6 8.8 9.0 9.2

R: Extracting the highest numeric value from each character value in a column

I have a character field in a dataframe that contains numbers e.g. (0.5,3.5,7.8,2.4).
For every record I am trying to extract the largest value from the string and put it in a new column.
e.g.
x csi
1 0.5, 6.7, 2.3
2 9.5, 2.6, 1.1
3 0.7, 2.3, 5.1
4 4.1, 2.7, 4.7
The desired output would be:
x csi csi_max
1 0.5, 6.7, 2.3 6.7
2 9.5, 2.6, 1.1 9.5
3 0.7, 2.3, 5.1 5.1
4 4.1, 2.7, 4.7 4.7
I have had various attempts ...with my latest attempt being the following - which provides the maximum csi score from the entire column rather than from the individual row's csi numbers...
library(stringr)
numextract <- function(string){
str_extract(string, "\\-*\\d+\\.*\\d*")
}
df$max_csi <- max(numextract(df$csi))
Thank you
We can use tidyverse
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(csi) %>%
group_by(x) %>%
summarise(csi_max = max(csi)) %>%
left_join(df1, .)
# x csi csi_max
#1 1 0.5, 6.7, 2.3 6.7
#2 2 9.5, 2.6, 1.1 9.5
#3 3 0.7, 2.3, 5.1 5.1
#4 4 4.1, 2.7, 4.7 4.7
Or this can be done with pmax from base R after separating the 'csi' column into a data.frame with read.table
df1$csi_max <- do.call(pmax, read.table(text=df1$csi, sep=","))
Hope this helps!
df$csi_max <- sapply(df$csi, function(x) max(as.numeric(unlist(strsplit(as.character(x), split=",")))))
Output is:
x csi csi_max
1 1 0.5, 6.7, 2.3 6.7
2 2 9.5, 2.6, 1.1 9.5
3 3 0.7, 2.3, 5.1 5.1
4 4 4.1, 2.7, 4.7 4.7
#sample data
> dput(df)
structure(list(x = 1:4, csi = structure(c(1L, 4L, 2L, 3L), .Label = c("0.5, 6.7, 2.3",
"0.7, 2.3, 5.1", "4.1, 2.7, 4.7", "9.5, 2.6, 1.1"), class = "factor")), .Names = c("x",
"csi"), class = "data.frame", row.names = c(NA, -4L))
Edit:
As suggested by #RichScriven, the more efficient way could be
df$csi_max <- sapply(strsplit(as.character(df$csi), ","), function(x) max(as.numeric(x)))
A solution using the splitstackshape package.
library(splitstackshape)
dat$csi_max <- apply(cSplit(dat, "csi")[, -1], 1, max)
dat
# x csi csi_max
# 1 1 0.5, 6.7, 2.3 6.7
# 2 2 9.5, 2.6, 1.1 9.5
# 3 3 0.7, 2.3, 5.1 5.1
# 4 4 4.1, 2.7, 4.7 4.7
DATA
dat <- read.table(text = "x csi
1 '0.5, 6.7, 2.3'
2 '9.5, 2.6, 1.1'
3 '0.7, 2.3, 5.1'
4 '4.1, 2.7, 4.7'",
header = TRUE, stringsAsFactors = FALSE)

multiple x axis in R

I have a sample data set
ID Depth Salinity Temperature Time fluorescence
1 0 1.3 29.2 13:44:23 152
2 3.1 1.4 29.2 13:44:26 175
3 3.5 2 29.2 13:44:30 149
4 4.3 2.6 29.2 13:44:34 192
5 7.5 2.9 29.4 13:44:37 174
6 8.2 2.1 29.1 13:44:41 154
7 10 2.6 29.1 13:44:44 147
8 9.1 2.6 29.1 13:44:48 150
9 7.3 2.7 28.9 13:44:52 147
10 5.2 3.2 29.0 13:44:55 180
11 4.5 2 29.0 13:44:59 167
12 3.3 2.3 29.1 13:45:03 154
13 2.5 1.8 29.1 13:45:06 106
14 0 1.5 29.1 13:45:10 136
I want two profiles Up and Down profile i.e. from depth 0-10 and 10-0 in a same plot. I used the code below to generate a plot
meltdf <- mutate(meltdf, trend = c(rep("UP",7), rep("DOWN",7)))
p <- ggplot(meltdf, aes(x = Temperature, y = Depth, color = trend)) +
geom_line()+
p
I get the plot with this. However, what I want is Depth in y axis and Salinity, Temperature, fluorescence in multiple x axis in the same graph. As they have varying ranges I don't know how i should set it.
Also the data i have is quite big and when i plot i dont get a smooth curve(pic R plot) in my result .Is there a way to avoid those spikes?
You might be looking for something like this
Your data
df <- structure(list(ID = 1:14, Depth = c(0, 3.1, 3.5, 4.3, 7.5, 8.2,
10, 9.1, 7.3, 5.2, 4.5, 3.3, 2.5, 0), Salinity = c(1.3, 1.4,
2, 2.6, 2.9, 2.1, 2.6, 2.6, 2.7, 3.2, 2, 2.3, 1.8, 1.5), Temperature = c(29.2,
29.2, 29.2, 29.2, 29.4, 29.1, 29.1, 29.1, 28.9, 29, 29, 29.1,
29.1, 29.1), Time = c("13:44:23", "13:44:26", "13:44:30", "13:44:34",
"13:44:37", "13:44:41", "13:44:44", "13:44:48", "13:44:52", "13:44:55",
"13:44:59", "13:45:03", "13:45:06", "13:45:10"), fluorescence = c(152L,
175L, 149L, 192L, 174L, 154L, 147L, 150L, 147L, 180L, 167L, 154L,
106L, 136L)), .Names = c("ID", "Depth", "Salinity", "Temperature",
"Time", "fluorescence"), row.names = c(NA, -14L), class = c("data.table",
"data.frame"))
library(tidyverse)
meltdf <- mutate(df, trend = c(rep("UP",7), rep("DOWN",7)))
solution
Starting with meltdf, gather relevant x-axis variables
moremelt <- meltdf %>%
gather(key, value, Salinity, Temperature, fluorescence)
ggplot with facet_wrap using options nrow=3 and scale="free"
ggplot(moremelt, aes(x = value, y = Depth, color = interaction(trend,key), label=key)) +
geom_line(lwd=2) +
scale_colour_manual(values=c("orange","red","blue","cyan","black","grey")) +
facet_wrap(~key, nrow=3, scale="free")

Resources