I have a set of data containing species names and numbers (spp_data) and I am trying to test how the species are influenced by different parameters such as pH, conductivity, as well as the Sewer position (Upstream/Downstream) (env_data1).
When I'm trying to run the lm() I get the following error:
lm1 <- lm(specnumber ~ Sewer + pH + Conductivity, data=spp_data,env_data1)
Error in eval(predvars, data, env) : object 'Sewer' not found
Is it because the column Sewer is non-numeric?
I also tried to exclude that column and run the lm() but it did not work.
species data
summary(spp_data)
Pisidium G_pulex C_pseudo A_aquatic V_pisc
Min. :0.000 Min. : 0.00 Min. : 0.000 Min. :0.0000 Min. :0.00000
1st Qu.:0.000 1st Qu.: 3.00 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.000 Median : 8.00 Median : 3.000 Median :0.0000 Median :0.00000
Mean :1.429 Mean :16.86 Mean : 4.476 Mean :0.5714 Mean :0.04762
3rd Qu.:2.000 3rd Qu.:20.00 3rd Qu.:10.000 3rd Qu.:0.0000 3rd Qu.:0.00000
Max. :7.000 Max. :68.00 Max. :16.000 Max. :4.0000 Max. :1.00000
Taeniopt Rhyacoph Hydropsy Lepidost Glossos
Min. :0.00000 Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 0.00
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.: 0.00
Median :0.00000 Median :0.0000 Median :0.000 Median :0.000 Median : 0.00
Mean :0.09524 Mean :0.2381 Mean :1.286 Mean :1.238 Mean : 1.81
3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.: 1.00
Max. :2.00000 Max. :2.0000 Max. :5.000 Max. :7.000 Max. :14.00
Agapetus Hydroptil Limneph S_person Tipula
Min. : 0.0000 Min. :0.00000 Min. :0.000 Min. :0.00000 Min. :0
1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0
Median : 0.0000 Median :0.00000 Median :0.000 Median :0.00000 Median :0
Mean : 0.5714 Mean :0.04762 Mean :0.381 Mean :0.09524 Mean :0
3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0
Max. :12.0000 Max. :1.00000 Max. :2.000 Max. :2.00000 Max. :0
Culicida Ceratopo Simuliid Chrinomi Chrnomus
Min. :0.0000 Min. : 0 Min. : 0.0000 Min. : 0.000 Min. : 0.000
1st Qu.:0.0000 1st Qu.: 0 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 1.000
Median :0.0000 Median : 1 Median : 0.0000 Median : 2.000 Median : 3.000
Mean :0.5714 Mean : 7 Mean : 0.5238 Mean : 7.286 Mean : 6.095
3rd Qu.:0.0000 3rd Qu.: 8 3rd Qu.: 0.0000 3rd Qu.: 8.000 3rd Qu.: 6.000
Max. :5.0000 Max. :31 Max. :10.0000 Max. :67.000 Max. :41.000
environmental data
summary(env_data)
Sample Sewer pH Conductivity
Length:21 Length:21 Min. :7.780 Length:21
Class :character Class :character 1st Qu.:7.850 Class :character
Mode :character Mode :character Median :8.100 Mode :character
Mean :8.044
3rd Qu.:8.270
Max. :8.280
Depth %rock %mud %sand,,
Min. : 7.00 Min. :10.00 Min. : 0 Length:21
1st Qu.: 8.00 1st Qu.:10.00 1st Qu.:20 Class :character
Median :11.00 Median :70.00 Median :30 Mode :character
Mean :17.14 Mean :57.14 Mean :40
3rd Qu.:28.00 3rd Qu.:80.00 3rd Qu.:90
Max. :40.00 Max. :90.00 Max. :90
Assuming that the rows of your spp_data match the rows of your environmental data ... I think if you do
lm1 <- lm(as.matrix(spp_data) ~ Sewer + pH + Conductivity,
data=env_data1)
you will get the results of running 44 separate linear models, one for each species. (Be careful: with 44 regressions and only 21 observations, you may need to do some multiple comparisons corrections to avoid overstating your conclusions.)
There are R packages for more sophisticated multi-species analyses such as mvabund or gllvm, but they might not apply to a data set this size ...
Related
I am running some calculations in r
df1 <- data.frame( data=mydata6$Date.created, mydata6[,-1] + mydataADDtasks[,-1])
code is running, no mistake is given. when i write
View(df1)
i see a table length 16, 5 obs of 16 variables.
But when I check
summarise (df1)
data frame with 0 columns and 1 row
And obviously i can not do any calculations with dataset. What should i do? What is wrong???
It is a normal behavior of the dplyr::summarise function. You have no group combination, so it returns an empty dataframe. You could use group_by and return a non-empty dataframe.
mtcars |>
dplyr::group_by(mpg) |>
dplyr::summarise(sum_cyl = sum(cyl))
There's probably nothing wrong with your dataframe!
You could also get an output from no grouping variables, but you still would need to supply a function to get an output:
dplyr::summarise(mtcars, sum_cyl = sum(cyl))
Maybe what you are after is summary?
> summary(mtcars)
mpg cyl disp hp drat wt
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424
qsec vs am gear carb
Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :17.71 Median :0.0000 Median :0.0000 Median :4.000 Median :2.000
Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000
I want to be able to see summary output for few columns of iris (inbuilt dataset) inside loop using below construct, I saw here
, mget might be a solution but guess its not. Can someone help here with the latest & effective way to run it
l <- c("Sepal.Length","Sepal.Width")
for(i in l){
print(summary( mget(paste0("iris$",l))))
}
I get an error on running above
Error: value for ‘iris$Sepal.Length’ not found
Q2 How would this work for different dataframe
l <- c("iris","mtcars")
for(i in l){
print(summary( mget(l)))
}
Since you are having column names in a vector, you don't need to get them. Just use it directly as index [[ to extract the column.
Base R:
sapply(l, function(x) summary(iris[[x]]))
Sepal.Length Sepal.Width
Min. 4.300000 2.000000
1st Qu. 5.100000 2.800000
Median 5.800000 3.000000
Mean 5.843333 3.057333
3rd Qu. 6.400000 3.300000
Max. 7.900000 4.400000
Q2:
In this case because you need to get the value of an object, you literally need the get() function.
sapply(c("iris","mtcars"), function(x) summary(get(x)))
$iris
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
$mtcars
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
I can use the tapply function to make basic operations (e.g. using mtcars data, calculate mean weight by number of cylinders).
library(data.table)
mtcars <- data.table(mtcars)
tapply(X = mtcars[,wt],
INDEX = mtcars[,cyl],
mean)
However, I do not know how to perform more complex operations. E.g. Correlation between weight and qsec variables by number of cylinders.
I tried something like the following but it does not work.
tapply(X = mtcars[,.(wt, qsec)],
INDEX = mtcars[,cyl],
cor.test(mtcars[,wt], mtcars[,qsec]))
Error in match.fun(FUN) : 'cor.test(mtcars[, wt], mtcars[, qsec])' is not a function, character or symbol
tapply(X = rownames(mtcars[,.(wt,qsec,cyl)]),
INDEX = mtcars[,cyl],
function(r) cor.test(mtcars[r, 1],
mtcars[r, 2])
Any idea how to do this efficiently with an t/apply function?
In my mind, a tapply data.table variant should have FUNs that operate on indexed subsets of the data.table. I have defined a dt_tapply is I imagine it should behave. Seems ok practical.
library(data.table)
data(mtcars)
mtcars = data.table(mtcars)
#iterate over table with index, like tapply just for table rows
dt_tapply = function(dx,INDEX,FUN=NULL,...) {
lapply(sort(unique(INDEX)),function(i){
do.call(FUN,c(list(dx[INDEX==i,]),list(...)))
})
}
dt_tapply(mtcars,mtcars$cyl,summary)
#some custom made function computing stuff from multiple columns giving some blob output
compute_cor_wtqsec = function(dx) {
cor(dx$wt,dx$qsec)
}
#dt_tapply that function
dt_tapply(mtcars,mtcars$cyl,compute_cor_wtqsec)
[[1]]
mpg cyl disp hp drat wt qsec
Min. :21.40 Min. :4 Min. : 71.10 Min. : 52.00 Min. :3.690 Min. :1.513 Min. :16.70
1st Qu.:22.80 1st Qu.:4 1st Qu.: 78.85 1st Qu.: 65.50 1st Qu.:3.810 1st Qu.:1.885 1st Qu.:18.56
Median :26.00 Median :4 Median :108.00 Median : 91.00 Median :4.080 Median :2.200 Median :18.90
Mean :26.66 Mean :4 Mean :105.14 Mean : 82.64 Mean :4.071 Mean :2.286 Mean :19.14
3rd Qu.:30.40 3rd Qu.:4 3rd Qu.:120.65 3rd Qu.: 96.00 3rd Qu.:4.165 3rd Qu.:2.623 3rd Qu.:19.95
Max. :33.90 Max. :4 Max. :146.70 Max. :113.00 Max. :4.930 Max. :3.190 Max. :22.90
vs am gear carb
Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:1.0000 1st Qu.:0.5000 1st Qu.:4.000 1st Qu.:1.000
Median :1.0000 Median :1.0000 Median :4.000 Median :2.000
Mean :0.9091 Mean :0.7273 Mean :4.091 Mean :1.545
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:2.000
Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :2.000
[[2]]
mpg cyl disp hp drat wt qsec
Min. :17.80 Min. :6 Min. :145.0 Min. :105.0 Min. :2.760 Min. :2.620 Min. :15.50
1st Qu.:18.65 1st Qu.:6 1st Qu.:160.0 1st Qu.:110.0 1st Qu.:3.350 1st Qu.:2.822 1st Qu.:16.74
Median :19.70 Median :6 Median :167.6 Median :110.0 Median :3.900 Median :3.215 Median :18.30
Mean :19.74 Mean :6 Mean :183.3 Mean :122.3 Mean :3.586 Mean :3.117 Mean :17.98
3rd Qu.:21.00 3rd Qu.:6 3rd Qu.:196.3 3rd Qu.:123.0 3rd Qu.:3.910 3rd Qu.:3.440 3rd Qu.:19.17
Max. :21.40 Max. :6 Max. :258.0 Max. :175.0 Max. :3.920 Max. :3.460 Max. :20.22
vs am gear carb
Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.500 1st Qu.:2.500
Median :1.0000 Median :0.0000 Median :4.000 Median :4.000
Mean :0.5714 Mean :0.4286 Mean :3.857 Mean :3.429
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :6.000
[[3]]
mpg cyl disp hp drat wt qsec
Min. :10.40 Min. :8 Min. :275.8 Min. :150.0 Min. :2.760 Min. :3.170 Min. :14.50
1st Qu.:14.40 1st Qu.:8 1st Qu.:301.8 1st Qu.:176.2 1st Qu.:3.070 1st Qu.:3.533 1st Qu.:16.10
Median :15.20 Median :8 Median :350.5 Median :192.5 Median :3.115 Median :3.755 Median :17.18
Mean :15.10 Mean :8 Mean :353.1 Mean :209.2 Mean :3.229 Mean :3.999 Mean :16.77
3rd Qu.:16.25 3rd Qu.:8 3rd Qu.:390.0 3rd Qu.:241.2 3rd Qu.:3.225 3rd Qu.:4.014 3rd Qu.:17.55
Max. :19.20 Max. :8 Max. :472.0 Max. :335.0 Max. :4.220 Max. :5.424 Max. :18.00
vs am gear carb
Min. :0 Min. :0.0000 Min. :3.000 Min. :2.00
1st Qu.:0 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.25
Median :0 Median :0.0000 Median :3.000 Median :3.50
Mean :0 Mean :0.1429 Mean :3.286 Mean :3.50
3rd Qu.:0 3rd Qu.:0.0000 3rd Qu.:3.000 3rd Qu.:4.00
Max. :0 Max. :1.0000 Max. :5.000 Max. :8.00
[[1]]
[1] 0.6380214
[[2]]
[1] 0.8659614
[[3]]
[1] 0.5365487
I am working on a Coursera Machine Learning project. The goal is to perform a predictive modeling for the following dataset.
> summary(training)
roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x
Min. :-28.90 Min. :-55.8000 Min. :-180.00 Min. : 0.00 Min. :-1.040000
1st Qu.: 1.10 1st Qu.: 1.7600 1st Qu.: -88.30 1st Qu.: 3.00 1st Qu.:-0.030000
Median :113.00 Median : 5.2800 Median : -13.00 Median :17.00 Median : 0.030000
Mean : 64.41 Mean : 0.3053 Mean : -11.21 Mean :11.31 Mean :-0.005592
3rd Qu.:123.00 3rd Qu.: 14.9000 3rd Qu.: 12.90 3rd Qu.:18.00 3rd Qu.: 0.110000
Max. :162.00 Max. : 60.3000 Max. : 179.00 Max. :29.00 Max. : 2.220000
gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
Min. :-0.64000 Min. :-1.4600 Min. :-120.000 Min. :-69.00 Min. :-275.00 Min. :-52.0
1st Qu.: 0.00000 1st Qu.:-0.2000 1st Qu.: -21.000 1st Qu.: 3.00 1st Qu.:-162.00 1st Qu.: 9.0
Median : 0.02000 Median :-0.1000 Median : -15.000 Median : 35.00 Median :-152.00 Median : 35.0
Mean : 0.03959 Mean :-0.1305 Mean : -5.595 Mean : 30.15 Mean : -72.59 Mean : 55.6
3rd Qu.: 0.11000 3rd Qu.:-0.0200 3rd Qu.: -5.000 3rd Qu.: 61.00 3rd Qu.: 27.00 3rd Qu.: 59.0
Max. : 0.64000 Max. : 1.6200 Max. : 85.000 Max. :164.00 Max. : 105.00 Max. :485.0
magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm
Min. :354.0 Min. :-623.0 Min. :-180.00 Min. :-88.800 Min. :-180.0000 Min. : 1.00
1st Qu.:581.0 1st Qu.:-375.0 1st Qu.: -31.77 1st Qu.:-25.900 1st Qu.: -43.1000 1st Qu.:17.00
Median :601.0 Median :-320.0 Median : 0.00 Median : 0.000 Median : 0.0000 Median :27.00
Mean :593.7 Mean :-345.5 Mean : 17.83 Mean : -4.612 Mean : -0.6188 Mean :25.51
3rd Qu.:610.0 3rd Qu.:-306.0 3rd Qu.: 77.30 3rd Qu.: 11.200 3rd Qu.: 45.8750 3rd Qu.:33.00
Max. :673.0 Max. : 293.0 Max. : 180.00 Max. : 88.500 Max. : 180.0000 Max. :66.00
gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y
Min. :-6.37000 Min. :-3.4400 Min. :-2.3300 Min. :-404.00 Min. :-318.0
1st Qu.:-1.33000 1st Qu.:-0.8000 1st Qu.:-0.0700 1st Qu.:-242.00 1st Qu.: -54.0
Median : 0.08000 Median :-0.2400 Median : 0.2300 Median : -44.00 Median : 14.0
Mean : 0.04277 Mean :-0.2571 Mean : 0.2695 Mean : -60.24 Mean : 32.6
3rd Qu.: 1.57000 3rd Qu.: 0.1400 3rd Qu.: 0.7200 3rd Qu.: 84.00 3rd Qu.: 139.0
Max. : 4.87000 Max. : 2.8400 Max. : 3.0200 Max. : 437.00 Max. : 308.0
accel_arm_z magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell pitch_dumbbell
Min. :-636.00 Min. :-584.0 Min. :-392.0 Min. :-597.0 Min. :-153.71 Min. :-149.59
1st Qu.:-143.00 1st Qu.:-300.0 1st Qu.: -9.0 1st Qu.: 131.2 1st Qu.: -18.49 1st Qu.: -40.89
Median : -47.00 Median : 289.0 Median : 202.0 Median : 444.0 Median : 48.17 Median : -20.96
Mean : -71.25 Mean : 191.7 Mean : 156.6 Mean : 306.5 Mean : 23.84 Mean : -10.78
3rd Qu.: 23.00 3rd Qu.: 637.0 3rd Qu.: 323.0 3rd Qu.: 545.0 3rd Qu.: 67.61 3rd Qu.: 17.50
Max. : 292.00 Max. : 782.0 Max. : 583.0 Max. : 694.0 Max. : 153.55 Max. : 149.40
yaw_dumbbell total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
Min. :-150.871 Min. : 0.00 Min. :-204.0000 Min. :-2.10000 Min. : -2.380
1st Qu.: -77.644 1st Qu.: 4.00 1st Qu.: -0.0300 1st Qu.:-0.14000 1st Qu.: -0.310
Median : -3.324 Median :10.00 Median : 0.1300 Median : 0.03000 Median : -0.130
Mean : 1.674 Mean :13.72 Mean : 0.1611 Mean : 0.04606 Mean : -0.129
3rd Qu.: 79.643 3rd Qu.:19.00 3rd Qu.: 0.3500 3rd Qu.: 0.21000 3rd Qu.: 0.030
Max. : 154.952 Max. :58.00 Max. : 2.2200 Max. :52.00000 Max. :317.000
accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
Min. :-419.00 Min. :-189.00 Min. :-334.00 Min. :-643.0 Min. :-3600
1st Qu.: -50.00 1st Qu.: -8.00 1st Qu.:-142.00 1st Qu.:-535.0 1st Qu.: 231
Median : -8.00 Median : 41.50 Median : -1.00 Median :-479.0 Median : 311
Mean : -28.62 Mean : 52.63 Mean : -38.32 Mean :-328.5 Mean : 221
3rd Qu.: 11.00 3rd Qu.: 111.00 3rd Qu.: 38.00 3rd Qu.:-304.0 3rd Qu.: 390
Max. : 235.00 Max. : 315.00 Max. : 318.00 Max. : 592.0 Max. : 633
magnet_dumbbell_z roll_forearm pitch_forearm yaw_forearm total_accel_forearm
Min. :-262.00 Min. :-180.0000 Min. :-72.50 Min. :-180.00 Min. : 0.00
1st Qu.: -45.00 1st Qu.: -0.7375 1st Qu.: 0.00 1st Qu.: -68.60 1st Qu.: 29.00
Median : 13.00 Median : 21.7000 Median : 9.24 Median : 0.00 Median : 36.00
Mean : 46.05 Mean : 33.8265 Mean : 10.71 Mean : 19.21 Mean : 34.72
3rd Qu.: 95.00 3rd Qu.: 140.0000 3rd Qu.: 28.40 3rd Qu.: 110.00 3rd Qu.: 41.00
Max. : 452.00 Max. : 180.0000 Max. : 89.80 Max. : 180.00 Max. :108.00
gyros_forearm_x gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y
Min. :-22.000 Min. : -7.02000 Min. : -8.0900 Min. :-498.00 Min. :-632.0
1st Qu.: -0.220 1st Qu.: -1.46000 1st Qu.: -0.1800 1st Qu.:-178.00 1st Qu.: 57.0
Median : 0.050 Median : 0.03000 Median : 0.0800 Median : -57.00 Median : 201.0
Mean : 0.158 Mean : 0.07517 Mean : 0.1512 Mean : -61.65 Mean : 163.7
3rd Qu.: 0.560 3rd Qu.: 1.62000 3rd Qu.: 0.4900 3rd Qu.: 76.00 3rd Qu.: 312.0
Max. : 3.970 Max. :311.00000 Max. :231.0000 Max. : 477.00 Max. : 923.0
accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z classe
Min. :-446.00 Min. :-1280.0 Min. :-896.0 Min. :-973.0 A:5580
1st Qu.:-182.00 1st Qu.: -616.0 1st Qu.: 2.0 1st Qu.: 191.0 B:3797
Median : -39.00 Median : -378.0 Median : 591.0 Median : 511.0 C:3422
Mean : -55.29 Mean : -312.6 Mean : 380.1 Mean : 393.6 D:3216
3rd Qu.: 26.00 3rd Qu.: -73.0 3rd Qu.: 737.0 3rd Qu.: 653.0 E:3607
Max. : 291.00 Max. : 672.0 Max. :1480.0 Max. :1090.0
For training the model, I did the following:
trainCtrl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
rfModel <- train(classe ~., method = "rf", trControl = trainCtrl, preProcess = "pca", data = training, prox = TRUE)
The model worked. However, I was rather annoyed by multiple warning messages, repeated up to 20 times, invalid mtry: reset to within valid range. A few searches on Google did not return any useful insights. Also, not sure it matters, there were no NA values in the dataset; they were removed in a prior step.
I also ran system.time(), the processing time was awfully more than 1 hour.
> system.time(train(classe ~., method = "rf", trControl = trainCtrl, preProcess = "pca", data = training, prox = TRUE))
user system elapsed
6478.113 302.281 7044.483
If you can help decipher the what and why this warning message, that would be super. I would love to hear any comments regarding such a long processing time.
Thank you!
The caret rf method uses the randomForest function from the randomForest package. If you set the mtry argument of randomForest to a value greater than the number of predictor variables, you'll get the warning you posted (for example, try rf = randomForest(mpg ~ ., mtry=15, data=mtcars)). The model still runs, but randomForest sets mtry to a lower, valid value.
The question is, why is train (or one of the functions it calls) feeding randomForest an mtry value that's too large? I'm not sure, but here's a guess: Setting preProcess="pca" reduces the number of features being fed to randomForest (relative to the number of features in the raw data), because the least important principal components are discarded to reduce the dimensionality of the feature set. However, when doing cross-validation, it's possible that train nevertheless sets the maximum mtry value for randomForest based on the larger number of features in the raw data, rather than based on the pre-processed data set that's actually fed to randomForest. Circumstantial evidence for this is that the warning goes away if you remove the preProcess="pca" argument, but I didn't check any further than that.
Reproducible code showing that the warning goes away without pca:
trainCtrl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
rfModel <- train(mpg ~., method = "rf", trControl = trainCtrl, preProcess = "pca", data = mtcars, prox = TRUE)
rfModel <- train(mpg ~., method = "rf", trControl = trainCtrl, data = mtcars, prox = TRUE)
Just self-learning R at the moment and have gotten a little stuck. I have a dataset and I want to summarize (find mean, max, etc) but only selecting those cases that have a particular value on a certain variable.
Alternatively, I guess the same outcome could be done by summarizing only certain rows in the dataset (ie summarize only rows 1 thru 20).
Could someone lend a helping hand? Thanks so much
mydata<-mtcars
a. Find summary for rows 1 to 20
summary(mydata[1:20,])
mpg cyl disp hp drat wt qsec vs am
Min. :10.40 Min. :4.0 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.615 Min. :15.84 Min. :0.0 Min. :0.0
1st Qu.:16.10 1st Qu.:4.0 1st Qu.:145.2 1st Qu.: 94.5 1st Qu.:3.070 1st Qu.:2.811 1st Qu.:17.41 1st Qu.:0.0 1st Qu.:0.0
Median :18.95 Median :6.0 Median :196.3 Median :116.5 Median :3.460 Median :3.440 Median :18.15 Median :0.5 Median :0.0
Mean :20.13 Mean :6.2 Mean :233.9 Mean :136.2 Mean :3.545 Mean :3.398 Mean :18.44 Mean :0.5 Mean :0.3
3rd Qu.:22.80 3rd Qu.:8.0 3rd Qu.:296.9 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.743 3rd Qu.:19.45 3rd Qu.:1.0 3rd Qu.:1.0
Max. :33.90 Max. :8.0 Max. :472.0 Max. :245.0 Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0 Max. :1.0
gear carb
Min. :3.0 Min. :1.00
1st Qu.:3.0 1st Qu.:1.75
Median :3.5 Median :3.00
Mean :3.5 Mean :2.70
3rd Qu.:4.0 3rd Qu.:4.00
Max. :4.0 Max. :4.00
b. Find summary when value of cyl=4
summary(mydata[mydata$cyl==4,])
mpg cyl disp hp drat wt qsec vs am
Min. :21.40 Min. :4 Min. : 71.10 Min. : 52.00 Min. :3.690 Min. :1.513 Min. :16.70 Min. :0.0000 Min. :0.0000
1st Qu.:22.80 1st Qu.:4 1st Qu.: 78.85 1st Qu.: 65.50 1st Qu.:3.810 1st Qu.:1.885 1st Qu.:18.56 1st Qu.:1.0000 1st Qu.:0.5000
Median :26.00 Median :4 Median :108.00 Median : 91.00 Median :4.080 Median :2.200 Median :18.90 Median :1.0000 Median :1.0000
Mean :26.66 Mean :4 Mean :105.14 Mean : 82.64 Mean :4.071 Mean :2.286 Mean :19.14 Mean :0.9091 Mean :0.7273
3rd Qu.:30.40 3rd Qu.:4 3rd Qu.:120.65 3rd Qu.: 96.00 3rd Qu.:4.165 3rd Qu.:2.623 3rd Qu.:19.95 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :33.90 Max. :4 Max. :146.70 Max. :113.00 Max. :4.930 Max. :3.190 Max. :22.90 Max. :1.0000 Max. :1.0000
gear carb
Min. :3.000 Min. :1.000
1st Qu.:4.000 1st Qu.:1.000
Median :4.000 Median :2.000
Mean :4.091 Mean :1.545
3rd Qu.:4.000 3rd Qu.:2.000
Max. :5.000 Max. :2.000