This is a very simplified version of what I'm trying to do. In brief, I create some matrices that I want to perform the same action on, like in a loop. In this example here I want to print the summary for each matrix, but I don't know how to refer to the matrices in a for loop. Any help is much appreciated:
for (i in 1:3){
x <- paste0('df', i)
assign(x, matrix(sample(1:10, 15, replace = TRUE), ncol = 3))
print(summary(eval(x)))
}
Returns (it is evaluating 'df3' as a string):
Length Class Mode
1 character character
Length Class Mode
1 character character
Length Class Mode
1 character character
How do I get it to return the following?
V1 V2 V3
Min. : 1.0 Min. :3.0 Min. : 5
1st Qu.: 5.0 1st Qu.:3.0 1st Qu.: 5
Median : 6.0 Median :4.0 Median : 7
Mean : 5.6 Mean :5.2 Mean : 7
3rd Qu.: 6.0 3rd Qu.:7.0 3rd Qu.: 8
Max. :10.0 Max. :9.0 Max. :10
V1 V2 V3
Min. :2 Min. :1.0 Min. : 4.0
1st Qu.:4 1st Qu.:3.0 1st Qu.: 4.0
Median :7 Median :3.0 Median : 6.0
Mean :6 Mean :3.4 Mean : 6.6
3rd Qu.:8 3rd Qu.:4.0 3rd Qu.: 9.0
Max. :9 Max. :6.0 Max. :10.0
V1 V2 V3
Min. :1.0 Min. : 5.0 Min. :1.0
1st Qu.:2.0 1st Qu.: 6.0 1st Qu.:2.0
Median :6.0 Median : 6.0 Median :3.0
Mean :5.2 Mean : 6.8 Mean :2.4
3rd Qu.:8.0 3rd Qu.: 7.0 3rd Qu.:3.0
Max. :9.0 Max. :10.0 Max. :3.0
Don’t use distinct variables and paste their names – put your objects into a list:
x = Map(function (i) matrix(sample(1:10, 15, replace = TRUE), ncol = 3), 1 : 3)
Then performing a common operation on them is trivial as well:
Map(summary, x)
Map maps a function onto a list. It operates similar to the lapply and mapply family of functions.
I think you can use eval(as.name("df3")) or get("df3")
Related
I have voter and party-data from several datasets that I further separated into different dataframes and lists to make it comparable. I could just use the summary command on each of them individually then compare manually, but I was wondering whether there was a way to get them all together and into one table?
Here's a sample of what I have:
> summary(eco$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3 4 4 4 4 5
> summary(ecovoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.000 4.000 3.744 5.000 10.000 26
> summary(lef$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 3.000 3.000 3.692 4.000 7.000
> summary(lefvoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 2.000 3.000 3.612 5.000 10.000 332
> summary(soc$rilenew)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 4.000 4.000 4.143 5.000 6.000
> summary(socvoters)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.000 4.000 3.674 5.000 10.000 346
Is there a way I can summarize these lists (ecovoters, lefvoters, socvoters etc) and the dataframe variables (eco$rilenew, lef$rilenew, soc$rilenew etc) together and have them in one table?
You could put everything into a list and summarize with a small custom function.
L <- list(eco$rilenew, ecovoters, lef$rilenew,
lefvoters, soc$rilenew, socvoters)
t(sapply(L, function(x) {
s <- summary(x)
length(s) <- 7
names(s)[7] <- "NA's"
s[7] <- ifelse(!any(is.na(x)), 0, s[7])
return(s)
}))
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
[1,] 0.9820673 3.3320662 3.958665 3.949512 4.625109 7.229069 0
[2,] -4.8259384 0.5028293 3.220546 3.301452 6.229384 9.585749 26
[3,] -0.3717391 2.3280366 3.009360 3.013908 3.702156 6.584659 0
[4,] -2.6569493 1.6674330 3.069440 3.015325 4.281100 8.808432 332
[5,] -2.3625651 2.4964361 3.886673 3.912009 5.327401 10.349040 0
[6,] -2.4719404 1.3635785 2.790523 2.854812 4.154936 8.491347 346
Data
set.seed(42)
eco <- data.frame(rilenew=rnorm(800, 4, 1))
ecovoters <- rnorm(75, 4, 4)
ecovoters[sample(length(ecovoters), 26)] <- NA
lef <- data.frame(rilenew=rnorm(900, 3, 1))
lefvoters <- rnorm(700, 3, 2)
lefvoters[sample(length(lefvoters), 332)] <- NA
soc <- data.frame(rilenew=rnorm(900, 4, 2))
socvoters <- rnorm(700, 3, 2)
socvoters[sample(length(socvoters), 346)] <- NA
Can use map from tidyverse to get the summary list, then if you want the result as dataframe, then plyr::ldply can help to convert list to dataframe:
ll = map(L, summary)
ll
plyr::ldply(ll, rbind)
> ll = map(L, summary)
> ll
[[1]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9821 3.3321 3.9587 3.9495 4.6251 7.2291
[[2]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-4.331 1.347 3.726 3.793 6.653 16.845 26
[[3]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3717 2.3360 3.0125 3.0174 3.7022 6.5847
[[4]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-2.657 1.795 3.039 3.013 4.395 9.942 332
[[5]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.363 2.503 3.909 3.920 5.327 10.349
[[6]]
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-3.278 1.449 2.732 2.761 4.062 8.171 346
> plyr::ldply(ll, rbind)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 0.9820673 3.332066 3.958665 3.949512 4.625109 7.229069 NA
2 -4.3312551 1.346532 3.725708 3.793431 6.652917 16.844796 26
3 -0.3717391 2.335959 3.012507 3.017438 3.702156 6.584659 NA
4 -2.6569493 1.795307 3.038905 3.012928 4.395338 9.941819 332
5 -2.3625651 2.503324 3.908727 3.920050 5.327401 10.349040 NA
6 -3.2779863 1.448814 2.732515 2.760569 4.061854 8.170793 346
I am working on a Coursera Machine Learning project. The goal is to perform a predictive modeling for the following dataset.
> summary(training)
roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x
Min. :-28.90 Min. :-55.8000 Min. :-180.00 Min. : 0.00 Min. :-1.040000
1st Qu.: 1.10 1st Qu.: 1.7600 1st Qu.: -88.30 1st Qu.: 3.00 1st Qu.:-0.030000
Median :113.00 Median : 5.2800 Median : -13.00 Median :17.00 Median : 0.030000
Mean : 64.41 Mean : 0.3053 Mean : -11.21 Mean :11.31 Mean :-0.005592
3rd Qu.:123.00 3rd Qu.: 14.9000 3rd Qu.: 12.90 3rd Qu.:18.00 3rd Qu.: 0.110000
Max. :162.00 Max. : 60.3000 Max. : 179.00 Max. :29.00 Max. : 2.220000
gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
Min. :-0.64000 Min. :-1.4600 Min. :-120.000 Min. :-69.00 Min. :-275.00 Min. :-52.0
1st Qu.: 0.00000 1st Qu.:-0.2000 1st Qu.: -21.000 1st Qu.: 3.00 1st Qu.:-162.00 1st Qu.: 9.0
Median : 0.02000 Median :-0.1000 Median : -15.000 Median : 35.00 Median :-152.00 Median : 35.0
Mean : 0.03959 Mean :-0.1305 Mean : -5.595 Mean : 30.15 Mean : -72.59 Mean : 55.6
3rd Qu.: 0.11000 3rd Qu.:-0.0200 3rd Qu.: -5.000 3rd Qu.: 61.00 3rd Qu.: 27.00 3rd Qu.: 59.0
Max. : 0.64000 Max. : 1.6200 Max. : 85.000 Max. :164.00 Max. : 105.00 Max. :485.0
magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm
Min. :354.0 Min. :-623.0 Min. :-180.00 Min. :-88.800 Min. :-180.0000 Min. : 1.00
1st Qu.:581.0 1st Qu.:-375.0 1st Qu.: -31.77 1st Qu.:-25.900 1st Qu.: -43.1000 1st Qu.:17.00
Median :601.0 Median :-320.0 Median : 0.00 Median : 0.000 Median : 0.0000 Median :27.00
Mean :593.7 Mean :-345.5 Mean : 17.83 Mean : -4.612 Mean : -0.6188 Mean :25.51
3rd Qu.:610.0 3rd Qu.:-306.0 3rd Qu.: 77.30 3rd Qu.: 11.200 3rd Qu.: 45.8750 3rd Qu.:33.00
Max. :673.0 Max. : 293.0 Max. : 180.00 Max. : 88.500 Max. : 180.0000 Max. :66.00
gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y
Min. :-6.37000 Min. :-3.4400 Min. :-2.3300 Min. :-404.00 Min. :-318.0
1st Qu.:-1.33000 1st Qu.:-0.8000 1st Qu.:-0.0700 1st Qu.:-242.00 1st Qu.: -54.0
Median : 0.08000 Median :-0.2400 Median : 0.2300 Median : -44.00 Median : 14.0
Mean : 0.04277 Mean :-0.2571 Mean : 0.2695 Mean : -60.24 Mean : 32.6
3rd Qu.: 1.57000 3rd Qu.: 0.1400 3rd Qu.: 0.7200 3rd Qu.: 84.00 3rd Qu.: 139.0
Max. : 4.87000 Max. : 2.8400 Max. : 3.0200 Max. : 437.00 Max. : 308.0
accel_arm_z magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell pitch_dumbbell
Min. :-636.00 Min. :-584.0 Min. :-392.0 Min. :-597.0 Min. :-153.71 Min. :-149.59
1st Qu.:-143.00 1st Qu.:-300.0 1st Qu.: -9.0 1st Qu.: 131.2 1st Qu.: -18.49 1st Qu.: -40.89
Median : -47.00 Median : 289.0 Median : 202.0 Median : 444.0 Median : 48.17 Median : -20.96
Mean : -71.25 Mean : 191.7 Mean : 156.6 Mean : 306.5 Mean : 23.84 Mean : -10.78
3rd Qu.: 23.00 3rd Qu.: 637.0 3rd Qu.: 323.0 3rd Qu.: 545.0 3rd Qu.: 67.61 3rd Qu.: 17.50
Max. : 292.00 Max. : 782.0 Max. : 583.0 Max. : 694.0 Max. : 153.55 Max. : 149.40
yaw_dumbbell total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
Min. :-150.871 Min. : 0.00 Min. :-204.0000 Min. :-2.10000 Min. : -2.380
1st Qu.: -77.644 1st Qu.: 4.00 1st Qu.: -0.0300 1st Qu.:-0.14000 1st Qu.: -0.310
Median : -3.324 Median :10.00 Median : 0.1300 Median : 0.03000 Median : -0.130
Mean : 1.674 Mean :13.72 Mean : 0.1611 Mean : 0.04606 Mean : -0.129
3rd Qu.: 79.643 3rd Qu.:19.00 3rd Qu.: 0.3500 3rd Qu.: 0.21000 3rd Qu.: 0.030
Max. : 154.952 Max. :58.00 Max. : 2.2200 Max. :52.00000 Max. :317.000
accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
Min. :-419.00 Min. :-189.00 Min. :-334.00 Min. :-643.0 Min. :-3600
1st Qu.: -50.00 1st Qu.: -8.00 1st Qu.:-142.00 1st Qu.:-535.0 1st Qu.: 231
Median : -8.00 Median : 41.50 Median : -1.00 Median :-479.0 Median : 311
Mean : -28.62 Mean : 52.63 Mean : -38.32 Mean :-328.5 Mean : 221
3rd Qu.: 11.00 3rd Qu.: 111.00 3rd Qu.: 38.00 3rd Qu.:-304.0 3rd Qu.: 390
Max. : 235.00 Max. : 315.00 Max. : 318.00 Max. : 592.0 Max. : 633
magnet_dumbbell_z roll_forearm pitch_forearm yaw_forearm total_accel_forearm
Min. :-262.00 Min. :-180.0000 Min. :-72.50 Min. :-180.00 Min. : 0.00
1st Qu.: -45.00 1st Qu.: -0.7375 1st Qu.: 0.00 1st Qu.: -68.60 1st Qu.: 29.00
Median : 13.00 Median : 21.7000 Median : 9.24 Median : 0.00 Median : 36.00
Mean : 46.05 Mean : 33.8265 Mean : 10.71 Mean : 19.21 Mean : 34.72
3rd Qu.: 95.00 3rd Qu.: 140.0000 3rd Qu.: 28.40 3rd Qu.: 110.00 3rd Qu.: 41.00
Max. : 452.00 Max. : 180.0000 Max. : 89.80 Max. : 180.00 Max. :108.00
gyros_forearm_x gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y
Min. :-22.000 Min. : -7.02000 Min. : -8.0900 Min. :-498.00 Min. :-632.0
1st Qu.: -0.220 1st Qu.: -1.46000 1st Qu.: -0.1800 1st Qu.:-178.00 1st Qu.: 57.0
Median : 0.050 Median : 0.03000 Median : 0.0800 Median : -57.00 Median : 201.0
Mean : 0.158 Mean : 0.07517 Mean : 0.1512 Mean : -61.65 Mean : 163.7
3rd Qu.: 0.560 3rd Qu.: 1.62000 3rd Qu.: 0.4900 3rd Qu.: 76.00 3rd Qu.: 312.0
Max. : 3.970 Max. :311.00000 Max. :231.0000 Max. : 477.00 Max. : 923.0
accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z classe
Min. :-446.00 Min. :-1280.0 Min. :-896.0 Min. :-973.0 A:5580
1st Qu.:-182.00 1st Qu.: -616.0 1st Qu.: 2.0 1st Qu.: 191.0 B:3797
Median : -39.00 Median : -378.0 Median : 591.0 Median : 511.0 C:3422
Mean : -55.29 Mean : -312.6 Mean : 380.1 Mean : 393.6 D:3216
3rd Qu.: 26.00 3rd Qu.: -73.0 3rd Qu.: 737.0 3rd Qu.: 653.0 E:3607
Max. : 291.00 Max. : 672.0 Max. :1480.0 Max. :1090.0
For training the model, I did the following:
trainCtrl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
rfModel <- train(classe ~., method = "rf", trControl = trainCtrl, preProcess = "pca", data = training, prox = TRUE)
The model worked. However, I was rather annoyed by multiple warning messages, repeated up to 20 times, invalid mtry: reset to within valid range. A few searches on Google did not return any useful insights. Also, not sure it matters, there were no NA values in the dataset; they were removed in a prior step.
I also ran system.time(), the processing time was awfully more than 1 hour.
> system.time(train(classe ~., method = "rf", trControl = trainCtrl, preProcess = "pca", data = training, prox = TRUE))
user system elapsed
6478.113 302.281 7044.483
If you can help decipher the what and why this warning message, that would be super. I would love to hear any comments regarding such a long processing time.
Thank you!
The caret rf method uses the randomForest function from the randomForest package. If you set the mtry argument of randomForest to a value greater than the number of predictor variables, you'll get the warning you posted (for example, try rf = randomForest(mpg ~ ., mtry=15, data=mtcars)). The model still runs, but randomForest sets mtry to a lower, valid value.
The question is, why is train (or one of the functions it calls) feeding randomForest an mtry value that's too large? I'm not sure, but here's a guess: Setting preProcess="pca" reduces the number of features being fed to randomForest (relative to the number of features in the raw data), because the least important principal components are discarded to reduce the dimensionality of the feature set. However, when doing cross-validation, it's possible that train nevertheless sets the maximum mtry value for randomForest based on the larger number of features in the raw data, rather than based on the pre-processed data set that's actually fed to randomForest. Circumstantial evidence for this is that the warning goes away if you remove the preProcess="pca" argument, but I didn't check any further than that.
Reproducible code showing that the warning goes away without pca:
trainCtrl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
rfModel <- train(mpg ~., method = "rf", trControl = trainCtrl, preProcess = "pca", data = mtcars, prox = TRUE)
rfModel <- train(mpg ~., method = "rf", trControl = trainCtrl, data = mtcars, prox = TRUE)
Is there a way to return the value of the 3rd Qu. that comes up when you do the summary of a vector?
For example:
summary(data$attribute)
Returns:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0002012 0.0218800 0.0454300 0.0707100 0.0961500 0.4845000
You can also use quantile and specify the probability to be 0.75:
quantile(1:10, probs = 0.75)
# 75%
#7.75
If you want to remove the name attribute:
quantile(1:10, probs = 0.75, names = FALSE)
#7.75
You can access elements of the summary by index:
summary(1:10)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.00 3.25 5.50 5.50 7.75 10.00
summary(1:10)[5]
# 3rd Qu.
# 7.75
Or by name:
summary(1:10)["3rd Qu."]
# 3rd Qu.
# 7.75
We can use unname() to drop names:
unname(summary(1:10)[5])
# [1] 7.75
Using summary(var) gives me the following output:
PAY_BACK_ORG
Min. : -16.40
1st Qu.: 0.00
Median : 26.40
Mean : 34.37
3rd Qu.: 53.60
Max. :4033.40
I want it as a dataframe which will look like this:
Min -16.40
1st Qu 0.00
Median 26.40
Mean 34.37
3rd Qu 53.60
Max 4033.40
How can I get it in?
Like this?
var <- rnorm(100)
x <- summary(var)
data.frame(x=matrix(x),row.names=names(x))
## x
## Min. -2.68300
## 1st Qu. -0.70930
## Median -0.09732
## Mean -0.00809
## 3rd Qu. 0.71550
## Max. 2.58100
I have a binary file with size of (360 720 )for the globe.I wrote the code given below to read and extract an area from that file. when I use summary for the whole file I got.
summary(a, na.rm=FALSE)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 1.00 3.00 4.15 7.00 20.00 200083 .
But when used summary for the region(b) which I extracted, I got many V1,V2. Which is not right I should have got one line (as for a)not many V1,V2.
Here is the code:
X <- c(200:300)
Y <- c(150:190)
conne <- file("C:\\initial-WTD.bin", "rb")
a=readBin(conne, numeric(), size=4, n=360*720, signed=TRUE)
a[a == -9999] <- NA
y <- matrix(data=a,ncol=360,nrow=720)
image(t(t(y[X,Y])),ylim=c(1,0))
b = y[X,Y]
summary(b,na.rm=FALSE)
V1 V2 V3 V4 V5 V6 V7
Min. : NA Min. : NA Min. : NA Min. : NA Min. : 8 Min. : NA Min. :
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.:11 1st Qu.: NA 1st Qu.:
Median : NA Median : NA Median : NA Median : NA Median :14 Median : NA Median
Mean :NaN Mean :NaN Mean :NaN Mean :NaN Mean :14 Mean :NaN Mean
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.:17 3rd Qu.: NA 3rd
Max. : NA Max. : NA Max. : NA Max. : NA Max. :20 Max. : NA Max.
NA's :101 NA's :101 NA's :101 NA's :101 NA's :99 NA's :101 NA's :
The problem is not in your indexing of a matrix, but some place prior to accessing it:
a <- matrix(1:100, 10, 10)
summary( a[1:3,1:3] )
V1 V2 V3
Min. :1.0 Min. :11.0 Min. :21.0
1st Qu.:1.5 1st Qu.:11.5 1st Qu.:21.5
Median :2.0 Median :12.0 Median :22.0
Mean :2.0 Mean :12.0 Mean :22.0
3rd Qu.:2.5 3rd Qu.:12.5 3rd Qu.:22.5
Max. :3.0 Max. :13.0 Max. :23.0
You managed to hit a few non-NA values (apparently only 2) but why are you doing this with such sparse data? I scaled this up to 100 columns (out of 1000) and still got the expected results.