Multiclass classification in H2O randomForest - r

I'm trying to use H20 randomForest for multiclass classification in R, but when I run the code, the randomForest always comes out as a regression model - despite the target variable being a factor. I am trying to predict 'Gradient', a factor with 5 levels, by one other factor 'Period' with 4 levels, and 21 numerical predictors.
Any help would be appreciated. Code below....
>str(df)
Class 'H2OFrame' <environment: 0x000001f6b361abe0>
- attr(*, "op")= chr ":="
- attr(*, "eval")= logi TRUE
- attr(*, "id")= chr "RTMP_sid_aecc_35"
- attr(*, "nrow")= int 63878
- attr(*, "ncol")= int 22
- attr(*, "types")=List of 22
- attr(*, "data")='data.frame': 10 obs. of 22 variables:
..$ Gradient: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1
..$ Period : Factor w/ 4 levels "Dawn","Day","Dusk",..: 2 2 2 2 2 2 2 2 2 2
..$ AC1 : num 1792 1793 1790 1790 1797 ...
..$ AC2 : num 316 316 318 317 324 ...
..$ AC3 : num 972 972 974 975 979 ...
etc for remaining numerical predictors.
>splits <- h2o.splitFrame(df, c(0.6,0.2), seed=1234)
>train <- h2o.assign(splits[[1]], "train.hex")
>valid <- h2o.assign(splits[[2]], "valid.hex")
>test <- h2o.assign(splits[[3]], "test.hex")
>str(train)
Class 'H2OFrame' <environment: 0x000002266fac7d40>
- attr(*, "op")= chr "assign"
- attr(*, "id")= chr "train.hex"
- attr(*, "nrow")= int 38259
- attr(*, "ncol")= int 22
- attr(*, "types")=List of 22
- attr(*, "data")='data.frame': 10 obs. of 22 variables:
..$ Gradient: Factor w/ 5 levels "LB","LU","PB",..: 1 1 1 1 1 1 1 1 1 1
..$ Period : Factor w/ 4 levels "Dawn","Day","Dusk",..: 2 2 2 2 2 2 2 2 2 2
..$ AC1 : num 1793 1797 1796 1805 1803 ...
..$ AC2 : num 316 324 322 322 323 ...
..$ AC3 : num 972 979 979 988 986 ...
..$ AC4 : num 663 662 664 673 670 ...
..$ AC5 : num 828 825 824 824 825 ...
..$ AD1 : num 1.22 1.42 1.73 2.25 1.99 ...
..$ AD2 : num 1.1 1.27 1.35 1.38 1.38 ...
..$ AD3 : num 1.22 1.42 1.72 2.24 1.99 ...
..$ AD4 : num 1.87 1.53 2.07 2.03 1.78 ...
..$ AD5 : num 2.33 2.33 2.33 2.33 2.33 ...
..$ AE1 : num 0.877 0.849 0.794 0.636 0.72 ...
..$ AE2 : num 0.3687 0.2332 0.1369 0.0433 0.0546 ...
..$ AE3 : num 0.774 0.723 0.624 0.335 0.487 ...
..$ AE4 : num 0.574 0.697 0.44 0.477 0.605 ...
..$ AE5 : num 0.542 0.542 0.554 0.543 0.542 ...
..$ BI1 : num 53 71.9 64 75.4 74.6 ...
..$ BI2 : num 6.51 5.88 4.54 2.3 2.34 ...
..$ BI3 : num 22.2 26 21.5 27.9 28 ...
..$ BI4 : num 7.86 9.58 8.59 12.17 12.5 ...
..$ BI5 : num 11.3 17.9 16.4 18.1 17.5 ...
> train[1:5,] ## rows 1-5, all columns
Gradient Period AC1 AC2 AC3 AC4 AC5 AD1 AD2 AD3 AD4 AD5 AE1 AE2 AE3 AE4 AE5
1 LB Day 1792.97 316.4038 972.4288 663.2612 827.6400 1.217491 1.104860 1.217491 1.866627 2.332115 0.876794 0.368712 0.774123 0.574168 0.541993
2 LB Day 1796.78 324.3562 979.2218 662.2341 824.6436 1.421910 1.274373 1.421910 1.526506 2.331810 0.848660 0.233177 0.722544 0.696906 0.542409
3 LB Day 1796.09 321.9081 978.7464 664.1776 824.4437 1.726798 1.345030 1.721740 2.066543 2.326278 0.794230 0.136892 0.624107 0.440458 0.553766
4 LB Day 1805.14 322.0390 987.9472 673.2841 824.3146 2.248474 1.381644 2.239061 2.028538 2.331881 0.636007 0.043267 0.334964 0.477149 0.542572
5 LB Day 1803.15 323.1540 985.6376 669.7603 824.6003 1.992025 1.380468 1.992004 1.782532 2.331971 0.720153 0.054578 0.486951 0.604876 0.542420
BI1 BI2 BI3 BI4 BI5
1 53.03567 6.506536 22.23446 7.862767 11.32708
2 71.94775 5.879407 26.04130 9.579798 17.94337
3 63.98763 4.535041 21.50727 8.590985 16.38780
4 75.38319 2.301110 27.89600 12.165991 18.06316
5 74.60517 2.342853 28.02568 12.499122 17.52902
rf1 <- h2o.randomForest(
training_frame = train,
validation_frame = valid,
x=2:22,
y=1,
ntrees = 200,
stopping_rounds = 2,
score_each_iteration = T,
seed = 1000000) `
>perf <- h2o.performance(rf1, valid)
>h2o.mcc(perf)
Error in h2o.metric(object, thresholds, "absolute_mcc") :
No absolute_mcc for H2OMultinomialMetrics
h2o.accuracy(perf)
Error in h2o.metric(object, thresholds, "accuracy") :
No accuracy for H2OMultinomialMetrics
and a summary from the model summary:
H2OMultinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
Training Set Metrics:
=====================
Extract training frame with `h2o.getFrame("train.hex")`
MSE: (Extract with `h2o.mse`) 0.2499334
RMSE: (Extract with `h2o.rmse`) 0.4999334
Logloss: (Extract with `h2o.logloss`) 0.9987891
Mean Per-Class Error: 0.2941914
R^2: (Extract with `h2o.r2`) 0.8683096

mcc is specifically for binary classifiers; your factor has more than 2 levels.
You can tell you have successfully done a multinomial classification, rather than a regression, because the error message says "No absolute_mcc for H2OMultinomialMetrics".
h2o.accuracy() and h2o.logloss() are available for multinomial models.
UPDATE: ...well, the docs say h2o.accuracy() is available, but a quick check on the iris dataset gives me the same error you see; must be related to that warning in the docs (which I didn't understand).
Anyway, more useful is likely to be h2o.confusionMatrix(rf1); the overall error shown in the bottom right is 1 - accuracy. Also h2o.confusionMatrix(rf1,valid=T) and h2o.confusionMatrix(rf1, test)

Related

'Incorrect number of dimensions' when running Zelig 'arima' on imputed data

I'm getting an error when I try to run an arima model with the zelig package. I'm using MI data with 20 imputations that were created with Amelia. Here is a short summary of my id and response variables:
$ imp20:'data.frame': 442 obs. of 50 variables:
..$ region : Factor w/ 4 levels "Central Africa",..: 3 3 3 3 3 3 3 3 3 3 ...
..$ subregionid : Factor w/ 4 levels "FC","FE","FS",..: 3 3 3 3 3 3 3 3 3 3 ...
..$ country : Factor w/ 34 levels "Angola","Benin",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ ISO2 : Factor w/ 34 levels "AO","BF","BJ",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ ISO3 : Factor w/ 34 levels "AGO","BEN","BFA",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ year : num [1:442] 2002 2003 2004 2005 2006 ...
..$ cap.lat : num [1:442] -8.5 -8.5 -8.5 -8.5 -8.5 -8.5 -8.5 -8.5 -8.5 -8.5 ...
..$ cap.long : num [1:442] 13.2 13.2 13.2 13.2 13.2 ...
..$ NGDP_RPCH : num [1:442] 14.53 5.25 10.88 18.26 20.73 ...
..$ NGDPD : num [1:442] 3.18 3.31 3.38 3.44 3.48 ...
..$ NGDPDPC : num [1:442] 2.68 2.69 2.72 2.75 2.78 ...
..$ NGSD_NGDP : num [1:442] 10.62 7.77 12.63 26.98 40.94
...
..$ PIKE.regional : num [1:442] 0.225 0.295 0.287 0.358 0.357 ...
..$ Definite.Probable : num [1:442] 36 36 36 36 36.1 ...
..$ Elephant.range : num [1:442] 406006 433613 511662 456046 459418 ...
..$ Change.by.year : num [1:442] 0.000463 0.000463 0.000463 0.000463 0.000463 ...
..$ Diff.from.expected : num [1:442] -0.0415 -0.0415 -0.0415 -0.0415 -0.0415 ...
Diff.from.expected is my response variable. And here is the code that I've run along with the error I'm getting.
z1 <- zarima$new()
> z1$zelig(Diff.from.expected~GNI, order=c(1,0,1), model="arima",
+ data = a.coVarsTrans.more, ts="year", cs="country")
Error in data[, cs] : incorrect number of dimensions
So it appears to me that there is an issue with the cs='country' call, but I'm not sure what the issue is. I'm planning to add more independent variables, but want to make sure that a basic model works first, which clearly it doesn't.
Here is the link to my saved Amelia .Rdata file.

How to standardize a data frame which contains both numeric and factor variables

My data frame, my.data, contains both numeric and factor variables. I want to standardise just the numeric variables in this data frame.
> mydata2=data.frame(scale(my.data, center=T, scale=T))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Could the standardising work by doing this? I want to standardise the columns 8,9,10,11 and 12 but I think I have the wrong code.
mydata=data.frame(scale(flowdis3[,c(8,9,10,11,12)], center=T, scale=T,))
Thanks in advance
Here is one option to standardize
mydata[] <- lapply(mydata, function(x) if(is.numeric(x)){
scale(x, center=TRUE, scale=TRUE)
} else x)
You can use the dplyr package to do this:
mydata2%>%mutate_if(is.numeric,scale)
Here are some options to consider, although it is answered late:
# Working environment and Memory management
rm(list = ls(all.names = TRUE))
gc()
memory.limit(size = 64935)
# Set working directory
setwd("path")
# Example data frame
df <- data.frame("Age" = c(21, 19, 25, 34, 45, 63, 39, 28, 50, 39),
"Name" = c("Christine", "Kim", "Kevin", "Aishwarya", "Rafel", "Bettina", "Joshua", "Afreen", "Wang", "Kerubo"),
"Salary in $" = c(2137.52, 1515.79, 2212.81, 2500.28, 2660, 4567.45, 2733, 3314, 5757.11, 4435.99),
"Gender" = c("Female", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Male"),
"Height in cm" = c(172, 166, 191, 169, 179, 177, 181, 155, 154, 183),
"Weight in kg" = c(60, 70, 88, 48, 71, 51, 65, 44, 53, 91))
Let us check the structure of df:
str(df)
'data.frame': 10 obs. of 6 variables:
$ Age : num 21 19 25 34 45 63 39 28 50 39
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num 2138 1516 2213 2500 2660 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num 172 166 191 169 179 177 181 155 154 183
$ Weight.in.kg: num 60 70 88 48 71 51 65 44 53 91
We see that Age, Salary, Height and Weight are numeric and Name and Gender are categorical (factor variables).
Let us scale just the numeric variables using only base R:
1) Option: (slight modification of what akrun has proposed here)
start_time1 <- Sys.time()
df1 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
(x-mean(x))/sd(x)
} else x))
end_time1 <- Sys.time()
end_time1 - start_time1
Time difference of 0.02717805 secs
str(df1)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
2) Option: (akrun's approach)
start_time2 <- Sys.time()
df2 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
scale(x, center=TRUE, scale=TRUE)
} else x))
end_time2 <- Sys.time()
end_time2 - start_time2
Time difference of 0.02599907 secs
str(df2)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...
3) Option:
start_time3 <- Sys.time()
indices <- sapply(df, is.numeric)
df3 <- df
df3[indices] <- lapply(df3[indices], scale)
end_time3 <- Sys.time()
end_time2 - start_time3
Time difference of -59.6766 secs
str(df3)
'data.frame': 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2
4) Option (using tidyverse and invoking dplyr):
library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, scale)
end_time4 <- Sys.time()
end_time4 - start_time4
Time difference of 0.012043 secs
str(df4)
'data.frame': 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2
Based on what kind of structure as output you demand and speed, you can judge. If your data is unbalanced and you want to balance it, and suppose you want to do classification after that after scaling the numeric variables, the matrix numeric structure of the numeric variables, namely - Age, Salary, Height and Weight will cause problems. I mean,
str(df4$Age)
num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
- attr(*, "scaled:center")= num 36.3
- attr(*, "scaled:scale")= num 13.8
Since, for example, ROSE package (which balances data) doesn't accept data structures apart from int, factor and num, it will throw an error.
To avoid this issue, the numeric variables after scaling can be saved as vectors instead of a column matrix by:
library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, ~scale (.) %>% as.vector)
end_time4 <- Sys.time()
end_time4 - start_time4
with
Time difference of 0.01400399 secs
str(df4)
'data.frame': 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 ...

convert list with unequal vector length to data frame by strata in R

I have the output from a coxph function, which is estimated by strata. I would like to transform this output from a list into a data frame. The code I ran for coxph is below:
k <- coxph(Surv(cum.goodp, dlq.next) ~ rpc.length + cluster(itemcode) + strata(sector), data = nr.sample)
m <- summary(survfit(k))
There are twenty different strata used to estimate the coxph. Here is the structure of the list
List of 16
$ n : int [1:20] 870 843 2278 603 6687 8618 15155 920 2598 654 ...
$ time : num [1:870] 1 2 3 4 5 6 7 8 9 10 ...
$ n.risk : num [1:870] 870 592 448 361 320 286 232 214 196 186 ...
$ n.event : num [1:870] 246 126 77 34 33 25 18 18 8 6 ...
$ n.censor : num [1:870] 32 18 10 7 1 29 0 0 2 0 ...
$ strata : Factor w/ 20 levels "sector=11","sector=21",..: 1 1 1 1 1 1 1 1 1 1 ...
$ surv : num [1:870] 0.725 0.571 0.471 0.425 0.379 ...
$ type : chr "right"
$ cumhaz : num [1:870] 0.322 0.561 0.754 0.856 0.971 ...
$ std.err : num [1:870] 0.015 0.017 0.0174 0.0174 0.0173 ...
$ upper : num [1:870] 0.755 0.605 0.506 0.46 0.414 ...
$ lower : num [1:870] 0.696 0.538 0.438 0.392 0.347 ...
$ conf.type: chr "log"
$ conf.int : num 0.95
$ call : language survfit(formula = k)
$ table : num [1:20, 1:7] 870 843 2278 603 6687 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:20] "sector=11" "sector=21" "sector=22" "sector=23" ...
.. ..$ : chr [1:7] "records" "n.max" "n.start" "events" ...
- attr(*, "class")= chr "summary.survfit"
I have done this before, but without strata. When I did not have strata I used the following approach:
col <- lapply(c(1 : 7), function(x) m[x])
tbl <- do.call(data.frame, col)
However, when I try that approach here, I get the familiar error:
cannot coerce class "c("survfit.cox", "survfit")" to a data.frame
All columns have the same name, but they are of different length. If possible, I would like to add a column to the final data frame that contains the particular strata that the results are for. Is there a way to do this? It doesn't have to be in base R. Any help would be much appreciated. Thanks so much.
This problem can be solved via the tidy function in the broom package. For the example above, the code is:
n <- survfit(k)
df <- tidy(n)
The tidy function produces a data frame with a variable "strata". It does not, however, provide the median and mean, but they can be estimated from the data frame df if one were so inclined. If the survfit object has multiple strata, the glance(list) cannot provide the median or mean.

r quantregForest() error: NA's produced by integer overflow lead to an invalid argument in the rep() function

I am trying to use the quantregForest() function from the quantregForest package (which is built on the randomForest package.)
I tried to train the model using:
qrf_model <- quantregForest(x=Xtrain, y=Ytrain, importance=TRUE, ntree=10)
and I get the following error message (even after reducing the number of trees from 100 to 10):
Error in rep(0, nobs * nobs * npred) : invalid 'times' argument
plus a warning:
In nobs * nobs * npred : NAs produced by integer overflow
The data frame Xtrain has 38 numeric variables, and it looks like this:
> str(Xtrain)
'data.frame': 31132 obs. of 38 variables:
$ X1 : num 301306 6431 2293 1264 32477 ...
$ X2 : num 173.2 143.5 43.4 180.6 1006.2 ...
$ X3 : num 0.1598 0.1615 0.1336 0.0953 0.1988 ...
$ X4 : num 0.662 0.25 0.71 0.709 0.671 ...
$ X5 : num 0.05873 0.0142 0 0.00154 0.09517 ...
$ X6 : num 0.01598 0 0.0023 0.00154 0.01634 ...
$ X7 : num 0.07984 0.03001 0.00845 0.04304 0.09326 ...
$ X8 : num 0.92 0.97 0.992 0.957 0.907 ...
$ X9 : num 105208 1842 830 504 11553 ...
$ X10: num 69974 1212 611 352 7080 ...
$ X11: num 0.505 0.422 0.55 0.553 0.474 ...
$ X12: num 0.488 0.401 0.536 0.541 0.45 ...
$ X13: num 0.333 0.419 0.257 0.282 0.359 ...
$ X14: num 0.187 0.234 0.172 0.207 0.234 ...
$ X15: num 0.369 0.216 0.483 0.412 0.357 ...
$ X16: num 0.0765 0.1205 0.0262 0.054 0.0624 ...
$ X17: num 2954 77 12 10 739 ...
$ X18: num 2770 43 9 21 433 119 177 122 20 17 ...
$ X19: num 3167 72 49 25 622 ...
$ X20: num 3541 57 14 24 656 ...
$ X21: num 3361 82 0 33 514 ...
$ X22: num 3929 27 10 48 682 ...
$ X23: num 3695 73 61 15 643 ...
$ X24: num 4781 52 5 14 680 ...
$ X25: num 3679 103 5 23 404 ...
$ X26: num 7716 120 55 40 895 ...
$ X27: num 11043 195 72 48 1280 ...
$ X28: num 16080 332 160 83 1684 ...
$ X29: num 12312 125 124 62 1015 ...
$ X30: num 8218 99 36 22 577 ...
$ X31: num 9957 223 146 26 532 ...
$ X32: num 0.751 0.444 0.621 0.527 0.682 ...
$ X33: num 0.01873 0 0 0.00317 0.02112 ...
$ X34: num 0.563 0.372 0.571 0.626 0.323 ...
$ X35: num 0.366 0.39 0.156 0.248 0.549 ...
$ X36: num 0.435 0.643 0.374 0.505 0.36 ...
$ X37: num 0.526 0.31 0.577 0.441 0.591 ...
$ X38: num 0.00163 0 0 0 0.00155 0.00103 0 0 0 0 ...
And the response variable Ytrain looks like this:
> str(Ytrain)
num [1:31132] 2605 56 8 16 214 ...
I checked that neither Xtrain or Ytrain contain any NA's by:
> sum(is.na(Xtrain))
[1] 0
> sum(is.na(Ytrain))
[1] 0
I am assuming that the error message for the invalid "times" argument for the rep(0, nobs * nobs * npred)) function comes from the NA value assigned to the product nobs * nobs * npred due to an integer overflow.
What I do not understand is where the integer overflow comes from. None of my variables are of the integer class so what am I missing?
I examined the source code for the quantregForest() function and the source code for the method predict.imp called by the quantregForest() function.
I found that nobs stands for the number of observations. In the case above nobs =length(Ytrain) = 31132 . The variable npred stands for the number of predictors. It is given by npred = ncol(Xtrain)=38. Both npred and nobs are of class integer, and
npred*npred*nobs = 31132*31132*38 = 36829654112.
And herein lies the root cause of the error, since:
npred*npred*nobs = 36829654112 > 2147483647,
where 2147483647 is the maximal integer value in R. Hence the integer overflow warning and the replacement of the product npred*npred*nobs with an NA.
The bottom line is, in order to avoid the error message I will have to use quite a bit fewer observations when training the model or set importance=FALSE in the quantregForest() function argument. The computations required to find variable importance are very memory intensive, even when using less then 10000 observations.

Subsetting by rows to do a correlation

I created a data frame from another dataset with 332 ID's. I split the data frame by IDs and would like to do a count rows of each ID and then do a correlation function. Can someone tell me how to do a count of the rows of each ID in order to do a correlation from these individual groups.
jlhoward your suggestion to add "table(dat1$ID)" command worked. My other problem is the function will not stop running
corr<-function(directory,threshold=)
####### file location path#####
for(i in 1:332){dat<-rbind(dat,read.csv(specdata1[i]))
dat1<-dat[complete.cases(dat),]
dat2<-(split(dat1,dat1$ID))
list(dat2)
dat3<-table(dat1$ID)
for (i in dat1>=threshold){
x<-dat1$sulfate
y<-dat1$nitrate
correlations<-cor(x,y,use="pairwise.complete.obs",method="pearson")
corrs_output<-c(corrs_output,correlations)
}
I'm trying to correlate the "sulfate" and "nitrate of each ID monitor that fits a threshold. I created a list that has all the complete cases per ID monitor. I need the function to do a correlation for "sulfate" and "nitrate of every set per ID that's => the threshold argument in the function. Below is the head and tail of the structure of the data.frame/list of each data set within the main data set "specdata1".
head of entire data.frame/list of specdata1 complete cases for
correlation
head(str(dat2,1))
List of 323
$ 1 :'data.frame': 117 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 279 285 291 297 303 315 321 327 333 339 ...
..$ sulfate: num [1:117] 7.21 5.99 4.68 3.47 2.42 1.43 2.76 3.41 1.3 3.15 ...
..$ nitrate: num [1:117] 0.651 0.428 1.04 0.363 0.507 0.474 0.425 0.964 0.491 0.669 ...
..$ ID : int [1:117] 1 1 1 1 1 1 1 1 1 1 ...
tail of entire data.frame/list for all complete cases of specdata1
tail(str(dat2,1))
$ 99 :'data.frame': 479 obs. of 4 variables:
..$ Date : Factor w/ 4018 levels "2003-01-01","2003-01-02",..: 1774 1780 1786 1804 1810 1816 1822 1840 1852 1858 ...
..$ sulfate: num [1:479] 1.51 8.2 1.48 4.75 3.47 1.19 1.77 2.27 2.06 2.11 ...
..$ nitrate: num [1:479] 0.725 1.64 1.01 6.81 0.751 1.69 2.08 0.996 0.817 0.488 ...
..$ ID : int [1:479] 99 99 99 99 99 99 99 99 99 99 ...
[list output truncated]

Resources