How to sample rows in the randomForest package - r

I have a dataset with 1 million rows and 100 columns. randomForest is quite slow for data this big so I would like to train each tree on a subset of, say, 50000 columns each.
How do I achieve this with the randomForest function? Do I have to hack something together manually? I am not able to find any instruction on this in the vignette.

Do you mean that the sample for each tree should be different?
To start with, I would consider sampling before calling randomforest. Indeed, the fact that you take different samples for each tree could have an impact on the final result, and the importance matrix would probably be partly biased.
You can achieve this by doing that:
numrow <- nrow(data)
subset <- sample(numrow, 50000)
learn <- data[subset,]
test <- data[-subset,]
model_rf <- randomForest(formula=[...], data=learn, importance=T)

Related

Using Amelia and decision trees

I have a panel dataset (countries and years) with a lot of missing data so I've decided to use multiple imputation. The goal is to see the relationship between the proportion of women in management (managerial_value) and total fatal workplace injuries (total_fatal)
From what I've read online, Amelia is the best option for panel data so I used that like so:
amelia_data <- amelia(spdata, ts = "year", cs = "country", polytime = 1,
intercs = FALSE)
where spdata is my original dataset.
This imputation process worked, but I'm unsure of how to proceed with forming decision trees using the imputed data (an object of class 'amelia').
I originally tried creating a function (amelia2df) to turn each of the 5 imputed datasets into a data frame:
amelia2df <- function(amelia_data, which_imp = 1) {
stopifnot(inherits(amelia_data, "amelia"), is.numeric(which_imp))
imps <- amelia_data$imputations[[which_imp]]
as.data.frame(imps)
}
one_amelia <- amelia2df(amelia_data, which_imp = 1)
two_amelia <- amelia2df(amelia_data, which_imp = 2)
three_amelia <- amelia2df(amelia_data, which_imp = 3)
four_amelia <- amelia2df(amelia_data, which_imp = 4)
five_amelia <- amelia2df(amelia_data, which_imp = 5)
where one_amelia is the data frame for the first imputed dataset, two_amelia is the second, and so on.
I then combined them using rbind():
total_amelia <- rbind(one_amelia, two_amelia, three_amelia, four_amelia, five_amelia)
And used the new combined dataset total_amelia to construct a decision tree:
set.seed(300)
tree_data <- total_amelia
I_index <- sample(1:nrow(tree_data), size = 0.75*nrow(tree_data), replace=FALSE)
I_train <- tree_data[I_index,]
I_test <- tree_data[-I_index,]
fatal_tree <- rpart(total_fatal ~ managerial_value, I_train)
rpart.plot(fatal_tree)
fatal_tree
This "works" as in it doesn't produce an error, but I'm not sure that it is appropriately using the imputed data.
I found a couple resources explaining how to apply least squares, logit, etc., but nothing about decision trees. I'm under the impression I'd need the 5 imputed datasets to be combined into one data frame, but I have not been able to find a way to do that.
I've also looked into Zelig and bind_rows but haven't found anything that returns one data frame that I can then use to form a decision tree.
Any help would be appreciated!
As already indicated by #Noah, you would set up the multiple imputation workflow different than you currently do.
Multiple imputation is not really a tool to improve your results or to make them more correct.
It is a method to enable you to quantify the uncertainty caused by the missing data, that comes along with your analysis.
All the different datasets created by multiple imputation are plausible imputations, because of the uncertainty, you don't know, which one is correct.
You would therefore use multiple imputation the following way:
Create your m imputed datasets
Build your trees on each imputed dataset separately
Do you analysis on each tree separately
In your final paper, you can now state how much uncertainty is caused trough the missing values/imputation
This means you get e.g. 5 different analysis results for m = 5 imputed datasets. First this looks confusing, but this enables you to give bounds, between the correct result probably lies. Or if you get completely different results for each imputed dataset, you know, there is too much uncertainty caused by the missing values to give reliable results.

Using bootstrapping to compare full and sample datasets

This is a fairly complicated situation, so I'll try to succinctly explain but feel free to ask for clarification.
I have several datasets of biological data that vary significantly in sample size (e.g., 253-1221 observations/dataset). I need to estimate individual breeding parameters and compare them (for a different analysis), but because of the large sample size differences, I took a sub-set of data from each dataset so the sample sizes were equal for each comparison. For example, the smallest dataset had 253 observations, so for all the others I used the following code
AT_EABL_subset <- Atlantic_EABL[sample(1:nrow(Atlantic_EABL), 253,replace=FALSE),]
to take a subset of 253 observations from the full dataset (in this case AT_EABL originally had 1,221 observations).
It's now suggested that I use bootstrapping to check if the parameter estimates from my subsets are similar to the full dataset estimates. I'm looking for code that will run, say, 200 iterations of the above subset data and calculate the average of the coefficients so I can compare them to the coefficients from my model with the full dataset. I found a site that uses the sample function to achieve this (https://towardsdatascience.com/bootstrap-regression-in-r-98bfe4ff5007), but when I get to this portion of the code
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])
}
I get
Error: $ operator not defined for this S4 class
Below is the code I'm using. I don't know if I'm getting the above error because of the type of model I'm running (glmer vs. lm used in the link), or if there's a different function that will give me the data I need. Any advice is greatly appreciated.
sample_coef_intercept <- NULL
sample_coef_x1 <- NULL
for (i in 1:2) {
boot.sample = AT_EABL_subset[sample(1:nrow(AT_EABL_subset), nrow(AT_EABL_subset), replace = FALSE), ]
model_bootstrap <- glmer(cbind(YOUNG_HOST_TOTAL_ATLEAST,CLUTCH_SIZE_HOST_ATLEAST-YOUNG_HOST_TOTAL_ATLEAST)~as.factor(YEAR)+(1|LatLong),binomial,data=boot.sample)}
sample_coef_intercept <-
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])

How to convert random forest prediction probabilities to a single classified response?

I have many large random forest classification models (~60min run time each) that are used for prediction of a raster using the type="prob" option. I am happy with the raster output (probabilities for each of x classes as a raster stack). However, I would like a simple way to covert these probabilities (a raster stack with x layers, where x is the number of classes) to a simple one layer classification (i.e. winners only, no probabilities). This would be equivalent of type="response".
Here is a simple example (which is not a raster, but still applies):
library(randomForest)
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
iris.prob <- predict(iris.rf, type="prob")
iris.resp <- predict(iris.rf, type="response")
What is the most efficient way to use the iris.prob object to get the equivalent output of iris.resp without rerunning the randomforests (which, in my case with many large rasters, would take too many hours)?
Thanks in advance
If you are trying to determine the max of multiple columns, with the same general format as iris.prob I would try to find the max from each row and return the colname.
colnames(iris.prob)[max.col(iris.prob,ties.method="first")]
Got the exact usage from another thread so if this isn't working you might try another response
iris.prob should contains a classification result, with the probability that one observation is classified in one category. So you just need to extract the colname of the maximum value of each row.
Eg :
iris.resp2 = colnames(iris.prob)[apply(iris.prob,1,which.max)]
iris.resp2 == as.character(iris.resp) should return TRUE everytime

Data Standardisation for Neural Network in R

I have built a multilayer perceptron neural network in SPSS 22. I try the same using "neuralnet" package in R, but the results are not desirable.
SPSS standardises data before performing training and I am wondering:
Does "neuralnet" package perform any sort of standardization? I could not find in its guide.
According to SPSS guide here, standardised process is done as below:
Subtract the mean and divide by the standard deviation, (x−mean)/s.
Is there an optimal function that can do this in R? Since the method is quite simple, I can implement the scaling by my own, but it might not be efficient since number of data elements and records are very large.
Or maybe should I use another neural network package like "monmlp"? that standardize data automatically?
Many thanks
This might be useful if you need to standardize multiple columns in a data frame (call it foo):
# Index of columns to standardize
cols <- c(1,2,3,4)
# Standardize
library(plyr)
standardize <- function(x) as.numeric((x - mean(x)) / sd(x))
foo[cols] <- plyr::colwise(standardize)(foo[cols])

R "Pool" support vector machines for subsetting data sets

Facts for svm:
positve data set 20 samples, 5 factors
negative data set 10000 samples, 5 factors
package: e1071 or kernel
My test dataset would be something like 15000 samples
To control this imbalance i tried to use the class weight in e1071, as suggested in previous questions.But i cannot see any differences also whole overweighting one class extremely.
Now i was thinking to subset my negative data set randomly in 100 sub negative datasets. Like this
cost<-vector("numeric", length(1))
gamma <- vector("numeric", length(1))
accuracy<- vector("numeric" , length(1)
)
Function definition
split_data<- function(x,repeats) {
for (i in 1:repeats){
random_data <- x[sample(1:nrow(x), 100),]
dat<- rbind(data_pos, random_data)
svm <- svm(Class~., data=dat, cross=10)
cost[i] <- svm$cost
gamma[i] <-svm$gamma
accuracy[i]<- svm$tot.accuracy
print(summary(svm))
}
return(matrix(c(cost,gamma,accuracy), ncol=3))
}
But Im not sure what to do now with ... :D Its seems to define always the same support vectors in my pos data set. But there should be a smarter strategy, i have read about some strategies but is it possible to realize them in R with any package?
Edit:
I would like to find an approach how i can deal with highly imbalanced datasets.And i would like to do it in this way: to split my negative data set (resampled test data set) in equal portions to my positive dataset. However I would then somehow like to get the complete accuracy and sensitivity.
My question in particular is: how can i manage this in R a nice way?
Thanks a lot

Resources