R "Pool" support vector machines for subsetting data sets - r

Facts for svm:
positve data set 20 samples, 5 factors
negative data set 10000 samples, 5 factors
package: e1071 or kernel
My test dataset would be something like 15000 samples
To control this imbalance i tried to use the class weight in e1071, as suggested in previous questions.But i cannot see any differences also whole overweighting one class extremely.
Now i was thinking to subset my negative data set randomly in 100 sub negative datasets. Like this
cost<-vector("numeric", length(1))
gamma <- vector("numeric", length(1))
accuracy<- vector("numeric" , length(1)
)
Function definition
split_data<- function(x,repeats) {
for (i in 1:repeats){
random_data <- x[sample(1:nrow(x), 100),]
dat<- rbind(data_pos, random_data)
svm <- svm(Class~., data=dat, cross=10)
cost[i] <- svm$cost
gamma[i] <-svm$gamma
accuracy[i]<- svm$tot.accuracy
print(summary(svm))
}
return(matrix(c(cost,gamma,accuracy), ncol=3))
}
But Im not sure what to do now with ... :D Its seems to define always the same support vectors in my pos data set. But there should be a smarter strategy, i have read about some strategies but is it possible to realize them in R with any package?
Edit:
I would like to find an approach how i can deal with highly imbalanced datasets.And i would like to do it in this way: to split my negative data set (resampled test data set) in equal portions to my positive dataset. However I would then somehow like to get the complete accuracy and sensitivity.
My question in particular is: how can i manage this in R a nice way?
Thanks a lot

Related

Using Amelia and decision trees

I have a panel dataset (countries and years) with a lot of missing data so I've decided to use multiple imputation. The goal is to see the relationship between the proportion of women in management (managerial_value) and total fatal workplace injuries (total_fatal)
From what I've read online, Amelia is the best option for panel data so I used that like so:
amelia_data <- amelia(spdata, ts = "year", cs = "country", polytime = 1,
intercs = FALSE)
where spdata is my original dataset.
This imputation process worked, but I'm unsure of how to proceed with forming decision trees using the imputed data (an object of class 'amelia').
I originally tried creating a function (amelia2df) to turn each of the 5 imputed datasets into a data frame:
amelia2df <- function(amelia_data, which_imp = 1) {
stopifnot(inherits(amelia_data, "amelia"), is.numeric(which_imp))
imps <- amelia_data$imputations[[which_imp]]
as.data.frame(imps)
}
one_amelia <- amelia2df(amelia_data, which_imp = 1)
two_amelia <- amelia2df(amelia_data, which_imp = 2)
three_amelia <- amelia2df(amelia_data, which_imp = 3)
four_amelia <- amelia2df(amelia_data, which_imp = 4)
five_amelia <- amelia2df(amelia_data, which_imp = 5)
where one_amelia is the data frame for the first imputed dataset, two_amelia is the second, and so on.
I then combined them using rbind():
total_amelia <- rbind(one_amelia, two_amelia, three_amelia, four_amelia, five_amelia)
And used the new combined dataset total_amelia to construct a decision tree:
set.seed(300)
tree_data <- total_amelia
I_index <- sample(1:nrow(tree_data), size = 0.75*nrow(tree_data), replace=FALSE)
I_train <- tree_data[I_index,]
I_test <- tree_data[-I_index,]
fatal_tree <- rpart(total_fatal ~ managerial_value, I_train)
rpart.plot(fatal_tree)
fatal_tree
This "works" as in it doesn't produce an error, but I'm not sure that it is appropriately using the imputed data.
I found a couple resources explaining how to apply least squares, logit, etc., but nothing about decision trees. I'm under the impression I'd need the 5 imputed datasets to be combined into one data frame, but I have not been able to find a way to do that.
I've also looked into Zelig and bind_rows but haven't found anything that returns one data frame that I can then use to form a decision tree.
Any help would be appreciated!
As already indicated by #Noah, you would set up the multiple imputation workflow different than you currently do.
Multiple imputation is not really a tool to improve your results or to make them more correct.
It is a method to enable you to quantify the uncertainty caused by the missing data, that comes along with your analysis.
All the different datasets created by multiple imputation are plausible imputations, because of the uncertainty, you don't know, which one is correct.
You would therefore use multiple imputation the following way:
Create your m imputed datasets
Build your trees on each imputed dataset separately
Do you analysis on each tree separately
In your final paper, you can now state how much uncertainty is caused trough the missing values/imputation
This means you get e.g. 5 different analysis results for m = 5 imputed datasets. First this looks confusing, but this enables you to give bounds, between the correct result probably lies. Or if you get completely different results for each imputed dataset, you know, there is too much uncertainty caused by the missing values to give reliable results.

Using bootstrapping to compare full and sample datasets

This is a fairly complicated situation, so I'll try to succinctly explain but feel free to ask for clarification.
I have several datasets of biological data that vary significantly in sample size (e.g., 253-1221 observations/dataset). I need to estimate individual breeding parameters and compare them (for a different analysis), but because of the large sample size differences, I took a sub-set of data from each dataset so the sample sizes were equal for each comparison. For example, the smallest dataset had 253 observations, so for all the others I used the following code
AT_EABL_subset <- Atlantic_EABL[sample(1:nrow(Atlantic_EABL), 253,replace=FALSE),]
to take a subset of 253 observations from the full dataset (in this case AT_EABL originally had 1,221 observations).
It's now suggested that I use bootstrapping to check if the parameter estimates from my subsets are similar to the full dataset estimates. I'm looking for code that will run, say, 200 iterations of the above subset data and calculate the average of the coefficients so I can compare them to the coefficients from my model with the full dataset. I found a site that uses the sample function to achieve this (https://towardsdatascience.com/bootstrap-regression-in-r-98bfe4ff5007), but when I get to this portion of the code
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])
}
I get
Error: $ operator not defined for this S4 class
Below is the code I'm using. I don't know if I'm getting the above error because of the type of model I'm running (glmer vs. lm used in the link), or if there's a different function that will give me the data I need. Any advice is greatly appreciated.
sample_coef_intercept <- NULL
sample_coef_x1 <- NULL
for (i in 1:2) {
boot.sample = AT_EABL_subset[sample(1:nrow(AT_EABL_subset), nrow(AT_EABL_subset), replace = FALSE), ]
model_bootstrap <- glmer(cbind(YOUNG_HOST_TOTAL_ATLEAST,CLUTCH_SIZE_HOST_ATLEAST-YOUNG_HOST_TOTAL_ATLEAST)~as.factor(YEAR)+(1|LatLong),binomial,data=boot.sample)}
sample_coef_intercept <-
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])

Low-pass fltering of a matrix

I'm trying to write a low-pass filter in R, to clean a "dirty" data matrix.
I did a google search, came up with a dazzling range of packages. Some apply to 1D signals (time series mostly, e.g. How do I run a high pass or low pass filter on data points in R? ); some apply to images. However I'm trying to filter a plain R data matrix. The image filters are the closest equivalent, but I'm a bit reluctant to go this way as they typically involve (i) installation of more or less complex/heavy solutions (imageMagick...), and/or (ii) conversion from matrix to image.
Here is sample data:
r<-seq(0:360)/360*(2*pi)
x<-cos(r)
y<-sin(r)
z<-outer(x,y,"*")
noise<-0.3*matrix(runif(length(x)*length(y)),nrow=length(x))
zz<-z+noise
image(zz)
What I'm looking for is a filter that will return a "cleaned" matrix (i.e. something close to z, in this case).
I'm aware this is a rather open-ended question, and I'm also happy with pointers ("have you looked at package so-and-so"), although of course I'd value sample code from users with experience on signal processing !
Thanks.
One option may be using a non-linear prediction method and getting the fitted values from the model.
For example by using a polynomial regression, we can predict the original data as the purple one,
By following the same logic, you can do the same thing to all columns of the zz matrix as,
predictions <- matrix(, nrow = 361, ncol = 0)
for(i in 1:ncol(zz)) {
pred <- as.matrix(fitted(lm(zz[,i]~poly(1:nrow(zz),2,raw=TRUE))))
predictions <- cbind(predictions,pred)
}
Then you can plot the predictions,
par(mfrow=c(1,3))
image(z,main="Original")
image(zz,main="Noisy")
image(predictions,main="Predicted")
Note that, I used a polynomial regression with degree 2, you can change the degree for a better fitting across the columns. Or maybe, you can use some other powerful non-linear prediction methods (maybe SVM, ANN etc.) to get a more accurate model.

How to convert random forest prediction probabilities to a single classified response?

I have many large random forest classification models (~60min run time each) that are used for prediction of a raster using the type="prob" option. I am happy with the raster output (probabilities for each of x classes as a raster stack). However, I would like a simple way to covert these probabilities (a raster stack with x layers, where x is the number of classes) to a simple one layer classification (i.e. winners only, no probabilities). This would be equivalent of type="response".
Here is a simple example (which is not a raster, but still applies):
library(randomForest)
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
iris.prob <- predict(iris.rf, type="prob")
iris.resp <- predict(iris.rf, type="response")
What is the most efficient way to use the iris.prob object to get the equivalent output of iris.resp without rerunning the randomforests (which, in my case with many large rasters, would take too many hours)?
Thanks in advance
If you are trying to determine the max of multiple columns, with the same general format as iris.prob I would try to find the max from each row and return the colname.
colnames(iris.prob)[max.col(iris.prob,ties.method="first")]
Got the exact usage from another thread so if this isn't working you might try another response
iris.prob should contains a classification result, with the probability that one observation is classified in one category. So you just need to extract the colname of the maximum value of each row.
Eg :
iris.resp2 = colnames(iris.prob)[apply(iris.prob,1,which.max)]
iris.resp2 == as.character(iris.resp) should return TRUE everytime

How to sample rows in the randomForest package

I have a dataset with 1 million rows and 100 columns. randomForest is quite slow for data this big so I would like to train each tree on a subset of, say, 50000 columns each.
How do I achieve this with the randomForest function? Do I have to hack something together manually? I am not able to find any instruction on this in the vignette.
Do you mean that the sample for each tree should be different?
To start with, I would consider sampling before calling randomforest. Indeed, the fact that you take different samples for each tree could have an impact on the final result, and the importance matrix would probably be partly biased.
You can achieve this by doing that:
numrow <- nrow(data)
subset <- sample(numrow, 50000)
learn <- data[subset,]
test <- data[-subset,]
model_rf <- randomForest(formula=[...], data=learn, importance=T)

Resources