Text Classification using Random Forest in R - r

Hi I'm new to R and want perform text message classification in R. Data contains 2 columns: "type":spam or ham and "message": character. I have performed data cleaning and converted data into Document Term Matrix
data_dtm <- DocumentTermMatrix(corpus, control = list(global = c(2, Inf)))
Now I want to use the Random forest classification:
sms_classifier <- randomForest(x= as.matrix(sms_dtm_train), y= train_data$type, ntree= 10)
sms_dtm_train: is the document term matrix of training data
This code is not working. Please tell me what is the problem?
This is the error message i am getting
Error in randomForest.default(x = as.matrix(sms_dtm_train), y = train_data$type, :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion

Related

Reproducing predict function in the svm method in R

I want to reproduce predict function in R. I found very nice example here How to reproduce predict.svm in R?, but it does not work on my data.
The difference is that I have four classes.
I receive error "Error in x - x1 : non-numeric argument to binary operator". After advice from MrFlick, I add as.numeric to all values (this change the error, so I check my original data table, and there where few non numeric values).
Right now, I have another error: "Error in f(y, m) : dims [product 3] do not match the length of object [6]"
My data are big, so I prepared some values to show you my problem.
library(e1071)
Cval =100
GammaVal=0.1
sp1a<-as.numeric(c("2.58","0","0","10.85","20.1","0","0","0","0","0","76.03","0","0","28.79","0","2.76","0","0","23.99","0"))
sp1b<-as.numeric(c("135.32","133.82","134.24","132.84","135.11","133.55","132.99","130.25","133.19","132.42","135.8","133.99","133.33","135.52","134.67","134.79","134.32","133.9","135.36","133.14"))
sp1c<-as.numeric(sp1b)/2.3
sp1d<-as.numeric(sp1b)-3.5
sp1e<-as.numeric(sp1a)+1.3
sp1f<-as.numeric(sp1a)*2
data<-data.frame(cbind(sp1a,sp1b,sp1c,sp1d,sp1e,sp1f,class=c(rep(1,4),rep(2,5),rep(3,5),rep(4,6))))
svm_mod = svm(class~.,type="C-classification",data=data,cost = Cval, gamma = GammaVal,cross=10)
summary(svm_mod)
svm_train_pred = predict(svm_mod, data)
self_check_svm_out = cbind(data,svm_train_pred)
tab <- table(pred = svm_train_pred, true = data[,7])
## my predict functions
k<-function(x,x1,gamma){
return(exp(-gamma*sum((x-x1)^2)))
}
f<-function(x,m){
return(t(m$coefs) %*% as.matrix(apply(m$SV,1,k,x,m$gamma)) - m$rho)
}
my.predict<-function(m,x){
apply(x,1,function(y) sign(f(y,m)))
}
table(my.predict(svm_mod,data[,1:4]),predict(svm_mod,data[,1:4]))

How to write custom predict function for classification model in R?

I am trying to use the flashlight package with the h2o package. An example of doing this on a regression model can be found here. However, I am trying to make it work for a classification model... to achieve this I was following the example given in the link. flashlight will work with h2o if you provide your own custom predict function. However, the predict function that is in the example below does not work for classification.
Here is the code I'm using:
library(flashlight)
library(h2o)
h2o.init()
h2o.no_progress()
iris_hf <- as.h2o(iris)
iris_dl <- h2o.deeplearning(x = 1:4, y = "Species", training_frame = iris_hf, seed=123456)
pred_fun <- function(mod, X) as.vector(unlist(h2o.predict(mod, as.h2o(X))))
fl_NN <- flashlight(model = iris_dl, data = iris, y = "Species", label = "NN",
predict_function = pred_fun)
But when I try and check the importance or interactions, I get an error.... for example:
light_interaction(fl_NN, type = "H",
pairwise = TRUE)
Throws back the error:
Error: Assigned data predict(x, data = X[, cols, drop = FALSE]) must
be compatible with existing data. Existing data has 22500 rows.
Assigned data has 90000 rows. ℹ Only vectors of size 1 are recycled.
I need to change the predict function somehow to make it work... but I have had no success yet... any suggestion as to how I could change the predict function to work?
EDIT UPDATE: So, I found a custom predict function that works with the light_interaction function. That is:
pred_fun <- function(mod, X) as.vector(unlist(h2o.predict(mod, as.h2o(X))[,2]))
Where the above is indexed for the specific category. However, The above doesn't work for calculating the importance. For example:
light_importance(fl_NN)
Gives the error:
Warning messages:
1: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
2: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
3: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
4: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
5: In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
So, Im still trying to figure this out!?

Error in seq.default(from = min(k), to = max(k), length = nBreaks + 1) : 'from' must be a finite number. WISH-R package

I have a list of pre-filtered genomic regions (based on previous GWAS and some enrichment analysis performed on GSEA) and I am looking for interesting gene-gene interactions.
i have a binary phenotype and i have used glm=T in the model of course.
I have followed in detail the WISH-R guide - https://github.com/QSG-Group/WISH - and generated the correlations matrix without issues.
I am now struggling to use the generate.modules function, so I am writing here for some help.
i have tried several times to run generate.modules(correlations,values="Coefficients",thread=2)
before that I have also run as suggested:
correlations$Coefficients[(is.na(correlations$Coefficients))]<-0
correlations$Pvalues[(is.na(correlations$Pvalues))]<-1
This is my R code:
library(WISH)
library(data.table)
ped <- fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/all_snp_tf_recoded.ped", data.table=F)
tped <- fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/all_snp_tf_recoded.tped", data.table=F)
pval <- fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/ALL_SNP_TF_p.txt", data.table=F)
id <- fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/ALL_SNP_TF_id.txt", data.table=F)
genotype <-generate.genotype(ped,tped,snp.id=id, pvalue=0.005,id.select=NULL,gwas.p=pval,major.freq=0.95,fast.read=T)
LD_genotype<-LD_blocks(genotype)
genotype <- LD_genotype$genotype
pheno<-fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/pheno.txt",data.table=F)
pheno<-ifelse(pheno=="1","0","1")
pheno<-as.numeric(pheno)
correlations<-epistatic.correlation(pheno, genotype,threads = 2 ,test=F,glm=T)
genome.interaction(tped,correlations,quantile = 0.9)
correlations$Coefficients[(is.na(correlations$Coefficients))]<-0
correlations$Pvalues[(is.na(correlations$Pvalues))]<-1
generate.modules(correlations,values="Coefficients",thread=2)
I get the following error:
Error in seq.default(from = min(k), to = max(k), length = nBreaks + 1) :
'from' must be a finite number.
Do you have some hints to debug this error here?
What is the main issue here?

Error in xj[i] : invalid subscript type 'list' error in R random forest

I am using the airbnb dataset. After cleaning it, I tried to apply a random forest (I did a tree and a pruned tree and they worked). I don't have a lot of experience but here is my code :
split_index <- createDataPartition(airbnbcleanedfinal$logprice, p = 0.8, list = F)
#Use index to split data
training<-training <- airbnbcleanedfinal[split_index,]
training1 <- airbnbcleanedfinal[sample(nrow(airbnbcleanedfinal),100000,replace=TRUE),]
features_test <- airbnbcleanedfinal[-split_index, !(colnames(airbnbcleanedfinal) %in% c('logprice'))]
target_test <- airbnbcleanedfinal[-split_index, 'logprice']
library(randomForest)
rf_train <- randomForest(logprice ~ ., data = airbnbcleanedfinal,
subset=training,
mtry = 5)
But I always get the same error message :
Error in xj[i] : invalid subscript type 'list'
I also tried to delete subset=training and put directly data=training but it makes R run forever. I also tried using training1 that I created for that purpose but still got the same error message.
I tried unlist(training) but it did not work. My data is huge (85k-15 variables) too, maybe that is the problem? How can I force training to be a list?

Predict() with regsubsets

I'm trying to replicate the results from An Introduction to Statistical Learning with Applications in R. Specifically, the Lab in section 6.5.3. I have followed the code in the lab exactly:
library("ISLR")
library("leaps")
set.seed(1)
train = sample(c(TRUE, FALSE), nrow(Hitters), rep = TRUE)
test = (!train)
regfit.best = regsubsets(Salary ~., data = Hitters[train,], nvmax= 19)
test.mat = model.matrix(Salary~., data = Hitters[test,])
val.errors = rep(NA, 19)
for (i in 1:19){
coefi= coef(regfit.best, id = i)
pred=test.mat[,names(coefi)]%*%coefi
val.errors[i]=mean((Hitters$Salary[test]-pred)^2)
}
When I run this I still get the following error:
Warning message:
In Hitters$Salary[test] - pred :
longer object length is not a multiple of shorter object length
Error in mean((Hitters$Salary[test] - pred)^2) :
error in evaluating the argument 'x' in selecting a method for function 'mean': Error: dims [product 121] do not match the length of object [148]
And val.errors is a vector of 19 NAs.
I'm still relatively new to R and to the validation approach, so I'm not sure exactly why the dimensions of these are different.
It was actually an issue with not carrying over steps from the previous subsection, which omitted entries that were incomplete.
You need to remove rows with missing data. Run "Hitters = na.omit(Hitters)" at the beginning.

Resources