Predict() with regsubsets - r

I'm trying to replicate the results from An Introduction to Statistical Learning with Applications in R. Specifically, the Lab in section 6.5.3. I have followed the code in the lab exactly:
library("ISLR")
library("leaps")
set.seed(1)
train = sample(c(TRUE, FALSE), nrow(Hitters), rep = TRUE)
test = (!train)
regfit.best = regsubsets(Salary ~., data = Hitters[train,], nvmax= 19)
test.mat = model.matrix(Salary~., data = Hitters[test,])
val.errors = rep(NA, 19)
for (i in 1:19){
coefi= coef(regfit.best, id = i)
pred=test.mat[,names(coefi)]%*%coefi
val.errors[i]=mean((Hitters$Salary[test]-pred)^2)
}
When I run this I still get the following error:
Warning message:
In Hitters$Salary[test] - pred :
longer object length is not a multiple of shorter object length
Error in mean((Hitters$Salary[test] - pred)^2) :
error in evaluating the argument 'x' in selecting a method for function 'mean': Error: dims [product 121] do not match the length of object [148]
And val.errors is a vector of 19 NAs.
I'm still relatively new to R and to the validation approach, so I'm not sure exactly why the dimensions of these are different.

It was actually an issue with not carrying over steps from the previous subsection, which omitted entries that were incomplete.

You need to remove rows with missing data. Run "Hitters = na.omit(Hitters)" at the beginning.

Related

SMOTE target variable not found in R

Why is it when I run the smote function in R, an error appears saying that my target variable is not found? I am using the smotefamily package to run this smote function.
tv_smote <- SMOTE(tv_smote, Churn, K = 5, dup_size = 0)
Error in table(target) : object 'Churn' not found
Chunk Codes
df data structure
df 1st few rows
Generally, you should include the minimum code and data needed to reproduce the problem. This saves time and gives you more chance of getting an answer. However, try this...
tv_smote <- SMOTE(tv_smote, tv_smote$Churn, K = 5, dup_size = 0)
I can get the same error by doing this:
library(smotefamily)
df <- data.frame(x = 1:8, y = 8:1)
df_smote <- SMOTE(df, y, K = 3, dup_size = 0)
It appears to me that SMOTE doesn't know y is a column name. In the documentation, ?SMOTE, it says that the target argument is "A vector of a target class attribute corresponding to a dataset X." My interpretation of this is that you need to supply a vector, not a name. Changing it to df_smote <- SMOTE(df, df$y, K = 3, dup_size = 0) gets past that part.
I am not familiar with the SMOTE function, but from testing it would appear that SMOTE cannot accept dataframes with any factor type columns. I get Error in get.knnx(data, query, k, algorithm) : Data non-numeric when using as.factor.

Reproducing predict function in the svm method in R

I want to reproduce predict function in R. I found very nice example here How to reproduce predict.svm in R?, but it does not work on my data.
The difference is that I have four classes.
I receive error "Error in x - x1 : non-numeric argument to binary operator". After advice from MrFlick, I add as.numeric to all values (this change the error, so I check my original data table, and there where few non numeric values).
Right now, I have another error: "Error in f(y, m) : dims [product 3] do not match the length of object [6]"
My data are big, so I prepared some values to show you my problem.
library(e1071)
Cval =100
GammaVal=0.1
sp1a<-as.numeric(c("2.58","0","0","10.85","20.1","0","0","0","0","0","76.03","0","0","28.79","0","2.76","0","0","23.99","0"))
sp1b<-as.numeric(c("135.32","133.82","134.24","132.84","135.11","133.55","132.99","130.25","133.19","132.42","135.8","133.99","133.33","135.52","134.67","134.79","134.32","133.9","135.36","133.14"))
sp1c<-as.numeric(sp1b)/2.3
sp1d<-as.numeric(sp1b)-3.5
sp1e<-as.numeric(sp1a)+1.3
sp1f<-as.numeric(sp1a)*2
data<-data.frame(cbind(sp1a,sp1b,sp1c,sp1d,sp1e,sp1f,class=c(rep(1,4),rep(2,5),rep(3,5),rep(4,6))))
svm_mod = svm(class~.,type="C-classification",data=data,cost = Cval, gamma = GammaVal,cross=10)
summary(svm_mod)
svm_train_pred = predict(svm_mod, data)
self_check_svm_out = cbind(data,svm_train_pred)
tab <- table(pred = svm_train_pred, true = data[,7])
## my predict functions
k<-function(x,x1,gamma){
return(exp(-gamma*sum((x-x1)^2)))
}
f<-function(x,m){
return(t(m$coefs) %*% as.matrix(apply(m$SV,1,k,x,m$gamma)) - m$rho)
}
my.predict<-function(m,x){
apply(x,1,function(y) sign(f(y,m)))
}
table(my.predict(svm_mod,data[,1:4]),predict(svm_mod,data[,1:4]))

Error in seq.default(from = min(k), to = max(k), length = nBreaks + 1) : 'from' must be a finite number. WISH-R package

I have a list of pre-filtered genomic regions (based on previous GWAS and some enrichment analysis performed on GSEA) and I am looking for interesting gene-gene interactions.
i have a binary phenotype and i have used glm=T in the model of course.
I have followed in detail the WISH-R guide - https://github.com/QSG-Group/WISH - and generated the correlations matrix without issues.
I am now struggling to use the generate.modules function, so I am writing here for some help.
i have tried several times to run generate.modules(correlations,values="Coefficients",thread=2)
before that I have also run as suggested:
correlations$Coefficients[(is.na(correlations$Coefficients))]<-0
correlations$Pvalues[(is.na(correlations$Pvalues))]<-1
This is my R code:
library(WISH)
library(data.table)
ped <- fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/all_snp_tf_recoded.ped", data.table=F)
tped <- fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/all_snp_tf_recoded.tped", data.table=F)
pval <- fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/ALL_SNP_TF_p.txt", data.table=F)
id <- fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/ALL_SNP_TF_id.txt", data.table=F)
genotype <-generate.genotype(ped,tped,snp.id=id, pvalue=0.005,id.select=NULL,gwas.p=pval,major.freq=0.95,fast.read=T)
LD_genotype<-LD_blocks(genotype)
genotype <- LD_genotype$genotype
pheno<-fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/pheno.txt",data.table=F)
pheno<-ifelse(pheno=="1","0","1")
pheno<-as.numeric(pheno)
correlations<-epistatic.correlation(pheno, genotype,threads = 2 ,test=F,glm=T)
genome.interaction(tped,correlations,quantile = 0.9)
correlations$Coefficients[(is.na(correlations$Coefficients))]<-0
correlations$Pvalues[(is.na(correlations$Pvalues))]<-1
generate.modules(correlations,values="Coefficients",thread=2)
I get the following error:
Error in seq.default(from = min(k), to = max(k), length = nBreaks + 1) :
'from' must be a finite number.
Do you have some hints to debug this error here?
What is the main issue here?

how to debug errors like: "dim(x) must have a positive length" with caret

I'm running a predict over a fit similar to what is found in the caret guide:
Caret Measuring Performance
predictions <- predict(caretfit, testing, type = "prob")
But I get the error:
Error in apply(x, 1, paste, collapse = ",") :
dim(X) must have a positive length
I would like to know 1) the general way to diagnose these errors that are the result of bad inputs into functions like this or 2) why my code is failing.
1)
So looking at the error It's something to do with 'X'. Which argument is x? Obviously the first one in 'apply', but which argument in predict is eventually passed to apply? Looking at traceback():
10: stop("dim(X) must have a positive length")
9: apply(x, 1, paste, collapse = ",")
8: paste(apply(x, 1, paste, collapse = ","), collapse = "\n")
7: makeDataFile(x = newdata, y = NULL)
6: predict.C5.0(modelFit, newdata, type = "prob")
5: predict(modelFit, newdata, type = "prob") at C5.0.R#59
4: method$prob(modelFit = modelFit, newdata = newdata, submodels = param)
3: probFunction(method = object$modelInfo, modelFit = object$finalModel,
newdata = newdata, preProc = object$preProcess)
2: predict.train(caretfit, testing, type = "prob")
1: predict(caretfit, testing, type = "prob")
Now, this problem would be easy to solve if I could follow the code through and understand the problem as opposed to these general errors. I can trace the code using this traceback to the code at C5.0.R#59. (It looks like there's no way to get line numbers on every trace?) I can follow this code as far as this line 59 and then (I think) the predict function on line 44:
Github Caret C5.0 source
But after this I'm not sure where the logic flows. I don't see 'makeDataFile' anywhere in the caret source or, if it's in another package, how it got there. I've also tried Rstudio debugging, debug() and browser(). None provide the stacktrace I would expect from other languages. Any suggestion on how to follow the code when you don't know what an error msg means?
2) As for my particular inputs, 'caretfit' is simply the result of a caret fit and the testing data is 3million rows by 59 columns:
fitcontrol <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = TRUE,
summaryFunction = custom.summary,
allowParallel = TRUE)
fml <- as.formula(paste("OUTVAR ~",paste(colnames(training[,1:(ncol(training)-2)]),collapse="+")))
caretfit <- train(fml,
data = training[1:200000,],
method = "C5.0",
trControl = fitcontrol,
verbose = FALSE,
na.action = na.pass)
1 Debuging Procedure
You can pinpoint the problem using a couple of functions.
Although there still doesn't seem to be anyway to get a full stacktrace with line numbers in code (Boo!), you can use the functions you do get from the traceback and use the function getAnywhere() to search for the function you are looking for. So for example, you can do:
getAnywhere(makeDataFile)
to see the location and source. (Which also works great in windows when the libraries are often bundled up in binaries.) Then you have to use source or github to find the specific line numbers or to trace through the logic of the code.
In my particular problem if I run:
newdata <- testing
caseString <- C50:::makeDataFile(x = newdata, y = NULL)
(Note the three ":".) I can see that this step completes at this level, So it appears as if something is happening to my training dataset along the way.
So using gitAnywhere() and github over and over through my traceback I can find the line number manually (Boo!)
in caret/R/predict.train.R, predict.train (defined on line 108)
calls probFunction on line 153
in caret/R/probFunction, probFunction
(defined on line 3) calls method$prob function which is a stored
function in the fit object caretfit$modelInfo$prob which can be
inspected by entering this into the console. This is the same
function found in caret/models/files/C5.0.R on line 58 which calls
'predict' on line 59
something in caret knows to use
C50/R/predict.C5.0.R which you can see by searching with
getAnywhere()
this function runs makeDataFile on line 25 (part of
the C50 package)
which calls paste, which calls apply, which dies
with stop
2 Particular Problem with caret's predict
As for my problem, I kept inspecting the code, and adding inputs at different levels and it would complete successfully. What happens is that some modification happens to my dataset in predict.train.R which causes it to fail. Well it turns out that I wasn't including my 'na.action' argument, which for my tree-based data, used 'na.pass'. If I include this argument:
prediction <- predict(caretfit, testing, type = "prob", na.action = na.pass)
it works as expected. line 126 of predict.train makes use of this argument to decide whether to include non-complete cases in the prediction. My data has no complete cases and so it failed complaining of needing a matrix of some positive length.
Now how one would be able to know the answer to this apply error is due to a missing na.action argument is not obvious at all, hence the need for a good debugging procedure. If anyone knows of other ways to debug (keeping in mind that in windows, stepping through library source in Rstudio doesnt work very well), please answer or comment.

KNN in R: 'train and class have different lengths'?

Here is my code:
train_points <- read.table("kaggle_train_points.txt", sep="\t")
train_labels <- read.table("kaggle_train_labels.txt", sep="\t")
test_points <- read.table("kaggle_test_points.txt", sep="\t")
#uses package 'class'
library(class)
knn(train_points, test_points, train_labels, k = 5);
dim(train_points) is 42000 x 784
dim(train_labels) is 42000 x 1
I don't see the issue, but I'm getting the error :
Error in knn(train_points, test_points, train_labels, k = 5) :
'train' and 'class' have different lengths.
What's the problem?
Without access to the data, it's really hard to help. However, I suspect that train_labels should be a vector. So try
cl = train_labels[,1]
knn(train_points, test_points, cl, k = 5)
Also double check:
dim(train_points)
dim(test_points)
length(cl)
I had the same issue in trying to apply knn on breast cancer diagnosis from wisconsin dataset I found that the issue was linked to the fact that cl argument need to be a vector factor (my mistake was to write cl=labels , I thought this was the vector to be predicted it was in fact a data frame of one column ) so the solution was to use the following syntax : knn (train, test,cl=labels$diagnosis,k=21) diagnosis was the header of the one column data frame labels and it worked well
Hope this help !
I have recently encountered a very similar issue.
I wanted to give only a single column as a predictor. In such cases, selecting a column, you have to remember about drop argument and set it to FALSE. The knn() function accepts only matrices or data frames as train and test arguments. Not vectors.
knn(train = trainSet[, 2, drop = FALSE], test = testSet[, 2, drop = FALSE], cl = trainSet$Direction, k = 5)
Try converting the data into a dataframe using as.dataframe(). I was having the same problem & afterwards it worked fine:
train_pointsdf <- as.data.frame(train_points)
train_labelsdf <- as.data.frame(train_labels)
test_pointsdf <- as.data.frame(test_points)
Simply set drop = TRUE while you're excluding cl from dataframe, it causes to remove dimension from an array which have only one level:
cl = train_labels[,1, drop = TRUE]
knn(train_points, test_points, cl, k = 5)
I had a similar error when I was reading to a tibble (read_csv) and when I switched to read.csv the code worked.
Followed the code as given in the book but will show error due to mismatch lengths (1 is df other is vector returned). I reached here but nothing worked exactly but ideas helped that vectors were needed for comparison.
This throws error
gmodels::CrossTable(x = wbcd_test_labels, # actuals
y = wbcd_test_pred, # predicted
prop.chisq = FALSE)
The following works :
gmodels::CrossTable(x = wbcd_test_labels$diagnosis, # actuals
y = wbcd_test_pred, # predicted
prop.chisq = FALSE)
where using $ for x makes it a vector and hence matches
Additionally while running knn
Cl parameter shoud also have vector save labels in vectors else there will be length mismatch OR use labelDF$Class_label
wbcd_test_pred <- knn(train = wbcd_train,
test = wbcd_test,
cl =wbcd_train_labels$diagnosis, #note this
k = 21)
Hope this helps beginners like me.
Uninstall R Previous versions and install R version > 4.0. It will work.

Resources