Since the model (package fastNaiveBayes) that I am using is not in the built-in library of the caret package, I am trying to make a k-fold cross validation in R without using the caret package. Does anyone have a solution to this?
Edit:
Here is my code so far from what I learned on how to do cv without caret. I am very certain something is wrong here.
library(fastNaiveBayes)
k<- 10
outs <- NULL
proportion <- 0.8
for (i in 1:10)
{
split <- sample(1:nrow(data), round(proportion*nrow(data)))
traindata <- data[split,]
testdata <- data[-split,]
y <- traindata$Label
x <- traindata[,0 - 15:ncol(traindata)]
model <- fnb.train(x, y=y, priors = NULL, laplace=0,
distribution = fnb.detect_distribution(x, nrows = nrow(x)))
model
test1 <- testdata[,0 - 15:ncol(testdata)]
pred <- predict(model, newdata = test1)
cm<- table(testdata$Label, pred)
print(confusionMatrix(cm))
}
It gave me 10 different results and I think that's not how it cross validation is supposed to work. I'm an entry-level R learner and I appreciate so much to receive enlightenment from this
Related
Using the dlm package in R I fit a dynamic linear model to a time series data set, consisting of 20 observations. I then use the dlmForecast function to predict future values (which I can validate against the genuine data for said period).
I use the following code to create a prediction interval;
ciTheory <- (outer(sapply(fut1$Q, FUN=function(x) sqrt(diag(x))), qnorm(c(0.05,0.95))) +
as.vector(t(fut1$f)))
However my data does not follow a normal distribution and I wondered whether it would be possible to
adapt the qnorm function for other distributions. I have tried qt, but am unable to apply qgamma.......
Just wondered if anyone knew how you would go about sorting this.....
Below is a reproduced version of my code...
library(dlm)
data <- c(20.68502, 17.28549, 12.18363, 13.53479, 15.38779, 16.14770, 20.17536, 43.39321, 42.91027, 49.41402, 59.22262, 55.42043)
mod.build <- function(par) {
dlmModPoly(1, dV = exp(par[1]), dW = exp(par[2]))
}
# Returns most likely estimate of relevant values for parameters
mle <- dlmMLE(a2, rep(0,2), mod.build); #nileMLE$conv
if(mle$convergence==0) print("converged") else print("did not converge")
mod1 <- dlmModPoly(dV = v, dW = c(0, w))
mod1Filt <- dlmFilter(a1, mod1)
fut1 <- dlmForecast(mod1Filt, n = 7)
Cheers
Good Afternoon.
I wanted a sanity check after doing research about k-Fold Cross-Validation. I will provide my understanding, and then provide an example of how to execute the preconceived understanding in R.
I would really appreciate any help on if I'm thinking about this incorrectly, or if my code is not reflecting my thought process / the correct procedures. Take the basic predictive modeling scenario on a continuous response variable:
Have a population dataset (xDF)
I want to split the dataset into k=10 separate parts, train a model on 9 of them (binded), and then validate on the remaining validation set
I then want to loop through each validation set to observe how the model performs on un-trained segments of the data
Model performance measures (RMSE for this example) on the kth-fold validation set that display similar results on the k+1...k+9th validation set reveals that the model is well-generalized
R Code:
#Declaring randomly sampled validation indices
ind <- sample(seq_len(nrow(xDF)), size = nrow(xDF))
n <- (nrow(xDF)/10)
nr <- nrow(xDF)
validation_ind <- split(ind, rep(1:ceiling(nr/n), each=n, length.out=nr))
#Looping through validation sets to obtain Model Performance measure of each set
RMSEsF <- double(10)
RMSEsFT <- double(10)
R2F <- double(10)
R2FT <- double(10)
rsq <- function (x, y) cor(x, y) ^ 2
for (i in 1:10){
validate = as.data.frame(xDF[unlist(validation_ind[i]),])
train = as.data.frame(xDF[unlist(validation_ind[-i]),])
rf_train = randomForest(y~.,data=train,mtry=3)
predictions_rf = predict(rf_train,validate)
predictions_rft = predict(rf_train, train)
RMSEsF[i] = RMSE(predictions_rf, validate$y)
RMSEsFT[i] = RMSE(predictions_rft, train$y)
R2F[i] = rsq(predictions_rf, validate$y)
R2FT[i] = rsq(predictions_rft, train$y)
print(".")
}
RMSEsF
RMSEsFT
Am I going about this correctly?
Many thanks in advance.
I am working on sentiment analysis in r. i've done making a model with naive bayes. but, i wanna try another one, which is xgboost. then, i got a problem when tried to make xgboost model because don't know what to do with my document term matrix in xgboost. Can anyone give me a solution?
i've tried to convert the document term matrix data to data frame. but it doesn't seem to work.
the code below describes how my current train & test data
library(tm)
dtm.tf <- VCorpus(VectorSource(results$text)) %>%
DocumentTermMatrix()
#split 80:20
all.data <- dtm.tf
train.data <- dtm.tf[1:312,]
test.data <- dtm.tf[313:390,]
and i have xgboost template with another data set :
# install.packages('xgboost')
library(xgboost)
classifier = xgboost(data = as.matrix(training_set[-11]),
label = training_set$Exited, nrounds = 10)
# Predicting the Test set results
y_pred = predict(classifier, newdata = as.matrix(test_set[-11]))
y_pred = (y_pred >= 0.5)
# Making the Confusion Matrix
cm = table(test_set[, 11], y_pred)
i want to use the xgboost template above to make my model using my current train & test data. what i have to do?
You need to transform the document term matrix into a sparse matrix. In your case that can be done via sparseMatrix function from the Matrix package (default with R):
sparse_matrix_tf <- Matrix::sparseMatrix(i=dtm.tf$i, j=dtm.tf$j, x=dtm.tf$v,
dims=c(dtm.tf$nrow, dtm.tf$ncol))
Then you can use this to feed it to xgboost and use the label form the dtm.tf.
classifier = xgboost(data = sparse_matrix_tf,
label = dtm.tf$dimnames$Docs,
nrounds = 10).
Complete reproducible example below. I leave the splitting into 80 / 20 to you.
library(tm)
library(xgboost)
data("crude")
crude <- as.VCorpus(crude)
dtm.tf <- DocumentTermMatrix(crude)
sparse_matrix_tf <- Matrix::sparseMatrix(i=dtm.tf$i, j=dtm.tf$j, x=dtm.tf$v,
dims=c(dtm.tf$nrow, dtm.tf$ncol))
classifier = xgboost(data = sparse_matrix_tf,
label = dtm.tf$dimnames$Docs,
nrounds = 10)
I used the hts package in R to fit an HTS model on train data, used "arima" option to forecast and computed the accuracy on the holdout/test data.
Here is my code:
library(hts)
data<-read.csv("C:/TS.csv")
ts_train <- ts(data[,-1],frequency=12, start=c(2000,1))
hts_train <- hts(ts_train, nodes=list(2, c(4, 2)))
data.test<-read.csv("C:/TStest.csv")
ts_test <- ts(data.test[,-1],frequency=12, start=c(2003,1))
hts_test <- hts(ts_test, nodes=list(2, c(4, 2)))
forecast <- forecast(hts_train, h=15, method="bu", fmethod="arima", keep.fitted = TRUE, keep.resid = TRUE)
accuracy<-accuracy.gts(forecast, hts_test)
Now, let's suppose I'm happy with the accuracy on the holdout sample and I'd like to lump the test data back with the train data and re-forecast using the full set.
I tried using this code:
data.full<-read.csv("C:/TS_full.csv")
ts_full <- ts(data.full[,-1],frequency=12, start=c(2000,1))
hts_full <- hts(ts_full, nodes=list(2, c(4, 2)))
forecast.full <- forecast(hts_full, h=15, method="bu", fmethod="arima", keep.fitted = TRUE, keep.resid = TRUE)
Now, I'm not sure that this is really the right way to do it as I don't know if ARIMA models that were used to estimate my train data are the same ARIMA models that I'm now using to forecast the full data set (I presume fmethod="arima" utilizes auto.arima) . I'd like them to remain the same models, otherwise the models evaluated by my out of sample accuracy measures are different from the models I used for the final forecast.
I see there is a FUN argument that represents "a user-defined function that returns an object which can be passed to the forecast function". Perhaps that argument can be used in the last line of my code somehow to make sure the models I fit on the train data are used to forecast the full data set?
Any suggestions on what sort of R code would help would be much appreciated.
The functions are not set up to do that. However, it is not too difficult to do what you want. Here is some sample code
library(hts)
data <- htseg2
# Split data into training and test sets
hts_train <- window(data, end=2004)
hts_test <- window(data, start=2005)
# Fit models and compute forecasts on all nodes using training data
train <- aggts(hts_train)
fmodels <- list()
fc <- matrix(0, ncol=ncol(train), nrow=3)
for(i in 1:ncol(train))
{
fmodels[[i]] <- auto.arima(train[,i])
fc[,i] <- forecast(fmodels[[i]],h=3)$mean
}
forecast <- combinef(fc, nodes=data$nodes)
accuracy <- accuracy.gts(forecast, hts_test)
# Forecast on full data set without re-estimating parameters
full <- aggts(data)
fcfull <- matrix(0, ncol=ncol(full), nrow=15)
for(i in 1:ncol(full))
{
fcfull[,i] <- forecast(Arima(full[,i], model=fmodels[[i]]),
h=15)$mean
}
forecast.full <- combinef(fcfull, nodes=data$nodes)
# Forecast on full data set with same models but re-estimated parameters
full <- aggts(data)
fcfull <- matrix(0, ncol=ncol(full), nrow=15)
for(i in 1:ncol(full))
{
fcfull[,i] <- forecast(Arima(full[,i],
order=fmodels[[i]]$arma[c(1,6,2)],
seasonal=fmodels[[i]]$arma[c(3,7,4)]),
h=15)$mean
}
forecast.full <- combinef(fcfull, nodes=data$nodes)
I know that in MatLab this is really easy ('-v 10').
But I need to do it in R. I did find one comment about adding cross = 10 as parameter would do it. But this is not confirmed in the help file so I am sceptical about it.
svm(Outcome ~. , data= source, cost = 100, gamma =1, cross=10)
Any examples of a successful SVM script for R would also be appreciated as I am still running into some dead ends?
Edit: I forgot to mention outside of the tags that I use the libsvm package for this.
I am also trying to perform a 10 fold cross validation. I think that using tune is not the right way in order to perform it, since this function is used to optimize the parameters, but not to train and test the model.
I have the following code to perform a Leave-One-Out cross validation. Suppose that dataset is a data.frame with your data stored. In each LOO step, the observed vs. predicted matrix is added, so that at the end, result contains the global observed vs. predicted matrix.
#LOOValidation
for (i in 1:length(dataset)){
fit = svm(classes ~ ., data=dataset[-i,], type='C-classification', kernel='linear')
pred = predict(fit, dataset[i,])
result <- result + table(true=dataset[i,]$classes, pred=pred);
}
classAgreement(result)
So in order to perform a 10-fold cross validation, I guess we should manually partition the dataset, and use the folds to train and test the model.
for (i in 1:10)
train <- getFoldTrainSet(dataset, i)
test <- getFoldTestSet(dataset,i)
fit = svm(classes ~ ., train, type='C-classification', kernel='linear')
pred = predict(fit, test)
results <- c(results,table(true=test$classes, pred=pred));
}
# compute mean accuracies and kappas ussing results, which store the result of each fold
I hope this help you.
Here is a simple way to create 10 test and training folds using no packages:
#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- yourData[testIndexes, ]
trainData <- yourData[-testIndexes, ]
#Use test and train data howeever you desire...
}
Here is my generic code to run a k-fold cross validation aided by cvsegments to generate the index folds.
# k fold-cross validation
set.seed(1)
k <- 80;
result <- 0;
library('pls');
folds <- cvsegments(nrow(imDF), k);
for (fold in 1:k){
currentFold <- folds[fold][[1]];
fit = svm(classes ~ ., data=imDF[-currentFold,], type='C-classification', kernel='linear')
pred = predict(fit, imDF[currentFold,])
result <- result + table(true=imDF[currentFold,]$classes, pred=pred);
}
classAgreement(result)