R training and test sets - r

So I needed some help with a train and test set that I am creating in R. The goal of the code is to break a data set into a certain amount k, and the number of folds the test set will be i. It will then return the training and test sets. We assume that k will be 5 or 10.
This is what I have so far.
create_sets<-function(df,k,i)
{
n<-dim(df)[1]
#fold size
size<-n/k
#beggining of test set
test_start<-(size*i)-(size)+1
#end of test set
test_end<-size*i
indices<-df(test_start,test_end)
train<-df[indices,]
test<-df[-indices,]
return (list(train=train,test=test))
}
df is just a random data frame of x and y. That is:
x<-c(1,6,7,4,3,5,7,8,9,8,7,6,5,4,3,4,5,3,2,1)
y<-c(3,5,6,7,5,4,3,5,7,8,9,0,2,3,4,5,6,7,5,6)
df<-data.frame(x,y)
When I run the code I am returning an error
Error in df(test_start, test_end) :
argument "df2" is missing, with no default

This is how I would approach it:
n <- nrow(df)
k <- 5
set.seed(10272015)
s <- sample(1:k, n, replace=TRUE)
result <- rep(NA, k)
for (i in 1:k) {
train <- df[s!=i, ]
test <- df[s==i,]
# fit model
# evaluate
# result[i] <- evalscore
}
mean(result)

I think you just need an index for different subsets,like this:
k <- 5
folds <- sample(rep(1:k,length=nrow(df)))
then, you can get any one of k subsets by(take 1 for example):
df[folds==1,]

Related

List giving an empty values upto penultimate cell

following is the code i am trying to run.The main objective is to run the model for different K values then after calculate the accuracies in order to choose the best K value.
so i thought of using for loop where every model.Result and the respective accuracy is stored in lists.,then after is sent out with respective k values..
but the thing is for the following code...the list isnt having any values from 1:29 and there is predicted values for 30..
k = 1:30
for(l in k){
pre[[l]] = knn(train_dataset,test_dataset,cl = labels_train, k = l)
}
output :
enter image description here
can someone help me out with this....like why the list is coming like that and what should be done in order to get the correct result..and why so..?
Here is a solution, with the models fit using the code in tacoman's comment.
library(class)
set.seed(1) # Make the results reproducible
knn_list <- lapply(1:30, function(l){
knn(train_dataset, test_dataset, cl = labels_train, k = l)
})
ok <- sapply(knn_list, '==', labels_test)
acc <- colMeans(ok)
which(acc == max(acc))
plot(acc, type = "b")
The for loop in the question can also be run, as long as the results list is created beforehand. The results are identical.
set.seed(1) # Make the results reproducible
k <- 1:30
pre <- vector("list", length = 30)
for(l in k){
pre[[l]] <- knn(train_dataset, test_dataset, cl = labels_train, k = l)
}
identical(pre, knn_list)
#[1] TRUE
Example data
set.seed(2021)
n <- nrow(iris)
i <- sample(n, 0.7*n)
train_dataset <- iris[i, -5]
test_dataset <- iris[-i, -5]
labels_train <- iris[i, 5]
labels_test <- iris[-i, 5]

How to create condition for warnings in R

I am trying to create a "for" loop where each of 100 trials has a set of parameters, each randomly chosen from probability distributions. From there, a model will take in these parameters and spit out an output. The input and output will be stored in a matrix, with each row representing a successful run through. Eventually, this matrix will be converted into a dataframe. I am displaying a sample run through for one case of the for loop below:
#matrix M will have 100 rows for each trial, and 4 columns
#columns will be a val, b val, c val and output
M <- matrix(0, nrow=100, ncol=4)
for (i in 1:100){
#random values for a,b,c for 1st trial
a =runif(1)
b=runif(1)
c=runif (1)
v <- c(a,b,c)
#some model
output[i]=v[1]*v[2]/v[3]
M[i,4]=output[i]
#don't know how to populate first 3 columns with all diff values of a,b,c
}
I know this code will not work, but that's my first question. How do I get the a,b, and c values to regenerate from trial to trial so I can have new outputs for each trial. From there, I am pretty sure I know how to store them in the matrix.
My last question is about warning messages. If I have a warning message because my output did not generate for some trial (no problems with this one, but if I had to divide by 0 or something)... how could I just tell the program to skip that trial and keep going until we get to 100 working trials?
Please comment if I should edit or clarify something above. Thanks in advance.
To answer your first question, you can first generate parameter vectors and then apply your function to each parameter set.
ntrials <- 100
M <- matrix(0, nrow=ntrials, ncol=4)
## Generate parameter vectors
M[,1] <- runif(ntrials)
M[,2] <- runif(ntrials)
M[,3] <- runif(ntrials)
## Example model function
run_mod <- function(a, b, c) {
return(a+b+c)
}
## Create output
M[, 4] <- run_mod(a = M[, 1], b = M[, 2], c = M[, 3])
To address your second question, you could use a while statement to continue generating parameter sets and trying to obtain valid model results until you have enough valid results. Your model function will need a way to handle errors or warnings that could occur, such as tryCatch().
## Example model function with error handling
run_mod <- function(a, b, c) {
tryCatch(
a+b+c,
error = function(e) print("Error"),
warning = function(w) print("Warning")
)
return(a+b+c)
}
i <- 0
while(i < ntrials) {
## Generate a single set of parameters
a <- runif(1)
b <- runif(1)
c <- runif(1)
## Example error
if(floor(100*a) %% 2 == 0) {
a <- "Bad parameter"
}
## Try running your model
output <- run_mod(a,b,c)
## If successful, save output and move on to the next set
if(!is.character(output)) {
M[i, 1] <- a
M[i, 2] <- b
M[i, 3] <- c
M[i, 4] <- output
i <- i + 1
}
}

rmse function issue in R

I have an R code that contains some nested bracket for loop within which I used rmse() function from Metrics package. I tried it without the function and it worked, but inside my nested R code it does not.
Here is what I desire to do with R
I have generated a 50-time series dataset.
I lice the same time series dataset into chunks of the following sizes: 2,3,...,48,49 making me have 48 different time series formed from step 1 above.
I divided each 48-time series dataset into train and test sets so I can use rmse function in Metrics package to get the Root Mean Squared Error (RMSE) for the 48 subseries formed in step 2.
The RMSE for each series is then tabulated according to their chunk sizes
I obtained the best ARIMA model for each 48 different time series data set.
My R code
# simulate arima(1,0,0)
library(forecast)
library(Metrics)
n <- 50
phi <- 0.5
set.seed(1)
wn <- rnorm(n, mean=0, sd=1)
ar1 <- sqrt((wn[1])^2/(1-phi^2))
for(i in 2:n){
ar1[i] <- ar1[i - 1] * phi + wn[i]
}
ts <- ar1
t<-length(ts)# the length of the time series
li <- seq(n-2)+1 # vector of block sizes(i.e to be between 1 and n exclusively)
RMSEblk<-matrix(nrow = 1, ncol = length(li))#vector to store block means
colnames(RMSEblk)<-li
for (b in 1:length(li)){
l<- li[b]# block size
m <- ceiling(t / l) # number of blocks
blk<-split(ts, rep(1:m, each=l, length.out = t)) # divides the series into blocks
singleblock <- vector() #initialize vector to receive result from for loop
for(i in 1:10){
res<-sample(blk, replace=T, 100) # resamples the blocks
res.unlist<-unlist(res, use.names = F) # unlist the bootstrap series
# Split the series into train and test set
train <- head(res.unlist, round(length(res.unlist) * 0.6))
h <- length(res.unlist) - length(train)
test <- tail(res.unlist, h)
# Forecast for train set
model <- auto.arima(train)
future <- forecast(test, model=model,h=h)
nfuture <- as.numeric(out$mean) # makes the `future` object a vector
# use the `rmse` function from `Metrics` package
RMSE <- rmse(test, nn)
singleblock[i] <- RMSE # Assign RMSE value to final result vector element i
}
#singleblock
RMSEblk[b]<-mean(singleblock) #store into matrix
}
RMSEblk
The error I got
#Error in rmse(test, nn): unused argument (nn)
#Traceback:
But when I wrote
library(forecast)
train <- head(ar1, round(length(ar1) * 0.6))
h <- length(ar1) - length(train)
test <- tail(ar1, h)
model <- auto.arima(train)
#forecast <- predict(model, h)
out <- forecast(test, model=model,h=h)
nn <- as.numeric(out$mean)
rmse(test, nn)
It did work
Please point out what I am missing?
I am able to run your code after making two very small corrections in your for loop. See the two commented lines:
for (b in 1:length(li)){
l<- li[b]
m <- ceiling(t / l)
blk<-split(ts, rep(1:m, each=l, length.out = t))
singleblock <- vector()
for(i in 1:10){
res<-sample(blk, replace=T, 100)
res.unlist<-unlist(res, use.names = F)
train <- head(res.unlist, round(length(res.unlist) * 0.6))
h <- length(res.unlist) - length(train)
test <- tail(res.unlist, h)
model <- auto.arima(train)
future <- forecast(test, model=model,h=h)
nfuture <- as.numeric(future$mean) # EDITED: `future` instead of `out`
RMSE <- rmse(test, nfuture) # EDITED: `nfuture` instead of `nn`
singleblock[i] <- RMSEi
}
RMSEblk[b]<-mean(singleblock)
}
It is possible that these typos did not result in errors because nn and out were defined in the global environment while you ran the for loop. A good debugging tip is to restart R and try to reproduce the problem.
Your code does not define nn. Other code that works has nn. To start code with clean slate use this line as first executable line:
rm(list=ls())

Question regarding k fold cross validation for KNN using R

I am trying to fit 5 fold cross validation for several values of k. I used the OJ data set in ISLR package.
my code so far as follows,
library(ISLR)
library(class)
ks=c(1:5)
err.rate.test <- numeric(length = 5)
folds <- cut(seq(1,nrow(OJ)),breaks=5,labels=FALSE)
for (j in seq(along = ks)) {
set.seed(123)
cv.knn <- sapply(1:5, FUN = function(i) {
testID <- which(folds == i, arr.ind = TRUE)
test.X <- OJ[testID, 3]
test.Y <- OJ[testID, 1]
train.X <- OJ[-testID, 3]
train.Y <- OJ[-testID, 1]
knn.test <- knn(data.frame(train.X), data.frame(test.X), train.Y, k = ks[j])
cv.test.est <- mean(knn.test != test.Y)
return(cv.test.est)
})
err.rate.test[j] <- mean(cv.knn)
}
err.rate.test
[1] 0.3757009 0.3757009 0.3757009 0.3757009 0.3757009
The code doesn't give any errors. But for some reason , my test error rate for each value of k is same.This seems to be weird for me. So i assume there is something wrong with my code.
Can anyone help me to figure that out ?
Thank you
remove set.seed(123), this causes the repeat error rates.
set.seed is used for reproducibility, ensuring that any random grid searches or parameter estimates remain constant, meaning all of the parameter estimates that go into fitting the knn model will be the same across executions, resulting in the same predictions and therefore the same error rates.

Predicting chunks with M models in R

I have dataset (HEART). I split it into chunks. I would like to predict each chunk with his (M=3) previous models. In this case, I would like to predict chunk number 10 - with models 7,8,9. chunk 9 - with models 6,7,8... chunk 4 - with models 1,2,3.
Here is my code:
library(caret)
dat1 <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"), header = FALSE,sep = ",")
colnames(dat1) <- c(LETTERS[1:(ncol (dat1)-1)],"CLA")
dat1$CLA<-as.factor (dat1$CLA)
chunk <- 30
n <- nrow(dat1)
r <- rep(1:floor(n/chunk),each=chunk)[1:n]
d <- split(dat1,r)
N<-floor(n/chunk)
cart.models <- list()
for(i in 1:N){cart.models[[i]]<-rpart(CLA~ ., data = d[[i]]) }
for (i in (1+M):N) { k=0
for (j in (i-M):(i-1)) {
k=k+1
d[[i]][,(ncol(d[[i]])+k)]<-(predict(cart.models[[j]], d[[i]][,c(-14)], type = "class") )
}
}
I get the following Error:
Error in `[<-.data.frame`(`*tmp*`, , (ncol(d[[i]]) + k), value = c(1L, :
new columns would leave holes after existing columns
Your question is a bit puzzling, you load caret without using any functions from it. The objective seems like a time series analyses but instead of building on one chunk and predicting on the one that comes after it, you have a more complex desire, so createTimeSlices from caret won't do the trick.
You could create custom folds in caret with index and indexOut arguments in trainControl but that would ultimately lead to the creation of more models (21 to be exact) than is required for the presented objective (9). So I do believe loops are an appropriate way:
create the models:
library(rpart)
N <- 9
cart.models <- list()
for(i in 1:N){
cart.models[[i]] <- rpart(CLA~ ., data = d[[i]])
}
N can be 9 since 10 will not be utilized later on.
create a matrix to store the values:
cart.predictions <- matrix(nrow = chunk, ncol = length(4:10)*3)
it should have the same number of rows as there are predictions in each chunk (so 30) and it should have as many columns are there are predictions (three models for 4:10 chunks).
k <- 0 #as a counter
for (j in 4:10) { #prediction on chunks 4:10
p <- j-3
pred <- list()
for(i in p : (p+2)) { #using models (chink - 3) : (chunk - 1)
k = k + 1
predi <- predict(cart.models[[i]], d[[j]], type = "class")
cart.predictions[,k] <- predi
}
}
this creates a numeric matrix for predictions. By default when R converts factors to numeric it gives them numbers: 1 to the first level, 2 to the second etc - so to get the levels (0:4) you can just:
cart.predictions <- as.data.frame(cart.predictions - 1)
to create the column names:
names <- expand.grid(3:1, 4:10)
names$Var1 <- with(names, Var2 - Var1)
colnames(cart.predictions) <- make.names(paste0(names$Var1,"_", names$Var2))
lets check if it correct:
prediction from model 5 on chunk 6 converted to numeric
as.numeric(as.character(predict(cart.models[[5]], d[[6]], type = "class")))
should be equal to
cart.predictions[["X5_6"]] #that's how the names were designed
all.equal(as.numeric(as.character(predict(cart.models[[5]], d[[6]], type = "class"))),
cart.predictions[["X5_6"]])
#output
TRUE
or you can create a character matrix in the first place:
cart.predictions <- matrix(data = NA_character_, nrow = chunk, ncol = length(4:10)*3)
k <- 0 #as a counter
for (j in 4:10) {
p <- j-3
pred <- list()
for(i in p : (p+2)) {
k = k + 1
predi <- predict(cart.models[[i]], d[[j]], type = "class")
cart.predictions[,k] <- predi
}
}
cart.predictions <- as.data.frame(cart.predictions)
This should be the preferred method if the classes are certain "names".

Resources