I have dataset (HEART). I split it into chunks. I would like to predict each chunk with his (M=3) previous models. In this case, I would like to predict chunk number 10 - with models 7,8,9. chunk 9 - with models 6,7,8... chunk 4 - with models 1,2,3.
Here is my code:
library(caret)
dat1 <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"), header = FALSE,sep = ",")
colnames(dat1) <- c(LETTERS[1:(ncol (dat1)-1)],"CLA")
dat1$CLA<-as.factor (dat1$CLA)
chunk <- 30
n <- nrow(dat1)
r <- rep(1:floor(n/chunk),each=chunk)[1:n]
d <- split(dat1,r)
N<-floor(n/chunk)
cart.models <- list()
for(i in 1:N){cart.models[[i]]<-rpart(CLA~ ., data = d[[i]]) }
for (i in (1+M):N) { k=0
for (j in (i-M):(i-1)) {
k=k+1
d[[i]][,(ncol(d[[i]])+k)]<-(predict(cart.models[[j]], d[[i]][,c(-14)], type = "class") )
}
}
I get the following Error:
Error in `[<-.data.frame`(`*tmp*`, , (ncol(d[[i]]) + k), value = c(1L, :
new columns would leave holes after existing columns
Your question is a bit puzzling, you load caret without using any functions from it. The objective seems like a time series analyses but instead of building on one chunk and predicting on the one that comes after it, you have a more complex desire, so createTimeSlices from caret won't do the trick.
You could create custom folds in caret with index and indexOut arguments in trainControl but that would ultimately lead to the creation of more models (21 to be exact) than is required for the presented objective (9). So I do believe loops are an appropriate way:
create the models:
library(rpart)
N <- 9
cart.models <- list()
for(i in 1:N){
cart.models[[i]] <- rpart(CLA~ ., data = d[[i]])
}
N can be 9 since 10 will not be utilized later on.
create a matrix to store the values:
cart.predictions <- matrix(nrow = chunk, ncol = length(4:10)*3)
it should have the same number of rows as there are predictions in each chunk (so 30) and it should have as many columns are there are predictions (three models for 4:10 chunks).
k <- 0 #as a counter
for (j in 4:10) { #prediction on chunks 4:10
p <- j-3
pred <- list()
for(i in p : (p+2)) { #using models (chink - 3) : (chunk - 1)
k = k + 1
predi <- predict(cart.models[[i]], d[[j]], type = "class")
cart.predictions[,k] <- predi
}
}
this creates a numeric matrix for predictions. By default when R converts factors to numeric it gives them numbers: 1 to the first level, 2 to the second etc - so to get the levels (0:4) you can just:
cart.predictions <- as.data.frame(cart.predictions - 1)
to create the column names:
names <- expand.grid(3:1, 4:10)
names$Var1 <- with(names, Var2 - Var1)
colnames(cart.predictions) <- make.names(paste0(names$Var1,"_", names$Var2))
lets check if it correct:
prediction from model 5 on chunk 6 converted to numeric
as.numeric(as.character(predict(cart.models[[5]], d[[6]], type = "class")))
should be equal to
cart.predictions[["X5_6"]] #that's how the names were designed
all.equal(as.numeric(as.character(predict(cart.models[[5]], d[[6]], type = "class"))),
cart.predictions[["X5_6"]])
#output
TRUE
or you can create a character matrix in the first place:
cart.predictions <- matrix(data = NA_character_, nrow = chunk, ncol = length(4:10)*3)
k <- 0 #as a counter
for (j in 4:10) {
p <- j-3
pred <- list()
for(i in p : (p+2)) {
k = k + 1
predi <- predict(cart.models[[i]], d[[j]], type = "class")
cart.predictions[,k] <- predi
}
}
cart.predictions <- as.data.frame(cart.predictions)
This should be the preferred method if the classes are certain "names".
Related
I am trying to run a forecasting function hw() in R with below parameters:
hw(TS, initial = 'simple', alpha = a,beta= b ,gamma=g,h=future_num)
where alpha, beta, and gamma vary from 0 to 1 with an increment of 0.1.
I want to run this function for all the combination of alpha, beta, and gamma on one go and so have written below code:
volume = 'UpdatedVol1'
TS <- ts(Data_summary[volume],start = c(sy,sm),frequency = freq)
model_out = data.frame()
for(i in 1:length(alpha)){
for(j in 1:length(beta)){
for(k in 1:length(gama)){
### Model Output ###
#alpha = 0.2
fore_hw <- hw(TS, initial = 'simple', alpha =
alpha[i],beta=beta[j],gamma=gama[k],h=future_num)
hw <- data.frame(ForecastVol =as.numeric(round(fore_hw$mean)))
hw$ModelName <- "Ex HW"
hw$Alpha <- alpha[i]
hw$Beta <- beta[j]
hw$Gama <- gama[k]
model_out <- rbind(model_out,hw)
}
}
}
where Data_summary is data frame with 150 rows, freq = 7 and future_num = 300
Is there any possibility to run the hw() function with all alpha, beta, and gamma values at once? or any option to optimise this code as it is taking more than 8 mins to run this code
rbind() inside a loop is notoriously inefficient. See The Second Circle of R Hell in the R Inferno for a detailed explanation.
A quick fix is to put all the model results in a list and then use rbind once at the end:
# create a list and counter
results <- list()
counter <- 1
volume = 'UpdatedVol1'
TS <- ts(Data_summary[volume],start = c(sy,sm),frequency = freq)
model_out = data.frame() ## this seems to be unused??
for(i in 1:length(alpha)){
for(j in 1:length(beta)){
for(k in 1:length(gama)){
### Model Output ###
#alpha = 0.2
fore_hw <- hw(TS, initial = 'simple', alpha =
alpha[i],beta=beta[j],gamma=gama[k],h=future_num)
hw <- data.frame(ForecastVol =as.numeric(round(fore_hw$mean)))
hw$ModelName <- "Ex HW"
hw$Alpha <- alpha[i]
hw$Beta <- beta[j]
hw$Gama <- gama[k]
# add to the list and increment the counter
results[[counter]] <- hw
counter <- counter + 1
}
}
}
results <- do.call(rbind, results) # one rbind at the end
# (you can use `dplyr::bind_rows` or `data.table::rbindlist` to do this faster)
You could certainly do better than this too, but this will be a vast improvement with minimal changes.
following is the code i am trying to run.The main objective is to run the model for different K values then after calculate the accuracies in order to choose the best K value.
so i thought of using for loop where every model.Result and the respective accuracy is stored in lists.,then after is sent out with respective k values..
but the thing is for the following code...the list isnt having any values from 1:29 and there is predicted values for 30..
k = 1:30
for(l in k){
pre[[l]] = knn(train_dataset,test_dataset,cl = labels_train, k = l)
}
output :
enter image description here
can someone help me out with this....like why the list is coming like that and what should be done in order to get the correct result..and why so..?
Here is a solution, with the models fit using the code in tacoman's comment.
library(class)
set.seed(1) # Make the results reproducible
knn_list <- lapply(1:30, function(l){
knn(train_dataset, test_dataset, cl = labels_train, k = l)
})
ok <- sapply(knn_list, '==', labels_test)
acc <- colMeans(ok)
which(acc == max(acc))
plot(acc, type = "b")
The for loop in the question can also be run, as long as the results list is created beforehand. The results are identical.
set.seed(1) # Make the results reproducible
k <- 1:30
pre <- vector("list", length = 30)
for(l in k){
pre[[l]] <- knn(train_dataset, test_dataset, cl = labels_train, k = l)
}
identical(pre, knn_list)
#[1] TRUE
Example data
set.seed(2021)
n <- nrow(iris)
i <- sample(n, 0.7*n)
train_dataset <- iris[i, -5]
test_dataset <- iris[-i, -5]
labels_train <- iris[i, 5]
labels_test <- iris[-i, 5]
I would like to perform a Sobol sensitivity analysis in R
The package "sensitivity" should allow me to do so, but I don't understand how to generate the sampling matrixes (X1, X2). I have a model that runs outside of R. I have 6 parameters with uniform distribution.
In my text book: N = (2k+2)*M ; M = 2^b ; b=[8,12] (New sampling method : Wu et al. 2012)
I had the feeling that I should create two sampling matrix and feed the two to the sobol function X1_{M,k} X2_{M,k}.
The dimension of final sampling matrix x$X is then (k+2)*M. because:
X <- rbind(X1, X2)
for (i in 1:k) {
Xb <- X1
Xb[, i] <- X2[, i]
X <- rbind(X, Xb)
}
How should I conduct my sampling to get the right number of runs as (2*k+2)*M ?
This script is for the old method but does someone know if the new method is already implemented yet in the sensitivity package? Feel free to comment this procedure
name = c("a" , "b" , "c" , "d" , "e", "f")
vals <- list(list(var="a",dist="unif",params=list(min=0.1,max=1.5)),
list(var="b",dist="unif",params=list(min=-0.3,max=0.4)),
list(var="c",dist="unif",params=list(min=-0.3,max=0.3)),
list(var="d",dist="unif",params=list(min=0,max=0.5)),
list(var="e",dist="unif",params=list(min=2.4E-5,max=2.4E-3)),
list(var="f",dist="unif",params=list(min=3E-5,max=3E-3)))
k = 6
b = 8
M = 2^b
n <- 2*M
X1 <- makeMCSample(n,vals, p = 1)
X2 <- makeMCSample(n,vals, p = 2)
x <- sobol2007(model = NULL, X1, X2, nboot = 200)
if I understand correctly, I should provide a y for each x$X sampling combination
then I can use the function "tell" which will generate the Sobol' first-order indices as well as the total indices
tell(x,y)
ggplot(x)
Supplemental R function SobolR
makeMCSample <- function(n, vals) {
# Packages to generate quasi-random sequences
# and rearrange the data
require(randtoolbox)
require(plyr)
# Generate a Sobol' sequence
if (p == 2){ sob <- sobol(n, length(vals), seed = 4321, scrambling = 1)
}else{sob <- sobol(n, length(vals), seed = 1234, scrambling = 1)}
# Fill a matrix with the values
# inverted from uniform values to
# distributions of choice
samp <- matrix(rep(0,n*(length(vals)+1)), nrow=n)
samp[,1] <- 1:n
for (i in 1:length(vals)) {
# i=1
l <- vals[[i]]
dist <- l$dist
params <- l$params
fname <- paste("q",dist,sep="")
samp[,i+1] <- do.call(fname,c(list(p=sob[,i]),params))
}
# Convert matrix to data frame and add labels
samp <- as.data.frame(samp)
names(samp) <- c("n",laply(vals, function(l) l$var))
return(samp)
}
ref: Qiong-Li Wu, Paul-Henry Cournède, Amélie Mathieu, 2012, Efficient computational method for global sensitivity analysis and its application to tree growth modelling
I have an R code that contains some nested bracket for loop within which I used rmse() function from Metrics package. I tried it without the function and it worked, but inside my nested R code it does not.
Here is what I desire to do with R
I have generated a 50-time series dataset.
I lice the same time series dataset into chunks of the following sizes: 2,3,...,48,49 making me have 48 different time series formed from step 1 above.
I divided each 48-time series dataset into train and test sets so I can use rmse function in Metrics package to get the Root Mean Squared Error (RMSE) for the 48 subseries formed in step 2.
The RMSE for each series is then tabulated according to their chunk sizes
I obtained the best ARIMA model for each 48 different time series data set.
My R code
# simulate arima(1,0,0)
library(forecast)
library(Metrics)
n <- 50
phi <- 0.5
set.seed(1)
wn <- rnorm(n, mean=0, sd=1)
ar1 <- sqrt((wn[1])^2/(1-phi^2))
for(i in 2:n){
ar1[i] <- ar1[i - 1] * phi + wn[i]
}
ts <- ar1
t<-length(ts)# the length of the time series
li <- seq(n-2)+1 # vector of block sizes(i.e to be between 1 and n exclusively)
RMSEblk<-matrix(nrow = 1, ncol = length(li))#vector to store block means
colnames(RMSEblk)<-li
for (b in 1:length(li)){
l<- li[b]# block size
m <- ceiling(t / l) # number of blocks
blk<-split(ts, rep(1:m, each=l, length.out = t)) # divides the series into blocks
singleblock <- vector() #initialize vector to receive result from for loop
for(i in 1:10){
res<-sample(blk, replace=T, 100) # resamples the blocks
res.unlist<-unlist(res, use.names = F) # unlist the bootstrap series
# Split the series into train and test set
train <- head(res.unlist, round(length(res.unlist) * 0.6))
h <- length(res.unlist) - length(train)
test <- tail(res.unlist, h)
# Forecast for train set
model <- auto.arima(train)
future <- forecast(test, model=model,h=h)
nfuture <- as.numeric(out$mean) # makes the `future` object a vector
# use the `rmse` function from `Metrics` package
RMSE <- rmse(test, nn)
singleblock[i] <- RMSE # Assign RMSE value to final result vector element i
}
#singleblock
RMSEblk[b]<-mean(singleblock) #store into matrix
}
RMSEblk
The error I got
#Error in rmse(test, nn): unused argument (nn)
#Traceback:
But when I wrote
library(forecast)
train <- head(ar1, round(length(ar1) * 0.6))
h <- length(ar1) - length(train)
test <- tail(ar1, h)
model <- auto.arima(train)
#forecast <- predict(model, h)
out <- forecast(test, model=model,h=h)
nn <- as.numeric(out$mean)
rmse(test, nn)
It did work
Please point out what I am missing?
I am able to run your code after making two very small corrections in your for loop. See the two commented lines:
for (b in 1:length(li)){
l<- li[b]
m <- ceiling(t / l)
blk<-split(ts, rep(1:m, each=l, length.out = t))
singleblock <- vector()
for(i in 1:10){
res<-sample(blk, replace=T, 100)
res.unlist<-unlist(res, use.names = F)
train <- head(res.unlist, round(length(res.unlist) * 0.6))
h <- length(res.unlist) - length(train)
test <- tail(res.unlist, h)
model <- auto.arima(train)
future <- forecast(test, model=model,h=h)
nfuture <- as.numeric(future$mean) # EDITED: `future` instead of `out`
RMSE <- rmse(test, nfuture) # EDITED: `nfuture` instead of `nn`
singleblock[i] <- RMSEi
}
RMSEblk[b]<-mean(singleblock)
}
It is possible that these typos did not result in errors because nn and out were defined in the global environment while you ran the for loop. A good debugging tip is to restart R and try to reproduce the problem.
Your code does not define nn. Other code that works has nn. To start code with clean slate use this line as first executable line:
rm(list=ls())
So I needed some help with a train and test set that I am creating in R. The goal of the code is to break a data set into a certain amount k, and the number of folds the test set will be i. It will then return the training and test sets. We assume that k will be 5 or 10.
This is what I have so far.
create_sets<-function(df,k,i)
{
n<-dim(df)[1]
#fold size
size<-n/k
#beggining of test set
test_start<-(size*i)-(size)+1
#end of test set
test_end<-size*i
indices<-df(test_start,test_end)
train<-df[indices,]
test<-df[-indices,]
return (list(train=train,test=test))
}
df is just a random data frame of x and y. That is:
x<-c(1,6,7,4,3,5,7,8,9,8,7,6,5,4,3,4,5,3,2,1)
y<-c(3,5,6,7,5,4,3,5,7,8,9,0,2,3,4,5,6,7,5,6)
df<-data.frame(x,y)
When I run the code I am returning an error
Error in df(test_start, test_end) :
argument "df2" is missing, with no default
This is how I would approach it:
n <- nrow(df)
k <- 5
set.seed(10272015)
s <- sample(1:k, n, replace=TRUE)
result <- rep(NA, k)
for (i in 1:k) {
train <- df[s!=i, ]
test <- df[s==i,]
# fit model
# evaluate
# result[i] <- evalscore
}
mean(result)
I think you just need an index for different subsets,like this:
k <- 5
folds <- sample(rep(1:k,length=nrow(df)))
then, you can get any one of k subsets by(take 1 for example):
df[folds==1,]