I am trying to create 10 folds of my data. What I want to have is a data structure of length 10 (number of folds) and each element of the data structure contains an object/data structure that has two attributes/elements; the training set and the test set at that fold. This is my R code.
I wanted to access for example, the training set at fold 8 by View(data_pairs[[8]]$training_set). But it did not work. Any help would be appreciated :)
k <- 10 # number of folds
i <- 1:k
folds <- sample(i, nrow(data), replace = TRUE)
data_pairs <- list()
for (j in i) {
test_ind <- which(folds==j,arr.ind=TRUE)
test <- data[test_ind,]
train <- data[-test_ind,]
data_pair <- list(training_set = list(train), test_set = list(test))
data_pairs <- append(x = data_pairs, values = data_pair)
}
You were very close, you just needed to wrap values in a list call.
k <- 10 # number of folds
i <- 1:k
folds <- sample(i, nrow(mtcars), replace = TRUE)
data_pairs <- list()
for (j in i) {
test_ind <- which(folds==j,arr.ind=TRUE)
test <- mtcars[test_ind,]
train <- mtcars[-test_ind,]
data_pair <- list(training_set = train, test_set = test)
data_pairs <- append(x = data_pairs, values = list(data_pair))
#data_pairs <- c(data_pairs, list(data_pair))
}
If your data is big I would suggest you read these two posts on more efficient ways to grow a list.
Append an object to a list in R in amortized constant time, O(1)?
Here we go again: append an element to a list in R
I would also like to point out that you are not creating "folds" of your data. In your case you are attempting a 10-fold cross validation, which means your data should be separated into 10 "equal" sized chunks. Then you create 10 train/test data sets using each fold as the test data and the rest for training.
It seems like the package modelr could help you here.
In particular I would point you to:
https://modelr.tidyverse.org/reference/resample_partition.html
library(modelr)
ex <- resample_partition(mtcars, c(test = 0.3, train = 0.7))
mod <- lm(mpg ~ wt, data = ex$train)
rmse(mod, ex$test)
#> [1] 3.229756
rmse(mod, ex$train)
#> [1] 2.88216
Alternatively, producing a dataset of these partitions can be done with:
crossv_mc(data, n, test = 0.2, id = ".id")
Related
I need to repeat the sampling procedure of the below loop 1000 times using a second loop.
This is the simplified code i produced for reproducability, the inner loop.
##Number of iterations
N = 8
##Store data from inner loop in vectors
PMSE <- rep(1 , N)
PolynomialDegree <- rep(1, N)
for (I in 1:N){
PolynomialDegree [I] <- I
PMSE [I] <- I*rnorm(1)
}
Now, using a second , outer loop. I want repeat this "sampling procedure" 1000 times and store the data of all those vectors into a single dataframe. Im struggling to write the outer loop and was hoping for some assistance.
This is my attempt with non-reproducable code, I hope it is clear what i am attempting to do.
##Set number of iterations
N <- 8
M <- 1000
##Store data
OUTPUT <- rep(1,M)
##Outer loop starts
for (J in 1:M){
PMSE <- rep(1 , N)
PolynomialDegree <- rep(1, N)
sample <- sample(nrow(tempraindata), floor(nrow(tempraindata)*0.7))
training <- tempraindata[sample,]
testing <- tempraindata[-sample,]
##Inner loop starts
for (I in 1:N){
##Set up linear model with x polynomial of degree I x = year, y = temp
mymodel <- lm(tem ~ poly(Year, degree = I), data = training)
##fit model on testing set and save predictions
predictions <- predict(mymodel, newdata = testing, raw = FALSE)
##define and store PMSE
PMSE[I] <- (1/(nrow(tempraindata)- nrow(training)))*(sum(testing$tem-predictions))^2
PolynomialDegree [I] <- I
} ## End of inner loop
OUTPUT[J] <- ##THIS IS WHERE I WANT TO SAVE THE DATA
} ##End outer loop
I want to store all the data inside OUTPUT and make it a dataframe, if done correctly it should contain 8000 values of PMSE and 8000 values of PolynomialDegree.
Avoid the bookkeeping of initializing vectors and then assigning elements by index. Consider a single sapply (or vapply) passing both iterations to build a matrix of 8,000 elements of the PSME calculations within a 1000 X 8 structure. Every column would then be a model run (or PolynomialDegree) and every row the training/testing data pair.
## Set number of iterations
N <- 8
M <- 1000
## Defined method to generalize process
calc_PSME <- function(M, N) {
## Randomly build training/testing sets
set.seed(M+N) # TO REPRODUCE RANDOM SAMPLES
sample <- sample(nrow(tempraindata), floor(nrow(tempraindata)*0.7))
training <- tempraindata[sample,]
testing <- tempraindata[-sample,]
## Set up linear model with x polynomial of degree I x = year, y = temp
mymodel <- lm(tem ~ poly(Year, degree = N), data = training)
## Fit model on testing set and save predictions
predictions <- predict(mymodel, newdata = testing, raw = FALSE)
## Return single PSME value
(
(1/(nrow(tempraindata)- nrow(training))) *
(sum(testing$tem-predictions)) ^ 2
)
}
# RETURN (1000 X 8) MATRIX WITH NAMED COLUMNS
PSME_matrix <- sapply(1:N, calc_PSME, 1:M)
PSME_matrix <- vapply(1:N, calc_PSME, numeric(M), 1:M)
Should you need a 8,000-row data frame of two columns, consider reshape to long format:
long_df <- reshape(
data.frame(output_matrix),
varying = 1:8,
timevar = "PolynomialDegree",
v.names = "PSME",
ids = NULL,
new.row.names = 1:1E4,
direction = "long"
)
I would like to create replicate predictions for one integer independent variable (iv1) given some model and a data frame called training. This is my current approach. I appreciate this is not self containing but hopefully it is self explanatory:
number_of_samples <- 10
results <- NULL
for (row in 1:nrow(training)) {
fake_iv1_values <- sample(1:100, number_of_samples)
case <- training[row,]
for (iv1 in fake_iv1_values) {
case$iv1 <- iv1
case$prediction <- predict(some_model, newdata = case)
results <- rbind(results, case)
}
}
Using loops is very slow. I wonder, if this could be sped up? Thanks!
Try with this.
Reproducible fake data and model:
# create fake data
n_row <- 100
n_xs <- 100
training <- data.frame(y = rnorm(n_row), iv1 = rnorm(n_row))
training[, paste0("x",1:n_xs)] <- replicate(n_xs, list(rnorm(n_row)))
# example model
some_model <- lm(y~., training)
Rewritten code:
number_of_samples <- 10
results <- NULL
# vector of several fake_iv1_values vectors
fake_iv1_values <- as.numeric(replicate(nrow(training), sample(1:100, number_of_samples)))
# replicate each row of the original dataframe
results <- training[rep(seq_len(nrow(training)), each = number_of_samples), ]
# add fake values to the replicated dataframe
results$iv1 <- fake_iv1_values
# get predictions
results$prediction <- predict(some_model, newdata = results)
I am new to R and programming, I want to store values from loop to a data frame in R. I want ker, cValues, accuracyValues values to be stored a data frame from bellow code. I am not able to achieve this, Data Frame is only saving last value not all the values.
Can you please help me with this please.
# Define a vector which has different kernel methods
kerna <- c("rbfdot","polydot","vanilladot","tanhdot","laplacedot",
"besseldot","anovadot","splinedot")
# Define a for loop to calculate accuracy for different values of C and kernel
for (ker in kerna){
cValues <- c()
accuracyValues <- c()
for (c in 1:100) {
model <- ksvm(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = credit_card_data,
type ="C-svc",
kernel = ker,
C=c,
scaled =TRUE)
pred <- predict(model,credit_card_data[,1:10])
#pred
accuracy <- sum(pred== credit_card_data$V11)/nrow(credit_card_data)
cValues[c] <- c;
accuracyValues[c] <- accuracy;
}
for(i in 1:100) {
print(paste("kernal:",ker, "c=",cValues[i],"accuracy=",accuracyValues[i]))
}
}
Starting from your base code, set up the structure of the output data frame. Then, loop through and fill in the accuracy values on each iteration. This method also "flattens" the nested loop and gets rid of your c variable which conflicts with the built-in c() function.
kerna <- c("rbfdot","polydot","vanilladot","tanhdot","laplacedot",
"besseldot","anovadot","splinedot")
# Create dataframe to store output data
df <- data.frame(kerna = rep(kerna, each = 100),
cValues = rep(1:100, times = length(kerna)),
accuracyValues = NA,
stringsAsFactors = F)
# Define a for loop to calculate accuracy for different values of C and kernel
for (i in 1:nrow(df)){
ker <- df$kerna[i]
j <- df$cValues[i]
model <- ksvm(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = credit_card_data,
type ="C-svc",
kernel = ker,
C=j,
scaled =TRUE)
pred <- predict(model,credit_card_data[,1:10])
accuracy <- sum(pred== credit_card_data$V11)/nrow(credit_card_data)
# Insert accuracy into df$accuracyValues
df$accuracyValues[i] <- accuracy;
}
Consider Map to build a list of data frames from each pairing of ker and cValues (1:100) generated from expand.grid and row bind all elements together.
k_c_pairs_df <- expand.grid(kerna=kerna, c_value=1:100, stringsAsFactors = FALSE)
model_fct <- function(ker, c) {
model <- ksvm(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = credit_card_data,
type ="C-svc",
kernel = ker,
C=c,
scaled =TRUE)
pred <- predict(model,credit_card_data[,1:10])
accuracy <- sum(pred== credit_card_data$V11)/nrow(credit_card_data)
print(paste("kernal:",ker, "c=",cValues[i],"accuracy=",accuracyValues[i]))
return(data.frame(kernel = ker, cValues = c, accuracyValues = accuracy))
}
df_list <- Map(model_fct, k_c_pairs_df$ker, k_c_pairs_df$c_value)
final_df <- do.call(rbind, df_list)
I've got a big database which I've split up by year and created at train and test for each respective year
#split the dataset into a list of datasets
Y <- split(dat_all, dat_all$year)
#create a train and test dataset for all years
#takes Y is inp
create_sets <- function(x){
train_set <- sample(2, nrow(x), replace = TRUE, prob = c(0.7, 0.3))
train <- x[train_set == 1, ]
test <- x[train_set == 2, ]
assign('x', list(train = train, test = test))
}
Ylist <- lapply(Y, create_sets)
To call each item out you use Ylist$'2016'$train
I've made an accuracy ratio function which I can run each list through individually but I am looking for a way to do it all in one to save massive amounts of code (theres 16 years of data)
Below is how I currently create an accuracy ratio for one year
val_train<-Ylist$'2016'$train
val_train$pred<-predict(modf,newdata=Ylist$'2016'$train)
val_train$probs<-exp(val_train$pred)/(1+exp(val_train$pred))
x<-data.frame(rcorr.cens(val_train$probs, val_train$default_flag))
train_AR<-x[2,1]
train_AR
modfull <-ModFit(test)
val_test<-test
val_test$pred<-predict(modf,newdata=test)
val_test$probs<-exp(val_test$pred)/(1+exp(val_test$pred))
x<-data.frame(rcorr.cens(val_test$probs, val_test$default_flag))
test_AR<-x[2,1]
test_AR
AR_Logistic1<-c(train_AR,test_AR,)
AR_Logistic2<-c(train_AR,test_AR) #just in to see if table works
AccuracyRatio<-rbind(AR_Logistic1,AR_Logistic2)
colnames(AccuracyRatio)<-c("Train","Test","All")
AccuracyRatio
Just to clarify I'm trying to run through my whole list through my accuracy ratio and then output the AR for each year for its train and test.
Any help is greatly appreciated
With lapply and wrapping the AR calculations in a function you can summarise the output as below.
Without sample data, I could not test it but let us know if you face any errors.
fn_Calc_AR <- function(yearDat = listInput) {
#yearDat <== Ylist$'2016'
trainDat <- yearDat$train
testDat <- yearDat$test
val_train<- trainDat
val_train$pred<-predict(modf,newdata= trainDat)
val_train$probs<-exp(val_train$pred)/(1+exp(val_train$pred))
x<-data.frame(rcorr.cens(val_train$probs, val_train$default_flag))
train_AR<-x[2,1]
#train_AR
modfull <-ModFit(testDat)
val_test<-testDat
val_test$pred<-predict(modf,newdata=testDat)
val_test$probs<-exp(val_test$pred)/(1+exp(val_test$pred))
x<-data.frame(rcorr.cens(val_test$probs, val_test$default_flag))
test_AR<-x[2,1]
#test_AR
AR_Logistic1<-c(train_AR,test_AR) # removed extraneous comma, previous input c(train_AR,test_AR,)
AR_Logistic2<-c(train_AR,test_AR) #just in to see if table works
AccuracyRatio<-rbind(AR_Logistic1,AR_Logistic2)
colnames(AccuracyRatio)<-c("Train","Test","All")
#confirm yearName is being created
try(yearName <- head(names(x),1)) #retain only year
if(length(yearName) > 0L) {
AR_DF <- data.frame(yearName = yearName , AccuracyRatio,stringsAsFactors=FALSE)
}else{
AR_DF <- AccuracyRatio
}
return(AR_DF)
}
Summarise Output:
AR_Summary = do.call(rbind,lapply(Ylist,fn_Calc_AR))
Aggregate Dataset:
aggregateTrain = do.call(rbind,lapply(Ylist,function(x) x$train))
aggregateTest = do.call(rbind,lapply(Ylist,function(x) x$test))
aggregateList = list(train = aggregateTrain,test = aggregateTest)
AR_AggregateSummary = do.call(rbind,lapply(aggregateList,function(x) fn_Calc_AR(x) ))
A critical difference (CD) plot for comparing classifiers over multiple data sets (Demšar2006) can be generated with the mlr package like this:
# THIS WORKS
library(mlr)
lrns = list(makeLearner("classif.knn"), makeLearner("classif.svm"))
tasks = list(iris.task, sonar.task)
rdesc = makeResampleDesc("CV", iters = 2L)
meas = list(acc)
bmr = benchmark(lrns, tasks, rdesc, measures = meas)
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
This requires the evaluation results to reside in a rather complex BenchmarkResult object, although the data is basically a matrix (where M[i, j] holds the score of classifier i for data set j).
I have previously generated such data in a Python workflow and imported in R into a data.frame (as there seems to be no Python package for such plots).
How can I generate a CD plot from this data?
I thought about creating a BenchmarkResult from the data.frame, but didn't know where to start:
# THIS DOES NOT WORK
library(mlr)
# Here I would import results from my experiments instead of using random data
# e.g. scores for 5 classifiers and 30 data sets, each
results = data.frame(replicate(5, runif(30, 0, 1)))
# This is the functionality I'm looking for
bmr = benchmarkResultFromDataFrame(results)
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
I finally managed to create the plot. It is necessary to set only a handful of the BenchmarkResult's attributes:
leaners with id and short.name for each classifier
measures
results with aggr for each dataset/classifier combination
The code may then look like this (smaller example of 5 datasets):
library(mlr)
# Here I would import results from my experiments instead of using random data
# e.g. scores for 5 classifiers and 30 data sets, each
results <- data.frame(replicate(5, runif(30, 0, 1)))
clf <- c('clf1', 'clf2', 'clf3', 'clf4', 'clf5')
clf.short.name <- c('c1', 'c2', 'c3', 'c4', 'c5')
dataset <- c('dataset1', 'dataset2', 'dataset3', 'dataset4', 'dataset5')
score <- list(acc)
# Setting up the learners: id, short.name
bmr <- list()
for (i in 1:5){
bmr$learners[[clf[i]]]$id <- clf[i]
bmr$learners[[clf[i]]]$short.name <- clf.short.name[i]
}
# Setting up the measures
bmr$measures <- list(acc)
# Setting up the results
for (i in 1:5){
bmr$results$`dataset1`[[clf[i]]]$aggr <- list('acc.test.mean' = results[1, i])
}
for (i in 1:5){
bmr$results$`dataset2`[[clf[i]]]$aggr <- list('acc.test.mean' = results[2, i])
}
for (i in 1:5){
bmr$results$`dataset3`[[clf[i]]]$aggr <- list('acc.test.mean' = results[3, i])
}
for (i in 1:5){
bmr$results$`dataset4`[[clf[i]]]$aggr <- list('acc.test.mean' = results[4, i])
}
for (i in 1:5){
bmr$results$`dataset5`[[clf[i]]]$aggr <- list('acc.test.mean' = results[5, i])
}
# Set BenchmarkResult class
class(bmr) <- "BenchmarkResult"
# Statistics and plot
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
Anyone who could teach me better R to avoid these for loops and code duplication would still be very welcome!