How to split automatically a matrix using R for 5-fold cross-validation?
I actually want to generate the 5 sets of (test_matrix_indices, train matrix_indices).
I suppose you want the matrix rows to be the cases to split. Then all you need is sample and split :
X <- matrix(rnorm(1000),ncol=5)
id <- sample(1:5,nrow(X),replace=TRUE)
ListX <- split(x,id) # gives you a list with the 5 matrices
X[id==2,] # gives you the second matrix
I'd work with the list, as it allows you to do something like :
names(ListX) <- c("Train1","Train2","Train3","Test1","Test2")
mean(ListX$Train3)
which makes for code that's easier to read, and keeps you from creating tons of matrices in your workspace. You're bound to mess up if you put the matrices individually in your workspace. Use lists!
In case you want the test matrix to be smaller or larger than the other ones, use the prob argument of sample :
id <- sample(1:5,nrow(X),replace=TRUE,prob=c(0.15,0.15,0.15,0.15,0.3))
gives you a test matrix that's double the size of the train matrices.
In case you want to determine the exact number of cases, sample and prob aren't the best options. You could use a trick like :
indices <- rep(1:5,c(100,20,20,20,40))
id <- sample(indices)
to get matrices with respectively 100, 20, ... and 40 cases.
f_K_fold <- function(Nobs,K=5){
rs <- runif(Nobs)
id <- seq(Nobs)[order(rs)]
k <- as.integer(Nobs*seq(1,K-1)/K)
k <- matrix(c(0,rep(k,each=2),Nobs),ncol=2,byrow=TRUE)
k[,1] <- k[,1]+1
l <- lapply(seq.int(K),function(x,k,d)
list(train=d[!(seq(d) %in% seq(k[x,1],k[x,2]))],
test=d[seq(k[x,1],k[x,2])]),k=k,d=id)
return(l)
}
Solution without split:
set.seed(7402313)
X <- matrix(rnorm(999), ncol=3)
k <- 5 # number of folds
# Generating random indices
id <- sample(rep(seq_len(k), length.out=nrow(X)))
table(id)
# 1 2 3 4 5
# 67 67 67 66 66
# lapply over them:
indicies <- lapply(seq_len(k), function(a) list(
test_matrix_indices = which(id==a),
train_matrix_indices = which(id!=a)
))
str(indicies)
# List of 5
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 12 13 14 17 18 20 23 28 41 45 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 4 5 6 7 8 9 10 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 4 19 31 36 47 53 58 67 83 89 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 5 6 7 8 9 10 11 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 5 8 9 30 32 35 37 56 59 60 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 4 6 7 10 11 12 13 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:66] 1 2 3 6 21 24 27 29 33 34 ...
# ..$ train_matrix_indices: int [1:267] 4 5 7 8 9 10 11 12 13 14 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:66] 7 10 11 15 16 22 25 26 40 42 ...
# ..$ train_matrix_indices: int [1:267] 1 2 3 4 5 6 8 9 12 13 ...
But you could return matrices too:
matrices <- lapply(seq_len(k), function(a) list(
test_matrix = X[id==a, ],
train_matrix = X[id!=a, ]
))
str(matrices)
List of 5
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] -1.0132 -1.3657 -0.3495 0.6664 0.0762 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.682 ...
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] 0.484 0.418 -0.622 0.996 0.414 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.682 0.186 ...
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] 0.682 0.812 -1.111 -0.467 0.37 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.186 ...
# $ :List of 2
# ..$ test_matrix : num [1:66, 1:3] -0.65 0.797 0.689 0.186 -1.398 ...
# ..$ train_matrix: num [1:267, 1:3] 0.484 0.682 0.473 0.812 -1.111 ...
# $ :List of 2
# ..$ test_matrix : num [1:66, 1:3] 0.473 0.212 -2.175 -0.746 1.707 ...
# ..$ train_matrix: num [1:267, 1:3] -0.65 0.797 0.689 0.484 0.682 ...
Then you could use lapply to get results:
lapply(matrices, function(x) {
m <- build_model(x$train_matrix)
performance(m, x$test_matrix)
})
Edit: compare to Wojciech's solution:
f_K_fold <- function(Nobs, K=5){
id <- sample(rep(seq.int(K), length.out=Nobs))
l <- lapply(seq.int(K), function(x) list(
train = which(x!=id),
test = which(x==id)
))
return(l)
}
Edit : Thanks for your answers.
I have found the following solution (http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/fr_Tanagra_Validation_Croisee_Suite.pdf) :
n <- nrow(mydata)
K <- 5
size <- n %/% K
set.seed(5)
rdm <- runif(n)
ranked <- rank(rdm)
block <- (ranked-1) %/% size+1
block <- as.factor(block)
Then I use :
for (k in 1:K) {
matrix_train<-matrix[block!=k,]
matrix_test<-matrix[block==k,]
[Algorithm sequence]
}
in order to generate the adequate sets for each iterations.
However this solution can omit one individual for tests. I do not recommend it.
Below does the trick without having to create separate data.frames/matrices, all you need to do is to keep an integer sequnce, id that stores the shuffled indices for each fold.
X <- read.csv('data.csv')
k = 5 # number of folds
fold_size <-nrow(X)/k
indices <- rep(1:k,rep(fold_size,k))
id <- sample(indices, replace = FALSE) # random draws without replacement
log_models <- new.env(hash=T, parent=emptyenv())
for (i in 1:k){
train <- X[id != i,]
test <- X[id == i,]
# run algorithm, e.g. logistic regression
log_models[[as.character(i)]] <- glm(outcome~., family="binomial", data=train)
}
The sperrorest package provides this ability. You can choose between a random split (partition.cv()), a spatial split (partition.kmeans()), or a split based on factor levels (partition.factor.cv()). The latter is currently only available in the Github version.
Example:
library(sperrorest)
data(ecuador)
## non-spatial cross-validation:
resamp <- partition.cv(ecuador, nfold = 5, repetition = 1:1)
# first repetition, second fold, test set indices:
idx <- resamp[['1']][[2]]$test
# test sample used in this particular repetition and fold:
ecuador[idx , ]
If you have a spatial data set (with coords), you can also visualize your generated folds
# this may take some time...
plot(resamp, ecuador)
Cross-validation can then be performed using sperrorest() (sequential) or parsperrorest() (parallel).
Related
So briefly and without further ado - is it possible to retrieve only a name of element in list and use it as a main title of plot?
Let me explain - example:
Let's create a random df:
a <- c(1,2,3,4)
b <- runif(4)
c <- runif(4)
d <- runif(4)
e <- runif(4)
f <- runif(4)
df <- data.frame(a,b,c,d,e,f)
head(df)
a b c d e f
1 1 0.9694204 0.9869154 0.5386678 0.39331278 0.15054698
2 2 0.8949330 0.9910894 0.1009689 0.03632476 0.15523628
3 3 0.4930752 0.7179144 0.6957262 0.36579883 0.32006026
4 4 0.4850141 0.5539939 0.3196953 0.14348259 0.05292068
Then I want to create a list of data frame (based on this above) with specific columns to make a plot. In other words I'd like to make plot where first column of df (a) will be x axis on the plot and columns b,c,d,e and gonna represent values on y axis on the plot. Yes there'll be 5 plots - that's the point!
So my idea was to write some simple function which be able to create a list of df's based on that created above so:
my_fun <- function(x){
a <- df[1]
b <- x
aname <- "x_label"
bname <- "y_label"
df <- data.frame(a,b)
names(df) <- c(aname,bname)
return(df)
}
Run it for all (specified) columns:
df_s <- apply(df[,2:6], 2, function(x) my_fun(x))
So I have now:
class(df_s)
[1] "list"
str(df_s)
List of 5
$ b:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.969 0.895 0.493 0.485
$ c:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.987 0.991 0.718 0.554
$ d:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.539 0.101 0.696 0.32
$ e:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.3933 0.0363 0.3658 0.1435
$ f:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.1505 0.1552 0.3201 0.0529
Something that I wanted, but here's the question. I'd like to create a plot for every df in my list... As a result I want 5 plots with main titles b, c, d, e, f respectively Axis labels are the same name of the plot isn't... So I tried:
lapply(df_s, function(x) plot(x[2] ~ x[1], data = x, main = ???))
What should be instead of question marks? I tried main = names(df_s)[x] however it didin't work...
I think the following works. However, I think it might be best to use ggplot2 instead of the plot function (unless you are saving the plots inside inside lapply).
lapply(1 : length(df_s), function(x)
plot(df_s[[x]][,2] ~ df_s[[x]][,1],
xlab = names(df_s[[x]])[1],
ylab = names(df_s[[x]])[1],
main = names(df_s[x])))
With ggplot2
plot_lst <- lapply(seq_along(df_s), function(i) {
ggplot(df_s[[i]], aes(x=x_label, y=y_label)) +
geom_point() +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(names(df_s)[i]) })
I have a list of 100 items.
I want to split it after each 10th item in Code 1.
Code 2 is about a list of two former lists and splitting it to 20 lists of 10 items each.
Code 1
Expected output: ten lists of 10 items.
A <- 100
a <- rnorm(A) # [1:100]
n <- 10
str(a)
# Not resulting in equal size of chunks with vectors so reject
# http://stackoverflow.com/a/3321659/54964
#d <- split(d, ceiling(seq_along(d)/(length(d)/n)))
# Works for vectors but not with lists
# http://stackoverflow.com/a/16275428/54964
#d <- function(d,n) split(d, cut(seq_along(d), n, labels = FALSE))
str(d)
Test code 2
Input: a list of two lists
aa <- list(a, rnorm(a))
Expected output: 20 lists of 10 item size
Testing Loki's answer
segmentLists <- function(A, segmentSize) {
res <- lapply(A, function(x) split(unlist(x), cut(seq_along(unlist(x)), segmentSize, labels = F)))
#print(res)
res <- unlist(res, recursive = F)
}
segmentLists(aa, 10)
Output: loop going on, never stopping
OS: Debian 8.5
R: 3.3.1
you can use lapply.
aa <- list(a, rnorm(a))
aa
n <- 10
x <- lapply(aa, function(x) split(unlist(x), cut(seq_along(unlist(x)), n, labels = F)))
y <- unlist(x, recursive = F)
str(y)
# List of 20
# $ 1 : num [1:10] 1.0895 -0.0477 0.225 -0.6308 -0.1558 ...
# $ 2 : num [1:10] -0.469 -0.381 0.709 -0.798 1.183 ...
# $ 3 : num [1:10] 0.757 -1.128 -1.394 -0.712 0.494 ...
# $ 4 : num [1:10] 1.135 0.324 0.75 -0.83 0.794 ...
# $ 5 : num [1:10] -0.786 -0.068 -0.179 0.354 -0.597 ...
# $ 6 : num [1:10] -0.115 0.164 -0.365 -1.827 -2.036 ...
...
length(y)
# [1] 20
to remove the names of the list elements in y ($ 1, $ 2 etc.) you can use unname()
str(unname(y))
# List of 20
# $ : num [1:10] 1.0895 -0.0477 0.225 -0.6308 -0.1558 ...
# $ : num [1:10] -0.469 -0.381 0.709 -0.798 1.183 ...
# $ : num [1:10] 0.757 -1.128 -1.394 -0.712 0.494 ...
# $ : num [1:10] 1.135 0.324 0.75 -0.83 0.794 ...
# $ : num [1:10] -0.786 -0.068 -0.179 0.354 -0.597 ...
...
Using a function, you have to return res at the end of the function.
segmentLists <- function(A, segmentSize)
{
res <- lapply(A, function(x) split(unlist(x), cut(seq_along(unlist(x)), segmentSize, labels = F)))
#print(res)
res <- unlist(res, recursive = F)
res <- unname(res)
res
}
Here is my dataframe example. It includes a column variable, named "dta" which is a single list of n values I want to keep for each of my scenario:
set.seed(777)
df <- data.frame(theo = numeric(),
size = numeric(),
dta = I(list()))
df[ 1: 5,"theo"] <- qlnorm(0.1, meanlog=0, sdlog=1, lower.tail = TRUE, log.p = FALSE)
df[ 6:10,"theo"] <- qlnorm(0.2, meanlog=0, sdlog=1, lower.tail = TRUE, log.p = FALSE)
df[ 1: 5,"size"] <- 10
df[ 6:10,"size"] <- 20
for(i in 1:10){
df$dta[i] <- list(rlnorm(df$size[i], meanlog = 0, sdlog = 1))
}
df
str(df)
This should give a df like:
theo size dta
1 0.2776062 10 1.631967....
2 0.2776062 10 0.737667....
3 0.2776062 10 0.131252....
4 0.2776062 10 1.937334....
5 0.2776062 10 0.739868....
6 0.4310112 20 4.631176....
7 0.4310112 20 2.610180....
8 0.4310112 20 0.175918....
9 0.4310112 20 3.501670....
10 0.4310112 20 0.588178....
or:
'data.frame': 10 obs. of 4 variables:
$ theo: num 0.278 0.278 0.278 0.278 0.278 ...
$ size: num 10 10 10 10 10 20 20 20 20 20
$ dta :List of 10
..$ : num 1.632 0.671 1.667 0.671 5.148 ...
..$ : num 0.738 1.056 0.152 0.967 10.089 ...
..$ : num 0.131 1.256 0.457 3.574 4.211 ...
..$ : num 1.937 2.359 3.496 0.297 4.587 ...
..$ : num 0.74 0.66 0.481 0.434 1.874 ...
..$ : num 4.631 0.298 10.28 0.933 1.286 ...
..$ : num 2.61 0.472 0.251 1.61 0.303 ...
..$ : num 0.176 0.566 2.156 0.407 3.52 ...
..$ : num 3.502 1.748 1.283 0.648 1.359 ...
..$ : num 0.588 0.392 2.447 1.926 0.86 ...
..- attr(*, "class")= chr "AsIs"
Now, I want to subset that list in such a way that:
for each list, each value is compared with the fixed value "theo" stored in the dataframe
when that value is below or equal to "theo", then recode that value NA
Here is a working code and gives me exactly what I want:
df$dta2 <- df$dta
for(i in 1:10){
df$dta2[[i]] [ df$dta2[[i]] <= df$theo[i] ] <- NA
}
However I was wondering is there is a way to get the same result with a single line of code and no "for loop" to proceed with a conditional replacement of values contained in a list which is nested in a dataframe?
We can use Map
df$dta3 <- Map(function(x,y) replace(x, x<=y, NA), df$dta, df$theo)
all.equal(df$dta2, df$dta3, check.attributes=FALSE)
#[1] TRUE
I have a function which contains a loop over two lists and builds up some calculated data. I would like to return these data as a lists of lists, indexed by some value, but I'm getting the assignment wrong.
A minimal example of what I'm trying to do, and where i'm going wrong would be:
mybiglist <- list()
for(i in 1:5){
a <- runif(10)
b <- rnorm(16)
c <- rbinom(8, 5, i/10)
name <- paste('item:',i,sep='')
tmp <- list(uniform=a, normal=b, binomial=c)
mybiglist[[name]] <- append(mybiglist, tmp)
}
If you run this and look at the output mybiglist, you will see that something is going very wrong in the way each item is being named.
Any ideas on how I might achieve what I actually want?
Thanks
ps. I know that in R there is a sense in which one has failed if one has to resort to loops, but in this case I do feel justified ;-)
It works if you don't use the append command:
mybiglist <- list()
for(i in 1:5){
a <- runif(10)
b <- rnorm(16)
c <- rbinom(8, 5, i/10)
name <- paste('item:',i,sep='')
tmp <- list(uniform=a, normal=b, binomial=c)
mybiglist[[name]] <- tmp
}
# List of 5
# $ item:1:List of 3
# ..$ uniform : num [1:10] 0.737 0.987 0.577 0.814 0.452 ...
# ..$ normal : num [1:16] -0.403 -0.104 2.147 0.32 1.713 ...
# ..$ binomial: num [1:8] 0 0 0 0 1 0 0 1
# $ item:2:List of 3
# ..$ uniform : num [1:10] 0.61 0.62 0.49 0.217 0.862 ...
# ..$ normal : num [1:16] 0.945 -0.154 -0.5 -0.729 -0.547 ...
# ..$ binomial: num [1:8] 1 2 2 0 2 1 0 2
# $ item:3:List of 3
# ..$ uniform : num [1:10] 0.66 0.094 0.432 0.634 0.949 ...
# ..$ normal : num [1:16] -0.607 0.274 -1.455 0.828 -0.73 ...
# ..$ binomial: num [1:8] 2 2 3 1 1 1 2 0
# $ item:4:List of 3
# ..$ uniform : num [1:10] 0.455 0.442 0.149 0.745 0.24 ...
# ..$ normal : num [1:16] 0.0994 -0.5332 -0.8131 -1.1847 -0.8032 ...
# ..$ binomial: num [1:8] 2 3 1 1 2 2 2 1
# $ item:5:List of 3
# ..$ uniform : num [1:10] 0.816 0.279 0.583 0.179 0.321 ...
# ..$ normal : num [1:16] -0.036 1.137 0.178 0.29 1.266 ...
# ..$ binomial: num [1:8] 3 4 3 4 4 2 2 3
Change
mybiglist[[name]] <- append(mybiglist, tmp)
to
mybiglist[[name]] <- tmp
To show that an explicit for loop is not required
unif_norm <- replicate(5, list(uniform = runif(10),
normal = rnorm(16)), simplify=F)
binomials <- lapply(seq_len(5)/10, function(prob) {
list(binomial = rbinom(n = 5 ,size = 8, prob = prob))})
biglist <- setNames(mapply(c, unif_norm, binomials, SIMPLIFY = F),
paste0('item:',seq_along(unif_norm)))
In general if you go down the for loop path it is better to preassign the list beforehand. This is more memory efficient.
mybiglist <- vector('list', 5)
names(mybiglist) <- paste0('item:', seq_along(mybiglist))
for(i in seq_along(mybiglist)){
a <- runif(10)
b <- rnorm(16)
c <- rbinom(8, 5, i/10)
tmp <- list(uniform=a, normal=b, binomial=c)
mybiglist[[i]] <- tmp
}
For some packages I can see special type of objects. For example I am getting following message when I try to print a dataset from a package.
multitrait
This is an object of class "cross".
It is too complex to print, so we provide just this summary.
RI strains via selfing
No. individuals: 162
......................and other summary information
is (multitrait)
[1] "riself"
I wonder how we can created such object. Are they special lists of dataframe, matrix of vector.
X <- c("A", "B", "C")
Y <- data.frame (A = 1:10, B = 21:30, C = 31:40)
myeq <- c("Y ~ X1 + Y1")
K <- 100
A = 1:20
B = B= 21:40
J <- as.matrix(A,B )
myl1 <- list(J, K)
Now my complex object:
mycomplexobject <- list(X, Y, myeq, K, J, myl1)
mycomplexobject
str(mycomplexobject)
List of 6
$ : chr [1:3] "A" "B" "C"
$ :'data.frame': 10 obs. of 3 variables:
..$ A: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ B: int [1:10] 21 22 23 24 25 26 27 28 29 30
..$ C: int [1:10] 31 32 33 34 35 36 37 38 39 40
$ : chr "Y ~ X1 + Y1"
$ : num 100
$ : int [1:20, 1] 1 2 3 4 5 6 7 8 9 10 ...
$ :List of 2
..$ : int [1:20, 1] 1 2 3 4 5 6 7 8 9 10 ...
..$ : num 100
is(mycomplexobject)
[1] "list" "vector"
Is there way to make special object and prevent printing whole list instead message like "it is complex to print" and provide summary instead ?
Just set the class of your object and provide a print method.
class(mycomplexobject) <- c("too_complex", class(mycomplexobject))
print.too_complex <- function(x) {
cat("Complex object of length", length(x), "\n")
}
mycomplexobject