I have a data frame called gfe_obj with structure as follows:
And I want to write an extract function such that when I run the code below, I get the corresponding output:
Currently, I have:
str(gfe_obj)
'[.gfe_obj' <- function(x,i) {
class(x) <- "gfe"
as.gfe_obj(x[i])
}
sub_gfe_obj <- gfe_obj[1:3]
str(sub_gfe_obj)
But when I run this code, I get Error in as.gfe_obj(x[i]) : could not find function "as.gfe_obj".
I referenced the method to write from here : How to implement extracting/subsetting ([, [<-, [[, [[<-) functions for custom S3 classes?
Thank you for your help.
I'm not sure what the exact structure of your gfe class is supposed to be, but assuming it is a list consisting of two objects (a 3D array called frames and a data frame called info with the same number of rows as the third dimension of frames, then your S3 method would be:
`[.gfe`<- function(x, i) {
x$frames <- x$frames[,,i]
x$info <- x$info[i,]
x
}
To test this, I need a mock class constructor and some dummy data:
gfe <- function(frames, info) {
structure(list(frames = frames, info = info), class = "gfe")
}
gfe_obj <- gfe(frames = array(1:90, dim = c(3, 3, 10)),
info = data.frame(x = 1:10, y = letters[1:10]))
str(gfe_obj)
#> List of 2
#> $ frames: int [1:3, 1:3, 1:10] 1 2 3 4 5 6 7 8 9 10 ...
#> $ info :'data.frame': 10 obs. of 2 variables:
#> ..$ x: int [1:10] 1 2 3 4 5 6 7 8 9 10
#> ..$ y: chr [1:10] "a" "b" "c" "d" ...
#> - attr(*, "class")= chr "gfe"
Now we can see the extractor method works as expected:
sub_gfe_obj <- gfe_obj[2:3]
str(sub_gfe_obj)
#> List of 2
#> $ frames: int [1:3, 1:3, 1:2] 10 11 12 13 14 15 16 17 18 19 ...
#> $ info :'data.frame': 2 obs. of 2 variables:
#> ..$ x: int [1:2] 2 3
#> ..$ y: chr [1:2] "b" "c"
#> - attr(*, "class")= chr "gfe"
Created on 2022-09-25 with reprex v2.0.2
My data has 1,000 entries and here is the str of the first 2 elements:
> str(my_boots[1:2])
List of 2
$ :List of 4
..$ result : Named num [1:10] 0.118 0.948 4.317 1.226 1.028 ...
.. ..- attr(*, "names")= chr [1:10] "(Intercept)" "pvi2" "freqchal" "sexexp" ...
..$ output : chr "list()"
..$ warnings: chr(0)
..$ messages: chr(0)
$ :List of 4
..$ result : Named num [1:10] 0.202 0.995 2.512 1.057 0.5 ...
.. ..- attr(*, "names")= chr [1:10] "(Intercept)" "pvi2" "freqchal" "sexexp" ...
..$ output : chr "list()"
..$ warnings: chr(0)
..$ messages: chr(0)
The fields of interest are $result and $warnings; I want to return a tibble with the columns based on the names within the named list result where warning == "" (where no warning).
I'm new to purrr but I was able to get most of the way there using map_dfr(my_boots[1:2],"result") - this returns a tibble with the column names from the named numbers list but I would like to only return the ones where the entry under warnings is blank.
I wasn't sure how to create this structure manually but was able to create a single element of my_boots:
test <- list(
list("warnings" = c("blah")),
list("result" = c("alpha" = 1.1, "beta" = 2.1, "theta" =3.1, "blah" = 4.1))
)
Also: I'm using the tidyverse - thank you.
Starting with some dummy data.
library(tidyverse)
l <- list(
list(
result = 1:10,
warnings = character(0)
),
list(
result = 2:20,
warnings = "warn"
),
list(
result = 3:30,
warnings = character(0)
),
list(
result = 4:40,
warnings = "warn"
)
)
Use keep to keep only elements without warnings. map("result") pulls the result element out of each list.
l %>%
keep(~is_empty(.$warnings)) %>%
map("result")
#> [[1]]
#> [1] 1 2 3 4 5 6 7 8 9 10
#>
#> [[2]]
#> [1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
#> [22] 24 25 26 27 28 29 30
I have a large list of lists. There are 46 lists in "output". Each list is a tibble with differing number of rows and columns. My immediate goal is to subset a specific column from each list.
This is str(output) of the first two lists to give you an idea of the data.
> str(output)
List of 46
$ Brain :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 6108 obs. of 8 variables:
..$ p_val : chr [1:6108] "0" "1.60383253411205E-274" "0" "0" ...
..$ avg_diff : num [1:6108] 1.71 1.7 1.68 1.6 1.58 ...
..$ pct.1 : num [1:6108] 0.998 0.808 0.879 0.885 0.923 0.905 0.951 0.957 0.619 0.985 ...
..$ pct.2 : num [1:6108] 0.677 0.227 0.273 0.323 0.36 0.384 0.401 0.444 0.152 0.539 ...
..$ cluster : num [1:6108] 1 1 1 1 1 1 1 1 1 1 ...
..$ gene : chr [1:6108] "Plp1" "Mal" "Ermn" "Stmn4" ...
..$ X__1 : logi [1:6108] NA NA NA NA NA NA ...
..$ Cell Type: chr [1:6108] "Myelinating oligodendrocyte" NA NA NA ...
$ Bladder :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4656 obs. of 8 variables:
..$ p_val : num [1:4656] 0.00 1.17e-233 2.85e-276 0.00 0.00 ...
..$ avg_diff : num [1:4656] 2.41 2.23 2.04 2.01 1.98 ...
..$ pct.1 : num [1:4656] 0.833 0.612 0.855 0.987 1 0.951 0.711 0.544 0.683 0.516 ...
..$ pct.2 : num [1:4656] 0.074 0.048 0.191 0.373 0.906 0.217 0.105 0.044 0.177 0.106 ...
..$ cluster : num [1:4656] 1 1 1 1 1 1 1 1 1 1 ...
..$ gene : chr [1:4656] "Dpt" "Gas1" "Cxcl12" "Lum" ...
..$ X__1 : logi [1:4656] NA NA NA NA NA NA ...
..$ Cell Type: chr [1:4656] "Stromal cell_Dpt high" NA NA NA ...
Since I have a large number of lists that make up the list, I have been trying to create an iterative code to perform tasks. This hasn't been successful.
I can achieve this manually, or list by list, but I haven't been successful in finding an iterative way of doing this.
x <- data.frame(output$Brain, stringsAsFactors = FALSE)
tmp.list <- x$Cell.Type
tmp.output <- purrr::discard(tmp.list, is.na)
x <- subset(x, Cell.Type %in% tmp.output)
This gives me the output that I want, which are the rows in the column "Cell.Type" with non-NA values.
I got as far as the code below to get the 8th column of each list, which is the "Cell.Type" column.
lapply(output, "[", , 8))
But here I found that the naming and positioning of the "Cell.Type" column in each list is not consistent. This means I cannot use the lapply function to subset the 8th columns, as some lists have this on for example the 9th column.
I tried the code below, but it does not work and gets an error.
lapply(output, "[", , c('Cell.Type', 'celltyppe'))
#Error: Column `celltyppe` not found
#Call `rlang::last_error()` to see a backtrace
Essentially, from my "output" list, I want to subset either columns "Cell.Type" or "celltyppe" from each of the 46 lists to create a new list with 46 lists of just a single column of values. Then I want to drop all rows with NA.
I would like to perform this using some sort of loop.
At the moment I have not had much success. Lapply seems to be able to extract columns through lists iterately, and I am having difficultly trying to subset names columns.
Once I can do this, I then want to create a loop that can subset only rows without NA.
FINAL CODE
This is the final code I have used to create exactly what I had hoped for. The first line of the code specifies the loop to go through each list of the large list. The second line of code selects columns of each list that contains "ell" in its name (Cell type, Cell Type, or celltyppe). The last removes any rows with "na".
purrr::map(output, ~ .x %>%
dplyr::select(matches("ell")) %>%
na.omit)
We can use anonymous function call
lapply(output, function(x) na.omit(x[grep("(?i)Cell\\.?(?i)Typp?e", names(x))]))
#[[1]]
# Cell.Type
#1 1
#2 2
#3 3
#4 4
#5 5
#[[2]]
# celltyppe
#1 7
#2 8
#3 9
#4 10
#5 11
Also with purrr
library(tidyverse)
map(output, ~ .x %>%
select(matches("(?i)Cell\\.?(?i)Typp?e") %>%
na.omit))
data
output <- list(data.frame(Cell.Type = 1:5, col1 = 6:10, col2 = 11:15),
data.frame(coln = 1:5, celltyppe = 7:11))
Hi if I have a list of list like so,
List of 5
$ 1:List of 2
..$ a: Named num [1:36] 3.29 3.25 3.36 3.26 3.34 ...
.. ..- attr(*, "names")= chr [1:36] "V1" "V2" "V3" "V4" ...
..$ b: Named num [1:36] 0.659 0.65 0.672 0.652 0.669 ...
say its called l, is there a way I can extract all the 'a' element of the list of list? Currently I can extract a single 'a' element as such, l[[5]] [['sr']] but when I try something like
l[[1:5]] [['sr']] or l[[1,5]] [['sr']] it fails. Help would greatly appreciated thanks!
Is this what you want?
l <- list(list(a=1:3, b=1:3),
list(a=3:1, b=3:1))
lapply(l, function(x) x[["a"]])
[[1]]
[1] 1 2 3
[[2]]
[1] 3 2 1
How to split automatically a matrix using R for 5-fold cross-validation?
I actually want to generate the 5 sets of (test_matrix_indices, train matrix_indices).
I suppose you want the matrix rows to be the cases to split. Then all you need is sample and split :
X <- matrix(rnorm(1000),ncol=5)
id <- sample(1:5,nrow(X),replace=TRUE)
ListX <- split(x,id) # gives you a list with the 5 matrices
X[id==2,] # gives you the second matrix
I'd work with the list, as it allows you to do something like :
names(ListX) <- c("Train1","Train2","Train3","Test1","Test2")
mean(ListX$Train3)
which makes for code that's easier to read, and keeps you from creating tons of matrices in your workspace. You're bound to mess up if you put the matrices individually in your workspace. Use lists!
In case you want the test matrix to be smaller or larger than the other ones, use the prob argument of sample :
id <- sample(1:5,nrow(X),replace=TRUE,prob=c(0.15,0.15,0.15,0.15,0.3))
gives you a test matrix that's double the size of the train matrices.
In case you want to determine the exact number of cases, sample and prob aren't the best options. You could use a trick like :
indices <- rep(1:5,c(100,20,20,20,40))
id <- sample(indices)
to get matrices with respectively 100, 20, ... and 40 cases.
f_K_fold <- function(Nobs,K=5){
rs <- runif(Nobs)
id <- seq(Nobs)[order(rs)]
k <- as.integer(Nobs*seq(1,K-1)/K)
k <- matrix(c(0,rep(k,each=2),Nobs),ncol=2,byrow=TRUE)
k[,1] <- k[,1]+1
l <- lapply(seq.int(K),function(x,k,d)
list(train=d[!(seq(d) %in% seq(k[x,1],k[x,2]))],
test=d[seq(k[x,1],k[x,2])]),k=k,d=id)
return(l)
}
Solution without split:
set.seed(7402313)
X <- matrix(rnorm(999), ncol=3)
k <- 5 # number of folds
# Generating random indices
id <- sample(rep(seq_len(k), length.out=nrow(X)))
table(id)
# 1 2 3 4 5
# 67 67 67 66 66
# lapply over them:
indicies <- lapply(seq_len(k), function(a) list(
test_matrix_indices = which(id==a),
train_matrix_indices = which(id!=a)
))
str(indicies)
# List of 5
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 12 13 14 17 18 20 23 28 41 45 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 4 5 6 7 8 9 10 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 4 19 31 36 47 53 58 67 83 89 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 5 6 7 8 9 10 11 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 5 8 9 30 32 35 37 56 59 60 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 4 6 7 10 11 12 13 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:66] 1 2 3 6 21 24 27 29 33 34 ...
# ..$ train_matrix_indices: int [1:267] 4 5 7 8 9 10 11 12 13 14 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:66] 7 10 11 15 16 22 25 26 40 42 ...
# ..$ train_matrix_indices: int [1:267] 1 2 3 4 5 6 8 9 12 13 ...
But you could return matrices too:
matrices <- lapply(seq_len(k), function(a) list(
test_matrix = X[id==a, ],
train_matrix = X[id!=a, ]
))
str(matrices)
List of 5
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] -1.0132 -1.3657 -0.3495 0.6664 0.0762 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.682 ...
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] 0.484 0.418 -0.622 0.996 0.414 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.682 0.186 ...
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] 0.682 0.812 -1.111 -0.467 0.37 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.186 ...
# $ :List of 2
# ..$ test_matrix : num [1:66, 1:3] -0.65 0.797 0.689 0.186 -1.398 ...
# ..$ train_matrix: num [1:267, 1:3] 0.484 0.682 0.473 0.812 -1.111 ...
# $ :List of 2
# ..$ test_matrix : num [1:66, 1:3] 0.473 0.212 -2.175 -0.746 1.707 ...
# ..$ train_matrix: num [1:267, 1:3] -0.65 0.797 0.689 0.484 0.682 ...
Then you could use lapply to get results:
lapply(matrices, function(x) {
m <- build_model(x$train_matrix)
performance(m, x$test_matrix)
})
Edit: compare to Wojciech's solution:
f_K_fold <- function(Nobs, K=5){
id <- sample(rep(seq.int(K), length.out=Nobs))
l <- lapply(seq.int(K), function(x) list(
train = which(x!=id),
test = which(x==id)
))
return(l)
}
Edit : Thanks for your answers.
I have found the following solution (http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/fr_Tanagra_Validation_Croisee_Suite.pdf) :
n <- nrow(mydata)
K <- 5
size <- n %/% K
set.seed(5)
rdm <- runif(n)
ranked <- rank(rdm)
block <- (ranked-1) %/% size+1
block <- as.factor(block)
Then I use :
for (k in 1:K) {
matrix_train<-matrix[block!=k,]
matrix_test<-matrix[block==k,]
[Algorithm sequence]
}
in order to generate the adequate sets for each iterations.
However this solution can omit one individual for tests. I do not recommend it.
Below does the trick without having to create separate data.frames/matrices, all you need to do is to keep an integer sequnce, id that stores the shuffled indices for each fold.
X <- read.csv('data.csv')
k = 5 # number of folds
fold_size <-nrow(X)/k
indices <- rep(1:k,rep(fold_size,k))
id <- sample(indices, replace = FALSE) # random draws without replacement
log_models <- new.env(hash=T, parent=emptyenv())
for (i in 1:k){
train <- X[id != i,]
test <- X[id == i,]
# run algorithm, e.g. logistic regression
log_models[[as.character(i)]] <- glm(outcome~., family="binomial", data=train)
}
The sperrorest package provides this ability. You can choose between a random split (partition.cv()), a spatial split (partition.kmeans()), or a split based on factor levels (partition.factor.cv()). The latter is currently only available in the Github version.
Example:
library(sperrorest)
data(ecuador)
## non-spatial cross-validation:
resamp <- partition.cv(ecuador, nfold = 5, repetition = 1:1)
# first repetition, second fold, test set indices:
idx <- resamp[['1']][[2]]$test
# test sample used in this particular repetition and fold:
ecuador[idx , ]
If you have a spatial data set (with coords), you can also visualize your generated folds
# this may take some time...
plot(resamp, ecuador)
Cross-validation can then be performed using sperrorest() (sequential) or parsperrorest() (parallel).