Organizing the files under the group name - r

The data I have look like in a list:
G085_1.csv, G085_2.csv, G085_3.csv, .. G100_1.csv, G100_2.csv, .. G173_1, csv., G173_2, csv., G173_3.csv
where G stands for the group followed by the identification of each group member (1, 2, or 3). Notably, some groups do not have all three members.
What I'm trying to do is to create a loop for running the following code (an example for 1 group) for the entire groups.
i1 <- fread("sample/G085_1.csv")
i2 <- fread("sample/G085_2.csv")
i3 <- fread("sample/G085_3.csv")
What I have been doing is:
Groups <- c()
for(g in 85:173){
Groups[g] <- ifelse(g<100,
paste0("G0", g),
paste0("G", g))
}
Members <- c("i1", "i2", "i3")
for(g in 1:length(Groups)){
for(m in 1:3) {
filename<- paste0("i",m)
wd <- paste0("sample/", Groups[g],"_",
m, ".csv")
ifelse(file.exists(wd),assign(filename,fread(wd)),
function(){})
}
assign(Groups[g],...
)
}
The place that I'm stuck in is the last part (assign(Groups[g]...). I'm not sure what would allow for calling in all the i1, i2, i3 dataframes for each group under the group. Is there a better way than using assign function here?

This code is not exactly assign to i1, i2, i3 but will gave you a list for your work with the Group names assgined to the list items. Each group item is a list contain three data.frame read from the files. In case the file not exists the item will be NULL.
Using foreach approach
library(foreach)
list_data <- foreach(g = Groups, .final = function(x) { setNames(x, Groups) }) %do% {
current_group <- foreach(m = 1:3) %do% {
filename<- paste0("i",m)
wd <- paste0("sample/", Groups[g],"_",
m, ".csv")
data <- ifelse(file.exists(wd) , fread(wd), NULL)
return(data)
}
return(current_group)
}
Using purrr map
library(purrr)
item_index <- c(1:3)
all_group_data <- map(.x = Groups, .f = function(g) {
list_files <- paste0("sample/", g,"_", item_index, ".csv")
group_data <- map(.x = list_files, .f = function(file) {
if (file.exists(file)) {
data <- fread(file)
} else {
data <- NULL
}
data
})
group_data
})
names(all_group_data) <- Groups

Related

Loop-generated list of data frames not being joined by rbind properly

I have a table with samples of data named Sample_1, Sample_2, etc. I take user input as a string for which samples are wanted (Sample_1,Sample_3,Sample_5). Then after parsing the string, I have a for-loop which I pass each sample name to and the program filters the original dataset for the name and creates a DF with calculations. I then append the DF to a list after each iteration of the loop and at the end, I rbind the list for a complete DF.
sampleloop <- function(samplenames) {
data <- unlist(strsplit(samplenames, ","))
temp = list()
for(inc in 1:length(data)) {
df <- CT[CT[["Sample_Name"]] == data[inc],]
........
tempdf = goitemp
temp[inc] <- tempdf
}
newdf <- do.call(rbind.data.frame, temp)
}
The inner function on its own produces the correct wanted output. However, with the loop the function produces the following wrong DF if the input is "Sample_3,Sample_9":
I'm wondering if it has something to do with the rbind?
The issue seems to be using [ instead of [[ to access and assign to the list element`
sampleloop <- function(samplenames) {
data <- unlist(strsplit(samplenames, ","))
temp <- vector('list', length(data))
for(inc in seq_along(data)) {
df <- CT[CT[["Sample_Name"]] == data[inc],]
........
tempdf <- goitemp
temp[[inc]] <- tempdf
}
newdf <- do.call(rbind.data.frame, temp)
return(newdf)
}
The difference can be noted with the reproducible example below
lst1 <- vector('list', 5)
lst2 <- vector('list', 5)
for(i in 1:5) {
lst1[i] <- data.frame(col1 = 1:5, col2 = 6:10)
lst2[[i]] <- data.frame(col1 = 1:5, col2 = 6:10)
}

R loop to create data frames with 2 counters

What I want is to create 60 data frames with 500 rows in each. I tried the below code and, while I get no errors, I am not getting the data frames. However, when I do a View on the as.data.frame, I get the view, but no data frame in my environment. I've been trying for three days with various versions of this code:
getDS <- function(x){
for(i in 1:3){
for(j in 1:30000){
ID_i <- data.table(x$ID[j: (j+500)])
}
}
as.data.frame(ID_i)
}
getDS(DATASETNAME)
We can use outer (on a small example)
out1 <- c(outer(1:3, 1:3, Vectorize(function(i, j) list(x$ID[j:(j + 5)]))))
lapply(out1, as.data.table)
--
The issue in the OP's function is that inside the loop, the ID_i gets updated each time i.e. it is not stored. Inorder to do that we can initialize a list and then store it
getDS <- function(x) {
ID_i <- vector('list', 3)
for(i in 1:3) {
for(j in 1:3) {
ID_i[[i]][[j]] <- data.table(x$ID[j:(j + 5)])
}
}
ID_i
}
do.call(c, getDS(x))
data
x <- data.table(ID = 1:50)
I'm not sure the description matches the code, so I'm a little unsure what the desired result is. That said, it is usually not helpful to split a data.table because the built-in by-processing makes it unnecessary. If for some reason you do want to split into a list of data.tables you might consider something along the lines of
getDS <- function(x, n=5, size = nrow(x)/n, column = "ID", reps = 3) {
x <- x[1:(n*size), ..column]
index <- rep(1:n, each = size)
replicate(reps, split(x, index),
simplify = FALSE)
}
getDS(data.table(ID = 1:20), n = 5)

Efficient sampling from nested lists

I have a list of lists, containing data.frames, from which I want to select only a few rows. I can achieve it in a for-loop, where I create a sequence based on the amount of rows and select only row indices according to that sequence.
But if I have deeper nested lists it doesn't work anymore. I am also sure, that there is a better way of doing that without a loop.
What would be an efficient and generic approach to sample from nested lists, that vary in their dimensions and contain data.frames or matrices?
## Dummy Data
n1=100;n2=300;n3=100
crdOrig <- list(
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
## Code to opimize
FiltRef <- list()
filterBy = 10
for (r in 1:length(crdOrig)) {
tmp <- do.call(rbind, crdOrig[[r]])
filterInd <- seq(1,nrow(tmp), by = filterBy)
FiltRef[[r]] <- tmp[filterInd,]
}
crdResult <- do.call(rbind, FiltRef)
# Plotting
crdOrigPl <- do.call(rbind, unlist(crdOrig, recursive = F))
plot(crdOrigPl[,1], crdOrigPl[,2], col="red", pch=20)
points(crdResult[,1], crdResult[,2], col="green", pch=20)
The code above works also if a list contains several data.frames (data below).
## Dummy Data (Multiple DF)
crdOrig <- list(
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)),
data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
But if a list contains multiple lists, it throws an error trying to bind the result (FiltRef) together.
The result can be a data.frame with 2 columns (x,y) - like crdResult or a one dimensional list like FiltRef (from the first example)
## Dummy Data (Multiple Lists)
crdOrig <- list(
list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
+1 and Thank you all for your brilliant answers! They all work and there is a lot to learn from each one of them. I will give this one to #Gwang-Jin Kim as his solution is the most flexible and extensive, although they all deserve to be checked!
Preparation and implementation of flatten
Well, there are many other answers which are in principle the same.
I meanwhile implemented for fun the flattening of nested lists.
Since I am thinking in Lisp:
Implemented first car and cdr from lisp.
car <- function(l) {
if(is.list(l)) {
if (null(l)) {
list()
} else {
l[[1]]
}
} else {
error("Not a list.")
}
}
cdr <- function(l) {
if (is.list(l)) {
if (null(l) || length(l) == 1) {
list()
} else {
l[2:length(l)]
}
} else {
error("Not a list.")
}
}
Some predicate functions:
null <- function(l) length(l) == 0
# this is Lisp's `null` checking whether list is empty (`length(l) == 0`)
# R's `is.null()` checks for the value NULL and not `length(obj) == 0`
# upon #Martin Morgan's comment removed other predicate functions
# thank you #Martin Morgan!
# instead using `is.data.frame()` and `is.list()`, since they are
# not only already there but also safer.
Which are necessary to build flatten (for data frame lists)
flatten <- function(nested.list.construct) {
# Implemented Lisp's flatten tail call recursively. (`..flatten()`)
# Instead of (atom l) (is.df l).
..flatten <- function(l, acc.l) {
if (null(l)) {
acc.l
} else if (is.data.frame(l)) { # originally one checks here for is.atom(l)
acc.l[[length(acc.l) + 1]] <- l
acc.l # kind of (list* l acc.l)
} else {
..flatten(car(l), ..flatten(cdr(l), acc.l))
}
}
..flatten(nested.list.construct, list())
}
# an atom is in the widest sence a non-list object
After this, the actual function is defined using a sampling function.
Defining sampling function
# helper function
nrow <- function(df) dim(df)[1L]
# sampling function
sample.one.nth.of.rows <- function(df, fraction = 1/10) {
# Randomly selects a fraction of the rows of a data frame
nr <- nrow(df)
df[sample(nr, fraction * nr), , drop = FALSE]
}
The actual collector function (from nested data-frame-lists)
collect.df.samples <- function(df.list.construct, fraction = 1/10) {
do.call(rbind,
lapply(flatten(df.list.construct),
function(df) sample.one.nth.of.rows(df, fraction)
)
)
}
# thanks for the improvement with `do.call(rbind, [list])` #Ryan!
# and the hint that `require(data.table)`
# `data.table::rbindlist([list])` would be even faster.
collect.df.samples first flattens the nested list construct of data frames df.list.construct to a flat list of data frames. It applies the function sample.one.nth.of.rows to each elements of the list (lapply). There by it produces a list of sampled data frames (which contain the fraction - here 1/10th of the original data frame rows). These sampled data frames are rbinded across the list. The resulting data frame is returned. It consists of the sampled rows of each of the data frames.
Testing on example
## Dummy Data (Multiple Lists)
n1=100;n2=300;n3=100
crdOrig <- list(
list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
collect.df.samples(crdOrig, fraction = 1/10)
Refactoring for later modifications
By writing the collect.df.samples function to:
# sampler function
sample.10th.fraction <- function(df) sample.one.nth.of.rows(df, fraction = 1/10)
# refactored:
collect.df.samples <-
function(df.list.construct,
df.sampler.fun = sample.10th.fraction) {
do.call(rbind,
lapply(flatten(df.list.construct), df.sampler.fun))
}
One can make the sampler function replace-able.
(And if not: By changing the fraction parameter, one can enhance or reduce amount of rows collected from each data frame.)
The sampler function is in this definition easily exchangable
For choosing every nth (e.g. every 10th) row in the data frame, instead of a random sampling,
you could e.g. use the sampler function:
df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]
and input it as df.sampler.fun = in collect.df.samples. Then, this function will be applied to every data frame in the nested df list object and collected to one data frame.
every.10th.rows <- function(df, nth = 10) {
df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]
}
a.10th.of.all.rows <- function(df, fraction = 1/10) {
sample.one.nth.of.rows(df, fraction)
}
collect.df.samples(crdOrig, a.10th.of.all.rows)
collect.df.samples(crdOrig, every.10th.rows)
I would just flatten the whole darn thing and work on a clean list.
library(rlist)
out <- list.flatten(y)
# prepare a vector for which columns belong together
vc <- rep(1:(length(out)/2), each = 2)
vc <- split(1:length(vc), vc)
# prepare the final list
ll <- vector("list", length(unique(vc)))
for (i in 1:length(vc)) {
ll[[i]] <- as.data.frame(out[vc[[i]]])
}
result <- lapply(ll, FUN = function(x) {
x[sample(1:nrow(x), size = 10, replace = FALSE), ]
})
do.call(rbind, result)
x y
98 10.32912 52.87113
52 16.42912 46.07026
92 18.85397 46.26403
90 12.04884 57.79290
23 18.20997 40.57904
27 18.98340 52.55919
...
Here's an answer in base borrowing from a custom "rapply" function mentioned here rapply to nested list of data frames in R
df_samples<-list()
i=1
f<-function(x) {
i<<-i+1
df_samples[[i]]<<-x[sample(rownames(x),10),]
}
recurse <- function (L, f) {
if (inherits(L, "data.frame")) {
f(L) }
else lapply(L, recurse, f)
}
recurse(crdOrig, f)
res<-do.call("rbind", df_samples)
I too would flatten the list-of-lists into a standard representation (and do all analysis on the flattened representation, not just the subseting), but keep track of relevant indexing information, e.g.,
flatten_recursive = function(x) {
i <- 0L
.f = function(x, depth) {
if (is.data.frame(x)) {
i <<- i + 1L
cbind(i, depth, x)
} else {
x = lapply(x, .f, depth + 1L)
do.call(rbind, x)
}
}
.f(x, 0L)
}
The internal function .f() visits each element of a list. If the element is a data.frame, it adds a unique identifier to index it. If it's a list, then it calls itself on each element of the list (incrementing a depth counter, in case this is useful, one could also add a 'group' counter) and then row-binds the elements. I use an internal function so that I can have a variable i to increment across function calls. The end result is a single data frame with a index to use for referencing the original results.
> tbl <- flatten_recursive(crdOrig) %>% as_tibble()
> tbl %>% group_by(i, depth) %>% summarize(n())
# A tibble: 4 x 3
# Groups: i [?]
i depth `n()`
<int> <int> <int>
1 1 3 100
2 2 3 100
3 3 2 300
4 4 2 100
> tbl %>% group_by(i) %>% slice(seq(1, n(), by = 10)) %>% summarize(n())
# A tibble: 4 x 2
i `n()`
<int> <int>
1 1 10
2 2 10
3 3 30
4 4 10
The overall pattern of .f() can be adjusted for additional data types, e.g., (some details omitted)
.f <- function(x) {
if (is.data.frame(x)) {
x
} else if (is.matrix(x)) {
x <- as.data.frame(x)
setNames(x, c("x", "y"))
} else {
do.call(rbind, lapply(x, .f))
}
}
Consider a recursive call conditionally checking if first item is a data.frame or list class.
stack_process <- function(lst){
if(class(lst[[1]]) == "data.frame") {
tmp <- lst[[1]]
}
if(class(lst[[1]]) == "list") {
inner <- lapply(lst, stack_process)
tmp <- do.call(rbind, inner)
}
return(tmp)
}
new_crdOrig <- lapply(crdOrig, function(x) {
df <- stack_process(x)
filterInd <- seq(1, nrow(df), by = filterBy)
return(df[filterInd,])
})
final_df <- do.call(rbind, new_crdOrig)

R: Passing the index of the corresponding data frame (from a list of data frames) in a function within lapply

I have a list of 9 data frames list_dataframes read from files and a function func_modification to modify them. I would like to pass the value of pos the index of the corresponding data frame in list, so that individual rows can have their respective dmv and method names. How to do that?
dmv <- c(rep("MC", 3), rep("MSM", 3), rep("Random", 3))
method <- rep(c("COM-0.5", "IDT", "LB"), 3)
func_modification <- function(d, pos) {
d[,1] <- d[,1]/3600
d[,3] <- NA
d[,3] <- dmv[pos]
d[,4] <- method[pos]
}
list_df <- list()
list_df <- lapply(list_dataframes, func_modification, pos=3) // Works
list_df <- lapply(list_dataframes, func_modification, pos=1:9) // Showing error
You can try Map to change the corresponding dataframes with each element of 'pos'
Map(func_modification, list_dataframes, pos= 1:3)
Or using lapply
lapply(seq_along(list_dataframes), function(i)
func_modification(list_dataframes[[i]], pos=i))
where func_modification is
func_modification <- function(d, pos) {
d[,1] <- d[,1]/3600
d[,3] <- NA #not sure if this needed
d[,3] <- dmv[pos]
d[,4] <- method[pos]
d #return the data
}
data
set.seed(24)
list_dataframes <- lapply(1:3, function(i)
as.data.frame(matrix(sample(1:10, 5*20, replace=TRUE), ncol=5)))

add row to a frame in a for loop

I have some code that creates a dataframe with 2 coulmns I want to write data from a forloop to this dataframe ...how do I do that?
df<-data.frame(id = numeric(), nobs = numeric())
setwd(directory)
files <-list.files(directory)
files <-files[id]
for (i in files) {
#print(i)
file <- read.csv(i)
x <- nrow(file)
num = as.numeric(gsub(".csv","",i))
y <- sprintf("%i %i", num, x)
#print(y)
df <- rbind(df,num,x)
}
To add rows in a data.frame using a loop you can modify your code using the following one:
df<-data.frame(id = numeric(), nobs = numeric())
for (i in 1:1000) {
df[i,] <- c(runif(1),runif(1))
}
However, if you know the number of rows needed then preallocating memory is strongly recommended:
files <- 1:1000
df<-data.frame(id = numeric(length(files)), nobs = numeric(length(files)))
for (i in 1:length(files)) {
df[i,] <- c(runif(1),runif(1))
}

Resources