How to sequentially concatenate every nth:(nth+j) object in a list of objects - r

I wish to concatenate every nth:nth(+jth) object in a list of objects I have. More specifically, I would like every two objects to be concatenated.
A small sample of the list in question is below.
list("SRR1772151_1.fastq", "SRR1772151_2.fastq", "SRR1772152_1.fastq",
"SRR1772152_2.fastq", "SRR1772153_1.fastq", "SRR1772153_2.fastq")
I would like to make a new list from this which looks closer to this.
list(c("SRR1772151_1.fastq", "SRR1772151_2.fastq"), c("SRR1772152_1.fastq",
"SRR1772152_2.fastq"), c("SRR1772153_1.fastq", "SRR1772153_2.fastq"
))
I have made the following attempt at doing this but my for loop has been unsuccessful.
for (i in seq(1,36, 2)) {
for (j in 1:18) {
unlist(List1[i:i+1]) -> List2[[j]]
}
}
Any help or advice would be very appreciated.

You could divide this into two problems -- split the list, e.g.,
elts = split(lst, 1:2)
and concatenate the elements
Map(c, elts[[1]], elts[[2]])
But I think it's better to follow 'tidy' data practices and to create a single vector with a grouping factor
df = data.frame(fastq = unlist(x), grp = 1:2, stringsAsFactors = FALSE)
or more discriptively
df = data.frame(
fastq = unlist(lst),
sample = factor(sub("_[12].fastq", "", unlist(lst))),
stringsAsFactors = FALSE
)
It's better to work with tidy data because one can accomplish more knowing less, for instance notice that when working with lists you have to learn about split() and Map() and c(), whereas working with vectors and data.frames you don't!

Here is one other attempt using dataframes. The output is a list.
library(tidyverse)
data.frame(X1 = unlist(my_list), stringsAsFactors = F) %>%
group_by(str_sub(X1,1,10)) %>% # assuming first 10 characters forms the string
summarise(list_value=list(X1)) %>%
pull(list_value)

For the general case, you can create a vector of consecutive groups of size j with:
ceiling(seq_along(x) / j)
… and then use tapply() to concatenate all elements in those groups. Unlike using Map(), this will also work if the chunk size does not equally divide the length of the list.
x <- list("SRR1772151_1.fastq", "SRR1772151_2.fastq", "SRR1772152_1.fastq",
"SRR1772152_2.fastq", "SRR1772153_1.fastq", "SRR1772153_2.fastq")
tapply(x, ceiling(seq_along(x) / 2), unlist)
#> $`1`
#> [1] "SRR1772151_1.fastq" "SRR1772151_2.fastq"
#>
#> $`2`
#> [1] "SRR1772152_1.fastq" "SRR1772152_2.fastq"
#>
#> $`3`
#> [1] "SRR1772153_1.fastq" "SRR1772153_2.fastq"
tapply(x, ceiling(seq_along(x) / 4), unlist)
#> $`1`
#> [1] "SRR1772151_1.fastq" "SRR1772151_2.fastq" "SRR1772152_1.fastq"
#> [4] "SRR1772152_2.fastq"
#>
#> $`2`
#> [1] "SRR1772153_1.fastq" "SRR1772153_2.fastq"
Created on 2019-06-12 by the reprex package (v0.2.1)

Related

Why won't R recognize data frame column names within lists?

HEADLINE: Is there a way to get R to recognize data.frame column names contained within lists in the same way that it can recognize free-floating vectors?
SETUP: Say I have a vector named varA:
(varA <- 1:6)
# [1] 1 2 3 4 5 6
To get the length of varA, I could do:
length(varA)
#[1] 6
and if the variable was contained within a larger list, the variable and its length could still be found by doing:
list <- list(vars = "varA")
length(get(list$vars[1]))
#[1] 6
PROBLEM:
This is not the case when I substitute the vector for a dataframe column and I don't know how to work around this:
rows <- 1:6
cols <- c("colA")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# colA
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
list <- list(vars = "varA",
cols = "df$colA")
length(get(list$vars[1]))
#[1] 6
length(get(list$cols[1]))
#Error in get(list$cols[1]) : object 'df$colA' not found
Though this contrived example seems inane, because I could always use the simple length(variable) approach, I'm actually interested in writing data from hundreds of variables varying in lengths onto respective dataframe columns, and so keeping them in a list that I could iterate through would be very helpful. I've tried everything I could think of, but it may be the case that it's just not possible in R, especially given that I cannot find any posts with solutions to the issue.
You could try:
> length(eval(parse(text = list$cols[1])))
[1] 6
Or:
list <- list(vars = "varA",
cols = "colA")
length(df[, list$cols[1]])
[1] 6
Or with regex:
list <- list(vars = "varA",
cols = "df$colA")
length(df[, sub(".*\\$", "", list$cols[1])])
[1] 6
If you are truly working with a data frame d, then nrow(d) is the length of all of the variables in d. There should be no reason to use length in this case.
If you are actually working with a list x containing variables of potentially different lengths, then you should use the [[ operator to extract those variables by name (see ?Extract):
x <- list(a = 1:10, b = rnorm(20L))
l <- list(vars = "a")
length(d[[l$vars[1L]]]) # 10
If you insist on using get (you shouldn't), then you need to supply a second argument telling it where to look for the variable (see ?get):
length(get(l$vars[1L], x)) # 10

R write elements of nested list to csv

I have a list of lists which, in turn, have multiple lists in them due to the structure of some JSON files. Every list has the same number (i.e., 48 lists of 1 list, of 1 list, of 1 list, of 2 lists [where I need the first of the last two]). Now, the issue is, I need to retrieve deeply nested data from all of these lists.
For a reproducible example.
The list structure is roughly as follows (maybe one more level):
list1 = list(speech1 = 1, speech2 = 2)
list2 = list(list1, randomvariable="rando")
list3 = list(list2) #container
list4 = list(list3, name="name", stage="stage")
list5 = list(list4) #container
list6 = list(list5, date="date")
listmain1 = list(list6)
listmain2 = list(list6)
listmain3 = list(listmain1, listmain2)
The structure should like like so:
[[1]]
[[1]][[1]]
[[1]][[1]][[1]]
[[1]][[1]][[1]][[1]]
[[1]][[1]][[1]][[1]][[1]]
[[1]][[1]][[1]][[1]][[1]][[1]]
[[1]][[1]][[1]][[1]][[1]][[1]][[1]]
[[1]][[1]][[1]][[1]][[1]][[1]][[1]]$speech1
[1] 1
[[1]][[1]][[1]][[1]][[1]][[1]][[1]]$speech2
[1] 2
[[1]][[1]][[1]][[1]][[1]][[1]]$randomvariable
[1] "rando"
[[1]][[1]][[1]][[1]]$name
[1] "name"
[[1]][[1]][[1]][[1]]$stage
[1] "stage"
[[1]][[1]]$date
[1] "date"
[[2]]
[[2]][[1]]
[[2]][[1]][[1]]
[[2]][[1]][[1]][[1]]
[[2]][[1]][[1]][[1]][[1]]
[[2]][[1]][[1]][[1]][[1]][[1]]
[[2]][[1]][[1]][[1]][[1]][[1]][[1]]
[[2]][[1]][[1]][[1]][[1]][[1]][[1]]$speech1
[1] 1
[[2]][[1]][[1]][[1]][[1]][[1]][[1]]$speech2
[1] 2
[[2]][[1]][[1]][[1]][[1]][[1]]$randomvariable
[1] "rando"
[[2]][[1]][[1]][[1]]$name
[1] "name"
[[2]][[1]][[1]][[1]]$stage
[1] "stage"
[[2]][[1]]$date
[1] "date"
The end result would look like this:
date name speech1 speech2
1
2
I want to make columns out of the variables which I need and rows out of the lists that I extract them from. In the above example, I would need to retrieve variables speech1, speech2, name, and date from all of the main lists and convert to a simpler dataframe. I'm not quite sure the fastest way to do this and have been knocking my head over with lapply() and purrr for the last couple of days. Ideally, I want to treat the lists as rowIDs with flattened variables in the columns -- but that has also been tricky. Any help is appreciated.
By looping through each list, flatten it and getting the values, it can be achieved quickly with base R:
# Your data
list1 = list(speech1 = 1, speech2 = 2)
list2 = list(list1, randomvariable="rando")
list3 = list(list2) #container
list4 = list(list3, name="name", stage="stage")
list5 = list(list4) #container
list6 = list(list5, date="date")
listmain1 = list(list6)
listmain2 = list(list6)
listmain3 = list(listmain1, listmain2)
# Loop over each list inside listmain3
flatten_list <- lapply(listmain3, function(x) {
# Flatten the list and extract the values that
# you're interested in
unlist(x)[c("date", "name", "speech1", "speech2")]
})
# bind each separate listo into a data frame
as.data.frame(do.call(rbind, flatten_list))
#> date name speech1 speech2
#> 1 date name 1 2
#> 2 date name 1 2
Unless you want to map the row names to some values in particular from each list, the row names should have the same order as the number of lists. That is, if you run this on 48 nested lists, the row names will go down to 1:48 so no need to use the row.names argument.

R use of lapply() to populate and name one column in list of dataframes

After searching for some time, I cannot find a smooth R-esque solution.
I have a list of vectors that I want to convert to dataframes and add a column with the names of the vectors. I cant do this with cbind() and melt() to a single dataframe b/c there are vectors with different number of rows.
Basic example would be:
list<-list(a=c(1,2,3),b=c(4,5,6,7))
var<-"group"
What I have come up with and works is:
list<-lapply(list, function(x) data.frame(num=x,grp=""))
for (j in 1:length(list)){
list[[j]][,2]<-names(list[j])
names(list[[j]])[2]<-var
}
But I am trying to better use lapply() and have cleaner coding practices. Right now I rely so heavily on for and if statements, which a lot of the base functions do already and much more efficiently than I can code at this point.
The psuedo code I would like is something like:
list<-lapply(list, function(x) data.frame(num=x,get(var)=names(x))
Is there a clean way to get this done?
Second closely related question, if I already have a list of dataframes, why is it so hard to reassign column values and names using lapply()?
So using something like:
list<-list(a=data.frame(num=c(1,2,3),grp=""),b=data.frame(num=c(4,5,6,7),grp=""))
var<-"group"
#pseudo code
list<-lapply(list, function(x) x[,2]<-names(x)) #populate second col with name of df[x]
list<-lapply(list, function(x) names[[x]][2]<-var) #set 2nd col name to 'var'
The first line of pseudo code throws an error about matching row lengths. Why does lapply() not just loop over and repeat names(x) like the same function on a single dataframe does in a for loop?
For the second line, as I understand it I can use setNames() to reassign all the column names, but how do I make this work for just one of the col names?
Many thanks for any ideas or pointing to other threads that cover this and helping me understand the behavior of lapply() in this context.
A full R base approach without using loops
> l<-list(a=c(1,2,3),b=c(4,5,6,7))
> data.frame(grp=rep(names(l), lengths(l)), num=unlist(l), row.names = NULL)
grp num
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
Related to your first/main question you can use the function enframe from package tibble for this purpose
library(tibble)
library(tidyr)
library(dplyr)
l<-list(a=c(1,2,3),b=c(4,5,6,7))
l %>%
enframe(name = "group", value="value") %>%
unnest(value) %>%
group_split(group)
Try this:
library(dplyr)
mylist <- list(a = c(1,2,3), b = c(4,5,6,7))
bind_rows(lapply(names(mylist), function(x) tibble(grp = x, num = mylist[[x]])))
# A tibble: 7 x 2
grp num
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 b 7
This is essentially a lapply-based solution where you iterate over the names of your list, and not the individual list elements themselves. If you prefer to do everything in base R, note that the above is equivalent to
do.call(rbind, lapply(names(mylist), function(x) data.frame(grp = x, num = mylist[[x]], stringsAsFactors = F)))
Having said that, tibbles as modern implementation of data.frames are preferred, as is bind_rows over the do.call(rbind... construct.
As to the second question, note the following:
lapply(mylist, function(x) str(x))
num [1:3] 1 2 3
num [1:4] 4 5 6 7
....
lapply(mylist, function(x) names(x))
$a
NULL
$b
NULL
What you see here is that the function inside of lapply gets the elements of mylist. In this case, it get's to work with the numeric vector. This does not have any name as far as the function that is called inside lapply is concerned. To highlight this, consider the following:
names(c(1,2,3))
NULL
Which is the same: the vector c(1,2,3) does not have a name attribute.

rbind all data frames with common names based on list using lapply

I have several data frames named as such:
orange_ABC
orange_BCD
apple_ABC
apple_BCD
grape_ABC
grape_BCD
I need to rbind those that have the first part of their name in common (orange, apple, grape), and name the new data frames as such. I'm accessing the names from a list of data frames names(fruitlist) (from which I made the aforementioned data frames) and have tried using lapply with function(x) with no luck. I'm somewhat new to R, so think I'm making a simple mistake when it comes to dynamically naming the new data frame...
lapply(names(fruitlist),
function(x){
frame_nm <- toString((names(fruitlist[x])))
frame_nm <- do.call(rbind, mget(ls(pattern=paste0((names(splitlist[x])),"*"))))
})
I've tried the standalone line on one type of "fruit" and it seems to work:
test_DF <- do.call(rbind, mget(ls(pattern="apple*")))
EDIT: I realize I forgot to mention that the example list of 6 data frames were created dynamically, so I can't simply generate a list of them. However, I do have a list of the "fruits", and all possible the ends of the new data frame names are known ("_ABC" and "_BCD").
As suspected, the proposed way of assigning values to objects does not work. Moreover, care has to be taken when using ls() and mget() for listing and accessing named objects within a function, because they do not automatically ascend to parent environments and only "see" variables in the local scope unless told otherwise. This applies to R version 3.4, older versions may behave differently.
Creating named objects.
In order to create new objects in the global environment, use assign() (already suggested in Luke C's answer):
> assign("foo", "some text")
> foo
[1] "some text"
Placing code inside a function induces a local scope. Explicitly specifying the global environment allows setting global variables:
> set_foo <- function (x) { assign("foo", x, envir=globalenv()) }
> set_foo("other text")
> foo
[1] "other text"
Note that omitting the envir argument would leave the global environment unaffected.
Use of ls()/mget() within a local function.
By default, this only lists names from the current (local) environment of the that function, which only sees the argument x in the example code given in the question. Similar to above, a quick fix is to specify the global environment explicitly by adding the argument envir=globalenv(). The same applies for mget().
Since no MWE was provided, I am taking the liberty of adapting the "fake data" example code provided in Luke C's answer.
# Populate environment
namelist <- paste(fruit = rep(c("orange", "apple", "grape"), 2),
nums = rep(c("_ABC", "_BCD"), each = 3), sep = "")
for(x in namelist)
assign(x, data.frame(a = 1:4, b = 11:14))
# The following re-generates the list of fruits used above
grouplist <- unique(unlist(lapply(strsplit(namelist, "_"), function (x) { x[[1]] })))
# Group and rbind by prefix, suppressing output
invisible(lapply(grouplist,
function(x) {
grouped <- do.call(rbind,
mget(ls(pattern=paste0(x,"_*"), envir=globalenv()),
envir=globalenv()))
assign(x, grouped, envir=globalenv())
}))
If your fruitlist is a named list of data frames, maybe this will suit.
First, get the like names into their own list:
fruit.groups <- split(names(fruitlist),
sapply(strsplit(names(fruitlist), split = "_"), "[[", 1))
> fruit.groups
$apple
[1] "apple_ABC" "apple_BCD"
$grape
[1] "grape_ABC" "grape_BCD"
$orange
[1] "orange_ABC" "orange_BCD"
Then, use lapply to rbind by group:
fdf <- lapply(fruit.groups, function(x){
out <- do.call(rbind, fruitlist[x])
out$from <- gsub("(\\..*)", "", rownames(out))
rownames(out) <- NULL
return(out)
})
> fdf$apple
a b from
1 1 11 apple_ABC
2 2 12 apple_ABC
3 3 13 apple_ABC
4 4 14 apple_ABC
5 1 11 apple_BCD
6 2 12 apple_BCD
7 3 13 apple_BCD
8 4 14 apple_BCD
Fake data:
namelist <- paste(fruit = rep(c("orange", "apple", "grape"), 2),
nums = rep(c("_ABC", "_BCD"), each = 3), sep = "")
fruitlist <- llply(namelist, function(x){
assign(as.character(x), data.frame(a = 1:4, b = 11:14))
})
EDIT:
From the edits to your question above:
If you have the fruits and suffixes, use expand.grid to get all possible combinations (assuming that all combinations will refer to the dynamically generated data frames).
fruits <- c("orange", "apple", "grape")
suffixes <- c("_ABC", "_BCD")
fullnames <- apply(expand.grid(fruits, suffixes), 1, paste, collapse = "")
Using that list of names, use mget to generate a list of the present dataframes.
new_fruit_df_list <- mget(fullnames)
Then, the code from above should work, modified here to reflect the name changes:
fruit.groups <- split(names(new_fruit_df_list),
sapply(strsplit(names(new_fruit_df_list), split = "_"), "[[", 1))
fdf <- lapply(fruit.groups, function(x){
out <- do.call(rbind, new_fruit_df_list[x])
out$from <- gsub("(\\..*)", "", rownames(out))
rownames(out) <- NULL
return(out)
})
Have a look at the head of each, with the added column (remove if you don't want it) showing the name of that row's original data frame.
> lapply(fdf, head, 2)
$apple
a b from
1 1 11 apple_ABC
2 2 12 apple_ABC
$grape
a b from
1 1 11 grape_ABC
2 2 12 grape_ABC
$orange
a b from
1 1 11 orange_ABC
2 2 12 orange_ABC
Give this a try:
file_groups <- ls()[grep(".*_.*", ls())]
file_groups <- gsub("(.*)_.*", "\\1", file_groups)
df_list <- lapply(file_groups,
function(x){ do.call(rbind, mget(ls(pattern = paste0(x, "*"))))})

Appending list to data frame in R

I have created an empty data frame in R with two columns:
d<-data.frame(id=c(), numobs=c())
I would like to append this data frame (in a loop) with a list, d1 that has output:
[1] 1 100
I tried using rbind:
d<-rbind(d, d2)
and merge:
d<-merge(d, d2)
And I even tried just making a list of lists and then converting it to a data frame, and then giving that data frame names:
d<-rbind(dlist1, dlist2)
dframe<-data.frame(d)
names(dframe)<-c("id","numobs")
But none of these seem to meet the standards of a routine checker (this is for a class), which gives the error:
Error: all(names(cc) %in% c("id", "nobs")) is not TRUE
Even though it works fine in my workspace.
This is frustrating since the error does not reveal where the error is occurring.
Can anyone help me to either merge 2 data frames or append a data frame with a list?
I think you are confusing the purpose of rbind and merge. rbind appends data.frames or named lists, or both vertically. While merge combines data.frames horizontally.
You seem to be also confused by vector's and list's. In R, list can take different datatypes for each element, while vector has to have all elements the same type. Both list and vector are one-dimensional. When you use rbind you want to append a named list, not a named/unnamed vector.
Unnamed Vectors and Lists
The way you define a vector is with the c() function. The way you define an unnamed list is with the list() function, like so:
vec1 = c(1, 10)
# > vec1
# [1] 1 10
list1 = list(1, 10)
# > list1
# [[1]]
# [1] 1
#
# [[2]]
# [1] 10
Notice that both vec1 and list1 have two elements, but list1 is storing the two numbers as two separate vectors (element [[1]] the vector c(1) and [[2]] the vector c(10))
Named Vectors and Lists
You can also create named vectors and lists. You do this by:
vec2 = c(id = 1, numobs = 10)
# > vec2
# id numobs
# 1 10
list2 = list(id = 1, numobs = 10)
# > list2
# $id
# [1] 1
#
# $numobs
# [1] 10
Same data structure for both, but the elements are named.
Dataframes as Lists
Notice that list2 has a $ in front of each element name. This might give you some clue that data.frame's are actually list's with each column an element of the list, since df$column is often used to extract a column from a dataframe. This makes sense since both list's and data.frame's can take different datatypes, unlike vectors's.
The rbind function
When your first element is a dataframe, rbind requires that what you are appending has the same names as the columns of the dataframe. Now, a named vector would not work, because the elements of a vector are not treated as columns of a dataframe, whereas a named list matches elements with columns if the names are the same:
To demonstrate:
d<-data.frame(id=c(), numobs=c())
rbind(d, c(1, 10))
# X1 X10
# 1 1 10
rbind(d, c(id = 1, numobs = 10))
# X1 X10
# 1 1 10
rbind(d, list(1, 10))
# X1 X10
# 1 1 10
rbind(d, list(id = 1, numobs = 10))
# id numobs
# 1 1 10
Knowing the above, it is obvious that you can most certainly also rbind two dataframes with column names that match:
df2 = data.frame(id = 1, numobs = 10)
rbind(d, df2)
# id numobs
# 1 1 10
For starters, the routine checker appears to be looking for columns labeled "id" and "nobs". If that doesn't match your file output, you'll get that error.
I'm taking what is probably the same class and had the same error; correcting my column names made that go away (I'd labeled the 2nd one "nob" not "nobs"!) Now I've gotten the routine checker to complete correctly, or so it seems... but it outputs three data files, and the first and last files are correct but the second one yields "Sorry, that is incorrect." No further feedback. Maddening!
No point posting my code here as it runs fine locally with all the course examples, and it's kinda hard to debug when you don't know what the script is asking for. Sigh.
That d2 object is being printed as an atomic vector would be. Maybe if you showed us either dput(d2) or str(d2) you would havea better understanding of R lists. Furthermore that first bit of code does not produce a two column dataframe, either.
> d<-data.frame(id=1, numobs=1)[0, ] # 2-cl dataframe with 0 rows
> dput(d)
structure(list(id = numeric(0), numobs = numeric(0)), .Names = c("id",
"numobs"), row.names = integer(0), class = "data.frame")
> d2 <- list(id="fifty three", numobs=6) # names that match names(d)
> rbind(d,d2)
id numobs
2 fifty three 6

Resources