elegant way to use rbind() on multiple dataframes with similar names? - r

Currently, i have multiple dataframes with the same name and in running order (foo1, foo2, foo3, foo4, foo5... etc). I am trying to create a large dataframe containing all the rows of the above dataframes with rbind(). Is there an elegant way to do it which would be the equivalent of rbind(foo1, foo2, foo3, foo4, foo5...)?
I have tried do.call(rbind, paste0("foo",i)) where i=c(1,2,3...) to no avail.
There is a solution mentioned here, which is:
do.matrix <- do.call(rbind, lapply( paste0("variable", 1:10) , get) )
However, the answer mysteriously says "That is the wrong way to handle related items. Better to use a list or dataframe, but you will probably find out why in due course."
Why would that be the wrong way to do this, and what would be the "right" way?
Thanks.

Always try to rigorously capture relations between related instances of data, or related data and methods, or related methods. This generally helps ease aggregate manipulation such as your rbind requirement.
For your case, you should have defined your related data.frames as a single list from the beginning:
foo <- list(data.frame(...), data.frame(...), ... );
And then your requirement could be satisfied thusly:
do.call(rbind, foo );
If it's too late for that, then the solution involving repeated calls to get(), as described in the article to which you linked, can do the job.

Related

How to process a dataframe row by row, passing the columns as args to a function, *as a single call to function*

Still quite new to R. Its quite possible my question is due to gaps in my thinking about this problem, but after few hours of googling, I'm still stuck.
The problem:
I have a dataframe(tibble) that contains 6 rows, and 3 columns.
The columns are Filename, Metadata1, Metadata2.
I want to call a function for each row, as follows:
function(Filename, Metadata1, Metadata2).
In other languages, this would be a simple for loop, but I am completely stuck how to do this in R, both looking at base, and tidyverse ways to do this. All the answers I've come across are variations of calling the function on every element in the dataframe or matrix, whereas I want to effectively pass the whole row to the function, as individual args.
Its probably blindly obvious, but I would really appreciate some guidance.
EDIT:
I ran across mapply, and it seems to do the job I need, but I have no idea if this is the only or best method. This what I'm working with currently:
testfunc <- function(a,b,c){
str(a)
str(b)
str(c)
}
discard <- mapply(testfunc, a=files_sorted$file, b=files_sorted$AppID, c=files_sorted$server)
Moments after I posted the last edit to my question, I hit the exact issue that #mrflick mentioned where my function was not vectorised.
In the end, I did end up using a for loop, this is what I settled on:
overall_data <- tibble()
for(a in transpose(files_sorted)){
df <- processFile(file=a[1]$list_files, srv=a[2]$server, tap=a[3]$AppID )
#view(df)
overall_data <- bind_rows(overall_data, df)
}
files_sorted:
I'm sure I'll learn better ways to tackle this in future, but leaving this here

How to automatize listing many elements within a command line in R? [duplicate]

Currently, i have multiple dataframes with the same name and in running order (foo1, foo2, foo3, foo4, foo5... etc). I am trying to create a large dataframe containing all the rows of the above dataframes with rbind(). Is there an elegant way to do it which would be the equivalent of rbind(foo1, foo2, foo3, foo4, foo5...)?
I have tried do.call(rbind, paste0("foo",i)) where i=c(1,2,3...) to no avail.
There is a solution mentioned here, which is:
do.matrix <- do.call(rbind, lapply( paste0("variable", 1:10) , get) )
However, the answer mysteriously says "That is the wrong way to handle related items. Better to use a list or dataframe, but you will probably find out why in due course."
Why would that be the wrong way to do this, and what would be the "right" way?
Thanks.
Always try to rigorously capture relations between related instances of data, or related data and methods, or related methods. This generally helps ease aggregate manipulation such as your rbind requirement.
For your case, you should have defined your related data.frames as a single list from the beginning:
foo <- list(data.frame(...), data.frame(...), ... );
And then your requirement could be satisfied thusly:
do.call(rbind, foo );
If it's too late for that, then the solution involving repeated calls to get(), as described in the article to which you linked, can do the job.

R approach for iterative querying

This is a question of a general approach in R, I'm trying to find a way into R language but the data types and loop approaches (apply, sapply, etc) are a bit unclear to me.
What is my target:
Query data from API with parameters from a config list with multiple parameters. Return the data as aggregated data.frame.
First I want to define a list of multiple vectors (colums)
site segment id
google.com Googleuser 123
bing.com Binguser 456
How to manage such a list of value groups (row by row)? data.frames are column focused, you cant write a data.frame row by row in an R script. So the only way I found to define this initial config table is a csv, which is really an approach I try to avoid, but I can't find a way to make it more elegant.
Now I want to query my data, lets say with this function:
query.data <- function(site, segment, id){
config <- define_request(site, segment, id)
result <- query_api(config)
return result
}
This will give me a data.frame as a result, this means every time I query data the same columns are used. So my result should be one big data.frame, not a list of similar data.frames.
Now sapply allows to use one parameter-list and multiple static parameters. The mapply works, but it will give me my data in some crazy output I cant handle or even understand exactly what it is.
In principle the list of data.frames is ok, the data is correct, but it feels cumbersome to me.
What core concepts of R I did not understand yet? What would be the approach?
If you have a lapply/sapply solution that is returning a list of dataframes with identical columns, you can easily get a single large dataframe with do.call(). do.call() inputs each item of a list as arguments into another function, allowing you to do things such as
big.df <- do.call(rbind, list.of.dfs)
Which would append the component dataframes into a single large dataframe.
In general do.call(rbind,something) is a good trick to keep in your back pocket when working with R, since often the most efficient way to do something will be some kind of apply function that leaves you with a list of elements when you really want a single matrix/vector/dataframe/etc.

Joining list of data frames in R

I have this example data
list_1<-list(data.frame(c(1:10)),data.frame(c(11:20)))
list_2<-list(data.frame(c(21:30)),data.frame(c(31:40)))
And I need to join them together to get structure like
list_3<-list(data.frame(c(1:10)),data.frame(c(11:20)),data.frame(c(21:30)),data.frame(c(31:40)))
It means that I have to create one new list of frames. Because when I use
list_3<-list(list_1,list_2)
then the first frame in list_1 is list_3[[1]][[1]] and it is problem for me. I need to call this frame like list_3[[1]].
Any straightforward way how to achieve it?
I have tried some plyr like join, join_all and I cannot still done this.
Moving some comments to the correct place (answers), the two most common solutions would be:
c(list_1, list_2)
or
append(list_1, list_2)
Since you had already tried:
list(list_1, list_2)
and found that this had created a nested list, you can also unlist the nested list with the argument recursive = FALSE.
unlist(list(list_1, list_2), recursive = FALSE)

Calculate e.g. a mean in a list with multi-column data.frames

I have a list of several data.frames. Each data.frame has several columns.
By using
mean(mylist$first_dataframe$a
I can get the mean for a in this one data.frame.
However I do not know how to calculate over all the data.frames stored in my list or how for specific data.frames.
I could use a loop but I was told that
apply() and its variations are better
I tried using several solutions I found via search but somehow it just doesn't work.
I assume I need to use
unlist()
Could you provide an example of how to calculate e.g. a mean for a data structure like mine.
A list with several data.frames containing several columns.
Update:
I'm sorry for the confusion. I wanted the grand mean for a specific column in all dataframes.
Thanks to Thomas for providing a working solution for calculating a grand mean for a specific column in all dataframes and to psychometriko for providing a useful solution for calculating means over all columns in all dataframes (& even for the case when not numeric data is involved).
Thanks!
Is this what you are looking for?
set.seed(42)
mylist <- list(a=data.frame(foo=rnorm(10),
bar=rnorm(10)),
b=data.frame(foo=rnorm(10),
bar=rnorm(10)),
c=data.frame(foo=rnorm(10),
bar=rnorm(10)))
sapply(do.call("rbind",mylist),mean)
foo bar
0.1163340 -0.1696556
Note: do.call("rbind",mylist) returns something similar to what you referred to above with the unlist function, and then sapply, as referred to by Roland in his answer, just calls the function mean on each component (column) of the data.frame that results from the above do.call function.
Edit: In response to the question of how to deal with non-numeric data.frame components, the below solution admittedly isn't very elegant and I'm sure better ones exist, but here's the first thing I was able to think of:
set.seed(42)
mylist <- list(a=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
b=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
c=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)))
sapply(do.call("rbind",mylist),function(x) {
if (is.numeric(x)) mean(x)
})
$rand
[1] -0.02470602
$lets
NULL
This basically just creates a custom function that first tests whether each component is numeric and, if it is, returns the mean. If it isn't, it skips it.
The whole do.call('rbind', List) thing can be quite slow and prone to mishaps. If there is only one column you need the mean for, the best way is:
mean(sapply(mylist, function(X) X$rand))
It's about 10x faster the the do.call method.

Resources