Creating a data frame by applying a function to each element of a vector and combining the results - r

I am working on a project where we frequently work with a list of usernames. We also have a function to take a username and return a dataframe with that user's data. E.g.
users = c("bob", "john", "michael")
get_data_for_user = function(user)
{
data.frame(user=user, data=sample(10))
}
We often:
Iterate over each element of users
Call get_data_for_user to get their data
rbind the results into a single dataerame
I am currently doing this in a purely imperative way:
ret = get_data_for_user(users[1])
for (i in 2:length(users))
{
ret = rbind(ret, get_data_for_user(users[i]))
}
This works, but my impression is that all the cool kids are now using libraries like purrr to do this in a single line. I am fairly new to purrr, and the closest I can see is using map_df to convert the vector of usernames to a vector of dataframes. I.e.
dfs = map_df(users, get_data_for_user)
That is, it seems like I would still be on the hook for writing a loop to do the rbind.
I'd like to clarify whether my solution (which works) is currently considered best practice in R / amongst users of the tidyverse.
Thanks.

That looks right to me - map_df handles the rbind internally (you'll need {dplyr} in addition to {purrr}).
FWIW, purrr::map_dfr() will do the same thing, but the function name is a bit more explicit, noting that it will be binding rows; purrr::map_dfc() binds columns.

I would suggest a slight adjustment:
dfs = map_dfr(users, get_data_for_user)
map_dfr() explicitely states that you want to do a row bind. And I would be inclined to call this best practice when working with purrr.

For the sake of completeness, here are some additional approaches:
using built-in functions
Reduce(rbind, lapply(users, get_data_for_user))
using data.table approach
library(data.table)
rbindlist(lapply(users, get_data_for_user))

Related

R approach for iterative querying

This is a question of a general approach in R, I'm trying to find a way into R language but the data types and loop approaches (apply, sapply, etc) are a bit unclear to me.
What is my target:
Query data from API with parameters from a config list with multiple parameters. Return the data as aggregated data.frame.
First I want to define a list of multiple vectors (colums)
site segment id
google.com Googleuser 123
bing.com Binguser 456
How to manage such a list of value groups (row by row)? data.frames are column focused, you cant write a data.frame row by row in an R script. So the only way I found to define this initial config table is a csv, which is really an approach I try to avoid, but I can't find a way to make it more elegant.
Now I want to query my data, lets say with this function:
query.data <- function(site, segment, id){
config <- define_request(site, segment, id)
result <- query_api(config)
return result
}
This will give me a data.frame as a result, this means every time I query data the same columns are used. So my result should be one big data.frame, not a list of similar data.frames.
Now sapply allows to use one parameter-list and multiple static parameters. The mapply works, but it will give me my data in some crazy output I cant handle or even understand exactly what it is.
In principle the list of data.frames is ok, the data is correct, but it feels cumbersome to me.
What core concepts of R I did not understand yet? What would be the approach?
If you have a lapply/sapply solution that is returning a list of dataframes with identical columns, you can easily get a single large dataframe with do.call(). do.call() inputs each item of a list as arguments into another function, allowing you to do things such as
big.df <- do.call(rbind, list.of.dfs)
Which would append the component dataframes into a single large dataframe.
In general do.call(rbind,something) is a good trick to keep in your back pocket when working with R, since often the most efficient way to do something will be some kind of apply function that leaves you with a list of elements when you really want a single matrix/vector/dataframe/etc.

elegant way to use rbind() on multiple dataframes with similar names?

Currently, i have multiple dataframes with the same name and in running order (foo1, foo2, foo3, foo4, foo5... etc). I am trying to create a large dataframe containing all the rows of the above dataframes with rbind(). Is there an elegant way to do it which would be the equivalent of rbind(foo1, foo2, foo3, foo4, foo5...)?
I have tried do.call(rbind, paste0("foo",i)) where i=c(1,2,3...) to no avail.
There is a solution mentioned here, which is:
do.matrix <- do.call(rbind, lapply( paste0("variable", 1:10) , get) )
However, the answer mysteriously says "That is the wrong way to handle related items. Better to use a list or dataframe, but you will probably find out why in due course."
Why would that be the wrong way to do this, and what would be the "right" way?
Thanks.
Always try to rigorously capture relations between related instances of data, or related data and methods, or related methods. This generally helps ease aggregate manipulation such as your rbind requirement.
For your case, you should have defined your related data.frames as a single list from the beginning:
foo <- list(data.frame(...), data.frame(...), ... );
And then your requirement could be satisfied thusly:
do.call(rbind, foo );
If it's too late for that, then the solution involving repeated calls to get(), as described in the article to which you linked, can do the job.

Convert data frame to list

I am trying to go from a data frame to a list structure in R (and I know technically a data frame is a list). I have a data frame containing reference chemicals and their mechanisms different targets. For example, estrogen is an estrogen receptor agonist. What I would like is to transform the data frame to a list, because I am tired of typing out something like:
refchem$chemical_id[refchem$target=="AR" & refchem$mechanism=="Agonist"]
every time I need to access the list of specific reference chemicals. I would much rather access the chemicals by:
refchem$AR$Agonist
I am looking for a general answer, even though I have given a simplified example, because not all targets have all mechanisms.
This is really easy to accomplish with a loop:
example <- data.frame(target=rep(c("t1","t2","t3"),each=20),
mechan=rep(c("m1","m2"),each=10,3),
chems=paste0("chem",1:60))
oneoption <- list()
for(target in unique(example$target)){
oneoption[[target]] <- list()
for(mech in unique(example$mechan)){
oneoption[[target]][[mech]] <- as.character(example$chems[ example$target==target & example$mechan==mech ])
}
}
I am just wondering if there is a more clever way to do it. I tried playing around with lapply and did not make any progress.
Using split:
split(refchem, list(refchem$target, refchem$mechanism))
Should do the trick.
The new way to access would be refchem$AR.Agonist
If you make a keyed data.table instead, ...
you'll still have all the data in one data.frame (instead of a possibly-nested list of many);
you may find iterating over these subsets nicer; and
the syntax is pretty clean:
To access a subset:
DT[.('AR','Agonist')]
To do something for each group, that will be rbinded together in the result:
DT[,{do stuff},by=key(DT)]
Similar to aggregate(), any list of vectors of the correct length can go into the by, not just the key.
Finally, DT came from...
require(data.table)
DT <- data.table(refchem,key=c('target','mechanism'))
You can also use a plyr function:
library(plyr)
dlply(example, .(target, mechan))
It has the added advantage of using a function to process the data, if needed (there's an implicit identity in the above).

Saving many subsets as dataframes using "for"-loops

this question might be very simple, but I do not find a good way to solve it:
I have a dataset with many subgroups which need to be analysed all-together and on their own. Therefore, I want to use subsets for the groups and use them for the later analysis. As well, the defintion of the subsets as the analysis should be partly done with loops in order to save space and to ensure that the same analysis has been done with all subgroups.
Here is an example of my code using an example dataframe from the boot package:
data(aids)
qlist <- c("1","2","3","4")
for (i in length(qlist)) {
paste("aids.sub.",qlist[i],sep="") <- subset(aids, quarter==qlist[i])
}
The variable which contains the subgroups in my dataset is stored as a string, therefore I added the qlist part which would be not required otherwise.
Make a list of the subsets with lapply:
lapply(qlist, function(x) subset(aids, quarter==x))
Equivalently, avoiding the subset():
lapply(qlist, function(x) aids[aids$quarter==x,])
It is likely the case that using a list will make the subsequent code easier to write and understand. You can subset the list to get a single data frame (just as you can use one of the subsets, as created below). But you can also iterate over it (using for or lapply) without having to construct variable names.
To do the job as you are asking, use assign:
for (i in qlist) {
assign(paste("aids.sub.",i,sep=""), subset(aids, quarter==i))
}
Note the removal of the length() function, and that this is iterating directly over qlist.

returning different data frames in a function - R

Is it possible to return 4 different data frames from one function?
Scenario:
I am trying to read a file, parse it, and return some parts of the file.
My function looks something like this:
parseFile <- function(file){
carFile <- read.table(file, header=TRUE, sep="\t")
carNames <- carFile[1,]
carYear <- colnames(carFile)
return(list(carFile,carNames,carYear))
}
I don't want to have to use list(carFile,carNames,carYear). Is there a way return the 3 data frames without returning them in a list first?
R does not support multiple return values. You want to do something like:
foo = function(x,y){return(x+y,x-y)}
plus,minus = foo(10,4)
yeah? Well, you can't. You get an error that R cannot return multiple values.
You've already found the solution - put them in a list and then get the data frames from the list. This is efficient - there is no conversion or copying of the data frames from one block of memory to another.
This is also logical, the return from a function should conceptually be a single entity with some meaning that is transferred to whatever function is calling it. This meaning is also better conveyed if you name the returned values of the list.
You could use a technique to create multiple objects in the calling environment, but when you do that, kittens die.
Note in your example carYear isn't a data frame - its a character vector of column names.
There are other ways you could do that, if you really really want, in R.
assign('carFile',carFile,envir=parent.frame())
If you use that, then carFile will be created in the calling environment. As Spacedman indicated you can only return one thing from your function and the clean solution is to go for the list.
In addition, my personal opinion is that if you find yourself in such a situation, where you feel like you need to return multiple dataframes with one function, or do something that no one has ever done before, you should really revisit your approach. In most cases you could find a cleaner solution with an additional function perhaps, or with the recommended (i.e. list).
In other words the
envir=parent.frame()
will do the job, but as SpacedMan mentioned
when you do that, kittens die
The zeallot package does what you need in a similar that Python can unpack variables from a function. Reproducible example below.
parseFile <- function(){
carMPG <- mtcars$mpg
carName <- rownames(mtcars)
carCYL <- mtcars$cyl
return(list(carMPG,carName,carCYL))
}
library(zeallot)
c(myFile, myName, myYear) %<-% parseFile()

Resources