R: combining data-frames row-wise when some variables are lists/dfs? - r

Is there a function in R that would let me combine/concatenate data-frames when some variables are either lists or data frames themselves? I've tried rbind(), rbindlist(), rbind.data.frame and bind_rows and they are all throwing out errors, e.g. duplicate 'row.names' are not allowed or Argument 4 can't be a list containing data frames.
After looking into it a bit, it seems that none of those functions support nested data-frames. Is there a function that would work for me? Or is there something (other than a for-loop that adds row by row) that I could do?
As a bit of a background, I'm making API-calls to a database and can get only 40 results at a time so I am looping through those via multiple calls, and I want to combine the results without any loss of information. I am using jsonlite:fromJSON to convert to a df: could I/should I combine the info in JSON format first and then convert to a df?

Related

Looping through R dataframe columns

I want to loop a dataframe columns and use them for something else (in this case, performing a chi-squared test on all my features.
for(i in (1:ncol(wdbc))){
wdbc[,i]
chisq.test(wdbc$diagnosis,wdbc[,i])
}
I've tried referring to the features in all kinds of ways, for example:
chisq.test(wdbc$diagnosis,wdbc[i]) ##looping through colnames(wdbc)
or
chisq.test(wdbc$diagnosis,wdbc$i) ##looping through colnames(wdbc)
but can't seem to solve the problem.
wdbc[i] will return a dataframe (rather than a vector), and wdbc$i doesn't work to loop through column names.
wdbc[,i] should work if wdbc is actually a dataframe. However, I've encountered an error in this type of situation before when my dataframe is not actually a dataframe but a tibble. The issue is that wdbc[,i] will still be a tibble rather than a vector. Try converting it to a dataframe with as.data.frame(wdbc).

How to use blank df and lists to speed up processing time in R

I know that creating a blank dataframe or list prior to populating it is a good thing to speed up processing time in R, but I'm having trouble executing it. Generally, all I would like to do is create a blank list of dataframes in which a map function fills in after completing some filtering. Below I'll recreate a simplified example to help explain what I'm trying to accomplish.
library(tidyverse)
library(purrr)
library(dplyr)
The code to create the lists of dataframes below is to much to show, but essentially I have a list of 192 dataframes that each contain the same type of information, but the data in each dataframe varies depending on which list.
"ListofDF1" is a list of 192 dataframes, each containing 468 rows and 27 columns of data. This list of dfs is created using a series of map functions.
Next, I have a pmap function that performs many tasks. Too many to show here, but below I'll generally show what I'm trying to accomplish.
return <-
pmap(inputs, function("variables that are contained in inputs dataframe") {
ListofDF2<- map(ListofDF1, ~filter(.,
"series of filters"
%>%
map(., ~data.frame(.) %>%
select(column1)
}
To summarize, inside the pmap function, a map function is performed on ListofDF1 (192 times because ListofDF1 contains 192 dataframes) to filter various metrics in the ListofDF1. The result is ListofDF2, which is a list of 192 dataframes. Note that each dataframe within the list of dfs contains only 1 column (due to select(column1)). But the number of rows in each dataframe are NOT consistent as they are dependent on the filtering that occurs.
I would like to try to improve the speed of my overall pmap function because it is cycling through several thousand times and I believe that creating a blank ListofDF2 may help.
Therefore, does anyone have any suggestions on how to create the blank ListofDF2 list of dataframes and then populate it using the filtering map function? To clarify, my existing code works just fine. I am just trying to improve speed and therefore efficiency.
Additionally, I would also like to crate a blank "return" list for the pmap function to populate. But one step at a time.

R approach for iterative querying

This is a question of a general approach in R, I'm trying to find a way into R language but the data types and loop approaches (apply, sapply, etc) are a bit unclear to me.
What is my target:
Query data from API with parameters from a config list with multiple parameters. Return the data as aggregated data.frame.
First I want to define a list of multiple vectors (colums)
site segment id
google.com Googleuser 123
bing.com Binguser 456
How to manage such a list of value groups (row by row)? data.frames are column focused, you cant write a data.frame row by row in an R script. So the only way I found to define this initial config table is a csv, which is really an approach I try to avoid, but I can't find a way to make it more elegant.
Now I want to query my data, lets say with this function:
query.data <- function(site, segment, id){
config <- define_request(site, segment, id)
result <- query_api(config)
return result
}
This will give me a data.frame as a result, this means every time I query data the same columns are used. So my result should be one big data.frame, not a list of similar data.frames.
Now sapply allows to use one parameter-list and multiple static parameters. The mapply works, but it will give me my data in some crazy output I cant handle or even understand exactly what it is.
In principle the list of data.frames is ok, the data is correct, but it feels cumbersome to me.
What core concepts of R I did not understand yet? What would be the approach?
If you have a lapply/sapply solution that is returning a list of dataframes with identical columns, you can easily get a single large dataframe with do.call(). do.call() inputs each item of a list as arguments into another function, allowing you to do things such as
big.df <- do.call(rbind, list.of.dfs)
Which would append the component dataframes into a single large dataframe.
In general do.call(rbind,something) is a good trick to keep in your back pocket when working with R, since often the most efficient way to do something will be some kind of apply function that leaves you with a list of elements when you really want a single matrix/vector/dataframe/etc.

Passing undetermined number of arguments in R to the order() function

I've gathered that the order function in R can be used to sort rows of a data frame/matrix by one or more columns of that object. The columns are passed as separate arguments to order, and order can handle a variable number of arguments.
I would like to sort a data frame by all its columns, but I don't know the names or the number of columns in the data frame beforehand. In Python, one can unpack a list of objects as the arguments to a function (e.g. zip(*mylist) is zip(mylist[0], mylist[1], etc...)). Is there a similar way to do so in R? It would be nice to "unpack" the columns of a matrix when I call order.
Is there another way in R to sort by multiple columns besides passing an arbitrary number of parameters?
more thoughts:
It seems like I cannot just package multiple unnamed items into a single object to pass to order. Nor can I think of a way to use a for loop, apply, or do.call to make arbitrary numbers of objects. There's something here: http://r.789695.n4.nabble.com/custom-sort-td888802.html.
Or... should I write a for loop to call order on each column, starting with the least priority one and ending with the column that would've been the first argument to order, reordering the rows each time and making sure that order sorts stably?
Thanks.
in python calling fun(*args,**kwargs) specifies the list of positional arguments (*args) and arguments to be matched by name (kwargs).
A similar call in R is do.call(fun,arglist). Unlike python, you cant mix regular and special arguments (e.g. fun(a=1,*args)) and the second argument to do.call is can have elements that are matched by name or position (e.g. do.call(fun,list(2,x=3)))
To complete the example, since data.frames inherit from lists, you can simply call 'order(df)' to order on all the columns sequentially (as long as none of the names of the fields in your data.frame match the formal arguments of order 'na.last' and 'decreasing')

How to order a matrix by all columns

Ok, I'm stuck in a dumbness loop. I've read thru the helpful ideas at How to sort a dataframe by column(s)? , but need one more hint. I'd like a function that takes a matrix with an arbitrary number of columns, and sorts by all columns in sequence. E.g., for a matrix foo with N columns,
does the equivalent of foo[order(foo[,1],foo[,2],...foo[,N]),] . I am happy to use a with or by construction, and if necessary define the colnames of my matrix, but I can't figure out how to automate the collection of arguments to order (or to with) .
Or, I should say, I could build the entire bloody string with paste and then call it, but I'm sure there's a more straightforward way.
The most elegant (for certain values of "elegant") way would be to turn it into a data frame, and use do.call:
foo[do.call(order, as.data.frame(foo)), ]
This works because a data frame is just a list of variables with some associated attributes, and can be passed to functions expecting a list.

Resources