Is there a way to bind two data.tables by reference? - r

When I have to bind two data.tables (or data.frames), I use:
data.full<- rbindlist(list(data.v1, data.v2), fill = T)
rm(data.v1)
rm(data.v2)
gc()
This works, but the data is duplicated in memory for a moment, and this can be a huge problem if the data is too big. Is there a way to do this without copying the data?
I would strictly prefer a solution that does not involve binding the data in an external software, such as an SQL database. A perfect solution would be something like data.table::setorder(data, id) that replaces data <- data[order(id)].

Related

How do I bind two disk frames together?

I have two disk frame and each are about 20GB worth of files.
It's too big to merge as data tables because the process requires more than the memory I have available. I tried using this code: output <- rbindlist(list(df1, df2))
The wrinkle is that I'd like to also run unique since there might be dups in my data.
Can I use the same code with rbindlist on two disk frames?
Yeah. You just do rbindlist.disk.frame(list(df1, df2))
I need to implement bind_rows at some point too!

dplyr Filter Database Table with Large Number of Matches

I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?
Note sure whether this will help, but a few suggestions:
Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.
Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.

How to use daply (from plyr) on 2billion rows using less memory

Does any one know, how one could apply the following function that converts 3 columns table into a matrix using a file that has 2 billion rows (with less than 10GB memory).
where x is 1st, y is 2nd and z is 3rd column.
library(plyr)
daply(a, .(x, y), function(x) x$z)
If you cannot load all the tuples at once
I know this is not the answer you are looking for: use SQLite.
The problem with R is that it must load the entire frame at once. If you don't have enough memory, then it simply can't continue.
SQLite is way smarter than R to do aggregates. Perhaps the most important feature is that it optimizes the memory available, and if it can, it does not need to read all the elements at once. See this for details on how to do it.
http://www.r-bloggers.com/using-sqlite-in-r/
If SQLite does not support the aggregate you want, you can create it yourself (see user defined functions in SQLite).
Alternatively you can try to partition your data (outside R), so you can aggregate in stages. But that will still require some sort of program that can read process the files in less than the available memory. Unix/MacOS/Linux sort is one of those utilities that can deal with more-than-available-memory data. It might be useful.

File organization - how to handle different combinations of filters on one data.frame efficiently?

I currently do a lot of descriptive analysis in R. I always work with a data.table like df
net <- seq(1,20,by=2)
gross <- seq(2,20,by=2)
color <- c("green", "blue", "white")
height <- c(170,172,180,188)
library(data.table)
df <- data.table(net,gross,color,height)
In order to obtain results, I do apply a lot of filters.
Sometimes I use one filter, sometimes I use a combination of multiple filters, e.g.:
df[color=="green" & height>175]
In my real data.table, I have 7 columns and all kind of filter-combinations.
Since I always address the same data.table, I'd like to find the most efficient way to filter the data.
So far, my files are organized like this (bottom-up):
execution level: multiple R-scripts with a very specific job (no interaction between them) that calculate and write the results to an excel file using XL Connect
source file: this file receives a pre-filtered data.table and sources all files from the execution level. It is necessary in case I add/remove files on the execution level.
filter files: read the data.table and apply one or multiple filters, as shown above with df_green_high. By filtering, filter files create a
new data.table and source the "source file" with this new filtered table.
I am currently challenged, since I have too many filter files. Having 7 variables, there is such a large number of combinations of filter, so I'll get lost sooner or later.
How can I do my analysis more efficient (reduce the number of "filter files"?)
How can I conveniently name the exported files according to the filters used?
I have read Workflow for statistical analysis and report writing and some other similar questions. However, in this case, I always refer to the same basic table, so there should be a more efficient way. I do not have a CS background, so any help is highly appreciated. On SOF, I also read about creating a package, but I am not sure if this reasonable.
I usually do it like this:
create a list called say "my_case_list"
filter data, do computation on the filtered data
add a column called "case" to each filtered dataset. Fill this column with some string i.e. "case 1: color=="green" & height>175"
put this data to my_case_list
convert list to data.frame like object
export results to sql server
import results from sql server to Excel Pivot table
make sense of results
Automate the process as much as possible.

Convert data frame to list

I am trying to go from a data frame to a list structure in R (and I know technically a data frame is a list). I have a data frame containing reference chemicals and their mechanisms different targets. For example, estrogen is an estrogen receptor agonist. What I would like is to transform the data frame to a list, because I am tired of typing out something like:
refchem$chemical_id[refchem$target=="AR" & refchem$mechanism=="Agonist"]
every time I need to access the list of specific reference chemicals. I would much rather access the chemicals by:
refchem$AR$Agonist
I am looking for a general answer, even though I have given a simplified example, because not all targets have all mechanisms.
This is really easy to accomplish with a loop:
example <- data.frame(target=rep(c("t1","t2","t3"),each=20),
mechan=rep(c("m1","m2"),each=10,3),
chems=paste0("chem",1:60))
oneoption <- list()
for(target in unique(example$target)){
oneoption[[target]] <- list()
for(mech in unique(example$mechan)){
oneoption[[target]][[mech]] <- as.character(example$chems[ example$target==target & example$mechan==mech ])
}
}
I am just wondering if there is a more clever way to do it. I tried playing around with lapply and did not make any progress.
Using split:
split(refchem, list(refchem$target, refchem$mechanism))
Should do the trick.
The new way to access would be refchem$AR.Agonist
If you make a keyed data.table instead, ...
you'll still have all the data in one data.frame (instead of a possibly-nested list of many);
you may find iterating over these subsets nicer; and
the syntax is pretty clean:
To access a subset:
DT[.('AR','Agonist')]
To do something for each group, that will be rbinded together in the result:
DT[,{do stuff},by=key(DT)]
Similar to aggregate(), any list of vectors of the correct length can go into the by, not just the key.
Finally, DT came from...
require(data.table)
DT <- data.table(refchem,key=c('target','mechanism'))
You can also use a plyr function:
library(plyr)
dlply(example, .(target, mechan))
It has the added advantage of using a function to process the data, if needed (there's an implicit identity in the above).

Resources