How to efficiently merge these data.tables - r

I want to create a certain data.table to be able to check for missing data.
Missing data in this case does not mean there will be an NA, but the entire row will just be left out. So I need to be able to see of a certain time dependent column which values are missing for which level from another column. Also important is if there are a lot of missing values together or if they are spread across the dataset.
So I have this 6.000.000x5 data.table (Call it TableA) containing the time dependent variable, an ID for the level and the value N which I would like to add to my final table.
I have another table (TableB) which is 207x2. This couples the ID's for the factor to the columns in TableC.
TableC is 1.500.000x207 of which each of the 207 columns correspond to an ID according to TableB and the rows correspond to the time dependent variable in TableA.
These tables are large and although I recently acquired extra RAM (totalling now to 8GB) my computer keeps swapping away TableC and for each write it has to be called back, and gets swapped away again after. This swapping is what is consuming all my time. About 1.6 seconds per row of TableA and as TableA has 6.000.000 rows this operation would take more than a 100 days running non stop..
Currently I am using a for-loop to loop over the rows of TableA. Doing no operation this for-loop loops almost instantly. I made a one-line command looking up the correct column and row number for TableC in TableA and TableB and writing the value from TableA to TableC.
I broke up this one-liner to do a system.time analysis and each step takes about 0 seconds except writing to the big TableC.
This showed that writing the value to the table was the most time consuming and looking at my memory use I can see a huge chunk appearing whenever a write happens and it disappears as soon as it is finished.
TableA <- data.table("Id"=round(runif(200, 1, 100)), "TimeCounter"=round(runif(200, 1, 50)), "N"=round(rnorm(200, 1, 0.5)))
TableB <- data.table("Id"=c(1:100),"realID"=c(100:1))
TSM <- matrix(0,ncol=nrow(TableB), nrow=50)
TableC <- as.data.table(TSM)
rm(TSM)
for (row in 1:nrow(TableA))
{
TableCcol <- TableB[realID==TableA[row,Id],Id]
TableCrow <- (TableA[row,TimeCounter])
val <- TableA[row,N]
TableC[TableCrow,TableCcol] <- val
}
Can anyone advise me on how to make this operation faster, by preventing the memory swap at the last step in the for-loop?
Edit: On the advice of #Arun I took some time to develop some dummy data to test on. It is now included in the code given above.
I did not include wanted results because the dummy data is random and the routine does work. It's the speed that is the problem.

Not entirely sure about the results, but give it a shot with the dplyr/tidyr packages for, as they seem to be more memory efficient than for loops.
install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)
TableC <- TableC %>% gather(tableC_id, value, 1:207)
This turns TableC from 1,500,000x207 to a long format 310,500,000x2 table with 'tableC_id' and 'tableC_value' columns.
TableD <- TableA %>%
left_join(TableB, c("LevelID" = "TableB_ID")) %>%
left_join(TableC, c("TableB_value" = "TableC_id")
This is a couple of packages I've been using of late, and they seem to be very efficient, but the data.table package is used specifically for management of large tables so there could be useful functions there. I'd also take a look at sqldf which allows you to query your data.frames via SQL commands.

Rethinking my problem I came to a solution which works much faster.
The thing is that it does not follow from the question posed above, because I already did a couple of steps to come to the situation described in my question.
Enter TableX from which I aggregated TableA. TableX contains Id's and TimeCounters and much more, that's why I thought it would be best to create a smaller table containing only the information I needed.
TableX also contains the relevant times while in my question I am using a complete time series from the beginning of time (01-01-1970 ;) ). It was way smarter to use the levels in my TimeCounter column to build my TableC.
Also I forced myself to set values individually while merging is a lot faster in data.table. So my advice is: whenever you need to set a lot of values try finding a way to merge instead of just copying them individually.
Solution:
# Create a table with time on the row dimension by just using the TimeCounters we find in our original data.
TableC <- data.table(TimeCounter=as.numeric(levels(factor(TableX[,TimeCounter]))))
setkey(TableC,TimeCounter) # important to set the correct key for merge.
# Loop over all unique Id's (maybe this can be reworked into something *apply()ish)
for (i in levels(factor(TableX[,Id])))
{
# Count how much samples we have for Id and TimeCounter
TableD <- TableX[Id==i,.N,by=TimeCounter]
setkey(TableD,TimeCounter) # set key for merge
# Merge with Id on the column dimension
TableC[TableD,paste("somechars",i,sep=""):=N]
}
There could be steps missing in the TimeCounter so now I have to check for gaps in TableC and insert rows which were missing for all Id's. Then I can finally check where and how big my data gaps are.

Related

Writing Apache Arrow dataset in batches in R

I'm wondering what the correct approach is to creating an Apache Arrow multi-file dataset as described here in batches. The tutorial explains quite well how to write a new partitioned dataset from data in memory, but is it possible to do this in batches?
My current approach is to simply write the datasets individually, but to the same directory. This appears to be working, but I have to imagine this causes issues with the metadata that powers the feature. Essentially, my logic is as follows (pseudocode):
data_ids <- c(123, 234, 345, 456, 567)
# write data in batches
for (id in data_ids) {
## assume this is some complicated computation that returns 1,000,000 records
df <- data_load_helper(id)
df <- group_by(df, col_1, col_2, col_3)
arrow::write_dataset(df, "arrow_dataset/", format = 'arrow')
}
# read in data
dat <- arrow::open_dataset("arrow_dataset/", format="arrow", partitioning=c("col_1", "col_2", "col_3"))
# check some data
dat %>%
filter(col_1 == 123) %>%
collect()
What is the correct way of doing this? Or is my approach correct? Loading all of the data into one object and then writing it at once is not viable, and certain chunks of the data will update at different periods over time.
TL;DR: Your solution looks pretty reasonable.
There may be one or two issues you run into. First, if your batches do not all have identical schemas then you will need to make sure to pass in unify_schemas=TRUE when you are opening the dataset for reading. This could also become costly and you may want to just save the unified schema off on its own.
certain chunks of the data will update at different periods over time.
If by "update" you mean "add more data" then you may need to supply a basename_template. Otherwise every call to write_dataset will try and create part-0.arrow and they will overwrite each other. A common practice to work around this is to include some kind of UUID in the basename_template.
If by "update" you mean "replace existing data" then things will be a little trickier. If you want to replace entire partitions worth of data you can use existing_data_behavior="delete_matching". If you want to replace matching rows I'm not sure there is a great solution at the moment.
This approach could also lead to small batches, depending on how much data is in each group in each data_id. For example, if you have 100,000 data ids and each data id has 1 million records spread across 1,000 combinations of col_1/col_2/col_3 then you will end up with 1 million files, each with 1,000 rows. This won't perform well. Ideally you'd want to end up with 1,000 files, each with 1,000,000 rows. You could perhaps address this with some kind of occasional compaction step.

Am I using the most efficient (or right) R instructions?

first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.

Referencing last used row in a data frame

I couldn't find the answer in any previously asked questions, but I believe this is an easy one.
I have the below two lines of code, which take in data from excel in a specific range (using readxl for this). The range itself only goes through row 2589 in the excel document, but it will update dynamically (it's a time series) and to ensure I capture the different observations (rows) as they're added, I've included rows to 10000 in the read_excel range argument.
In the end, I'd like to run charts on this data, but a key part of this is identifying the last used row, without manually updating the code row for the latest date. I've tried using nrow but to no avail.
Raw_Index_History <- read_excel("RData.xlsx", range = "ReturnsA6:P10000", col_names = TRUE)
Raw_Index_History <- Raw_Index_History[nrow(Raw_Index_History),]
Does anybody have any thoughts or advice? Thanks very much.
It would be easier to answer your question if you include an example.
Not knowing how your data looks like answers are likely going to be a bit vague.
Does your data contain NAs? If not it should be straight forward to remove the empty rows with
na.omit(Raw_Index_History)
It appears you also have control over the excel spreadsheet. So in case your data does contain NAs you could have some default value in your empty rows that will get overwritten as soon as a new data point is recorded. This will allow you to filter your dataframe accordingly.
Raw_Index_History[!grepl("place_holder", Raw_Index_History$column_with_placeholder),]
If you expect data in the spreadsheet to grow, you can specify only the columns to include, instead of a defined boundary.
Something like this ...
Raw_Index_History <- read_excel("RData.xlsx",
sheet = 1,
range = cell_cols("A:P"), # Only cols, no rows
col_names = TRUE)
Every time you run the code, R will pull in the data from columns between A:P up until the last populated row.
This will be a more elegant approach to your use case. (Consider what you'd do when your data crosses 10000 rows in the future)

Removing lines in data.table and spiking memory usage

I have a data.table of a decent size 89M rows, 3.7Gb. Keys are in place so everything is set-up properly. However I am experiencing a problem when I remove rows based on a column's value. The memory usage just goes through the roof!
Just for the record I have read the other posts here about this, but they don't really help much. Also, I am using RStudio which I am pretty sure is not ideal but it helps while experimenting, however I notice the same behaviour in the R console. I am using Windows.
Let me post an example (taken from a similar question regarding removal of rows) of creating a very big data.table approx 1e6x100
rm(list=ls(all=TRUE)) #Clean stuff
gc(reset=TRUE) #Call gc (not really helping but whatever..)
dimension=1e6 #let's say a million
DT = data.table(col1 = 1:dimension)
cols = paste0('col', 2:100) #let these be conditions as columns
for (col in cols){ DT[, col := 1:dimension, with = F] }
DT.m<-melt(DT,id=c('col1','col2','col3'))
Ok so now we have a data.table with 97M rows, approx 1.8Gb. This is our starting point.
Let's remove all rows where the value column (after the melt) is e.g. 4
DT.m<-DT.m[value!=4]
The last line takes a huge amount of memory! Prior to executing this line, in my PC the memory usage is approx 4.3Gb, and just after the line is executed, it goes to 6.9Gb!
This is the correct way to remove the lines, right? (just checking). Has anyone come across this behaviour before?
I thought of looping for all parameters and keeping the rows I am interested in, in another data.table but somehow I doubt that this is a proper way of working.
I am looking forward to your help.
Thanks
Nikos
Update: With this commit, the logical vector is replaced by row indices to save memory (Read the post below for more info). Fixed in 1.9.5.
Doing sum(DT.m$value == 4L) gives me 97. That is, you're removing a total of 97 rows from 97 million. This in turn implies that the subset operation would return ~1.8GB data set as well.
Your memory usage was 4.3GB to begin with
The condition you provide value == 4 takes the space of a logical vector of size 97 million =~360MB.
data.table computes a which(that_value) to fetch indices = almost all the rows = another 360MB
The data that's being subset has to be allocated elsewhere first, and that's ~1.8GB.
Total comes to 4.3+1.8+0.72 =~ 6.8GB
And garbage collection hasn't happened yet. If you now do gc(), the memory corresponding to old DT.m should be released.
The only place where I can see we can save space is by replacing the logical vector with the integer vector (rather than storing the integer indices in another vector) to save the extra 360MB of space.
Usually which results in a much smaller (negligible) value - and therefore subset is faster - that being the reason for using which(). But in this case, you remove 97 rows.
But good to know that we can save a bit of memory. Could you please file an issue here?
Removing rows by reference, #635, when implemented, should both be fast and memory efficient.

What is the purpose of setting a key in data.table?

I am using data.table and there are many functions which require me to set a key (e.g. X[Y]). As such, I wish to understand what a key does in order to properly set keys in my data tables.
One source I read was ?setkey.
setkey() sorts a data.table and marks it as sorted. The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference. No copy is made at all, other than temporary working memory as large as one column.
My takeaway here is that a key would "sort" the data.table, resulting in a very similar effect to order(). However, it doesn't explain the purpose of having a key.
The data.table FAQ 3.2 and 3.3 explains:
3.2 I don't have a key on a large table, but grouping is still really quick. Why is that?
data.table uses radix sorting. This is signicantly faster than other
sort algorithms. Radix is specically for integers only, see
?base::sort.list(x,method="radix"). This is also one reason why
setkey() is quick. When no key is set, or we group in a different order
from that of the key, we call it an ad hoc by.
3.3 Why is grouping by columns in the key faster than an ad hoc by?
Because each group is contiguous in RAM, thereby minimising page
fetches, and memory can be copied in bulk (memcpy in C) rather than
looping in C.
From here, I guess that setting a key somehow allows R to use "radix sorting" over other algorithms, and that's why it is faster.
The 10 minute quick start guide also has a guide on keys.
Keys
Let's start by considering data.frame, specically rownames (or in
English, row names). That is, the multiple names belonging to a single
row. The multiple names belonging to the single row? That is not what
we are used to in a data.frame. We know that each row has at most one
name. A person has at least two names, a rst name and a second name.
That is useful to organise a telephone directory, for example, which
is sorted by surname, then rst name. However, each row in a
data.frame can only have one name.
A key consists of one or more
columns of rownames, which may be integer, factor, character or some
other class, not simply character. Furthermore, the rows are sorted by
the key. Therefore, a data.table can have at most one key, because it
cannot be sorted in more than one way.
Uniqueness is not enforced,
i.e., duplicate key values are allowed. Since the rows are sorted by
the key, any duplicates in the key will appear consecutively
The telephone directory was helpful in understanding what a key is, but it seems that a key is no different when compared to having a factor column. Furthermore, it does not explain why is a key needed (especially to use certain functions) and how to choose the column to set as key. Also, it seems that in a data.table with time as a column, setting any other column as key would probably mess the time column too, which makes it even more confusing as I do not know if I am allowed set any other column as key. Can someone enlighten me please?
In addition to this answer, please refer to the vignettes Secondary indices and auto indexing and Keys and fast binary search based subset as well.
This issue highlights the other vignettes that we plan to.
I've updated this answer again (Feb 2016) in light of the new on= feature that allows ad-hoc joins as well. See history for earlier (outdated) answers.
What exactly does setkey(DT, a, b) do?
It does two things:
reorders the rows of the data.table DT by the column(s) provided (a, b) by reference, always in increasing order.
marks those columns as key columns by setting an attribute called sorted to DT.
The reordering is both fast (due to data.table's internal radix sorting) and memory efficient (only one extra column of type double is allocated).
When is setkey() required?
For grouping operations, setkey() was never an absolute requirement. That is, we can perform a cold-by or adhoc-by.
## "cold" by
require(data.table)
DT <- data.table(x=rep(1:5, each=2), y=1:10)
DT[, mean(y), by=x] # no key is set, order of groups preserved in result
However, prior to v1.9.6, joins of the form x[i] required key to be set on x. With the new on= argument from v1.9.6+, this is not true anymore, and setting keys is therefore not an absolute requirement here as well.
## joins using < v1.9.6
setkey(X, a) # absolutely required
setkey(Y, a) # not absolutely required as long as 'a' is the first column
X[Y]
## joins using v1.9.6+
X[Y, on="a"]
# or if the column names are x_a and y_a respectively
X[Y, on=c("x_a" = "y_a")]
Note that on= argument can be explicitly specified even for keyed joins as well.
The only operation that requires key to be absolutely set is the foverlaps() function. But we are working on some more features which when done would remove this requirement.
So what's the reason for implementing on= argument?
There are quite a few reasons.
It allows to clearly distinguish the operation as an operation involving two data.tables. Just doing X[Y] does not distinguish this as well, although it could be clear by naming the variables appropriately.
It also allows to understand the columns on which the join/subset is being performed immediately by looking at that line of code (and not having to traceback to the corresponding setkey() line).
In operations where columns are added or updated by reference, on= operations are much more performant as it doesn't need the entire data.table to be reordered just to add/update column(s). For example,
## compare
setkey(X, a, b) # why physically reorder X to just add/update a column?
X[Y, col := i.val]
## to
X[Y, col := i.val, on=c("a", "b")]
In the second case, we did not have to reorder. It's not computing the order that's time consuming, but physically reordering the data.table in RAM, and by avoiding it, we retain the original order, and it is also performant.
Even otherwise, unless you're performing joins repetitively, there should be no noticeable performance difference between a keyed and ad-hoc joins.
This leads to the question, what advantage does keying a data.table have anymore?
Is there an advantage to keying a data.table?
Keying a data.table physically reorders it based on those column(s) in RAM. Computing the order is not usually the time consuming part, rather the reordering itself. However, once we've the data sorted in RAM, the rows belonging to the same group are all contiguous in RAM, and is therefore very cache efficient. It's the sortedness that speeds up operations on keyed data.tables.
It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.
In most cases therefore, there shouldn't be a need to set keys anymore. We recommend using on= wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.
Question: What do you think would be the performance like in comparison to a keyed join, if you use setorder() to reorder the data.table and use on=? If you've followed thus far, you should be able to figure it out :-).
A key is basically an index into a dataset, which allows for very fast and efficient sort, filter, and join operations. These are probably the best reasons to use data tables instead of data frames (the syntax for using data tables is also much more user friendly, but that has nothing to do with keys).
If you don't understand indexes, consider this: a phone book is "indexed" by name. So if I want to look up someone's phone number, it's pretty straightforward. But suppose I want to search by phone number (e.g., look up who has a particular phone number)? Unless I can "re-index" the phone book by phone number, it will take a very long time.
Consider the following example: suppose I have a table, ZIP, of all the zip codes in the US (>33,000) along with associated information (city, state, population, median income, etc.). If I want to look up the information for a specific zip code, the search (filter) is about 1000 times faster if I setkey(ZIP, zipcode) first.
Another benefit has to do with joins. Suppose a have a list of people and their zip codes in a data table (call it "PPL"), and I want to append information from the ZIP table (e.g. city, state, and so on). The following code will do it:
setkey(ZIP, zipcode)
setkey(PPL, zipcode)
full.info <- PPL[ZIP, nomatch = FALSE]
This is a "join" in the sense that I'm joining the information from 2 tables based in a common field (zipcode). Joins like this on very large tables are extremely slow with data frames, and extremely fast with data tables. In a real-life example I had to do more than 20,000 joins like this on a full table of zip codes. With data tables the script took about 20 min. to run. I didn't even try it with data frames because it would have taken more than 2 weeks.
IMHO you should not just read but study the FAQ and Intro material. It's easier to grasp if you have an actual problem to apply this to.
[Response to #Frank's comment]
Re: sorting vs. indexing - Based on the answer to this question, it appears that setkey(...) does in fact rearrange the columns in the table (e.g., a physical sort), and does not create an index in the database sense. This has some practical implications: for one thing if you set the key in a table with setkey(...) and then change any of the values in the key column, data.table merely declares the table to be no longer sorted (by turning off the sorted attribute); it does not dynamically re-index to maintain the proper sort order (as would happen in a database). Also, "removing the key" using setkey(DT, NULL) does not restore the table to it's original, unsorted order.
Re: filter vs. join - the practical difference is that filtering extracts a subset from a single dataset, whereas join combines data from two datasets based on a common field. There are many different kinds of join (inner, outer, left). The example above is an inner join (only records with keys common to both tables are returned), and this does have many similarities to filtering.

Resources