Writing Apache Arrow dataset in batches in R

Writing Apache Arrow dataset in batches in R - r

I'm wondering what the correct approach is to creating an Apache Arrow multi-file dataset as described here in batches. The tutorial explains quite well how to write a new partitioned dataset from data in memory, but is it possible to do this in batches?
My current approach is to simply write the datasets individually, but to the same directory. This appears to be working, but I have to imagine this causes issues with the metadata that powers the feature. Essentially, my logic is as follows (pseudocode):
data_ids <- c(123, 234, 345, 456, 567)
# write data in batches
for (id in data_ids) {
## assume this is some complicated computation that returns 1,000,000 records
df <- data_load_helper(id)
df <- group_by(df, col_1, col_2, col_3)
arrow::write_dataset(df, "arrow_dataset/", format = 'arrow')
}
# read in data
dat <- arrow::open_dataset("arrow_dataset/", format="arrow", partitioning=c("col_1", "col_2", "col_3"))
# check some data
dat %>%
filter(col_1 == 123) %>%
collect()
What is the correct way of doing this? Or is my approach correct? Loading all of the data into one object and then writing it at once is not viable, and certain chunks of the data will update at different periods over time.

TL;DR: Your solution looks pretty reasonable.
There may be one or two issues you run into. First, if your batches do not all have identical schemas then you will need to make sure to pass in unify_schemas=TRUE when you are opening the dataset for reading. This could also become costly and you may want to just save the unified schema off on its own.
certain chunks of the data will update at different periods over time.
If by "update" you mean "add more data" then you may need to supply a basename_template. Otherwise every call to write_dataset will try and create part-0.arrow and they will overwrite each other. A common practice to work around this is to include some kind of UUID in the basename_template.
If by "update" you mean "replace existing data" then things will be a little trickier. If you want to replace entire partitions worth of data you can use existing_data_behavior="delete_matching". If you want to replace matching rows I'm not sure there is a great solution at the moment.
This approach could also lead to small batches, depending on how much data is in each group in each data_id. For example, if you have 100,000 data ids and each data id has 1 million records spread across 1,000 combinations of col_1/col_2/col_3 then you will end up with 1 million files, each with 1,000 rows. This won't perform well. Ideally you'd want to end up with 1,000 files, each with 1,000,000 rows. You could perhaps address this with some kind of occasional compaction step.

Related

R Updating A Column In a Large Dataframe

I've got a dataframe, which is stored in a csv, of 63 columns and 1.3 million rows. Each row is a chess game, each column is details about the game (e.g. who played in the game, what their ranking was, the time it was played, etc). I have a column called "Analyzed", which is whether someone later analyzed the game, so it's a yes/no variable.
I need to use the API offered by chess.com to check whether a game is analyzed. That's easy. However, how do I systematically update the csv file, without wasting huge amounts of time reading in and writing out the csv file, while accounting for the fact that this is going to take a huge amount of time and I need to do it in stages? I believe a best practice for chess.com's API is to use Sys.sleep after every API call so that you lower the likelihood that you are accidentally making concurrent requests, which the API doesn't handle very well. So I have Sys.sleep for a quarter of a second. If we assume the API call itself takes no time, then this means this program will need to run for 90 hours because of the sleep time alone. My goal is to make it so that I can easily run this program in chunks, so that I don't need to run it for 90 hours in a row.
The code below works great to get whether a game has been analyzed, but I don't know how to intelligently update the original csv file. I think my best bet would be to rewrite the new dataframe and replace the old Games.csv every 1000 or say API calls. See the commented code below.
My overall question is, when I need to update a column in csv that is large, what is the smart way to update that column incrementally?
library(bigchess)
library(rjson)
library(jsonlite)
df <- read.csv <- "Games.csv"
for(i in 1:nrow(df)){
data <- read_json(df$urls[i])
if(data$analysisLogExists == TRUE){
df$Analyzed[i] <- 1
}
if(data$analysisLogExists==FALSE){
df$Analyzed[i] = 0
}
Sys.sleep(.25)
##This won't work because the second time I run it then I'll just reread the original lines
##if i try to account for this by subsetting only the the columns that haven't been updated,
##then this still doesn't work because then the write command below will not be writing the whole dataset to the csv
if(i%%1000){
write.csv(df,"Games.csv",row.names = F)
}
}

Am I using the most efficient (or right) R instructions?

first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.

In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.

dplyr Filter Database Table with Large Number of Matches

I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?

Note sure whether this will help, but a few suggestions:
Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.

Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.

File organization - how to handle different combinations of filters on one data.frame efficiently?

I currently do a lot of descriptive analysis in R. I always work with a data.table like df
net <- seq(1,20,by=2)
gross <- seq(2,20,by=2)
color <- c("green", "blue", "white")
height <- c(170,172,180,188)
library(data.table)
df <- data.table(net,gross,color,height)
In order to obtain results, I do apply a lot of filters.
Sometimes I use one filter, sometimes I use a combination of multiple filters, e.g.:
df[color=="green" & height>175]
In my real data.table, I have 7 columns and all kind of filter-combinations.
Since I always address the same data.table, I'd like to find the most efficient way to filter the data.
So far, my files are organized like this (bottom-up):
execution level: multiple R-scripts with a very specific job (no interaction between them) that calculate and write the results to an excel file using XL Connect
source file: this file receives a pre-filtered data.table and sources all files from the execution level. It is necessary in case I add/remove files on the execution level.
filter files: read the data.table and apply one or multiple filters, as shown above with df_green_high. By filtering, filter files create a
new data.table and source the "source file" with this new filtered table.
I am currently challenged, since I have too many filter files. Having 7 variables, there is such a large number of combinations of filter, so I'll get lost sooner or later.
How can I do my analysis more efficient (reduce the number of "filter files"?)
How can I conveniently name the exported files according to the filters used?
I have read Workflow for statistical analysis and report writing and some other similar questions. However, in this case, I always refer to the same basic table, so there should be a more efficient way. I do not have a CS background, so any help is highly appreciated. On SOF, I also read about creating a package, but I am not sure if this reasonable.

I usually do it like this:
create a list called say "my_case_list"
filter data, do computation on the filtered data
add a column called "case" to each filtered dataset. Fill this column with some string i.e. "case 1: color=="green" & height>175"
put this data to my_case_list
convert list to data.frame like object
export results to sql server
import results from sql server to Excel Pivot table
make sense of results
Automate the process as much as possible.

How to efficiently merge these data.tables

I want to create a certain data.table to be able to check for missing data.
Missing data in this case does not mean there will be an NA, but the entire row will just be left out. So I need to be able to see of a certain time dependent column which values are missing for which level from another column. Also important is if there are a lot of missing values together or if they are spread across the dataset.
So I have this 6.000.000x5 data.table (Call it TableA) containing the time dependent variable, an ID for the level and the value N which I would like to add to my final table.
I have another table (TableB) which is 207x2. This couples the ID's for the factor to the columns in TableC.
TableC is 1.500.000x207 of which each of the 207 columns correspond to an ID according to TableB and the rows correspond to the time dependent variable in TableA.
These tables are large and although I recently acquired extra RAM (totalling now to 8GB) my computer keeps swapping away TableC and for each write it has to be called back, and gets swapped away again after. This swapping is what is consuming all my time. About 1.6 seconds per row of TableA and as TableA has 6.000.000 rows this operation would take more than a 100 days running non stop..
Currently I am using a for-loop to loop over the rows of TableA. Doing no operation this for-loop loops almost instantly. I made a one-line command looking up the correct column and row number for TableC in TableA and TableB and writing the value from TableA to TableC.
I broke up this one-liner to do a system.time analysis and each step takes about 0 seconds except writing to the big TableC.
This showed that writing the value to the table was the most time consuming and looking at my memory use I can see a huge chunk appearing whenever a write happens and it disappears as soon as it is finished.
TableA <- data.table("Id"=round(runif(200, 1, 100)), "TimeCounter"=round(runif(200, 1, 50)), "N"=round(rnorm(200, 1, 0.5)))
TableB <- data.table("Id"=c(1:100),"realID"=c(100:1))
TSM <- matrix(0,ncol=nrow(TableB), nrow=50)
TableC <- as.data.table(TSM)
rm(TSM)
for (row in 1:nrow(TableA))
{
TableCcol <- TableB[realID==TableA[row,Id],Id]
TableCrow <- (TableA[row,TimeCounter])
val <- TableA[row,N]
TableC[TableCrow,TableCcol] <- val
}
Can anyone advise me on how to make this operation faster, by preventing the memory swap at the last step in the for-loop?
Edit: On the advice of #Arun I took some time to develop some dummy data to test on. It is now included in the code given above.
I did not include wanted results because the dummy data is random and the routine does work. It's the speed that is the problem.

Not entirely sure about the results, but give it a shot with the dplyr/tidyr packages for, as they seem to be more memory efficient than for loops.
install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)
TableC <- TableC %>% gather(tableC_id, value, 1:207)
This turns TableC from 1,500,000x207 to a long format 310,500,000x2 table with 'tableC_id' and 'tableC_value' columns.
TableD <- TableA %>%
left_join(TableB, c("LevelID" = "TableB_ID")) %>%
left_join(TableC, c("TableB_value" = "TableC_id")
This is a couple of packages I've been using of late, and they seem to be very efficient, but the data.table package is used specifically for management of large tables so there could be useful functions there. I'd also take a look at sqldf which allows you to query your data.frames via SQL commands.

Rethinking my problem I came to a solution which works much faster.
The thing is that it does not follow from the question posed above, because I already did a couple of steps to come to the situation described in my question.
Enter TableX from which I aggregated TableA. TableX contains Id's and TimeCounters and much more, that's why I thought it would be best to create a smaller table containing only the information I needed.
TableX also contains the relevant times while in my question I am using a complete time series from the beginning of time (01-01-1970 ;) ). It was way smarter to use the levels in my TimeCounter column to build my TableC.
Also I forced myself to set values individually while merging is a lot faster in data.table. So my advice is: whenever you need to set a lot of values try finding a way to merge instead of just copying them individually.
Solution:
# Create a table with time on the row dimension by just using the TimeCounters we find in our original data.
TableC <- data.table(TimeCounter=as.numeric(levels(factor(TableX[,TimeCounter]))))
setkey(TableC,TimeCounter) # important to set the correct key for merge.
# Loop over all unique Id's (maybe this can be reworked into something *apply()ish)
for (i in levels(factor(TableX[,Id])))
{
# Count how much samples we have for Id and TimeCounter
TableD <- TableX[Id==i,.N,by=TimeCounter]
setkey(TableD,TimeCounter) # set key for merge
# Merge with Id on the column dimension
TableC[TableD,paste("somechars",i,sep=""):=N]
}
There could be steps missing in the TimeCounter so now I have to check for gaps in TableC and insert rows which were missing for all Id's. Then I can finally check where and how big my data gaps are.