multiple sum().compute() in dask on a very large set of data - bigdata

I have a dask dataframe with 100 partitions (aggregating 100 json files together that are around 45GB). I want to calculate a number of metrics with .sum().compute() on around 15-20 columns. For each .compute() it takes so long. Is there a better way to do all these sum().compute() in parallel?

Yes, there is a better way!
Simply, do .sum() on each thing you want - this produces lazy prescriptions of the work to be done - and then pass the set of them to dask.compute(), and it will do them all in one go, sharing any intermediate values where possible.
dask.compute(df.a.sum(), df.b.sum(), df.c.sum(), df.d.sum())
Alternatively, you can probably simply select the columns you want (df[[col1, col2, ...]]) and then do a single .sum().compute()
df[['a', 'b', 'c', 'd']].sum().compute()

Related

Writing Apache Arrow dataset in batches in R

I'm wondering what the correct approach is to creating an Apache Arrow multi-file dataset as described here in batches. The tutorial explains quite well how to write a new partitioned dataset from data in memory, but is it possible to do this in batches?
My current approach is to simply write the datasets individually, but to the same directory. This appears to be working, but I have to imagine this causes issues with the metadata that powers the feature. Essentially, my logic is as follows (pseudocode):
data_ids <- c(123, 234, 345, 456, 567)
# write data in batches
for (id in data_ids) {
## assume this is some complicated computation that returns 1,000,000 records
df <- data_load_helper(id)
df <- group_by(df, col_1, col_2, col_3)
arrow::write_dataset(df, "arrow_dataset/", format = 'arrow')
}
# read in data
dat <- arrow::open_dataset("arrow_dataset/", format="arrow", partitioning=c("col_1", "col_2", "col_3"))
# check some data
dat %>%
filter(col_1 == 123) %>%
collect()
What is the correct way of doing this? Or is my approach correct? Loading all of the data into one object and then writing it at once is not viable, and certain chunks of the data will update at different periods over time.
TL;DR: Your solution looks pretty reasonable.
There may be one or two issues you run into. First, if your batches do not all have identical schemas then you will need to make sure to pass in unify_schemas=TRUE when you are opening the dataset for reading. This could also become costly and you may want to just save the unified schema off on its own.
certain chunks of the data will update at different periods over time.
If by "update" you mean "add more data" then you may need to supply a basename_template. Otherwise every call to write_dataset will try and create part-0.arrow and they will overwrite each other. A common practice to work around this is to include some kind of UUID in the basename_template.
If by "update" you mean "replace existing data" then things will be a little trickier. If you want to replace entire partitions worth of data you can use existing_data_behavior="delete_matching". If you want to replace matching rows I'm not sure there is a great solution at the moment.
This approach could also lead to small batches, depending on how much data is in each group in each data_id. For example, if you have 100,000 data ids and each data id has 1 million records spread across 1,000 combinations of col_1/col_2/col_3 then you will end up with 1 million files, each with 1,000 rows. This won't perform well. Ideally you'd want to end up with 1,000 files, each with 1,000,000 rows. You could perhaps address this with some kind of occasional compaction step.

How to sort .csv files in R

I have one .csv file which i have imported into R. It contains a column with locations, some locations are repeated depending on how many times that location has been surveyed. I have another column with the total no. of plastic items.
I would like to add together the number of plastic items for locations that appear more than once, and create a separate column with the total no. of plastic and another column of the no. of times the location appeared.
I am unsure how to do this, any help will be much appreciated.
Using dplyr:
data %>%
group_by(location) %>%
mutate(TOTlocation=n(),TOTitems=sum(items))
And here's a base solution that does pretty much the same thing:
data[c("TOTloc","TOTitem")]<-t(sapply(data$location, function(x)
c(TOTloc=sum(data$location==x),
TOTitem=sum(data$items[data$location==x]))))
Note that in neither case do you need to sort anything - in dplyr you can use group_by to have each action done on only the part of the data set that belongs to a group determined by the contents of a certain column. In my base solution, I break down the locations list using sapply and then recalculate the TOTloc and TOTitem again for each row. This may not be a very efficient solution. A better solution will probably use split, but for some reason I couldn't make it work with my made up dataset, so maybe someone else can suggest how to best do that.

dplyr Filter Database Table with Large Number of Matches

I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?
Note sure whether this will help, but a few suggestions:
Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.
Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.

SparkR get distinct values from Dataframe fast

I'm trying to get the distinct values from a SparkDataframe using below statement.
distVals <- collect(distinct(select(dataframeName, 'Column_name')))
To execute this statement, it takes around 30-40 minutes. Is there any better way to perform this?
Also there is not much time difference in collecting full dataframe and collecting distinct values. So why is it suggested not to collect entire dataset? Is it only because of the data size?
Since I have to get different kinds of filtered data, I'm looking for collecting the results faster.

How to use daply (from plyr) on 2billion rows using less memory

Does any one know, how one could apply the following function that converts 3 columns table into a matrix using a file that has 2 billion rows (with less than 10GB memory).
where x is 1st, y is 2nd and z is 3rd column.
library(plyr)
daply(a, .(x, y), function(x) x$z)
If you cannot load all the tuples at once
I know this is not the answer you are looking for: use SQLite.
The problem with R is that it must load the entire frame at once. If you don't have enough memory, then it simply can't continue.
SQLite is way smarter than R to do aggregates. Perhaps the most important feature is that it optimizes the memory available, and if it can, it does not need to read all the elements at once. See this for details on how to do it.
http://www.r-bloggers.com/using-sqlite-in-r/
If SQLite does not support the aggregate you want, you can create it yourself (see user defined functions in SQLite).
Alternatively you can try to partition your data (outside R), so you can aggregate in stages. But that will still require some sort of program that can read process the files in less than the available memory. Unix/MacOS/Linux sort is one of those utilities that can deal with more-than-available-memory data. It might be useful.

Resources