dplyr: colSums on sub-grouped (group_by) data frames: elegantly - r

I have a very large dataframe (265,874 x 30), with three sensible groups: an age category (1-6), dates (5479 such) and geographic locality (4 total). Each record consists of a choice from each of these, plus 27 count variables. I want to group by each of the grouping variables, then take a colSums on the resulting sub-grouped 27 variables. I've been trying to use dplyr (v0.2) to do it, because doing it manually ends up setting up a lot of redundant things (or resorting to a loop for iterating across the grouping options, for lack of an elegant solution).
Example code:
countData <- sample(0:10, 2000, replace = TRUE)
dates <- sample(seq(as.Date("2010/1/1"), as.Date("2010/01/30"), "days"), 200, replace = TRUE)
locality <- sample(1:2, 2000, replace = TRUE)
ageCat <- sample(1:2, 2000, replace = TRUE)
sampleDF <- data.frame(dates, locality, ageCat, matrix(countData, nrow = 200, ncol = 10))
then what I'd like to do is ...
library("dplyr")
sampleDF %.% group_by(locality, ageCat, dates) %.% do(colSums(.[, -(1:3)]))
but this doesn't quite work, as the results from colSums() aren't data frames. If I cast it, it works:
sampleDF %.% group_by(locality, ageCat, dates) %.% do(data.frame(matrix(colSums(.[, -(1:3)]), nrow = 1, ncol = 10)))
but the final do(...) bit seems very clunky.
Any thoughts on how to do this more elegantly or effectively? I guess the question comes down to: how best to use the do() function and the . operator to summarize a data frame via colSums.
Note: the do(.) operator only applies to dplyr 0.2, so you need to grab it from GitHub (link), not from CRAN.
Edit: results from suggestions
Three solutions:
My suggestion in post: elapsed, 146.765 seconds.
#joran's suggestion below: 6.902 seconds
#eddi's suggestion in the comments, using data.table: 6.715 seconds.
I didn't bother to replicate, just used system.time() to get a rough gauge. From the looks of it, dplyr and data.table perform approximately the same on my data set, and both are significantly faster when used properly than the hack solution I came up with yesterday.

Unless I'm missing something, this seems like a job for summarise_each (a sort of colwise analogue from plyr):
sampleDF %.% group_by(locality, ageCat, dates) %.% summarise_each(funs(sum))
The grouping column are not included in the summarizing function by default, and you can select only a subset of columns to apply the functions to using the same technique as when using select.
(summarise_each is in version 0.2 of dplyr but not in 0.1.3, as far as I know.)

The method summarise_each mentioned in joran's answer from 2014 has been deprecated.
Instead, please use summarize_all() or summarize_at().

The methods summarize_all and summarize_at mentioned in Hack-R's answer from 2018 have been superseded.
Instead, please use summarize()/summarise() combined with across().

Related

Using a key to clean data in two corresponding columns

I have a large data frame (6 million rows, 20 columns) where data in one column corresponds to data in another column. I created a key that I now want to use to fix rows that have the wrong value. As a small example:
key = data.frame(animal = c('dog', 'cat', 'bird'),
sound = c('bark', 'meow', 'chirp'))
The data frame looks like this (minus the other columns of data):
df = data.frame(id = c(1, 2, 3, 4),
animal = c('dog', 'cat', 'bird', 'cat'),
sound = c('meow', 'bark', 'chirp', 'chirp'))
I swear I have done this before but can't remember my solution. Any ideas?
Using dplyr. If you want to fix sound according to animal,
library(dplyr)
df <- df %>%
mutate(sound = sapply(animal, function(x){key %>% filter(animal==x) %>% pull(sound)}))
should do the trick. If you want to fix animal according to sound:
df <- df %>%
mutate(animal = sapply(sound, function(x){key %>% filter(sound==x) %>% pull(animal)}))
I'm not sure about relative efficiency, but it's simpler to replace the partially incorrect column completely. It may not even cost you very much time (since you have to look up values anyway to determine that an animal/sound pair is mismatched).
library(tidyverse)
df %>% select(-sound) %>% full_join(key, by = "animal")
For 6 million rows, you may be better off using data.table. If you convert df and key to data tables (as.data.table()) that will take some up-front computational time but may speed up subsequent operations; you can use tidyverse operations on data.table objects without doing any further modifications, but native data.table operations might be faster:
library(data.table
dft <- as.data.table(df)
k <- as.data.table(key)
merge(dft[,-"sound"], k, by = "animal")
I haven't bothered to do any benchmarking (would need much larger examples to be able to measure any differences).

Is there a more efficient way to obtain variance of lot's of columns than dplyr?

I have a data.frame that is >250,000 columns and 200 rows, so around 50 million individual values. I am trying to get a breakdown of the variance of the columns in order to select the columns with the most variance.
I am using dplyr as follows:
df %>% summarise_if(is.numeric, var)
It has been running on my imac with 16gb of RAM for about 8 hours now.
Is there a way top allocate more resources to the call, or a more efficient way to summarise the variance across columns?
I bet that selecting the columns first, then calculating the variance, will be a lot faster:
df <- as.data.frame(matrix(runif(5e7), nrow = 200, ncol = 250000))
df_subset <- df[,sapply(df, is.numeric)]
sapply(df_subset, var)
The code above runs on my machine in about a second, and that's calculating the variance on every single column because they're all numeric in my example.
You may try using data.table which is usually faster.
library(data.table)
cols <- names(Filter(is.numeric, df))
setDT(df)
df[, lapply(.SD, var), .SDcols = cols]
Another approach you can try is getting the data in long format.
library(dplyr)
library(tidyr)
df %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything()) %>%
group_by(name) %>%
summarise(var_value = var(value))
but I agree with #Daniel V that it is worth checking the data as 8 hours is way too much time to perform this calculation.
Very wide data.frames are quite inefficient. I think converting to a matrix and using matrixStats::colVars() would be the fastest.

How to create statistical summary for the result of clustering for different group of variable in R

I am wondering if there is a package or fast way to generate a statistical summary table for the result of clustering.
I imagine I can choose variables of interest and group by cluster number and then calculate mean and max and etc. I am looking for a fast way to do it. Is there any package I can use?
Thanks
The fastest and easiest way might depend on the exact results you want. The easiest approach is probably summary() in base R, the more versatile is to use the package dplyr with its functions group_by() and summarize(). For specific type of data, other packages may provide a more practical summary.
An example:
DF <- data.frame(groups = sample(LETTERS, 20, replace = TRUE),
var = runif(20))
summary(DF)
library(dplyr)
DF %>%
group_by(groups) %>%
summarize(mean_by_group = mean(var),
number = n())

Vector addition with vector indexing

This may well have an answer elsewhere but I'm having trouble formulating the words of the question to find what I need.
I have two dataframes, A and B, with A having many more rows than B. I want to look up a value from B based on a column of A, and add it to another column of A. Something like:
A$ColumnToAdd + B[ColumnToMatch == A$ColumnToMatch,]$ColumnToAdd
But I get, with a load of NAs:
Warning in `==.default`: longer object length is not a multiple of shorter object length
I could do it with a messy for-loop but I'm looking for something faster & elegant.
Thanks
If I understood your question correctly, you're looking for a merge or a join, as suggested in the comments.
Here's a simple example for both using dummy data that should fit what you described.
library(tidyverse)
# Some dummy data
ColumnToAdd <- c(1,1,1,1,1,1,1,1)
ColumnToMatch <- c('a','b','b','b','c','a','c','d')
A <- data.frame(ColumnToAdd, ColumnToMatch)
ColumnToAdd <- c(1,2,3,4)
ColumnToMatch <- c('a','b','c','d')
B <- data.frame(ColumnToAdd, ColumnToMatch)
# Example using merge
A %>%
merge(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
# Example using join
A %>%
inner_join(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
The advantages of the dplyr versions over merge are:
rows are kept in existing order
much faster
tells you what keys you're merging by (if you don't supply)
also work with database tables.

Working with temporary columns (created on-the-fly) more efficiently in a dataframe

Consider the following dataframe:
df <- data.frame(replicate(5,sample(1:10, 10, rep=TRUE)))
If I want to divide each row by its sum (to make a probability distribution), I need to do something like this:
df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs)
This really feels inefficient:
Create an rs column
Divide each of the values by their corresponding row rowSums()
Remove the temporarily created column to clean up the original dataframe.
When working with existing columns, it feels much more natural:
df %>% summarise_each(funs(weighted.mean(., X1)), -X1)
Using dplyr, would there a better way to work with temporary columns (created on-the-fly) than having to add and remove them after processing ?
I'm also interested in how data.table would handle such a task.
As I mentioned in a comment above I don't think that it makes sense to keep that data in either a data.frame or a data.table, but if you must, the following will do it without converting to a matrix and illustrates how to create a temporary variable in the data.table j-expression:
dt = as.data.table(df)
dt[, names(dt) := {sums = Reduce(`+`, .SD); lapply(.SD, '/', sums)}]
Why not considering base R as well:
as.data.frame(as.matrix(df)/rowSums(df))
Or just with your data.frame:
df/rowSums(df)

Resources