I am trying to create a matrix of donors and recipients, populated with the sum of donations produced in each couple keeping the eventual NAs.
It works well for small datasets (See toy example below) but when I switch to national datasets (3m entries) several problems emerge: besides being painstakingly slow, the creation of the fill df consume all the memory of the (super)computer and I get the error "Error: cannot allocate vector of size 1529.0 Gb"
How should I tackle the problem?
Thanks a lot!
library(dplyr)
library(tidyr)
libray(bigmemory)
candidate_id <- c("cand_1","cand_1","cand_1","cand_2","cand_3")
donor_id <- c("don_1","don_1","don_2","don_2","don_3")
donation <- c(1,2,3.5,4,10)
df = data.frame(candidate_id,donor_id,donation)
colnames(df) <- c("candidate_id","donor_id","donation")
fill <- df %>%
group_by(df$candidate_id,df$donor_id) %>%
summarise(tot_donation=sum(as.numeric(donation))) %>%
complete(df$candidate_id,df$donor_id)
fill <- unique(fill[ ,1:3])
colnames(fill) <- c("candidate_id","donor_id","tot_donation")
nrow = length(unique(df$candidate_id))
ncol = length(unique(df$donor_id))
row_names = unique(fill$candidate_id)
col_names = unique(fill$donor_id)
x <- big.matrix(nrow, ncol, init=NA,dimnames=list(row_names,col_names))
for (i in 1:nrow){
for (j in 1:ncol){
x[i,j] <- fill[which(fill$candidate_id == row_names[i] &
fill$donor_id == col_names[j]), 3]
}
}
I see you're using unique because your output has duplicated values.
Based on this question,
you should try the following in order to avoid duplication:
fill <- df %>%
group_by(candidate_id, donor_id) %>%
summarise(tot_donation=sum(donation)) %>%
ungroup %>%
complete(candidate_id, donor_id)
Can you then try to create your desired output?
I think unique can be very resource-heavy,
so try to avoid calling it.
The tidyr version of what Benjamin suggested should be:
spread(fill, donor_id, tot_donation)
EDIT: By the way, since you tagged the question with sparse-matrix,
you could indeed use sparsity to your advantage:
library(Matrix)
library(dplyr)
df <- data.frame(
candidate_id = c("cand_1","cand_1","cand_1","cand_2","cand_3"),
donor_id = c("don_1","don_1","don_2","don_2","don_3"),
donation = c(1,2,3.5,4,10)
)
summ <- df %>%
group_by(candidate_id, donor_id) %>%
summarise(tot_donation=sum(donation)) %>%
ungroup
num_candidates <- nlevels(df$candidate_id)
num_donors <- nlevels(df$donor_id)
smat <- Matrix(0, num_candidates, num_donors, sparse = TRUE, dimnames = list(
levels(df$candidate_id),
levels(df$donor_id)
))
indices <- summ %>%
select(candidate_id, donor_id) %>%
mutate_all(unclass) %>%
as.matrix
smat[indices] <- summ$tot_donation
smat
3 x 3 sparse Matrix of class "dgCMatrix"
don_1 don_2 don_3
cand_1 3 3.5 .
cand_2 . 4.0 .
cand_3 . . 10
You might try
library(reshape2)
dcast(fill, candidate_id ~ donor_id,
value.var = "tot_donation",
fun.aggregate = sum)
I don't know if it will avoid the memory issue, but it will likely be much faster than a double for loop.
I have to run to a meeting, but part of me wonders if there is a way to do this with outer.
Related
I have a (new) question related to expss tables. I wrote a very simple UDF (that relies on few expss functions), as follows:
library(expss)
z_indices <- function(x, m_global, std_global, weight=NULL){
if(is.null(weight)) weight = rep(1, length(x))
z <- (w_mean(x, weight)-m_global)/std_global
indices <- 100+(z*100)
return(indices)
}
Reproducible example, based on infert dataset (plus a vector of arbitrary weights):
data(infert)
infert$w <- as.vector(x=rep(2, times=nrow(infert)), mode='numeric')
infert %>%
tab_cells(age, parity) %>%
tab_cols(total(), education, case %nest% list(total(), education)) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
tab_stat_fun(label="Z", function(x, m_global, std_global, weight=NULL){
z_indices(x, m_global=w_mean(infert$age, infert$w),std_global=w_sd(infert$age, infert$w))
}) %>%
tab_pivot(stat_position="inside_columns")
The table is computed and the output for the first line is (almost) as expected.
Then things go messy for the second line, since both arguments of z_indices explicitely refer to infert$age, where infert$parity is expected.
My question: is there a way to dynamically pass the variables of tab_cells as function argument within tab_stat_fun to match the variable being processed? I guess this happens inside function declaration but have not clue how to proceed...
Thanks!
EDIT April 28th 2020:
Answer from #Gregory Demin works great in the scope of infert dataset, although for better scalability to larger dataframes I wrote the following loop:
var_df <- data.frame("age"=infert$age, "parity"=infert$parity)
tabZ=infert
for(each in names(var_df)){
tabZ = tabZ %>%
tab_cells(var_df[each]) %>%
tab_cols(total(), education) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
tab_stat_fun(label="Z", function(x, m_global, std_global, weight=NULL){
z_indices(x, m_global=w_mean(var_df[each], infert$w),std_global=w_sd(var_df[each], infert$w))
})
}
tabZ = tabZ %>% tab_pivot()
Hope this inspires other expss users in the future!
There is no universal solution for this case. Function in the tab_stat_fun is always calculated inside cell so you can't get global values in it.
However, in your case we can calculate z-index before summarizing. Not so flexible solution but it works:
# function for weighted z-score
w_z_index = function(x, weight = NULL){
if(is.null(weight)) weight = rep(1, length(x))
z <- (x - w_mean(x, weight))/w_sd(x, weight)
indices <- 100+(z*100)
return(indices)
}
data(infert)
infert$w <- rep(2, times=nrow(infert))
infert %>%
tab_cells(age, parity) %>%
tab_cols(total(), education, case %nest% list(total(), education)) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
# here we get z-index instead of original variables
tab_cells(age = w_z_index(age, w), parity = w_z_index(parity, w)) %>%
tab_stat_mean(label="Z") %>%
tab_pivot(stat_position="inside_columns")
UPDATE.
A little more scalable approach:
w_z_index = function(x, weight = NULL){
if(is.null(weight)) weight = rep(1, length(x))
z <- (x - w_mean(x, weight))/w_sd(x, weight)
indices <- 100+(z*100)
return(indices)
}
w_z_index_df = function(df, weight = NULL){
df[] = lapply(df, w_z_index, weight = weight)
df
}
data(infert)
infert$w <- rep(2, times=nrow(infert))
infert %>%
tab_cells(age, parity) %>%
tab_cols(total(), education, case %nest% list(total(), education)) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
# here we get z-index instead of original variables
# we process a lot of variables at once
tab_cells(w_z_index_df(data.frame(age, parity))) %>%
tab_stat_mean(label="Z") %>%
tab_pivot(stat_position="inside_columns")
I want to turn a table into a data frame. Three columns should be there: 1. the zip code 2 outcome "0" and 3 outcome "1". But as.data.frame.matrix turns the zip-code into row names and makes them unusable.
I tried to add a fourth column with imaginary ID's (1:100) so R makes them to row names but R tells me, that "all arguments must be the same length" - which they are!
id <- 1:5000
zip <- sample(100:200, 5000, replace = TRUE)
outcome <- rbinom(5000, 1, 0.23)
df <- data.frame(id, outcome, zip)
abs <- table(df$zip, df$outcome)
abs <- as.data.frame.matrix(abs)
Some has a nice and slick idea? Thanks in advance!
Edit:
When:
abs <- as.matrix(as.data.frame(abs))
I get something close to what I want but the outcomes are together in one column. How to untie them, to make them look like the table again?
You can get to your desired result easier with dplyr and tidyr:
library(dplyr)
library(tidyr)
id <- 1:5000
zip <- sample(100:200, 5000, replace = TRUE)
outcome <- rbinom(5000, 1, 0.23)
df <- data.frame(id, outcome, zip)
df <- df %>% group_by(zip, outcome) %>%
summarise(freq = n()) %>%
ungroup() %>%
spread(outcome, freq)
You are supplying only a 100 values to a data.frame that has 101 rows.
> nrow(abs)
[1] 101
so this would work
abs$new_col <- 1:101
I think you want this:
abs2 <- as.data.frame(abs) %>% select(2,3,1)
I have a data set with 100's of columns, I want to keep top 20 columns with highest average (can be other aggregation like sum or SD).
How to efficiently do it?
One way I think is to create a vector of averages of all columns, sort it descending and keep top n values in it then use it subset my data set.
I am looking for a more elegant way and some thing that can also be part of dplyr pipe %>% flow.
code below for creating a dummy dataset, also I would appreciate suggestion for elegant ways to create dummy dataset.
#initialize data set
set.seed(101)
df <- as.data.frame(matrix(round(runif(25,2,5),0), nrow = 5, ncol = 5))
# add more columns
for (i in 1:5){
set.seed (101)
df_stage <-
as.data.frame(matrix(
round(runif(25,5*i , 10*i), 0), nrow = 5, ncol = 5
))
colnames(df_stage) <- paste("v",(10*i):(10*i+4))
df <- cbind(df, df_stage)
}
Another tidyverse approach with a bit of reshaping:
library(tidyverse)
n = 3
df %>%
summarise_all(mean) %>%
gather() %>%
top_n(n, value) %>%
pull(key) %>%
df[.]
We can do this with
library(dplyr)
n <- 3
df %>%
summarise_all(mean) %>%
unlist %>%
order(., decreasing = TRUE) %>%
head(n) %>%
df[.]
I am trying to split an ordered data frame into 10 equal buckets. The following works but it introduces an X1., X2., X3. ... prefix to each bucket, which prevents me from iterating over the buckets to sum them.
num_dfs <- 10
buckets<-split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs)))
Produces a df[10] that looks like:
$`10`
predicted_duration actual_duration
177188 23.7402944 6
466561 23.7402663 12
479556 23.7401721 5
147585 23.7401666 48
Here's the crude code I am using to try to sum the groups.
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(as.data.frame(df[i],row.names=NULL)$X1.actual_duration) # X1., X2.,
print(paste(i,"=",p))
}
How do I remove the Xn. grouping prefix or programmatically reference it using the index i?
Here's a similar reproducible example:
df<-data.frame(actual_duration=sample(100))
num_dfs <- 10
df_grouped<-as.data.frame(split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs))))
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(df[i]$actual_duration) # does not work because postfix .1, .2.. was added by R
print(paste(p))
}
I'm not entirely clear on what your issue is, but if you are just trying to get the sum by group couldn't you use
library(tidyverse)
df <- data.frame(actual_duration=sample(100))
df %>%
arrange(actual_duration) %>%
mutate(group = rep(1:10, each = 10)) %>%
group_by(group) %>%
summarise(sums = sum(actual_duration))
alternatively if you want to keep the list format
df %>%
arrange(actual_duration) %>%
mutate(group = factor(rep(1:10, each = 10))) %>%
split(., .$group) %>%
map(., function(x) sum(x$actual_duration))
I'm writing a function to remove duplicate observations of undirected relationships between firms when both parties report the relationship. For my ~1.3 million observation dataset, the function collapse_undirected below results either in R crashing during the mutate for hash or the error:
"translateCharUTF8 must be called on CHARSXP"
during the mutate for dup.
The goal of this function is to uniquely identify each pair of related firms by ordering and concatenating their IDs, and then dropping duplicate hashes reported in the same time period.
The data set I'm using is licensed so I can't provide it, but the "translateCharUTF8" error is reproducible with the randomly generated data I've included below. It occurs more frequently with larger sets. I'd say its common at around 9000 observations. I've also included a slow version of the function that runs without incident, which further leads me to believe that the problem occurs in the first mutate.
The function with the error:
collapse_undirected <- function(data, dir){
out <- data %>% filter(REL_TYPE != dir)
obs <- data %>% filter(REL_TYPE == dir) %>%
group_by(SOURCE, TARGET) %>%
mutate(hash = paste(min(SOURCE, TARGET),max(SOURCE, TARGET))) %>%
group_by(START, END) %>%
mutate(dup = duplicated(hash)) %>%
filter(!dup) %>%
select(-hash,-dup)
bind_rows(out,obs)
}
The slow workaround:
jank_undir <- function(data, dir){
obs <- data %>% filter(REL_TYPE == dir)
out <- data %>% filter(REL_TYPE != dir)
obs$hash <- NA
for(i in 1:nrow(obs)){
obs$hash[i] <- paste(min(obs$SOURCE[i], obs$TARGET[i]),
max(obs$SOURCE[i], obs$TARGET[i]))
}
obs %>% group_by(START,END) %>%
mutate(dup = duplicated(hash)) %>%
filter(!dup) %>%
select(-hash,-dup) %>%
bind_rows(out)
}
Here's a convenience function to randomly generate test data:
reroll <- function(n){
test_data <- data_frame(1:n)
test_data$SOURCE <- as.character(sample(1:27000, size = n, replace = TRUE))
test_data$TARGET <- as.character(sample(1:27000, size = n, replace = TRUE))
test_data$REL_TYPE <- "DUMMY"
test_data$START <- sample(1:2870, size = n, replace = TRUE)
test_data$END <- sample(1:2781, size = n, replace = TRUE)
test_data
}
And, varying with the random draw, this should demonstrate the error:
library(dplyr)
test_data <- reroll(9000)
test_cleaned <- test_data %>% jank_undir("DUMMY")
test_cleaned <- test_data %>% collapse_undirected("DUMMY")
I'd greatly appreciate any insight into why this is happening. The slow version is fast enough for now but I anticipate needing to use it for larger datasets. The R crashes occurred on both my Windows and Linux based R sessions with the main dataset but seems less frequent on the Linux version. My dplyr is 0.7.2
Thank you,