dplyr: R crash/errors with group_by, mutate - r

I'm writing a function to remove duplicate observations of undirected relationships between firms when both parties report the relationship. For my ~1.3 million observation dataset, the function collapse_undirected below results either in R crashing during the mutate for hash or the error:
"translateCharUTF8 must be called on CHARSXP"
during the mutate for dup.
The goal of this function is to uniquely identify each pair of related firms by ordering and concatenating their IDs, and then dropping duplicate hashes reported in the same time period.
The data set I'm using is licensed so I can't provide it, but the "translateCharUTF8" error is reproducible with the randomly generated data I've included below. It occurs more frequently with larger sets. I'd say its common at around 9000 observations. I've also included a slow version of the function that runs without incident, which further leads me to believe that the problem occurs in the first mutate.
The function with the error:
collapse_undirected <- function(data, dir){
out <- data %>% filter(REL_TYPE != dir)
obs <- data %>% filter(REL_TYPE == dir) %>%
group_by(SOURCE, TARGET) %>%
mutate(hash = paste(min(SOURCE, TARGET),max(SOURCE, TARGET))) %>%
group_by(START, END) %>%
mutate(dup = duplicated(hash)) %>%
filter(!dup) %>%
select(-hash,-dup)
bind_rows(out,obs)
}
The slow workaround:
jank_undir <- function(data, dir){
obs <- data %>% filter(REL_TYPE == dir)
out <- data %>% filter(REL_TYPE != dir)
obs$hash <- NA
for(i in 1:nrow(obs)){
obs$hash[i] <- paste(min(obs$SOURCE[i], obs$TARGET[i]),
max(obs$SOURCE[i], obs$TARGET[i]))
}
obs %>% group_by(START,END) %>%
mutate(dup = duplicated(hash)) %>%
filter(!dup) %>%
select(-hash,-dup) %>%
bind_rows(out)
}
Here's a convenience function to randomly generate test data:
reroll <- function(n){
test_data <- data_frame(1:n)
test_data$SOURCE <- as.character(sample(1:27000, size = n, replace = TRUE))
test_data$TARGET <- as.character(sample(1:27000, size = n, replace = TRUE))
test_data$REL_TYPE <- "DUMMY"
test_data$START <- sample(1:2870, size = n, replace = TRUE)
test_data$END <- sample(1:2781, size = n, replace = TRUE)
test_data
}
And, varying with the random draw, this should demonstrate the error:
library(dplyr)
test_data <- reroll(9000)
test_cleaned <- test_data %>% jank_undir("DUMMY")
test_cleaned <- test_data %>% collapse_undirected("DUMMY")
I'd greatly appreciate any insight into why this is happening. The slow version is fast enough for now but I anticipate needing to use it for larger datasets. The R crashes occurred on both my Windows and Linux based R sessions with the main dataset but seems less frequent on the Linux version. My dplyr is 0.7.2
Thank you,

Related

My R functions intended to randomly assign participants to groups and simulate differences in means is not running through a pipe

I've created a random dataset using set.seed(1101) - r(studio) version 4.0
set.seed(1101)
library(tidyverse)
dat <- readr::read_csv("..data/datafile.csv")
My intention was to simulate randomly assigning participants to two groups of equal size and calculate the differences between their group mean scores, then create a distribution of these scores.
I have created functions to do these operations.
Group assignment function (A or B) labelled Permute ('group' is the name of the variable in the dataset)
permute <- function(x) {
x %>%
mutate(group = sample(group, replace = TRUE))
}
Calculating differences in mean scores between two groups from a tibble and inputting it into a function called calc_diff:
calc_diff <- function(x) {
x %>%
group_by(group) %>%
summarise(m = mean(Y)) %>%
spread(group, m) %>%
mutate(diff = A - B) %>%
pull(diff)
}
When I run these functions individually in the console, they simulate differences just fine for the data object.
Permute(x = dat) ## original data tibble
Calc_diffs(x = dat)
So I created a pipe to join these functions and store the output in a variable called dat_sim:
dat_sim <- dat %>%
permute() %>% # group assignment
calc_diff() # mean differences calc
Problem is that when I try to replicate that 1000 times through a pipe I get all instances of the same score, not randomly varied outputs. The code I have used is below.
dat_sim1000 <- rep(dat2 %>%
permute() %>%
calc_diff(), 1000) # put replications at the end as for some reason it just replicated the number 1000 when I put it at the start
Thus for some reason the replicate function is not re-running the function 1000 times for different results but running both once and storing the output a 1000 times. What do I need to do to re-run the functions a 1000 times and store the output?
Your Problem is not your functions but the way you are trying to repeat the process.
rep does just take its first argument an repeats it.
Try the following:
dat_sim1000 <- map(1:1000,
~dat2 %>%
permute() %>%
calc_diff())
You could also use lapply or a loop instead of map. It just repeats the term given after the ~ as many times as there are values in the vector 1:1000
EDIT
An example that should work:
set.seed(1101)
library(tidyverse)
dat <- tibble(group = sample(LETTERS[1:2],50,T),
Y = rnorm(50))
permute <- function(x) {
x %>%
mutate(group = sample(group, replace = TRUE))
}
calc_diff <- function(x) {
x %>%
group_by(group) %>%
summarise(m = mean(Y)) %>%
spread(group, m) %>%
mutate(diff = A - B) %>%
pull(diff)
}
dat_sim1000 <- map(1:1000,
~dat %>%
permute() %>%
calc_diff())
summary(unlist(dat_sim1000))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -0.740541 -0.188444 -0.004183 -0.003256 0.183467 0.821315
You may be confusing rep and replicate: rep just repeats its first argument. replicate, by contrast, repeatedly evaluates a given expression.
Note that the order of arguments is inverted (because who needs consistent APIs?):
dat_sim1000 = replicate(
1000L,
dat2 %>% permute() %>% calc_diff()
)

Is there a way to loop through different levels of a factor for anomaly detection

I am using the 'anomalize' package for anomaly detection. My data consists of three columns, the date, an agent (this is where the different levels come from), and the number of schedules that agent had on a particular day. I can run the anomaly detection just fine when I remove the 'agent' column and sum the number of consults by day using this code:
df <- scheds %>%
group_by(date) %>%
summarise(
new_scheds = sum(new_scheds)
)
df_ts <- df %>% rownames_to_column() %>% as_tibble() %>%
mutate(date = as.Date(date, format = "%m/%d/%Y")) %>% select(-one_of('rowname'))
df_ts <- df_ts[order(df_ts$date),]
########## TS Decomp ###############
df_ts %>%
time_decompose(new_scheds, method = "stl", frequency = 5, trend = "auto") %>%
anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.2) %>%
plot_anomaly_decomposition()
But I cannot find out how I would do this same type of thing for each agent individually without manually typing everything out and using filter(). I have tried the following loop with no luck:
agents <- levels(ts_agents$agent)
results <- matrix(NA, length(agents))
for(i in 1:length(agents)){
ts_agents %>%
time_decompose(new_scheds)[i] %>%
anomalize(remainder)[i] %>%
time_recompose()[i] %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)[i] }
but I get the following error:
'Error in time_decompose(new_scheds) : object 'new_scheds' not found'
Any tips or pointers would be greatly appreciated!
The reason for your error is that the pipe operator %>% doesn't work right when you try to subset your data.
If you enclose it in brackets and use . to refer to the input, you will avoid this error:
for(i in 1:length(agents)){
ts_agents %>% {
time_decompose(., new_scheds)[i]
} %>% {
...
This fixes the immediate problem of the error but I'm not sure how well the subsetting will work. It may be that you need filter() in the loop, or even group_by(df, agent) without any loop at all. (If you provide a full reproducible example including data, it will be easier to help).

Dataframe is too big for supercomputer

I am trying to create a matrix of donors and recipients, populated with the sum of donations produced in each couple keeping the eventual NAs.
It works well for small datasets (See toy example below) but when I switch to national datasets (3m entries) several problems emerge: besides being painstakingly slow, the creation of the fill df consume all the memory of the (super)computer and I get the error "Error: cannot allocate vector of size 1529.0 Gb"
How should I tackle the problem?
Thanks a lot!
library(dplyr)
library(tidyr)
libray(bigmemory)
candidate_id <- c("cand_1","cand_1","cand_1","cand_2","cand_3")
donor_id <- c("don_1","don_1","don_2","don_2","don_3")
donation <- c(1,2,3.5,4,10)
df = data.frame(candidate_id,donor_id,donation)
colnames(df) <- c("candidate_id","donor_id","donation")
fill <- df %>%
group_by(df$candidate_id,df$donor_id) %>%
summarise(tot_donation=sum(as.numeric(donation))) %>%
complete(df$candidate_id,df$donor_id)
fill <- unique(fill[ ,1:3])
colnames(fill) <- c("candidate_id","donor_id","tot_donation")
nrow = length(unique(df$candidate_id))
ncol = length(unique(df$donor_id))
row_names = unique(fill$candidate_id)
col_names = unique(fill$donor_id)
x <- big.matrix(nrow, ncol, init=NA,dimnames=list(row_names,col_names))
for (i in 1:nrow){
for (j in 1:ncol){
x[i,j] <- fill[which(fill$candidate_id == row_names[i] &
fill$donor_id == col_names[j]), 3]
}
}
I see you're using unique because your output has duplicated values.
Based on this question,
you should try the following in order to avoid duplication:
fill <- df %>%
group_by(candidate_id, donor_id) %>%
summarise(tot_donation=sum(donation)) %>%
ungroup %>%
complete(candidate_id, donor_id)
Can you then try to create your desired output?
I think unique can be very resource-heavy,
so try to avoid calling it.
The tidyr version of what Benjamin suggested should be:
spread(fill, donor_id, tot_donation)
EDIT: By the way, since you tagged the question with sparse-matrix,
you could indeed use sparsity to your advantage:
library(Matrix)
library(dplyr)
df <- data.frame(
candidate_id = c("cand_1","cand_1","cand_1","cand_2","cand_3"),
donor_id = c("don_1","don_1","don_2","don_2","don_3"),
donation = c(1,2,3.5,4,10)
)
summ <- df %>%
group_by(candidate_id, donor_id) %>%
summarise(tot_donation=sum(donation)) %>%
ungroup
num_candidates <- nlevels(df$candidate_id)
num_donors <- nlevels(df$donor_id)
smat <- Matrix(0, num_candidates, num_donors, sparse = TRUE, dimnames = list(
levels(df$candidate_id),
levels(df$donor_id)
))
indices <- summ %>%
select(candidate_id, donor_id) %>%
mutate_all(unclass) %>%
as.matrix
smat[indices] <- summ$tot_donation
smat
3 x 3 sparse Matrix of class "dgCMatrix"
don_1 don_2 don_3
cand_1 3 3.5 .
cand_2 . 4.0 .
cand_3 . . 10
You might try
library(reshape2)
dcast(fill, candidate_id ~ donor_id,
value.var = "tot_donation",
fun.aggregate = sum)
I don't know if it will avoid the memory issue, but it will likely be much faster than a double for loop.
I have to run to a meeting, but part of me wonders if there is a way to do this with outer.

Efficiently create data.frames for a changing number of input csv files with identical 'tidy' format and size

I can't figure out how to:
efficiently create, with rbind or another way, a data.frame compiling csv-derived data.frames, whose number varies for different projects. Or similarly:
efficiently create a data.frame of the difference between a csv-derived "baseline scenario" 's values and those of the rest of the csv-based alternative scenarios.
The csvs are timeseries of hydrologic model output, already in long, 'tidy' format and they're identical in format, size, and order -- there's just different numbers of them for different projects. There's always at least two, a baseline and an alternative, but there's usually quite a few. Eg, Project A might have four csvs/scenarios and Project B might have thirty csvs/scenarios.
I'm hoping to have one code template that will efficiently accommodate projects with any number of scenarios. Without an efficient way, I'm needing to add or delete quite a few lines to match the number of scenarios I have on an sub-daily basis, so it's a time-consuming step I'd like to avoid. After df and df_diff are created, both are used for later summaries and plots.
I'll manually enter the names of the scenarios as they always differ, eg:
library(dplyr)
scenarios <- c("baseline", "alt1", "alt1b", "no dam")
length(scenarios) will always match the number of CSVs I have for a given project.
Read in the csvs (one csv for each scenario) and keep them unmodified for later, separate processing:
#In my case these csv#s are from a separate file's list of csvs,
#eg csv1 <- read.csv("baseline.csv")
# csv2 <- read.csv("alt1.csv"), etc - all tidy monthly timeseries of many variables
#For reproducibility, simplyfying:
csv1 <- data.frame("variable" = "x", "value" = 13) #baseline scenario
csv2 <- data.frame("variable" = "x", "value" = 5) #"alternative 1"
csv3 <- data.frame("variable" = "x", "value" = 109) #"alternative 1b"
csv4 <- data.frame("variable" = "x", "value" = 11) #"dam removal"
#csv5 <- data.frame("variable" = "x", "value" = 2.5) #"100 extra flow for salmon sep-dec"
#...
#csv30 <- data.frame("variable" = "x", "value" = 41) #"alternative H3"
Copy the csvs and connect data to scenario:
baseline <- csv1 %>% mutate(scenario = as.factor(paste0(scenarios[1])))
scen2 <- csv2 %>% mutate(scenario = as.factor(paste0(scenarios[2])))
scen3 <- csv3 %>% mutate(scenario = as.factor(paste0(scenarios[3])))
scen4 <- csv4 %>% mutate(scenario = as.factor(paste0(scenarios[4])))
df <- rbind(baseline, scen2, scen3, scen4) #data.frame #1 I'm looking for.
#eg, if csv1-csv30 were included, how to compile in df efficiently, w/o needing the "scen" lines?
There are 4 scenarios in this case so df$scenario has 4 levels. To get here.
Now for the second "difference" data.frame:
bslnevals <- baseline %>% select(value)
scen2vals <- scen2 %>% select(value)
scen3vals <- scen3 %>% select(value)
scen4vals <- scen4 %>% select(value)
scen2diff <- (scen2vals - bslnevals) %>% transmute(value_diff = value,
scenario_diff = as.factor(paste0(scenarios[2], " - baseline"))) %>%
data.frame(scen2) %>% select(-value, -scenario)
scen3diff <- (scen3vals - bslnevals) %>% transmute(value_diff = value,
scenario_diff = as.factor(paste0(scenarios[3], " - baseline"))) %>%
data.frame(scen3) %>% select(-value, -scenario)
scen4diff <- (scen4vals - bslnevals) %>% transmute(value_diff = value,
scenario_diff = as.factor(paste0(scenarios[4], " - baseline"))) %>%
data.frame(scen4) %>% select(-value, -scenario)
df_diff <- rbind(scen2diff, scen3diff, scen4diff) #data.frame #2 I'm looking for.
#same as above, if csv1 - csv30 were included, how to compile in df_diff efficiently, w/o
#needing the "scen#vals" and "scen#diff" lines?
rm(baseline, scen2, scen3, scen4) #declutter - now unneeded (but csv1, csv2, etc orig csv#s needed later)
rm(bslnevals, scen2vals, scen3vals, scen4vals) #unneeded
rm(scen2diff, scen3diff, scen4diff) #unneeded
With 4 scenarios, there are 3 differences from the baseline so df_diff$scenario has 3 levels.
So, if I had 4 csvs (1 baseline, 3 alternatives) or maybe 30 CSVs (1 baseline, 29 alternatives), I tried to write functions and for loops that would assign scen2 and scen3 ...scen28 , and scen2diff, scen3diff...scen28diff etc, variables dynamically, but I failed. So, I'm looking for a way that works and that doesn't need much modification when applied to a project with any number of scenarios. I'm looking just to create df and df_diff in a clean way for a user, for however many scenarios (ie csvs) happen to be given to me or them for a given project.
Any help is greatly appreciated.
I can't test with your case but this may be a good starting point for refactoring your code. I use case_when to generate rules to map the name of the CSV file to a scenario. I subtract the baseline value from the value in each scenario.
library(dplyr)
library(readr)
library(purrr)
library(tidyr)
baseline_df <- read_csv("baseline.csv") %>%
mutate(id = row_number())
# list all csv files (in current directory), then read them all, and row-bind them.
# use case_when to apply rules to change filenames to "scenarios" (grepl to check presence of string)
# join with baseline df (by scenario row number) for easy subtracting.
# calculate differences values.
# remove baseline-baseline rows (diff is 0)
diff_df <- list.files(path = getwd(), pattern = "*.csv", full.names = TRUE) %>%
tibble(filename = .) %>%
mutate(data = map(filename, read_csv)) %>%
unnest() %>%
mutate(scenario = case_when(
grepl("baseline", filename) ~ "baseline",
grepl("alternative1", filename) ~ "alt1",
grepl("alternative2", filename) ~ "alt2",
grepl("dam_removal", filename) ~ "no dam",
TRUE ~ "other"
)) %>%
group_by(scenario) %>%
mutate(id = row_number()) %>%
left_join(baseline_df, by = "id", suffix = c("_new", "_baseline")) %>%
mutate(Value_diff = Value_new - Value_baseline) %>%
filter(scenario != "baseline")

Generating a unique ID column for large dataset with the RecordLinkage package

I am trying to generate a unique ID column using the RecordLinkage package. I have successfully done so when working with smaller datasets (<= 1,000,000), but have not been able to reproduce this result for larger datasets (> 1,000,000) that use different (but similar) functions in the package. I am given multiple identifier variables for which I want to generate a unique ID despite the fact that there may be some errors (near matches) or duplicates in the records.
Given some data frame of identifiers:
data(RLdata500)
df_identifiers <- RLdata500
This is the code for the smaller datesets (which work):
df_identifiers <- df_identifiers %>% mutate(ID = 1:nrow(df_identifiers))
rpairs <- compare.dedup(df_identifiers)
p=epiWeights(rpairs)
classify <- epiClassify(p,0.3)
matches <- getPairs(object = classify, show = "links", single.rows = TRUE)
# this code writes an "ID" column that is the same for similar identifiers
classify <- matches %>% arrange(ID.1) %>% filter(!duplicated(ID.2))
df_identifiers$ID_prior <- df_identifiers$ID
# merge matching information with the original data
df_identifiers <- left_join(df_identifiers, matches %>% select(ID.1,ID.2), by=c("ID"="ID.2"))
# replace matches in ID with the thing they match with from ID.1
df_identifiers$ID <- ifelse(is.na(df_identifiers$ID.1), df_identifiers$ID, df_identifiers$ID.1)
This approach is discussed here. But this code does not seem to be extensible when applied towards larger datasets when using other functions. For example, the big data equivalent of compare.dedup is RLBigDataDedup, whose RLBigData class support similar functions such as epiWeights, epiClassify, getPairs, etc. Replacing compare.dedup with RLBigDataDedup does not work in this situation.
Consider the following attempt for large datasets:
df_identifiers <- df_identifiers %>% mutate(ID = 1:nrow(df_identifiers))
rpairs <- RLBigDataDedup(df_identifiers)
p=epiWeights(rpairs)
( . . . )
Here, the remaining code is almost identical to that of the first. Although epiWeights and epiClassify work on the RLBigData class as expected, getPairs does not. The function getPairs does not use the show = "links" argument. Because of this, all subsequent code does not work.
Is there a different approach that needs to be taken to generate a column of unique IDs when working with larger datasets in the RLBigData class, or is this just a limitation?
First, import the following libraries:
library(RecordLinkage)
library(dplyr)
library(magrittr)
Consider these example datasets from the RecordLinkage package:
data(RLdata500)
data(RLdata10000)
Assume we care about these matching variables and threshold:
matching_variables <- c("fname_c1", "lname_c1", "by", "bm", "bd")
threshold <- 0.5
The record linkage for SMALL datasets is as follows:
RLdata <- RLdata500
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
compare.dedup() %>%
epiWeights() %>%
epiClassify(threshold) %>%
getPairs(show = "links", single.rows = TRUE) -> matching_data
Here, the following SMALL data manipulation may be applied to append the appropriate IDs to the given dataset (same code from here):
RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
select(matching_data, id1, id2) %>%
arrange(id1) %>% filter(!duplicated(id2)),
by = c("ID" = "id2")) %>%
mutate(ID = ifelse(is.na(id1), ID, id1)) %>%
select(-id1)
RLdata$ID <- RLdata_ID$ID
The equivalent code for LARGE datasets is as follows:
RLdata <- RLdata10000
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
RLBigDataDedup() %>%
epiWeights() %>%
epiClassify(threshold) %>%
getPairs(filter.link = "link", single.rows = TRUE) -> matching_data
Here, the following LARGE data manipulation may be applied to append the appropriate IDs to the given dataset (similar to code from here):
RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
select(matching_data, id.1, id.2) %>%
arrange(id.1) %>% filter(!duplicated(id.2)),
by = c("ID" = "id.2")) %>%
mutate(ID = ifelse(is.na(id.1), ID, id.1)) %>%
select(-id.1)
RLdata$ID <- RLdata_ID$ID

Resources