I have a piece of code in R. Every time I run it on a cluster, I get an answer where the order of columns are different. (It seems to be OK on my laptop). If I order the column so they have the same order, answers are identical, the only problem is ordering of the columns.
NNs_loc_year <- Reduce(cbind,
split(NNs_loc_year,
rep(1:n_neighbors, each=(nrow(NNs_loc_year)/n_neighbors)))) %>%
data.table()
# rename columns
NN_dist <- NN_dist %>% data.table()
names(NN_dist) <- paste0("NN_", c(1:n_neighbors))
names(NNs_loc_year) <- paste0(names(NNs_loc_year), paste0("_NN_", rep(1:n_neighbors, each=2)))
NN_chi <- pchi(as.vector(NN_list$nn.dist), PCs)
NN_sigma <- qchi(NN_chi, 1)
NN_sigma_df = Reduce(cbind,
split(NN_sigma,
rep(1:n_neighbors, each=(length(NN_sigma)/n_neighbors)))) %>%
data.table()
names(NN_sigma_df) <- paste0("sigma_NN_", c(1:n_neighbors))
NN_dist_tb = rbind(NN_dist_tb, NN_dist)
NNs_loc_year_tb = rbind(NNs_loc_year_tb, NNs_loc_year)
NN_sigma_tb = rbind(NN_sigma_tb, NN_sigma_df)}
Related
I have a collection of ten dataframes (df.a, df.b, and so on) and the following concept in R
df.x.new = df.x %>% do this %>% do that...
I wondered if there is an elegant way to interchange the df.x variable of my single code line above iteratively with my dfs one by another, to get ten new dfs as an output.
Meaning something like this:
#place your elegant code here
df.a.new = df.a %>% do this %>% do that
df.b.new = df.b % do this %>% do that
#and so on
Edit:
#this should serve as a minimal reproducible code
df.a = c(1,2,3)
df.b = c(4,5,6)
df.c = c(7,8,9)
df.a.new = df.a %>% left_join(df.b)
I have a tibble with a ton of data in it, but most importantly, I have a column that references a row in a lookup table by number (ex. 1,2,3 etc).
df <- tibble(ref = c(1,1,1,2,5)
data = c(33,34,35,35,32))
lkup <- tibble(CurveID <- c(1,2,3,4,5)
Slope <- c(-3.8,-3.5,-3.1,-3.3,-3.3)
Intercept <- c(40,38,40,38,36)
Min <- c(25,25,21,21,18)
Max <- c(36,36,38,37,32))
I need to do a calculation for each row in the original tibble based on the information in the referenced row in the lookup table.
df$result <- df$data - lkup$intercept[lkup$CurveID == df$ref]/lkup$slope[lkup$CurveID == df$ref]
The idea is to access the slope or intercept (etc) value from the correct row of the lookup table based on the number in the data table, and to do this for each data point in the column. But I keep getting an error telling me my data isn't compatible, and that my objects need to be of the same length.
You could also do it with match()
df$result <- df$data - lkup$Intercept[match(df$ref, lkup$CurveID)]/lkup$Slope[match(df$ref, lkup$CurveID)]
df$result
# [1] 43.52632 44.52632 45.52632 45.85714 42.90909
You could use the dplyr package to join the tibbles together. If the ref column and CurveID column have the same name then left_join will combine the two tibbles by the matching rows.
library(dplyr)
df <- tibble(CurveID = c(1,1,1,2,5),
data = c(33,34,35,35,32))
lkup <- tibble(CurveID = c(1,2,3,4,5),
Slope = c(-3.8,-3.5,-3.1,-3.3,-3.3),
Intercept = c(40,38,40,38,36),
Min = c(25,25,21,21,18),
Max = c(36,36,38,37,32))
df <- df %>% left_join(lkup, by = "CurveID")
Then do the calcuation on each row
df <- df %>% mutate(result = data - (Intercept/Slope)) %>%
select(CurveID, data, result)
For completeness' sake, here's one way to literally do what OP was trying:
library(slider)
df %>%
mutate(result = slide_dbl(ref, ~ slice(lkup, .x)$Intercept /
slice(lkup, .x)$Slope))
though since slice goes by row number, this relies on CurveID equalling the row number (we make no reference to CurveID at all). You can write it differently with filter but it ends up being more code.
I am trying to read a set of tab separated files into a matrix or data.frame. For each file I need to extract one column and then concatenate all the columns into a single matrix keeping both column and row names.
I am using tidyverse (and I am terrible at that). I successfully get column names but I miss row names at the very last stage of processing.
library("purrr")
library("tibble")
samples <- c("a","b","c","d")
a <- samples %>%
purrr::map_chr(~ file.path(getwd(), TARGET_FOLDER, paste(., "tsv", sep = "."))) %>%
purrr::map(safely(~ read.table(., row.names = 1, skip = 4))) %>%
purrr::set_names(rownames(samples)) %>%
purrr::transpose()
is_ok <- a$error %>% purrr::map_lgl(is_null)
x <- a$result[is_ok] %>%
purrr::map(~ {
v <- .[,1]
names(v) <- rownames(.)
v
}) %>% as_tibble(rownames = NA)
The x data.frame has correct colnames but lacks rownames. All the element on the a list have the same rownames in the exact same order. I am aware of tricks like rownames(x) <- rownames(a$result[[1]]) but I am looking for more consistent solutions.
It turned out that the solution was easier than expected. Using as.data.frame instead the last as_tibble solved it.
I'm writing a function to remove duplicate observations of undirected relationships between firms when both parties report the relationship. For my ~1.3 million observation dataset, the function collapse_undirected below results either in R crashing during the mutate for hash or the error:
"translateCharUTF8 must be called on CHARSXP"
during the mutate for dup.
The goal of this function is to uniquely identify each pair of related firms by ordering and concatenating their IDs, and then dropping duplicate hashes reported in the same time period.
The data set I'm using is licensed so I can't provide it, but the "translateCharUTF8" error is reproducible with the randomly generated data I've included below. It occurs more frequently with larger sets. I'd say its common at around 9000 observations. I've also included a slow version of the function that runs without incident, which further leads me to believe that the problem occurs in the first mutate.
The function with the error:
collapse_undirected <- function(data, dir){
out <- data %>% filter(REL_TYPE != dir)
obs <- data %>% filter(REL_TYPE == dir) %>%
group_by(SOURCE, TARGET) %>%
mutate(hash = paste(min(SOURCE, TARGET),max(SOURCE, TARGET))) %>%
group_by(START, END) %>%
mutate(dup = duplicated(hash)) %>%
filter(!dup) %>%
select(-hash,-dup)
bind_rows(out,obs)
}
The slow workaround:
jank_undir <- function(data, dir){
obs <- data %>% filter(REL_TYPE == dir)
out <- data %>% filter(REL_TYPE != dir)
obs$hash <- NA
for(i in 1:nrow(obs)){
obs$hash[i] <- paste(min(obs$SOURCE[i], obs$TARGET[i]),
max(obs$SOURCE[i], obs$TARGET[i]))
}
obs %>% group_by(START,END) %>%
mutate(dup = duplicated(hash)) %>%
filter(!dup) %>%
select(-hash,-dup) %>%
bind_rows(out)
}
Here's a convenience function to randomly generate test data:
reroll <- function(n){
test_data <- data_frame(1:n)
test_data$SOURCE <- as.character(sample(1:27000, size = n, replace = TRUE))
test_data$TARGET <- as.character(sample(1:27000, size = n, replace = TRUE))
test_data$REL_TYPE <- "DUMMY"
test_data$START <- sample(1:2870, size = n, replace = TRUE)
test_data$END <- sample(1:2781, size = n, replace = TRUE)
test_data
}
And, varying with the random draw, this should demonstrate the error:
library(dplyr)
test_data <- reroll(9000)
test_cleaned <- test_data %>% jank_undir("DUMMY")
test_cleaned <- test_data %>% collapse_undirected("DUMMY")
I'd greatly appreciate any insight into why this is happening. The slow version is fast enough for now but I anticipate needing to use it for larger datasets. The R crashes occurred on both my Windows and Linux based R sessions with the main dataset but seems less frequent on the Linux version. My dplyr is 0.7.2
Thank you,
Using a user-defined function I have to join the lower and higher bound of confidence intervals (named as CIlow and CIhigh) of a selected number of columns from a data frame. The data frame has CIlow and CIhigh for a number of groups (named as a, b and c) and for a number row (in this example just two). See below how the data frame looks like.
dataframe<-data.frame(CIlow_a=c(1.1,1.2),CIlow_b=c(2.1,2.2),CIlow_c=c(3.1,3.2),
CIhigh_a=c(1.3,1.4),CIhigh_b=c(2.3,2.4),CIhigh_c=c(3.3,3.4))
I would like to have a joined column for each group in a selected number of groups (e.g. a, b) among the existing ones (a, b and c).
Thus, the expected output should be the following:
output<-data.frame(CI_a=c("(1.1,1.3)","(1.2,1.4)"),
CI_b=c("(2.1,2.3)","(2.2,2.4)"))
To built my own user-defined function I tried the following code:
f<-function(df,gr){
enquo_gr<-enquo(gr)
r<-df%>%
dplyr::mutate(UQ(paste("CI",quo_name(gr),sep="_")):=
sprintf("(%s,%s)",
paste("CIlow",UQ(enquo_gr),sep="_"),
paste("CIhigh",UQ(enquo_gr),sep="_")))%>%
dplyr::select(paste("CI",UQ(enquo_gr),sep="_"))
return(r)
}
However when using the above mentioned function in this way
library(dplyr)
group<-c("a","b")
dataframe<-data.frame(CIlow_a=c(1.1,1.2),CIlow_b=c(2.1,2.2),CIlow_c=c(3.1,3.2),CIhigh_a=c(1.3,1.4),CIhigh_b=c(2.3,2.4),CIhigh_c=c(3.3,3.4))
f(df=dataframe,gr=group)
I get the following error message:
Error: expr must quote a symbol, scalar, or call
How could I solve this issue?
PS1: This question is similar to a previous one. However, this question goes one step further because it requires selecting the columns to be merged.
PS2: I would appreciate code suggestions following the approach of this question.
If we are passing quoted strings, then use sym (for more than one element - syms which return a list)
f <- function(df, gr){
sl <- rlang::syms(paste("CIlow", gr, sep="_"))
sh <- rlang::syms(paste("CIhigh", gr, sep="_"))
nmN <- paste("CI", gr, sep= "_")
df %>%
dplyr::mutate(!!(nmN[1]) := sprintf("(%s,%s)",
!!(sl[[1]]), !!(sh[[1]])),
!!(nmN[2]) := sprintf("(%s,%s)",
!!(sl[[2]]), !!(sh[[2]]))) %>%
dplyr::select(paste("CI", gr, sep="_"))
}
group <- c("a","b")
f(dataframe, group)
# CI_a CI_b
#1 (1.1,1.3) (2.1,2.3)
#2 (1.2,1.4) (2.2,2.4)
I would have probably answered differently basing on the question, but after examining you answer I prepared below code. It uses trick with lapply from here dplyr::unite across column patterns. I am not sure if usage of dplyr/tidyr is the best option here, maybe simple for would be simpler.
output <- data.frame(CI_a=c("(1.1,1.3)","(1.2,1.4)"),
CI_b=c("(2.1,2.3)","(2.2,2.4)"),
stringsAsFactors = F)
dataframe <- data.frame(CIlow_a=c(1.1,1.2),CIlow_b=c(2.1,2.2),CIlow_c=c(3.1,3.2),
CIhigh_a=c(1.3,1.4),CIhigh_b=c(2.3,2.4),CIhigh_c=c(3.3,3.4))
tricky <- function(input_data, group_ids){
# convert columns to character
input_data <- input_data %>%
mutate_each(funs(as.character(.)))
# unite selected groups
output <- group_ids %>%
lapply(function(group_id) {unite_(input_data,
paste0("CI_", group_id),
paste0(c("CIlow_", "CIhigh_"), group_id),
sep = ',') %>% select_(paste0("CI_", group_id))}) %>%
bind_cols() %>%
mutate_each(funs(paste0("(", ., ")")))
return(output)
}
identical(tricky(dataframe, list("a", "b")), output)
I have found by myself an solution for my issue. The code below works:
output<-data.frame(CI_a=c("(1.1,1.3)","(1.2,1.4)"), CI_b=c("(2.1,2.3)","(2.2,2.4)"))
dataframe<-data.frame(CIlow_a=c(1.1,1.2),CIlow_b=c(2.1,2.2),CIlow_c=c(3.1,3.2),
CIhigh_a=c(1.3,1.4),CIhigh_b=c(2.3,2.4),CIhigh_c=c(3.3,3.4))
f <- function(df, gr){
sl <<- rlang::syms(paste("CIlow", gr, sep="_"))
sh <<- rlang::syms(paste("CIhigh", gr, sep="_"))
nmN <<- paste("CI", gr, sep= "_")
r<-df
for(i in 1:length(gr)){
r<-dplyr::mutate(r,UQ(nmN[i]) := sprintf("(%s;%s)", UQ(sl[[i]]),UQ(sh[[i]])))
}
r<- dplyr::select(r,nmN)
return(r)
}
group <- c("a","b")
x<-f(df=dataframe, gr=group)
The code works for an undefined number of elements in group. Thus, it works for c("a","b"), for c("a") or c("a","b","c").
I know loops are not recommended. Any better solution is appreciated.