dplyr join two tables within a function where one variable name is an argument to the function - r

I am trying to join two tables using dplyr within a function, where one of the variable names is defined by an argument to the function. In other dplyr functions, there is usually a version available for non-standard evaluation, e.g. select & select_, rename and rename_, etc, but not for the _join family. I found this answer, but I cannot get it to work in my code below:
df1 <- data.frame(gender = rep(c('M', 'F'), 5), var1 = letters[1:10])
new_join <- function(df, sexvar){
df2 <- data.frame(sex = rep(c('M', 'F'), 10), var2 = letters[20:1])
# initial attempt using usual dplyr behaviour:
# left_join(df, df2, by = c(sexvar = 'sex'))
# attempt using NSE:
# left_join(df, df2,
# by = c(eval(substitute(var), list(var = as.name(sexvar)))) = 'sex'))
# attempt using setNames:
# left_join(df, df2, by = setNames(sexvar, 'sex'))
}
new_join(df1, 'gender')
The first and second attempt give the error
Error: 'sexvar' column not found in rhs, cannot join
while the last attempt gives the error
Error: 'gender' column not found in lhs, cannot join,
which at least shows it knows I want the column gender, but somehow doesn't see it as a column heading.
Can anyone point out where I am going wrong?

Try:
df1 <- data.frame(gender = rep(c('M', 'F'), 5), var1 = letters[1:10])
new_join <- function(df, sexvar){
df2 <- data.frame(sex = rep(c('M', 'F'), 10), var2 = letters[20:1])
join_vars <- c('sex')
names(join_vars) <- sexvar
left_join(df, df2, by = join_vars)
}
new_join(df1, 'gender')
I'm sure there's a more elegant way of getting this to work using lazy evaluation, etc., but this should get you up-and-running in the meantime.

A oneliner in your block can look like this (which is similar to your last attempt)
left_join(df, df2, by = structure("sex", names = sexvar))
It is also possible to extend this to two varialbes
left_join(df, df2, by = structure(sexvarDF1, names = sexvarDF2))

Related

Filter rows in dataset for distinct words in r

Goal: To filter rows in dataset so that only distinct words remain At the moment, I have used inner_join to retain rows in 2 datasets which has made my rows in this dataset duplicate.
Attempt 1: I have tried to use distinct to retain only those rows which are unique, but this has not worked. I may be using it incorrectly.
This is my code so far; output attached in png format:
# join warriner emotion lemmas by `word` column in collocations data frame to see how many word matches there are
warriner2 <- dplyr::inner_join(warriner, coll, by = "word") # join data; retain only rows in both sets (works both ways)
warriner2 <- distinct(warriner2)
warriner2
coll2 <- dplyr::semi_join(coll, warriner, by = "word") # join all rows in a that have a match in b
# There are 8166 lemma matches (including double-ups)
# There are XXX unique lemma matches
You can try :
library(dplyr)
warriner2 <- inner_join(warriner, coll, by = "word") %>%
distinct(word, .keep_all = TRUE)
To even further clarify Ronak's answer, here is an example with some mock data. Note that you can just use distinct() at the end of the pipe to keep distinct columns if that's what you want. Your error might very well have occurred because you performed two operations, and assigned the result to the same name both times (warriner2).
library(dplyr)
# Here's a couple sample tibbles
name <- c("cat", "dog", "parakeet")
df1 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
df2 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
# It's much less confusing if you do this in one pipe
p <- df1 %>%
inner_join(df2, by = "name") %>%
distinct()

How to create column names form character vector when using data.table

I have a data.table like so:
dt = data.table(id_1 = c(rep(1:3, 5)), id_2 = sort(rep(c('A', 'B', 'C'), 5)), value_1 = rnorm(15, 1, 1), value_2 = rpois(15, 1))
I would like to create a function which groups the table by some columns specified by the function parameter and performs action (let's say sum) to several other columns specified by another parameter. Finally, i'd like to specify names for the new columns as another function parameter. My problem is: i dont really know how to create names from character vector when i am not using the assignment by reference :=.
The following two approaches achieve exactly what i want to do, i just don't like the way:
Approach one: use the assignment by reference and then choose only one record per group (and forget original columns)
dt_aggregator_1 <- function(data,
group_cols = c('id_1', 'id_2'),
new_names = c('sum_value_1', 'sum_value_2'),
value_cols = c('value_1', 'value_2')){
data_out = data
data_out[,(new_names) := lapply(.SD, function(x){sum(x)}),by = group_cols, .SDcols = value_cols]
data_out[,lapply(.SD, max), by = group_cols, .SDcols = new_names]
}
Approach 2: rename columns after grouping. I assume this is way better approach.
dt_aggregator_2 <- function(data,
group_cols = c('id_1', 'id_2'),
new_names = c('sum_value_1', 'sum_value_2'),
value_cols = c('value_1', 'value_2')){
data_out = data[,lapply(.SD, function(x){sum(x)}),by = group_cols, .SDcols = value_cols]
setnames(data_out, value_cols, new_names)
data_out[]
}
My question is, if in approach number 2 i can somehow set the names while performing the grouping opperation? So that i would reduce it to one line of code instead of 2:)
you can try with dplyr library
library(dplyr)
dt1 <- dt %>% group_by(id_1,id_2) %>%
summarise(
sum_value_1 = sum(value_1),
sum_value_2 = sum(value_2)
)
dt1
You can include setNames in the same line and make this one-liner.
dt_aggregator_2 <- function(data,
group_cols = c('id_1', 'id_2'),
new_names = c('sum_value_1', 'sum_value_2'),
value_cols = c('value_1', 'value_2')){
dt[,setNames(lapply(.SD, sum), new_names),by = group_cols, .SDcols = value_cols]
}

Puzzling behavior of arrange_at with .funs

I don't understand what dplyr::arrange_at is doing when passed a .funs parameter.
For example suppose we create a data frame Z:
library(dplyr)
Z <- expand.grid(A = c(1:2, NA), B = c(1:2, NA))
and suppose we want to sort it by A (with NA first) and then by B. Then we might try target or current below:
all_equal(
target = Z %>% arrange(!is.na(A), A, B),
current = Z %>% arrange_at(.vars = c("A", "A", "B"),
.funs = list(function(x)!is.na(x), identity, identity)),
ignore_row_order = FALSE)
which returns "Same row values, but different order". The first version (target) is what I expected, but the second (current) is puzzling. What I expected is that each function in .funs would be applied to the corresponding column in .var and then it would be sorted much like arrange().
Ultimately I want to sort in a very dynamic way and hence want the full power of arrange_at.
Update
As #akrun says in a comment, the _at family of dplyr functions create the Cartesian product of all .vars and all .funs. Therefore what I need is an arrange_parallel_at function that expects .vars and .funs to have the same length and where the kth function is evaluated on the column whose name is the kth entry in .vars (and only that column). Then all these columns in the same order become the argument to arrange.
Below is an answer to my own question. While it works (and in particular does what I asked in the Update) it's almost certainly sub-optimal since I suspect there are better solutions based on tidy eval.
Therefore I'm reluctant to accept it.
library(tidyverse)
arrange_parallel_at <- function(.data, .vars, .funs) {
stopifnot(length(.vars) == length(.funs), is.character(.vars), is.list(.funs))
tmp_cols <- paste0('.tmp', seq_along(.vars))
for (i in seq_along(.vars)) {
.data[[tmp_cols[i]]] <- sort_trans[[i]](.data[[.vars[i]]])
}
.data <- arrange_at(.data, tmp_cols)
.data[tmp_cols] <- NULL
.data
}
Below is some test code.
tibble(A = c(1:2, NA)) %>%
crossing(B = c(1:2, NA)) ->
Z
na_first <- function(x) !is.na(x)
all_equal(
Z %>% arrange(!is.na(A), A, !is.na(B), desc(B)),
Z %>% arrange_parallel_at( c( 'A', 'A', 'B', 'B'),
list(na_first, identity, na_first, desc)),
ignore_row_order = FALSE) # returns TRUE

Unknown result of select command

I have multiple .csv files (mydata_1, mydata_2,...) with the same amount of columns and column names(, different row lengths if that helps finding an answer). After reading them into my environment they have the class data.frame . I was putting them all in a list and now want to select specific columns by name from all of them, resulting in in the same variable name with just the chosen columns.
mydata_1 = matrix(c(1:21), nrow=3, ncol=7,byrow = TRUE)
mydata_2 = matrix(c(1:21), nrow=3, ncol=7,byrow = TRUE)
colnames(mydata_1) = c(paste0("X","1":"7"))
colnames(mydata_2) = c(paste0("X","1":"7"))
df1 = as.data.frame(mydata_1)
df2 = as.data.frame(mydata_2)
all_data = c(df1, df2)
class(all_data)
class(df1)
for (i in all_data){
i = select(i,"X3":"X5")
}
My for command shall output the data.frames df1 and df2 with just three columns (instead of the prior seven), but when running the code an error message regarding the select command appears.
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "c('integer', 'numeric')"
How can I get an working output of my new dfs?
The first issue here is that your are trying to create a list using c(df1, df2), while you have to use list(df1, df2)
Data
library(dplyr)
library(purrr)
mydata_1 = matrix(c(1:21), nrow=3, ncol=7,byrow = TRUE)
mydata_2 = matrix(c(1:21), nrow=3, ncol=7,byrow = TRUE)
colnames(mydata_1) = c(paste0("X","1":"7"))
colnames(mydata_2) = c(paste0("X","1":"7"))
df1 = as.data.frame(mydata_1)
df2 = as.data.frame(mydata_2)
all_data = list(df1 = df1, df2 = df2)
The second problem is within your loop. look, in this approach you have to create an empty list before running the loop, and then aggregate elements in each iteration.
all_data2 <- list()
for(i in 1:length(all_data)) {
all_data2[[i]] <- all_data[[i]] %>% select(X3, X4, X5)
}
try using map from purrr which is part of the tidyverse package and lead to a cleaner code with the same result.
# Down here the `.x` is replaced by each element of the list all_data
# in each iteration, ending wiht a list of two data frames
all_data2 = map(all_data, ~.x %>%
select(X3, X4, X5))
Consider base R's subset with select argument for contiguous column selection, wrapped in an lapply call. Unlike for loop, lapply does not require the bookkeeping to reassign each element back into a list:
all_data <- list(df1 = df1, df2 = df2)
all_data_sub <- lapply(all_data, function(df) subset(df, select=X3:X5))

dplyr join by exclusion?

When using the various join functions from dplyr you can either join all variables with the same name (by default) or specify those ones using by = c("a" = "b"). Is there a way to join by exclusion? For example, I have 1000 variables in two data frames and I want to join them by 999 of them, leaving one out. I don't want to do by = c("a1" = "b1", ...,"a999" = "b999"). Is there a way to join by excluding the one variable that is not used?
Ok, using this example from one answer:
set.seed(24)
df1 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
df2 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs=
letters[1:3], val = rnorm(3))
I want to join them using all variables excluding val. I'm looking for a more general solution. Assuming there are 1000 variables and I only remember the name of the one that I want to exclude in the join, while not knowing the index of that variable. How can I perform the join while only knowing the variable names to exclude. I understand I can find the column index first but is there a simply way to add exclusions in by =?
We create a named vector to do this
library(dplyr)
grps <- setNames(paste0("b", 1:999), paste0("a", 1:999))
Note the 'grps' vector is created with paste as the OP's post suggested a pattern. If there is no pattern, but we know the column that is not to be grouped
nogroupColumn <- "someColumn"
grps <- setNames(setdiff(names(df1), nogroupColumn),
setdiff(names(df2), nogroupColumn))
inner_join(df1, df2, by = grps)
Using a reproducible example
set.seed(24)
df1 <- data_frame(a1 = LETTERS[1:3], a2 = letters[1:3], val = rnorm(3))
df2 <- data_frame(b1 = LETTERS[3:4], b2 = letters[3:4], valn = rnorm(2))
grps <- setNames(paste0("b", 1:2), paste0("a", 1:2))
inner_join(df1, df2, by = grps)
# A tibble: 1 x 4
# a1 a2 val valn
# <chr> <chr> <dbl> <dbl>
#1 C c 0.420 -0.584
To exclude a certain field(s), you need to identify the index of the columns you want. Here's one way:
which(!names(df1) %in% "sskjs" ) #<this excludes the column "sskjs"
[1] 1 2 4 #<and shows only the desired index columns
Use unite to create a join_id in each dataframe, and join by it.
df1 <- df1 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
df2 <- df2 %>%
unite(join_id, which(!names(.) %in% "sskjs"), remove = F)
left_join(df1, df2, by = "join_id" )

Resources