I want to automatically add a new dataset identifier variable when using full_join() in R.
df1 <- tribble(~ID, ~x,
"A", 1,
"B", 2,
"C", 3)
df2 <- tribble(~ID, ~y,
"D", 4,
"E", 5,
"F", 6)
combined <- df1 %>% dplyr::full_join(df2)
I know from ?full_join that it joins all rows from df1 followed by df2. But, I couldn't find an option to create an index variable automatically.
Currently, I'm adding an extra variable in df1 first
df1 <- tribble(~ID, ~x, ~dataset,
"A", 1, 1,
"B", 2, 1,
"C", 3, 1)
and following it up with df1 %>% dplyr::full_join(df2) %>% dplyr::mutate(dataset = replace_na(dataset, 2))
Any suggestions to do it in a better way?
I'm not sure if it's more efficient than yours', but if there always do not exist overlapping columns except id, then you may try
df1 %>%
full_join(df2) %>%
mutate(dataset = as.numeric(is.na(x))+1)
ID x y dataset
<chr> <dbl> <dbl> <dbl>
1 A 1 NA 1
2 B 2 NA 1
3 C 3 NA 1
4 D NA 4 2
5 E NA 5 2
6 F NA 6 2
But to be safe, it might be better just define it's index(?) thing beforehand.
df1 %>%
mutate(dataset = 1) %>%
full_join(df2 %>% mutate(dataset = 2))
ID x y dataset
<chr> <dbl> <dbl> <dbl>
1 A 1 NA 1
2 B 2 NA 1
3 C 3 NA 1
4 D NA 4 2
5 E NA 5 2
6 F NA 6 2
New data
df1 <- tribble(~ID, ~x,~y,
"A", 1,1,
"B", 2,1,
"C", 3,1)
df2 <- tribble(~ID, ~x,~y,
"D", 4,1,
"E", 5,1,
"F", 6,1)
full_join(df1, df2)
ID x y
<chr> <dbl> <dbl>
1 A 1 1
2 B 2 1
3 C 3 1
4 D 4 1
5 E 5 1
6 F 6 1
Instead of a "join", maybe try bind_rows from dplyr:
library(dplyr)
bind_rows(df1, df2, .id = "dataset")
This will bind rows, and the missing columns are filled in with NA. In addition, you can specify an ".id" argument with an identifier. If you provide a list of dataframes, the labels are taken from names in the list. If not, a numeric sequence is used (as seen below).
Output
dataset ID x y
<chr> <chr> <dbl> <dbl>
1 1 A 1 NA
2 1 B 2 NA
3 1 C 3 NA
4 2 D NA 4
5 2 E NA 5
6 2 F NA 6
Related
I want to replace de columns with NA in df using the imputed values in df2 to get df3.
I can do it with left_join and coalesce, but I think this method doesn't generalize well. Is there a better way?
library(tidyverse)
df <- tibble(c = c("a", "a", "a", "b", "b", "b"),
d = c(1, 2, 3, 1, 2, 3),
x = c(1, NA, 3, 4, 5,6),
y = c(1, 2, NA, 4, 5, 6),
z = c(1, 2, 7, 4, 5, 6))
# I want to replace NA in df by df2
df2 <- tibble(c = c("a", "a", "a"),
d = c(1, 2, 3),
x = c(1, 2, 3),
y = c(1, 2, 2))
# to get
df3 <- tibble(c = c("a", "a", "a", "b", "b", "b"),
d = c(1, 2, 3, 1, 2, 3),
x = c(1, 2, 3, 4, 5, 6),
y = c(1, 2, 2, 4, 5, 6),
z = c(1, 2, 7, 4, 5, 6))
# is there a better solution than coalesce?
df3 <- df %>% left_join(df2, by = c("c", "d")) %>%
mutate(x = coalesce(x.x, x.y),
y = coalesce(y.x, y.y)) %>%
select(-x.x, -x.y, -y.x, -y.y)
Created on 2021-06-17 by the reprex package (v2.0.0)
Here's a custom function that coalesces all .x and .y columns, optionally renaming and removing columns.
#' Coalesce all columns duplicated in a previous join.
#'
#' Find all columns resulting from duplicate names after a join
#' operation (e.g., `dplyr::*_join` or `base::merge`), then coalesce
#' them pairwise.
#'
#' #param x data.frame
#' #param suffix character, length 2, the same string suffixes
#' appended to column names of duplicate columns; should be the same
#' as provided to `dplyr::*_join(., suffix=)` or `base::merge(.,
#' suffixes=)`
#' #param clean logical, whether to remove the suffixes from the LHS
#' columns and remove the columns on the RHS columns
#' #param strict logical, whether to enforce same-classes in the LHS
#' (".x") and RHS (".y") columns; while it is safer to set this to
#' true (default), sometimes the conversion of classes might be
#' acceptable, for instance, if one '.x' column is 'numeric' and its
#' corresponding '.y' column is 'integer', then relaxing the class
#' requirement might be acceptable
#' #return 'x', coalesced, optionally cleaned
#' #export
coalesce_all <- function(x, suffix = c(".x", ".y"),
clean = FALSE, strict = TRUE) {
nms <- colnames(x)
Xs <- endsWith(nms, suffix[1])
Ys <- endsWith(nms, suffix[2])
# x[Xs] <- Map(dplyr::coalesce, x[Xs], x[Ys])
# x[Xs] <- Map(data.table::fcoalesce, x[Xs], x[Ys])
x[Xs] <- Map(function(dotx, doty) {
if (strict) stopifnot(identical(class(dotx), class(doty)))
isna <- is.na(dotx)
replace(dotx, isna, doty[isna])
} , x[Xs], x[Ys])
if (clean) {
names(x)[Xs] <- gsub(glob2rx(paste0("*", suffix[1]), trim.head = TRUE), "", nms[Xs])
x[Ys] <- NULL
}
x
}
In action:
df %>%
left_join(df2, by = c("c", "d")) %>%
coalesce_all()
# # A tibble: 6 x 7
# c d x.x y.x z x.y y.y
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 1 1 1 1 1
# 2 a 2 2 2 2 2 2
# 3 a 3 3 2 7 3 2
# 4 b 1 4 4 4 NA NA
# 5 b 2 5 5 5 NA NA
# 6 b 3 6 6 6 NA NA
df %>%
left_join(df2, by = c("c", "d")) %>%
coalesce_all(clean = TRUE)
# # A tibble: 6 x 5
# c d x y z
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 1 1 1
# 2 a 2 2 2 2
# 3 a 3 3 2 7
# 4 b 1 4 4 4
# 5 b 2 5 5 5
# 6 b 3 6 6 6
I included two coalesce functions as alternatives to the base-R within the Map. One advantage is the strict argument: dplyr::coalesce will silently allow integer and numeric to be coalesced, while data.table::fcoalesce does not. If that is desirable, use what you prefer. (Another advantage is that both of the non-base coalesce functions accept an arbitrary number of columns to coalesce, which is not required in this implementation.)
You may mutate all columns at once, by using across and using .names & .keep argument, like this
library(dplyr, warn.conflicts = F)
df %>% left_join(df2, by = c("c", "d")) %>%
mutate(across(ends_with('.x'), ~ coalesce(., get(gsub('.x', '.y', cur_column()))),
.names = '{gsub(".x$", "", .col)}'), .keep = 'unused')
#> # A tibble: 6 x 5
#> c d z x y
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 1 1 1
#> 2 a 2 2 2 2
#> 3 a 3 7 3 2
#> 4 b 1 4 4 4
#> 5 b 2 5 5 5
#> 6 b 3 6 6 6
Created on 2021-06-17 by the reprex package (v2.0.0)
I tried another method, filtering c, dropping all columns of df with NA, joining with df2 and bind rows of the unfiltered df with df3.
df3 <- df %>% filter(c == "a") %>% select_if(~ !any(is.na(.))) %>%
left_join(df2, by = c("c", "d"))
df3 <- bind_rows(df %>% filter(!c == "a"), df3) %>% arrange(c,d)
df3
#> # A tibble: 6 x 5
#> c d x y z
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 1 1 1
#> 2 a 2 2 2 2
#> 3 a 3 3 2 7
#> 4 b 1 4 4 4
#> 5 b 2 5 5 5
#> 6 b 3 6 6 6
Created on 2021-06-17 by the reprex package (v2.0.0)
We can use {powerjoin}
library(powerjoin)
power_left_join(df, df2, by = c("c", "d"), conflict = coalesce_xy)
#> # A tibble: 6 × 5
#> c d z x y
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 1 1 1
#> 2 a 2 2 2 2
#> 3 a 3 7 3 2
#> 4 b 1 4 4 4
#> 5 b 2 5 5 5
#> 6 b 3 6 6 6
I am new to R and have a simple 'how to' question, specifically, what is the best way to calculate Group and overall percentages on data frame columns? My data looks like this:
# A tibble: 13 x 3
group resp id
<chr> <dbl> <chr>
1 A 1 ssa
2 A 1 das
3 A NA fdsf
4 B NA gfd
5 B 1 dfg
6 B 1 dg
7 C 1 gdf
8 C NA gdf
9 C NA hfg
10 D 1 hfg
11 D 1 trw
12 D 1 jyt
13 D NA ghj
the test data is this:
structure(list(group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C", "D", "D", "D", "D"), resp = c(1, 1, NA, NA, 1, 1, 1, NA,
NA, 1, 1, 1, NA), id = c("ssa", "das", "fdsf", "gfd", "dfg",
"dg", "gdf", "gdf", "hfg", "hfg", "trw", "jyt", "ghj")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame")
I managed to do the group percentages by doing the following (which seems overcomplicated):
a <- test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE))
b <- test %>%
group_by(group) %>%
summarise(all = n_distinct(id, na.rm = TRUE))
result <- a %>%
left_join(b) %>%
mutate(a,resp_rate = round(no_resp/all*100))
this gives me:
# A tibble: 4 x 4
group no_resp all resp_rate
<chr> <dbl> <int> <dbl>
1 A 2 3 67
2 B 2 3 67
3 C 1 2 50
4 D 3 4 75
which is fine, but I wondered how I could make this simpler? Also, how would I do an overall percentage? E.g. an overall distinct count of resp/distinct count of id, without grouping.
Many thanks
You can add multiple statements in summarise so you don't have to create temporary objects a and b. To calculate overall percentage you can divide the number by the sum of the column.
library(dplyr)
test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE),
all = n_distinct(id),
resp_rate = round(no_resp/all*100)) %>%
mutate(no_resp_perc = no_resp/sum(no_resp) * 100)
# group no_resp all resp_rate no_resp_perc
# <chr> <int> <int> <dbl> <dbl>
#1 A 2 3 67 25
#2 B 2 3 67 25
#3 C 1 2 50 12.5
#4 D 3 4 75 37.5
Using base R we may apply tapply and table functions.
res <- transform(with(test, data.frame(no_resp=tapply(resp, group, sum, na.rm=TRUE),
all=colSums(table(id, group) > 0))),
resp_rate=round(no_resp/all*100),
overall_perc=prop.table(no_resp)*100
)
res
# no_resp all resp_rate overall_perc
# A 2 3 67 25.0
# B 2 3 67 25.0
# C 1 2 50 12.5
# D 3 4 75 37.5
I have data coming from several people and some thought they should use a column to store names and others thought they should use the same column for values.
In the following dataframe I would like to separate remark into a numeric column and a character column:
df2 <- data.frame(group = c("a", "b", "c", "d", "e", "f"), var1 = c(3, 1, 2, 4, 3, 2),
remark = c( 2, "abc", 8, 2, "cfd", "afd"))
I'm not even sure where to start, but I'd like to be able >%> it in.
Here is a solution using dplyr 1.0.0's across():
library(dplyr)
df2 <- data.frame(group = c("a", "b", "c", "d", "e", "f"),
var1 = c(3, 1, 2, 4, 3, 2),
remark = c( 2, "abc", 8, 2, "cfd", "afd"))
df2 %>%
mutate(across(remark, list(num = ~ as.numeric(.x),
chr = ~ ifelse(is.na(as.numeric(.x)), .x, NA))))
#> Warning in fn(col, ...): NAs introduced by coercion
#> Warning in ifelse(is.na(as.numeric(.x)), .x, NA): NAs introduced by coercion
#> group var1 remark remark_num remark_chr
#> 1 a 3 2 2 <NA>
#> 2 b 1 abc NA abc
#> 3 c 2 8 8 <NA>
#> 4 d 4 2 2 <NA>
#> 5 e 3 cfd NA cfd
#> 6 f 2 afd NA afd
Created on 2020-06-12 by the reprex package (v0.3.0)
We could do the following
df2$remark.num <- as.numeric(as.character(df2$remark))
df2$remark.char <- sub("\\d*","",df2$remark)
df2 <- df2[,-3]
output
> df2
group var1 remark.num remark.char
1 a 3 2
2 b 1 NA abc
3 c 2 8
4 d 4 2
5 e 3 NA cfd
6 f 2 NA afd
If we use as.numeric() to coerce the remark column to numbers, R will have trouble with the ones that can't be converted. We could exploit that like this:
df2$remark.num <- df2$remark %>% paste %>% as.numeric
df2$remark.alpha <- NA
df2$remark.alpha[is.na(df2$remark.num)] <- paste(df2$remark)[is.na(df2$remark.num)]
Output:
> df2
group var1 remark remark.num remark.alpha
1 a 3 2 2 <NA>
2 b 1 abc NA abc
3 c 2 8 8 <NA>
4 d 4 2 2 <NA>
5 e 3 cfd NA cfd
6 f 2 afd NA afd
I have the following data.
> dat
# A tibble: 12 x 2
id name
<chr> <chr>
1 1 a
2 1 b
3 1 a
4 2 a
5 2 b
6 2 c
7 2 b
8 3 a
9 3 b
10 3 c
11 3 d
12 3 d
I would like to filter only by the following list
set <- NULL
set$names <- c("a","b","c")
The ids selected are those that contain exactly the names in the list.
So the result would be only the 2s selected as follows:
> dat
# A tibble: 12 x 2
id name
<chr> <chr>
4 2 a
5 2 b
6 2 c
7 2 b
Here is the data for easy replication:
dat <- tribble(
~id, ~name,
1, "a",
1, "b",
1, "a",
2, "a",
2, "b",
2, "c",
2, "b",
3, "a",
3, "b",
3, "c",
3, "d",
3, "d"
)
I would like to have the following result.
How about:
group_by(dat, id) %>% filter(setequal(name, set$names))
This filters out all groups where the name column and set$names do not contain the same elements, but allows duplicates.
I am not sure it is what you want
dat %>%
group_by(id) %>%
filter(all(set$name %in% name) & all(name %in%set$name))
# A tibble: 4 x 2
id name
<dbl> <chr>
1 2 a
2 2 b
3 2 c
4 2 b
I am trying to remove rows that have offsetting values.
library(dplyr)
a <- c(1, 1, 1, 1, 2, 2, 2, 2,2,2)
b <- c("a", "b", "b", "b", "c", "c","c", "d", "d", "d")
d <- c(10, 10, -10, 10, 20, -20, 20, 30, -30, 30)
o <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
df <- tibble(ID = a, SEQ = b, VALUE = d, OTHER = o)
Generates this ordered table that is grouped by ID and SEQ.
> df
# A tibble: 10 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 10 B
3 1 b -10 C
4 1 b 10 D
5 2 c 20 E
6 2 c -20 F
7 2 c 20 G
8 2 d 30 H
9 2 d -30 I
10 2 d 30 J
I want to drop the row pairs (2,3), (5,6), (8,9) because VALUE negates the VALUE in the matching previous row.
I want the resulting table to be
> df2
# A tibble: 4 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 10 D
3 2 c 20 G
4 2 d 30 J
I know that I can't use group_by %>% summarize, because I need to keep the value that is in OTHER. I've looked at the dplyr::lag() function but I don't see how that can help. I believe that I could loop through the table with some type of for each loop and generate a logical vector that can be used to drop the rows, but I was hoping for a more elegant solution.
What about:
vec <- cbind(
c(head(df$VALUE,-1) + df$VALUE[-1], 9999) ,
df$VALUE + c(9999, head(df$VALUE,-1))
)
vec <- apply(vec,1,prod)
vec <- vec!=0
df[vec,]
# A tibble: 4 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 50 D
3 2 c 60 G
4 2 d 70 J
The idea is to take your VALUE field and subtract it with a slightly subset version of it. When the result is 0, than you remove the line.
Here's another solution with dplyr. Not sure about the edge case you mentioned in the comments, but feel free to test it with my solution:
library(dplyr)
df %>%
group_by(ID, SEQ) %>%
mutate(diff = VALUE + lag(VALUE),
diff2 = VALUE + lead(VALUE)) %>%
mutate_at(vars(diff:diff2), funs(coalesce(., 1))) %>%
filter((diff != 0 & diff2 != 0)) %>%
select(-diff, -diff2)
Result:
# A tibble: 4 x 4
# Groups: ID, SEQ [4]
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 50 D
3 2 c 60 G
4 2 d 70 J
Note:
This solution first creates two diff columns, one adding the lag, another adding the lead of VALUE to each VALUE. Only the offset columns will either have a zero in diff or in diff2, so I filtered out those rows, resulting in the desired output.