Replacing multiple columns from different dataframe using dplyr - r

I have two dataframes, one of which contains a subset of IDs and columns of the other (but has different values).
ds1 <- data.frame(id = c(1:4),
d1 = "A",
d2 = "B",
d3 = "C")
ds2 <- data.frame(id = c(1,2),
d1 = "W",
d2 = "X")
I am hoping to use dplyr on d1 to find the shared columns, and replace their values with those found in d2, matching on ID. I can mutate them one at a time like this:
ds1 %>%
mutate(d1 = ifelse(id %in% ds2$id, ds2$d1[ds2$id==id],d1),
d2 = ifelse(id %in% ds2$id, ds2$d2[ds2$id==id],d2))
In my real situation, I am needing to do this 47 times, however. With the robustness of across(), I feel there is a better way. I am open to non-dplyr solutions as well.

You may perhaps need this using dplyr and stringr (can be done without stringr also)
library(tidyverse)
ds1 %>% left_join(ds2, by = 'id') %>%
mutate(across(ends_with('.y'), ~ coalesce(., get(str_replace(cur_column(), '.y', '.x'))))) %>%
select(!ends_with('.x')) %>%
rename_with(~str_remove(., '.y'), ends_with('.y'))
#> id d3 d1 d2
#> 1 1 C W X
#> 2 2 C W X
#> 3 3 C A B
#> 4 4 C A B
Created on 2021-05-10 by the reprex package (v2.0.0)

using rows_update
library(tidyverse)
ds1 <- data.frame(id = c(1:4),
d1 = "A",
d2 = "B",
d3 = "C")
ds2 <- data.frame(id = c(1,2),
d1 = "W",
d2 = "X")
rows_update(x = ds1, y = ds2, by = "id")
#> id d1 d2 d3
#> 1 1 W X C
#> 2 2 W X C
#> 3 3 A B C
#> 4 4 A B C
Created on 2021-05-11 by the reprex package (v2.0.0)

This is somewhat similar to the one posted by my friend dear #AnilGoyal and also a little bit verbose comparing to yours you can use it for larger data sets:
library(dplyr)
library(stringr)
ds1 %>%
left_join(ds2, by = "id") %>%
mutate(across(ends_with(".x"), ~ ifelse(!is.na(get(str_replace(cur_column(), ".x", ".y"))),
get(str_replace(cur_column(), ".x", ".y")),
.x))) %>%
select(!ends_with(".y")) %>%
rename_with(~ str_remove(., ".x"), ends_with(".x"))
id d1 d2 d3
1 1 W X C
2 2 W X C
3 3 A B C
4 4 A B C

Related

Pivot to wide and keep all columns

I have a dataset like this:
df <- data.frame(A=c(1,1,1,2,2,2), B=c(3,3,3,6,6, 6), C=c(2,3,9,12,2, 6), D=c("a1", "a2", "a3", "a1", "a2", "a3"))
and i want a dataset like this:
df2 <- data.frame(a1=c(2,12), a2=c(3, 2), a3=c(9, 6), B=c(3,6))
I try this function but it doesn't work:
df_new <- df %>%
mutate(B = if_else(B == 1, "A", "B")) %>%
group_by(B) %>%
mutate(var = paste0("V",row_number())) %>%
pivot_wider(id_cols = B, names_from = var, values_from = A) %>%
rename(row_name = B)
How can I solve?
You can use pivot_wider. To keep the column "B", use unused_fn with a summarizing function (here, mean, but it could also be first, min, max...).
library(tidyr)
df %>%
pivot_wider(A, names_from = D, values_from = C, unused_fn = mean)
A a1 a2 a3 B
1 1 2 3 9 3
2 2 12 2 6 6
data.table provides a nice function dcast (pass from reahspe2) to make this happen:
library(data.table)
dcast(df, A + B ~ D, value.var = "C")
A B a1 a2 a3
1 1 3 2 3 9
2 2 6 12 2 6
Read this vignette if interested

Collapsing Columns in R using tidyverse with mutate, replace, and unite. Writing a function to reuse?

Data:
ID
B
C
1
NA
x
2
x
NA
3
x
x
Results:
ID
Unified
1
C
2
B
3
B_C
I'm trying to combine colums B and C, using mutate and unify, but how would I scale up this function so that I can reuse this for multiple columns (think 100+), instead of having to write out the variables each time? Or is there a function that's already built in to do this?
My current solution is this:
library(tidyverse)
Data %>%
mutate(B = replace(B, B == 'x', 'B'), C = replace(C, C == 'x', 'C')) %>%
unite("Unified", B:C, na.rm = TRUE, remove= TRUE)
We may use across to loop over the column, replace the value that corresponds to 'x' with column name (cur_column())
library(dplyr)
library(tidyr)
Data %>%
mutate(across(B:C, ~ replace(., .== 'x', cur_column()))) %>%
unite(Unified, B:C, na.rm = TRUE, remove = TRUE)
-output
ID Unified
1 1 C
2 2 B
3 3 B_C
data
Data <- structure(list(ID = 1:3, B = c(NA, "x", "x"), C = c("x", NA,
"x")), class = "data.frame", row.names = c(NA, -3L))
Here are couple of options.
Using dplyr -
library(dplyr)
cols <- names(Data)[-1]
Data %>%
rowwise() %>%
mutate(Unified = paste0(cols[!is.na(c_across(B:C))], collapse = '_')) %>%
ungroup -> Data
Data
# ID B C Unified
# <int> <chr> <chr> <chr>
#1 1 NA x C
#2 2 x NA B
#3 3 x x B_C
Base R
Data$Unified <- apply(Data[cols], 1, function(x)
paste0(cols[!is.na(x)], collapse = '_'))

Separate rows by matching two columns in similar pattern

i have data like
df1 <- data.frame(A = c("P,Q","X,Y"), B = c("P1,Q1",""), C = c("P2,Q2","X2,Y2"))
i am looking for output like
output <- data.frame(A = c("P","Q","X","Y"), B = c("P1","Q1","",""), C = c("P2","Q2","X2","Y2"))
i tried using separate_rows like mentioned below but it is not matching the strings seperated by comma.
separate_rows(df1, A, sep=",") %>%
separate_rows(B) %>%
separate_rows(C)
I like splitstackshape package for such operations,
library(splitstackshape)
cSplit(df1, splitCols = names(df1), sep = ',', direction = 'long')
# A B C
#1: P P1 P2
#2: Q Q1 Q2
you simply have to do :
library(tidyr)
separate_rows(df1, A, B, C, convert = TRUE)
Output :
A B C
1 P P1 P2
2 Q Q1 Q2
Edit if you have NA and empty strings :
data:
df1 <- data.frame(A = c("P,Q","X,Y"), B = c("P1,Q1",""), C =
c("P2,Q2","X2,Y2"))
Code:
df1 <- data.frame(lapply(df1, as.character), stringsAsFactors=FALSE)
df1[df1 == ""] <- "0,0"
df1 <- separate_rows(df1, A, B, C, convert = TRUE)
df1[df1 == "0"] <- ""
Output :
A B C
1 P P1 P2
2 Q Q1 Q2
3 X X2
4 Y Y2
An option using base R with strsplit
data.frame(lapply(df1, function(x) strsplit(as.character(x), ",")[[1]]))
# A B C
#1 P P1 P2
#2 Q Q1 Q2
Or with scan
data.frame(lapply(df1, function(x)
scan(text = as.character(x), what = "", sep=",", quiet = TRUE)))
As suggested by Gainz's answer, separate_rows(df1, A, B, C, convert = T) works really well.
However, if you do have blank cells in the dataframe then it does become harder to use, since it will give you an error about all the columns not having the same number of rows.
I suggest using a column that you know will have no blank values. Let's assume it is column A.
I would first then convert the dataframe to a tibble, and all factor columns to character columns. Then I would replace the blank cells with a string with the correct number of commas. Then separate_rows() should be able to work correctly.
Then the code will look as follows:
df1_tibble <- df1 %>%
as_tibble() %>%
mutate_if(is.factor, as.character)
df1_clean <- df1_tibble %>%
mutate(count = str_count(A, ",") + 1) %>%
mutate(temp_str = map_chr(count, ~ rep("", .x) %>% paste0(collapse = ","))) %>%
mutate_at(vars(B, C), funs(ifelse(str_length(.) == 0, temp_str, .))) %>%
select(A, B, C)
df1_clean
#> # A tibble: 2 x 3
#> A B C
#> <chr> <chr> <chr>
#> 1 P,Q P1,Q1 P2,Q2
#> 2 X,Y , X2,Y2
df1_clean %>% separate_rows(A, B, C)
#> # A tibble: 4 x 3
#> A B C
#> <chr> <chr> <chr>
#> 1 P P1 P2
#> 2 Q Q1 Q2
#> 3 X "" X2
#> 4 Y "" Y2

Count occurrence of a categorical variable, when grouping and summarising by a different variable in R

I have a table df that looks like this:
a <- c(10,20, 20, 20, 30)
b <- c("u", "u", "u", "r", "r")
c <- c("a", "a", "b", "b", "b")
df <- data.frame(a,b,c)
I would like to create a new table that contains the mean of col a, grouped by variable c. And I would like to have a column with the counts of the occurrence of b types within each group c.
I would therefore like the result table to look like df2:
a_m <- c(15, 23.3)
c <- c("a", "b")
counts_b <-c("2 u", "1 u, 2 r")
df2 <- data.frame(a_m, c, counts_b)
What I have so far is:
df2 <- df %>% group_by(c) %>% summarise(a_m = mean(a, na.rm = TRUE))
I do not know how to add the column counts_b in the example df2.
Giulia
Here's a way using a little table magic:
df %>%
group_by(c) %>%
summarise(a_mean = mean(a),
b_list = paste(names(table(b)), table(b), collapse = ', '))
# A tibble: 2 x 3
c a_mean b_list
<fct> <dbl> <chr>
1 a 15.0 r 0, u 2
2 b 23.3 r 2, u 1
Here is another solution using reshape2. The output format may be more convenient to work with, each value of b has its own column with the number of occurrences.
out1 <- dcast(df, c ~ b, value.var="c", fun.aggregate=length)
c r u
1 a 0 2
2 b 2 1
out2 <- df %>% group_by(c) %>% summarise(a_m = mean(a))
# A tibble: 2 x 2
c a_m
<fctr> <dbl>
1 a 15.00000
2 b 23.33333
df2 <- merge(out1, out2, by=c)
c r u a_m
1 a 0 2 15.00000
2 b 2 1 23.33333

Remove exact rows and frequency of rows of a data.frame that are in another data.frame in r

Consider the following two data.frames:
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)])
I would like to remove the exact rows of a1 that are in a2 so that the result should be:
A B
4 d
5 e
4 d
2 b
Note that one row with 2 b in a1 is retained in the final result. Currently, I use a looping statement, which becomes extremely slow as I have many variables and thousands of rows in my data.frames. Is there any built-in function to get this result?
The idea is, add a counter for duplicates to each file, so you can get a unique match for each occurrence of a row. Data table is nice because it is easy to count the duplicates (with .N), and it also gives the necessary function (fsetdiff) for set operations.
library(data.table)
a1 <- data.table(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.table(A = c(1:3,2), B = letters[c(1:3,2)])
# add counter for duplicates
a1[, i := 1:.N, .(A,B)]
a2[, i := 1:.N, .(A,B)]
# setdiff gets the exception
# "all = T" allows duplicate rows to be returned
fsetdiff(a1, a2, all = T)
# A B i
# 1: 4 d 1
# 2: 5 e 1
# 3: 4 d 2
# 4: 2 b 3
You could use dplyr to do this. I set stringsAsFactors = FALSE to get rid of warnings about factor mismatches.
library(dplyr)
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)], stringsAsFactors = FALSE)
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)], stringsAsFactors = FALSE)
## Make temp variables to join on then delete later.
# Create a row number
a1_tmp <-
a1 %>%
group_by(A, B) %>%
mutate(tmp_id = row_number()) %>%
ungroup()
# Create a count
a2_tmp <-
a2 %>%
group_by(A, B) %>%
summarise(count = n()) %>%
ungroup()
## Keep all that have no entry int a2 or the id > the count (i.e. used up a2 entries).
left_join(a1_tmp, a2_tmp, by = c('A', 'B')) %>%
ungroup() %>% filter(is.na(count) | tmp_id > count) %>%
select(-tmp_id, -count)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
EDIT
Here is a similar solution that is a little shorter. This does the following: (1) add a column for row number to join both data.frame items (2) a temporary column in a2 (2nd data.frame) that will show up as null in the join to a1 (i.e. indicates it's unique to a1).
library(dplyr)
left_join(a1 %>% group_by(A,B) %>% mutate(rn = row_number()) %>% ungroup(),
a2 %>% group_by(A,B) %>% mutate(rn = row_number(), tmpcol = 0) %>% ungroup(),
by = c('A', 'B', 'rn')) %>%
filter(is.na(tmpcol)) %>%
select(-tmpcol, -rn)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
I think this solution is a little simpler (perhaps very little) than the first.
I guess this is similar to DWal's solution but in base R
a1_temp = Reduce(paste, a1)
a1_temp = paste(a1_temp, ave(seq_along(a1_temp), a1_temp, FUN = seq_along))
a2_temp = Reduce(paste, a2)
a2_temp = paste(a2_temp, ave(seq_along(a2_temp), a2_temp, FUN = seq_along))
a1[!a1_temp %in% a2_temp,]
# A B
#4 4 d
#5 5 e
#7 4 d
#8 2 b
Here's another solution with dplyr:
library(dplyr)
a1 %>%
arrange(A) %>%
group_by(A) %>%
filter(!(paste0(1:n(), A, B) %in% with(arrange(a2, A), paste0(1:n(), A, B))))
Result:
# A tibble: 4 x 2
# Groups: A [3]
A B
<dbl> <fctr>
1 2 b
2 4 d
3 4 d
4 5 e
This way of filtering avoids creating extra unwanted columns that you have to later remove in the final output. This method also sorts the output. Not sure if it's what you want.

Resources