R get column names for changed rows - r

I have two dataframes, old and new, in R. Is there a way to add a column (called changed) to the new dataframe that lists the column names (in this case, separated with a ";") where the values are different between the two dataframes? I am also trying to use this is a function where the column names that I am comparing are contained in other variables (x1, x2, x3). Ideally, I would only refer to x1, x2, x3 instead of the actual column names, but I can make due if this isn't possible. A tidy solution is preferable.
old <- data.frame(var1 = c(1, 2, 3, 5), var2 = c("A", "B", "C", "D"))
new <- data.frame(var1 = c(1, 4, 3, 6), var2 = c("A", "B", "D", "Z"))
x1 <- "var1"
x2 <- "var2"
x3 <- "changed"
#Output, adding a new column changed to new dataframe
var1 var2 changed
1 1 A NA
2 4 B var1
3 3 D var2
4 6 Z var1; var2

A tidyverse way -
library(dplyr)
library(tidyr)
cols <- names(new)
bind_cols(new, map2_df(old, new, `!=`) %>%
rowwise() %>%
transmute(changed = {
x <- c_across()
if(any(x)) paste0(cols[x], collapse = ';') else NA
}))
# var1 var2 changed
#1 1 A <NA>
#2 4 B var1
#3 3 D var2
#4 6 Z var1;var2
The same logic can be implemented in base R as well -
new$changed <- apply(mapply(`!=`, old, new), 1, function(x)
if(any(x)) paste0(cols[x], collapse = ';') else NA)

Here is a base R approach.
new$changed <- apply(old != new, 1L, \(r, nms) toString(nms[which(r)]), colnames(old))
Output
var1 var2 changed
1 1 A
2 4 B var1
3 3 D var2
4 6 Z var1, var2

Related

Find unique entries in otherwise identical rows

I am currently trying to find a way to find unique column values in otherwise duplicate rows in a dataset.
My dataset has the following properties:
The dataset's columns comprise an identifier variable (ID) and a large number of response variables (x1 - xn).
Each row should represent one individual, meaning the values in the ID column should all be unique (and not repeated).
Some rows are duplicated, with repeated entries in the ID column and seemingly identical response item values (x1 - xn). However, the dataset is too large to get a full overview over all variables.
As demonstrated in the code below, if rows are truly identical for all variables, then the duplicate row can be removed with the dplyr::distinct() function. In my case, not all "duplicate" rows are removed by distinct(), which can only mean that not all entries are identical.
I want to find a way to identify which entries are unique in these otherwise duplicate rows.
Example:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
# The dataframe with all entries
df
A data.frame: 6 × 7
ID x1 x2 x3 x4 x5 x6
1 4 a 7 d x g
1 4 a 10 d p g
2 5 b 8 e y h
2 5 b 8 e y h
3 6 c 9 f z i
3 6 d 11 f q i
# The dataframe
df %>%
# with duplicates removed
distinct() %>%
# filtered for columns only containing duplicates in the ID column
janitor::get_dupes(ID)
ID dupe_count x1 x2 x3 x4 x5 x6
1 2 4 a 7 d x g
1 2 4 a 10 d p g
3 2 6 c 9 f z i
3 2 6 d 11 f q i
In the example above I demonstrate how dplyr::distinct() will remove fully duplicate rows (ID = 2), but not rows that are different in some columns (rows where ID = 1 and 3, and columns x2, x3 and x5).
What I want is an overview over which columns that are not duplicates for each value:
df %>%
distinct() %>%
janitor::get_dupes(ID) %>%
# Here I want a way to find columns with unidentical entries:
find_nomatch()
ID x2 x3 x5
1 7 x
1 10 p
3 c 9 z
3 d 11 q
A data.table alternative. Coerce data frame to a data.table (setDT). Melt data to long format (melt(df, id.vars = "ID")).
Within each group defined by 'ID' and 'variable' (corresponding to the columns in the wide format) (by = .(ID, variable)), count number of unique values (uniqueN(value)) and check if it's equal to the number of rows in the subgroup (== .N). If so (if), select the entire subgroup (.SD).
Finally, reshape the data back to wide format (dcast).
library(data.table)
setDT(df)
d = melt(df, id.vars = "ID")
dcast(d[ , if(uniqueN(value) == .N) .SD, by = .(ID, variable)], ID + rowid(ID, variable) ~ variable)
# ID ID_1 x2 x3 x5
# 1: 1 1 <NA> 7 x
# 2: 1 2 <NA> 10 p
# 3: 3 1 c 9 z
# 4: 3 2 d 11 q
A bit more simple than yours I think:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
d <- df %>%
distinct() %>%
janitor::get_dupes(ID)
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) .x[1, ] != .y))%>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-18 by the reprex package (v2.0.1)
Edit
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) !Vectorize(identical)(unlist(.x[1, ]), .y))) %>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-19 by the reprex package (v2.0.1)
I have been working on this issue for some time and I found a solution, though it tooks more step than I would've though necessary. I can only presume there's a more elegant solution out there. Anyway, this should work:
df <- df %>%
distinct() %>%
janitor::get_dupes(ID)
# Make vector of unique values from the duplicated ID values
l <- distinct(df, ID) %>% unlist()
# Lapply on each ID
df <- lapply(
l,
function(x) {
# Filter rows for the duplicated ID
dplyr::filter(df, ID == x) %>%
# Transpose dataframe (converts it into a matrix)
t() %>%
# Convert back to data frame
as.data.frame() %>%
# Filter columns that are not identical
dplyr::filter(!if_all(everything(), ~ . == V1)) %>%
# Transpose back
t() %>%
# Convert back to data frame
as.data.frame()
}
) %>%
# Bind the dataframes in the list together
bind_rows() %>%
# Finally the columns are moved back in ascending order
relocate(x2, .before = x3)
#Remove row names (not necessary)
row.names(df) <- NULL
df
A data.frame: 4 × 3
x2 x3 x5
NA 7 x
NA 10 p
c 9 z
d 11 q
Feel free to comment
If you just want to keep the first instance of each identifier:
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = rep(letters[1:3], each = 2),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2)
)
df %>%
distinct(ID, .keep_all = TRUE)
Output:
ID x1 x2 x3 x4
1 1 4 a 7 d
2 2 5 b 8 e
3 3 6 c 9 f

Replace multiple values in a dataframe with NA based on conditions given in another dataframe in R

Using R, I'd like to replace numeric values in a data frame with NA's based on conditions contained in another data frame:
# An example data frame with numeric values I want to
# change to `NA` based on values given in another data frame.
df1 <- tibble::tribble(
~col_1, ~col_2, ~col_3, ~group, ~subgroup,
1, 3, 5, 'A', 'p',
6, 8, 5, 'A', 'q',
5, 3, 3, 'B', 'p',
1, 7, 7, 'B', 'q'
)
# A second data frame containing conditions
# to be used for subsetting the first data frame.
df2 <- tibble::tribble(
~group, ~subgroup, ~cols,
'A', 'q', 'col_1',
'A', 'q', 'col_3',
'B', 'p', 'col_2',
'B', 'p', 'col_3'
)
# My problematic approach to subsetting df1 and replacing
# values with `NA` based on the conditions given in df2.
df1[df1$group %in% unique(df2$group) &
df1$subgroup %in% unique(df2$subgroup),
unique(df2$cols)] <- NA
# The incorrect result of my approach.
print(df1)
# A tibble: 4 × 5
col_1 col_2 col_3 group subgroup
<dbl> <dbl> <dbl> <chr> <chr>
1 NA NA NA A p
2 NA NA NA A q
3 NA NA NA B p
4 NA NA NA B q
Created on 2021-09-20 by the reprex package (v2.0.1)
My strategy was to subset df1 using the TRUE indices where df1 observations matched df2 and use ]<-NA to replace those observations with NA's. However, my approach selected all the observations rather than the desired result of only replacing the observations indicated in df2.
How can I do this functionally/programmatically without manual replacement? This example dataset is small enough to use ]<- methods for each value I'd like to replace, but I'd like to do this functionally on a larger, more complicated dataset.
Solutions and caveats: both solutions offered by #Ronak and #akrun worked for the example data set in this question. However, after discovering rare cases of duplicated subgroup and group values in my real data set I found that only the solution by #akrun worked. Below I have added another example that recreates the rare cases I observed in my real data and added a modification of the solution by #Ronak that makes it work for these duplications.
# Unique numeric observations were added
# in rows 1 and 2 with group and subgroup
# values that are duplicated with existing
# group and subgroup values.
df1 <- tibble::tribble(
~col_1, ~col_2, ~col_3, ~group, ~subgroup,
7, 4, 9, "A", "p",
1, 3, 5, "A", "p",
6, 8, 5, "A", "q",
5, 3, 3, "B", "p",
1, 7, 7, "B", "q"
)
# Conditions were added in rows 1 and 2
# to indicate which values to replace
# in df1 with NA.
df2 <- tibble::tribble(
~group, ~subgroup, ~cols,
"A", "p", "col_1",
"A", "p", "col_2",
"A", "q", "col_1",
"A", "q", "col_3",
"B", "p", "col_2",
"B", "p", "col_3"
)
# Modifications of #Ronak's solution
df1 <- as.data.frame(df1)
df2 <- as.data.frame(df2)
key1 <- lapply(
setNames(names(df1)[grep("col_\\d", x = names(df1))], 1:3),
function(x) {
paste(x, df1$group, df1$subgroup)
}
)
key2 <- with(df2, paste(cols, group, subgroup))
indices <- lapply(
key1,
function(x) {
which(x %in% key2)
}
)
indices <- indices[sapply(indices, function(x) length(x) > 0)]
selection <- lapply(
1:length(indices),
function(x) {
cbind(indices[[x]], as.numeric(names(indices)[x]))
}
)
selection <- do.call(rbind, selection)
df1[selection] <- NA
df1
# col_1 col_2 col_3 group subgroup
# 1 NA NA 9 A p
# 2 NA NA 5 A p
# 3 NA 8 NA A q
# 4 5 NA NA B p
# 5 1 7 7 B q
Here is one method to assign i.e. loop across columns that starts_with 'col' in first dataset ('df1'), create a single string vector by pasteing the 'group', 'subgroup' and the corresponding column name (cur_column()), check if that elements are %in% the pasted rows of 'df2' to create logical vector. Use that in replace to replace those elements to NA
library(dplyr)
library(stringr)
library(purrr)
df1 <- df1 %>%
mutate(across(starts_with('col'),
~ replace(., str_c(group, subgroup, cur_column()) %in%
invoke(str_c, c(df2, sep = '')), NA) ))
-output
df1
# A tibble: 4 x 5
col_1 col_2 col_3 group subgroup
<dbl> <dbl> <dbl> <chr> <chr>
1 1 3 5 A p
2 NA 8 NA A q
3 5 NA NA B p
4 1 7 7 B q
A base R option creating a matrix of row/column index to replace values in df1.
key1 <- with(df1, paste(group, subgroup))
key2 <- with(df2, paste(group, subgroup))
df1 <- data.frame(df1)
df1[cbind(match(key2, key1), match(df2$cols, names(df1)))] <- NA
df1
# col_1 col_2 col_3 group subgroup
#1 1 3 5 A p
#2 NA 8 NA A q
#3 5 NA NA B p
#4 1 7 7 B q

tidyverse alternative to left_join & rows_update when two data frames differ in columns and rows

There might be a *_join version for this I'm missing here, but I have two data frames, where
The merging should happen in the first data frame, hence left_join
I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
The second data frame contains more rows than the first one.
Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.
x <- data.frame(id = c(1, 2, 3),
a = c("A", "B", NA))
id a
1 1 A
2 2 B
3 3 <NA>
y <- data.frame(id = c(1, 2, 3, 4),
a = c("A", "B", "C", "D"),
q = c("u", "v", "w", "x"))
id a q
1 1 A u
2 2 B v
3 3 C w
4 4 D x
and the desired output would be:
id a q
1 1 A u
2 2 B v
3 3 C w
I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?
library(tidyverse)
x %>%
left_join(., y %>% select(id, q), by = c("id")) %>%
rows_update(., y %>% filter(id %in% x$id), by = "id")
You can left_join and use coalesce to replace missing values.
library(dplyr)
x %>%
left_join(y, by = 'id') %>%
transmute(id, a = coalesce(a.x, a.y), q)
# id a q
#1 1 A u
#2 2 B v
#3 3 C w

Data manipulation using r purrr

I have two datasets dat1 and dat2. I would like to pull out rows from dat1 which match the pairs of variables from dat2. var6 can be matched in any of var1, var2, var3, and var4. var7 must be matched with var5.
I would like to come up with a solution using the map functions from the purr package in tidyverse but I'm not sure where to start. Thank you for any help!
dat1 <- data.frame(id = c(1:9),
var1 = c("x","x","x","y","y","y","z","z","z"),
var2 = c("c","c","c","d","d","d","e","e","e"),
var3 = c("f","f","f","g","g","g","h","h","h"),
var4 = c("i","i","i","j","j","j","k","k","k"),
var5 = c("aa","aa","aa","aa","aa","aa","bb","bb","bb"), stringsAsFactors = FALSE)
dat2 <- data.frame(var6 = c("c", "d", "l", "m", "n"),
var7 = c("aa", "bb", "aa", "aa","aa"), stringsAsFactors = FALSE)
In this example the result would pull out rows 1, 2, and 3 from dat1 as "c" is matched in var2 and "aa" is matched in var5.
If we need an elementwise comparison, loop through the column 2 to 5 in 'dat1' with lapply, then do an elementwisse comparison with 'var6' of 'dat2' using outer while doing the same comparison with 'var5', 'var7' columns from 'dat1', 'dat2' respectively, check whether we get both as TRUE (&), then take the row wise sum (rowSums) to collapse the matrix into a single logical vector and Reduce the list. of vectors into. a single vector with | i.e. checking whether any of the row elements are TRUE in each of the vectors. It is used for subsetting the rows ('i1')
i1 <- Reduce(`|`, lapply(dat1[2:5], function(x)
rowSums(outer(x, dat2$var6, `==`) & outer(dat1$var5, dat2$var7, `==`)) > 0 ))
dat1[i1,]
# id var1 var2 var3 var4 var5
#1 1 x c f i aa
#2 2 x c f i aa
#3 3 x c f i aa
Or using map
library(purrr)
library(dplyr)
map(dat1[2:5], ~ outer(.x, dat2$var6, `==`) &
outer(dat1$var5, dat2$var7, `==`)) %>%
reduce(`+`) %>%
rowSums %>%
as.logical %>%
magrittr::extract(dat1, ., )
# id var1 var2 var3 var4 var5
#1 1 x c f i aa
#2 2 x c f i aa
#3 3 x c f i aa

cumsum when current obs equals next obs for same variable (column)

I want to add a column to a dataframe that makes a cumulated sum of another variable if yet another variable is equal for two rows. For example:
Row Var1 Var2 CumVal
1 A 2 2
2 A 4 6
3 B 5 5
So I want CumVal to cumulate/sum the Var2 column, if Var1 obs for row 2 equals Var1 obs for row 1. With other words, if it is equal to the obs before.
If the cumsum is based on the Var1 as a grouping variable
library(dplyr)
df %>%
group_by(Var1) %>%
mutate(CumVal=cumsum(Var2))
Or
library(data.table)
setDT(df)[, CumVal:=cumsum(Var2), by=Var1]
Or using base R
transform(df, CumVal=ave(Var2, Var1, FUN=cumsum))
Update
If it is based on whether adjacent elements are not equal
transform(df, CumVal= ave(Var2, cumsum(c(TRUE,Var1[-1]!=
Var1[-nrow(df)])), FUN=cumsum))
# Row Var1 Var2 CumVal
#1 1 A 2 2
#2 2 A 4 6
#3 3 B 5 5
#4 4 A 6 6
Or the dplyr approach
df %>%
group_by(indx= cumsum(c(TRUE,(lag(Var1)!=Var1)[-1]))) %>%
mutate(CumVal=cumsum(Var2)) %>%
ungroup() %>%
select(-indx)
data
df <- structure(list(Row = 1:4, Var1 = c("A", "A", "B", "A"), Var2 = c(2L,
4L, 5L, 6L)), .Names = c("Row", "Var1", "Var2"), class = "data.frame",
row.names = c(NA, -4L))
I like rle, which detects similar successive values in a vector and describe it in a nice synthetic way. E.g. let's say we have a vector x of length 10:
x <- c(2, 3, 2, 2, 2, 2, 0, 0, 2, 1)
rle is able to detect that there are 4 successive 2s and 2 successive 0s:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 4 2 1 1
# values : num [1:6] 2 3 2 0 2 1
(in the output, we can that there are 2 lengths different from 1 corresponding to values 4 and 2)
We can use this function to apply cumsum to subvectors of another vector. Let's say we want to apply cumcum on a new vector y <- 1:10, but only for repeated values of x (which will be stored in a factor f):
y <- 1:10
z <- rle(x)$lengths
f <- factor(rep( seq_along(z), z) )
We can then use by or tapply (or something else to achieve the desired output):
cumval <- unlist(tapply(y, f, cumsum))

Resources