keeping first observation of each duplicated combination of values across multiple colums

keeping first observation of each duplicated combination of values across multiple colums - r

My data :
var1 <- c(1, 2, 3, 4, 5, 28, 6)
var2 <- c(2, 1, 10, 11, 6, 78, 5)
var3 <- c(100,101,102,0,0,0, 0)
dataset<- data.frame(var1, var2, var3)
datset
my result :
var1 var2 var3
1 2 100
2 1 101
3 10 102
4 11 0
5 6 0
28 78 0
6 5 0
I have two combinations of duplicated values across the var1 and var2 columns (in any order):
first one:
var1 var2 var3
1 2 100
2 1 101
second one:
var1 var2 var3
5 6 0
6 5 0
Expected result :
keeping first observation of each duplicated combinaison of values in multiple colums (var1 and var2) :
var1 var2 var3
1 2 100
3 10 101
4 11 102
5 6 0
28 78 0
full dataset csv

We can use duplicated on the sorted elements of each row of the first two columns to get the expected output
dataset[!duplicated(t(apply(dataset[1:2], 1, sort))),]
Or another option is to apply duplicated on pmin and pmax
library(data.table)
setDT(dataset)[!duplicated(dataset[, .(var1 = pmin(var1, var2), var2 = pmax(var1, var2))])]
Update
Based on the OP's full dataset
df1 <- na.omit(read.csv(file.choose(), row.names = 1))
out <- df1[!duplicated(t(apply(df1[1:2], 1, sort))),]
dim(out)
#[1] 113 3
out2 <- setDT(df1)[!duplicated(df1[, .(from = pmin(from, to), to = pmax(from, to))])]
dim(out2)
#[1] 113 3

Related

Identify duplicates and make column with common id [duplicate]

This question already has answers here:
Concatenate strings by group with dplyr [duplicate]
(4 answers)
Closed 20 days ago.
I have a df
df <- data.frame(ID = c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'),
var1 = c(1, 1, 3, 4, 5, 5, 7, 8),
var2 = c(1, 1, 0, 0, 1, 1, 0, 0),
var3 = c(50, 50, 30, 47, 33, 33, 70, 46))
Where columns var1 - var3 are numerical inputs into a modelling software. To save on computing time, I would like to simulate unique instances of var1 - var3 in the modelling software, then join the results back to the main dataframe using leftjoin.
I need to add a second identifier to each row to show that it is the same as another row in terms of var1-var3. The output would be like:
ID var1 var2 var3 ID2
1 a 1 1 50 ab
2 b 1 1 50 ab
3 c 3 0 30 c
4 d 4 0 47 d
5 e 5 1 33 ef
6 f 5 1 33 ef
7 g 7 0 70 g
8 h 8 0 46 h
The I can subset unique rows of var1-var3 and ID2 simulate them in the software, and join results back to the main df using the new ID2.

With paste:
library(dplyr) #1.1.0
df %>%
mutate(ID2 = paste(unique(ID), collapse = ""),
.by = c(var1, var2))
# ID var1 var2 var3 ID2
# 1 a 1 1 50 ab
# 2 b 1 1 50 ab
# 3 c 3 0 30 c
# 4 d 4 0 47 d
# 5 e 5 1 33 ef
# 6 f 5 1 33 ef
# 7 g 7 0 70 g
# 8 h 8 0 46 h
Note that the .by argument is a new feature of dplyr 1.1.0. You can still use group_by and ungroup with earlier versions and/or if you have a more complex pipeline.

replace values across columns in a dataframe when index variable matches to another dataframe in r

I have a dataset (df1) with about 40 columns including an ID variable with values that can have multiple observations over the thousands of rows of observations. Say I have another dataset (df2) with only about 4 columns and a few rows of data. The column names in df2 are found in df1 and the ID variable matches some of the observations in df1. I want to replace values in df1 with those of df2 whenever the ID value from df1 matches that of df2.
Here is an example:
(I am omitting all 40 cols for simplicity in df1)
df1 <- data.frame(ID = c('a', 'b', 'a', 'd', 'e', 'd', 'f'),
var1 = c(40, 22, 12, 4, 0, 2, 1),
var2 = c(75, 55, 65, 15, 0, 2, 1),
var3 = c(9, 18, 81, 3, 0, 2, 1),
var4 = c(1, 11, 21, 61, 0, 2, 1),
var5 = c(-1, -2, -3, -4, 0, 2, 1),
var6 = c(0, 1, 0, 1, 0, 2, 1))
df2<- data.frame(ID = c('a', 'd', 'f'),
var2 = c("fish", "pig", "cow"),
var4 = c("pencil", "pen", "eraser"),
var5 = c("lamp", "rug", "couch"))
I would like the resulting df:
ID var1 var2 var3 var4 var5 var6
1 a 40 fish 9 pencil lamp 0
2 b 22 55 18 11 -2 1
3 a 12 fish 81 pencil lamp 0
4 d 4 pig 3 pen rug 1
5 e 0 0 0 0 0 0
6 d 2 pig 2 pen rug 2
7 f 1 cow 1 eraser couch 1
I think there is a tidyverse solution using mutate across and case_when but I cannot figure out how to do this. Any help would be appreciated.

An option is also to loop across the column names from 'df2' in df1, match the 'ID' and coalesce with the original column values
library(dplyr)
df1 %>%
mutate(across(any_of(names(df2)[-1]),
~ coalesce(df2[[cur_column()]][match(ID, df2$ID)], as.character(.x))))
-output
ID var1 var2 var3 var4 var5 var6
1 a 40 fish 9 pencil lamp 0
2 b 22 55 18 11 -2 1
3 a 12 fish 81 pencil lamp 0
4 d 4 pig 3 pen rug 1
5 e 0 0 0 0 0 0
6 d 2 pig 2 pen rug 2
7 f 1 cow 1 eraser couch 1

library(tidyverse)
df1 %>%
mutate(row = row_number(), .before = 1) %>% # add row number
pivot_longer(-c(ID, row)) %>% # reshape long
mutate(value = as.character(value)) %>% # numbers as text like df2
left_join(df2 %>% # join to long version of df2
pivot_longer(-ID), by = c("ID", "name")
) %>%
mutate(new_val = coalesce(value.y, value.x)) %>% # preferentially use df2 val
select(-value.x, -value.y) %>%
pivot_wider(names_from = name, values_from = new_val) # reshape wide again
Result
# A tibble: 7 × 8
row ID var1 var2 var3 var4 var5 var6
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 a 40 fish 9 pencil lamp 0
2 2 b 22 55 18 11 -2 1
3 3 a 12 fish 81 pencil lamp 0
4 4 d 4 pig 3 pen rug 1
5 5 e 0 0 0 0 0 0
6 6 d 2 pig 2 pen rug 2
7 7 f 1 cow 1 eraser couch 1

Mean difference between pairs of valid cases

I am currently working on data structured like this:
library(tibble)
df <- tibble(
id = c("1", "2", "3", "4", "5"),
var1 = c(2, NA, 3, 1, 2),
var2 = c(1, 2, NA, NA, 2),
var3 = c(5, 8, 6, NA, NA),
var4 = c(11, 22, 33, 44, 55)
)
> df
# A tibble: 5 × 5
eid var1 var2 var3 var4
<chr> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 5 11
2 2 NA 2 8 22
3 3 3 NA 6 33
4 4 1 NA NA 44
5 5 2 2 NA 55
I need to compute the unstandardised mean difference between each pair of valid cases across var1, var2, and var3, for each row taken singularly.
I would like to get a resulting variable in the tibble with the mean difference between any two variables (out of the 3 I listed before).
If I were to do it by hand for the first row, I would calculate the differences first
2 - 1 = 1
1 - 5 = -4
2 - 5 = -3
then take the module, as I am interested in the distances only
|1| = 1
|-4| = 4
|-3| = 3
and then compute an average of the differences
1+4+3 / 3 = 2.67
An important exception would be that, if an NA or more is present, it shouldn't be considered in the count, neither in the difference nor in the average. E.g. in the 2nd row, I'd need the result to be 6, not NA.
The expected scenario with 2 NAs would be the average difference to be 0, but NA would be acceptable.
What I tried so far didn't work, as it does not sum by row:
df %>%
mutate(meandiff = sum(
abs(sum(var1, -var2, na.rm = TRUE)),
abs(sum(var2, -var3, na.rm = TRUE)),
abs(sum(var1, -var3, na.rm = TRUE)),
na.rm = TRUE
) / 3)
I was thinking of using the function rowsum(), but I need the pairwise difference and not for all three variables at the same.
Would you be able to help me find out a way to compute it in R?
Thank you!

Something like this?
func <- function(...) {
dots <- na.omit(c(...))
sum(abs(diff(c(dots, dots[1]))), na.rm = TRUE) / length(dots)
}
df %>%
mutate(meandiff = mapply(func, var1, var2, var3))
# # A tibble: 5 x 6
# id var1 var2 var3 var4 meandiff
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2 1 5 11 2.67
# 2 2 NA 2 8 22 6
# 3 3 3 NA 6 33 3
# 4 4 1 NA NA 44 0
# 5 5 2 2 NA 55 0
(This calculates var3 - var1 for the third mid-sum value instead of your var1 - var3, but since you use abs it should not matter.)

If I understand you correctly, you want something like
df %>%
mutate(
meandiff = rowSums(
cbind(
abs(var1-var2) / 2,
abs(var2-var3) / 2,
abs(var1-var3) / 2
), na.rm = TRUE) / 3
)
Btw, if you want to remove the NA and calculate the mean, will / 3 still be proper?

R: Multiply values values based on a logical conditionin in a data frame with NA values

If you have a full data frame, it easy to multiply values based on a logical condition:
df = data.frame(
var1 = c(1, 2, 3, 4, 5),
var2 = c(1, 2, 3, 2, 1),
var3 = c(5, 4, 3, 4, 5)
)
> df
var1 var2 var3
1 1 1 5
2 2 2 4
3 3 3 3
4 4 2 4
5 5 1 5
> df[df > 2] <- df[df > 2] * 10
> df
var1 var2 var3
1 1 1 50
2 2 2 40
3 30 30 30
4 40 2 40
5 50 1 50
However, if you have NA values in the data frame, the operation fails:
> df_na = data.frame(
var1 = c(NA, 2, 3, 4, 5),
var2 = c(1, 2, 3, 1, NA),
var3 = c(5, NA, 3, 4, 5)
)
> df_na
var1 var2 var3
1 NA 1 5
2 2 2 NA
3 3 3 3
4 4 1 4
5 5 NA 5
> df_na[df_na > 2] <- df_na[df_na > 2] * 10
Error in `[<-.data.frame`(`*tmp*`, df_na > 2, value = c(NA, 30, 40, 50, :
'value' is the wrong length
I tried, for example, some na.omit() tactics but could not make it work. I also could not find an appropriate question here in Stack Overflow.
So, how should I do it?

You can add !is.na() as an additional logical argument to subset by:
df_na[df_na > 2 & !is.na(df_na)] <- df_na[df_na > 2 & !is.na(df_na)] * 10
# > df_na
# var1 var2 var3
# 1 NA 1 50
# 2 2 2 NA
# 3 30 30 30
# 4 40 1 40
# 5 50 NA 50
Alternatively, a dplyr / tidyverse solution would be:
library(dplyr)
df_na %>%
mutate_all(.funs = ~ ifelse(!is.na(.x) & .x > 2, .x * 10, .x))
Added based on OP comment:
If you want to subset by values based on the %in% operator, opt for the dplyr solution (the %in% operator won't work the same way here as explained in this post):
df_na %>%
mutate_all(.funs = ~ ifelse(!is.na(.x) & .x %in% c(3, 4), .x * 10, .x))
# var1 var2 var3
# 1 NA 1 5
# 2 2 2 NA
# 3 30 30 30
# 4 40 1 40
# 5 5 NA 5
This approach generally lends itself to more complex manipulation tasks. You may, for instance, also define additional conditions with the help of dplyr::case_when() instead of the one-alternative ifelse.

Does this work, Using base R:
df_na[] <- lapply(df_na, function(x) ifelse(!is.na(x) & x > 2, x * 10, x))
df_na
var1 var2 var3
1 NA 1 50
2 2 2 NA
3 30 30 30
4 40 1 40
5 50 NA 50

The problem is not with the multiplication, it is with the array indexing.
(df_na > 2 returns NAs).
You can convert the line below into one line if you like,
inds <- which(df_na > 2, arr.ind = TRUE)
df_na[inds] <- df_na[inds] * 10

Repeat/duplicate specific row of data frame and append

I would like to duplicate a certain row based on information in a data frame. Prefer a tidyverse solution. I'd like to accomplish this without explicitly calling the original data frame in a function.
Here's a toy example.
data.frame(var1 = c("A", "A", "A", "B", "B"),
var2 = c(1, 2, 3, 4, 5),
val = c(21, 31, 54, 65, 76))
var1 var2 val
1 A 1 21
2 A 2 31
3 A 3 54
4 B 4 65
5 B 5 76
All the solutions I've found so far require the user to input the desired row index. I'd like to find a way of doing it programmatically. In this case, I would like to duplicate the row where var1 is "A" with the highest value of var2 for "A" and append to the original data frame. The expected output is
var1 var2 val
1 A 1 21
2 A 2 31
3 A 3 54
4 B 4 65
5 B 5 76
6 A 3 54

A variation using dplyr. Find the max by group, filter for var1 and append.
library(dplyr)
df %>%
group_by(var1) %>%
filter(var2 == max(var2),
var1 == "A") %>%
bind_rows(df, .)
var1 var2 val
1 A 1 21
2 A 2 31
3 A 3 54
4 B 4 65
5 B 5 76
6 A 3 54

You could select the row that you want to duplicate and add it to original dataframe :
library(dplyr)
var1_variable <- 'A'
df %>%
filter(var1 == var1_variable) %>%
slice_max(var2, n = 1) %>%
#For dplyr < 1.0.0
#slice(which.max(var2)) %>%
bind_rows(df, .)
# var1 var2 val
#1 A 1 21
#2 A 2 31
#3 A 3 54
#4 B 4 65
#5 B 5 76
#6 A 3 54
In base R, that can be done as :
df1 <- subset(df, var1 == var1_variable)
rbind(df, df1[which.max(df1$var2), ])
From this post we can save the previous work in a temporary variable and then bind rows so that we don't break the chain and don't bind the original dataframe df.
df %>%
#Previous list of commands
{
{. -> temp} %>%
filter(var1 == var1_variable) %>%
slice_max(var2, n = 1) %>%
bind_rows(temp)
}

In base you can use rbind and subset to append the row(s) where var1 == "A" with the highest value of var2 to the original data frame.
rbind(x, subset(x[x$var1 == "A",], var2 == max(var2)))
# var1 var2 val
#1 A 1 21
#2 A 2 31
#3 A 3 54
#4 B 4 65
#5 B 5 76
#31 A 3 54
Data:
x <- data.frame(var1 = c("A", "A", "A", "B", "B"),
var2 = c(1, 2, 3, 4, 5),
val = c(21, 31, 54, 65, 76))

An option with uncount
library(dplyr)
library(tidyr)
df1 %>%
uncount(replace(rep(1, n()), match(max(val[var1 == 'A']), val), 2)) %>%
as_tibble
# A tibble: 6 x 3
# var1 var2 val
# <chr> <dbl> <dbl>
#1 A 1 21
#2 A 2 31
#3 A 3 54
#4 A 3 54
#5 B 4 65
#6 B 5 76

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

keeping first observation of each duplicated combination of values across multiple colums - r

Related

Identify duplicates and make column with common id [duplicate]

replace values across columns in a dataframe when index variable matches to another dataframe in r

Mean difference between pairs of valid cases

R: Multiply values values based on a logical conditionin in a data frame with NA values

Repeat/duplicate specific row of data frame and append

Categories

Resources