Identify duplicates and make column with common id [duplicate] - r

This question already has answers here:
Concatenate strings by group with dplyr [duplicate]
(4 answers)
Closed 20 days ago.
I have a df
df <- data.frame(ID = c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'),
var1 = c(1, 1, 3, 4, 5, 5, 7, 8),
var2 = c(1, 1, 0, 0, 1, 1, 0, 0),
var3 = c(50, 50, 30, 47, 33, 33, 70, 46))
Where columns var1 - var3 are numerical inputs into a modelling software. To save on computing time, I would like to simulate unique instances of var1 - var3 in the modelling software, then join the results back to the main dataframe using leftjoin.
I need to add a second identifier to each row to show that it is the same as another row in terms of var1-var3. The output would be like:
ID var1 var2 var3 ID2
1 a 1 1 50 ab
2 b 1 1 50 ab
3 c 3 0 30 c
4 d 4 0 47 d
5 e 5 1 33 ef
6 f 5 1 33 ef
7 g 7 0 70 g
8 h 8 0 46 h
The I can subset unique rows of var1-var3 and ID2 simulate them in the software, and join results back to the main df using the new ID2.

With paste:
library(dplyr) #1.1.0
df %>%
mutate(ID2 = paste(unique(ID), collapse = ""),
.by = c(var1, var2))
# ID var1 var2 var3 ID2
# 1 a 1 1 50 ab
# 2 b 1 1 50 ab
# 3 c 3 0 30 c
# 4 d 4 0 47 d
# 5 e 5 1 33 ef
# 6 f 5 1 33 ef
# 7 g 7 0 70 g
# 8 h 8 0 46 h
Note that the .by argument is a new feature of dplyr 1.1.0. You can still use group_by and ungroup with earlier versions and/or if you have a more complex pipeline.

Related

replace values across columns in a dataframe when index variable matches to another dataframe in r

I have a dataset (df1) with about 40 columns including an ID variable with values that can have multiple observations over the thousands of rows of observations. Say I have another dataset (df2) with only about 4 columns and a few rows of data. The column names in df2 are found in df1 and the ID variable matches some of the observations in df1. I want to replace values in df1 with those of df2 whenever the ID value from df1 matches that of df2.
Here is an example:
(I am omitting all 40 cols for simplicity in df1)
df1 <- data.frame(ID = c('a', 'b', 'a', 'd', 'e', 'd', 'f'),
var1 = c(40, 22, 12, 4, 0, 2, 1),
var2 = c(75, 55, 65, 15, 0, 2, 1),
var3 = c(9, 18, 81, 3, 0, 2, 1),
var4 = c(1, 11, 21, 61, 0, 2, 1),
var5 = c(-1, -2, -3, -4, 0, 2, 1),
var6 = c(0, 1, 0, 1, 0, 2, 1))
df2<- data.frame(ID = c('a', 'd', 'f'),
var2 = c("fish", "pig", "cow"),
var4 = c("pencil", "pen", "eraser"),
var5 = c("lamp", "rug", "couch"))
I would like the resulting df:
ID var1 var2 var3 var4 var5 var6
1 a 40 fish 9 pencil lamp 0
2 b 22 55 18 11 -2 1
3 a 12 fish 81 pencil lamp 0
4 d 4 pig 3 pen rug 1
5 e 0 0 0 0 0 0
6 d 2 pig 2 pen rug 2
7 f 1 cow 1 eraser couch 1
I think there is a tidyverse solution using mutate across and case_when but I cannot figure out how to do this. Any help would be appreciated.
An option is also to loop across the column names from 'df2' in df1, match the 'ID' and coalesce with the original column values
library(dplyr)
df1 %>%
mutate(across(any_of(names(df2)[-1]),
~ coalesce(df2[[cur_column()]][match(ID, df2$ID)], as.character(.x))))
-output
ID var1 var2 var3 var4 var5 var6
1 a 40 fish 9 pencil lamp 0
2 b 22 55 18 11 -2 1
3 a 12 fish 81 pencil lamp 0
4 d 4 pig 3 pen rug 1
5 e 0 0 0 0 0 0
6 d 2 pig 2 pen rug 2
7 f 1 cow 1 eraser couch 1
library(tidyverse)
df1 %>%
mutate(row = row_number(), .before = 1) %>% # add row number
pivot_longer(-c(ID, row)) %>% # reshape long
mutate(value = as.character(value)) %>% # numbers as text like df2
left_join(df2 %>% # join to long version of df2
pivot_longer(-ID), by = c("ID", "name")
) %>%
mutate(new_val = coalesce(value.y, value.x)) %>% # preferentially use df2 val
select(-value.x, -value.y) %>%
pivot_wider(names_from = name, values_from = new_val) # reshape wide again
Result
# A tibble: 7 × 8
row ID var1 var2 var3 var4 var5 var6
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 a 40 fish 9 pencil lamp 0
2 2 b 22 55 18 11 -2 1
3 3 a 12 fish 81 pencil lamp 0
4 4 d 4 pig 3 pen rug 1
5 5 e 0 0 0 0 0 0
6 6 d 2 pig 2 pen rug 2
7 7 f 1 cow 1 eraser couch 1

using R: drop rows efficiently based on different conditions

Considering this sample
df<-{data.frame(v0=c(1, 2, 5, 1, 2, 0, 1, 2, 2, 2, 5),v1=c('a', 'a', 'a', 'b', 'b', 'c', 'c', 'b', 'b', 'a', 'a'), v2=c(0, 10, 5, 1, 8, 5,10, 3, 3, 1, 5))}
For a large dataframe: if v0>4, drop all the rows containing corresponding value v1 (drop a group?).
So, here the result should be a dataframe dropping all the rows with "a" since v0 values of 5 exist for "a".
df_ExpectedResult<-{data.frame(v0=c( 1, 2, 0, 1, 2, 2 ),v1=c( 'b', 'b', 'c', 'c', 'b', 'b'), v2=c(1, 8, 5,10, 3, 3))}
Also, I would like to have a new dataframe keeping the dropped groups.
df_Dropped <- {data.frame(v1='a')}
How would you do this efficiently for a huge dataset? I am using a simple for loop and if statement, but it takes too long to do the manipulation.
An option with dplyr
library(dplyr)
df %>%
group_by(v1) %>%
filter(sum(v0 > 4) < 1) %>%
ungroup
-output
# A tibble: 6 x 3
# v0 v1 v2
# <dbl> <chr> <dbl>
#1 1 b 1
#2 2 b 8
#3 0 c 5
#4 1 c 10
#5 2 b 3
#6 2 b 3
A base R option using subset + ave
subset(df, !ave(v0 > 4, v1, FUN = any))
gives
v0 v1 v2
4 1 b 1
5 2 b 8
6 0 c 5
7 1 c 10
8 2 b 3
9 2 b 3
It's two operations, but what about this:
drop_groups <- df %>% filter(v0 > 4) %>% select(v1) %>% unique()
df_result <- df %>% filter(!(v1 %in% drop_groups))
df_result
# v0 v1 v2
# 1 1 b 1
# 2 2 b 8
# 3 0 c 5
# 4 1 c 10
# 5 2 b 3
# 6 2 b 3

Repeat/duplicate specific row of data frame and append

I would like to duplicate a certain row based on information in a data frame. Prefer a tidyverse solution. I'd like to accomplish this without explicitly calling the original data frame in a function.
Here's a toy example.
data.frame(var1 = c("A", "A", "A", "B", "B"),
var2 = c(1, 2, 3, 4, 5),
val = c(21, 31, 54, 65, 76))
var1 var2 val
1 A 1 21
2 A 2 31
3 A 3 54
4 B 4 65
5 B 5 76
All the solutions I've found so far require the user to input the desired row index. I'd like to find a way of doing it programmatically. In this case, I would like to duplicate the row where var1 is "A" with the highest value of var2 for "A" and append to the original data frame. The expected output is
var1 var2 val
1 A 1 21
2 A 2 31
3 A 3 54
4 B 4 65
5 B 5 76
6 A 3 54
A variation using dplyr. Find the max by group, filter for var1 and append.
library(dplyr)
df %>%
group_by(var1) %>%
filter(var2 == max(var2),
var1 == "A") %>%
bind_rows(df, .)
var1 var2 val
1 A 1 21
2 A 2 31
3 A 3 54
4 B 4 65
5 B 5 76
6 A 3 54
You could select the row that you want to duplicate and add it to original dataframe :
library(dplyr)
var1_variable <- 'A'
df %>%
filter(var1 == var1_variable) %>%
slice_max(var2, n = 1) %>%
#For dplyr < 1.0.0
#slice(which.max(var2)) %>%
bind_rows(df, .)
# var1 var2 val
#1 A 1 21
#2 A 2 31
#3 A 3 54
#4 B 4 65
#5 B 5 76
#6 A 3 54
In base R, that can be done as :
df1 <- subset(df, var1 == var1_variable)
rbind(df, df1[which.max(df1$var2), ])
From this post we can save the previous work in a temporary variable and then bind rows so that we don't break the chain and don't bind the original dataframe df.
df %>%
#Previous list of commands
{
{. -> temp} %>%
filter(var1 == var1_variable) %>%
slice_max(var2, n = 1) %>%
bind_rows(temp)
}
In base you can use rbind and subset to append the row(s) where var1 == "A" with the highest value of var2 to the original data frame.
rbind(x, subset(x[x$var1 == "A",], var2 == max(var2)))
# var1 var2 val
#1 A 1 21
#2 A 2 31
#3 A 3 54
#4 B 4 65
#5 B 5 76
#31 A 3 54
Data:
x <- data.frame(var1 = c("A", "A", "A", "B", "B"),
var2 = c(1, 2, 3, 4, 5),
val = c(21, 31, 54, 65, 76))
An option with uncount
library(dplyr)
library(tidyr)
df1 %>%
uncount(replace(rep(1, n()), match(max(val[var1 == 'A']), val), 2)) %>%
as_tibble
# A tibble: 6 x 3
# var1 var2 val
# <chr> <dbl> <dbl>
#1 A 1 21
#2 A 2 31
#3 A 3 54
#4 A 3 54
#5 B 4 65
#6 B 5 76

How to extract information from a dataframe name and create a column based on it

Here's some mock data that represents the data I have:
pend4P_17k <- data.frame(x = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
pend5P_17k <- data.frame(x = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
I need to add a column to each data frame that represents the first letter/number code within the dataframe name, so for each dataframe I've been doing the following:
pend4P_17k$Pendant_ID<-"4P"
pend5P_17k$Pendant_ID<-"5P"
However, I have many dataframes to apply this to, so I'd like to create a function that can pull the information out of the dataframe name and apply it to a new column. I have attempted to use regular expressions and pattern matching to create a function, but with no luck (I'm very new to regular expressions).
Using R version 3.5.1, Mac OS X 10.13.6
This seems like a pretty bad idea. It's better to keep your data frames in a list rather than strewn about the global environment. However, if you're insistent it is possible:
add_name_cols <- function()
{
my_global <- ls(envir = globalenv())
for(i in my_global)
if(class(get(i)) == "data.frame" & grepl("pend", i))
{
df <- get(i)
df$Pendant_ID <- gsub("^pend(.{2})_.*$", "\\1", i)
assign(i, df, envir = globalenv())
}
}
add_name_cols()
pend4P_17k
#> x var1 var2 Pendant_ID
#> 1 1 a 1 4P
#> 2 2 b 1 4P
#> 3 3 c 0 4P
#> 4 4 d 0 4P
#> 5 5 e 1 4P
pend5P_17k
#> x var1 var2 Pendant_ID
#> 1 1 a 1 5P
#> 2 2 b 1 5P
#> 3 3 c 0 5P
#> 4 4 d 0 5P
#> 5 5 e 1 5P
This will do the trick:
require(dplyr)
f<-function(begin, end){
ids<-seq(begin,end)
listdf<-lapply(ids, function(x) eval(parse(text=paste0("pend", x,"P_17k"))))
names(listdf)<-lapply(ids, function(x) paste0("pend", x,"P_17k"))
len<-seq(1,length(listdf))
for (i in len){
listdf[[i]]<-listdf[[i]] %>% mutate(Pendant_ID=paste0(i+3,"P"))
}
list2env(listdf,.GlobalEnv)
}
Gives the desired output:
> f(4,5)
<environment: R_GlobalEnv>
> pend4P_17k
x var1 var2 Pendant_ID
1 1 a 1 4P
2 2 b 1 4P
3 3 c 0 4P
4 4 d 0 4P
5 5 e 1 4P
> pend5P_17k
x var1 var2 Pendant_ID
1 1 a 1 5P
2 2 b 1 5P
3 3 c 0 5P
4 4 d 0 5P
5 5 e 1 5P
Using mget and rbindlist:
library(data.table)
m1 <- mtcars[1:2, 1:3]
m2 <- mtcars[3:4, 1:3]
rbindlist(mget(ls(pattern = "^m")), id = "myDF")
# myDF mpg cyl disp
# 1: m1 21.0 6 160
# 2: m1 21.0 6 160
# 3: m2 22.8 4 108
# 4: m2 21.4 6 258

keeping first observation of each duplicated combination of values across multiple colums

My data :
var1 <- c(1, 2, 3, 4, 5, 28, 6)
var2 <- c(2, 1, 10, 11, 6, 78, 5)
var3 <- c(100,101,102,0,0,0, 0)
dataset<- data.frame(var1, var2, var3)
datset
my result :
var1 var2 var3
1 2 100
2 1 101
3 10 102
4 11 0
5 6 0
28 78 0
6 5 0
I have two combinations of duplicated values across the var1 and var2 columns (in any order):
first one:
var1 var2 var3
1 2 100
2 1 101
second one:
var1 var2 var3
5 6 0
6 5 0
Expected result :
keeping first observation of each duplicated combinaison of values in multiple colums (var1 and var2) :
var1 var2 var3
1 2 100
3 10 101
4 11 102
5 6 0
28 78 0
full dataset csv
We can use duplicated on the sorted elements of each row of the first two columns to get the expected output
dataset[!duplicated(t(apply(dataset[1:2], 1, sort))),]
Or another option is to apply duplicated on pmin and pmax
library(data.table)
setDT(dataset)[!duplicated(dataset[, .(var1 = pmin(var1, var2), var2 = pmax(var1, var2))])]
Update
Based on the OP's full dataset
df1 <- na.omit(read.csv(file.choose(), row.names = 1))
out <- df1[!duplicated(t(apply(df1[1:2], 1, sort))),]
dim(out)
#[1] 113 3
out2 <- setDT(df1)[!duplicated(df1[, .(from = pmin(from, to), to = pmax(from, to))])]
dim(out2)
#[1] 113 3

Resources