R dataframe Removing duplicates / choosing which duplicate to remove

R dataframe Removing duplicates / choosing which duplicate to remove - r

I have a dataframe that has duplicates based on their identifying ID, but some of the columns are different. I'd like to keep the rows (or the duplicates) that have the extra bit of info. The structure of the df is as such.
id <- c("3235453", "3235453", "21354315", "21354315", "2121421")
Plan_name<- c("angers", "strasbourg", "Benzema", "angers", "montpellier")
service_line<- c("", "AMRS", "", "Therapy", "")
treatment<-c("", "MH", "", "MH", "")
df <- data.frame (id, Plan_name, treatment, service_line)
As you can see, the ID row has duplicates, but I'd like to keep the second duplicate where there is more info in treatment and service_line.
I have tried using
df[duplicated(df[,c(1,3)]),]
but it doesn't work as an empty df is returned. Any suggestions?

Maybe you want something like this:
First we replace all blank with NA, then we arrange be Section.B and finally slice() first row from group:
library(dplyr)
df %>%
mutate(across(-c(id, Plan_name),~ifelse(.=="", NA, .))) %>%
group_by(id) %>%
arrange(Section.B, .by_group = TRUE) %>%
slice(1)
id Plan_name Section.B Section.C
<chr> <chr> <chr> <chr>
1 2121421 montpellier NA NA
2 21354315 angers MH Therapy
3 3235453 strasbourg MH AMRS

Try with
library(dplyr)
df %>%
filter(if_all(treatment:service_line, ~ .x != ""))
-output
id Plan_name Section.B Section.C
1 3235453 strasbourg MH AMRS
2 21354315 angers MH Therapy
If we need ids with blanks and not duplicated as well
df %>%
group_by(id) %>%
filter(n() == 1|if_all(treatment:service_line, ~ .x != "")) %>%
ungroup
-output
# A tibble: 3 × 4
id Plan_name treatment service_line
<chr> <chr> <chr> <chr>
1 3235453 strasbourg "MH" "AMRS"
2 21354315 angers "MH" "Therapy"
3 2121421 montpellier "" ""

Related

extract valus of another dataframe if value of one column is partially match in R

Sorry I didn't clarify my question,
my aim is if dt$id %in% df$id , extract df$score add to new column at dt,
I have a dataframe like this :
df <- tibble(
score = c(2587,002,885,901,2587,3371,3372,002),
id = c("AR01.0","AR01.1","AR01.12","ERS02.00","ERS02.01","ERS02.02","QR01","QR01.03"))
And I have another dataframe like
dt <- tibble(
id = c("AR01","QR01","KVC"),
city = c("AM", "Bis","CHB"))
I want to mutate a new column "score"
I want to got output like below :
id
city
score
AR01
AM
2587/2/885
ERS02
Bis
901/3371
KVC
CHB
NA
or
id
city
score
score2
score3
AR01
AM
2587
2
885
ERS02
Bis
901
3371
NA
KVC
CHB
NA
NA
NA
I tried to use ifelse to achieve but always got error,
do any one can provide ideas? Thank you.

A simple left_join (after mutateing id values in df) is required:
library(dplyr)
library(stringr)
left_join(df %>% mutate(id = str_extract(id, "[\\w]+")), dt, by = "id") %>%
group_by(id) %>%
summarise(across(city,first),
score = paste(score, collapse = "/"))
# A tibble: 3 × 3
id city score
<chr> <chr> <chr>
1 AR01 AM 2587/2/885
2 ERS02 NA 901/2587/3371
3 QR01 Bis 3372/2
For the second solution you can use separate:
library(dyplr)
library(stringr)
library(tidyr)
left_join(df %>% mutate(id = str_extract(id, "[\\w]+")), dt, by = "id") %>%
group_by(id) %>%
summarise(across(city,first),
score = paste(score, collapse = "/")) %>%
separate(score,
into = paste("score", 1:3),
sep = "/" )
# A tibble: 3 × 5
id city `score 1` `score 2` `score 3`
<chr> <chr> <chr> <chr> <chr>
1 AR01 AM 2587 2 885
2 ERS02 NA 901 2587 3371
3 QR01 Bis 3372 2 NA

You could create groups by extracting everything before the . using sub to group_by on and merge the rows with paste separated with / and right_join them by id like this:
library(tibble)
df <- tibble(
score = c(2587,002,885,901,2587,3371,3372,002),
id = c("AR01.0","AR01.1","AR01.12","ERS02.00","ERS02.01","ERS02.02","QR01","QR01.03"))
dt <- tibble(
id = c("AR01","QR01","KVC"),
city = c("AM", "Bis","CHB"))
library(dplyr)
df %>%
mutate(id = sub('\\..*', "", id)) %>%
group_by(id) %>%
mutate(score = paste(score, collapse = '/')) %>%
distinct(id, .keep_all = TRUE) %>%
ungroup() %>%
right_join(., dt, by = 'id')
#> # A tibble: 3 × 3
#> score id city
#> <chr> <chr> <chr>
#> 1 2587/2/885 AR01 AM
#> 2 3372/2 QR01 Bis
#> 3 <NA> KVC CHB
Created on 2022-10-01 with reprex v2.0.2

Sampling by Group in R with no replacement but the final result cannot contain any repeats as well

I am trying to construct a control group. ID_1 is the original participant, ID_2 is the control. For simplicity sake they are matched by sex and age. I received a dataframe that looks like this:
ID_1 <- c(1,1,1,2,2,3,3,4,4,4)
Sex <- c("M","M","M","F","F","M","M","F","F","F")
Age <- c(23,23,23,35,35,44,44,35,35,35)
ID_2 <- c(321,322,323,630,631,502,503,630,631,632)
df <- data.frame(ID_1, Sex, Age, ID_2)
So I have several matches for each ID_1 and I want to sample within each group to get just one. I got that with:
library(dplyr)
random_ID_2 <- df %>% group_by(ID_1) %>% sample_n(size = 1, replace = F)
The problem is that I do not want to get any repeats of ID_2. So by random chance I could end up pairing ID_1 = 2 and ID_1 = 4 to the same control ID_2 = 630
How i can make sure this does not happen?
Thanks in advance.

If you can use a data.table solution:
dt <- setnames(
unique(
setorder(
setDT(copy(df))[, idx := 1:.N, by = ID_1], # add an index column for each ID_1 group
idx, ID_1) # sort by idx, ID_1
# for each Sex/Age group, sample unique values of ID_2 withouth replacement (pad with NA)
[, ID_3 := c(sample(unique(ID_2)), rep(NA, .N - uniqueN(ID_2))), by = c("Sex", "Age")],
by = "ID_1") # get the first row for each ID_1 group
[, c(1:3, 6)], "ID_3", "ID_2") # remove helper columns and rename "ID_3" to "ID_2"

Here is one potential option that samples and if there is a duplicate it will resample:
# handles case where no samples left
my_sample <- function(x, ...){
if (length(x) == 0L) return(NA) else sample(x, ...)
}
df %>%
group_by(ID_1) %>%
slice_sample(n = 1) %>%
ungroup() %>%
mutate(resample = duplicated(ID_2)) %>%
rowwise() %>%
mutate(ID_2 = if (resample) my_sample(df[df$ID_1 == ID_1 & df$ID_2 != ID_2, "ID_2"], 1) else ID_2) %>%
ungroup() %>%
select(-resample)
One thing to note is that rows further down your data frame with duplicate ID_2 are conditionally sampling.
Output
set.seed(17) is a case where the same ID_2 is sampled:
df %>%
group_by(ID_1) %>%
slice_sample(n = 1)
ID_1 Sex Age ID_2
<dbl> <chr> <dbl> <dbl>
1 1 M 23 322
2 2 F 35 631
3 3 M 44 502
4 4 F 35 631
And to test the above code:
set.seed(17)
df %>%
group_by(ID_1) %>%
slice_sample(n = 1) %>%
ungroup() %>%
mutate(resample = duplicated(ID_2)) %>%
rowwise() %>%
mutate(ID_2 = if (resample) my_sample(df[df$ID_1 == ID_1 & df$ID_2 != ID_2, "ID_2"], 1) else ID_2) %>%
ungroup() %>%
select(-resample)
ID_1 Sex Age ID_2
<dbl> <chr> <dbl> <dbl>
1 1 M 23 322
2 2 F 35 631
3 3 M 44 502
4 4 F 35 632
>
Again to emphasize my point above ID_1 == 4 is conditionally sampling since we allow ID_1 == 2 to remain matched to ID_2 == 631 and change the match for ID_1 == 4.
How it works
Sample your data as you normally would.
Then we we check for duplicates in ID_2. Note: duplicated returns TRUE for all subsequent duplicated IDs.
If a row needs to be resampled then we subset and sample from the original data frame with the line mutate(ID_2 = if ...)

How to use slice in dplyr to keep the rows with NA values in R

I have the following dataset, and I want to know the min word for each group, and if there is no min word (it is NA), I still want to display it
df=data.frame(
key=c("A","A","B","B","C"),
word=c(1,2,3,5,NA))
df%>%group_by(key)%>%slice(which.min(word))
This excludes key=C, word=NA which I would want:
df_out=data.frame(
key=c("A","B","C"),
word=c(1,3,NA))

We can create a logical condition with is.na in filter and return the NA rows as well after doing the grouping by 'key'
library(dplyr)
df %>%
group_by(key) %>%
filter(word == min(word)|is.na(word))
Or using slice. We don't need any if/else condition
df %>%
group_by(key) %>%
slice(which(word ==min(word)|is.na(word)))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Or more compactly
df %>%
group_by(key) %>%
slice(match(min(word), word))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
NOTE: Using match returns the index of the first match.
which.min removes the NA
which.min(c(NA, 1, 3))
#[1] 2

We can check the condition with if, If all the word in a group is NA we return the first row or else return the minimum row.
library(dplyr)
df %>%
group_by(key)%>%
slice(if(all(is.na(word))) 1L else which.min(word))
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Another option is to arrange the data by word and select the 1st row in each group.
df %>% arrange(key, word) %>% group_by(key) %>% slice(1L)

You can create a modified slice-function using the tidyverse-package, which returns NA's:
slice_uneven = function(.data, .idx) {
.data_ = .data %>% add_row() # Add an extra row
.idx_ = .idx %>% c(NA) %>% replace_na(nrow(.data_)) # Replace NA with index of the extra row
.data_[.idx_,] %>% head(-1) %>% remove_rownames() %>% return() # Subset, remove extra row, and reset rownames before returning data
}
slice_uneven(cars, c(1, 2, 3, NA, NA, 3, 2))

You can also arrange by word and use distinct from dplyr to get the desired output.
library(dplyr)
df %>%
arrange(word) %>%
distinct(key, .keep_all = TRUE)
# key word
#1 A 1
#2 B 3
#3 C NA

Looping and concatenating based on a condition in R

I'm new to R and still struggling with loops.
I'm trying to create a loop where, based on a condition (variable_4 == 1), it will concatenate the content of variable_5, separated by comma.
data1 <- data.frame(
ID = c(123:127),
agent_1 = c('James', 'Lucas','Yousef', 'Kyle', 'Marisa'),
agent_2 = c('Sophie', 'Danielle', 'Noah', 'Alex', 'Marcus'),
agent_3 = c('Justine', 'Adrienne', 'Olivia', 'Janice', 'Josephine'),
Flag_1 = c(1,0,1,0,1),
Flag_2 = c(0,1,0,0,1),
Flag_3 = c(1,0,1,0,1)
)
data1$new_var<- ""
for(i in 2:10){
variable_4 <- paste0("flag_", i)
variable_5 <- paste0("agent_", i)
data1 <- data1 %>%
mutate(!! new_var = case_when(variable_4 == 1,paste(new_var, variable_5, sep=",")))
}
I've created new_var in a previous step because the code was giving me an error that the variable was not found. Ideally, the loop will accumulate the contents of variable_5, only if variable_4 is equal 1 and the result would be big string, separate by comma.
The loop will paste in the new var only the name of the agents which the flags are = 1. If Flag_1=1, then paste the name of the agent in the new_var, if not, ignore. If flag_2 =1, then concatenate the name of the agent in the new var, separating by comma, if not, then ignore...

You shouldn't need to use a loop for this. The data is in wide format which makes it harder, but if we convert to long format, we can easily find a vectorized solution rather than using a loop.
The pivot_longer function is useful here which requires tidyr version >= 1.0.0.
library(tidyr)
library(dplyr)
pivot_longer(data1,
cols = -ID,
names_to = c(".value", "group"),
names_sep = "_") %>%
group_by(ID) %>%
mutate(new_var = paste0(agent[Flag==1], collapse = ',')) %>%
pivot_wider(names_from = c("group"),
values_from = c('agent', 'Flag'),
names_sep = '_') %>%
ungroup() %>%
select(ID, starts_with('agent'), starts_with('Flag'), new_var)
## A tibble: 5 x 8
# ID agent_1 agent_2 agent_3 Flag_1 Flag_2 Flag_3 new_var
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 123 James Sophie Justine 1 0 1 James,Justine
#2 124 Lucas Danielle Adrienne 0 1 0 Danielle
#3 125 Yousef Noah Olivia 1 0 1 Yousef,Olivia
#4 126 Kyle Alex Janice 0 0 0 ""
#5 127 Marisa Marcus Josephine 1 1 1 Marisa,Marcus,Josephine
Details:
pivot_longer puts our data into a more natural format where each row represents one observation of the variables agent and flag, rather than several:
pivot_longer(data1,
cols = -ID,
names_to = c(".value", "group"),
names_sep = "_")
## A tibble: 15 x 4
# ID group agent Flag
# <int> <chr> <chr> <chr>
# 1 123 1 James 1
# 2 123 2 Sophie 0
# 3 123 3 Justine 1
# 4 124 1 Lucas 0
# 5 124 2 Danielle 1
# 6 124 3 Adrienne 0
# ...
For each ID, we can then paste together the agents which have flag values of 1. This is easy now that our variables are contained in single columns.
Lastly, we revert back to the wide format with pivot_wider. We also ungroup the data we previously grouped, and re-order the columns to the desired format.

There are a few different ways to do this in BaseR or the tidyverse, or a combination of both, if you stick to using tidyverse then consider this:
I have used mtcars as your dataframe instead!
#load dplyr or tidyverse
library(tidyverse)
# create data as mtcars
df <- mtcars
# create two new columns flag and agent as rownumbers
df <- df %>%
mutate(flag = paste0("flag", row_number())) %>%
mutate(agent = paste0("agent", row_number()))
# using case when in mutate statement
df2 <- df %>%
mutate(new_column = ifelse(flag == "flag1", yes = paste0(agent, " this is a new variable"), no = flag))
print(df2)
an ifelse statement might be more appropriate if you have one case - but if you have many then use case_when instead.

Multi-Field Text to Columns

I have a dataframe with 13 columns that has data separated by a "^" sign. What I'm trying to come up with is some code that would read each column and parse out the data in between the "^" into its own column.
I can do this on a single column but performing the function I want on each column has proved tricky.
This is easy to do on a single column of data.
#df = original dataset
#split first column based on '^' symbol -output is a list
df2 <-strsplit(as.character(df$`Col1`),"\\^")
#turn list into df again
df3 <-as.data.frame(do.call(rbind,df2),stringsAsFactors = F)
This gives me one dataframe with the text-to-columns output of 1 column. The problem is I have 12 other columns.
Original df example:
col1 col2 col3
baby^monkey cow^pig^sheep tree^root^grass^man
Desired Output:
Col1_1 Col1_2 Col2_1 Col2_2 Col2_3 Col3_1 Col3_2 Col3_3 Col3_4
baby monkey cow pig sheep tree root grass man

With a few functions from dplyr and tidyr, you can reshape the data into a long format, separate the strings by ^ into individual rows, make row numbers along the column groups, and spread back into wide shape.
library(tidyr)
library(dplyr)
df <- read.table(text = "col1 col2 col3
baby^monkey cow^pig^sheep tree^root^grass^man",
header = T, stringsAsFactors = F)
df %>%
gather(key, value) %>%
separate_rows(value, sep = "\\^") %>%
group_by(key) %>%
mutate(row = row_number()) %>%
unite(key, key, row) %>%
spread(key, value)
#> # A tibble: 1 x 9
#> col1_1 col1_2 col2_1 col2_2 col2_3 col3_1 col3_2 col3_3 col3_4
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 baby monkey cow pig sheep tree root grass man

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R dataframe Removing duplicates / choosing which duplicate to remove - r

Related

extract valus of another dataframe if value of one column is partially match in R

Sampling by Group in R with no replacement but the final result cannot contain any repeats as well

How to use slice in dplyr to keep the rows with NA values in R

Looping and concatenating based on a condition in R

Multi-Field Text to Columns

Categories

Resources