Find duplicates by id and restructure the dataset in R [duplicate] - r

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 5 years ago.
I have a dataset with columns: id, names. There can be one id but multiple names, so I am getting duplicate id-rows at times:
id names
id1 name1
id1 name2
id1 name3
id2 name4
id2 name5
I need to restructure such a data.frame in R, so that all rows would have unique ids, and if there are multiple names, they all should be written into the names column as comma separated values like that:
id names
id1 name1, name2, name3
id2 name4, name5
I tried grouped <- table %>% group_by(names) but it did not work.
How could I achieve that in R?

Using data.table:
df <- read.table(header=T, text="id names
id1 name1
id1 name2
id1 name3
id2 name4
id2 name5")
library(data.table)
setDT(df)
df[, names := as.character(names)]
df[, names := paste0(names, collapse = ", "), by = id]
df <- unique(df)
Output:
df
id names
1: id1 name1, name2, name3
2: id2 name4, name5

Related

How to subset data frame with another data frame's columns and give name to subsets R

I have one data frame (Table_A) with 3.4 million rows and 33 columns. I have another data frame with 384 rows and 3 columns (Table_B). (This is for one participant, I should have 40 at the end)
Table_A
Col1
100
143
178
245
265
Table_B
start
stop
name
101
144
Name1
154
254
Name2
What I want to do is subset Table A by Col1, by start and stop columns in Table B and give each subset row a name. To return
Table_A adapted
Col1
name
143
Name1
178
Name2
245
Name2
I have tried
df_sub <- subset(Table_A, (Col1 >= (Table_B$start)) & (Col1 <= (Table_B$stop))```
names <- Table_B$name[(Table_A$Col1 >= (Table_B$start)) & (Table_A$Col1 <= (Table_B$Col2))]
df_out <- cbind(df_sub, names)
However, df_sub seems to only return one/two rows per subset and there should be ~187 in half (192) and ~375 in the other half. Whereas names returns 2million + rows.
I tried
(Table_A$Col1 >= (Table_B$Col1)) & (Table_A$Col1 <= (Table_B$Col2))
and this returns a list of False up to 384 and NA after
this is a job for a data.table non-equi join
library(data.table)
# if tablea and tableb are not already data.table
setDT(tablea);setDT(tableb)
# non-equi join
tablea[tableb, name := i.name, on = .(Col1 >= start, Col1 <= stop)]
# Col1 name
# 1: 100 <NA>
# 2: 143 Name1
# 3: 178 Name2
# 4: 245 Name2
# 5: 265 <NA>
sample data
tablea <- fread("Col1
100
143
178
245
265")
tableb <- fread("start stop name
101 144 Name1
154 254 Name2")
Example using a for loop.
Say we have the following dataframes
#dt, corresponds to table a
dt <- structure(list(col1 = c(100, 143, 178, 245, 265)), class = "data.frame", row.names = c(NA,
-5L))
#dt2, corresponds to table b
structure(list(start = c(101, 154), stop = c(144, 254), name = c("Name1",
"Name2")), class = "data.frame", row.names = c(NA, -2L))
We could loop through the rows of table B and add the names to table A if col1 falls within the range. Using data.table:
setDT(dt) #table A to data.table
setDT(dt2) #table B to data.table
#add column 'name'
dt[, name := ""]
#loop through table B
for(i in 1:nrow(dt2)){
dt[col1 >= dt2[i,]$start & col1 <= dt2[i,]$stop, name := paste0(name, dt2[i,]$name)]
}
Output
dt
col1 name
1: 100
2: 143 Name1
3: 178 Name2
4: 245 Name2
5: 265
Note I used a paste0 when adding the name. Otherwise, in the loop the names would be overwritten if the range in table B is not unique. Now, if the values fall within two ranges in table B, the name column will get two names pasted after eachother.

Replace values in one column by taking values from another column

After asking one question this morning, now I would like to ask another way to do the replacement, since I am waiting my teacher confirm about the species name.
I have a dataframe like this (The real df resulted by removing duplicated rows)
df <- data.frame(name1 = c("a" , "b", "c", "a"),
name2 = c("x", NA, NA, NA),
name3 = c(NA, "b1", "c1", NA),
name4 = c("x", "b1", "c1", "a"))
name1 name2 name3 name4
1 a x <NA> x
2 b <NA> b1 b1
3 c <NA> c1 c1
4 a <NA> <NA> a
Can we replace a by x by calling if the value in name4 column match with name1 column?
I do not want to use and assign x directly here since my data is supposed to have many cases like this. Any suggestions for me, please? (using base-R also fine for me since I would love to learn more)
Desired output
name1 name2 name3 name4
1 a x <NA> x
2 b <NA> b1 b1
3 c <NA> c1 c1
4 a <NA> <NA> x
My explanation for the table and my expectation:
I have 3 columns name1, name2, name3 (after removing duplicated rows). Name4 column is the final column that contains value that I want from 3 previous columns. The value in name2 column is the my first priority to use, then value in name3.
In my fourth row, since NA value appears in name2 column, then I took an "a" from name1 column. I am thinking that whether can I replace a by x without assigning x i.e. if value (i.e. a) in name4 == value (i.e. a) in name1, then the a in name4 replaced by x in name2 or 4.
Your criteria to define name4 as I understand it is:
Use name2 from the same row if available
Use name3 from the same row if available
Leave it missing (for now)
Fill missing name4 values with name4 values from previous rows that share the same name1 value.
If you want a tidyverse-based solution:
library(dplyr)
library(tidyr)
df <- data.frame(name1 = c("a" , "b", "c", "a"),
name2 = c("x", NA, NA, NA),
name3 = c(NA, "b1", "c1", NA))
result <- df %>%
mutate(name4 = case_when(
#!is.na(name4) ~ name4, # when name4 is not missing, use it? If you like...
!is.na(name2) ~ name2, # when name2 is not missing, use it
!is.na(name3) ~ name3, # when name3 is not missing, use it
TRUE ~ NA_character_ # leave a NA for now otherwise
)) %>%
group_by(name1) %>%
fill(name4, .direction = c("down")) %>% # Fill each group looking at the previous non-missing row.
ungroup()
Returns:
# A tibble: 4 × 4
name1 name2 name3 name4
<chr> <chr> <chr> <chr>
1 a x NA x
2 b NA b1 b1
3 c NA c1 c1
4 a NA NA x
Note that fill can fill in several directions, you could use "downup" if you want to first fill from top to bottom and then bottom to top.
You can group by name1 and if name1 and name4 are equal replace the name4 value with 1st non-NA value available.
library(dplyr)
df %>%
group_by(name1) %>%
mutate(name4 = ifelse(name1 == name4, na.omit(unlist(cur_data()))[1], name4)) %>%
ungroup
# name1 name2 name3 name4
# <chr> <chr> <chr> <chr>
#1 a x NA x
#2 b NA b1 b1
#3 c NA c1 c1
#4 a NA NA x
You can do it like this:
df[which(df$name1==df$name4), "name4"] <- "x"
Basically this means subsetting your dataframe selecting rows, in which name1 == name4, and name4 column, then changing these values to "x"
Base R ifelse solution:
df$name4 <- ifelse(df$name1 == df$name4, "x", df$name4)
Based on your update, using dplyr's first:
library(dplyr)
df$name4 <- ifelse(df$name1 == df$name4, first(df$name4), df$name4)
This does the following:
Checks to see if name1 is equal to name 4
If name1 is equal to name4, it replaces the value of name4 with the first value occurring for name4.
Result:
name1 name2 name3 name4
1 a x <NA> x
2 b <NA> b1 b1
3 c <NA> c1 c1
4 a <NA> <NA> x

Filter data.table inside function with varying filter arguments and vectorised fifelse()

Task: I have a large numeric data.table (columns 1:3 are character required for subsetting) inside a user defined function to perform analyses on the data.
Prior to downstream analysis within the function, I'd like use the function's parameters to pass values in to filter the data.table, or if no values are supplied for a given column filter then to use all rows.
However, difficulty arises as not all column filters will be used all of the time.
Currently, my approach has been to use separate blocks of if else statements to filter to get a list of IDs for each column, which I then intersect, before using this to subset the data.table. The example included states 3 columns, but there are more in the actual data which makes this approach clunky and inefficient (although it works).
Query: Is there a method to filter data.table directly, given that it is unknown which filtered columns will be used? Or, is there a way to vectorise the process with ifelse() or fifelse()?
I'd even take help with embedding it in a for loop, but for that to work I'd need to dynamically create variable names to store each ID list.
I'd like to keep the solution to using data.table and base functions. Also, the function's parameter names can be changed to be the same as the column names of the data.table if that makes coding and readability easier.
I appreciate any help offered.
Data:
# Install data.table package if not installed and load
if (!require("data.table")) {
install.packages("data.table")
library(data.table)
}
# Data (example)
head(DT, n=2)
#> ID info1 info2 name1 name2 name3 name4
#> <char> <char> <char> <dbl> <dbl> <dbl> <dbl>
#> 1: A100 StuffA StuffX 0.1460 NA -0.019 0.2102
#> 2: A101 StuffA StuffY 0.0987 -1.307 -0.174 NA
Current method:
# Function (example)
myfunc <- function(ID_filter = "", info1_filter = "", info2_filter = "") {
# Get IDs to use as filter
if (ID_filter != "") {
ID_list <- DT[ID %in% ID_filter][["ID"]]
} else {
ID_list <- DT[["ID"]]
}
if (info1_filter != "") {
info1_list <- DT[info1 %in% info1_filter][["info1"]]
} else {
info1_list <- DT[["info1"]]
}
# Get overall filter list
filters <- Reduce(intersect, list(ID, info1, info2))
# Subset data.table
DT <- DT[ID %in% filters]
}
The function could be simplified with
myfunc <- function(DT, ID_filter = "", info1_filter = "", info2_filter = "") {
lst1 <- Filter(function(x) all(x != ""),
dplyr::lst(ID_filter, info1_filter, info2_filter))
if(length(lst1) > 0 ) {
pat <- paste(sub("_filter", "", names(lst1)), collapse="|")
i1 <- DT[, Reduce(`&`, Map(`%in%`, .SD, lst1)), .SDcols = patterns(pat)]
DT[i1]
} else DT
}
-testing
setDT(DT)
myfunc(DT, ID_filter = "A100")
# ID info1 info2 name1 name2 name3 name4
#1: A100 StuffA StuffX 0.146 NA -0.019 0.2102
myfunc(DT, ID_filter = "A100", info1 = "StuffA")
# ID info1 info2 name1 name2 name3 name4
#1: A100 StuffA StuffX 0.146 NA -0.019 0.2102
myfunc(DT, ID_filter = "A100", info1 = "StuffA", info2 = "StuffX")
# ID info1 info2 name1 name2 name3 name4
#1: A100 StuffA StuffX 0.146 NA -0.019 0.2102
myfunc(DT, ID_filter = "A100", info1 = "StuffA", info2 = "StuffY")
#Empty data.table (0 rows and 7 cols): ID,info1,info2,name1,name2,name3...
myfunc(DT) # if all the parameters are "", return the full data
# ID info1 info2 name1 name2 name3 name4
#1: A100 StuffA StuffX 0.1460 NA -0.019 0.2102
#2: A101 StuffA StuffY 0.0987 -1.307 -0.174 NA
data
DT <- structure(list(ID = c("A100", "A101"), info1 = c("StuffA", "StuffA"
), info2 = c("StuffX", "StuffY"), name1 = c(0.146, 0.0987), name2 = c(NA,
-1.307), name3 = c(-0.019, -0.174), name4 = c(0.2102, NA)),
class = "data.frame", row.names = c(NA,
-2L))

Remove rows found in more than 3 groups

I have a dataframe, i am trying to remove the rows that are present in >= 3 groups. In my below example bike is the common value across 3 group and i need to remove that. Please help me to achieve this.
df <- data.frame(a = c("name1","name1","name1","name2","name2","name2","name3"), b=c("car","bike","bus","train","bike","tour","bike"))
df
a b
name1 car
name1 bike
name1 bus
name2 train
name2 bike
name2 tour
name3 bike
Expected Output:
a b
name1 car
name1 bus
name2 train
name2 tour
You can use dplyr::n_distinct:
n_gr <- 3
cn <- df %>% group_by(b) %>% summarise(na = n_distinct(a)) %>%
filter(na >= n_gr) %>% pull(b)
df <- df %>% filter(!(b %in% cn))
Output
a b
1 name1 car
2 name1 bus
3 name2 train
4 name2 tour
In base R you could do this...
df[ave(as.numeric(as.factor(df$a)), #convert a to numbers (factor levels) (required by ave)
df$b, #group by b
FUN=length) < 3, ] #return whether no of a's per b is less than 3
a b
1 name1 car
3 name1 bus
4 name2 train
6 name2 tour
Using data.table:
library(data.table)
setDT(df)[, count := .N, by = b] ## convert df to data.table & create a column to count groups
df <- df[!(count >= 3), ] ## delete rows that have count equal to 3 or more than 3
df[, count := NULL] ## delete the column created
df
a b
1: name1 car
2: name1 bus
3: name2 train
4: name2 tour
Using Base R:
df <- data.frame(a = c("name1","name1","name1","name2","name2","name2","name3"), b=c("car","bike","bus","train","bike","tour","bike"))
df
lst <- table(df$b)
df[df$b != names(lst)[lst >=3],]
# a b
# 1 name1 car
# 3 name1 bus
# 4 name2 train
# 6 name2 tour

R - Combining duplicate rows within dataframe in R :

I have a dataframe as below: please note that COL1 is having duplicate entries
COL1 COL2 COL3
10 hai 2
10 hai 3
10 pal 1
I want the output to be like this as shown below: i.e COL1 should have the unique entry alone(10), COL2 should contain the merged entries under it without duplicates(hai pal), and COL3 should contain the sum of entries(2+3+1=6)
OUTPUT:
COL1 COL2 COL3
10 hai pal 6
Perhaps we need to aggregate by group. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'COL1', paste the unique elements in 'COL2' together as well as get the sum of 'COL3'.
library(data.table)
setDT(df1)[,.(COL2 = paste(unique(COL2), collapse=" "), COL3= sum(COL3)) , by = COL1]
# COL1 COL2 COL3
#1: 10 hai pal 6

Resources