I have two dataframes organised like this.
df1 <- data.frame(lastname = c("Miller", "Smith", "Grey"),
firstname = c("John", "Jane", "Hans")
)
df2 <- data.frame(lastname =c("Smith", "Grey"),
firstname = c("Jane", "Hans")
)
df2 is not necessarily a subset of df1. Duplicated entries are also possible.
My goal is to keep a copy of df1 in which all entries occur represented in both dfs. Alternatively, I would like to end up with a subset of df1 with a new variable, indicating that the name is also element of df2.
Can someone suggest a way to do this? A {dyplr}-attempt is totally fine.
Desired output for the the paticular simple case:
res <- data.frame(lastname = c("Smith", "Grey"),
firstname = c("Jane", "Hans")
)
Including the "alternatively" part of the question this is an approach with left_join. Adding a grouping variable grp to distinguish the 2 sets.
library(dplyr)
left_join(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffix=c("_A", "_B"))
lastname firstname grp_A grp_B
1 Miller John A <NA>
2 Smith Jane A B
3 Grey Hans A B
or with base R merge
merge(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffixes=c("_A", "_B"), all=T)
lastname firstname grp_A grp_B
1 Grey Hans A B
2 Miller John A <NA>
3 Smith Jane A B
To remove NA and compact the grps
na.omit(left_join(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffix=c("_A", "_B"))) %>%
summarize(lastname, firstname,
grp = list(across(starts_with("grp"), ~ unique(.x))))
lastname firstname grp
1 Smith Jane A, B
2 Grey Hans A, B
The other part is simply
merge(df1, df2)
lastname firstname
1 Grey Hans
2 Smith Jane
Related
Problem
I have some data. I would like to flag the same instance (e.g. a person, company, machine, whatever) in my data by a unique ID. The data actually has some IDs but they are either not always present or one instance has different IDs.
What I try to acheive is to use these IDs along with individual information to find the same instance and assign a unique ID to them.
I found a solution, but this one is highly inefficient. I would appreciate both tipps to improve the performance of my code or - probably more promising - another approach.
Code
Example Data
dt1 <- data.table(id1 = c(1, 1, 2, 3, 4),
id2 = c("A", "B", "A", "C", "D"),
surname = "Smith",
firstname = c("John", "John", "Joe", "Joe", "Jack"))
dt1
#> id1 id2 surname firstname
#> 1: 1 A Smith John
#> 2: 1 B Smith John
#> 3: 2 A Smith Joe
#> 4: 3 C Smith Joe
#> 5: 4 D Smith Jack
Current Solution
find_grp <- function(dt,
by) {
# keep necessary variables only
dtx <- copy(dt)[, .SD, .SDcols = c(unique(unlist(by)))]
# unique data.table to improve performance
dtx <- unique(dtx)
# assign a row id column
dtx[, ID := .I]
# for every row and every by group, find all rows that match each row
# on at least one condition
res <- lapply(X = dtx$ID,
FUN = function(i){
unique(unlist(lapply(X = by,
FUN = function(by_sub) {
merge(dtx[ID == i, ..by_sub],
dtx,
by = by_sub,
all = FALSE)$ID
}
)))
})
res
print("merge done")
# keep all unique matching rows
l <- unique(res)
# combine matching rows together, if there is at least one overlap between
# two groups.
# repeat until all row-groups are completely disjoint form one another
repeat{
l1 <- l
iterator <- seq_len(length(l1))
for (i in iterator) {
for (ii in iterator[-i]) {
# is there any overlap between both row-groups
if (length(intersect(l1[[i]], l1[[ii]])) > 0) {
l1[[i]] <- sort(union(l1[[i]], l1[[ii]]))
}
}
}
if (isTRUE(all.equal(l1, l))) {
break
} else {
l <- unique(l1)
}
}
print("repeat done")
# use result to assign a groupId to the helper data.table
Map(l,
seq_along(l),
f = function(ll, grp) dtx[ID %in% ll, ID_GRP := grp])
# remove helper Id
dtx[, ID := NULL]
# assign the groupId to the original data.table
dt_out <- copy(dt)[dtx,
on = unique(unlist(by)),
ID_GRP := ID_GRP]
return(dt_out[])
}
Result
find_grp(dt1, by = list("id1",
"id2"
, c("surname", "firstname"))
)
#> [1] "merge done"
#> [1] "repeat done"
#> id1 id2 surname firstname ID_GRP
#> 1: 1 A Smith John 1
#> 2: 1 B Smith John 1
#> 3: 2 A Smith Joe 1
#> 4: 3 C Smith Joe 1
#> 5: 4 D Smith Jack 2
As you can see, ID_GRP is identified because
the first two rows share id1
since id2 for id1 contains A, row 3 with id2 = A belongs to the same group.
finally, all Joe Smith belong to the same group as well because its the name in row 3
so on and so forth
only row 5 is completely unrelated
{data.table} solutions are preferred
This might help you. I'm not sure if I've completely understood your question. I've written a function (gen_grp(), that takes a data table d, and a vector of variables v. It steps through each unique id1, and replaces id1 values if matches of certain types are found.
gen_grp <- function(d,v) {
for(id in unique(d$id1)) {
d[id2 %in% d[id1==id,id2], id1:=id]
k=unique(d[id1==id, ..v])[,t:=id]
d = k[d, on=v][!is.na(t), id1:=t][, t:=NULL]
}
d[, grp:=rleid(id1)]
return(d[])
}
Usage:
gen_grp(dt1,c("surname","firstname"))
Output:
surname firstname id1 id2 grp
<char> <char> <num> <char> <int>
1: Smith John 1 A 1
2: Smith John 1 B 1
3: Smith Joe 1 A 1
4: Smith Joe 1 C 1
5: Smith Jack 4 D 2
I have a joining problem that I'm struggling with in that the join IDs I want to use for separate dataframes are spread out across three possible ID columns. I'd like to be able to join if at least one join ID matches. I know the _join and merge functions accept a vector of column names but is it possible to make this work conditionally?
For example, if I have the following two data frames:
df_A <- data.frame(dta = c("FOO", "BAR", "GOO"),
id1 = c("abc", "", "bcd"),
id2 = c("", "", "xyz"),
id3 = c("def", "fgh", ""), stringsAsFactors = F)
df_B <- data.frame(dta = c("FUU", "PAR", "KOO"),
id1 = c("abc", "", ""),
id2 = c("", "xyz", "zzz"),
id3 = c("", "", ""), stringsAsFactors = F)
> df_A
dta id1 id2 id3
1 FOO abc def
2 BAR fgh
3 GOO bcd xyz
> df_B
dta id1 id2 id3
1 FUU abc
2 PAR xyz
3 KOO zzz
I hope to end up with something like this:
dta.x dta.y id1 id2 id3
1 FOO FUU abc "" def [matched on id1]
2 BAR "" "" "" fgh [unmatched]
3 GOO PAR bcd xyz "" [matched on id2]
4 KOO "" "" zzz "" [unmatched]
So that unmatched dta1 and dta1 variables are retained but where there is a match (row 1 + 3 above) both dta1 and dta2 are joined in the new table. I have a sense that neither _join, merge, or match will work as is and that I'd need to write a function but I'm not sure where to start. Any help or ideas appreciated. Thank you
Basically, what you want to do is join by corresponding IDs, what you can do is to convert the original id columns to id_column and id_value, because you don't want to join with "", do I dropped it.
library(tidyverse)
df_A_long <- df_A %>%
pivot_longer(
cols = -dta,
names_to = "id_column",
values_to = "id_value"
) %>%
dplyr::filter(id_value != "")
df_B_long <- df_B %>%
pivot_longer(
cols = -dta,
names_to = "id_column",
values_to = "id_value"
) %>%
dplyr::filter(id_value != "")
We always use id_column and id_value to join A & B.
> df_B_long
# A tibble: 3 x 3
dta id_column id_value
<chr> <chr> <chr>
1 FUU id1 abc
2 PAR id2 xyz
3 KOO id2 zzz
The joining part is clear, but to create your desired output, we need to do some data wrangling to make it look identical.
df_joined <- df_A_long %>%
# join using id_column and id_value
full_join(df_B_long, by = c("id_column","id_value"),suffix = c("1","2")) %>%
# pivot back to long format
pivot_wider(
id_cols = c(dta1,dta2),
names_from = id_column,
values_from = id_value
) %>%
# if dta1 is missing, then in the same row, move value from dta2 to dta1
mutate(
dta1_has_value = !is.na(dta1), # helper column
dta1 = ifelse(dta1_has_value,dta1,dta2),
dta2 = ifelse(!dta1_has_value & !is.na(dta2),NA,dta2)
) %>%
select(-dta1_has_value) %>%
group_by(dta1) %>%
# condense multiple rows into one row
summarise_all(
~ifelse(all(is.na(.x)),"",.x[!is.na(.x)])
) %>%
# reorder columns
{
.[sort(colnames(df_joined))]
}
Result:
> df_joined
# A tibble: 4 x 5
dta1 dta2 id1 id2 id3
<chr> <chr> <chr> <chr> <chr>
1 BAR "" "" "" fgh
2 FOO FUU abc "" def
3 GOO PAR bcd xyz ""
4 KOO "" "" zzz ""
library(sqldf)
one <-
sqldf('
select a.*
, b.dta as dta_b
from df_A a
left join df_B b
on a.id1 <> ""
and (
a.id1 = b.id1
or a.id2 = b.id2)
')
two <-
sqldf('
select b.*
from df_B b
left join one
on b.dta = one.dta
or b.dta = one.dta_b
where one.dta is null
')
dplyr::bind_rows(one, two)
# dta id1 id2 id3 dta_b
# 1 FOO abc def FUU
# 2 BAR fgh <NA>
# 3 GOO bcd xyz PAR
# 4 KOO zzz <NA>
I have a data set in chronological order which I have imported to R using:
mydata <- read.csv(file="test.csv",stringsAsFactors=FALSE)
Two of the columns in the data set are 'winner' and loser'. Each row in the data is a tennis match.
What I am looking to do is to add two columns which give me a cumulative count of the total matches the player in the 'winner' column has played up to and including the match on that row. And the same count for the 'loser' in that row.
So for example it would look like this:
winner loser winner_matches loser_matches
tom andy 1 1
andy greg 2 1
greg tom 2 2
I hope that makes sense.
I have tried using the following code but can't get it to work across both columns:
ave(mydata$winner_name==mydata$winner_name, mydata$winner_name, FUN=cumsum)
So the data below is the first 10 rows of around 20,000.
1) base Define a function which counts matches up to the ith row for the indicated player and then apply it for the winner and loser matches separately. No packages are used:
count_matches <- function(i, player) {
with(DF[1:i, ], sum(winner == player | loser == player))
}
n <- nrow(DF)
transform(DF, winner_matches = mapply(count_matches, 1:n, winner),
loser_matches = mapply(count_matches, 1:n, loser))
giving:
winner loser winner_matches loser_matches
1 tom andy 1 1
2 andy greg 2 1
3 greg tom 2 2
2) sqldf A different solution can be obtained using sqldf upon realizing that this problem can be solved with a self-join on a complex condition like this:
library(sqldf)
sqldf("select a.winner,
a.loser,
sum(a.winner = b.winner or a.winner = b.loser) winner_matches,
sum(a.loser = b.winner or a.loser = b.loser) loser_matches
from DF a join DF b on a.rowid >= b.rowid
group by a.rowid")
giving:
winner loser winner_matches loser_matches
1 tom andy 1 1
2 andy greg 2 1
3 greg tom 2 2
Note: The input used, in reproducible form, is:
Lines <- "winner loser
tom andy
andy greg
greg tom"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
We can get number of times that each player won or lost by data.table package:
library(data.table)
setDT(dat)[, winner_matches_won := seq_len(.N), by=(winner)]
setDT(dat)[, loser_matches_lost := seq_len(.N), by=(loser)]
dat
# winner loser winner_matches_won loser_matches_lost
# 1: tom andy 1 1
# 2: andy greg 1 1
# 3: greg tom 1 1
# 4: greg tom 2 2
# 5: tom greg 2 2
Data:
dat <- structure(list(winner = structure(c(3L, 1L, 2L, 2L, 3L), .Label = c("andy",
"greg", "tom"), class = "factor"), loser = structure(c(1L, 2L,
3L, 3L, 2L), .Label = c("andy", "greg", "tom"), class = "factor")), .Names = c("winner",
"loser"), class = "data.frame", row.names = c(NA, -5L))
You're really close to getting ave to work. The cumsum function doesn't know how to handle text so I created a dummy column that's equal to 1 for each row. That gives cumsum something to count.
Here's a sample dataframe.
mydata <-
data.frame(
winner = c("tom", "andy", "greg", "tom", "gary"),
loser = c("andy", "greg", "tom", "gary", "tom"),
stringsAsFactors = FALSE
)
And here's the code to add the two new columns.
library(tidyverse)
mydata <- mutate(mydata, one = 1) # Add dummy column
# Use ave() to calculate both the wins and losses
mydata$winner_matches <- ave(x = mydata$one, mydata$winner, FUN = cumsum)
mydata$loser_matches <- ave(x = mydata$one, mydata$loser, FUN = cumsum)
mydata <- select(mydata, -one) # Remove dummy column
I am trying to update a dataframe based on a certain condition. Here is the sample dataframe.
fname mname lname
1 RONALD D VALE
2 RONALD VALE
3 JACK A SMITH
4 JACK B SMITH
5 JACK SMITH
I would like to update the middle names column if the first and last names match. In this example, I would expect the following output.
fname mname lname
1 RONALD D VALE
2 RONALD D VALE
3 JACK A SMITH
4 JACK B SMITH
5 JACK SMITH
I also do not want to update the table if there are two different middle initials. There are some missing values in the data. So the main aim is to identify and merge multiple entries which are possibly similar. At the same time, we do not want to introduce erroneous data into the table.
A tidyverse solution:
df %>%
group_by(fname, lname) %>%
mutate(mname_count = n_distinct(mname, na.rm = TRUE)) %>%
mutate(mname = ifelse(mname_count == 1, unique(na.omit(mname)), mname)) %>%
select(-mname_count)
An ugly base R solution (assuming you changed your "" to NA):
unic<-unique(lolz[,c("fname","lname")])
for (i in 1:nrow(unic)){
lelz<-lolz[lolz[,"fname"]==unic[i,1] & lolz[,"lname"]==unic[i,2],]$mnam
if (sum(!is.na(lelz))==1){
lelz[is.na(lelz)] <- "D"
lolz[lolz[,"fname"]==unic[i,1] & lolz[,"lname"]==unic[i,2],][,2]<-lelz
}
}
We can use data.table
library(data.table)
setDT(df1)[, mname := if(uniqueN(mname[nzchar(mname)])==1)
mname[nzchar(mname)] else mname, .(fname, lname)]
df1
# fname mname lname
#1: RONALD D VALE
#2: RONALD D VALE
#3: JACK A SMITH
#4: JACK B SMITH
#5: JACK SMITH
data
df1 <- structure(list(fname = c("RONALD", "RONALD", "JACK", "JACK",
"JACK"), mname = c("D", "", "A", "B", ""), lname = c("VALE",
"VALE", "SMITH", "SMITH", "SMITH")), .Names = c("fname", "mname",
"lname"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5"))
I have a dataframe as below. I want to split the last column into 2. Splitting needs to be done based upon the only first : and rest of the columns dont matter.
In the new dataframe, there will be 4 columns. 3 rd column will be (a,b,d) while 4th column will be (1,2:3,3:4:4)
any suggestions? 4th line of my code doesnt work :(. I am okay with completely new solution or corrections to the line 4
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(3, 2, 1)
df <- data.frame(employee, salary, originalColumn = c("a :1", "b :2:3", "d: 3:4:4"))
as.data.frame(do.call(rbind, strsplit(df,":")))
--------------------update1
Below solutions work well. But i need a modified solution as I just realized that some of the cells in column 3 wont have ":". In such case i want text in that cell to appear in only 1st column after splitting that column
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(3, 2, 1)
df <- data.frame(employee, salary, originalColumn = c("a :1", "b", "d: 3:4:4"))
You could use cSplit. On your updated data frame,
library(splitstackshape)
cSplit(df, "originalColumn", sep = ":{1}")
# employee salary originalColumn_1 originalColumn_2
# 1: John Doe 3 a 1
# 2: Peter Gynn 2 b NA
# 3: Jolie Hope 1 d 3:4:4
And on your original data frame,
df1 <- data.frame(employee, salary,
originalColumn = c("a :1", "b :2:3", "d: 3:4:4"))
cSplit(df1, "originalColumn", sep = ":{1}")
# employee salary originalColumn_1 originalColumn_2
# 1: John Doe 3 a 1
# 2: Peter Gynn 2 b 2:3
# 3: Jolie Hope 1 d 3:4:4
Note: I'm using splitstackshape version 1.4.2. I believe the sep argument has been changed from version 1.4.0
You could use extract from tidyr to split the originalColumn in to two columns. In the below code, I am creating 3 columns and removing one of the unwanted columns from the result.
library(tidyr)
pat <- "([^ :])( ?:|: ?|)(.*)"
extract(df, originalColumn, c("Col1", "ColN", "Col2"), pat)[,-4]
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b 2:3
#3 Jolie Hope 1 d 3:4:4
Using the updated df, (for better identification - df1)
extract(df1, originalColumn, c("Col1", "ColN", "Col2"), pat)[,-4]
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b
#3 Jolie Hope 1 d 3:4:4
Or without creating a new column in df
extract(df, originalColumn, c("Col1", "Col2"), "(.)[ :](.*)") %>%
mutate(Col2= gsub("^\\:", "", Col2))
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b 2:3
#3 Jolie Hope 1 d 3:4:4
Based on the pattern in df, the below code also works. Here, the regex used to extract the first column is (.). A dot is a single element at the beginning of the string inside the parentheses will be extracted for the Col1. Then .{2} two elements following the first are discarded and the rest within the parentheses (.*) forms the Col2.
extract(df, originalColumn, c("Col1", "Col2"), "(.).{2}(.*)")
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b 2:3
#3 Jolie Hope 1 d 3:4:4
or using strsplit
as.data.frame(do.call(rbind, strsplit(as.character(df$originalColumn), " :|: ")))
# V1 V2
#1 a 1
#2 b 2:3
#3 d 3:4:4
For df1, here is a solution using strsplit
lst <- strsplit(as.character(df1$originalColumn), " :|: ")
as.data.frame(do.call(rbind,lapply(lst,
`length<-`, max(sapply(lst, length)))) )
# V1 V2
#1 a 1
#2 b <NA>
#3 d 3:4:4
You were close, here's a solution:
library(stringr)
df[, c('Col1','Col2')] <- do.call(rbind, str_split_fixed(df$originalColumn,":",n=2))
df$originalColumn <- NULL
employee salary Col1 Col2
1 John Doe 3 a 1
2 Peter Gynn 2 b 2:3
3 Jolie Hope 1 d 3:4:4
Notes:
stringr::str_split() is better than base::strsplit() because you don't have to do as.character(), also it has the n=2 argument you want to limit to only split on the first ':'