Find Duplicates in R based on multiple characters

Find Duplicates in R based on multiple characters - r

I can't seem to remember how to code this properly in R -
if I want to remove duplicates within a csv file based on multiple entries - first name and last name that are stored in separate columns
Then I can code: file[(duplicated(file$First.Name),]
but that only looks at the first name, I want it to look at the last same simultaneously.
If this is my starting file:
Steve Jones
Eric Brown
Sally Edwards
Steve Jones
Eric Davis
I want the output to be
Steve Jones
Eric Brown
Sally Edwards
Eric Davis
Only removing names of first and last name matching.

You can use
file[!duplicated(file[c("First.Name", "Last.Name")]), ]

Here is the solution for better performance (using data.table assuming First Name and Last Name are stored in separate columns):
> df <- read.table(text = 'Steve Jones
+ Eric Brown
+ Sally Edwards
+ Steve Jones
+ Eric Davis')
> colnames(df) <- c("First.Name","Last.Name")
> df
First.Name Last.Name
1 Steve Jones
2 Eric Brown
3 Sally Edwards
4 Steve Jones
5 Eric Davis
Here is where data.table specific code begins
> dt <- setDT(df)
> unique(dt,by=c('First.Name','Last.Name'))
First.Name Last.Name
1: Steve Jones
2: Eric Brown
3: Sally Edwards
4: Eric Davis

If there is a single column, use sub to remove the substring (i.e. first name) followed by space, get the logical vector (!duplicated(..) based on that to subset the rows of the dataset.
df1[!duplicated(sub("\\w+\\s+", "", df1$Col1)),,drop=FALSE]
# Col1
#1 Steve Jones
#2 Eric Brown
#3 Sally Edwards
#5 Eric Davis
If it is based on two columns and the dataset have two columns, just do duplicated directly on the dataset to get the logical vector, negate it and subset the rows.
df1[!duplicated(df1), , drop=FALSE]
# first.name second.name
#1 Steve Jones
#2 Eric Brown
#3 Sally Edwards
#5 Eric Davis

try:
!duplicated(paste(File$First.Name,File$Last.Name))

Related

Using the %in% function for multiple columns to [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Find complement of a data frame (anti - join)
(7 answers)
Closed 1 year ago.
I am trying to use the %in% function to match the observations of one of my datasets to the observations of another dataset. Essentially, I would like to make two new datasets, one that contains the observations of the second dataset, and another which contains all other observations. Here is an example dataset:
Df
Last.Name First.Name Group
Williams Bob A
Williams Dan C
Miller Bob A
Smith Dan C
Williams Rick A
Smart Jeff C
Miller Bob A
Smith Dan C
Jones Bob A
Williams Buddy C
Miller Bob A
Hends Dan C
Williams Rick A
Smart Jeff C
Millers Bob A
Smith Danny C
Here is a dataset that I am trying to match the observations:
dfMatch
LastName FirstName
Williams Bob
Williams Buddy
Miller Bob
Smith Dan
Williams Rick
Smart Jeff
Miller Bob
Smith Dan
I tried various versions of the following code:
newdf<-Df[ Df$Last.Name %in% DfMatch$LastName & Df$First.Name %in% DfMatch$FirstName,]
and
newdf<-Df[ which(Df$Last.Name %in% DfMatch$LastName & Df$First.Name %in% DfMatch$FirstName),]
To get this new dataset:
newDF
Last.Name First.Name Group
Williams Bob A
Miller Bob A
Smith Dan C
Williams Rick A
Smart Jeff C
Miller Bob A
Smith Dan C
Williams Buddy C
However, this does not work.
I would also like to use similar code to build a dataset which includes all observations not listed in the dfMatch set, such as:
DfNoMatch
Last.Name First.Name Group
Williams Dan C
Jones Bob A
Miller Bob A
Hends Dan C
Williams Rick A
Smart Jeff C
Millers Bob A
Smith Danny C
By using code similar to:
DfNoMatch<-Df[ !Df$Last.Name %in% DfMatch$LastName & Df$First.Name %in% DfMatch$FirstName,]
and
DfNoMatch<-Df[! which(Df$Last.Name %in% DfMatch$LastName & Df$First.Name %in% DfMatch$FirstName),]
Thank you in advance and any help is greatly appreciated!

To really match the observations use the match-function. The %in%-function only tells you that there is a match, but it doesn't tell you what is matched where.

Get a new column with only the first name in R

I want to create a column to have only the first name of the people in the dataset. In this case, I just want to get a column with value John, David, Carey, and David and NA values for those who are either non-human or don't have one. However, I am facing two difficulties.
The first is I need to filter out all those rows with captial letters. Because they're not PEOPLE; they're ENTITIES.
The second is I need to extract the word right before the comma, as those are the first name.
So I am just wondering what's the best approach to get a new column for the first name of the people.
reproducible dataset
structure(list(company_number = c("04200766", "04200766", "04200766",
"04200766", "04200766", "04200766"), directors = c("THOMAS, John Anthony",
"THOMAS, David Huw", "BRIGHTON SECRETARY LIMITED", "THOMAS, Carey Rosaline",
"THOMAS, David Huw", "BRIGHTON DIRECTOR LIMITED")), row.names = c(NA,
-6L), class = c("data.table", "data.frame"))

we can do this:
first take the first word after a comma
df$names <- sub(".*?, (.*?) .*","\\1",df$directors)
then take any strings with more than one word and make it <NA>
df$names <- ifelse(sapply(strsplit(df$names, " "), length)>1,NA,df$names)
output:
> df
company_number directors names
1 04200766 THOMAS, John Anthony John
2 04200766 THOMAS, David Huw David
3 04200766 BRIGHTON SECRETARY LIMITED <NA>
4 04200766 THOMAS, Carey Rosaline Carey
5 04200766 THOMAS, David Huw David
6 04200766 BRIGHTON DIRECTOR LIMITED <NA>

Using str_extract :
library(dplyr)
library(stringr)
df %>% mutate(people = str_extract(directors, '(?<=,\\s)\\w+'))
# company_number directors people
#1: 04200766 THOMAS, John Anthony John
#2: 04200766 THOMAS, David Huw David
#3: 04200766 BRIGHTON SECRETARY LIMITED <NA>
#4: 04200766 THOMAS, Carey Rosaline Carey
#5: 04200766 THOMAS, David Huw David
#6: 04200766 BRIGHTON DIRECTOR LIMITED <NA>

Multiple criteria lookup in R

I have data like this:
ID 1a 2a 3a 1b 2b 3b Name Team
cb128c James John Bill Jeremy Ed Simon Simon Wolves
cb128c John James Randy Simon David Ben John Tigers
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats
And I want to add columns indicating teams A and B by matching the row ID of that row in the ID column, and by matching one of the names in one of the "a" columns of that row in the "Name" column (for Team A), and doing the same for Team B using one of the names in one of the "b" columns of that row:
ID 1a 2a 3a 1b 2b 3b Name Team Team A Team B
cb128c James John Bill Jeremy Ed Simon Simon Wolves Tigers Wolves
cb128c John James Randy Simon David Ben John Tigers Tigers Wolves
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows Wildcats Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats Wildcats Sparrows
In row 1, we know Team A is Tigers because we match the ID of row 1, cb128c, in the ID column, and one of the "a" names of row 1 (either James, John or Bill) in the Name column. In this case, Row 2 has that ID, cb128c, and has "John" in the Name column. The Team in row 2 is "Tigers." Therefore, Row 1's Team A is Tigers. Team B is the Wolves because we match row 1's ID, still cb128c, and one of the "b" names in row 1 (either Jeremy, Ed or Simon) in the Name column. In this case, row 1 itself has the data we're looking for since one of the "b" names appears in the "Name" column of that row (Simon). The "Team" listed in each row will always either be the Team A or the Team B for that row.
Further down, we know Team A for row 3 is Wildcats because we match row 3's ID, ko351u and one of row 3's "a" names (either Adam, Alex or Jacob) in the "Name" column. Row 4 has that ID and "Adam" in the Name column. So the Team in Row 4 is Team A for Row 3.
Also notice that David switched teams in Row 3. In Row 2, David was on Simon's team, which we know is the Wolves (as explained above), but when we match Row 3's ID and one of Row 3's "b" names (Bob, Oscar or David), we get the Sparrows (like Row 1, one of the "b" names appears in the name column of that same row, so the Team B is the Team listed in that row).
How can I get this done in R?

df = read.table(text = "ID 1a 2a 3a 1b 2b 3b Name Team
cb128c James John Bill Jeremy Ed Simon Simon Wolves
cb128c John James Randy Simon David Ben John Tigers
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats", header = T)
# convert to character
df[] = lapply(df, as.character)
library(tidyr)
library(dplyr)
The following code 1. gathers to long format, 2. creates "Team_A" and "Team_B" out of the a or b suffix, 3. matches names to fill in the A/B Team Name, 4. removes missing values (no match), 5. gets rid of unnecessary columns, 6. converts back to wide format, 7. joins the A and B teams to the original data.
I'd encourage you to step through the code line by line to understand what's going on. I'll leave reordering the columns to you.
result = gather(df, key = "key", value = "value", starts_with("X")) %>%
mutate(ab = paste0("Team_", toupper(substr(key, start = nchar(key), stop = nchar(key)))),
team = ifelse(Name == value, Team, NA)) %>%
filter(!is.na(team)) %>%
select(ID, ab, team) %>%
spread(key = ab, value = team) %>%
right_join(df)
result
# ID Team_A Team_B X1a X2a X3a X1b X2b X3b Name Team
# 1 cb128c Tigers Wolves James John Bill Jeremy Ed Simon Simon Wolves
# 2 cb128c Tigers Wolves John James Randy Simon David Ben John Tigers
# 3 ko351u Wildcats Sparrows Adam Alex Jacob Bob Oscar David Oscar Sparrows
# 4 ko351u Wildcats Sparrows Adam Matt Sam Fred Frank Harry Adam Wildcats

separate different combinations of names to first and last using dplyr, tidyr, and regex

Sample data frame:
name <- c("Smith John Michael","Smith, John Michael","Smith John, Michael","Smith-John Michael","Smith-John, Michael")
df <- data.frame(name)
df
name
1 Smith John Michael
2 Smith, John Michael
3 Smith John, Michael
4 Smith-John Michael
5 Smith-John, Michael
I need to achieve the following desired output:
name first.name last.name
1 Smith John Michael John Smith
2 Smith, John Michael John Smith
3 Smith John, Michael Michael Smith John
4 Smith-John Michael Michael Smith-John
5 Smith-John, Michael Michael Smith-John
The rules are: if there is a comma in the string, then anything before is the last name. the first word following the comma is first name. If no comma in string, first word is last name, second word is last name. hyphenated words are one word. I would rather acheive this with dplyr and regex but I'll take any solution. Thanks for the help

You can achieve your desired result using strsplit switching between splitting by "," or " " based on whether there is a comma or not in name. Here, we define two functions to make the presentation clearer. You can just as well inline the code within the functions.
get.last.name <- function(name) {
lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,1)
}
The result of strsplit is a list. The lapply(...,'[[',1) loops through this list and extracts the first element from each list element, which is the last name.
get.first.name <- function(name) {
d <- lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,2)
lapply(strsplit(gsub("^ ","",d), " "),`[[`,1)
}
This function is similar except we extract the second element from each list element returned by strsplit, which contains the first name. We then remove any starting spaces using gsub, and we split again with " " to extract the first element from each list element returned by that strsplit as the first name.
Putting it all together with dplyr:
library(dplyr)
res <- df %>% mutate(first.name=get.first.name(name),
last.name=get.last.name(name))
The result is as expected:
print(res)
## name first.name last.name
## 1 Smith John Michael John Smith
## 2 Smith, John Michael John Smith
## 3 Smith John, Michael Michael Smith John
## 4 Smith-John Michael Michael Smith-John
## 5 Smith-John, Michael Michael Smith-John
Data:
df <- structure(list(name = c("Smith John Michael", "Smith, John Michael",
"Smith John, Michael", "Smith-John Michael", "Smith-John, Michael"
)), .Names = "name", row.names = c(NA, -5L), class = "data.frame")
## name
##1 Smith John Michael
##2 Smith, John Michael
##3 Smith John, Michael
##4 Smith-John Michael
##5 Smith-John, Michael

I am not sure if this is any better than aichao's answer but I gave it a shot anyway. I gives the right output.
df1 <- df %>%
filter(grepl(",",name)) %>%
separate(name, c("last.name","first.middle.name"), sep = "\\,", remove=F) %>%
mutate(first.middle.name = trimws(first.middle.name)) %>%
separate(first.middle.name, c("first.name","middle.name"), sep="\\ ",remove=T) %>%
select(-middle.name)
df2 <- df %>%
filter(!grepl(",",name)) %>%
separate(name, c("last.name","first.name"), sep = "\\ ", remove=F)
df<-rbind(df1,df2)

Regrouping data based on an indicator value

I have a dataframe with two columns as shown below,
Name Indicator
DeAngelo Williams 1
Marcus Brown 1
Elaine Nelson 2
Steve Olson 3
Jennifer Carter 1
Michael Johnson 2
Angela Brawley 3
Dax Shepard 4
What I am trying to do is combine all the names where the Indicator Column values is 1 until the next value 1 is encountered, the final output should looks like this below.
Name
-------
DeAngelo Williams
Marcus Brown,Elaine Nelson,Steve Olson
Jennifer Carter, Michael Johnson, Angela Brawley, Dax Shepard
I am unable to think of a solution for this issue so any assistance on accomplishing this is much appreciated.

We can use aggregate from base R to do this. As #thelatemail mentioned in the comments, create a group by doing the cumulative sum of the logical vector Indicator==1, using the formula method, we paste the elements in 'Name' together.
aggregate(Name~cbind(Group=cumsum(Indicator==1)), df1, FUN=toString)[2]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Find Duplicates in R based on multiple characters - r

You can use file[!duplicated(file[c("First.Name", "Last.Name")]), ]

try: !duplicated(paste(File$First.Name,File$Last.Name))

Related

Using the %in% function for multiple columns to [duplicate]

Get a new column with only the first name in R

Multiple criteria lookup in R

separate different combinations of names to first and last using dplyr, tidyr, and regex

Regrouping data based on an indicator value

Categories

Resources