Get a new column with only the first name in R - r

I want to create a column to have only the first name of the people in the dataset. In this case, I just want to get a column with value John, David, Carey, and David and NA values for those who are either non-human or don't have one. However, I am facing two difficulties.
The first is I need to filter out all those rows with captial letters. Because they're not PEOPLE; they're ENTITIES.
The second is I need to extract the word right before the comma, as those are the first name.
So I am just wondering what's the best approach to get a new column for the first name of the people.
reproducible dataset
structure(list(company_number = c("04200766", "04200766", "04200766",
"04200766", "04200766", "04200766"), directors = c("THOMAS, John Anthony",
"THOMAS, David Huw", "BRIGHTON SECRETARY LIMITED", "THOMAS, Carey Rosaline",
"THOMAS, David Huw", "BRIGHTON DIRECTOR LIMITED")), row.names = c(NA,
-6L), class = c("data.table", "data.frame"))

we can do this:
first take the first word after a comma
df$names <- sub(".*?, (.*?) .*","\\1",df$directors)
then take any strings with more than one word and make it <NA>
df$names <- ifelse(sapply(strsplit(df$names, " "), length)>1,NA,df$names)
output:
> df
company_number directors names
1 04200766 THOMAS, John Anthony John
2 04200766 THOMAS, David Huw David
3 04200766 BRIGHTON SECRETARY LIMITED <NA>
4 04200766 THOMAS, Carey Rosaline Carey
5 04200766 THOMAS, David Huw David
6 04200766 BRIGHTON DIRECTOR LIMITED <NA>

Using str_extract :
library(dplyr)
library(stringr)
df %>% mutate(people = str_extract(directors, '(?<=,\\s)\\w+'))
# company_number directors people
#1: 04200766 THOMAS, John Anthony John
#2: 04200766 THOMAS, David Huw David
#3: 04200766 BRIGHTON SECRETARY LIMITED <NA>
#4: 04200766 THOMAS, Carey Rosaline Carey
#5: 04200766 THOMAS, David Huw David
#6: 04200766 BRIGHTON DIRECTOR LIMITED <NA>

Related

Extracting best guess first and last names from a string

I have a set of names that looks as such:
names <- structure(list(name = c('Michael Smith ♕',
'Scott Lewis - Realtor',
'Erin Hopkins Ŧ',
'Katie Parsons | Denver',
'Madison Hollins Taylor',
'Kevin D. Williams',
'|Ryan Farmer|',
'l a u r e n t h o m a s',
'Dave Goodwin💦',
'Candice Harper Makeup Artist',
'dani longfeld // millenialmodels',
'Madison Jantzen | DALLAS, TX',
'Rachel Wallace Perkins',
'Kayla Wright Photography',
'Scott Green Jr.')), class = "data.frame", row.names = c(NA, -15L))
In addition to getting first and last name extracted from each of these, for ones like Rachel Wallace Perkins and Madison Hollins Taylor, I'd like to create one to multiple extracts since we don't really know which is their true last name. The final output would look something like this:
names_revised <- structure(list(name = c('Michael Smith',
'Scott Lewis',
'Erin Hopkins',
'Katie Parsons',
'Madison Hollins',
'Madison Taylor',
'Kevin Williams',
'Ryan Farmer',
'Lauren Thomas',
'Dave Goodwin',
'Candice Harper',
'Dani Longfeld',
'Madison Jantzen',
'Rachel Wallace',
'Rachel Perkins',
'Kayla Wright',
'Scott Green')), class = "data.frame", row.names = c(NA, -17L))
Based on some previous answers, I attempted to do (using the tidyr package):
names_extract <- tidyr::extract(names, name, c("FirstName", "LastName"), "([^ ]+) (.*)")
But that doesn't seem to do the trick, as the output it produces looks as such:
FirstName LastName
1 Michael Smith ♕
2 Scott Lewis - Realtor
3 Erin Hopkins Ŧ
4 Katie Parsons | Denver
5 Madison Hollins Taylor
6 Kevin D. Williams
7 |Ryan Farmer|
8 l a u r e n t h o m a s
9 Dave Goodwin💦
10 Candice Harper Makeup Artist
11 dani longfeld // millenialmodels
12 Madison Jantzen | DALLAS, TX
13 Rachel Wallace Perkins
14 Kayla Wright Photography
15 Scott Green Jr.
I know there are a ton of little edge cases that make this difficult, but overall, what would be the best approach for handling this that would capture the most results I'm trying for?
This fixes most of the rows.
library(dplyr)
library(tidyr)
Names %>%
mutate(name2 = sub("^[[:punct:]]", "", name) %>%
sub(" \\w[.] ", " ", .) %>%
sub("[[:punct:]]+ *[^[:punct:]]*$", "", .) %>%
sub("\\W+[[:upper:]]+$", "", .) %>%
trimws) %>%
separate(name2, c("First", "Last"), extra = "merge")
giving:
name First Last
1 Michael Smith ♕ Michael Smith
2 Scott Lewis - Realtor Scott Lewis
3 Erin Hopkins Ŧ Erin Hopkins
4 Katie Parsons | Denver Katie Parsons
5 Madison Hollins Taylor Madison Hollins Taylor
6 Kevin D. Williams Kevin Williams
7 |Ryan Farmer| Ryan Farmer
8 l a u r e n t h o m a s l a u r e n t h o m a s
9 Dave Goodwin?? Dave Goodwin
10 Candice Harper Makeup Artist Candice Harper Makeup Artist
11 dani longfeld // millenialmodels dani longfeld
12 Madison Jantzen | DALLAS, TX Madison Jantzen
13 Rachel Wallace Perkins Rachel Wallace Perkins
14 Kayla Wright Photography Kayla Wright Photography
15 Scott Green Jr. Scott Green Jr
Here's a first go at cleaning the data - (much) more will be needed to obtain perfect data:
library(stringr)
df %>%
mutate(name = str_extract(name, "[\\w\\s.]+\\w"))
name
1 Michael Smith
2 Scott Lewis
3 Erin Hopkins
4 Katie Parsons
5 Madison Hollins Taylor
6 Kevin D. Williams
7 Ryan Farmer
8 l a u r e n t h o m a s
9 Dave Goodwin
10 Candice Harper Makeup Artist
11 dani longfeld
12 Madison Jantzen
13 Rachel Wallace Perkins
14 Kayla Wright Photography
15 Scott Green Jr
Here we use str_extract, which extracts just the first match in the string, which is convenient as most of the characters that you want to remove are right-end bound. The character class [\\w\\s.]+ matches any alphanumeric and whitespace characters and the dot occurring one or more times. It is followed by \\w, i.e., a single alphanumeric character to make sure that the extracted parts do not end on whitespace. As said, that's just a first go but the data is already very much tidier.

Multiple criteria lookup in R

I have data like this:
ID 1a 2a 3a 1b 2b 3b Name Team
cb128c James John Bill Jeremy Ed Simon Simon Wolves
cb128c John James Randy Simon David Ben John Tigers
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats
And I want to add columns indicating teams A and B by matching the row ID of that row in the ID column, and by matching one of the names in one of the "a" columns of that row in the "Name" column (for Team A), and doing the same for Team B using one of the names in one of the "b" columns of that row:
ID 1a 2a 3a 1b 2b 3b Name Team Team A Team B
cb128c James John Bill Jeremy Ed Simon Simon Wolves Tigers Wolves
cb128c John James Randy Simon David Ben John Tigers Tigers Wolves
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows Wildcats Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats Wildcats Sparrows
In row 1, we know Team A is Tigers because we match the ID of row 1, cb128c, in the ID column, and one of the "a" names of row 1 (either James, John or Bill) in the Name column. In this case, Row 2 has that ID, cb128c, and has "John" in the Name column. The Team in row 2 is "Tigers." Therefore, Row 1's Team A is Tigers. Team B is the Wolves because we match row 1's ID, still cb128c, and one of the "b" names in row 1 (either Jeremy, Ed or Simon) in the Name column. In this case, row 1 itself has the data we're looking for since one of the "b" names appears in the "Name" column of that row (Simon). The "Team" listed in each row will always either be the Team A or the Team B for that row.
Further down, we know Team A for row 3 is Wildcats because we match row 3's ID, ko351u and one of row 3's "a" names (either Adam, Alex or Jacob) in the "Name" column. Row 4 has that ID and "Adam" in the Name column. So the Team in Row 4 is Team A for Row 3.
Also notice that David switched teams in Row 3. In Row 2, David was on Simon's team, which we know is the Wolves (as explained above), but when we match Row 3's ID and one of Row 3's "b" names (Bob, Oscar or David), we get the Sparrows (like Row 1, one of the "b" names appears in the name column of that same row, so the Team B is the Team listed in that row).
How can I get this done in R?
df = read.table(text = "ID 1a 2a 3a 1b 2b 3b Name Team
cb128c James John Bill Jeremy Ed Simon Simon Wolves
cb128c John James Randy Simon David Ben John Tigers
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats", header = T)
# convert to character
df[] = lapply(df, as.character)
library(tidyr)
library(dplyr)
The following code 1. gathers to long format, 2. creates "Team_A" and "Team_B" out of the a or b suffix, 3. matches names to fill in the A/B Team Name, 4. removes missing values (no match), 5. gets rid of unnecessary columns, 6. converts back to wide format, 7. joins the A and B teams to the original data.
I'd encourage you to step through the code line by line to understand what's going on. I'll leave reordering the columns to you.
result = gather(df, key = "key", value = "value", starts_with("X")) %>%
mutate(ab = paste0("Team_", toupper(substr(key, start = nchar(key), stop = nchar(key)))),
team = ifelse(Name == value, Team, NA)) %>%
filter(!is.na(team)) %>%
select(ID, ab, team) %>%
spread(key = ab, value = team) %>%
right_join(df)
result
# ID Team_A Team_B X1a X2a X3a X1b X2b X3b Name Team
# 1 cb128c Tigers Wolves James John Bill Jeremy Ed Simon Simon Wolves
# 2 cb128c Tigers Wolves John James Randy Simon David Ben John Tigers
# 3 ko351u Wildcats Sparrows Adam Alex Jacob Bob Oscar David Oscar Sparrows
# 4 ko351u Wildcats Sparrows Adam Matt Sam Fred Frank Harry Adam Wildcats

How to Restructure R Data Frame in R [duplicate]

This question already has answers here:
reshape wide to long with character suffixes instead of numeric suffixes
(3 answers)
Closed 5 years ago.
I have data in this format:
boss employee1 employee2
1 wil james andy
2 james dean bert
3 billy herb collin
4 tony mike david
and I would like it in this format:
boss employee
1 wil james
2 wil andy
3 james dean
4 james bert
5 billy herb
6 billy collin
7 tony mike
8 tony david
I have searched the forums, but I have not yet found anything that helps. I have tried using dplyr and some others, but I am still pretty new to R.
If this question has been answered and you could give me a link that would be greatly appreciated.
Thanks,
Wil
Here is a solution that uses tidyr. Specifically, the gather function is used to combine the two employee columns. This also generates a column bsaed on the column headers (employee1 and employee2) which is called key. We remove that with select from dplyr.
library(tidyr)
library(dplyr)
df <- read.table(
text = "boss employee1 employee2
1 wil james andy
2 james dean bert
3 billy herb collin
4 tony mike david",
header = TRUE,
stringsAsFactors = FALSE
)
df2 <- df %>%
gather(key, employee, -boss) %>%
select(-key)
> df2
boss employee
1 wil james
2 james dean
3 billy herb
4 tony mike
5 wil andy
6 james bert
7 billy collin
8 tony david
I would be shocked if there isn't a slicker, base solution but this should work for you.
Using base R:
df1 <- df[, 1:2]
df2 <- df[, c(1, 3)]
names(df1)[2] <- names(df2)[2] <- "employee"
rbind(df1, df2)
# boss employee
# 1 wil james
# 2 james dean
# 3 billy herb
# 4 tony mike
# 11 wil andy
# 21 james bert
# 31 billy collin
# 41 tony david
Using dplyr:
df %>%
select(boss, employee1) %>%
rename(employee = employee1) %>%
bind_rows(df %>%
select(boss, employee2) %>%
rename(employee = employee2))
# boss employee
# 1 wil james
# 2 james dean
# 3 billy herb
# 4 tony mike
# 5 wil andy
# 6 james bert
# 7 billy collin
# 8 tony david
Data:
df <- read.table(text = "
boss employee1 employee2
1 wil james andy
2 james dean bert
3 billy herb collin
4 tony mike david
", header = TRUE, stringsAsFactors = FALSE)

separate different combinations of names to first and last using dplyr, tidyr, and regex

Sample data frame:
name <- c("Smith John Michael","Smith, John Michael","Smith John, Michael","Smith-John Michael","Smith-John, Michael")
df <- data.frame(name)
df
name
1 Smith John Michael
2 Smith, John Michael
3 Smith John, Michael
4 Smith-John Michael
5 Smith-John, Michael
I need to achieve the following desired output:
name first.name last.name
1 Smith John Michael John Smith
2 Smith, John Michael John Smith
3 Smith John, Michael Michael Smith John
4 Smith-John Michael Michael Smith-John
5 Smith-John, Michael Michael Smith-John
The rules are: if there is a comma in the string, then anything before is the last name. the first word following the comma is first name. If no comma in string, first word is last name, second word is last name. hyphenated words are one word. I would rather acheive this with dplyr and regex but I'll take any solution. Thanks for the help
You can achieve your desired result using strsplit switching between splitting by "," or " " based on whether there is a comma or not in name. Here, we define two functions to make the presentation clearer. You can just as well inline the code within the functions.
get.last.name <- function(name) {
lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,1)
}
The result of strsplit is a list. The lapply(...,'[[',1) loops through this list and extracts the first element from each list element, which is the last name.
get.first.name <- function(name) {
d <- lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,2)
lapply(strsplit(gsub("^ ","",d), " "),`[[`,1)
}
This function is similar except we extract the second element from each list element returned by strsplit, which contains the first name. We then remove any starting spaces using gsub, and we split again with " " to extract the first element from each list element returned by that strsplit as the first name.
Putting it all together with dplyr:
library(dplyr)
res <- df %>% mutate(first.name=get.first.name(name),
last.name=get.last.name(name))
The result is as expected:
print(res)
## name first.name last.name
## 1 Smith John Michael John Smith
## 2 Smith, John Michael John Smith
## 3 Smith John, Michael Michael Smith John
## 4 Smith-John Michael Michael Smith-John
## 5 Smith-John, Michael Michael Smith-John
Data:
df <- structure(list(name = c("Smith John Michael", "Smith, John Michael",
"Smith John, Michael", "Smith-John Michael", "Smith-John, Michael"
)), .Names = "name", row.names = c(NA, -5L), class = "data.frame")
## name
##1 Smith John Michael
##2 Smith, John Michael
##3 Smith John, Michael
##4 Smith-John Michael
##5 Smith-John, Michael
I am not sure if this is any better than aichao's answer but I gave it a shot anyway. I gives the right output.
df1 <- df %>%
filter(grepl(",",name)) %>%
separate(name, c("last.name","first.middle.name"), sep = "\\,", remove=F) %>%
mutate(first.middle.name = trimws(first.middle.name)) %>%
separate(first.middle.name, c("first.name","middle.name"), sep="\\ ",remove=T) %>%
select(-middle.name)
df2 <- df %>%
filter(!grepl(",",name)) %>%
separate(name, c("last.name","first.name"), sep = "\\ ", remove=F)
df<-rbind(df1,df2)

Find Duplicates in R based on multiple characters

I can't seem to remember how to code this properly in R -
if I want to remove duplicates within a csv file based on multiple entries - first name and last name that are stored in separate columns
Then I can code: file[(duplicated(file$First.Name),]
but that only looks at the first name, I want it to look at the last same simultaneously.
If this is my starting file:
Steve Jones
Eric Brown
Sally Edwards
Steve Jones
Eric Davis
I want the output to be
Steve Jones
Eric Brown
Sally Edwards
Eric Davis
Only removing names of first and last name matching.
You can use
file[!duplicated(file[c("First.Name", "Last.Name")]), ]
Here is the solution for better performance (using data.table assuming First Name and Last Name are stored in separate columns):
> df <- read.table(text = 'Steve Jones
+ Eric Brown
+ Sally Edwards
+ Steve Jones
+ Eric Davis')
> colnames(df) <- c("First.Name","Last.Name")
> df
First.Name Last.Name
1 Steve Jones
2 Eric Brown
3 Sally Edwards
4 Steve Jones
5 Eric Davis
Here is where data.table specific code begins
> dt <- setDT(df)
> unique(dt,by=c('First.Name','Last.Name'))
First.Name Last.Name
1: Steve Jones
2: Eric Brown
3: Sally Edwards
4: Eric Davis
If there is a single column, use sub to remove the substring (i.e. first name) followed by space, get the logical vector (!duplicated(..) based on that to subset the rows of the dataset.
df1[!duplicated(sub("\\w+\\s+", "", df1$Col1)),,drop=FALSE]
# Col1
#1 Steve Jones
#2 Eric Brown
#3 Sally Edwards
#5 Eric Davis
If it is based on two columns and the dataset have two columns, just do duplicated directly on the dataset to get the logical vector, negate it and subset the rows.
df1[!duplicated(df1), , drop=FALSE]
# first.name second.name
#1 Steve Jones
#2 Eric Brown
#3 Sally Edwards
#5 Eric Davis
try:
!duplicated(paste(File$First.Name,File$Last.Name))

Resources