splitting surnames from fullnames

splitting surnames from fullnames - r

I've used this:
String <- unlist(str_split(Invname,"[ ]",n=2))
To split the names that I have into Surnames and First Names, since the surnames come first. But I cannot figure out how to reassign the split Invname into two lists, so that I can use only the surnames for the rest of my project. Right now I have this:
" [471] "KRUEGER" "MARCUS" "
And I would like to have the left side only assigned to a new variable, so that I can work further with mining the surnames for information.

Using the data in nate.edwinton's answer, there is no need to unlist.
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
String <- stringr::str_split(Invnames, "[ ]", n = 2)
Surnames <- sapply(String, '[', 1)
Firstnames <- sapply(String, '[', 2)
data.frame(Surnames, Firstnames)
# Surnames Firstnames
#1 Krueger Markus
#2 Doe John
#3 Tatum Jayson

As mentioned in the comments, it would be easier to help if you provided some data. Anyway, here might be a solution:
Assuming that Invnames is a vector of where for every first name there is (exactly) one last name, you could do the following
# data
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
# extraction
String <- unlist(stringr::str_split(Invnames,"[ ]",n=2))
# saving first and last names
lastNames <- String[seq(1,length(String),2)]
firstNames <- String[seq(2,length(String),2)]
# yields
> cbind(lastNames,firstNames)
lastNames firstNames
[1,] "Krueger" "Markus"
[2,] "Doe" "John"
[3,] "Tatum" "Jayson"

Here is some sample data and a suggested solution. Data modified from #Rui Barradas' answer:
Invnames <- c("Krueger.$Markus","Doe.John","Tatum.Jayson")
sapply(strsplit(Invnames,"\\W"),"[")

Again using data from an earlier answer with dplyr this time
library(tidyverse)
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
Invnames <- data.frame(Invnames)
Invnames %>%
separate(Invnames, c('Surname', 'FirstName'), sep=" ")
Surname FirstName
1 Krueger Markus
2 Doe John
3 Tatum Jayson

With base R, we can make use of read.table/read.csv to separate the string into columns
read.table(text = Invnames, header = FALSE, col.names = c("Surnames", "Firstnames"))
# Surnames Firstnames
#1 Krueger Markus
#2 Doe John
#3 Tatum Jayson
data
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")

If only names were so straightforward! If there were few complications between strings then yes the answers below are good options. In my experience with name lists we get hyphenated names (both in "first" and "last"), "Middle" names, Titles and shortened name versions (Dr., Mr, Md), and many other variants. I first try to clean the strings before any splitting.
Here is just one idea using dplyr (explicit code provided for clarity)
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson", "Taylor - Cline Jeff", "Davis - Freud Melvin- John")
df <- as.data.frame(Invnames, Invnames = Invnames) %>%
mutate(Invnames2 = gsub("- ","-",Invnames)) %>%
mutate(Invnames2 = gsub(" -","-",Invnames2)) %>%
mutate(surname = gsub(" .*", "", Invnames2))

Related

How do I extract certain items from unstructured text?

I have an extremely unstructured data frame (df) in R, which includes a text column.
An example of the df$text looks like this
John Smith 3.8 GPA johnsmith#gmail.com, https://link.com
I am trying to extract the GPA out of the field and save to a new column called df$GPA but am unable to get it to work.
I have tried:
df$gpa <- sub('[0-9].[0-9] GPA',"\\1", df$text)
But that returns the whole block of text.
I am also trying to extract the url but am unsure how to do that as well.Does anybody have any suggestions?

Here's a solution using positive lookahead in (?=GPA)and str_extractfrom the package stringr:
df$GPA <- str_extract(df$text, "\\d+\\.\\d+\\s(?=GPA)")
A subsolution with backreference would be this:
df$GPA <- sub(".*(\\d+\\.\\d+).*", "\\1", df$text)
Result:
df
text GPA
1 John Smith 3.8 GPA johnsmith#gmail.com, https://link.com 3.8
Data:
df <- data.frame(text = "John Smith 3.8 GPA johnsmith#gmail.com, https://link.com")

We can use a regex lookaround to extract the numeric part
library(stringr)
df$GPA <- str_extract(df$text, "[0-9.]+(?=\\s*GPA)")
df$GPA
#[1] "3.8"
Or in base R with regmatches/regexpr
regmatches(df$text, regexpr("[0-9.]+(?=\\s*GPA)", df$text, perl = TRUE))
data
df <- data.frame(text = 'John Smith 3.8 GPA johnsmith#gmail.com, https://link.com', stringsAsFactors = FALSE)

Exact match from list of words from a text in R

I have list of words and I am looking for words that are there in the text.
The result is that in the last column is always found as it is searching for patterns. I am looking for exact match that is there in words. Not the combinations. For the first three records it should be not found.
Please guide where I am going wrong.
col_1 <- c(1,2,3,4,5)
col_2 <- c("work instruction change",
"technology npi inspections",
" functional locations",
"Construction has started",
" there is going to be constn coon")
df <- as.data.frame(cbind(col_1,col_2))
df$col_2 <- tolower(df$col_2)
words <- c("const","constn","constrction","construc",
"construct","construction","constructs","consttntype","constypes","ct","ct#",
"ct2"
)
pattern_words <- paste(words, collapse = "|")
df$result<- ifelse(str_detect(df$col_2, regex(pattern_words)),"Found","Not Found")

Use word boundaries around the words.
library(stringr)
pattern_words <- paste0('\\b', words, '\\b', collapse = "|")
df$result <- c('Not Found', 'Found')[str_detect(df$col_2, pattern_words) + 1]
#OR with `ifelse`
#df$result <- ifelse(str_detect(df$col_2, pattern_words), "Found", "Not Found")
df
# col_1 col_2 result
#1 1 work instruction change Not Found
#2 2 technology npi inspections Not Found
#3 3 functional locations Not Found
#4 4 construction has started Found
#5 5 there is going to be constn coon Found
You can also use grepl here to keep it in base R :
grepl(pattern_words, df$col_2)

Find out single word names

I have a Name column and names are like this:
Preety ..
Sudalai Rajkumar S.
Parvathy M. S.
Navaraj Ranjan Arthur
I want to get which of these are single-word names, like in this case Preety.
I have tried eliminating the "." and " " and counting the length and using the difference of this length and the original string length.
But it's not giving me the desired output. Please help.
NBData3$namewodot <- gsub(" .","",NBData3$Client.Name)
NBData3$namewoblank <- gsub(" ","",NBData3$namewodot)
wordlength <- NBData3$namelengthchar-nchar(as.character(NBData3$namewoblank))

You could use str_count from stringr inside an ifelse() statement to check one worded names; first removing dots from names with gsub.
library(stringr)
NBData3$namewodot <- gsub("\\.", "", NBData3$Client.Name)
NBData3$oneword <- ifelse(str_count(NBData3$namewodot , '\\w+') == 1, TRUE, FALSE)
# Client.Name namewodot oneword
# 1 Preety .. Preety TRUE
# 2 Sudalai Rajkumar S. Sudalai Rajkumar S FALSE
# 3 Parvathy M. S. Parvathy M S FALSE
# 4 Navaraj Ranjan Arthur Navaraj Ranjan Arthur FALSE

This seems to work for your example
names = c("Preety ..",
"Sudalai Rajkumar S." ,
"Parvathy M. S.",
"Navaraj Ranjan Arthur")
names[sapply(strsplit(gsub(".","",names,fixed=T)," ",fixed=T),function(x) length(x) == 1)]
[1] "Preety .."

This may be a bit round about, but here would be a text mining approach. There are definitely more streamlined ways, but I thought there might be concepts in here that are also useful.
# define the data frame
df <- data.frame(Name = c("Preety ..",
"Sudalai Rajkumar S.",
"Parvathy M. S.",
"Navaraj Ranjan Arthur"),
stringsAsFactors = FALSE)
library(tidyverse)
library(tidytext)
# break each name out by words. remove all the periods
df_token <- df %>%
rowid_to_column(var = "name_id") %>%
mutate(Name = str_remove_all(Name, pattern = "\\.")) %>%
unnest_tokens(name_split, Name, to_lower = FALSE)
# find the lines with only one word
df_token %>%
group_by(name_id) %>%
summarize(count = n()) %>%
filter(count == 1) %>%
left_join(df_token) %>%
pull(name_split)
[1] "Preety"

in base R you could use grep:
grep("^\\S+$", gsub("\\W+$", "", names), value=T)
[1] "Preety"
If you need the names as originally given, then you will just use [:
names[grep("^\\S+$", gsub("\\W+$", "", names))]
[1] "Preety .."

Getting a split out of a string into a new column

I'm working on a data.frame trying to extract a part of a string between , and . and putting that into a neww column. I would like to use dplyr.
library(dplyr)
name <- c("Cumings, Mrs. John Bradley","Heikkinen, Miss. Laina","Moran, Mr. James","Allen, Mr. William Henry","Futrelle, Mrs. Jacques Heath (Lily May Peel)")
sex <- c("female","female","male","male","female")
age <- c(22,23,24,37,42)
data <- data.frame(name,sex,age)
So I want to extract Mrs, Misss, Mr and so on into a own column.
data %>%
mutate(title = strsplit(name, split = "[,.]")) %>%
select(name,title)

Delete everything outside ,. using gsub(".*, |\\..*", "", name):
library(dplyr)
data %>% mutate(title = gsub(".*, |\\..*", "", name))
gsub(".*, ", "", name): deletes everything before ,, , itself and space after.
gsub("\\..*", "", name): deletes . and everything after it.
| combines two gsub patterns.

str_extract will retrieve the first instance within each string:
library(dplyr)
library(stringr)
data <- data.frame(name,sex,age) %>%
mutate(title = str_extract(name, ",.+\\."),
title = str_replace_all(title, "([[:punct:]]| )", ""))
A slightly more efficient solution:
data %>%
mutate(title = str_trim(str_extract(name, regex("(?<=,).*?(?=\\.)"))))
The (?<=,) says to look after a comma, the (?=\\.) says to look before the period, and the .*? says grab everything in between. the str_trim removes the leading and trailing white space.

Im having no answer to the dplyr problem.
I just wanted to mention, that this way of splitting the salutation from a name is a way which will probably encounter multiple errors when using real world data.
A better (but still error-prone) way to do this is by creating a lookup table for common salutations while utilizing on regex.
The advantage over splitting the data lies within the fact that if there is no hit in the regex, it remains empty (NA) and can easily be fixed manually, but does not create inconsistent data in the first step.

Without using any external package
data$title <- with(data, sub("^[^,]+,\\s*(\\S+).*", "\\1", name))
data$title
#[1] "Mrs." "Miss." "Mr." "Mr." "Mrs."

Similar to #Benjamin's answer (Base R's equivalent to str_extract_all), here's how to do it using regmatches + gregexpr + positive lookahead:
library(dplyr)
data %>%
mutate(title = regmatches(data$name, gregexpr("\\b[[:alpha:]]+(?=[.])",
data$name, perl = TRUE))) %>%
select(name,title)
Result:
name title
1 Cumings, Mrs. John Bradley Mrs
2 Heikkinen, Miss. Laina Miss
3 Moran, Mr. James Mr
4 Allen, Mr. William Henry Mr
5 Futrelle, Mrs. Jacques Heath (Lily May Peel) Mrs
\\b matches a "word boundary", which in this case is a space. perl = TRUE is needed to utilize positive lookahead (?=[.]), which essentially says "only if the pattern is followed by a ."

guess something like this: data %>% mutate(title = gsub(".*, |\\..*", "", name))

Gsub apostrophe in data frame R

I need to remove all apostrophes from my data frame but as soon as I use....
textDataL <- gsub("'","",textDataL)
The data frame gets ruined and the new data frame only contains values and NAs, when I am only looking to remove any apostrophes from any text that might be in there? Am I missing something obvious with apostrophes and data frames?

To keep the structure intact:
dat1 <- data.frame(Col1= c("a woman's hat", "the boss's wife", "Mrs. Chang's house", "Mr Cool"),
Col2= c("the class's hours", "Mr. Jones' golf clubs", "the canvas's size", "Texas' weather"),
stringsAsFactors=F)
I would use
dat1[] <- lapply(dat1, gsub, pattern="'", replacement="")
or
library(stringr)
dat1[] <- lapply(dat1, str_replace_all, "'","")
dat1
# Col1 Col2
# 1 a womans hat the classs hours
# 2 the bosss wife Mr. Jones golf clubs
# 3 Mrs. Changs house the canvass size
# 4 Mr Cool Texas weather

You don't want to apply gsub directly on a data frame, but column-wise instead, e.g.
apply(textDataL, 2, gsub, pattern = "'", replacement = "")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

splitting surnames from fullnames - r

Here is some sample data and a suggested solution. Data modified from #Rui Barradas' answer: Invnames <- c("Krueger.$Markus","Doe.John","Tatum.Jayson") sapply(strsplit(Invnames,"\\W"),"[")

Again using data from an earlier answer with dplyr this time library(tidyverse) Invnames <- c("Krueger Markus","Doe John","Tatum Jayson") Invnames <- data.frame(Invnames) Invnames %>% separate(Invnames, c('Surname', 'FirstName'), sep=" ") Surname FirstName 1 Krueger Markus 2 Doe John 3 Tatum Jayson

Related

How do I extract certain items from unstructured text?

Exact match from list of words from a text in R

Find out single word names

Getting a split out of a string into a new column

Gsub apostrophe in data frame R

Categories

Resources