I'm working on a data.frame trying to extract a part of a string between , and . and putting that into a neww column. I would like to use dplyr.
library(dplyr)
name <- c("Cumings, Mrs. John Bradley","Heikkinen, Miss. Laina","Moran, Mr. James","Allen, Mr. William Henry","Futrelle, Mrs. Jacques Heath (Lily May Peel)")
sex <- c("female","female","male","male","female")
age <- c(22,23,24,37,42)
data <- data.frame(name,sex,age)
So I want to extract Mrs, Misss, Mr and so on into a own column.
data %>%
mutate(title = strsplit(name, split = "[,.]")) %>%
select(name,title)
Delete everything outside ,. using gsub(".*, |\\..*", "", name):
library(dplyr)
data %>% mutate(title = gsub(".*, |\\..*", "", name))
gsub(".*, ", "", name): deletes everything before ,, , itself and space after.
gsub("\\..*", "", name): deletes . and everything after it.
| combines two gsub patterns.
str_extract will retrieve the first instance within each string:
library(dplyr)
library(stringr)
data <- data.frame(name,sex,age) %>%
mutate(title = str_extract(name, ",.+\\."),
title = str_replace_all(title, "([[:punct:]]| )", ""))
A slightly more efficient solution:
data %>%
mutate(title = str_trim(str_extract(name, regex("(?<=,).*?(?=\\.)"))))
The (?<=,) says to look after a comma, the (?=\\.) says to look before the period, and the .*? says grab everything in between. the str_trim removes the leading and trailing white space.
Im having no answer to the dplyr problem.
I just wanted to mention, that this way of splitting the salutation from a name is a way which will probably encounter multiple errors when using real world data.
A better (but still error-prone) way to do this is by creating a lookup table for common salutations while utilizing on regex.
The advantage over splitting the data lies within the fact that if there is no hit in the regex, it remains empty (NA) and can easily be fixed manually, but does not create inconsistent data in the first step.
Without using any external package
data$title <- with(data, sub("^[^,]+,\\s*(\\S+).*", "\\1", name))
data$title
#[1] "Mrs." "Miss." "Mr." "Mr." "Mrs."
Similar to #Benjamin's answer (Base R's equivalent to str_extract_all), here's how to do it using regmatches + gregexpr + positive lookahead:
library(dplyr)
data %>%
mutate(title = regmatches(data$name, gregexpr("\\b[[:alpha:]]+(?=[.])",
data$name, perl = TRUE))) %>%
select(name,title)
Result:
name title
1 Cumings, Mrs. John Bradley Mrs
2 Heikkinen, Miss. Laina Miss
3 Moran, Mr. James Mr
4 Allen, Mr. William Henry Mr
5 Futrelle, Mrs. Jacques Heath (Lily May Peel) Mrs
\\b matches a "word boundary", which in this case is a space. perl = TRUE is needed to utilize positive lookahead (?=[.]), which essentially says "only if the pattern is followed by a ."
guess something like this: data %>% mutate(title = gsub(".*, |\\..*", "", name))
Related
I have some strings of text (example below). As you can see each string was split at a period or question mark.
[1]"I am a Mr."
[2]"asking for help."
[3]"Can you help?"
[4]"Thank you ms."
[5]"or mr."
I want to collapse where the string ends with an abbreviation like mr., mrs. so the end result would be the desired output below.
[1]"I am a Mr. asking for help."
[2]"Can you help?"
[3]"Thank you ms. or mr."
I already created a vector (called abbr) containing all my abbreviations in the following format:
> abbr
[1] "Mr|Mrs|Ms|Dr|Ave|Blvd|Rd|Mt|Capt|Maj"
but I can't figure out how to use it in paste function to collapse. I have also tried using gsub (didn't work) to replace \n following abbreviation with a period with a space like this:
lines<-gsub('(?<=abbr\\.\\n)(?=[A-Z])', ' ', lines, perl=FALSE)
We can use tapply to collapse string and grepl to create groups to collapse.
x <- c("I am a Mr.", "asking for help.","Can you help?","Thank you ms.", "or Mr.")
#Include all the abbreviations with proper cases
#Note that "." has a special meaning in regex so you need to escape it.
abbr <- 'Mr\\.|Mrs\\.|Ms\\.|Dr\\.|mr\\.|ms\\.'
unname(tapply(x, c(0, head(cumsum(!grepl(abbr, x)), -1)), paste, collapse = " "))
#[1] "I am a Mr. asking for help." "Can you help?" "Thank you ms. or mr."
I've used this:
String <- unlist(str_split(Invname,"[ ]",n=2))
To split the names that I have into Surnames and First Names, since the surnames come first. But I cannot figure out how to reassign the split Invname into two lists, so that I can use only the surnames for the rest of my project. Right now I have this:
" [471] "KRUEGER" "MARCUS" "
And I would like to have the left side only assigned to a new variable, so that I can work further with mining the surnames for information.
Using the data in nate.edwinton's answer, there is no need to unlist.
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
String <- stringr::str_split(Invnames, "[ ]", n = 2)
Surnames <- sapply(String, '[', 1)
Firstnames <- sapply(String, '[', 2)
data.frame(Surnames, Firstnames)
# Surnames Firstnames
#1 Krueger Markus
#2 Doe John
#3 Tatum Jayson
As mentioned in the comments, it would be easier to help if you provided some data. Anyway, here might be a solution:
Assuming that Invnames is a vector of where for every first name there is (exactly) one last name, you could do the following
# data
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
# extraction
String <- unlist(stringr::str_split(Invnames,"[ ]",n=2))
# saving first and last names
lastNames <- String[seq(1,length(String),2)]
firstNames <- String[seq(2,length(String),2)]
# yields
> cbind(lastNames,firstNames)
lastNames firstNames
[1,] "Krueger" "Markus"
[2,] "Doe" "John"
[3,] "Tatum" "Jayson"
Here is some sample data and a suggested solution. Data modified from #Rui Barradas' answer:
Invnames <- c("Krueger.$Markus","Doe.John","Tatum.Jayson")
sapply(strsplit(Invnames,"\\W"),"[")
Again using data from an earlier answer with dplyr this time
library(tidyverse)
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
Invnames <- data.frame(Invnames)
Invnames %>%
separate(Invnames, c('Surname', 'FirstName'), sep=" ")
Surname FirstName
1 Krueger Markus
2 Doe John
3 Tatum Jayson
With base R, we can make use of read.table/read.csv to separate the string into columns
read.table(text = Invnames, header = FALSE, col.names = c("Surnames", "Firstnames"))
# Surnames Firstnames
#1 Krueger Markus
#2 Doe John
#3 Tatum Jayson
data
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
If only names were so straightforward! If there were few complications between strings then yes the answers below are good options. In my experience with name lists we get hyphenated names (both in "first" and "last"), "Middle" names, Titles and shortened name versions (Dr., Mr, Md), and many other variants. I first try to clean the strings before any splitting.
Here is just one idea using dplyr (explicit code provided for clarity)
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson", "Taylor - Cline Jeff", "Davis - Freud Melvin- John")
df <- as.data.frame(Invnames, Invnames = Invnames) %>%
mutate(Invnames2 = gsub("- ","-",Invnames)) %>%
mutate(Invnames2 = gsub(" -","-",Invnames2)) %>%
mutate(surname = gsub(" .*", "", Invnames2))
I've been searching for hours. This should be very easy but I don't see how :(
I have a dataframe called ds that contains a column structured like:
name
"Doe, Mr. John"
"Worth, Miss. Jane"
I want to extract the middle word and put it into a new column.
#This is how I'm doing it now
ds$title <- NA
mr <- grep(", Mr. ", ds$name)
miss <- grep(", Miss. ", ds$name)
ds$title[mr] <- ", Mr. "
ds$title[miss] <- ", Miss. "
I'm trying to generalize this with regex so that it'll take any middle word matching the pattern of "comma space word period space"
This is my best guess but it only removes the pattern:
gsub(", .+\\.+ ", "", ds$name)
How do I keep the pattern and remove the rest?
Thank you!
You can use a capture group. Basically, you match the whole pattern, use a capture group to match the part you want to keep, and replace the whole match with the capture group:
# I often specify perl = TRUE, though it isn't necessary here
(ds$title <- gsub(".+(, .+\\.+ ).+", "\\1", ds$name, perl = TRUE))
#[1] ", Mr. " ", Miss. "
The capture group is what's in the parentheses ((, .+\\.+ )), and you refer back to it with \\1. If you had a second capture group, you'd refer to it as \\2.
Note that if you want to catch comma, space, word, period, space, then you could modify the capture group to (, .+\\. ). You only need to match one period, not one or more.
A straightforward stringi alternative that does not use capture groups is stri_extract_first_regex (or in this case stri_extract_last_regex or stri_extract_all_regex work fine)
library(stringi)
ds$title <- stri_extract_first_regex(ds$name, ", .+\\. ")
#[1] ", Mr. " ", Miss. "
and as thelatemail pointed out in a comment you can do a similar thing with base R, too, but it's a little harder to remember how to use the regmatches and regexpr functions:
regmatches(ds$name, regexpr(", .+\\. ", ds$name))
#[1] ", Mr. " ", Miss. "
Matched capture groups are your BFF:
library(stringi)
library(purrr)
ds <- data.frame(name=c("Doe, Mr. John", "Worth, Miss. Jane"), stringsAsFactors=FALSE)
nonsp <- "[[:alnum:][:punct:]]+"
sp <- "[[:blank:]]+"
stri_match_all_regex(ds$name, nonsp %s+% sp %s+% "(" %s+% nonsp %s+% ")" %s+% sp %s+% nonsp) %>%
map_chr(2)
## [1] "Mr." "Miss."
For your "add column to a data frame" needs:
library(stringi)
library(dplyr)
library(purrr)
ds <- data.frame(name=c("Doe, Mr. John", "Worth, Miss. Jane"), stringsAsFactors=FALSE)
nonsp <- "[[:alnum:][:punct:]]+"
sp <- "[[:blank:]]+"
mutate(ds, title=stri_match_all_regex(ds$name, nonsp %s+% sp %s+% "(" %s+% nonsp %s+% ")" %s+% sp %s+% nonsp) %>% map_chr(2))
## name title
## 1 Doe, Mr. John Mr.
## 2 Worth, Miss. Jane Miss.
Rearranging simpsons names with R to follow first name, last name format but there are large spaces between the names, is it possible to remove spaces outside the quoted names?
library(stringr)
simpsons <- c("Moe Syzlak", "Burns, C. Montgomery", "Rev. Timothy Lovejoy", "Ned Flanders", "Simpson, Homer", "Dr. Julius Hibbert")
reorder <- sapply(sapply(str_split(simpsons, ","), str_trim),rev)
for (i in 1:length(name) ) {
splitname[i]<- paste(unlist(splitname[i]), collapse = " ")
}
splitname <- unlist(splitname)
If we need to rearrange the first name followed by last name, we could use sub. We capture one or more than character which is not a , in a group, followed by , followed by 0 or more space (\\s*), capture one or more characters that are not a , as the 2nd group, and in the replacement reverse the backreference to get the output.
sub("([^,]+),\\s*([^,]+)", "\\2 \\1", simpsons)
#[1] "Moe Syzlak" "C. Montgomery Burns" "Rev. Timothy Lovejoy" "Ned Flanders" "Homer Simpson" "Dr. Julius Hibbert"
I need to remove all apostrophes from my data frame but as soon as I use....
textDataL <- gsub("'","",textDataL)
The data frame gets ruined and the new data frame only contains values and NAs, when I am only looking to remove any apostrophes from any text that might be in there? Am I missing something obvious with apostrophes and data frames?
To keep the structure intact:
dat1 <- data.frame(Col1= c("a woman's hat", "the boss's wife", "Mrs. Chang's house", "Mr Cool"),
Col2= c("the class's hours", "Mr. Jones' golf clubs", "the canvas's size", "Texas' weather"),
stringsAsFactors=F)
I would use
dat1[] <- lapply(dat1, gsub, pattern="'", replacement="")
or
library(stringr)
dat1[] <- lapply(dat1, str_replace_all, "'","")
dat1
# Col1 Col2
# 1 a womans hat the classs hours
# 2 the bosss wife Mr. Jones golf clubs
# 3 Mrs. Changs house the canvass size
# 4 Mr Cool Texas weather
You don't want to apply gsub directly on a data frame, but column-wise instead, e.g.
apply(textDataL, 2, gsub, pattern = "'", replacement = "")