Extract last word in string before the first comma - r

I have a list of names like "Mark M. Owens, M.D., M.P.H." that I would like to sort to first name, last names and titles. With this data, titles always start after the first comma, if there is a title.
I am trying to sort the list into:
FirstName LastName Titles
Mark Owens M.D.,M.P.H
Lara Kraft -
Dale Good C.P.A
Thanks in advance.
Here is my sample code:
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames=sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', namelist)
titles = sub('.*,\\s*', '', namelist)
names <- data.frame(firstnames , lastnames, titles )
You can see that with this code, Mr. Owens is not behaving. His title starts after the last comma, and the last name begins from P. You can tell that I referred to Extract last word in string in R, Extract 2nd to last word in string and Extract last word in a string after comma if there are multiple words else the first word

You were off to a good start so you should pick up from there. The firstnames variable was good as written. For lastnames I used a modified name list. Inside of the sub function is another that eliminates everything after the first comma. The last name will then be the final word in the string. For titles there is a two-step process of first eliminating everything before the first comma, then replacing non-matched strings with a hyphen -.
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames <- sub(".*?(\\w+)$", "\\1", sub(",.*", "", namelist), perl=TRUE)
titles <- sub(".*?,", "", namelist)
titles <- ifelse(titles == namelist, "-", titles)
names <- data.frame(firstnames , lastnames, titles )
firstnames lastnames titles
1 Mark Owens M.D., M.P.H.
2 Dale Good C.P.A
3 Lara Kraft -
4 Roland Bass III

This should do the trick, at least on test data:
x=strsplit(namelist,split = ",")
x=rapply(object = x,function(x) gsub(pattern = "^ ",replacement = "",x = x),how="replace")
names=sapply(x,function(y) y[[1]])
titles=sapply(x,function(y) if(length(unlist(y))>1){
paste(na.omit(unlist(y)[2:length(unlist(y))]),collapse = ",")
}else{""})
names=strsplit(names,split=" ")
firstnames=sapply(names,function(y) y[[1]])
lastnames=sapply(names,function(y) y[[3]])
names <- data.frame(firstnames, lastnames, titles )
names
In cases like this, when the structure of strings is always the same, it is easier to use functions like strsplit() to extract desired parts

Related

Choose a pattern which will select only WHOLE words which start with an r, s, or t regardless of case

I don't know what to put for ptrn
Choose a pattern which will select only WHOLE words which start with an r, s, or t regardless of case.
ptrn <- "" # EDIT THIS LINE
reg <- gregexpr(ptrn, plath) # DO NOT EDIT THIS LINE
(rst_words <- Reduce("c",regmatches(x = plath, m = reg))) # DO NOT EDIT THIS LINE
Try:
pattern = "\\b[rstRST]\\w+"
\\b is a word boundary, [rstRST] will match any word that starts with any one letter inside the brackets and \\w+ will match the remaining letters.
See the regex working at Regex101
You did not share an example , however you can try grep after splitting the string into words.
x <- "Random text as an example reading where it ended"
grep("^[RST]",strsplit(x, " ")[[1]], value = TRUE, ignore.case = TRUE)
#[1] "Random" "text" "reading"

How to develop a function that accepts a vectors of character which corresponds to the column component of a dataframe?

This is my current dataset called details.
> details$names<- c("James Johnson","Michael Jones","Robert Miller","Christopher Smith","Richard Nolan","Constantine Wilson","Mountabatteen Keizman")
I want to extract the part of names considering these 2 aspects:
1) Starting from the left, extract all characters until a space or a hypen (or minus sign) is reached.
2) Extract no more than ten characters.
I tried to do this by using this code:
> abrevStrings<- function(details$names)
{
gsub("([a-z])([A-Z])","([a-z])([A-Z])<= 10",details$names)
}
But I didn't get the output I wanted.
My desired output can be seen below:
James
Michael
Robert
Christophe
Richard
Constantin
Mountabatt
One way would using sub and substr by removing everything after whitespace or hyphen and then select only first 10 characters.
abrevStrings <- function(x) {
substr(sub("\\s+.*|-.*", "", x), 1, 10)
}
abrevStrings(details$names)
#[1] "James" "Michael" "Robert" "Christophe" "Richard"
# "Constantin" "Mountabatt"
Or another option is to split the strings on whitespace or hyphen and take the substring of the first part of the string.
sapply(strsplit(details$names, "\\s+|-"), function(x) substr(x[1], 1, 10))
data
details <- data.frame(names = c("James Johnson","Michael Jones","Robert Miller",
"Christopher Smith","Richard Nolan","Constantine Wilson",
"Mountabatteen Keizman"), stringsAsFactors = FALSE)

Limiting word count in a character column in R and saving extra words in another variable [duplicate]

I have a string in R as
x <- "The length of the word is going to be of nice use to me"
I want the first 10 words of the above specified string.
Also for example I have a CSV file where the format looks like this :-
Keyword,City(Column Header)
The length of the string should not be more than 10,New York
The Keyword should be of specific length,Los Angeles
This is an experimental basis program string,Seattle
Please help me with getting only the first ten words,Boston
I want to get only the first 10 words from the column 'Keyword' for each row and write it onto a CSV file.
Please help me in this regards.
Regular expression (regex) answer using \w (word character) and its negation \W:
gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)
^ Beginning of the token (zero-width)
((\\w+\\W+){9}\\w+) Ten words separated by not-words.
(\\w+\\W+){9} A word followed by not-a-word, 9 times
\\w+ One or more word characters (i.e. a word)
\\W+ One or more non-word characters (i.e. a space)
{9} Nine repetitions
\\w+ The tenth word
.* Anything else, including other following words
$ End of the token (zero-width)
\\1 when this token found, replace it with the first captured group (the 10 words)
How about using the word function from Hadley Wickham's stringr package?
word(string = x, start = 1, end = 10, sep = fixed(" "))
Here is an small function that unlist the strings, subsets the first ten words and then pastes it back together.
string_fun <- function(x) {
ul = unlist(strsplit(x, split = "\\s+"))[1:10]
paste(ul,collapse=" ")
}
string_fun(x)
df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
This is an experimental basis program string is or are in,Seattle
Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)
df <- as.data.frame(df)
Using apply (the function isn't doing anything in the second column)
df$Keyword <- apply(df[,1:2], 1, string_fun)
EDIT
Probably this is a more general way to use the function.
df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))
print(df)
# Keyword City.Column.Header.
# 1 The length of the string should not be more than New York
# 2 The Keyword should be of specific length is or are Los Angeles
# 3 This is an experimental basis program string is or Seattle
# 4 Please help me with getting only the first ten Boston
x <- "The length of the word is going to be of nice use to me"
head(strsplit(x, split = "\ "), 10)

R list within matrix to dataframe conversion

R struggles. I am using the following to extract quotations from text, with multiple results on a large datset. I am trying to have the output be a character string within a dataframe, so I can easily share this as an csv with others.
Sample data:
normalCase <- 'He said, "I am a test," very quickly.'
endCase <- 'This is a long quote, which we said, "Would never happen."'
shortCase <- 'A "quote" yo';
beginningCase <- '"I said this," he said quickly';
multipleCase <- 'When asked, "No," said Sam "I do not like green eggs and ham."'
testdata = c(normalCase,endCase,shortCase,beginningCase,multipleCase)
Using the following to extract quotations and a buffer of characters:
result <-function(testdata) {
str_extract_all(testdata, '[^\"]?{15}"[^\"]+"[^\"]?{15}')
}
extract <- sapply(testdata, FUN=result)
The extract is a list within a matrix. However, I want the extract to be a character string that I can later merge to a dataframe as a column. How do I convert this?
Code
normalCase <- 'He said, "I am a test," very quickly.'
endCase <- 'This is a long quote, which we said, "Would never happen."'
shortCase <- 'A "quote" yo';
beginningCase <- '"I said this," he said quickly';
multipleCase <- 'When asked, "No," said Sam "I do not like green eggs and ham."'
testdata = c(normalCase,endCase,shortCase,beginningCase,multipleCase)
# extract quotations
gsub(pattern = "[^\"]*((?:\"[^\"]*\")|$)", replacement = "\\1 ", x = testdata)
Output
[1] "\"I am a test,\" "
[2] "\"Would never happen.\" "
[3] "\"quote\" "
[4] "\"I said this,\" "
[5] "\"No,\" \"I do not like green eggs and ham.\" "
Explanation
pattern = "[^\"]" will match with any character except a double quote
pattern = "[^\"]*" will match with any character except a double quote 0 or more times
pattern = "\"[^\"]*\"" will match with a double quote, then any
character except a double quote 0 or more times, then another double
quote (i.e.) quotations
pattern = "(?:\"[^\"]*\")" will match with quotations, but wont capture
it
pattern = "((?:\"[^\"]*\")|$)" will match with quotations or endOfString,
and capture it. Note that this is the first group we capture
replacement = "\\1 " will replace with the first group we captured followed by a space

How to get the first 10 words in a string in R?

I have a string in R as
x <- "The length of the word is going to be of nice use to me"
I want the first 10 words of the above specified string.
Also for example I have a CSV file where the format looks like this :-
Keyword,City(Column Header)
The length of the string should not be more than 10,New York
The Keyword should be of specific length,Los Angeles
This is an experimental basis program string,Seattle
Please help me with getting only the first ten words,Boston
I want to get only the first 10 words from the column 'Keyword' for each row and write it onto a CSV file.
Please help me in this regards.
Regular expression (regex) answer using \w (word character) and its negation \W:
gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)
^ Beginning of the token (zero-width)
((\\w+\\W+){9}\\w+) Ten words separated by not-words.
(\\w+\\W+){9} A word followed by not-a-word, 9 times
\\w+ One or more word characters (i.e. a word)
\\W+ One or more non-word characters (i.e. a space)
{9} Nine repetitions
\\w+ The tenth word
.* Anything else, including other following words
$ End of the token (zero-width)
\\1 when this token found, replace it with the first captured group (the 10 words)
How about using the word function from Hadley Wickham's stringr package?
word(string = x, start = 1, end = 10, sep = fixed(" "))
Here is an small function that unlist the strings, subsets the first ten words and then pastes it back together.
string_fun <- function(x) {
ul = unlist(strsplit(x, split = "\\s+"))[1:10]
paste(ul,collapse=" ")
}
string_fun(x)
df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
This is an experimental basis program string is or are in,Seattle
Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)
df <- as.data.frame(df)
Using apply (the function isn't doing anything in the second column)
df$Keyword <- apply(df[,1:2], 1, string_fun)
EDIT
Probably this is a more general way to use the function.
df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))
print(df)
# Keyword City.Column.Header.
# 1 The length of the string should not be more than New York
# 2 The Keyword should be of specific length is or are Los Angeles
# 3 This is an experimental basis program string is or Seattle
# 4 Please help me with getting only the first ten Boston
x <- "The length of the word is going to be of nice use to me"
head(strsplit(x, split = "\ "), 10)

Resources