I have a list of names and would like to extract the last name of each individual. The complication is that some of the entries have middle names, some have nicknames, etc. Here's my example, building off of this question, but changing the formatting to reflect my situation:
df <- c("bob smith","mary ann d. jane","jose chung","michael mike marx","charlie m. ivan")
To get the first names, I use the following:
firstnames <- sapply(strsplit(df, " "), '[',1)
Is there any way to get the element in "final" position, however? Thanks in advance.
> lastnames <- sapply(strsplit(df, " "), tail, 1)
>
> lastnames
[1] "smith" "jane" "chung" "marx" "ivan"
Related
This is my current dataset called details.
> details$names<- c("James Johnson","Michael Jones","Robert Miller","Christopher Smith","Richard Nolan","Constantine Wilson","Mountabatteen Keizman")
I want to extract the part of names considering these 2 aspects:
1) Starting from the left, extract all characters until a space or a hypen (or minus sign) is reached.
2) Extract no more than ten characters.
I tried to do this by using this code:
> abrevStrings<- function(details$names)
{
gsub("([a-z])([A-Z])","([a-z])([A-Z])<= 10",details$names)
}
But I didn't get the output I wanted.
My desired output can be seen below:
James
Michael
Robert
Christophe
Richard
Constantin
Mountabatt
One way would using sub and substr by removing everything after whitespace or hyphen and then select only first 10 characters.
abrevStrings <- function(x) {
substr(sub("\\s+.*|-.*", "", x), 1, 10)
}
abrevStrings(details$names)
#[1] "James" "Michael" "Robert" "Christophe" "Richard"
# "Constantin" "Mountabatt"
Or another option is to split the strings on whitespace or hyphen and take the substring of the first part of the string.
sapply(strsplit(details$names, "\\s+|-"), function(x) substr(x[1], 1, 10))
data
details <- data.frame(names = c("James Johnson","Michael Jones","Robert Miller",
"Christopher Smith","Richard Nolan","Constantine Wilson",
"Mountabatteen Keizman"), stringsAsFactors = FALSE)
I have a character vector where I'd like to match a specific string and then collapse the element containing that string match only with the next element in the character vector and then allow the process to continue until the character vector ends. For example just one situation:
'"FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA" "NAV Ticker:" "XMPAX" "Average Daily Volume (shares):" "26,000" "Average Daily Volume (USD):" "$0.335M" "Inception Date:" "10/30/1992" "Inception Share Price:" "$15.00" "Inception NAV:" "$14.18" "Tender Offer:" "No" "Term:" "No"'
Combining each element containing a : with only the element following it would be great BUT I've struggled with using the paste function because it just generally collapses the entire vector based on the : into one element which is not the more targeted solution I'm looking for.
Here's an example of what I'd like a portion of the revised output to look like:
"Inception Share Price:$15.00"
Here is something that might help:
First split using strsplit, then bind elements that belong together
# split the string
vec <- unlist(strsplit(string, '(?=\")(?=\")', perl = TRUE))
vec <- vec[! vec %in% c(' ', '\"')]
# that's how vec looks like right now
head(vec)
# [1] "FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA"
# [6] "NAV Ticker:"
#
# now paste the elements
ind <- grepl(':.+',vec)
tmp <- vec[!ind]
vec[!ind] <- paste0(tmp[seq(1,length(tmp),2)], tmp[seq(2,length(tmp),2)])
head(vec)
# [1] "FundSponsor:Blackrock Advisors" "Category:Tax-Free Income-Pennsylvania" "Ticker:MPA" "NAV Ticker:XMPAX"
# [5] "Average Daily Volume (shares):26,000" "Average Daily Volume (USD):$0.335M"
with the data
string = "\"FundSponsor:Blackrock Advisors\" \"Category:\" \"Tax-Free Income-Pennsylvania\" \"Ticker:\" \"MPA\" \"NAV Ticker:\" \"XMPAX\" \"Average Daily Volume (shares):\" \"26,000\" \"Average Daily Volume (USD):\" \"$0.335M\" \"Inception Date:\" \"10/30/1992\" \"Inception Share Price:\" \"$15.00\" \"Inception NAV:\" \"$14.18\" \"Tender Offer:\" \"No\" \"Term:\" \"No\""
Explanation
The regex (?=\")(?=\") basically tells R to split the string whenever there are two \". The syntax (?!*something*) means *something* comes before/after. So the above simply reads: split the string at every position that is preceeded by a \" and that preceeds a \".
The strsplit(...) above creates elements of the form \" and ('\"Category:\" \"...' becomes the vector '\"';'Category:';'\"';' ';'...'). So by using ! vec %in% c(...) we remove those unwanted elements.
Addendum
If elements of the form "string:" followed by a " " are contained, in the above code remove the line vec <- vec[! vec %in% c(' ', '\"')] and add the lines
vec <- vec[seq(2L, length(vec), 4L)]
vec[vec == ' '] <- NA_character_
I am not sure if you want the outcome to be one single key: value format or if you just want to clean that long string and have it in the following format key1: value1 key2: value2 key3: value3. If this is the case, you can achieve it via the following code:
char = '"FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA" "NAV Ticker:" "XMPAX" "Average Daily Volume (shares):" "26,000" "Average Daily Volume (USD):" "$0.335M" "Inception Date:" "10/30/1992" "Inception Share Price:" "$15.00" "Inception NAV:" "$14.18" "Tender Offer:" "No" "Term:" "No"'
char_tidy = gsub('\\" \\"', " ", char)
# output is below
> char_tidy
[1] "\"FundSponsor:Blackrock Advisors Category: Tax-Free Income-Pennsylvania Ticker: MPA NAV Ticker: XMPAX Average Daily Volume (shares): 26,000 Average Daily Volume (USD): $0.335M Inception Date: 10/30/1992 Inception Share Price: $15.00 Inception NAV: $14.18 Tender Offer: No Term: No\""
I have a list of names like "Mark M. Owens, M.D., M.P.H." that I would like to sort to first name, last names and titles. With this data, titles always start after the first comma, if there is a title.
I am trying to sort the list into:
FirstName LastName Titles
Mark Owens M.D.,M.P.H
Lara Kraft -
Dale Good C.P.A
Thanks in advance.
Here is my sample code:
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames=sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', namelist)
titles = sub('.*,\\s*', '', namelist)
names <- data.frame(firstnames , lastnames, titles )
You can see that with this code, Mr. Owens is not behaving. His title starts after the last comma, and the last name begins from P. You can tell that I referred to Extract last word in string in R, Extract 2nd to last word in string and Extract last word in a string after comma if there are multiple words else the first word
You were off to a good start so you should pick up from there. The firstnames variable was good as written. For lastnames I used a modified name list. Inside of the sub function is another that eliminates everything after the first comma. The last name will then be the final word in the string. For titles there is a two-step process of first eliminating everything before the first comma, then replacing non-matched strings with a hyphen -.
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames <- sub(".*?(\\w+)$", "\\1", sub(",.*", "", namelist), perl=TRUE)
titles <- sub(".*?,", "", namelist)
titles <- ifelse(titles == namelist, "-", titles)
names <- data.frame(firstnames , lastnames, titles )
firstnames lastnames titles
1 Mark Owens M.D., M.P.H.
2 Dale Good C.P.A
3 Lara Kraft -
4 Roland Bass III
This should do the trick, at least on test data:
x=strsplit(namelist,split = ",")
x=rapply(object = x,function(x) gsub(pattern = "^ ",replacement = "",x = x),how="replace")
names=sapply(x,function(y) y[[1]])
titles=sapply(x,function(y) if(length(unlist(y))>1){
paste(na.omit(unlist(y)[2:length(unlist(y))]),collapse = ",")
}else{""})
names=strsplit(names,split=" ")
firstnames=sapply(names,function(y) y[[1]])
lastnames=sapply(names,function(y) y[[3]])
names <- data.frame(firstnames, lastnames, titles )
names
In cases like this, when the structure of strings is always the same, it is easier to use functions like strsplit() to extract desired parts
I have a list of 120777 records which contains names of people. I want to store an array of name parts for each record in the dataset. I tried this in R.
my_list$name_parts<- strsplit(my_list$name, " ")
I get a my_list$name_parts as a list of 120777 items. When I try querying the number of words in each name using length(my_list$name_parts), I get 120777 for all.
Let's use this simple example:
my_list <- list()
my_list$name <- c("toto t. tutu", "foo bar")
To get the number of words, you can do that:
lapply(strsplit(my_list$name," "), length)
which gives in the simple example above:
[[1]]
[1] 3
[[2]]
[1] 2
To avoid getting a list, you can even do:
unlist(lapply(strsplit(my_list$name," "), length))
[1] 3 2
I have fullname data that I have used strsplit() to get each element of the name.
# Dataframe with a `names` column (complete names)
df <- data.frame(
names =
c("Adam, R, Goldberg, MALS, MBA",
"Adam, R, Goldberg, MEd",
"Adam, S, Metsch, MBA",
"Alan, Haas, MSW",
"Alexandra, Dumas, Rhodes, MA",
"Alexandra, Ruttenberg, PhD, MBA"),
stringsAsFactors=FALSE)
# Add a column with the split names (it is actually a list)
df$splitnames <- strsplit(df$names, ', ')
I also have a list of degrees below
degrees<-c("EdS","DEd","MEd","JD","MS","MA","PhD","MSPH","MSW","MSSA","MBA",
"MALS","Esq","MSEd","MFA","MPA","EdM","BSEd")
I would like to get the intersection for each name and respective degrees.
I'm not sure how to flatten the name list so I can compare the two vectors using intersect. When I tried unlist(df$splitname,recursive=F) it returned each element separately. Any help is appreciated.
Try
df$intersect <- lapply(X=df$splitname, FUN=intersect, y=degrees)
That will give you a list of the intersection of each element in df$splitname (e.g. intersect(df$splitname[[1]], degrees)). If you want it as a vector:
sapply(X=df$intersect, FUN=paste, collapse=', ')
I assume you need it as a vector, since possibly the complete names came from one (for instance, from a dataframe), but strsplit outputs a list.
Does that work? If not, please try to clarify your intention.
Good luck!
For continuity, you can use unlist :
hh <- unlist(df$splitname)
intersect(hh,degrees)
For example :
ll <- list(c("Adam" , "R" , "Goldberg" ,"MALS" , "MBA "),
c("Adam" , "R" , "Goldberg", "MEd" ))
intersect(hh,degrees)
[1] "MEd"
or equivalent to :
hh[hh %in% degrees]
[1] "MEd"
To get differences you can use
setdiff(hh,degrees)
[1] "Adam" "R" "Goldberg" "MALS" "MBA "
...