I need to extract the first word (German) from the following text string
substr(details[1],0,50)%>%
+ gsub("[^a-z/A-Z/,/ ]","" ,.)%>%
+ gsub("A-Z.*" , "", .)
[1] " , German, European, Central European"
For many combinations I try with gsub I can't extract it
Thank you very much
Assuming your string is s <- " , German, European, Central European", maybe you can use the following code to get the word German:
w <- gsub("\\s+,\\s+([[:alpha:]]+),.*","\\1",s)
or
w <- trimws(unlist(strsplit(s,split = ","))[2])
Related
there's a lot of people asking how to remove accents from data, but I'm looking for how to remove the entire character. They're retained using [[:alnum:]], and [[A-Za-z]]. What would I have to do to get rid of them?
Thanks
You can do in base R:
gsub("[^A-Za-z ]", "", "à la volée d'où est-il")
[1] " la vole do estil"
Here exclude everything that is not letters and spaces. Have a look if you want to keep punctuation with [:punct:]
Without using regex, you could define a set of letters you would accept, which is available in base R as the letters and LETTERS vectors.
characters_to_keep <- c(letters, " ")
accentstring <- "éqodio diq ozàoih"
result <- unlist(strsplit(accentstring,""))
result <- result[result %in% characters_to_keep]
result <- paste0(result, collapse="")
> result
[1] "qodio diq ozoih"
I have a large data frame in R with column "NameFull" holding a text string made up of two words (binomial scientific name), followed by author name(s) and initials. Both have been corrupted (presumably UTF translation issues). This means that in the binomials any leading "x" (indicating hybrids) has been replaced with "?". Unfortunately any non-standard characters in the author names have also been replaced with "?" so I cannot just replace all "?" with x.
I simply want to replace and leading "?" in the first two words with "x" (I will then have to manually compose a list of corrected author names to replace the corrupted ones, unless anyone has a bright idea on that!).
Example chunk of df:
df.corrupt <- data.frame(Bing = 1:6, FullName = c("?Anthematricaria dominii Rohlena", "?Anthemimatricaria inolens P.Fourn.", "?Anthemimatricaria maleolens P.Fourn.", "Achillea ?albinea Bjel?i? & K.Mal?", "Achillea carpatica B?ocki ex Dubovik", "Floscaldasia azorelloides Sklen ? & H.Rob."), Bang = 1:6)
I've tried to shoehorn it into regex but can't get close. Any help appreciated!
On my understanding, you want to replace ?only if it occurs in word-initial position in either the first or the second word; if that's correct this should work:
Data: (I've changed a few chars)
df.corrupt <- data.frame(Bing = 1:6,
FullName = c("?Anthematricaria dominii ?Rohlena",
"?Anthemimatricaria inolens P.Fourn.",
"?Anthemimatricaria maleolens ?P.Fourn.",
"Achillea ?albinea Bjel?i? & K.Mal?",
"Achillea carpatica B?ocki ex Dubovik",
"Floscaldasia azorelloides Sklen ? & H.Rob."), Bang = 1:6)
Solution:
library(stringr)
str_replace_all(df.corrupt$FullName, "^\\?|(?<=^(\\?)?\\b\\w{1,100}\\b\\s)\\?", "x")
[1] "xAnthematricaria dominii ?Rohlena" "xAnthemimatricaria inolens P.Fourn."
[3] "xAnthemimatricaria maleolens ?P.Fourn." "Achillea xalbinea Bjel?i? & K.Mal?"
[5] "Achillea carpatica B?ocki ex Dubovik" "Floscaldasia azorelloides Sklen ? & H.Rob."
This stringr solution puts x where ?occurs right at the start of the string (^) or (|) using positive lookbehind (i.e., a non-consuming capturing group) where it follows a whitespace char (\\s), which in turn follows a word boundary (\\b) following up to 100 \\w chars following a word boundary, following finally an optional ?
We can check for the ? that succeeds a space or at the start of the string, replace with 'x'
trimws(gsub("(^|\\s)\\?", " x", df.corrupt$FullName))
I have a dataframe df with a place field containing strings that looks like so:
countryName0 / provinceName0 / countyName0 / cityName0
countryName1 / provinceName1
Using this code I can pull out the finest resolution place identifier:
df$shortplace <- trimws(basename(df$place))
or:
df$shortplace <- gsub(".*/ ", "", df$place)
e.g.
cityName0
provinceName1
I can then use ggmap library to extract geocodes for cityName0 and provinceName1:
df$geo <- geocode(df$shortplace)
Result looks like this:
geo.lat geo.long
-33.789 147.909
-29.333 133.819
Unfortunately, some city names are not unique e.g. Perth is the capital of Western Australia, a town in Tasmania, and a city in Scotland. What I need to do is extract not the place identifier after the last "/" but the second last "/" (and replace the "/" with a " " to provide more information for the geocode() function. How do I scan to second last "/" and extract highest and second highest order place names? E.g.
shortplace
countyName0 cityName0
countryName1 provinceName1
There are other ways, but strsplit() seems the most straightforward to me here. Give this a try:
x = "countryName0 / provinceName0 / countyName0 / cityName0"
x_split = strsplit(x, " / ")[[1]] # Somewhat confusingly, result of strsplit() is a list; [[1]] pulls out the one and only entry here
n_terms = length(x_split)
result = paste(x_split[n_terms - 1], x_split[n_terms], sep = ", ")
result
# [1] "countyName0, cityName0"
One option is sub to match the alpha numeric characters followed by one or more spaces, / followed by space (\\s+), then another set of alpha numeric characters until the end of the string ($), capture as a group and replace with the backreferences (\\1 \\2) of the capture groups
df$shortplace <- sub(".*\\b([[:alnum:]]+)\\s+\\/\\s+([[:alnum:]]+)$", "\\1 \\2", df$place)
df$shortplace
#[1] "countyName0 cityName0" "countryName1 provinceName1"
This worked for me in the end:
df$shortplace <- gsub("((?:/[^/\r\n]*){2})$", "\1", df$place)
df$shortplace <- gsub("\\ / ", ", ", df$place)
Not super elegant but it does the job.
I have a list of names like "Mark M. Owens, M.D., M.P.H." that I would like to sort to first name, last names and titles. With this data, titles always start after the first comma, if there is a title.
I am trying to sort the list into:
FirstName LastName Titles
Mark Owens M.D.,M.P.H
Lara Kraft -
Dale Good C.P.A
Thanks in advance.
Here is my sample code:
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames=sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', namelist)
titles = sub('.*,\\s*', '', namelist)
names <- data.frame(firstnames , lastnames, titles )
You can see that with this code, Mr. Owens is not behaving. His title starts after the last comma, and the last name begins from P. You can tell that I referred to Extract last word in string in R, Extract 2nd to last word in string and Extract last word in a string after comma if there are multiple words else the first word
You were off to a good start so you should pick up from there. The firstnames variable was good as written. For lastnames I used a modified name list. Inside of the sub function is another that eliminates everything after the first comma. The last name will then be the final word in the string. For titles there is a two-step process of first eliminating everything before the first comma, then replacing non-matched strings with a hyphen -.
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames <- sub(".*?(\\w+)$", "\\1", sub(",.*", "", namelist), perl=TRUE)
titles <- sub(".*?,", "", namelist)
titles <- ifelse(titles == namelist, "-", titles)
names <- data.frame(firstnames , lastnames, titles )
firstnames lastnames titles
1 Mark Owens M.D., M.P.H.
2 Dale Good C.P.A
3 Lara Kraft -
4 Roland Bass III
This should do the trick, at least on test data:
x=strsplit(namelist,split = ",")
x=rapply(object = x,function(x) gsub(pattern = "^ ",replacement = "",x = x),how="replace")
names=sapply(x,function(y) y[[1]])
titles=sapply(x,function(y) if(length(unlist(y))>1){
paste(na.omit(unlist(y)[2:length(unlist(y))]),collapse = ",")
}else{""})
names=strsplit(names,split=" ")
firstnames=sapply(names,function(y) y[[1]])
lastnames=sapply(names,function(y) y[[3]])
names <- data.frame(firstnames, lastnames, titles )
names
In cases like this, when the structure of strings is always the same, it is easier to use functions like strsplit() to extract desired parts
R struggles. I am using the following to extract quotations from text, with multiple results on a large datset. I am trying to have the output be a character string within a dataframe, so I can easily share this as an csv with others.
Sample data:
normalCase <- 'He said, "I am a test," very quickly.'
endCase <- 'This is a long quote, which we said, "Would never happen."'
shortCase <- 'A "quote" yo';
beginningCase <- '"I said this," he said quickly';
multipleCase <- 'When asked, "No," said Sam "I do not like green eggs and ham."'
testdata = c(normalCase,endCase,shortCase,beginningCase,multipleCase)
Using the following to extract quotations and a buffer of characters:
result <-function(testdata) {
str_extract_all(testdata, '[^\"]?{15}"[^\"]+"[^\"]?{15}')
}
extract <- sapply(testdata, FUN=result)
The extract is a list within a matrix. However, I want the extract to be a character string that I can later merge to a dataframe as a column. How do I convert this?
Code
normalCase <- 'He said, "I am a test," very quickly.'
endCase <- 'This is a long quote, which we said, "Would never happen."'
shortCase <- 'A "quote" yo';
beginningCase <- '"I said this," he said quickly';
multipleCase <- 'When asked, "No," said Sam "I do not like green eggs and ham."'
testdata = c(normalCase,endCase,shortCase,beginningCase,multipleCase)
# extract quotations
gsub(pattern = "[^\"]*((?:\"[^\"]*\")|$)", replacement = "\\1 ", x = testdata)
Output
[1] "\"I am a test,\" "
[2] "\"Would never happen.\" "
[3] "\"quote\" "
[4] "\"I said this,\" "
[5] "\"No,\" \"I do not like green eggs and ham.\" "
Explanation
pattern = "[^\"]" will match with any character except a double quote
pattern = "[^\"]*" will match with any character except a double quote 0 or more times
pattern = "\"[^\"]*\"" will match with a double quote, then any
character except a double quote 0 or more times, then another double
quote (i.e.) quotations
pattern = "(?:\"[^\"]*\")" will match with quotations, but wont capture
it
pattern = "((?:\"[^\"]*\")|$)" will match with quotations or endOfString,
and capture it. Note that this is the first group we capture
replacement = "\\1 " will replace with the first group we captured followed by a space