Say I have the following list of full scientific names of plant species inside my dataset:
FullSpeciesNames <- c("Aronia melanocarpa (Michx.) Elliott", "Cotoneaster divaricatus Rehder & E. H. Wilson","Rosa canina L.","Ranunculus montanus Willd.")
I want to obtain a list of simplified names, i.e just the first two elements of a given name, namely:
SimpleSpeciesNames<- c("Aronia melanocarpa", "Cotoneaster divaricatus", "Rosa canina", "Ranunculus montanus")
How can this be done in R?
We can use sub to match a word (\\w+) followed by one or more white space (\\s+) followed by another word and space, capture as a group, and the rest of the characters (.*). In the replacement, use the backreference of the captured group (\\1)
trimws(sub("^((\\w+\\s+){2}).*", "\\1", FullSpeciesNames))
An alternative that is more complicated in function use, but does not require regular expressions is
substring(FullSpeciesNames,
1, sapply(gregexpr(" ", FullSpeciesNames, fixed=TRUE), "[[", 2) - 1)
[1] "Aronia melanocarpa" "Cotoneaster divaricatus" "Rosa canina" "Ranunculus montanus"
gregexpr can be used to find the positions of certain characters in a string (it can also look for patterns with regular expressions). Here we are looking for spaces. It returns a list of the positions for each string in the character vector. sapply is used to extract the position of the second space. The vector of these positions (minus one) is fed to substring, which runs through the initial vector and takes the substrings starting from the first character to the indicated position.
Related
I have a field in a data frame that formatted as last name, coma, space, first name, space, middle name, and sometimes without middle name. I need to remove middle names from the full names when they have it, and all spaces. Couldn't figure out how. My guess is that it will involve regular expression and stuff. It would be nice if you can provide explanations for the answer. Below is an example,
names <- c("Casillas, Kyron Jamar", "Knoll, Keyana","McDonnell, Messiah Abdul")
names
Expected output will be,
names_n <- c("Casillas,Kyron", "Knoll,Keyana","McDonnell,Messiah")
names_n
Thanks!
You can use this:
gsub("([^,]+,).*?(\\w+)$","\\1\\2",names)
[1] "Casillas,Jamar" "Knoll,Keyana" "McDonnell,Abdul"
Here we divide the string into two capturing groups and use backreference to recollect their content:
([^,]+,): the 1st capture group, which captures any sequence of characters that is not a ,followed by a comma
.*?: this lazily matches what follows until ...
(\\w+)$: ... the 2nd capture group, which captures the alphanumeric string at the end
\\1\\2 in the replacment argument recollects the contents of the two capture groups only, thereby removing whatever is not captured. If you wish to separate the surname from the first name not only by a comma but also a whitespace just squeeze one whitespace between the two backreferences, thus: \\1 \\2
We may capture the second word (\\w+) and replace with the backreference (\\1) of the captured word
sub("\\s+", "", sub("\\s+(\\w+)\\s+\\w+$", "\\1", names))
-output
[1] "Casillas,Kyron" "Knoll,Keyana" "McDonnell,Messiah"
I have a vector of strings and I want to remove -es from all strings (words) ending in either -ses or -ces at the same time. The reason I want to do it at the same time and not consequitively is that sometimes it happens that after removing one ending, the other ending appears while I don't want to apply this pattern to a single word twice.
I have no idea how to use two patterns at the same time, but this is the best I could:
text <- gsub("[sc]+s$", "[sc]", text)
I know the replacement is not correct, but I wonder how can I show that I want to replace it with the letter I just detected (c or s in this case). Thank you in advance.
To remove es at the end of words, that is preceded with s or c, you may use
gsub("([sc])es\\b", "\\1", text)
gsub("(?<=[sc])es\\b", "", text, perl=TRUE)
To remove them at the end of strings, you can go on using your $ anchor:
gsub("([sc])es$", "\\1", text)
gsub("(?<=[sc])es$", "", text, perl=TRUE)
The first gsub TRE pattern is ([sc])es\b: a capturing group #1 that matches either s or c, and then es is matched, and then \b makes sure the next char is not a letter, digit or _. The \1 in the replacement is the backreference to the value stored in the capturing group #1 memory buffer.
In the second example with the PCRE regex (due to perl=TRUE), (?<=[sc]) positive lookbehind is used instead of the ([sc]) capturing group. Lookbehinds are not consuming text, the text they match does not land in the match value, and thus, there is no need to restore it anyhow. The replacement is an empty string.
Strings ending with "ces" and "ses" follow the same pattern, i.e. "*es$"
If I understand it correctly than you don't need two patterns.
Example:
x = c("ces", "ses", "mes)
gsub( pattern = "*([cs])es$", replacement = "\\1", x)
[1] "c" "s" "mes"
Hope it helps.
M
Let's say I want a Regex expression that will only match numbers between 18 and 31. What is the right way to do this?
I have a set of strings that look like this:
"quiz.18.player.total_score"
"quiz.19.player.total_score"
"quiz.20.player.total_score"
"quiz.21.player.total_score"
I am trying to match only the strings that contain the numbers 18-31, and am currently trying something like this
(quiz.)[1-3]{1}[1-9]{1}.player.total_score
This obviously won't work because it will actually match all numbers between 11-39. What is the right way to do this?
Regex: 1[89]|2\d|3[01]
For matching add additional text and escape the dots:
quiz\.(?:1[89]|2\d|3[01])\.player\.total_score
Details:
(?:) non-capturing group
[] match a single character present in the list
| or
\d matches a digit (equal to [0-9])
\. dot
. matches any character
!) If s is the character vector read the fields into a data frame picking off the second field and check whether it is in the desired range. Put the result in logical vector ok and get those elements from s. This uses no regular expressions and only base R.
digits <- read.table(text = s, sep = ".")$V2
s[digits %in% 18:31]
2) Another approach based on the pattern "\\D" matching any non-digit is to remove all such characters and then check if what is left is in the desired range:
digits <- gsub("\\D", "", s)
s[digits %in% 18:31]
2a) In the development version of R (to be 3.6.0) we could alternately use the new whitespace argument of trimws like this:
digits <- trimws(s, whitespace = "\\D")
s[digits %in% 18:31]
3) Another alternative is to simply construct the boundary strings and compare s to them. This will work only if all the number parts in s are exactly the same number of digits (which for the sample shown in the question is the case).
ok <- s >= "quiz.18.player.total_score" & s <= "quiz.31.player.total_score"
s[ok]
This is done using character ranges and alternations. For your range
3[10]|[2][0-9]|1[8-9]
Demo
I'm trying to split a dataframe with "," separators. However, some parts of the strings have the pattern [0-9][,][0-9]{2}, and i'd like to substitute only the comma inside, not the hole pattern, in order to preserve the numerical inputs.
I try to solve with stringr, but got stucked in the following pattern of error:
library(stringr)
string <- '"name: John","age: 27","height: 1,73", "weight: 78,30"'
str_replace_all(string, "[0-9][,][0-9]{2}", "[0-9][;][0-9]{2}")
[1] "\"name: John\",\"age: 27\",\"height: [0-9][;][0-9]{2}\", \"weight: 7[0-9][;][0-9]{2}\""
I know it can be done with substitution by position, but the string is too big.
I'd appreciate any help. Thanks in advance.
You need to use capturing groups around the parts of the pattern you need to keep and, in the replacement pattern, refer to those submatches with backreferences:
> str_replace_all(string, "([0-9]),([0-9]{2})", "\\1;\\2")
[1] "\"name: John\",\"age: 27\",\"height: 1;73\", \"weight: 78;30\""
Or the same regex can be used with gsub:
> gsub("([0-9]),([0-9]{2})", "\\1;\\2", string)
[1] "\"name: John\",\"age: 27\",\"height: 1;73\", \"weight: 78;30\""
Details:
([0-9]) - capturing group 1, whose value is referred to using \\1 in the replacement pattern, matching a single digit
, - a comma
([0-9]{2}) - capturing group 2, whose value is referred to using \\2 in the replacement pattern, matching 2 digits.
Here are three character vectors:
[1] "Session_1/Focal_1_P1/240915_P1_S1_F1.csv"
[2] "Session_2/Focal_1_PA10/250915_PA10_S2_F1.csv"
[3] "Session_3/Focal_1_DA100/260915_DA100_S3_F1.csv"
I'm trying to extract the strings P1, PA10 and DA100, respectively in a standardised manner (as I have several hundred other strings in which I want to extract this.
I know I need to use regex but I'm fairly new to it and not exactly sure which one.
I can see that the commonalities are 6 numbers (\d\d\d\d\d\d)followed by an _ and then what I want followed by another _.
How do I extract what I want? I believe with grep but am not 100% on the regular expression I need.
We can use gsub. We match zero or more characters (.*) followed by a forward slash (\\/), followed by one or more numbers and a underscore (\\d+_), or (!) two instances of an underscore followed by one or more characters that are not an underscore ((_[^_]+){2}) and replace it with blank ("").
gsub(".*\\/\\d+_|(_[^_]+){2}", "", v1)
#[1] "P1" "PA10" "DA100"
Or we extract the basename of the vector, match one or more numbers followed by underscore (\\d+_) followed by character not an underscore (([^_]+)) as a capture group followed by characters until the end of the string and replace it with the backreference (\\1) for the captured group.
sub("\\d+_([^_]+).*", "\\1", basename(v1))
#[1] "P1" "PA10" "DA100"
data
v1 <- c( "Session_1/Focal_1_P1/240915_P1_S1_F1.csv",
"Session_2/Focal_1_PA10/250915_PA10_S2_F1.csv",
"Session_3/Focal_1_DA100/260915_DA100_S3_F1.csv")