R retrieving strings with sub: Why this does not work? - r

I would like to extract parts of strings. The string is:
> (x <- 'ab/cd efgh "xyz xyz"')
> [1] "ab/cd efgh \"xyz xyz\""
Now, I would like first to extract the first part:
> # get "ab/cd efgh"
> sub(" \"[/A-Za-z ]+\"","",x)
[1] "ab/cd efgh"
But I don't succeed in extracting the second part:
> # get "xyz xyz"
> sub("(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "ab/cd efgh \"xyz xyz\""
What is wrong with this code?
Thanks for help.

Your last snippet does not work because you reinsert the whole match back into the result: (\"[A-Za-z ]+\")$ matches and captures ", 1+ letters and spaces, " into Group 1 and \1 in the replacement puts it back.
You may actually get the last part inside quotes by removing all chars other than " at the start of the string:
x <- 'ab/cd efgh "xyz xyz"'
sub('^[^"]+', "", x)
See the R demo
The sub here will find and replace just once, and it will match the string start (with ^) followed with 1+ chars other than " with [^"]+ negated character class.

To get this to work with sub, you have to match the whole string. The help file says
For sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding).
So to get this to work with your regex, pre-pend the sometimes risky catchall ".*"
sub(".*(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "\"xyz xyz\""

Related

Extract all text after last occurrence of a special character

I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.
You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"
We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"
You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.
I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)

Count with how many spaces a string starts

I want to know with how many spaces a string starts. Here are some examples:
string.1 <- " starts with 4 spaces"
string.2 <- " starts with only 2 spaces"
My attempt was the following but this leads to 1 in both cases and I understand why this is the case.
stringr::str_count(string.1, "^ ")
stringr::str_count(string.2, "^ ")
I'd prefer if there was a solution completely like this but with another regex.
The ^ pattern matches a single space at the start of the string, that is why both test cases return 1.
To match consecutive spaces at the start of the string, you may use
stringr::str_count(string.1, "\\G ")
Or, to count any whitespaces,
stringr::str_count(string.1, "\\G\\s")
See the R demo
The \G pattern matches a space at the start and each space after the successful match due to the \G anchor.
Another approach: count the length of ^\s+ matches (1 or more whitespace chars at the start of the string):
strings <- c(" starts with 4 spaces", " starts with only 2 spaces")
matches <- regmatches(strings, regexpr("^\\s+", strings))
sapply(matches, nchar)
# => 4 2
One approach might be to take the nchar of the input string, with all content from the first non whitespace character until the end stripped.
string.1 <- " starts with 4 spaces"
nchar(sub("\\S.*$", "", string.1))

R Regex: removing only the immediate following character after >

I have the following string in R:
string1 = "A((..A>B)A"
I would like to remove all punctation, and the letter immediately after >, i.e. >B
Here is the output I desire:
output = "AAA"
I tried using gsub() as follows:
output = gsub("[[:punct:]]","", string1)
But this gives AABA, which keeps the immediately following character.
This would work using your work plus a leading lookbehind first to look for what comes after the > character.
gsub('(?<=>).|[[:punct:]]', '', "A((..A>B)A", perl=TRUE)
## [1] "AAA"
A slightly less complex regex without the use of perl seems to work for this example as well:
gsub("[[:punct:]]|>(.)", "", "A((..A>B)A")
[1] "AAA"
You say
remove all punctation, and the letter immediately after >
Punctuation is matched with [[:punct:]] and a letter can be matched with [[:alpha:]], thus, you may use a TRE regex with gsub:
string1 = "A((..A>B)A"
gsub(">[[:alpha:]]|[[:punct:]]", "", string1)
# => [1] "AAA"
See the online R demo
Note that > is also a char matched with [[:punct:]], thus, you do not need any lookarounds here, just remove it with a letter after it.
Pattern details:
>[[:alpha:]] - a > and any letter
| - or
[[:punct:]] - a punctuation or symbol.

In R: grab all alnum characters before the first punctuation

I have a vector s of strings (or NAs), and would like to get a vector of same length of everything before first occurrence of punctionation (.).
s <- c("ABC1.2", "22A.2", NA)
I would like a result like:
[1] "ABC1" "22A" NA
You can remove all symbols (incl. a newline) from the first dot with the following Perl-like regex:
s <- c("ABC1.2", "22A.2", NA)
gsub("[.][\\s\\S]*$", "", s, perl=T)
## => [1] "ABC1" "22A" NA
See IDEONE demo
The regex matches
[.] - a literal dot
[\\s\\S]* - any symbols incl. a newline
$ - end of string.
All matched strings are removed from the input with "". As the regex engine analyzes the string from left to right, the first dot is matched with \\., and the greedy * quantifier with [\\s\\S] will match all up to the end of string.
If there are no newlines, a simpler regex will do: [.].*$:
gsub("[.].*$", "", s)
See another demo

string manipulation to remove the name of files

I have a list of strings
/temp/123/afedcgid/abc.csv
/temp/123/4388dkfa/abc1.csv
/temp/123/4388dkfa/ab1.csv
I want to remove name of the file from the strings
The results desired are
/temp/123/afedcgid
/temp/123/4388dkfa
/temp/123/4388dkfa
How can i do it. Thanks.
You could try the below,
sub("/[^/]*$", "", x)
It removes all the chars from the last / symbol.
OR
> x <- "/temp/123/afedcgid/abc.csv"
> sub("(.*)/.*", "\\1", x)
[1] "/temp/123/afedcgid"
captures all the chars from the start upto the last / symbol (excluding /). Then the following chars are matched by .*. Replacing the matched chars with chars inside group 1 will give you the desired output.
Example:
> x <- "/temp/123/afedcgid/abc.csv"
> sub("/[^/]*$", "", x)
[1] "/temp/123/afedcgid"
OR
regmatches(x, gregexpr(".+(?=/)", x, perl=TRUE))
Use this regex to catch character you want to replace
\/\w+\.\w+$
try this demo
Demo
files <- c("/temp/123/afedcgid/abc.csv" ,
"/temp/123/4388dkfa/abc1.csv" , "/temp/123/4388dkfa/ab1.csv")
sub("\\/\\w+\\.\\w+$" , "" , files)
as you may know you need to \\ for escaping sequences in R

Resources