Remove everything after space in string - r

I would like to remove everything after a space in a string.
For example:
"my string is sad"
should return
"my"
I've been trying to figure out how to do this using sub/gsub but have been unsuccessful so far.

You may use a regex like
sub(" .*", "", x)
See the regex demo.
Here, sub will only perform a single search and replace operation, the .* pattern will find the first space (since the regex engine is searching strings from left to right) and .* matches any zero or more characters (in TRE regex flavor, even including line break chars, beware when using perl=TRUE, then it is not the case) as many as possible, up to the string end.
Some variations:
sub("[[:space:]].*", "", x) # \s or [[:space:]] will match more whitespace chars
sub("(*UCP)(?s)\\s.*", "", x, perl=TRUE) # PCRE Unicode-aware regex
stringr::str_replace(x, "(?s) .*", "") # (?s) will force . to match any chars
See the online R demo.

strsplit("my string is sad"," ")[[1]][1]

or, substitute everything behind the first space to nothing:
gsub(' [A-z ]*', '' , 'my string is sad')
And with numbers:
gsub('([0-9]+) .*', '\\1', c('c123123123 0320.1'))

If you want to do it with a regex:
gsub('([A-z]+) .*', '\\1', 'my string is sad')

Stringr is your friend.
library(stringr)
word("my string is sad", 1)

Related

Delete string parts within delimiter

I have a string as"dfgdf" sa"2323":
a <- "as\"dfgdf\" sa\"2323\""
The delimiter (same for the start and the end) here is ". So what I want is to get a string were everything is deleted within delimiter but not delimiter itself. So the end result string should look like as"" sa""
You could match " and forget what is matched using \K
Then use a negated character class matching any char except " or a whitespace character and use lookarounds to assert " to the right.
Use perl=TRUE to enable Perl-like regular expressions.
a <- "as\"dfgdf\" sa\"2323\""
gsub('"\\K[^"\\s]+(?=")', "", a, perl=TRUE)
Output
[1] "as\"\" sa\"\""
R demo
Here is another base R option using paste0 + strsplit
s <- paste0(paste0(unlist(strsplit(a, '"\\w+"')), '""'), collapse = "")
which gives
> s
[1] "as\"\" sa\"\""
> cat(s)
as"" sa""
Here is one option with a regex lookaround to match a word (\\w+) that succeeds a double quote and precedes one as pattern and is replaced by blank ("")
cat(gsub('(?<=")\\w+(?=")', "", a, perl = TRUE), "\n")
#as"" sa""
Or without regex lookaround
cat(gsub('"\\w+"', '""', a), "\n")
#as"" sa""
I also found a way with stringr library:
library(stringr)
a <- "as\"dfgdf\" sa\"2323\""
result <- str_replace_all(a, "\".*?\"", "\"\"")
cat(result)

Extract all text after last occurrence of a special character

I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.
You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"
We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"
You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.
I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)

Using shorthand character classes inside character classes in R regex

I have defined
vec <- "5f 110y, Fast"
and
gsub("[\\s0-9a-z]+,", "", vec)
gives "5f Fast"
I would have expected it to give "Fast" since everything before the comma should get matched by the regex.
Can anyone explain to me why this is not the case?
You should keep in mind that, in TRE regex patterns, you cannot use regex escapes like \s, \d, \w inside bracket expressions.
So, the regex in your case, "[\\s0-9a-z]+,", matches 1 or more \, s, digits and lowercase ASCII letters, and then a single ,.
You may use POSIX character classes instead, like [:space:] (any whitespaces) or [:blank:] (horizontal whitespaces):
> gsub("[[:space:]0-9a-z]+,", "", vec)
[1] " Fast"
Or, use a PCRE regex with \s and perl=TRUE argument:
> gsub("[\\s0-9a-z]+,", "", vec, perl=TRUE)
[1] " Fast"
To make \s match all Unicode whitespaces, add (*UCP) PCRE verb at the pattern start: gsub("(*UCP)[\\s0-9a-z]+,", "", vec, perl=TRUE).
Could you please try folllowing and let me know if this helps you.
vec <- c("5f 110y, Fast")
gsub(".*,","",vec)
OR
gsub("[[:alnum:]]+ [[:alnum:]]+,","",vec)
A tidyverse solution would be to use str_replace with you original regex:
library(stringr)
str_replace(vec, "[\\s0-9a-z]+,", "")
Try a different regex:
gsub("[[:blank:][:digit:][:lower:]]+,", "", vec)
#[1] " Fast"
Or, to remove the space after the comma,
gsub("[[:blank:][:digit:][:lower:]]+, ", "", vec)
#[1] "Fast"

R/ Regex: Remove an immediate character in front of a pattern along with the pattern

I have this string:
cd/etc/init[BKSP][BKSP]it.d[ENTER]
I want the end result to be like this :
cd/etc/init.d[ENTER]
It would remove all the [BKSP] substrings along with an immediate character in front of it.
I have this sub function:
sub(“(.?\\[BKSP\\]+)+”, “”, string, perl = TRUE)
But getting: cd/etc/iniit.d[ENTER] instead.
Any help would be greatly appreciated! Thanks!
You may use
gsub("(?s).(?R)?\\[BKSP]", "", string, perl=TRUE)
See the regex demo
Details
(?s) - turns on the DOTALL modifier
. - matches any char
(?R)? - matches 1 or 0 ocurrences of the whole pattern (recurses the whole pattern)
\\[BKSP] - a literal substring [BKSP].
R demo:
string <- c("cd/etc/init[BKSP][BKSP]it.d[ENTER]", "abcd[BKSP]e")
gsub("(?s).(?R)?\\[BKSP]", "", string, perl=TRUE)
## => [1] "cd/etc/init.d[ENTER]" "abce"
You could use
test <- "cd/etc/init[BKSP][BKSP]it.d[ENTER]"
pattern <- "\\[BKSP\\]\\w*"
gsub(pattern, "", test)
Which yields
[1] "cd/etc/init.d[ENTER]"

Removing the second "|" on the last position

Here are some examples from my data:
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
For a: The individual strings can contain even more entries of "sp|" and "orf"
The results have to be like this:
[1] "sp|Q9Y6W5" "sp|Q9HB90,sp|Q9NQL2" "orf|NCBIAAYI_c_1_1023"
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
So the aim is to remove the last "|" for each "sp|" and "orf|" entry. It seems that "|" is a special challenge because it is a metacharacter in regular expressions. Furthermore, the length and composition of the "orf|" entries varying a lot. The only things they have in common is "orf|" or "sp|" at the beginning and that "|" is on the last position. I tried different things with gsub() but also with the stringr package or regexpr() or [:punct:], but nothing really worked. Maybe it was just the wrong combination.
We can use gsub to match the | that is followed by a , or is at the end ($) of the string and replace with blank ("")
gsub("[|](?=(,|$))", "", a, perl = TRUE)
#[1] "sp|Q9Y6W5"
#[2] "sp|Q9HB90,sp|Q9NQL2"
#[3] "orf|NCBIAAYI_c_1_1023"
#[4] "orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
#[5] "orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
Or we split by ,', remove the last character withsubstr, andpastethelist` elements together
sapply(strsplit(a, ","), function(x) paste(substr(x, 1, nchar(x)-1), collapse=","))
An easy alternative that might work. You need to escape the "|" using "\\|".
# Input
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
# Expected output
b <- c("sp|Q9Y6W5", "sp|Q9HB90,sp|Q9NQL2", "orf|NCBIAAYI_c_1_1023" ,
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142" ,
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405")
res <- gsub("\\|,", ",", gsub("\\|$", "", a))
all(res == b)
#[1] TRUE
You could construct a single regex call to gsub, but this is simple and easy to understand. The inner gsub looks for | and the end of the string and removes it. The outer gsub looks for ,| and replaces with ,.
You do not have to use a PCRE regex here as all you need can be done with the default TRE regex (if you specify perl=TRUE, the pattern is compiled with a PCRE regex engine and is sometimes slower than TRE default regex engine).
Here is the single simple gsub call:
gsub("\\|(,|$)", "\\1", a)
See the online R demo. No lookarounds are really necessary, as you see.
Pattern details
\\| - a literal | symbol (because if you do not escape it or put into a bracket expression it will denote an alternation operator, see the line below)
(,|$) - a capturing group (referenced to with \1 from the replacement pattern) matching either of the two alternatives:
, - a comma
| - or (the alternation operator)
$ - end of string anchor.
The \1 in the replacement string tells the regex engine to insert the contents stored in the capturing group #1 back into the resulting string (so, the commas are restored that way where necessary).

Resources