Delete string parts within delimiter - r

I have a string as"dfgdf" sa"2323":
a <- "as\"dfgdf\" sa\"2323\""
The delimiter (same for the start and the end) here is ". So what I want is to get a string were everything is deleted within delimiter but not delimiter itself. So the end result string should look like as"" sa""

You could match " and forget what is matched using \K
Then use a negated character class matching any char except " or a whitespace character and use lookarounds to assert " to the right.
Use perl=TRUE to enable Perl-like regular expressions.
a <- "as\"dfgdf\" sa\"2323\""
gsub('"\\K[^"\\s]+(?=")', "", a, perl=TRUE)
Output
[1] "as\"\" sa\"\""
R demo

Here is another base R option using paste0 + strsplit
s <- paste0(paste0(unlist(strsplit(a, '"\\w+"')), '""'), collapse = "")
which gives
> s
[1] "as\"\" sa\"\""
> cat(s)
as"" sa""

Here is one option with a regex lookaround to match a word (\\w+) that succeeds a double quote and precedes one as pattern and is replaced by blank ("")
cat(gsub('(?<=")\\w+(?=")', "", a, perl = TRUE), "\n")
#as"" sa""
Or without regex lookaround
cat(gsub('"\\w+"', '""', a), "\n")
#as"" sa""

I also found a way with stringr library:
library(stringr)
a <- "as\"dfgdf\" sa\"2323\""
result <- str_replace_all(a, "\".*?\"", "\"\"")
cat(result)

Related

capitalize the first letter of two words separated by underscore using stringr

I have a string like word_string. What I want is Word_String. If I use the function str_to_title from stringr, what I get is Word_string. It does not capitalize the second word.
Does anyone know any elegant way to achieve that with stringr? Thanks!
Here is a base R option using sub:
input <- "word_string"
output <- gsub("(?<=^|_)([a-z])", "\\U\\1", input, perl=TRUE)
output
[1] "Word_String"
The regex pattern used matches and captures any lowercase letter [a-z] which is preceded by either the start of the string (i.e. it's the first letter) or an underscore. Then, we replace with the uppercase version of that single letter. Note that the \U modifier to change to uppercase is a Perl extension, so we must use sub in Perl mode.
Can also use to_any_case from snakecase
library(snakecase)
to_any_case(str1, "title", sep_out = "_")
#[1] "Word_String"
data
str1 <- "word_string"
This is obviously overly complicating but another base possibility:
test <- "word_string"
paste0(unlist(lapply(strsplit(test, "_"),function(x)
paste0(toupper(substring(x,1,1)),
substring(x,2,nchar(x))))),collapse="_")
[1] "Word_String"
You could first use gsub to replace "_" by " " and apply the str_to_title function
Then use gsub again to change it back to your format
x <- str_to_title(gsub("_"," ","word_string"))
gsub(" ","_",x)

How to throw out spaces and underscores only from the beginning of the string?

I want to ignore the spaces and underscores in the beginning of a string in R.
I can write something like
txt <- gsub("^\\s+", "", txt)
txt <- gsub("^\\_+", "", txt)
But I think there could be an elegant solution
txt <- " 9PM 8-Oct-2014_0.335kwh "
txt <- gsub("^[\\s+|\\_+]", "", txt)
txt
The output should be "9PM 8-Oct-2014_0.335kwh ". But my code gives " 9PM 8-Oct-2014_0.335kwh ".
How can I fix it?
You could bundle the \s and the underscore only in a character class and use quantifier to repeat that 1+ times.
^[\s_]+
Regex demo
For example:
txt <- gsub("^[\\s_]+", "", txt, perl=TRUE)
Or as #Tim Biegeleisen points out in the comment, if only the first occurrence is being replaced you could use sub instead:
txt <- sub("[\\s_]+", "", txt, perl=TRUE)
Or using a POSIX character class
txt <- sub("[[:space:]_]+", "", txt)
More info about perl=TRUE and regular expressions used in R
R demo
The stringr packages offers some task specific functions with helpful names. In your original question you say you would like to remove whitespace and underscores from the start of your string, but in a comment you imply that you also wish to remove the same characters from the end of the same string. To that end, I'll include a few different options.
Given string s <- " \t_blah_ ", which contains whitespace (spaces and tabs) and underscores:
library(stringr)
# Remove whitespace and underscores at the start.
str_remove(s, "[\\s_]+")
# [1] "blah_ "
# Remove whitespace and underscores at the start and end.
str_remove_all(s, "[\\s_]+")
# [1] "blah"
In case you're looking to remove whitespace only – there are, after all, no underscores at the start or end of your example string – there are a couple of stringr functions that will help you keep things simple:
# `str_trim` trims whitespace (\s and \t) from either or both sides.
str_trim(s, side = "left")
# [1] "_blah_ "
str_trim(s, side = "right")
# [1] " \t_blah_"
str_trim(s, side = "both") # This is the default.
# [1] "_blah_"
# `str_squish` reduces repeated whitespace anywhere in string.
s <- " \t_blah blah_ "
str_squish(s)
# "_blah blah_"
The same pattern [\\s_]+ will also work in base R's sub or gsub, with some minor modifications, if that's your jam (see Thefourthbird`s answer).
You can use stringr as:
txt <- " 9PM 8-Oct-2014_0.335kwh "
library(stringr)
str_trim(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
Or the trimws in Base R
trimws(txt)
[1] "9PM 8-Oct-2014_0.335kwh"

R/ Regex: Remove an immediate character in front of a pattern along with the pattern

I have this string:
cd/etc/init[BKSP][BKSP]it.d[ENTER]
I want the end result to be like this :
cd/etc/init.d[ENTER]
It would remove all the [BKSP] substrings along with an immediate character in front of it.
I have this sub function:
sub(“(.?\\[BKSP\\]+)+”, “”, string, perl = TRUE)
But getting: cd/etc/iniit.d[ENTER] instead.
Any help would be greatly appreciated! Thanks!
You may use
gsub("(?s).(?R)?\\[BKSP]", "", string, perl=TRUE)
See the regex demo
Details
(?s) - turns on the DOTALL modifier
. - matches any char
(?R)? - matches 1 or 0 ocurrences of the whole pattern (recurses the whole pattern)
\\[BKSP] - a literal substring [BKSP].
R demo:
string <- c("cd/etc/init[BKSP][BKSP]it.d[ENTER]", "abcd[BKSP]e")
gsub("(?s).(?R)?\\[BKSP]", "", string, perl=TRUE)
## => [1] "cd/etc/init.d[ENTER]" "abce"
You could use
test <- "cd/etc/init[BKSP][BKSP]it.d[ENTER]"
pattern <- "\\[BKSP\\]\\w*"
gsub(pattern, "", test)
Which yields
[1] "cd/etc/init.d[ENTER]"

A regex to remove all words which contains number in R

I want to write a regex in R to remove all words of a string containing numbers.
For example:
first_text = "a2c if3 clean 001mn10 string asw21"
second_text = "clean string
Try with gsub
trimws(gsub("\\w*[0-9]+\\w*\\s*", "", first_text))
#[1] "clean string"
It is easier to select words with no numbers than to select and delete words with numbers:
> library(stringr)
> str1 <- "a2c if3 clean 001mn10 string asw21"
> paste(unlist(str_extract_all(str1, "(\\b[^\\s\\d]+\\b)")), collapse = " ")
[1] "clean string"
Note:
Backslashes have to be escaped in R to work properly, hence double backslashes
\b is word boundary
\s is white space
\d is digit character
a caret (^) inside square brackets is a negater: find characters that do not match ...
"+" after the character group inside [] means "1 or more" occurrences of those (non white space and non digit) characters
Just another alternative using gsub
trimws(gsub("[^\\s]*[0-9][^\\s]*", "", first_text, perl=T))
#[1] "clean string"
A bit longer than some of the answers but very tractable is to first convert the string to a vector of words, then check word by word if there are any numbers and use standard R subsetting.
first_text_vec <- strsplit(first_text, " ")[[1]]
first_text_vec
[1] "a2c" "if3" "clean" "001mn10" "string" "asw21"
paste(first_text_vec[!grepl("[0-9]", first_text_vec)], collapse = " ")
[1] "clean string"

Remove everything after space in string

I would like to remove everything after a space in a string.
For example:
"my string is sad"
should return
"my"
I've been trying to figure out how to do this using sub/gsub but have been unsuccessful so far.
You may use a regex like
sub(" .*", "", x)
See the regex demo.
Here, sub will only perform a single search and replace operation, the .* pattern will find the first space (since the regex engine is searching strings from left to right) and .* matches any zero or more characters (in TRE regex flavor, even including line break chars, beware when using perl=TRUE, then it is not the case) as many as possible, up to the string end.
Some variations:
sub("[[:space:]].*", "", x) # \s or [[:space:]] will match more whitespace chars
sub("(*UCP)(?s)\\s.*", "", x, perl=TRUE) # PCRE Unicode-aware regex
stringr::str_replace(x, "(?s) .*", "") # (?s) will force . to match any chars
See the online R demo.
strsplit("my string is sad"," ")[[1]][1]
or, substitute everything behind the first space to nothing:
gsub(' [A-z ]*', '' , 'my string is sad')
And with numbers:
gsub('([0-9]+) .*', '\\1', c('c123123123 0320.1'))
If you want to do it with a regex:
gsub('([A-z]+) .*', '\\1', 'my string is sad')
Stringr is your friend.
library(stringr)
word("my string is sad", 1)

Resources