Replacing white space with one single backslash

Replacing white space with one single backslash - r

I want to replace a white space with ONE backslash and a whitespace like this:
"foo bar" --> "foo\ bar"
I found how to replace with multiple backslashes but wasn't able to adapt it to a single backslash.
I tried this so far:
x <- "foo bar"
gsub(" ", "\\ ", x)
# [1] "foo bar"
gsub(" ", "\\\ ", x)
# [1] "foo bar"
gsub(" ", "\\\\ ", x)
# [1] "foo\\ bar"
However, all the outcomes do not satisfy my needs. I need the replacement to dynamically create file paths which contain folders with names like
/some/path/foo bar/foobar.txt.
To use them for shell commands in system() white spaces have to be exited with a \ to
/some/path/foo\ bar/foobar.txt.
Do you know how to solve this one?

Your problem is a confusion between the content of a string and its representation. When you print out a string in the ordinary way in R you will never see a single backslash (unless it's denoting a special character, e.g. print("y\n"). If you use cat() instead, you'll see only a single backslash.
x <- "foo bar"
y <- gsub(" ", "\\\\ ", x)
print(y)
## [1] "foo\\ bar"
cat(y,"\n") ## string followed by a newline
## foo\ bar
There are 8 characters in the string; 6 letters, one space, and the backslash.
nchar(y) ## 8
For comparison, consider \n (newline character).
z <- gsub(" ", "\n ", x)
print(z)
## [1] "foo\n bar"
cat(z,"\n")
## foo
## bar
nchar(z) ## 8
If you're constructing file paths, it might be easier to use forward slashes instead - forward slashes work as file separators in R on all operating systems (even Windows). Or check out file.path(). (Without knowing exactly what you're trying to do, I can't say more.)

To replace a space with one backslash and a space, you do not even need to use regular expression, use your gsub(" ", "\\ ", x) first attempt with fixed=TRUE:
> x <- "foo bar"
> res <- gsub(" ", "\\ ", x, fixed=TRUE)
> cat(res, "\n")
foo\ bar
See an online R demo
The cat function displays the "real", literal backslashes.

Related

regex replace parts/groups of a string in R

Trying to postprocess the LaTeX (pdf_book output) of a bookdown document to collapse biblatex citations to be able to sort them chronologically using \usepackage[sortcites]{biblatex} later on. Thus, I need to find }{ after \\autocites and replace it with ,. I am experimenting with gsub() but can't find the correct incantation.
# example input
testcase <- "text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}"
# desired output
"text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"
A simple approach was to replace all }{
> gsub('\\}\\{', ',', testcase, perl=TRUE)
[1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep,separate}"
But this also collapses {keep}{separate}.
I was then trying to replace }{ within a 'word' (string of characters without whitspace) starting with \\autocites by using different groups and failed bitterly:
> gsub('(\\\\autocites)([^ \f\n\r\t\v}{}]+)((\\}\\{})+)', '\\1\\2\\3', testcase, perl=TRUE)
[1] "text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} some text {keep}{separate}"
Addendum:
The actual document contains more lines/elements than the testcase above. Not all elements contain \\autocites and in rare cases one element has more than one \\autocites. I didn't originally think this was relevant. A more realistic testcase:
testcase2 <- c("some text",
"text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}",
"text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate} \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}")

A single gsub call is enough:
gsub("(?:\\G(?!^)|\\\\autocites)\\S*?\\K}{", ",", testcase, perl=TRUE)
## => [1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"
See the regex demo. Here, (?:\G(?!^)|\\autocites) matches the end of the previous match or \autocites string, then it matches any 0 or more non-whitespace chars, but as few as possible, then \K discards the text from the current match buffer and consumes the }{ substring that is eventually replaced with a comma.
There is also a very readable solution with one regex and one fixed text replacements using stringr::str_replace_all:
library(stringr)
str_replace_all(testcase, "\\\\autocites\\S+", function(x) gsub("}{", ",", x, fixed=TRUE))
# => [1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"
Here, \\autocites\S+ matches \autocites and then 1+ non-whitespace chars, and gsub("}{", ",", x, fixed=TRUE) replaces (very fast) each }{ with , in the matched text.

Not the prettiest solution, but it works. This repeatedly replaces }{ with , but only if it follows autocities with no intervening blanks.
while(length(grep('(autocites\\S*)\\}\\{', testcase, perl=TRUE))) {
testcase = sub('(autocites\\S*)\\}\\{', '\\1,', testcase, perl=TRUE)
}
testcase
[1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

I'll make the input string slightly bigger to make the algorithm more clear.
str <- "
text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}
text \\autocites[cf.~][]{wattPattern1947}{foxMapping2000}{runkleGap1990} text {keep}{separate}
"
We will firstly extract all the citation blocks, replace "}{" with "," in them and then put them back into the string.
# pattern for matching citation blocks
pattern <- "\\\\autocites(\\[[^\\[\\]]*\\])*(\\{[[:alnum:]]*\\})+"
cit <- str_extract_all(str, pattern)[[1]]
cit
#> [1] "\\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990}"
#> [2] "\\autocites[cf.~][]{wattPattern1947}{foxMapping2000}{runkleGap1990}"
Replace in citation blocks:
newcit <- str_replace_all(cit, "\\}\\{", ",")
newcit
#> [1] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [2] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
Break the original string in the places where citation block was found
strspl <- str_split(str, pattern)[[1]]
strspl
#> [1] "\ntext " " text {keep}{separate}\ntext " " text {keep}{separate}\n"
Insert modified citation blocks:
combined <- character(length(strspl) + length(newcit))
combined[c(TRUE, FALSE)] <- strspl
combined[c(FALSE, TRUE)] <- newcit
combined
#> [1] "\ntext "
#> [2] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [3] " text {keep}{separate}\ntext "
#> [4] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [5] " text {keep}{separate}\n"
Paste it together to finalize:
newstr <- paste(combined, collapse = "")
newstr
#> [1] "\ntext \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}\ntext \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}\n"
I suspect there could be a more elegant fully-regex solution based on the same idea, but I wasn't able to find one.

I found an incantation that works. It's not pretty:
gsub("\\\\autocites[^ ]*",
gsub("\\}\\{",",",
gsub(".*(\\\\autocites[^ ]*).*","\\\\\\1",testcase) #all those extra backslashes are there because R is ridiculous.
),
testcase)
I broke it in to lines to hopefully make it a little more intelligible. Basically, the innermost gsub extracts just the autocites (anything that follows \\autocites up to the first space), then the middle gsub replaces the }{s with commas, and the outermost gsub replaces the result of the middle one for the pattern extracted in the innermost one.
This will only work with a single autocites in a string, of course.
Also, fortune(365).

How to throw out spaces and underscores only from the beginning of the string?

I want to ignore the spaces and underscores in the beginning of a string in R.
I can write something like
txt <- gsub("^\\s+", "", txt)
txt <- gsub("^\\_+", "", txt)
But I think there could be an elegant solution
txt <- " 9PM 8-Oct-2014_0.335kwh "
txt <- gsub("^[\\s+|\\_+]", "", txt)
txt
The output should be "9PM 8-Oct-2014_0.335kwh ". But my code gives " 9PM 8-Oct-2014_0.335kwh ".
How can I fix it?

You could bundle the \s and the underscore only in a character class and use quantifier to repeat that 1+ times.
^[\s_]+
Regex demo
For example:
txt <- gsub("^[\\s_]+", "", txt, perl=TRUE)
Or as #Tim Biegeleisen points out in the comment, if only the first occurrence is being replaced you could use sub instead:
txt <- sub("[\\s_]+", "", txt, perl=TRUE)
Or using a POSIX character class
txt <- sub("[[:space:]_]+", "", txt)
More info about perl=TRUE and regular expressions used in R
R demo

The stringr packages offers some task specific functions with helpful names. In your original question you say you would like to remove whitespace and underscores from the start of your string, but in a comment you imply that you also wish to remove the same characters from the end of the same string. To that end, I'll include a few different options.
Given string s <- " \t_blah_ ", which contains whitespace (spaces and tabs) and underscores:
library(stringr)
# Remove whitespace and underscores at the start.
str_remove(s, "[\\s_]+")
# [1] "blah_ "
# Remove whitespace and underscores at the start and end.
str_remove_all(s, "[\\s_]+")
# [1] "blah"
In case you're looking to remove whitespace only – there are, after all, no underscores at the start or end of your example string – there are a couple of stringr functions that will help you keep things simple:
# `str_trim` trims whitespace (\s and \t) from either or both sides.
str_trim(s, side = "left")
# [1] "_blah_ "
str_trim(s, side = "right")
# [1] " \t_blah_"
str_trim(s, side = "both") # This is the default.
# [1] "_blah_"
# `str_squish` reduces repeated whitespace anywhere in string.
s <- " \t_blah blah_ "
str_squish(s)
# "_blah blah_"
The same pattern [\\s_]+ will also work in base R's sub or gsub, with some minor modifications, if that's your jam (see Thefourthbird`s answer).

You can use stringr as:
txt <- " 9PM 8-Oct-2014_0.335kwh "
library(stringr)
str_trim(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
Or the trimws in Base R
trimws(txt)
[1] "9PM 8-Oct-2014_0.335kwh"

Remove punctuation but keep hyphenated phrases in R text cleaning

Is there any effective way to remove punctuation in text but keeping hyphenated expressions, such as "accident-prone"?
I used the following function to clean my text
clean.text = function(x)
{
# remove rt
x = gsub("rt ", "", x)
# remove at
x = gsub("#\\w+", "", x)
x = gsub("[[:punct:]]", "", x)
x = gsub("[[:digit:]]", "", x)
# remove http
x = gsub("http\\w+", "", x)
x = gsub("[ |\t]{2,}", "", x)
x = gsub("^ ", "", x)
x = gsub(" $", "", x)
x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
#return(x)
}
and apply it on hyphenated expressions that returned
my_text <- "accident-prone"
new_text <- clean.text(text)
new_text
[1] "accidentprone"
while my desired output is
"accident-prone"
I have referenced this thread but didn't find it worked on my situation. There must be some regex things that I haven't figured out. It will be really appreciated if someone could enlighten me on this.

Putting my two cents in, you could use (*SKIP)(*FAIL) with perl = TRUE and remove any non-word characters:
data <- c("my-test of #$%^&*", "accident-prone")
(gsub("(?<![^\\w])[- ](?=\\w)(*SKIP)(*FAIL)|\\W+", "", data, perl = TRUE))
Resulting in
[1] "my-test of" "accident-prone"
See a demo on regex101.com.
Here the idea is to match what you want to keep
(?<![^\\w])[- ](?=\\w)
# a whitespace or a dash between two word characters
# or at the very beginning of the string
let these fail with (*SKIP)(*FAIL) and put what you want to be removed on the right side of the alternation, in this case
\W+
effectively removing any non-word-characters not between word characters.
You'd need to provide more examples for testing though.

The :punct: set of characters includes the dash and you are removing them. You could make an alternate character class that omits the dash. You do need to pay special attention to the square-brackets placements and escape the double quote and the backslash:
(test <- gsub("[]!\"#$%&'()*+,./:;<=>?#[\\^_`{|}~]", "", "my-test of #$%^&*") )
[1] "my-test of "
The ?regex (help page) advises against using ranges. I investigated whether there might be any simplification using my local ASCII sequence of punctuation, but it quickly became obvious that was not the way to go for other reasons. There were 5 separate ranges, and the "]" was in the middle of one of them so there would have been 7 ranges to handle in addition to the "]" which needs to come first.

Move location of special character

I have an entire vector of strings with the only special symbol in them being "-"
To be clear a sample string is like 23 C-Exam
I'd like to change it 23-C Exam
I essentially want R to find the location of "-" and move it 2 spaces back.
I feel this is a really simple task although I cant figure out how.
Assume that whenever R finds "-" , two spaces back is whitespace just like the example above.

regex attempt:
x <- c("23 C-Exam","45 D-Exam")
#[1] "23 C-Exam" "45 D-Exam"
sub(".(.)-", "-\\1 ", x)
#[1] "23-C Exam" "45-D Exam"
Find a character ., before a character (.), followed by a literal dash -.
Replace with a literal dash -, the saved character from above \\1, and overwrite the dash with a space

There is probably a sleek way of doing this with regular expressions, but one approach is to simply splice together the various pieces of the desired output. First, I find the index in the string containing the -, and then I use substr() to piece together the output.
pos <- regexpr("-", "23 C-Exam")
x <- "23 C-Exam"
x <- paste0(substr(x, 1, pos-3),
"-",
substr(x, pos-1, pos-1),
" ",
substr(x, pos+1, nchar(x)))
> x
[1] "23-C Exam"

We can also use chartr
chartr(" -", "- ", x)
#[1] "23-C Exam" "45-D Exam"
data
x <- c("23 C-Exam","45 D-Exam")

remove all line breaks (enter symbols) from the string using R

How to remove all line breaks (enter symbols) from the string?
my_string <- "foo\nbar\rbaz\r\nquux"
I've tried gsub("\n", "", my_string), but it doesn't work, because new line and line break aren't equal.

You need to strip \r and \n to remove carriage returns and new lines.
x <- "foo\nbar\rbaz\r\nquux"
gsub("[\r\n]", "", x)
## [1] "foobarbazquux"
Or
library(stringr)
str_replace_all(x, "[\r\n]" , "")
## [1] "foobarbazquux"

I just wanted to note here that if you want to insert spaces where you found newlines the best option is to use the following:
gsub("\r?\n|\r", " ", x)
which will insert only one space regardless whether the text contains \r\n, \n or \r.

Have had success with:
gsub("\\\n", "", x)

With stringr::str_remove_all
library(stringr)
str_remove_all(my_string, "[\r\n]")
# [1] "foobarbazquux"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replacing white space with one single backslash - r

Related

regex replace parts/groups of a string in R

How to throw out spaces and underscores only from the beginning of the string?

Remove punctuation but keep hyphenated phrases in R text cleaning

Move location of special character

remove all line breaks (enter symbols) from the string using R

Categories

Resources