Remove multiple instances with a regex expression, but not the text in between instances [duplicate] - r

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 1 year ago.
In long passages using bookdown, I have inserted numerous images. Having combined the passages into a single character string (in a data frame) I want to remove the markdown text associated with inserting images, but not any text in between those inserted images. Here is a toy example.
text.string <- "writing ![Stairway scene](/media/ClothesFairLady.jpg) writing to keep ![Second scene](/media/attire.jpg) more writing"
str_remove_all(string = text.string, pattern = "!\\[.+\\)")
[1] "writing more writing"
The regex expression doesn't stop at the first closed parenthesis, it continues until the last one and deletes the "writing to keep" in between.
I tried to apply String manipulation in R: remove specific pattern in multiple places without removing text in between instances of the pattern, which uses gsubfn and gsub but was unable to get the solutions to work.
Please point me in the right direction to solve this problem of a regex removal of designated strings, but not the characters in between the strings. I would prefer a stringr solution, but whatever works. Thank you

You have to use the following regex
"!\\[[^\\)]+\\)"
alternatively you can also use this:
"!\\[.*?\\)"
both solution offer a lazy match rather than a greedy one, which is the key to your question

I think you could use the following solution too:
gsub("!\\[[^][]*\\]\\([^()]*\\)", "", text.string)
[1] "writing writing to keep more writing"

Related

Replace latex with r strings using gsub [duplicate]

This question already has an answer here:
"'\w' is an unrecognized escape" in grep
(1 answer)
Closed 1 year ago.
I would like to find and replace tabular instances by tabularx. I tried with gsub but it seems to enter me into a world of escaping pain. Following other questions and answers I find fixed=TRUE which is the best I so far have. The code snippet below almost works, \B is unrecognized. If I escape it twice I get \BEGIN as output!
texText <- '\begin{tabular}{rl}\begin{tabular}{rll}'
texText <- gsub("\begin{tabular}{rl}", "\BEGIN{tabular}{rll}", texText, fixed=TRUE)
I'm using BEGIN as my test to see what is happening. This is before I get to tackling the question of what goes on in the brackets {rl} {ll} {rrl} etc. Ideally I'm looking for a regex that would output:
\begin{tabularx}{rX}\begin{tabularx}{rlX}
That is the final column is replaced by X.
Try using proper escaping:
texText <- "\begin{tabular}{rl}\begin{tabular}{rll}"
output <- gsub("\begin\\{tabular\\}", "\begin{tabularx}", texText)
output
[1] "\begin{tabularx}{rl}\begin{tabularx}{rll}"
A literal backslash requires two backslashes, and also metacharacters such as { and } require two backslashes.

How to use regex to match upto third forward slash in R using gsub? [duplicate]

This question already has answers here:
How to Select everything up to and including the 3rd slash (RegExp)?
(2 answers)
Extract a regular expression match
(12 answers)
Closed 2 years ago.
So this question is relating to specifically how R handles regex - I would like to find some regex in conjunction with gsub to extract out the text all but before the 3rd forward slash.
Here are some string examples:
/google.com/images/video
/msn.com/bing/chat
/bbc.com/video
I would like to obtain the following strings only:
/google.com/images
/msn.com/bing
/bbc.com/video
So it is not keeping the information after the 3rd forward slash.
I cannot seem to get any regex working along with using gsub to solve this!
The closest I have got is:
gsub(pattern = "/[A-Za-z0-9_.-]/[A-Za-z0-9_.-]*$", replacement = "", x = the_data_above )
I think R has some issues regarding forward slashes and escaping them.
From the start of the string match two instances of slash and following non-slash characters followed by anything and replace with the two instances.
paths <- c("/google.com/images/video", "/msn.com/bing/chat", "/bbc.com/video")
sub("^((/[^/]*){2}).*", "\\1", paths)
## [1] "/google.com/images" "/msn.com/bing" "/bbc.com/video"
You can take advantage of lazy (vs greedy) matching by adding the ? after the quantifier (+ in this case) within your capture group:
gsub("(/.+?/.+?)/.*", "\\1", text)
[1] "/google.com/images" "/msn.com/bing" "/bbc.com/video"
Data:
text <- c("/google.com/images/video",
"/msn.com/bing/chat",
"/bbc.com/video")
Try this out:
^\/[A-Za-z0-9_.-]+\/[A-Za-z0-9_.-]+
As seen here: https://regex101.com/r/9ZYppe/1
Your problem arises from the fact that [A-Za-z0-9_.-] matches only one such character. You need to use the + operator to specify that there are multiple of them. Also, the $ at the end is pretty unnecessary because using ^ to assert the start of the sentence solves a great many problems.

How to automatically handle strings/paths with backslashes? [duplicate]

This question already has answers here:
How to escape backslashes in R string
(3 answers)
Efficiently convert backslash to forward slash in R
(11 answers)
Closed 3 years ago.
I often want to read in csv files and I get the path by using shift + right click and then clicking "copy path".
I paste this path into my code. See an example below:
read_csv("C:\Users\me\data\file.csv")
Obviously this doesn't work because of the backslashes. My current solution is to escape each one, so that my code looks like this:
read_csv("C:\\Users\\me\\data\\file.csv")
It works, but it's annoying and occasionally I'll get errors because I missed one of the backslashes.
I wanted to create a function automatically adds the extra slashes
fix_path <- function(string) str_replace(string, "\\\\", "\\\\\\\\")
but R won't recognize the string in the first place until the backslashes are taken care of.
Is there another way to deal with this? Python has the option of adding an "r" before strings to note that the backslashes should be treated just as regular backslashes, is there anything similar in R? To be clear, I know that I can escape the backslashes, but I am looking for a way to do it automatically.
You can use this hack. Suppose you had copied your path as mentioned then you could use
scan("clipboard", "character", quiet = TRUE)
scan reads the text copied from the clipboard and takes care about the backslashes. Then copy again what is returned from scan

How to put \' in my string using paste0 function [duplicate]

This question already has answers here:
How to escape backslashes in R string
(3 answers)
Closed 5 years ago.
I have an array:
t <- c("IMCR01","IMFA02","IMFA03")
I want to make it look like this:
"\'IMCR01\'","\'IMFA02\'","\'IMFA03\'"
I tried different ways like:
paste0("\'",t,"\'")
paste0("\\'",t,"\\'")
paste0("\\\\'",t,"\\\\'")
But none of them is correct. Any other functions are OK as well.
Actually your second attempt is correct:
paste0("\\'",t,"\\'")
If you want to tell paste to use a literal backslash, you need to escape it once (but not twice, as you would need within a regex pattern). This would output the following to the console in R:
[1] "\\'IMCR01\\'" "\\'IMFA02\\'" "\\'IMFA03\\'"
The trick here is that the backslash is even being escaped by R in the console output. If you were instead to write t to a text file, you would only see a single backslash as you wanted:
write(t, file = "/path/to/your/file.txt")
But why does R need to escape backslash when writing to its own console? One possibility is that if it were to write a literal \n then this would actually be interpreted by the console as a newline. Hence the need for eacaping is still there.

Removing different words form a document using R console

I have managed to retrieve a text file but i want to remove different words. I have gone to read.table and have no clue how to use it to help me remove certain words. I have got 300 words and these are some of them. How can remove all these words using the R console? I have two files, one is sk.text which is a whole document and the other one is bash.txt that has got just words, so i want to remove all the words in sk.text that match the words given in bash.text.
with
within
without
work
worked
working
works
would
A simple way would be to use
gsub(paste0('\\b',
YOURVECTOROFWORDSTOREMOVE,
'\\b', collapse = '|'),'',YOURSTRING)
which replaces every occurence of the words in the vector surrounded by either end/beginning characters or whitespace with a single space.
but you might want to look at the tm package and work with a corpus object if you have many files like this. there you can remove the words you like simply with
tm_map(YOURCORPUS, removeWords, YOURVECTOROFWORDSTOREMOVE)

Resources