How to remove characters between space and specific character in R - r

I have a question similar to this one but instead of having two specific characters to look between, I want to get the text between a space and a specific character. In my example, I have this string:
myString <- "This is my string I scraped from the web. I want to remove all instances of a picture. picture-file.jpg. The text continues here. picture-file2.jpg"
but if I were to do something like this: str_remove_all(myString, " .*jpg) I end up with
[1] "This"
I know that what's happening is R is finding the first instance of a space and removing everything between that space and ".jpg" but I want it to be the first space immediately before ".jpg". My final result I hope for looks like this:
[1] "This is my string I scraped from the web. I want to remove all instances of a picture. the text continues here.
NOTE: I know that a solution may arise which does what I want, but ends up putting two periods next to each other. I do not mind a solution like that because later in my analysis I am removing punctuation.

You can use
str_remove_all(myString, "\\S*\\.jpg")
Or, if you also want to remove optional whitespace before the "word":
str_remove_all(myString, "\\s*\\S*\\.jpg")
Details:
\s* - zero or more whitespaces
\S* - zero or more non-whitespaces
\.jpg - .jpg substring.
To make it case insensitive, add (?i) at the pattern part: "(?i)\\s*\\S*\\.jpg".
If you need to make sure there is no word char after jpg, add a word boundary: "(?i)\\s*\\S*\\.jpg\\b"

Related

Remove first occurrence of special characters until the first word or word character in R using regex

For my project I am looking into removing parts of text based on the pattern of special characters. I have a long .txt file that has the below structure:
mycharobj=c("---------Some text is here.---------More text is here - [3548]----- Even more text is here.-----------More text is here - [408]--------- Even more text is here again.")
String continues following the above pattern.
My target is to remove parts that start with - and end - [number], such as:
"-----------------------More text is here - [3548]"
"-----------More text is here - [408]"
I am planning to use the below to remove these parts with (will be looped in the future)
library(stringr)
library(qdapRegex)
temp=unlist(regmatches(mycharobj, gregexpr("[[:digit:]]+", mycharobj)))
mycharobj=rm_between(mycharobj, "-", paste(temp[1],"]", sep=""))
but for this to work, I need a regex expression that will remove the first occurrence of "-----------" in text until the first word or word character. If a string starts with text (word or word characters), it needs to ignore this and identify the first occurrence of "-----------" for my potential loop to work.
I was wondering if this can be done with regular expressions? Any help is appreciated. I have a very computationally demanding solution for this; split the string based on the special character "-" and then identify the parts of the text that I need through a set of conditionals. But due to the fact that it takes a lot more of the processing time, this solution is not very scalable for processing a large number of such .txt files.
You can use
gsub("-{9,}(?:(?!-{9}).)*?- \\[\\d+]", "", mycharobj, perl=TRUE)
See the regex demo.
Details:
-{9,} - nine or more - chars
(?:(?!-{9}).)*? - any one char, other than a line break char, zero or more but as few as possible occurrences, that does not start a nine hyphen char sequence
- \[ - a - [ string
\d+ - one or more digits
] - a ] char.

Replace multiple spaces in string, but leave singles spaces be

I am reading a PDF file using R. I would like to transform the given text in such a way, that whenever multiple spaces are detected, I want to replace them by some value (for example "_"). I've come across questions where all spaces of 1 or more can be replaced using "\\s+" (Merge Multiple spaces to single space; remove trailing/leading spaces) but this will not work for me. I have a string that looks something like this;
"[1]This is the first address This is the second one
[2]This is the third one
[3]This is the fourth one This is the fifth"
When I apply the answers I found; replacing all spaces of 1 or more with a single space, I will not be able to recognise separate addresses anymore, because it would look like this;
gsub("\\s+", " ", str_trim(PDF))
"[1]This is the first address This is the second one
[2]This is the third one
[3]This is the fourth one This is the fifth"
So what I am looking for is something like this
"[1]This is the first address_This is the second one
[2]This is the third one_
[3]This is the fourth one_This is the fifth"
However if I rewrite the code used in the example, I get the following
gsub("\\s+", "_", str_trim(PDF))
"[1]This_is_the_first_address_This_is_the_second_one
[2]This_is_the_third_one_
[3]This_is_the_fourth_one_This_is_the_fifth"
Would anyone know a workaround for this? Any help will be greatly appreciated.
Whenever I come across string and reggex problems I like to refer to the stringr cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf
On the second page you can see a section titled "Quantifiers", which tells us how to solve this:
library(tidyverse)
s <- "This is the first address This is the second one"
str_replace(s, "\\s{2,}", "_")
(I am loading the complete tidyverse instead of just stringr here due to force of habit).
Any 2 or more whitespace characters will no be replaced with _.

Split string in R,better to use regular expressions

I'd like to split a text string in R
for example:"cell_70001.ERP123.138_D11_62.5Y_45880"
But,I want ERP123.138_D11_62.5Y_45880 finally.
That is to say, cut the place where the first punctuation starts, get the part after it,
I really don’t understand regular expressions, but I’m very anxious. I hope someone can help me. Thank you.
Since your aim is to get where the first punctuation starts, considering _ is a word, we could do:
sub(".*?\\W","", "cell_70001.ERP123.138_D11_62.5Y_45880")
[1] "ERP123.138_D11_62.5Y_45880"
This is to say, delete everything until the first non-word character.
You could also do it as:
sub("\\w+\\W","", "cell_70001.ERP123.138_D11_62.5Y_45880")
[1] "ERP123.138_D11_62.5Y_45880"
Which means delete every word until the first non-word. Which is then also deleted

How to prevent code from detecting and pulling patterns within words (Example: I want 'one' detected but not 'one' in the word al'one')?

I have this code that is meant to add highlights to some numbers in a text stored in "lines"
stringr::str_replace_all(lines, nums, function(x) {paste0("<<", x, ">>")})
where nums is the following pattern being deteced
nums<-(Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine)+\\s?(Hundred|Thousand|Million|Billion|Trillion)?'
The problem I'm having is that the line of code above also leads to numbers embedded in words also being detected. In the following text this happens:
Get <<ten>> eggs. That is what is writ<<ten>>. I am <<one>> and al<<one>>.
when it should be:
Get <<ten>> eggs. That is what is written. I am <<one>> and alone.
I don't want to remove the question mark after the \s because I want to detect both numbers like "One" followed by no space and "One Hundred" which has a space in between.
Does anyone know how to do this?
Surround (Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine)+ with \b.
\b matches word boundaries, so this expression will newer match inside a word.

Regular Expression to remove contents in string

I have a string as below:
4s: and in this <em>new</em>, 5s: <em>year</em> everybody try to make our planet clean and polution free.
Replace string:
4s: and in this <em>new</em>, <em>year</em> everybody try to make our planet clean and polution free.
what i want is ,if string have two <em> tags , and if gap between these two <em> tags is of just one word and also , format of that word will be of ns: (n is any numeric value 0 to 4 char. long). then i want to remove ns: from that string. while keeping punctuation marks('?', '.' , ',',) between two <em> as it is.
also i like to add note that. input string may or may not have punctuation marks between these two <em> tags.
My regular expression as below
Regex.Replace(txtHighlight, #"</em>.(\s*)(\d*)s:(\s*).<em", "</em> <em");
Hope it is clear to my requirement.
How can I do this using regular expressions?
Not really sure what you need, but how about:
Regex.Replace(txtHighlight, #"</em>(.)\s*\d+s:\s*(.)<em", "</em>$1$2<em");
If you just want to take out the 4s 5s bit you could do something like this:
Regex.Replace(txtHighlight, #"\s\d\:", "");
This will match a space followed by a digit followed by a colon.
If that's not what you're after, my apologies. I hope it might help :)

Resources