How to remove characters between space and specific character in R

How to remove characters between space and specific character in R - r

I have a question similar to this one but instead of having two specific characters to look between, I want to get the text between a space and a specific character. In my example, I have this string:
myString <- "This is my string I scraped from the web. I want to remove all instances of a picture. picture-file.jpg. The text continues here. picture-file2.jpg"
but if I were to do something like this: str_remove_all(myString, " .*jpg) I end up with
[1] "This"
I know that what's happening is R is finding the first instance of a space and removing everything between that space and ".jpg" but I want it to be the first space immediately before ".jpg". My final result I hope for looks like this:
[1] "This is my string I scraped from the web. I want to remove all instances of a picture. the text continues here.
NOTE: I know that a solution may arise which does what I want, but ends up putting two periods next to each other. I do not mind a solution like that because later in my analysis I am removing punctuation.

You can use
str_remove_all(myString, "\\S*\\.jpg")
Or, if you also want to remove optional whitespace before the "word":
str_remove_all(myString, "\\s*\\S*\\.jpg")
Details:
\s* - zero or more whitespaces
\S* - zero or more non-whitespaces
\.jpg - .jpg substring.
To make it case insensitive, add (?i) at the pattern part: "(?i)\\s*\\S*\\.jpg".
If you need to make sure there is no word char after jpg, add a word boundary: "(?i)\\s*\\S*\\.jpg\\b"

Related

Remove first occurrence of special characters until the first word or word character in R using regex

For my project I am looking into removing parts of text based on the pattern of special characters. I have a long .txt file that has the below structure:
mycharobj=c("---------Some text is here.---------More text is here - [3548]----- Even more text is here.-----------More text is here - [408]--------- Even more text is here again.")
String continues following the above pattern.
My target is to remove parts that start with - and end - [number], such as:
"-----------------------More text is here - [3548]"
"-----------More text is here - [408]"
I am planning to use the below to remove these parts with (will be looped in the future)
library(stringr)
library(qdapRegex)
temp=unlist(regmatches(mycharobj, gregexpr("[[:digit:]]+", mycharobj)))
mycharobj=rm_between(mycharobj, "-", paste(temp[1],"]", sep=""))
but for this to work, I need a regex expression that will remove the first occurrence of "-----------" in text until the first word or word character. If a string starts with text (word or word characters), it needs to ignore this and identify the first occurrence of "-----------" for my potential loop to work.
I was wondering if this can be done with regular expressions? Any help is appreciated. I have a very computationally demanding solution for this; split the string based on the special character "-" and then identify the parts of the text that I need through a set of conditionals. But due to the fact that it takes a lot more of the processing time, this solution is not very scalable for processing a large number of such .txt files.

You can use
gsub("-{9,}(?:(?!-{9}).)*?- \\[\\d+]", "", mycharobj, perl=TRUE)
See the regex demo.
Details:
-{9,} - nine or more - chars
(?:(?!-{9}).)*? - any one char, other than a line break char, zero or more but as few as possible occurrences, that does not start a nine hyphen char sequence
- \[ - a - [ string
\d+ - one or more digits
] - a ] char.

Replace multiple spaces in string, but leave singles spaces be

I am reading a PDF file using R. I would like to transform the given text in such a way, that whenever multiple spaces are detected, I want to replace them by some value (for example "_"). I've come across questions where all spaces of 1 or more can be replaced using "\\s+" (Merge Multiple spaces to single space; remove trailing/leading spaces) but this will not work for me. I have a string that looks something like this;
"[1]This is the first address This is the second one
[2]This is the third one
[3]This is the fourth one This is the fifth"
When I apply the answers I found; replacing all spaces of 1 or more with a single space, I will not be able to recognise separate addresses anymore, because it would look like this;
gsub("\\s+", " ", str_trim(PDF))
"[1]This is the first address This is the second one
[2]This is the third one
[3]This is the fourth one This is the fifth"
So what I am looking for is something like this
"[1]This is the first address_This is the second one
[2]This is the third one_
[3]This is the fourth one_This is the fifth"
However if I rewrite the code used in the example, I get the following
gsub("\\s+", "_", str_trim(PDF))
"[1]This_is_the_first_address_This_is_the_second_one
[2]This_is_the_third_one_
[3]This_is_the_fourth_one_This_is_the_fifth"
Would anyone know a workaround for this? Any help will be greatly appreciated.

Whenever I come across string and reggex problems I like to refer to the stringr cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf
On the second page you can see a section titled "Quantifiers", which tells us how to solve this:
library(tidyverse)
s <- "This is the first address This is the second one"
str_replace(s, "\\s{2,}", "_")
(I am loading the complete tidyverse instead of just stringr here due to force of habit).
Any 2 or more whitespace characters will no be replaced with _.

Split string in R，better to use regular expressions

I'd like to split a text string in R
for example:"cell_70001.ERP123.138_D11_62.5Y_45880"
But,I want ERP123.138_D11_62.5Y_45880 finally.
That is to say, cut the place where the first punctuation starts, get the part after it,
I really don’t understand regular expressions, but I’m very anxious. I hope someone can help me. Thank you.

Since your aim is to get where the first punctuation starts, considering _ is a word, we could do:
sub(".*?\\W","", "cell_70001.ERP123.138_D11_62.5Y_45880")
[1] "ERP123.138_D11_62.5Y_45880"
This is to say, delete everything until the first non-word character.
You could also do it as:
sub("\\w+\\W","", "cell_70001.ERP123.138_D11_62.5Y_45880")
[1] "ERP123.138_D11_62.5Y_45880"
Which means delete every word until the first non-word. Which is then also deleted

How to prevent code from detecting and pulling patterns within words (Example: I want 'one' detected but not 'one' in the word al'one')?

I have this code that is meant to add highlights to some numbers in a text stored in "lines"
stringr::str_replace_all(lines, nums, function(x) {paste0("<<", x, ">>")})
where nums is the following pattern being deteced
nums<-(Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine)+\\s?(Hundred|Thousand|Million|Billion|Trillion)?'
The problem I'm having is that the line of code above also leads to numbers embedded in words also being detected. In the following text this happens:
Get <<ten>> eggs. That is what is writ<<ten>>. I am <<one>> and al<<one>>.
when it should be:
Get <<ten>> eggs. That is what is written. I am <<one>> and alone.
I don't want to remove the question mark after the \s because I want to detect both numbers like "One" followed by no space and "One Hundred" which has a space in between.
Does anyone know how to do this?

Surround (Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine)+ with \b.
\b matches word boundaries, so this expression will newer match inside a word.

Regular Expression to remove contents in string

I have a string as below:
4s: and in this <em>new</em>, 5s: <em>year</em> everybody try to make our planet clean and polution free.
Replace string:
4s: and in this <em>new</em>, <em>year</em> everybody try to make our planet clean and polution free.
what i want is ,if string have two <em> tags , and if gap between these two <em> tags is of just one word and also , format of that word will be of ns: (n is any numeric value 0 to 4 char. long). then i want to remove ns: from that string. while keeping punctuation marks('?', '.' , ',',) between two <em> as it is.
also i like to add note that. input string may or may not have punctuation marks between these two <em> tags.
My regular expression as below
Regex.Replace(txtHighlight, #"</em>.(\s*)(\d*)s:(\s*).<em", "</em> <em");
Hope it is clear to my requirement.
How can I do this using regular expressions?

Not really sure what you need, but how about:
Regex.Replace(txtHighlight, #"</em>(.)\s*\d+s:\s*(.)<em", "</em>$1$2<em");

If you just want to take out the 4s 5s bit you could do something like this:
Regex.Replace(txtHighlight, #"\s\d\:", "");
This will match a space followed by a digit followed by a colon.
If that's not what you're after, my apologies. I hope it might help :)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to remove characters between space and specific character in R - r

Related

Remove first occurrence of special characters until the first word or word character in R using regex

Replace multiple spaces in string, but leave singles spaces be

Split string in R，better to use regular expressions

How to prevent code from detecting and pulling patterns within words (Example: I want 'one' detected but not 'one' in the word al'one')?

Regular Expression to remove contents in string

Categories

Resources