Remove unwanted text from string - r

I have a string "yada yada.useful text here. googletag.cmd.push(function() { googletag.display('div-gpt-ad-447281037690072557-2'); });useful text here. yada yada". I want to remove the string "googletag.cmd.push(function() { googletag.display('div-gpt-ad-447281037690072557-2'); });" but I can't.
I tried selecting the unwanted string using "^(google)});", "^google});" to no avail. Even "^google" or "^google*" does not do anything but "google" works fine. I used the gsub and str_remove functions but my selector doesnt work.
How do I remove the unwanted string? I searched the regex and adding ^ to a selector stops my code from working. What did I miss?

This should do it.
library(stringr)
x <- "yada yada.useful text here. googletag.cmd.push(function() { googletag.display('div-gpt-ad-447281037690072557-2'); });useful text here. yada yada"
x %>% str_remove("googletag.*\\}\\)")
Explanation
The regex looks for "googletag" (where your unwanted string starts)
.* means any number of characters
\\}\\) until we find })
the double backslashes are "R slang" other regex would mostly only use one backslash.

Related

In R, how do you remove links from a dataframe or a string?

I'd like to delete all of the links in the dataframe's "messages" column. How can I get rid of links like these?
I also want to remove any words in the text that begin with http: or https.
x<-c("Deneme http://www.example.com/x2ABf merhaba Osman https://www.example.edu OZhann www.example.org/xlsEr45?a karalama")
x<-gsub('?(f|ht)(tp)(s?)(://)(((\\w+\\S+|\\W+\\S+)))(*((/)\\w+))',"",x,ignore.case=T)
x
[1] "Deneme merhaba Osman https://www.example.edu OZhann www.example.org/xlser45a karalama"
Expected output:
Deneme merhaba Osman OZhann karalama
The solution is actually quite simple:
gsub("(http|www)\\S+", "", x)
"Deneme merhaba Osman OZhann karalama"
It builds on \\S (upper-case), a shorthand character class matching any character that is not matched by \\s (lower-case), namely tabs, new lines, etc. but also, most important in the present case, white space characters.
This would be a solution:
library(tidyverse)
tibble::tibble(URL = c("https:\\example.com",
"http:\\www.example.com",
"example.net",
"example.com",
"no_url_1")) %>%
dplyr::filter(!stringr::str_detect(URL, "\\.net|http|\\.com"))
Surprisingly, all of the suggestions were successful. With your assistance, I created the code below. Thank you a lot.
x<-gsub('?(f|ht)(tp)(s?)(://)(((\\w+\\S+|\\W+\\S+)))((*((/)\\w+))|*(\\w+))',"",x,ignore.case=T)
x<-sub('.www.*\\.\\w+\\S+','', x)
x<-sub('.ttp.*\\.\\w+\\S+','', x)
Note: The order of the code is important.

combining strings to one string in r

I'm trying to combine some stings to one. In the end this string should be generated:
//*[#id="coll276"]
So my inner part of the string is an vector: tag <- 'coll276'
I already used the paste() method like this:
paste('//*[#id="',tag,'"]', sep = "")
But my result looks like following: //*[#id=\"coll276\"]
I don't why R is putting some \ into my string, but how can I fix this problem?
Thanks a lot!
tldr: Don't worry about them, they're not really there. It's just something added by print
Those \ are escape characters that tell R to ignore the special properties of the characters that follow them. Look at the output of your paste function:
paste('//*[#id="',tag,'"]', sep = "")
[1] "//*[#id=\"coll276\"]"
You'll see that the output, since it is a string, is enclosed in double quotes "". Normally, the double quotes inside your string would break the string up into two strings with bare code in the middle:
"//*[#id\" coll276 "]"
To prevent this, R "escapes" the quotes in your string so they don't do this. This is just a visual effect. If you write your string to a file, you'll see that those escaping \ aren't actually there:
write(paste('//*[#id="',tag,'"]', sep = ""), 'out.txt')
This is what is in the file:
//*[#id="coll276"]
You can use cat to print the exact value of the string to the console (Thanks #LukeC):
cat(paste('//*[#id="',tag,'"]', sep = ""))
//*[#id="coll276"]
Or use single quotes (if possible):
paste('//*[#id=\'',tag,'\']', sep = "")
[1] "//*[#id='coll276']"

Removing punctuations from text using R

I need to remove punctuation from the text. I am using tm package but the catch is :
eg: the text is something like this:
data <- "I am a, new comer","to r,"please help","me:out","here"
now when I run
library(tm)
data<-removePunctuation(data)
in my code, the result is :
I am a new comerto rplease helpmeouthere
but what I expect is:
I am a new comer to r please help me out here
Here's how I take your question, and an answer that is very close to #David Arenburg's in the comment above.
data <- '"I am a, new comer","to r,"please help","me:out","here"'
gsub('[[:punct:] ]+',' ',data)
[1] " I am a new comer to r please help me out here "
The extra space after [:punct:] is to add spaces to the string and the + matches one or more sequential items in the regular expression. This has the side effect, desirable in some cases, of shortening any sequence of spaces to a single space.
If you had something like
string <- "hello,you"
> string
[1] "hello,you"
You could do this:
> gsub(",", "", string)
[1] "helloyou"
It replaces the "," with "" in the variable called string

What is the regular expression for "No quotes in a string"?

I am trying to write a regular expression that doesn't allow single or double quotes in a string (could be single line or multiline string). Based on my last question, I wrote like this ^(?:(?!"|').)*$, but it is not working. Really appreciate if anybody could help me out here.
Just use a character class that excludes quotes:
^[^'"]*$
(Within the [] character class specifier, the ^ prefix inverts the specification, so [^'"] means any character that isn't a ' or ".)
Just use a regex that matches for quotes, and then negate the match result:
var regex = new Regex("\"|'");
bool noQuotes = !regex.IsMatch("My string without quotes");
Try this:
string myStr = "foo'baa";
bool HasQuotes = myStr.Contains("'") || myStr.Contains("\""); //faster solution , I think.
bool HasQuotes2 = Regex.IsMatch(myStr, "['\"]");
if (!HasQuotes)
{
//not has quotes..
}
This regular expression below, allows alphanumeric and all special characters except quotes(' and "")
#"^[a-zA-Z-0-9~+:;,/#&_#*%$!()\[\] ]*$"
You can use it like
[RegularExpression(#"^[a-zA-Z-0-9~+:;,/#&_#*%$!()**\[\]** ]*$", ErrorMessage = "Should not allow quotes")]
here use escape sequence() for []. Since its not showing in this post

Regular expression to convert substring to link

i need a Regular Expression to convert a a string to a link.i wrote something but it doesnt work in asp.net.i couldnt solve and i am new in Regular Expression.This function converts (bkz: string) to (bkz: show.aspx?td=string)
Dim pattern As String = "<bkz[a-z0-9$-$&-&.-.ö-öı-ış-şç-çğ-ğü-ü\s]+)>"
Dim regex As New Regex(pattern, RegexOptions.IgnoreCase)
str = regex.Replace(str, "<font color=""#CC0000"">$1</font>")
Generic remarks on your code: beside the lack of opening parentheses, you do redundant things: $-$ isn't incorrect but can be simplified into $ only. Same for accented chars.
Everybody will tell you that font tag is deprecated even in plain HTML: favor span with style attribute.
And from your question and the example in the reply, I think the expression could be something like:
\(bkz: ([a-z0-9$&.öışçğü\s]+)\)
the replace string would look like:
(bkz: <span style=""color: #C00"">$1</span>)
BUT the first $1 must be actually URL encoded.
Your regexp is in trouble because of a ')' without '('
Would:
<bkz:\s+((?:.(?!>))+?.)>
work better ?
The first group would capture what you are after.
Thanks Vonc,Now it doesnt raise error but also When i assign str to a Label.Text,i cant see the link too.Forexample after i bind str to my label,it should be viewed in view-source ;
<span id="Label1">(bkz: here)</span>
But now,it is in viewsource source;
<span id="Label1">(bkz: here)</span>

Resources