I've been grappling with regex in following string:
"Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... https://t dot co/hUradDaNVX"
I am unable to remove the entire \x...\x pattern from the above string.
I'm unable to remove https URL from above string.
My regex expression are:
gsub('http.* *', '', twts_array)
gsub("\\x.*\\x..","",twts_array)
My output is:
"Just beautiful let’s see how the next few days go \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... httpstcohUradDaNVX"
My expected output is:
Just beautiful, let’s see how the next few days go. Long term buying opportunities could be around the corner
P.S: As you can see neither of problems got solved. I also added dot for . in https://t dot co/hUradDaNVX as StackOverflow does not allow me to post shortened urls. Can some one help me in tackling this problem.
On Linux you can do the following:
twts_array <- "Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... https://t dot co/hUradDaNVX"
twts_array_str <- enc2utf8(twts_array)
twts_array_str <- gsub('<..>', '', twts_array_str)
twts_array_str <- gsub('http.*', '', twts_array_str)
twts_array_str
# "Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner ... "
enc2utf8 will convert any unknown Unicode sequences to <..> format. Then it will be replaced by gsub with URL as well.
Related
I'm trying to grep strings that end in a dash in R, but having trouble. I've worked out how to grep strings ending in any punctuation mark, maybe not the best way but this worked:
grep("\\#[[:print:]]+[[:punct:]]$",c)
Can't for the life of me work out how to grep strings that end specifically in a dash
for example these strings:
- # (piano) - not this.
- # hello hello - not this either.
I'd like to sub all the stuff between the dashes (and including the dashes) with nothing "" and leave the text to the right of the second dash, which end in full stops. So, I would like the output to be (for example, based on the example above):
not this.
and
not this either.
Any help would be appreciated.
Thank you!
Maro
UPDATE:
Hi again everyone,
I'm just updating my original question again:
So what I had in my original data was these three examples (I tried to simplify in my original post above, but I think it might be helpful for you all to see what I was actually dealing with):
- # (Piano) - no, and neither can you.
- # (Piano) - uh-huh.
- # Many dreams ago - Try it again.
(numbers 1-3 are for the purposes of making things clearer, they are not part of the strings)
I was trying to find a way to delete all the stuff between and including the two dashes, and leave all the stuff after the second dash, so I wanted my output to be:
no, and neither can you.
uh-huh.
Try it again.
I ended up using this:
gsub(("-[[:blank:]]#[[:blank:]]\\(?[A-Z][a-z]*\\)?[[:blank:]]-", "", c)
which helped me get 1. and 2. in one go. But this didn't help with 3 - I thought by including the question mark after the open and close bracket (which I thought meant 'optional') this would help me get all three targets, but for some reason it didn't. To then get 3, I just ended up targeting that specific string i.e. - # Many dreams ago -, by using:
gsub(("- # Many dreams ago -"), "", c)
I'm new to this, so not the best solution I'm sure.
In my original post (this has been edited a couple of times) I included square brackets around the three strings, which explains some of the answers I originally received from members of the community. Apologies for the confusion!
Thanks everyone - if there's anything that doesn't make sense, please let me know, and I'll try to clarify.
Maro
If you want to stay in between the square brackets you can start the match at #, then use a negated character class [^][]* matching optional chars other than an opening or closing square bracket, and match the last -
Replace the match with an empty string.
c <- "[- # (piano) - not this.]"
sub("#[^][]*-", "", c)
Output
[1] "[- not this.]"
For a more specific match of that string format, you can match the whole line including the square brackets, the # and the string ending on a full stop, and capture what you want to keep.
In the replacement use the capture group value.
c <- c("[- # (piano) - not this.]", "[- # hello hello - not this either.]")
sub("\\[[^][#]*#[^][]*-\\s*([^][]*\\.)]", "\\1", c)
Output
[1] "not this." "not this either."
Alright so I have minimal experience with RStudio, I've been googling this for hours now and I'm fed up-- I don't care about the pride of figuring it out on my own anymore, I just want it done. I want to do some stuff with Canterbury Tales-- the Middle English version on Gutenberg.
Downloaded the plaintext, trimmed out the meta data, etc but it's chock-full of "helpful" footnotes and I can't figure out how to cut them out. EX:
"And shortly, whan the sonne was to reste,
So hadde I spoken with hem everichon,
That I was of hir felawshipe anon,
And made forward erly for to ryse,
To take our wey, ther as I yow devyse.
19. Hn. Bifel; E. Bifil. 23. E. were; _rest_ was. 24. E. Hn.
compaignye. 26, 32. E. felaweshipe. Hl. pilgryms; E. pilgrimes.
34. E. oure
But natheles, whyl I have tyme and space,..."
I at least have the vague notion that this is a grep/regex puzzle. Looking at the text in TextEdit, each bundle of footnotes is indented by 4 spaces, and the next verse starts with a capitalized word indented by (edit: 4 spaces as well).
So I tried downloading the package qdap and using the rm_between function to specify removal of text between four spaces and a number; and two spaces and a capital letter (" [0-9]"," "[A-Z]") to no avail.
I mean, this isn't nearly as simple as "make the text lowercase and remove all the numbers dur-hur" which all the tutorials are so helpfully offering. But I'm assuming this is a rather common thing that people have to do when dealing with big texts. Can anyone help me? Or do I have to go into textedit and just manually delete all the footnotes?
EDIT: I restarted the workspace today and all I have is a scan of the file, each line stored in a character vector, with the Gutenburg metadata trimmed out:
text<- scan("thefilepath.txt, what = "character", sep = "\n")
start <-which(text=="GROUP A. THE PROLOGUE.")
end <-which(text==""God bringe us to the Ioye . that ever schal be!")
cant.lines.v <- text[start:end]
And that's it so far. Eventually I will
cant.v<- paste(cant.lines.v, collapse=" ")
And then strsplit and unlist into a vector of individual words-- but I'm assuming, to get rid of the footnotes, I need to gsub and replace with blank space, and that will be easier with each separate line? I just don't know how to encode the pattern I need to cut. I believe it is 4 spaces followed by a number, then continuing on until you get to 4 spaces followed by a capitalized word and a second word w/o numbers and special characters and punctuation.
I hope that I'm providing enough information, I'm not well-versed in this but I am looking to become so...thanks in advance.
I have the following sentence
review <- C("1a. How long did it take for you to receive a personalized response to an internet or email inquiry made to THIS dealership?: Approx. It was very prompt however. 2f. Consideration of your time and responsiveness to your requests.: Were a little bit pushy but excellent otherwise 2g. Your satisfaction with the process of coming to an agreement on pricing.: Were willing to try to bring the price to a level that was acceptable to me. Please provide any additional comments regarding your recent sales experience.: Abel is awesome! Took care of everything from welcoming me into the dealership to making sure I got the car I wanted (even the color)! ")
I want to remove everything before :
I tried the following code,
gsub("^[^:]+:","",review)
However, it only removed first sentence ending with a colon
Expected results:
Approx. It was very prompt however. Were a little bit pushy but excellent otherwise Were willing to try to bring the price to a level that was acceptable to me. Abel is awesome! Took care of everything from welcoming me into the dealership to making sure I got the car I wanted (even the color)!
Any help or suggestions will be appreciated. Thank you.
If the sentences are not complex and have no abbreviations you may use
gsub("(?:\\d+[a-zA-Z]\\.)?[^.?!:]*[?!.]:\\s*", "", review)
See the regex demo.
Note that you may further generalize it a bit by changing \\d+[a-zA-Z] to [0-9a-zA-Z]+ / [[:alnum:]]+ to match 1+ digits or letters.
Details
(?:\d+[a-zA-Z]\.)? - an optional sequence of
\d+ - 1+ digits
[a-zA-Z] - an ASCII letter
\. - a dot
[^.?!:]* - 0 or more chars other than ., ?, !, :
[?!.] - a ?, ! or .
: - a colon
\s* - 0+ whitespaces
R test:
> gsub("(?:\\d+[a-zA-Z]\\.)?[^.?!:]*[?!.]:\\s*", "", review)
[1] "Approx. It was very prompt however. Were a little bit pushy but excellent otherwise Were willing to try to bring the price to a level that was acceptable to me.Abel is awesome! Took care of everything from welcoming me into the dealership to making sure I got the car I wanted (even the color)! "
Extending to handle abbreviations
You may enumerate the exceptions if you add alternation:
gsub("(?:\\d+[a-zA-Z]\\.)?(?:i\\.?e\\.|[^.?!:])*[?!.]:\\s*", "", review)
^^^^^^^^^^^^^^^^^^^^^^
Here, (?:i\.?e\.|[^.?!:])* matches 0 or more ie. or i.e. substrings or any chars other than ., ?, ! or :.
See this demo.
I'm studying the recent hashtag #BalanceTonPorc in one of my classes. I'm trying to get all the occurrences of this hashtag appearing in tweets, but of course nobody uses the same format.
Some people use #BalanceTonPorc, some #balancetonporc, and son on and so forth.
Using gsub, I've so far done this :
df$hashtags <- gsub(".alance.on.orc", "BalanceTonPorc", df$hashtags)
Which does what I want, and all variations of this hashtag are stored under the same one. But there are A LOT of other variations. Some people used #BalanceTonPorc... or #BalanceTonPorc.
Is there a way to have a RegEx that says I want everything that contains .alance.on.orc with every character possible after the hashtag, except , (because it separates hashtags)? Here is a screenshot to illustrate what I mean.
I'm also having another issue, in my frequency table I have twice #BalanceTonPorc, so I guess R must consider them to be different. Can you spot the difference?
You may use [^,]* to match any char but ,, 0+ occurrences:
gsub(".alance.on.orc[^,]*", "BalanceTonPorc", df$hashtags)
Or, to exactly match balancetonporc,
gsub("balancetonporc[^,]*", "BalanceTonPorc", df$hashtags, ignore.case=TRUE)
See a regex demo and an R online test:
x <- c("#balancetonPorc#%$%#$%^","#balancetonporc#%$%, text")
gsub("balancetonporc[^,]*", "BalanceTonPorc", x, ignore.case=TRUE)
# => [1] "#BalanceTonPorc" "#BalanceTonPorc, text"
I am scraping a very long forum thread, and I want to come up with a database that has columns containing the following info: date / full post text / quoted user / quoted text / clean text
The clean text should be each user's post, without the quotations if they are replying to anyone. if the post is not a reply, I would leave it as NA. The following is an invented post, with invented user, to illustrate what I have managed to do so far:
post<-"Meow1 wrote: »\noday is gonna be the day that they're gonna throw it back to you?\nBy now you should've somehow Realized what you gotta do\n\n\nI don't believe that anybody Feels the way I do, about you now\nMeow1 wrote: »\nI'm sure you've heard it all before But you never really had a doubt\n\n\nBecause maybe, you're gonna be the one that saves me\nMeow1 wrote: »\nAnd after all, you're my wonderwall\n\n\nAnd all the lights that lead us there are blinding"
Then I try to pull out the quoted user (Meow1) and it works:
QuotedUser_1<-ifelse(grepl('wrote:', post), gsub('\\s*wrote.*$', '', post), NA)
QuotedUser_1
[1] "Meow1"
Then I created this codes for pulling out the quoted text, and the clean text:
Quotedtext_1<- ifelse(grepl('wrote:', post), gsub('^.*wrote\\s*|\\s*\\n\\n\\n.*$', '', post), NA)
It works when there is only one quoted text, but otherwise, it only gives the last quoted bit (in the example, 'And after all, you´re my wonderwall')
And same for the clean text, it only returns the last reply:
Clean_text<- sub('^.*\\n\\n\\n\\s*|\\s*wrote.*', '', post)
If anyone has a suggestion to improve the code, so that I can have a vector with all the quotations, and a vector with all the replies, I would be very grateful...
Cheers
Are you sure you cannot scrape the author and text information separately? Without a source it's difficult to know, but I guess they can be obtained by different css-selectors making it much easier to split the data.
If not, it might be helpful to look into str_locate_all which allows you to locate all occurences of e.g. "wrote:" and split the string accordingly.