I extracted tweets from Twitter related to #TrumpCaved!!
In my tweets I wanted to remove the emoticons, url's and all other speacial characters from all the tweets. One of the tweets is as follows:
#mitchellvii #AnnCoulter Hey all you #MAGA people, how did you like
watching #realDonaldTrump cave today?
… HTTP content[If I
use http link I could not able to post it]
I tried using the following code but it doesn't work for me.
In my scenario, I tried to remove URL's successfully and after I use the next code to remove the emoticons it gets removed but now the URL's gets added. Can anyone help me in removing all the unwanted characters from the text especially the URL's and emoticons?
First I tried to remove the http using gsub function
Corpus = gsub("https.*","", tweets_text$Tweets)
O/p : #mitchellvii #AnnCoulter Hey all you #MAGA people, how did you like watching #realDonaldTrump cave today? <U+0001F602><U+0001F923><U+0001F602><U+0001F923>…
Next I tried to remove the emoticons using gsub function
Corpus = gsub("[^[:alnum:]///' ]","", tweets_text$Tweets)
O/P : mitchellvii AnnCoulter Hey all you MAGA people how did you like watching realDonaldTrump cave today https//tco/vmUCJvTnEO
Related
I've been grappling with regex in following string:
"Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... https://t dot co/hUradDaNVX"
I am unable to remove the entire \x...\x pattern from the above string.
I'm unable to remove https URL from above string.
My regex expression are:
gsub('http.* *', '', twts_array)
gsub("\\x.*\\x..","",twts_array)
My output is:
"Just beautiful let’s see how the next few days go \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... httpstcohUradDaNVX"
My expected output is:
Just beautiful, let’s see how the next few days go. Long term buying opportunities could be around the corner
P.S: As you can see neither of problems got solved. I also added dot for . in https://t dot co/hUradDaNVX as StackOverflow does not allow me to post shortened urls. Can some one help me in tackling this problem.
On Linux you can do the following:
twts_array <- "Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... https://t dot co/hUradDaNVX"
twts_array_str <- enc2utf8(twts_array)
twts_array_str <- gsub('<..>', '', twts_array_str)
twts_array_str <- gsub('http.*', '', twts_array_str)
twts_array_str
# "Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner ... "
enc2utf8 will convert any unknown Unicode sequences to <..> format. Then it will be replaced by gsub with URL as well.
I am scraping a very long forum thread, and I want to come up with a database that has columns containing the following info: date / full post text / quoted user / quoted text / clean text
The clean text should be each user's post, without the quotations if they are replying to anyone. if the post is not a reply, I would leave it as NA. The following is an invented post, with invented user, to illustrate what I have managed to do so far:
post<-"Meow1 wrote: »\noday is gonna be the day that they're gonna throw it back to you?\nBy now you should've somehow Realized what you gotta do\n\n\nI don't believe that anybody Feels the way I do, about you now\nMeow1 wrote: »\nI'm sure you've heard it all before But you never really had a doubt\n\n\nBecause maybe, you're gonna be the one that saves me\nMeow1 wrote: »\nAnd after all, you're my wonderwall\n\n\nAnd all the lights that lead us there are blinding"
Then I try to pull out the quoted user (Meow1) and it works:
QuotedUser_1<-ifelse(grepl('wrote:', post), gsub('\\s*wrote.*$', '', post), NA)
QuotedUser_1
[1] "Meow1"
Then I created this codes for pulling out the quoted text, and the clean text:
Quotedtext_1<- ifelse(grepl('wrote:', post), gsub('^.*wrote\\s*|\\s*\\n\\n\\n.*$', '', post), NA)
It works when there is only one quoted text, but otherwise, it only gives the last quoted bit (in the example, 'And after all, you´re my wonderwall')
And same for the clean text, it only returns the last reply:
Clean_text<- sub('^.*\\n\\n\\n\\s*|\\s*wrote.*', '', post)
If anyone has a suggestion to improve the code, so that I can have a vector with all the quotations, and a vector with all the replies, I would be very grateful...
Cheers
Are you sure you cannot scrape the author and text information separately? Without a source it's difficult to know, but I guess they can be obtained by different css-selectors making it much easier to split the data.
If not, it might be helpful to look into str_locate_all which allows you to locate all occurences of e.g. "wrote:" and split the string accordingly.
I'm grabbing the following page and storing it in R with the following code:
gQuery <- getURL("https://www.google.com/#q=mcdimalds")
Within this, there's the following snippet of code
Showing results for</span> <a class="spell" href="/search?rlz=1C1CHZL_enUS743US743&q=mcdonalds&spell=1&sa=X&ved=0ahUKEwj9koqPx_TTAhUKLSYKHRWfDlYQvwUIIygA"><b><i>mcdonalds</i></b></a>
Everything other than "showing results for" and the italics tags encasing the desired name for extraction are subject to change from query to query.
What I want to do is extract the mcdonalds out of this string using regex that occurs here: <b><i>mcdonalds</i> aka the second instance of mcdonalds. However, I'm not too sure how to write the regex to do so.
Any help accomplishing this would be greatly appreciated. As always, please let me know if any additional information should be added to clarify the question.
I have parts of links pertaining to baseball players in my character vector:
teamplayerlinks <- c(
"/players/i/iannech01.shtml",
"/players/l/lindad01.shtml",
"/players/c/canoro01.shtml"
)
I would like to isolate the letters/numbers after the 3rd / sign, and before the .sthml portion. I want my resulting string to read:
desiredlinks
# [1] "iannech01" "lindad01" "canoro01"
I assume this may be a job for sub, but I after many trials and error, I'm having a very tough time learning the escape and character sequences. I know it can be done with two sub calls to remove the front and back portion, but I'd rather complete this to dynamically handle other links.
Thank you in advance to anyone who replies - I'm still learning R and trying to get better everyday.
You could try
gsub(".*/|\\..*$", "", teamplayerlinks)
# [1] "iannech01" "lindad01" "canoro01"
Here we have
.*/ remove everything up to and including the last /
| or
\\..*$ remove everything after the ., starting from the end of the string
By the way, these look a bit like player IDs given in the Lahman baseball data sets. If so, you can use the Lahman package in R and not have to scrape the web. It has numerous baseball data sets. It can be installed with install.packages("Lahman"). I also wrote a package retrosheet for downloading data sets from retrosheet.com. It's also on CRAN. Check it out!
The basename function is useful here.
gsub("\\.shtml", "", basename(teamplayerlinks))
# [1] "iannech01" "lindad01" "canoro01"
This can be also done without regex
tools::file_path_sans_ext(basename(teamplayerlinks))
#[1] "iannech01" "lindad01" "canoro01"
I am trying to retrieve the whole lyrics of a band from the web.
I have noticed that they build URLs using ".../firstletter/bandname/songname.html"
Here is an example.
http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html
I was thinkining about creating a function that would read.csv the URLs.
That part was kind of easy because I can get the titles by a simple copy paste and save as .csv. Then, use that vector to pass the function for each value in order to construct the URL name.
But I tried to read the first one just to see what it looks like and I found that there will be too much "cleaning the data" if my goal is to build a csv file with each lyric.
x <-read.csv(url("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html"))
I think my approach is not the best (or maybe I need a better data cleaning strategy)
The HTML page has a tell on where the lyrics begin:
Usage of azlyrics.com content by any third-party lyrics provider is prohibited by our licensing agreement. Sorry about that.
Taking advantage of that, you can detect this string, and then read everything up to the end of the div:
m <- readLines("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html")
giveaway <- "Sorry about that."
#You can add the full line in case you think one of the lyrics might have this sentence in it.
start <- grep(giveaway, m) + 1 # Where the lyric starts
end <- grep("</div>", m[start:length(m)])[1] + start
# Take the first </div> after the start of the lyric, and then fix the position by adding the start
lyrics <- paste(gsub("<br>|</div>", "", m[start:end]), collapse = "\n")
#This is just an example of how to clear the remaining tags and join the text.
And then:
> cat(lyrics) #using cat() prints the line breaks
Ridin' down the highway
Goin' to a show
Stop in all the byways
Playin' rock 'n' roll
.
.
.
Well it's a long way
It's a long way, you should've told me
It's a long way, such a long way
Assuming that "cleaning the data" means you would be parsing through html tags. I recommend using DOM scraping library that would extract only the text lyrics from the page and save those lyrics to CSV, database or wherever. That way you wouldn't have to do any data cleaning. I don't know what programming language your using, but a simple google search will show you a lot of DOM querying and parsing libraries for any language.
Here is an example with PHP
http://simplehtmldom.sourceforge.net/manual.htm
$html = file_get_html('http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html');
// Find all images
$lyrics = $html->find('div.ringtone',1)->next_sibling();
print($lyrics.innertext);
now you have lyrics. Save Them.(code not tested);
If your using the R-Language. Use this library here. You will be able to query the DOM and extract the lyrics easily.
https://github.com/hadley/rvest