R: Automatically 'translate' shortened URLs from Twitter data

R: Automatically 'translate' shortened URLs from Twitter data - r

I am looking for a way to automatically 'translate' the shortened URLs from Twitter to the original URL.
I scraped a couple of twitter timelines using following code:
tweets <- userTimeline("exampleuser", n = 3200, includeRts=TRUE)
tweets_df <- tbl_df(map_df(tweets, as.data.frame))
Then I separated the shortened URLs from the rest of the tweet text, so that I have a separate column in my dataframe, which contains only the shortened URL.
Now I am looking for a way to automatically scrape all these URLs, which redirect to various websites, and get a new column with the original (i.e. unshortened) URL.
Anyone an idea how I can do this in R?
Thanks,
Manuel

You can use the httr package.
httr::HEAD("URL") will give you a response in the first line and then you can do the usual cleaning to get just the URL-s.

Related

API Webscrape OpenFDA with R

I am scraping OpenFDA (https://open.fda.gov/apis). I know my particular inquiry has 6974 hits, which is organized into 100 hits per page (max download of the API). I am trying to use R (rvest, jsonlite, purr, tidyverse, httr) to download all of this data.
I checked the website information with curl in terminal and downloaded a couple of sites to see a pattern.
I've tried a few lines of code and I can only get 100 entries to download. This code seems to work decently, but it will only pull 100 entries, so one page To skip the fisrt 100, which I can pull down and merge later, here is the code that I have used:
url_json <- "https://api.fda.gov/drug/label.json?api_key=YOULLHAVETOGETAKEY&search=grapefruit&limit=100&skip=6973"
raw_json <- httr::GET(url_json, accept_json())
data<- httr::content(raw_json, "text")
my_content_from_json <- jsonlite::fromJSON(data)
dplyr::glimpse(my_content_from_json)
dataframe1 <- my_content_from_json$results
view(dataframe1)
SOLUTION below in the responses. Thanks!

From the comments:
It looks like the API parameters skip and limit work better than the search_after parameter. They allow pulling down 1,000 entries simultaneously according to the documentation (open.fda.gov/apis/query-parameters). To provide these parameters into the query string, an example URL would be
https://api.fda.gov/drug/label.json?api_key=YOULLHAVETOGETAKEY&search=grapefruit&limit=1000&skip=0
after which you can loop to get the remaining entries with skip=1000, skip=2000, etc. as you've done above.

Web scraping using a url-list in R

I am trying to scrape some URLs from multiple websites I collected. I saved the already collected websites in a dataframe called meetings2017_2018. The problem is that URLs don't look very similar to one another except the first part of the URLs: https://amsterdam.raadsinformatie.nl. The second part of urls are saved in the dataframe. Here are some examples:
/vergadering/458873/raadscommissie%20Algemene%20Zaken
/vergadering/458888/raadscommissie%20Wonen
/vergadering/458866/raadscommissie%20Jeugd%20en%20Cultuur
/vergadering/346691/raadscommissie%20Algemene%20Zaken
So the whole URL would be https://amsterdam.raadsinformatie.nl/vergadering/458873/raadscommissie%20Algemene%20Zaken
I managed to create a very simple function from which I can pull out the URLs from a single website (see below).
web_scrape <- function(meeting) {
url <- glue("https://amsterdam.raadsinformatie.nl{meeting}")
read_html(url) %>%
html_nodes("a") %>%
html_attr("href")
}
With this function I still need to insert every single URL from the dataframe I want to scrape. Since there more than 140 in the dataframe this might take a while. As you can guess, I want to scrape all the urls at once using the url-list in the dataframe. Does anybody know how I can do that?

You could map/iterate over your saved URL in the meetings2017_2018 data frame:
Assuming your URLs are saved in a url column in your meetings2017_2018 data frame a starting point would be:
# create a vector of the URLs
urls <- pull(meetings2017_2018, url)
# map over the URLs and execute whatever code you want for every URL
map(urls, function(url) {
your_code
})

R package (twitteR) returns short URL destination rather than URL text

I'm trying to pull the text of a URL from a twitter feed--about 3,000 of them--via the twitteR package in R. Specifically, I want the longitude and latitude data contained in the URLs in this tweet:
https://twitter.com/PGANVACentralCh/status/885702041275969536
However, the twitteR package scrapes out the short form URL destination instead:
e.g.: https://t dot co slash Y0pGeSiVFJ
I could follow all 3,000 links individually and copy and paste their URLs and then transform them to longitude and latitude, but there has to be a simpler way?
Not that it matters for this particular problem, but I am getting the tweets via this code:
#
library(twitteR)
library(httr)
#
poketweets <- userTimeline("PGANVACentralCh", n = 3200)
poketweets_df <- tbl_df(map_df(poketweets, as.data.frame))
write.csv(poketweets_df, "poketweets.csv")

You need to get hold of the entities.url.expanded_url value from the Tweet object. I do not believe that the status objects returned by twitteR support that (the status object fields are only a subset of the Tweet JSON values). Additionally, twitteR is now deprecated in favour of rtweet.
Using rtweet, you can modify your code:
poketweets <- get_timeline("PGANVACentralCh", n = 50)
head(poketweets)
You'll find there's a urls_expanded field in each Tweet dataframe that you can use.

URL contains page number but this number is unknown

I'm trying in R to get the list of tickers from every exchange covered by Quandl.
There are 2 ways:
1) For every exchange the provide the zipped csv with all ticker. The URL looks like this (XXXXXXXXXXXXXXXXXXXX - API key, YYY - code of exchange):
https://www.quandl.com/api/v3/databases/YYY/codes?api_key=XXXXXXXXXXXXXXXXXXXX
This looks pretty promissing, but I was not able to read the file with read.table or e.g fread. Don't know why. Is it because of the API key? read.table is supposed to read zip files with no problem.
2) I was able to go further with the 2nd way. They provide URL to the csv of tickers. E.g.:
https://www.quandl.com/api/v3/datasets.csv?database_code=YYY&per_page=100&sort_by=id&page=1&api_key=XXXXXXXXXXXXXXXXXXXX
As you see, URL contains page number. The problem is they only mention it below in text, that you need to run this URL many time (e.g. 56 for LSE) in order to get the full list. I was able to do it like this:
pages <- 1:100 # "100" is taken just to be big enough
Source <- c("LSE","FSE", ...) # vector of exchange codes
QUANDL_API_KEY="XXXXXXXXXXXXXXXXXXXXXXXXXX"
TICKERS = lapply(sprintf(
"https://www.quandl.com/api/v3/datasets.csv?
database_code=%s&per_page=100&sort_by=id&page=%s&api_key=%s",
Source,pages,QUANDL_API_KEY),
FUN=fread,
stringsAsFactors=FALSE)
TICKERS <- do.call(rbind, TICKERS)
The problem is I just put 100 pages, but when R tryies to get the non-existing page (e.g. #57) it delivers an error and do not go further. I was trying to do smth like iferror, but failed.
Could you pls give some hints?

Scrape data from flash page using rvest

I am trying to scrape data from this page:
http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?
If I try to scrape the name of the players using the css selector and the usual rvest syntax:
names <- read_html("http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?") %>%
html_nodes(".scoring-player-name") %>% sapply(html_text)
everything goes well.
Unfortunately if I try to scrape the statistics below (first serve pts won, ..)
using the selector .stat-breakdown span I am not able to retrieve any data.
I know rvest is generally not recommended to scrape pages created dynamically, however I don't understand why some data are scraped and some not.

I don't use Rvest. If you follow the code below you should get to the format which is in the picture basically a string which you could transform to dataframe based on separators :, .
This Tag also contains more information than it was displayed in UI of webpage.
I can try also RSelenium but need to get my other PC. So I would let you know if RSelenium worked for me.
library(XML)
library(RCurl)
library(stringr)
url<-"http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?"
url2<-getURL(url)
parsed<-htmlParse(url2)
# get messi data from tag
step1<-xpathSApply(parsed,"//script[#id='matchStatsData']",xmlValue)
# removing some unwanted characters
step2<-str_replace_all(step1,"\r\n","")
step3<-str_replace_all(step2,"\t","")
step4<-str_replace_all(step3,"[[{}]\"]","")
Output then is a string like this

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Automatically 'translate' shortened URLs from Twitter data - r

You can use the httr package. httr::HEAD("URL") will give you a response in the first line and then you can do the usual cleaning to get just the URL-s.

Related

API Webscrape OpenFDA with R

Web scraping using a url-list in R

R package (twitteR) returns short URL destination rather than URL text

URL contains page number but this number is unknown

Scrape data from flash page using rvest

Categories

Resources