I am trying to extract twitter data for a keyword using the following code:
cred<- OAuthFactory$new(consumerKey='XXXX', consumerSecret='XXXX',
requestURL='https://api.twitter.com/oauth/request_token',
accessURL='https://api.twitter.com/oauth/access_token',
authURL='https://api.twitter.com/oauth/authorize')
cred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
To enable the connection, please direct your web browser to:
https://api.twitter.com/oauth/authorize?oauth_token=Cwr7GgWIdjh9pZCmaJcLq6CG1zIqk4JsID8Q7v1s
When complete, record the PIN given to you and provide it here: 8387466
registerTwitterOAuth(cred)
search=searchTwitter('facebook',cainfo="cacert.pem",n=1000)
But evenwith n=1000, the function returns a list of only 99 tweets where it should more than that. I also tried the same function with a specific timeline:
search=searchTwitter('facebook',cainfo="cacert.pem",n=1000,since='2013-01-01',until='2014-04-01')
But this function returns a empty list.
Can anyone help me out, with the correct set of additional queries so that I can extract data from a specific timeline and without any restriction on the number of tweets? Does it have to do anything with the amount of data fetched by the API?
Thanks in advance
It looks like Twitter API restricts number of returned tweets. You should check this out in the API documentation. Keeping the restriction in mind, you can use the since and sinceID arguments of searchTwitter() within a loop, something like:
for (i in 1:20) {
if (i==1) search = searchTwitter('facebook',cainfo="cacert.pem",n=2, since='2014-04-15')
else search = searchTwitter('facebook',cainfo="cacert.pem",n=2, since='2014-04-15', sinceID=search[[1]]$id)
print(search)
Sys.sleep(10)
}
You may need to adjust the Sys.sleep(10) portion if you hit API restrictions.
Related
I'm having trouble accessing the Energy Information Administration's API through R (https://www.eia.gov/opendata/).
On my office computer, if I try the link in a browser it works, and the data shows up (the full url: https://api.eia.gov/series/?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json).
I am also successfully connected to Bloomberg's API through R, so R is able to access the network.
Since the API is working and not blocked by my company's firewall, and R is in fact able to connect to the Internet, I have no clue what's going wrong.
The script works fine on my home computer, but at my office computer it is unsuccessful. So I gather it is a network issue, but if somebody could point me in any direction as to what the problem might be I would be grateful (my IT department couldn't help).
library(XML)
api.key = "e122a1411ca0ac941eb192ede51feebe"
series.id = "PET.MCREXUS1.M"
my.url = paste("http://api.eia.gov/series?series_id=", series.id,"&api_key=", api.key, "&out=xml", sep="")
doc = xmlParse(file=my.url, isURL=TRUE) # yields error
Error msg:
No such file or directoryfailed to load external entity "http://api.eia.gov/series?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json"
Error: 1: No such file or directory2: failed to load external entity "http://api.eia.gov/series?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json"
I tried some other methods like read_xml() from the xml2 package, but this gives a "could not resolve host" error.
To get XML, you need to change your url to XML:
my.url = paste("http://api.eia.gov/series?series_id=", series.id,"&api_key=",
api.key, "&out=xml", sep="")
res <- httr::GET(my.url)
xml2::read_xml(res)
Or :
res <- httr::GET(my.url)
XML::xmlParse(res)
Otherwise with the post as is(ie &out=json):
res <- httr::GET(my.url)
jsonlite::fromJSON(httr::content(res,"text"))
or this:
xml2::read_xml(httr::content(res,"text"))
Please note that this answer simply provides a way to get the data, whether it is in the desired form is opinion based and up to whoever is processing the data.
If it does not have to be XML output, you can also use the new eia package. (Disclaimer: I'm the author.)
Using your example:
remotes::install_github("leonawicz/eia")
library(eia)
x <- eia_series("PET.MCREXUS1.M")
This assumes your key is set globally (e.g., in .Renviron or previously in your R session with eia_set_key). But you can also pass it directly to the function call above by adding key = "yourkeyhere".
The result returned is a tidyverse-style data frame, one row per series ID and including a data list column that contains the data frame for each time series (can be unnested with tidyr::unnest if desired).
Alternatively, if you set the argument tidy = FALSE, it will return the list result of jsonlite::fromJSON without the "tidy" processing.
Finally, if you set tidy = NA, no processing is done at all and you get the original JSON string output for those who intend to pass the raw output to other canned code or software. The package does not provide XML output, however.
There are more comprehensive examples and vignettes at the eia package website I created.
I'm trying to fetch data from the Google Plus API but I only know how to search if I know the user_id.
Here's how I get the JSON using RCurl library:
data <- getURL(paste0("https://www.googleapis.com/plus/v1/people/",
user_id,"/activities/public?maxResults=100&key=", api_key),
ssl.verifypeer = FALSE)
I have tried formatting the URL like the documentation on google
like so:
data <- getURL(paste0("https://www.googleapis.com/plus/v1/activities/",
keyword,"?key=",api_key),ssl.verifypeer = FALSE)
but it doesn't work.
Is it even possible to search using a keyword from R or not? As R isn't in the supported programming languages for the API according to this link
I figured out how to make it work.
The GET request should be formatted as:
data <- getURL(paste0("https://www.googleapis.com/plus/v1/activities?key=",api_key,"&query=",search_string),ssl.verifypeer = FALSE)
I would like to retrieve a list of tweets from Twitter for a given hashtag using package RJSONIO in R. I think I am pretty close to the solution, but I seem to miss one step.
My code reads as follows (in this example, I use #NBA as a hashtag):
library(httr)
library(RJSONIO)
# 1. Find OAuth settings for twitter:
# https://dev.twitter.com/docs/auth/oauth
oauth_endpoints("twitter")
# Replace key and secret below
myapp <- oauth_app("twitter",
key = "XXXXXXXXXXXXXXX",
secret = "YYYYYYYYYYYYYYYYY"
)
# 3. Get OAuth credentials
twitter_token <- oauth1.0_token(oauth_endpoints("twitter"), myapp)
# 4. Use API
req=GET("https://api.twitter.com/1.1/search/tweets.json?q=%23NBA&src=typd",
config(token = twitter_token))
req <- content(req, as = "text")
response=fromJSON(req)
How can I get the list of tweets from object 'response'?
Eventually, I would like to get something like:
searchTwitter("#NBA", n=5000, lang="en")
Thanks a lot in advance!
The response object should be a list of length two: statuses and metadata. So, for example, to get the text of the first tweet, try:
response$statuses[[1]]$text
However, there are a couple of R packages designed to make just this kind of thing easier: Try streamR for the streaming API, and twitteR for the REST API. The latter has a searchTwitter function exactly as you describe.
I'm a starter in web scraping and I'm not yet familiarized with the nomenclature for the problems I'm trying to solve. Nevertheless, I've searched exhaustively for this specific problem and was unsuccessful in finding a solution. If it is already somewhere else, I apologize in advance and thank your suggestions.
Getting to it. I'm trying to build a script with R that will:
1. Search for specific keywords in a newspaper website;
2. Give me the headlines, dates and contents for the number of results/pages that I desire.
I already know how to post the form for the search and scrape the results from the first page, but I've had no success so far in getting the content from the next pages. To be honest, I don't even know where to start from (I've read stuff about RCurl and so on, but it still haven't made much sense to me).
Below, it follows a partial sample of the code I've written so far (scraping only the headlines of the first page to keep it simple).
curl <- getCurlHandle()
curlSetOpt(cookiefile='cookies.txt', curl=curl, followlocation = TRUE)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
search=getForm("http://www.washingtonpost.com/newssearch/search.html",
.params=list(st="Dilma Rousseff"),
.opts=curlOptions(followLocation = TRUE),
curl=curl)
results=htmlParse(search)
results=xmlRoot(results)
results=getNodeSet(results,"//div[#class='pb-feed-headline']/h3")
results=unlist(lapply(results, xmlValue))
I understand that I could perform the search directly on the website and then inspect the URL for references regarding the page numbers or the number of the news article displayed in each page and, then, use a loop to scrape each different page.
But please bear in mind that after I learn how to go from page 1 to page 2, 3, and so on, I will try to develop my script to perform more searches with different keywords in different websites, all at the same time, so the solution in the previous paragraph doesn't seem the best to me so far.
If you have any other solution to suggest me, I will gladly embrace it. I hope I've managed to state my issue clearly so I can get a share of your ideas and maybe help others that are facing similar issues. I thank you all in advance.
Best regards
First, I'd recommend you use httr instead of RCurl - for most problems it's much easier to use.
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff"
)
)
stop_for_status(r)
content(r)
Second, if you look at url in your browse, you'll notice that clicking the page number, modifies the startat query parameter:
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff",
startat = 10
)
)
Third, you might want to try out my experiment rvest package. It makes it easier to extract information from a web page:
# devtools::install_github("hadley/rvest")
library(rvest)
page <- html(r)
links <- page[sel(".pb-feed-headline a")]
links["href"]
html_text(links)
I highly recommend reading the selectorgadget tutorial and using that to figure out what css selectors you need.
I'm trying to pull tweets using the twitteR package, but I'm having an issue getting them through the searchTwitter function when I specify a geocode the way they have it in their docs. Please see code below:
#Oauth code (successful authentication)
keyword = "the"
statuses = searchTwitter(keyword, n=100, lang="en",sinceID = NULL, geocode="39.312957, -76.618119, 10km",retryOnRateLimit=10)
Code works perfectly when I leave out geocode="39.312957, -76.618119, 10km",, but when I include it, I get the following:
Warning message:
In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit, :
100 tweets were requested but the API can only return 0
I thought maybe my formatting was wrong but based on the twitteR CRAN page the string is in the right format (I also tried switching between km and mi).
Has anyone else experienced this or know a better way to search for a specific geocode? Would they have deprecated the geocode functionality?
I'm looking for tweets from Baltimore so if there is a better way to do so, I'm all ears. (By the way, I want to avoid trying to pull all tweets and then filter myself because I think I will hit the data limit fairly quickly and miss out on what I'm looking for)
Thanks!
I believe you need to remove the spaces in the geocode parameter:
statuses = searchTwitter(keyword, n=100, lang="en",sinceID = NULL, geocode="39.312957,-76.618119,10km",retryOnRateLimit=10)
FWIW You can use the Twitter desktop client "Develop" console to test out URLs before committing them into scripts.
Had the same issue. Your parameters are in the correct order, but you must avoid any whitespace within the geocode. Also, 10km might be too little a radius for the accuracy of the coordinates given, might want to try with 12mi.