I am using the twitteR package in R to extract tweets based on their ids.
But I am unable to do this for multiple tweet ids without hitting either a rate limit or an error 404.
This is because I am using the showStatus() - one tweet id at a time.
I am looking for a function similar to getStatuses() - multiple tweet id/request
Is there an efficient way to perform this action.
I suppose only 60 requests can be made in a 15 minute window using the outh.
So, how do I ensure :-
1.Retrieve multiple tweet ids for single request thereafter repeating these requests.
2.Rate limit is under check.
3.Error handling for tweets not found.
P.S : This activity is not user based.
Thanks
I have come across the same issue recently. For retrieving tweets in bulk, Twitter recommends using the lookup-method provided by its API. That way you can get up to 100 tweets per request.
Unfortunately, this has not been implemented in the twitteR package yet; so I've tried to hack together a quick function (by re-using lots of code from the twitteR package) to use that API method:
lookupStatus <- function (ids, ...){
lapply(ids, twitteR:::check_id)
batches <- split(ids, ceiling(seq_along(ids)/100))
results <- lapply(batches, function(batch) {
params <- parseIDs(batch)
statuses <- twitteR:::twInterfaceObj$doAPICall(paste("statuses", "lookup",
sep = "/"),
params = params, ...)
twitteR:::import_statuses(statuses)
})
return(unlist(results))
}
parseIDs <- function(ids){
id_list <- list()
if (length(ids) > 0) {
id_list$id <- paste(ids, collapse = ",")
}
return(id_list)
}
Make sure that your vector of ids is of class character (otherwise there can be a some problems with very large IDs).
Use the function like this:
ids <- c("432656548536401920", "332526548546401821")
tweets <- lookupStatus(ids, retryOnRateLimit=100)
Setting a high retryOnRateLimit ensures you get all your tweets, even if your vector of IDs has more than 18,000 entries (100 IDs per request, 180 requests per 15-minute window).
As usual, you can turn the tweets into a data frame with twListToDF(tweets).
Related
I'm trying to pull data using FTC's API in R. But the results are limited to only 50 rows. Please advise me on how to get more than 50 from the API.
You can find API details at the following link. https://www.ftc.gov/developer/api/v0/endpoints/do-not-call-dnc-reported-calls-data-api
myapikey <- "my-api-key"
URL <- "https://api.ftc.gov/v0/dnc-complaints?api_key=my-api-key"
get.data <- GET(URL, query=list(api_key=myapikey, created_date="2021-01-10"))
ftc.data <- content(get.data)
jsoncars <- toJSON(ftc.data$data, pretty=TRUE)
ftc <- fromJSON(jsoncars, flatten = TRUE) %>% data.frame()
I was looking at the same thing, the page details "Maximum number of records to include in JSON response. By default, the endpoint displays a maximum of 50 records per request – this is also the maximum value allowed. All non-empty responses include pagination metadata, which can be used to iterate over a large number of records, by sending a GET request for each page of records." So far I too could not figure out a workaround.
My goal is to obtain a time series from 1996 week 1 to week 46 of 2016 of legionellosis cases from this website supported by the Center for Disease Control (CDC) of the United States. A coworker attempted to scrape only tables that contain legionellosis cases with the code below:
#install.packages('rvest')
library(rvest)
## Code to get all URLS
getUrls <- function(y1,y2,clist){
root="https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp?mmwr_year="
root1="&mmwr_week="
root2="&mmwr_table=2"
root3="&request=Submit&mmwr_location="
urls <- NULL
for (year in y1:y2){
for (week in 1:53){
for (part in clist) {
urls <- c(urls,(paste(root,year,root1,week,root2,part,root3,sep="")))
}
}
}
return(urls)
}
TabList<-c("A","B") ## can change to get not just 2 parts of the table but as many as needed.
WEB <- as.data.frame(getUrls(1996,2014,TabList)) # Only applies from 1996-2014. After 2014, the root url changes.
head(WEB)
#Example of how to extract data from a single webpage.
url <- 'https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp? mmwr_year=1996&mmwr_week=20&mmwr_table=2A&request=Submit&mmwr_location='
webpage <- read_html(url)
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table, fill = TRUE)[[2]]
#test if Legionellosis is in the table. Returns a vector showing the columns index if the text is found.
#Can use this command to filter only pages that you need and select only those columns.
test <- grep("Leg", sb)
sb <- sb[,c(1,test)]
### This code only works if you have 3 columns for headings. Need to adapt to be more general for all tables.
#Get Column names
colnames(sb) <- paste(sb[2,], sb[3,], sep="_")
colnames(sb)[1] <- "Area"
sb <- sb[-c(1:3),]
#Remove commas from numbers so that you can then convert columns to numerical values. Only important if numbers above 1000
Dat <- sapply(sb, FUN= function(x)
as.character(gsub(",", "", as.character(x), fixed = TRUE)))
Dat<-as.data.frame(Dat, stringsAsFactors = FALSE)
However, the code is not finished and I thought it may be best to use the API since the structure and layout of the table in the webpages changes. This way we wouldn't have to comb through the tables to figure out when the layout changes and how to adjust the web scraping code accordingly. Thus I attempted to pull the data from the API.
Now, I found two help documents from the CDC that provides the data. One appears to provide data from 2014 onward which can be seen here using RSocrata, while the other instruction appears to be more generalized and uses XML format request over http, which can be seen here.The XML format request over http required a databased ID which I could not find. Then I stumbled onto the RSocrata and decided to try that instead. But the code snippet provided along with the token ID I set up did not work.
install.packages("RSocrata")
library("RSocrata")
df <- read.socrata("https://data.cdc.gov/resource/cmap-p7au?$$app_token=tdWMkm9ddsLc6QKHvBP6aCiOA")
How can I fix this? My end goal is a table of legionellosis cases from 1996 to 2016 on a weekly basis by state.
I'd recommend checking out this issue thread in the RSocrata GitHub repo where they're discussing a similar issue with passing tokens into the RSocrata library.
In the meantime, you can actually leave off the $$app_token parameter, and as long as you're not flooding us with requests, it'll work just fine. There's a throttling limit you can sneak under without using an app token.
For a little project for myself I'm trying to get the results from some races.
I can access the pages with the results and download the data from the table in page. However, there are only 20 results per page, but luckily the web addresses are built logically so I can create them, and in a loop, access these pages and download the data. However, each category has a different number of racers, and thus can have different number of pages. I want to avoid to manually having to check how many racers there are in each category.
My first thought was to just generate a lot of links, making sure there are enough (based on the total amount of racers) to get all the data.
nrs <- rep(seq(1,5,1),2)
sex <- c("M","M","M","M","M","F","F","F","F","F")
links <- NULL
#Loop to create 10 links, 5 for the male age grou 18-24, 5 for women agegroup 18-24. However,
#there are only 3 pages in the male age group with a table.
for (i in 1:length(nrs) ) {
links[i] = paste("http://www.ironman.com/triathlon/events/americas/ironman/texas/results.aspx?p=",nrs[i],"&race=texas&rd=20160514&sex=",sex[i],"&agegroup=18-24&loc=",sep="")
}
resultlist <- list() #create empty list to store results
for (i in 1:length(links)) {
results = readHTMLTable(links[i],
as.data.frame = TRUE,
which=1,
stringsAsFactors = FALSE,
header = TRUE) #get data
resultlist[[i]] <- results #combine results in one big list
}
results = do.call(rbind, resultlist) #combine results into dataframe
As you can see in this code readHTMLTable throws an error message as soon as it encounters a page with no table, and then stops.
I thought of two possible solutions.
1) Somehow check all the links if they exist. I tried with url.exists from the RCurl package. But this doesn't work. It returns TRUE for all pages, as the page exists, it just doesn't have a table in it (so for me it would be a false positive). Somehow I would need some code to check if a table in the page exists, but I don't know how to go about that.
2) Suppress the error message from readHTMLTable so the loop continuous, but I'm not sure if that's possible.
Any suggestions for these two methods, or any other suggestions?
I think that method #2 is easier. I modified your code with tryCatch, one of R's builtin exception handling mechanisms. It works for me.
PS I would recommend using rvest for web scraping like this.
I'm trying to grab the most recent tweets from multiple users. I have already had my application registered, and have the requisite keys and tokens.
I know that for a single user, the command is:
recent <- twListToDF(userTimeline("**twitterID**",n=15))
However, I'm unsure how to grab the Tweets for multiple IDs, and how to combine them into data frame.
I tried:
targets <- c("a","b","c")
recent <- twListToDF(userTimeline("targets",n=15))
where a, b, c are IDs, but get the error message:
Error in twInterfaceObj$doAPICall(cmd, params, method, ...) :
Not Found (HTTP 404).
It doesn't seem to matter whether target is surrounded by quotes or not. Is there a simple way to grab tweets from multiple IDs? Or do I need to have a vector, iterate through etc.
I figured it out, so I thought I'd share my solution to the problem.
Assuming I have the same list of screen names to pull Tweets from, called targets.
I put the output into a list, called output, with each entry corresponding to the tweets from the same index value in targets. I.e., output[1] contains all the tweets from targets[1], with one tweet on each line.
num = length(targets)
output <- vector("list", num)
for(i in 1:num){
output[i] <- getTweets(targets[i])
}
getTweets uses twListToDF(userTimeline(handle, n=15))
To put all the Tweets into a single dataframe,complete with other info:
masterFrame <- data.frame()
for (i in 1:num){
tempFrame <- getTweets(targets[i])
masterFrame <- rbind(masterFrame, tempFrame)
}
Is there another method to extract tweets for a specific time span, rather than using number of tweets, using SearchTwitter? I am able to fetch a certain number of tweets for Nordstrom but have not been successful is doing so for specific dates.
library('twitteR')
nord1<- searchTwitteR("#nordstrom", n= 1000) #works fine
nord2<- searchTwitteR('nordstrom', since = '2012-01-01', until = '2015-11-13')
Warning message:
In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit, :
25 tweets were requested but the API can only return 0
The Twitter search API only allows access to the most recent tweets (last 6-9 days). I was instead searching for much earlier dates when I came across the issue.