I'm trying to pull data using FTC's API in R. But the results are limited to only 50 rows. Please advise me on how to get more than 50 from the API.
You can find API details at the following link. https://www.ftc.gov/developer/api/v0/endpoints/do-not-call-dnc-reported-calls-data-api
myapikey <- "my-api-key"
URL <- "https://api.ftc.gov/v0/dnc-complaints?api_key=my-api-key"
get.data <- GET(URL, query=list(api_key=myapikey, created_date="2021-01-10"))
ftc.data <- content(get.data)
jsoncars <- toJSON(ftc.data$data, pretty=TRUE)
ftc <- fromJSON(jsoncars, flatten = TRUE) %>% data.frame()
I was looking at the same thing, the page details "Maximum number of records to include in JSON response. By default, the endpoint displays a maximum of 50 records per request – this is also the maximum value allowed. All non-empty responses include pagination metadata, which can be used to iterate over a large number of records, by sending a GET request for each page of records." So far I too could not figure out a workaround.
Related
I am scraping OpenFDA (https://open.fda.gov/apis). I know my particular inquiry has 6974 hits, which is organized into 100 hits per page (max download of the API). I am trying to use R (rvest, jsonlite, purr, tidyverse, httr) to download all of this data.
I checked the website information with curl in terminal and downloaded a couple of sites to see a pattern.
I've tried a few lines of code and I can only get 100 entries to download. This code seems to work decently, but it will only pull 100 entries, so one page To skip the fisrt 100, which I can pull down and merge later, here is the code that I have used:
url_json <- "https://api.fda.gov/drug/label.json?api_key=YOULLHAVETOGETAKEY&search=grapefruit&limit=100&skip=6973"
raw_json <- httr::GET(url_json, accept_json())
data<- httr::content(raw_json, "text")
my_content_from_json <- jsonlite::fromJSON(data)
dplyr::glimpse(my_content_from_json)
dataframe1 <- my_content_from_json$results
view(dataframe1)
SOLUTION below in the responses. Thanks!
From the comments:
It looks like the API parameters skip and limit work better than the search_after parameter. They allow pulling down 1,000 entries simultaneously according to the documentation (open.fda.gov/apis/query-parameters). To provide these parameters into the query string, an example URL would be
https://api.fda.gov/drug/label.json?api_key=YOULLHAVETOGETAKEY&search=grapefruit&limit=1000&skip=0
after which you can loop to get the remaining entries with skip=1000, skip=2000, etc. as you've done above.
I'm using this guide as an example to scrape the time that posts were published to Reddit.
It says to use SelectorGadget tool to bypass learning other languages, so that's what I did.
Although the page on old.reddit.com shows 100 posts (so 100 different times should be recorded), only 25 different time values are actually extracted from my code. Here's what my code looks like:
library(rvest)
url <- 'https://old.reddit.com/'
rawdata <- read_html(url)
rawtime <- html_nodes(rawdata, '.live-timestamp')
#".live-timestamp" was obtained using the Chrome extension "SelectorGadget"
finalresult <- bind_rows(lapply(xml_attrs(rawtime), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))
Alternatively, you could use PRAW to get the information from Reddit. This is a particular solution for your problem but might work.
https://praw.readthedocs.io/en/latest/
And in the subreddit r/redditdev
You need to be logged in or use the ?limit=100 parameter in order to get 100 items in a listing.
See the API documentation for more information:
limit: the maximum number of items desired (default: 25, maximum: 100)
I'm trying to scrap some data using this code.
require(XML)
tables <- readHTMLTable('http://fantasynba.movistarplus.es/basketball/reports/player_rankings.asp')
str(tables, max.level = 1)
df <- tables$searchResults
It works perfect but the problem is that it only gives me data for the first 188 observations that corresponds to the players whose position is "Base". Whenever I try to get data from "Pivot" or "Alero" players, it gives me the same info. Since the url never changes, I don't know how to get this info.
My goal is to obtain a time series from 1996 week 1 to week 46 of 2016 of legionellosis cases from this website supported by the Center for Disease Control (CDC) of the United States. A coworker attempted to scrape only tables that contain legionellosis cases with the code below:
#install.packages('rvest')
library(rvest)
## Code to get all URLS
getUrls <- function(y1,y2,clist){
root="https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp?mmwr_year="
root1="&mmwr_week="
root2="&mmwr_table=2"
root3="&request=Submit&mmwr_location="
urls <- NULL
for (year in y1:y2){
for (week in 1:53){
for (part in clist) {
urls <- c(urls,(paste(root,year,root1,week,root2,part,root3,sep="")))
}
}
}
return(urls)
}
TabList<-c("A","B") ## can change to get not just 2 parts of the table but as many as needed.
WEB <- as.data.frame(getUrls(1996,2014,TabList)) # Only applies from 1996-2014. After 2014, the root url changes.
head(WEB)
#Example of how to extract data from a single webpage.
url <- 'https://wonder.cdc.gov/mmwr/mmwr_1995_2014.asp? mmwr_year=1996&mmwr_week=20&mmwr_table=2A&request=Submit&mmwr_location='
webpage <- read_html(url)
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table, fill = TRUE)[[2]]
#test if Legionellosis is in the table. Returns a vector showing the columns index if the text is found.
#Can use this command to filter only pages that you need and select only those columns.
test <- grep("Leg", sb)
sb <- sb[,c(1,test)]
### This code only works if you have 3 columns for headings. Need to adapt to be more general for all tables.
#Get Column names
colnames(sb) <- paste(sb[2,], sb[3,], sep="_")
colnames(sb)[1] <- "Area"
sb <- sb[-c(1:3),]
#Remove commas from numbers so that you can then convert columns to numerical values. Only important if numbers above 1000
Dat <- sapply(sb, FUN= function(x)
as.character(gsub(",", "", as.character(x), fixed = TRUE)))
Dat<-as.data.frame(Dat, stringsAsFactors = FALSE)
However, the code is not finished and I thought it may be best to use the API since the structure and layout of the table in the webpages changes. This way we wouldn't have to comb through the tables to figure out when the layout changes and how to adjust the web scraping code accordingly. Thus I attempted to pull the data from the API.
Now, I found two help documents from the CDC that provides the data. One appears to provide data from 2014 onward which can be seen here using RSocrata, while the other instruction appears to be more generalized and uses XML format request over http, which can be seen here.The XML format request over http required a databased ID which I could not find. Then I stumbled onto the RSocrata and decided to try that instead. But the code snippet provided along with the token ID I set up did not work.
install.packages("RSocrata")
library("RSocrata")
df <- read.socrata("https://data.cdc.gov/resource/cmap-p7au?$$app_token=tdWMkm9ddsLc6QKHvBP6aCiOA")
How can I fix this? My end goal is a table of legionellosis cases from 1996 to 2016 on a weekly basis by state.
I'd recommend checking out this issue thread in the RSocrata GitHub repo where they're discussing a similar issue with passing tokens into the RSocrata library.
In the meantime, you can actually leave off the $$app_token parameter, and as long as you're not flooding us with requests, it'll work just fine. There's a throttling limit you can sneak under without using an app token.
I am using the twitteR package in R to extract tweets based on their ids.
But I am unable to do this for multiple tweet ids without hitting either a rate limit or an error 404.
This is because I am using the showStatus() - one tweet id at a time.
I am looking for a function similar to getStatuses() - multiple tweet id/request
Is there an efficient way to perform this action.
I suppose only 60 requests can be made in a 15 minute window using the outh.
So, how do I ensure :-
1.Retrieve multiple tweet ids for single request thereafter repeating these requests.
2.Rate limit is under check.
3.Error handling for tweets not found.
P.S : This activity is not user based.
Thanks
I have come across the same issue recently. For retrieving tweets in bulk, Twitter recommends using the lookup-method provided by its API. That way you can get up to 100 tweets per request.
Unfortunately, this has not been implemented in the twitteR package yet; so I've tried to hack together a quick function (by re-using lots of code from the twitteR package) to use that API method:
lookupStatus <- function (ids, ...){
lapply(ids, twitteR:::check_id)
batches <- split(ids, ceiling(seq_along(ids)/100))
results <- lapply(batches, function(batch) {
params <- parseIDs(batch)
statuses <- twitteR:::twInterfaceObj$doAPICall(paste("statuses", "lookup",
sep = "/"),
params = params, ...)
twitteR:::import_statuses(statuses)
})
return(unlist(results))
}
parseIDs <- function(ids){
id_list <- list()
if (length(ids) > 0) {
id_list$id <- paste(ids, collapse = ",")
}
return(id_list)
}
Make sure that your vector of ids is of class character (otherwise there can be a some problems with very large IDs).
Use the function like this:
ids <- c("432656548536401920", "332526548546401821")
tweets <- lookupStatus(ids, retryOnRateLimit=100)
Setting a high retryOnRateLimit ensures you get all your tweets, even if your vector of IDs has more than 18,000 entries (100 IDs per request, 180 requests per 15-minute window).
As usual, you can turn the tweets into a data frame with twListToDF(tweets).