A continuation of... Extracting data from an API using R - r

I'm a super new at this and working on R for my thesis. The code in this answer finally worked for me (Extracting data from an API using R), but I can't figure out how to add a loop to it. I keep getting the first page of the API when I need all 3360.
Here's the code:
library(httr)
library(jsonlite)
r1 <- GET("http://data.riksdagen.se/dokumentlista/?
sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12- 31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s#soktraff")
r2 <- rawToChar(r1$content)
class(r2)
r3 <- fromJSON(r2)
r4 <- r3$dokumentlista$dokument
By the time I reach r4, it's already a data frame.
Please and thank you!
Edit: originally, I couldn't get a url that had the page as info within it. Now I have it (below). I still haven't been able to loop it.
"http://data.riksdagen.se/dokumentlista/?sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12-31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s&p="

I think you can extract the url of the next page from r3 as follows:
next_url <- r3$dokumentlista$`#nasta_sida`
# you need to re-check this, but sometimes I'm getting white spaces within the url,
# you may not face this problem, but in any case this line of code solved the issue
next_url <- gsub(' ', '', n_url)
GET(next_url)
Update
I tried the url with the page number with 10 pages and it worked
my_dfs <- lapply(1:10, function(i){
my_url <- paste0("http://data.riksdagen.se/dokumentlista/?sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12-31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s&p=", i)
r1 <- GET(my_url)
r2 <- rawToChar(r1$content)
r3 <- fromJSON(r2)
r4 <- r3$dokumentlista$dokument
return(r4)
})
Update 2:
The extracted data frames are complex (e.g. some columns are lists of data frames) which is why a simple rbind will not work here, you'll have to do some pre-processing before you stack up the data together, something like this would work
my_dfs %>% lapply(function(df_0){
# Do some stuff here with the data, and choose the variables you need
# I chose the first 10 columns to check that I got 200 different observations
df_0[1:10]
}) %>% do.call(rbind, .)

Related

R Text data scraper loop through dates

I'm doing a little project where there goal is to retrieve data in text format from a website. (http://regsho.finra.org/regsho-Index.html)
The website was nice enough to provide it online but they sorted the data over several days in different links
I thought about looping through the dates and store the data with the following code:
#Download the needed data
my_data <- c()
for (i in 01:13){
my_data <- read.delim(sprintf("http://regsho.finra.org/CNMSshvol202005%i.txt", i), header=TRUE, sep="|")
}
head(my_data)
The problem here is that in line
for (i in 01:13){ # The date in the website is 01-02-03 and the loop seems to ommit the 0
I've used the sprintf() method so I can have a variable in a string.
and this line the empty variable my_data always seems to be overwritten by the last data downloaded.
my_data <- read.delim(sprintf("http://regsho.finra.org/CNMSshvol202005%i.txt", i), header=TRUE, sep="|")
# the empty variable my_data always seems to be overwritten by the last data downloaded.
Could somebody reassure me if i'm going in the right direction because i'm starting to doubt myself here
Any help would be greatly appreciated!
Thanks in advance
This should give you a leading 0 without using an extra package:
sprintf("%02d", i)
i.e.
sprintf("http://regsho.finra.org/CNMSshvol202005%02d.txt", i)

Faster download for paginated nested JSON data from API in R?

I am downloading nested json data from the UN's SDG Indicators API, but using a loop for 50,006 paginations is waaayyy too slow to ever complete. Is there a better way?
https://unstats.un.org/SDGAPI/swagger/#!/Indicator/V1SdgIndicatorDataGet
I'm working in RStudio on a Windows laptop. Getting to the json nested data and structuring into a dataframe was a hard-fought win, but dealing with the paginations has me stumped. No response from the UN statistics email.
Maybe an 'apply' would do it? I only need data from 2004, 2007, and 2011 - maybe I can filter, but I don't think that would help the fundamental issue.
I'm probably misunderstanding the API structure - I can't see how querying 50,006 pages individually can be functional for anyone. Thanks for any insight!
library(dplyr)
library(httr)
library(jsonlite)
#Get data from the first page, and initialize dataframe and data types
page1 <- fromJSON("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data", flatten = TRUE)
#Get number of pages from the field of the first page
pages <- page1$totalPages
SDGdata<- data.frame()
for(j in 1:25){
SDGdatarow <- rbind(page1$data[j,1:16])
SDGdata <- rbind(SDGdata,SDGdatarow)
}
SDGdata[1] <- as.character(SDGdata[[1]])
SDGdata[2] <- as.character(SDGdata[[2]])
SDGdata[3] <- as.character(SDGdata[[3]])
#Loop through all the rest of the pages
baseurl <- ("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data")
for(i in 2:pages){
mydata <- fromJSON(paste0(baseurl, "?page=", i), flatten=TRUE)
message("Retrieving page ", i)
for(j in 1:25){
SDGdatarow <- rbind(mydata$data[j,1:16])
rownames(SDGdatarow) <- as.numeric((i-1)*25+j)
SDGdata <- rbind.data.frame(SDGdata,SDGdatarow)
}
}
I do get the data I want, and in a nice dataframe, but inevitably the query has a connection issue after a few hundred pages, or my laptop shuts down etc. It's about 5 seconds a page. 5*50,006/3600 ~= 70 hours.
I think I (finally) figured out a workable solution: I can set the # of elements per page, resulting in a manageable number of pages to call. (I also filtered for just the 3 years I want which reduces the data). Through experimentation I figured out that about 1/10th of the elements download ok, so I set the call to 1/10 per page, with a loop for 10 pages. Takes about 20 minutes, but better than 70 hours, and works without losing the connection.
#initiate the df
SDGdata<- data.frame()
# call to get the # elements with the years filter
page1 <- fromJSON("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?timePeriod=2004&timePeriod=2007&timePeriod=2011", flatten = TRUE)
perpage <- ceiling(page1$totalElements/10)
ptm <- proc.time()
for(i in 1:10){
SDGpage <- fromJSON(paste0("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?timePeriod=2004&timePeriod=2007&timePeriod=2011&pageSize=",perpage,"&page=",i), flatten = TRUE)
message("Retrieving page ", i, " :", (proc.time() - ptm) [3], " seconds")
SDGdata <- rbind(SDGdata,SDGpage$data[,1:16])
}

Mapping SIC to FamaFrench Industry Classification

I am working on a project where I have to map firms that have an SIC industry classification to the corresponding Fama-French industry classification. I have found that Ian Gow has gracefully created the script to do this. The script is available from the following url: https://iangow.wordpress.com/2011/05/17/getting-fama-french-industry-data-into-r/
However, there is a glitch in the script or in the data set and for some reason, it does not work with “Siccodes30.txt”. More specifically, it does not produce the correct result (mapping) for lines related to “6726-6726 Unit inv trusts, closed-end” from the “Siccodes30.txt”. I have been trying to figure out the source of the problem, but I have not been successful.
In the post below, I have included the original script (there is some room to make it more efficient) and I have added a few lines at the end to make it work with an online example.
Original Script (I have removed comments to makes the post shorter). Again, this is not my script (the original script is in https://iangow.wordpress.com/2011/05/17/getting-fama-french-industry-data-into-r/
url4FF <- paste("http://mba.tuck.dartmouth.edu",
"pages/faculty/ken.french/ftp",
"Industry_Definitions.zip", sep="/")
f <- tempfile()
download.file(url4FF, f)
fileList <- unzip(f,list=TRUE)
trim <- function(string) {
ifelse(grepl("^\\s*$", string, perl=TRUE),"",
gsub("^\\s*(.*?)\\s*$","\\1",string,perl=TRUE))
}
extract_ff_ind_data <- function (file) {
ff_ind <- as.vector(read.delim(unzip(f, files=file), header=FALSE,
stringsAsFactors=FALSE))
ind_num <- trim(substr(ff_ind[,1],1,10))
for (i in 2:length(ind_num)) {
if (ind_num[i]=="") ind_num[i] <- ind_num[i-1]
}
sic_detail <- trim(substr(ff_ind[,1],11,100))
is.desc <- grepl("^\\D",sic_detail,perl=TRUE)
regex.ind <- "^(\\d+)\\s+(\\w+).*$"
ind_num <- gsub(regex.ind,"\\1",ind_num,perl=TRUE)
ind_abbrev <- gsub(regex.ind,"\\2",ind_num[is.desc],perl=TRUE)
ind_list <- data.frame(ind_num=ind_num[is.desc],ind_abbrev,
ind_desc=sic_detail[is.desc])
regex.sic <- "^(\\d+)-(\\d+)\\s*(.*)$"
ind_num <- ind_num[!is.desc]
sic_detail <- sic_detail[!is.desc]
sic_low <- as.integer(gsub(regex.sic,"\\1",sic_detail,perl=TRUE))
sic_high <- as.integer(gsub(regex.sic,"\\2",sic_detail,perl=TRUE))
sic_desc <- gsub(regex.sic,"\\3",sic_detail,perl=TRUE)
sic_list <- data.frame(ind_num, sic_low, sic_high, sic_desc)
return(merge(ind_list,sic_list,by="ind_num",all=TRUE))
}
FFID_30 <- extract_ff_ind_data("Siccodes30.txt")
I have added the following lines to allow testing the script:
library(gsheet)
url <-"https://docs.google.com/spreadsheets/d/1QRv8YmJv0pdhIVmkXMQC7GQuvXV21Kyjl9pVZsSPEAk/gid=1758600626"
companiesSIC <- read.csv(text=gsheet2text(url, format='csv'), stringsAsFactors=FALSE)
names(companiesSIC)
library(sqldf)
companiesFFID_30 <- sqldf("SELECT a.gvkey, a.SIC, b.ind_desc AS FF30,
b.ind_num as FFIndNUm30
FROM companiesSIC AS a
LEFT JOIN FFID_30 AS b
ON a.sic BETWEEN b.sic_low AND b.sic_high")
companiesFFID_30
Results on rows 141 and 142 are wrong. Instead of an industry number the provide a string.
Thanks
PS As I said there is room to make the script shorter (e.g., you don't need to create a separate function to remove white space, you can use trimws) but to give credit to the original author, I kept the script in its original form. However, if someone can solve the problem should also try to update the rest of the script too.
There is nothing wrong with the script. The problem is in the formatting of the two lines (141 and 142) of the txt file.
I opened the text file with a text editor, deleted and re-typed the content of these two lines. When I re-run the R script the problem was gone.

List of Data Frames to One Data Frame

Disclaimer: I know that this question has been asked before. The answer provided in this answer worked for me in the past, but for some reason has stopped now.
I am pulling Marketing email statistics from the Mailchimp API. I have been doing this for the last half year or so. However, in the past 2 months, I believe the structure of what I pull has changed and thus, my code no longer works and I cannot figure out why. I believe it has something to do with the nested data frames within my list of data frames that I receive.
Here is an example of my code and the resulting list of data frames. I have removed sensitive information from my code and image:
library(httr)
library(jsonlite)
library(plyr)
#Opens-----------
opens1 <- GET("https://us4.api.mailchimp.com/3.0/reports/***ReportNumber***/sent-to?count=4000",authenticate('***My Company***', '***My-Password***'))
opens1 <- content(opens1,"text")
opens1 <- fromJSON(opens1)
Then I run opens1 <- ldply(opens1, data.frame), and I receive the following error:
Error in allocate_column(df[[var]], nrows, dfs, var) :
Data frame column 'merge_fields' not supported by rbind.fill
I tried using and looking up rbind.fill() and the other methods described in the linked answer at the top of my post, to no avail. What am I interpreting incorrectly about the merge_fields variable, or am I way off, and how do I correct it?
I'm just trying to get one data frame of all of the variables from the opens1 list.
Thanks for any and all help, and please, feel free to ask any clarification questions!
On a quick glance, this seems to work for me:
library(httr)
campaign_id <- "-------"
apikey = "------"
url <- sprintf("https://us1.api.mailchimp.com/3.0/reports/%s/sent-to", campaign_id)
opens <- GET(url, query = list(apikey = apikey, count = 4000L))
lst <- rjson::fromJSON(content(opens, "text"))
df <- dplyr::bind_rows(
lapply(lst$sent_to, function(x)
as.data.frame(t(unlist(x)), stringsAsFactors = F)
))

Darksky api in R loop not working

I am a huge R fan, but it never seems to work out for me, I am trying to use an API to get weather data, but I cannot write the loop. I have all the codes in the right format, but when I import the file into r, the cells appear like
-33.86659241, 151.2081909, \"2014-10-01T02:00:00"\
and this is preventing me from running the code. So rather than using a loop I need to use a mailmerge to create 5000 lines of code. Any help would be really appreciated.
tmp <- get_forecast_for(-33.86659241, 151.2081909, "2014-10-01T02:00:00", add_headers=TRUE)
fdf <- as.data.frame(tmp)
fdf$ID <- "R_3nNli1Hj2mlvFVo"
fd <- rbind(fd,fdf)
Here is the code with loop -
df <- read.csv("~/Machine Learning/Darksky.csv", header=T,sep=",", fill = TRUE)
for(i in 1:length(df$DarkSky)){
fdf <- get_forecast_for(df$LocationLatitude[i], df$LocationLongitude[i], df$DarkSky[i], add_headers=TRUE)
fdf <- as.data.frame(fdf)
fdf <- fdf[1:2,]
fd <- rbind(fd,fdf)
}
I also wanted to rbind the retreived data onto a dataframe but it does not work. I also wanted to cbind the identifier, which would be the value in df$DarkSky[i], but it will not work.
CSV -
LocationLatitude LocationLongitude DarkSky
-33.86659241 151.2081909 "2014-10-01T02:00:00"
The get_forecast_for function takes three parameters, the latitude, longitude and the date, structured as above, I have the loop working for latitude and longitude, but the time/date is not working.

Resources