using for loops with string variables - r

I'm using r to download data from an api that uses a key. I've downloaded the data for AK into a df called officials and I would like to download the data for the remaining states using rbind to add each state to the df officials. But the format of the call to the api requires the state abbreviation without ". That is, stateId=AK not "AK". Is there a way to do this? I tried the code below and then realized my error in the GET command specifying stateID. My code inserts "AL" not AL.
states <- c("AL","AR","AZ","CA","CO","CT")
for(i in 1:length(states)) {
temp_raw <- GET("http://api.votesmart.org/Officials.getByOfficeTypeState?key=xxx&officeTypeId=L&stateId=states[i]&o=JSON")
my_content <- content(temp_raw, as = 'text')
my_content2 <- fromJSON(my_content)
temp_officials <- my_content2$candidate$candidate
officials2022 <- rbind(officials2022,temp_officials)
}

Try this variation, using the paste command to combine the strings together into the URL:
Also, notice the simplified way to perform a for loop over states, where i is directly available.
Edit: forgot the GET
states <- c("AL","AR","AZ","CA","CO","CT")
for(i in states) {
temp_raw <- GET(paste0("http://api.votesmart.org/Officials.getByOfficeTypeState?key=xxx&officeTypeId=L&stateId=", i, "&o=JSON"))
...
}

Related

Looping over rows of self-reported job titles to gather publicly available data

Suppose I have a data frame with one case of self-reported job titles (this is done in R):
x <- data.frame("job.title" = c("psychologist"))
I'd like to have this job title entered into a search engine on a website (this part I can do) in order to have data on these jobs pulled into a data frame (this part I can also do).
The following function does this for me:
onet.sum <- function(x) {
obj1 <- as.list(ONETr::keySearch(x)) # enter self-reported job title into ONET's search engine
job.title <- obj1[["title"]][1] # pull best-matching title
soc.code <- obj1[["code"]][1] # pull best matching title's SOC code
obj4 <- as.data.frame(cbind(job.title,soc.code))
return(obj4)
}
However, once I add a second job title in a second row...
x <- data.frame("job.title" = c("psychologist", "social worker"))
...I get this system error that I'm not sure how to diagnose.
Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Any advice?
UPDATE
So it turns out that there are two solutions that work if I pass job titles that do not contain spaces:
Using lapply(). Make sure that the job titles do not contain spaces.
So this works:
final_data <- lapply(c("psychologist","socialworker"), onet.sum) %>%
bind_rows
...but this doesn't work:
final_data <- lapply(c("psychologist","social worker"), onet.sum) %>%
bind_rows
Use purrr's map_df() is more flexible.
result <- purrr::map_df(gsub('\s', '', x$job.title), onet.sum)
You can try with a lapply statement -
result <- do.call(rbind, lapply(x$job.title, onet.sum))
Using purrr::map_df might be shorter.
result <- purrr::map_df(x$job.title, onet.sum)

Nested Json with Different Attribute Names in R

I am playing with the Kaggle Star Trek Scripts dataset but I am struggling with converting from json to a dataframe in R. Ideally I would convert it in a long form dataset with index columns for episodes and characters with their lines on individual rows. I did find this answer, however it is not in R.
Currently the json looks like the photo below. Sorry it is not a full exmaple, but I put a small mocked version below as well. If you want you can download the data yourselves from here.
Current JSON View
Mock Example
"ENT": {
"episode_0": {
"KLAANG": {
"\"Pungghap! Pung ghap!\"": {},
"\"DujDajHegh!\"": {}
}
},
"eipsode_1": {
"ARCHER": {
"\"Warpme!\"": {},
"\"Toboldly go!\"": {}
}
}
}
}
The issue I have is that the second level, epsiodes, are individually numbered. So my regular bag of tricks for flattening by attribute name are not working. I am unsure how to loop through a level rather than an attribute name.
What I would ideally want is a long form data set that looks like this:
Series Episode Character Lines
ENT episode_0 KLAANG Pung ghap! Pung ghap!
ENT episode_0 KLAANG DujDaj Hegh!
ENT episode_1 ARCHER Warp me!
ENT episode_1 ARCHER To boldly go!
My currnet code looks like the below, which is what I would normally start with, but is obviously not working or far enough.
your_df <- result[["ENT"]] %>%
purrr::flatten() %>%
map_if(is_list, as_tibble) %>%
map_if(is_tibble, list) %>%
bind_cols()
I have also tried using stack() and map_dfr() but with no success. So I yet again come humbly to you, dear reader, for expertise. Json is the bane of my existance. I struggle with applying other answers to my circumstance so any advice or examples I can reverse engineer and lear from are most appreciated.
Also happy to clarify or expand on anything if possible.
-Jake
So I was able to brute force it thanks to an answer from Michael on this tread called How to flatten a list of lists? so shout out to them.
The function allowed me to covert JSON to a list of lists.
flattenlist <- function(x){
morelists <- sapply(x, function(xprime) class(xprime)[1]=="list")
out <- c(x[!morelists], unlist(x[morelists], recursive=FALSE))
if(sum(morelists)){
Recall(out)
}else{
return(out)
}
}
So Putting it all together I ended up with the following solution. Annotation for your entertainment.
library(jsonlite)
library(tidyverse)
library(dplyr)
library(data.table)
library(rjson)
result <- fromJSON(file = "C:/Users/jacob/Downloads/all_series_lines.json")
# Mike's function to get to a list of lists
flattenlist <- function(x){
morelists <- sapply(x, function(xprime) class(xprime)[1]=="list")
out <- c(x[!morelists], unlist(x[morelists], recursive=FALSE))
if(sum(morelists)){
Recall(out)
}else{
return(out)
}
}
# Mike's function applied
final<-as.data.frame(do.call("rbind", flattenlist(result)))
# Turn all the lists into a master data frame and ensure the index becomes a column I can separate later for context.
final <- cbind(Index_Name = rownames(final), final)
rownames(final) <- 1:nrow(final)
# So the output takes the final elements at the end of the JSON and makes those the variables in a dataframe so I need to force it back to a long form dataset.
final2<-gather(final,"key","value",-Index_Name)
# I separate each element of index name into my three mapping variables; Series,Episode and Character. I can also keep the original column names from above as script line id
final2$Episode<-gsub(".*\\.(.*)\\..*", "\\1", final2$Index_Name)
final2$Series<-substr(final2$Index_Name, start = 1, stop = 3)
final2$Character<-sub('.*\\.'," ", final2$newColName)

Using R and an API to extract multiple stock market data

I have setup an API access key with a data provider of stock market data. With this key i am able to extract stock market data based on ticker code (E.g. APPL: Apple, FB: Facebook etc).
I am able to extract stock data on an individual ticker basis using R but I want to write a piece of code that extracts data based on the multiple stock tickers and puts them all in one data frame (the structure is the same for all stocks). I m not sure how to create a loop that updates the data frame each time stock data is extracted. I get an error called 'No encoding supplied: defaulting to UTF-8' which does not tell me much. A point in the right direction would be helpful.
I have the following code:
if (!require("httr")) {
install.packages("httr")
library(httr)
}
if (!require("jsonlite")) {
install.packages("jsonlite")
library(jsonlite)
}
stocks <- c("FB","APPL") #Example stocks actual stocks removed
len <- length(stocks)
url <- "URL" #Actual url removed
access_key <- "MY ACCESS KEY" #Actual access key removed
extraction <- lapply(stocks[1:len],function(i){
call1 <- paste(url,"?access_key=",access_key,"&","symbols","=",stocks[i],sep="")
get_prices <- GET(call1)
get_prices_text <- content(get_prices, "text")
get_prices_json <- fromJSON(get_prices_text, flatten = TRUE)
get_prices_df <- as.data.frame(get_prices_json)
return(get_prices_df)
}
)
file <- do.call(rbind,extraction)
I realised that this is not the most efficient way of doing this. A better way is to update the url to include multiple stocks rather then using a lapply function. I am therefore closing the question.

R - Loop API Call

I'm using R to successful make API calls. For each individual call I need to alter one or two distinguishing ID's (in the case of the code below activity_id and/or name_id). The code is working fine however I am now in a position where I like to automate this process as opposed to manually changing the ID's for each call. I am wondering if there's a way to loop this using a data frame or list to store the relevant ID's.
I've searched across Stack however I'm yet to find or execute an appropriate solution.
Any help would be appreciated.
Thanks,
JPC
apiKey <-"eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI"
example<- GET("url?activity_id=b5cb9359-f0e5-4939-9be3-fc95f8bc7d6b&name_id=f1e17fa6-c40c-4810-9c43-60939e2a9a99",add_headers(Authorization = paste("Bearer", apiKey)))
example <-content(example,"text")
example <-fromJSON(example,flatten = TRUE)
example <-unnest(example,data)
write.csv(example,"example.csv",row.names=F)
We could write a function where we dynamically generate url using sprintf based on activity_id and name_id passed.
get_data <- function(activity_id, name_id) {
url <- sprintf('url?activity_id=%s&name_id=%s', activity_id, name_id)
example<- httr::GET(url,add_headers(Authorization = paste("Bearer", apiKey)))
example <- httr::content(example,"text")
example <- jsonlite::fromJSON(example,flatten = TRUE)
example <- tidyr::unnest(example,data)
return(example)
}
and then call it using Map.
out <- Map(get_data, activity_vec, name_vec)
Here activity_vec and name_vec are the vector of respective id's. This will return a list of dataframes in out which can be combined into one dataframe if needed before writing to csv.
If only name_id is changing we can do
get_data <- function(name_id) {
url <- sprintf('url?activity_id=b5cb9359-f0e5-4939-9be3-fc95f8bc7d6&name_id=%s', name_id)
example<- httr::GET(url,add_headers(Authorization = paste("Bearer", apiKey)))
example <- httr::content(example,"text")
example <- jsonlite::fromJSON(example,flatten = TRUE)
example <- tidyr::unnest(example,data)
return(example)
}
out <- lapply(name_vec, get_data)

Mapping SIC to FamaFrench Industry Classification

I am working on a project where I have to map firms that have an SIC industry classification to the corresponding Fama-French industry classification. I have found that Ian Gow has gracefully created the script to do this. The script is available from the following url: https://iangow.wordpress.com/2011/05/17/getting-fama-french-industry-data-into-r/
However, there is a glitch in the script or in the data set and for some reason, it does not work with “Siccodes30.txt”. More specifically, it does not produce the correct result (mapping) for lines related to “6726-6726 Unit inv trusts, closed-end” from the “Siccodes30.txt”. I have been trying to figure out the source of the problem, but I have not been successful.
In the post below, I have included the original script (there is some room to make it more efficient) and I have added a few lines at the end to make it work with an online example.
Original Script (I have removed comments to makes the post shorter). Again, this is not my script (the original script is in https://iangow.wordpress.com/2011/05/17/getting-fama-french-industry-data-into-r/
url4FF <- paste("http://mba.tuck.dartmouth.edu",
"pages/faculty/ken.french/ftp",
"Industry_Definitions.zip", sep="/")
f <- tempfile()
download.file(url4FF, f)
fileList <- unzip(f,list=TRUE)
trim <- function(string) {
ifelse(grepl("^\\s*$", string, perl=TRUE),"",
gsub("^\\s*(.*?)\\s*$","\\1",string,perl=TRUE))
}
extract_ff_ind_data <- function (file) {
ff_ind <- as.vector(read.delim(unzip(f, files=file), header=FALSE,
stringsAsFactors=FALSE))
ind_num <- trim(substr(ff_ind[,1],1,10))
for (i in 2:length(ind_num)) {
if (ind_num[i]=="") ind_num[i] <- ind_num[i-1]
}
sic_detail <- trim(substr(ff_ind[,1],11,100))
is.desc <- grepl("^\\D",sic_detail,perl=TRUE)
regex.ind <- "^(\\d+)\\s+(\\w+).*$"
ind_num <- gsub(regex.ind,"\\1",ind_num,perl=TRUE)
ind_abbrev <- gsub(regex.ind,"\\2",ind_num[is.desc],perl=TRUE)
ind_list <- data.frame(ind_num=ind_num[is.desc],ind_abbrev,
ind_desc=sic_detail[is.desc])
regex.sic <- "^(\\d+)-(\\d+)\\s*(.*)$"
ind_num <- ind_num[!is.desc]
sic_detail <- sic_detail[!is.desc]
sic_low <- as.integer(gsub(regex.sic,"\\1",sic_detail,perl=TRUE))
sic_high <- as.integer(gsub(regex.sic,"\\2",sic_detail,perl=TRUE))
sic_desc <- gsub(regex.sic,"\\3",sic_detail,perl=TRUE)
sic_list <- data.frame(ind_num, sic_low, sic_high, sic_desc)
return(merge(ind_list,sic_list,by="ind_num",all=TRUE))
}
FFID_30 <- extract_ff_ind_data("Siccodes30.txt")
I have added the following lines to allow testing the script:
library(gsheet)
url <-"https://docs.google.com/spreadsheets/d/1QRv8YmJv0pdhIVmkXMQC7GQuvXV21Kyjl9pVZsSPEAk/gid=1758600626"
companiesSIC <- read.csv(text=gsheet2text(url, format='csv'), stringsAsFactors=FALSE)
names(companiesSIC)
library(sqldf)
companiesFFID_30 <- sqldf("SELECT a.gvkey, a.SIC, b.ind_desc AS FF30,
b.ind_num as FFIndNUm30
FROM companiesSIC AS a
LEFT JOIN FFID_30 AS b
ON a.sic BETWEEN b.sic_low AND b.sic_high")
companiesFFID_30
Results on rows 141 and 142 are wrong. Instead of an industry number the provide a string.
Thanks
PS As I said there is room to make the script shorter (e.g., you don't need to create a separate function to remove white space, you can use trimws) but to give credit to the original author, I kept the script in its original form. However, if someone can solve the problem should also try to update the rest of the script too.
There is nothing wrong with the script. The problem is in the formatting of the two lines (141 and 142) of the txt file.
I opened the text file with a text editor, deleted and re-typed the content of these two lines. When I re-run the R script the problem was gone.

Resources