I wish to scale my working API query to query many IDs and to store this in a nice rectangular data frame.
I need some help understanding how I can scale my code to take many input variables and then how to store them.
My working code is as follows:
pacman::p_load(tidyverse,httr,jsonlite,purrr)
path <- "https://npiregistry.cms.hhs.gov/api/?"
request <- httr::GET(url = path,
query = list(version = "2.0",
number = 1154328938))
response <- content(request, as = "text", encoding = "UTF-8")
df <- jsonlite::fromJSON(response, flatten = TRUE) %>%
data.frame()
providerData <- df %>%
select(results.number,
results.basic.name,
results.basic.gender,
results.basic.credential,
results.taxonomies) %>%
unnest_wider(results.taxonomies) %>%
rename(Provider_NPI = results.number,
Provider_Name = results.basic.name,
Provider_Gender = results.basic.gender,
Provider_Credentials = results.basic.credential,
Provider_Taxonomy = desc,
Provider_State = state) %>%
select(-code,-license,-primary)
I now wish to query these 4 IDs and to store them in the same data format as the example above.
I have tried using lapply and building my own function but I don't fully understand how to create objects that store returned values.
My function looks as follows:
getNPI <- function(object) {
httr::GET(url = path,
query = list(version = "2.0",
number = object))
}
providerIDs <- c('1073666335',
'1841395357',
'1104023381',
'1477765634')
test <- lapply(providerIDs, getNPI)
I'm pretty certain I need some sort of object like a list or data frame to store the values of httr::GET but this is where I am falling down. The other piece is how to pull the appropriate values from the returned objects and to store them in a neat data frame.
Your help would be greatly appreciated.
you have to add the "cleaning" steps and return a df inside your getNPI function, then you can later use do.call for "combine" all data into a "final" data frame:
Example
getNPI <- function(object) {
request <- httr::GET(url = path,
query = list(version = "2.0",
number = object))
df <- content(request, as = "text", encoding = "UTF-8") %>%
jsonlite::fromJSON(. , flatten = TRUE) %>%
data.frame()
df %>%
select(results.number,
results.basic.name,
results.basic.gender,
results.basic.credential,
results.taxonomies) %>%
unnest_wider(results.taxonomies)
# Add more selection, mutations as needed
}
test <- lapply(providerIDs, getNPI)
# Use do.call for rbind an make the final df
final_df <- do.call("rbind",test)
Hope this can help you
NOTE: In order to rbind works with do.call as expected, all the columns names has to be the same.
I am new to web scraping. I am trying to scrape a table with the following code. But I am unable to get it. The source of data is
https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1
url <- "https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1"
urlYAnalysis <- paste(url, sep = "")
webpage <- readLines(urlYAnalysis)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
Tab <- readHTMLTable(tableNodes[[1]])
I copied this apporach from the link (Web scraping of key stats in Yahoo! Finance with R) where it is applied on yahoo finance data.
In my opinion, in readHTMLTable(tableNodes[[12]]), it should be Table 12. But when I try giving tableNodes[[12]], it always gives me an error.
Error in do.call(data.frame, c(x, alis)) :
variable names are limited to 10000 bytes
Please suggest me the way to extract the table and combine the data from other tabs as well (Fundamental, Technical and Performance).
This data is returned dynamically as json. In R (behaves differently from Python requests) you get html from which you can extract a given page's results as json. A page includes all the tabs info and 50 records. From the first page you are given the total record count and therefore can calculate the total number of pages to loop over to get all results. Perhaps combine them info a final dataframe during a loop to total number of pages; where you alter the pn param of the XHR POST body to the appropriate page number for desired results in each new POST request. There are two required headers.
Probably a good idea to write a function that accepts a page number in signature and returns a given page's json as a dataframe. Apply that via a tidyverse package to handle loop and combining of results to final dataframe?
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>%read_html()%>%html_node('p')%>% html_text()
page1_data <- jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])
total_rows <- str_match(s, '"totalCount\":(\\d+),' )[1,2]%>%as.integer()
num_pages <- ceiling(total_rows/50)
My current attempt at combining which I would welcome feedback on. This is all the returned columns, for all pages, and I have to handle missing columns and different ordering of columns as well as 1 column being a data.frame. As the returned number is far greater than those visible on page, you could simply revise to subset returned columns with a mask just for the columns present in the tabs.
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
get_data <- function(page_number){
data['pn'] = page_number
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>% read_html() %>% html_node('p') %>% html_text()
if(page_number==1){ return(s) }
else{return(data.frame(jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])))}
}
clean_df <- function(df){
interim <- df['viewData']
df_minus <- subset(df, select = -c(viewData))
df_clean <- cbind.data.frame(c(interim, df_minus))
return(df_clean)
}
initial_data <- get_data(1)
df <- clean_df(data.frame(jsonlite::fromJSON(str_match(initial_data, '(\\[.*\\])' )[1,2])))
total_rows <- str_match(initial_data, '"totalCount\":(\\d+),' )[1,2] %>% as.integer()
num_pages <- ceiling(total_rows/50)
dfs <- map(.x = 2:num_pages,
.f = ~clean_df(get_data(.)))
r <- rbindlist(c(list(df),dfs),use.names=TRUE, fill=TRUE)
write_csv(r, 'data.csv')
I wanted to perform loop to capture weather data from multiple stations using code below:
library(rwunderground)
sample_df <- data.frame(airportid = c("K6A2",
"KAPA",
"KASD",
"KATL",
"KBKF",
"KBKF",
"KCCO",
"KDEN",
"KFFC",
"KFRG"),
stringsAsFactors = FALSE)
history_range(set_location(airport_code =sample_df$airportid), date_start = "20170815", date_end = "20170822",
limit = 10, no_api = FALSE, use_metric = FALSE, key = get_api_key(),
raw = FALSE, message = TRUE)
It won't work.
Currently, you are passing the entire vector (multiple character values) into the history_range call. Simply lapply to iteratively pass the vector values and even return a list of history_range() return objects. Below uses a defined function to pass the parameter. Extend the function as needed to perform other operations.
capture_weather_data <- function(airport_id) {
data <- history_range(set_location(airport_code=airport_id),
date_start = "20170815", date_end = "20170822",
limit = 10, no_api = FALSE, use_metric = FALSE, key = get_api_key(),
raw = FALSE, message = TRUE)
write.csv(data, paste0("/path/to/output/", airport_id, ".csv"))
return(data)
}
data_list <- lapply(sample_df$airportid, capture_weather_data)
Also, name each item in list to the corresponding airport_id character value:
data_list <- setNames(data_list, sample_df$airportid)
data_list$K6A2 # 1st ITEM
data_list$KAPA # 2nd ITEM
data_list$KASD # 3rd ITEM
...
In fact, with sapply (the wrapper to lapply) you can generate list and name each item in same call but the input vector must be a character type (not factor):
data_list <- sapply(as.character(sample_df$airportid), capture_weather_data,
simplify=FALSE, USE.NAMES=TRUE)
names(data_list)
I think this history_range function that you brought up, from the rwunderground package as I understand, requires a weather underground API key. I went to the site and even signed up for it, but the email validation process in order to get a key (https://www.wunderground.com/weather/api) doesn't seem to be working correctly at the moment.
Instead I went to the CRAN mirror (https://github.com/cran/rwunderground/blob/master/R/history.R) and from what I understand, the function accepts only one string as set_location argument. The example provided in the documentation is
history(set_location(airport_code = "SEA"), "20130101")
So what you should be doing as a "loop", instead, is
sample_df <- as.vector(sample_df)
for(i in 1:length(sample_df)){
history_range(
set_location(airport_code = sample_df[[i]]),
date_start = "20170815", date_end = "20170822",
limit = 10, no_api = FALSE, use_metric = FALSE,
key = get_api_key(),
raw = FALSE, message = TRUE)
}
If this doesn't work, let me know. (Ack, somebody also gave another answer to this question while I was typing this up.)