Scraping PHP dashboard with R (rvest)

Scraping PHP dashboard with R (rvest) - r

I am trying to scrape Bangladesh COVID-19 data (number of tests, number of positive tests, positive rate) from the official website: http://103.247.238.92/webportal/pages/covid19.php
The website contains 3 drop-down menus to arrive at the data: Select Division; Select District; Select time frame for the data.
I have tried the following so far:
url <- "http://103.247.238.92/webportal/pages/covid19.php"
webpage <- read_html(url)
webpage has the following:
List of 2
$ node:<externalptr>
$ doc :<externalptr>
- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
Since this did not help, I also tried the following based on this question:
a <- GET(url)
a <- content(a, as="text")
a <- gsub("^angular.callbacks._2\\(", "", a)
a <- gsub("\\);$", "", a)
df <- fromJSON(a, simplifyDataFrame = TRUE)
The above returns the following error:
Error: lexical error: invalid char in json text.
<!DOCTYPE html> <!-- This is a
(right here) ------^
So I am really lost in terms of how I can even read the data - but upon looking at the source of the webpage, I know that the data is right there: Safari Website inspector
Any suggestions on how I can read this data?
Additionally, if someone could help with how I can go about selecting the different drop-down menu items, that would be really appreciated. The final goal is to collect data for each district in each division for the last 12 months.

tl;dr
The page makes additional requests to pick up that info. Those additional requests rely on combinations of ids; an id pulled from the option element value attribute, of each option within Division dropdown, in tandem with an id pulled from the option element value attribute of each option within the District dropdown.
You can make an initial request to get all the Division dropdown ids:
divisions <- options_df("#division option:nth-child(n+2)", "division")
nth-child(n+2) is used to exclude the initial 'select' option.
This returns a dataframe with the initial divisionIDs and friendly division names.
Those ids can then be used to retrieve the associated districtIDs (the options which become available in the second dropdown after making your selection in the first):
districts <- pmap_dfr(
list(divisions$divisionID),
~ {
df_districts <- districts_from_updated_session(.x, "district") %>%
mutate(
divisionID = .x
)
return(df_districts)
}
)
This returns a dataframe mapping the divisionID to all the associated districtIDs, as well as the friendly district names:
By including the divisionID in both dataframes I can inner-join them:
div_district <- dplyr::inner_join(divisions, districts, by = "divisionID", copy = FALSE)
Up until now, I have been using a session object for the efficiency of tcp re-use. Unfortunately, I couldn't find anything in the documentation covering how to update an already open session allowing for sending a new POST request with dynamic body argument. Instead, I leveraged furrr::future_map to try and gain some efficiencies through parallel processing:
df <- div_district %>%
mutate(json = furrr::future_map(divisionID, .f = get_covid_data, districtID))
To get the final covid numbers, via get_covid_data(), I leverage some perhaps odd behaviour of the server, in that I could make a GET, passing divisionID and districtID within the body, then regex out part of the jquery datatables scripting, string clean that into a json valid string, then read that into a json object stored in the json column of the final dataframe.
Inside of the json column
R:
library(httr)
#> Warning: package 'httr' was built under R version 4.0.3
library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.3
#> Warning: package 'forcats' was built under R version 4.0.3
library(jsonlite)
#> Warning: package 'jsonlite' was built under R version 4.0.3
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
library(furrr)
#> Warning: package 'furrr' was built under R version 4.0.3
#> Loading required package: future
#> Warning: package 'future' was built under R version 4.0.3
## to clean out everything before a run
# rm(list = ls(all = TRUE))
# invisible(lapply(paste0('package:', names(sessionInfo()$otherPkgs)), detach, character.only=TRUE, unload=TRUE)) # https://stackoverflow.com/a/39235076 #mmfrgmpds
#returns value:text for options e.g. divisions/districts (dropdown)
options_df <- function(css_selector, level) {
nodes <- session %>% html_nodes(css_selector)
options <- nodes %>% map_df(~ c(html_attr(., "value"), html_text(.)) %>%
set_names(paste0(level, "ID"), level))
return(options)
}
#returns districts associated with division
districts_from_updated_session <- function(division_id, level) {
session <- jump_to(session, paste0("http://103.247.238.92/webportal/pages/ajaxDataDistrictDHIS2Dashboard.php?division_id=", division_id))
return(options_df("#district option:nth-child(n+2)", level))
}
# returns json object housing latest 12 month covid numbers by divisionID + districtID pairing
get_covid_data <- function(divisionID, districtID) {
headers <- c(
"user-agent" = "Mozilla/5.0",
"if-modified-since" = "Wed, 08 Jul 2020 00:00:00 GMT" # to mitigate for caching
)
data <- list("division" = divisionID, "district" = districtID, "period" = "LAST_12_MONTH", "Submit" = "Search")
r <- httr::GET(url = "http://103.247.238.92/webportal/pages/covid19.php", httr::add_headers(.headers = headers), body = data)
data <- stringr::str_match(content(r, "text"), "DataTable\\((\\[[\\s\\S]+\\])\\)")[1, 2] %>% #clean up extracted string so can be parsed as valid json
gsub("role", '"role"', .) %>%
gsub("'", '"', .) %>%
gsub(",\\s+\\]", "]", .) %>%
str_squish() %>%
jsonlite::parse_json()
return(data)
}
url <- "http://103.247.238.92/webportal/pages/covid19.php"
headers <- c("User-Agent" = "Mozilla/4.0", "Referer" = "http://103.247.238.92/webportal/pages/covid19.php")
session <- html_session(url, httr::add_headers(.headers = headers)) #for tcp re-use
divisions <- options_df("#division option:nth-child(n+2)", "division") #nth-child(n+2) to exclude initial 'select' option
districts <- pmap_dfr(
list(divisions$divisionID),
~ {
df <- districts_from_updated_session(.x, "district") %>%
mutate(
divisionID = .x
)
return(df)
}
)
div_district <- dplyr::inner_join(divisions, districts, by = "divisionID", copy = FALSE)
no_cores <- future::availableCores() - 1
future::plan(future::multisession, workers = no_cores)
df <- div_district %>%
mutate(json = future_map(divisionID, .f = get_covid_data, districtID))
Created on 2021-03-04 by the reprex package (v0.3.0)
Py
import requests, re, ast
from bs4 import BeautifulSoup as bs
def options_dict(soup, css_selector):
options = {i.text:i['value'] for i in soup.select(css_selector) if i['value']}
return options
def covid_numbers(text):
covid_data = p.findall(text)[0]
covid_data = re.sub(r'\n\s+', '', covid_data.replace("role","'role'"))
covid_data = ast.literal_eval(covid_data)
return covid_data
url = 'http://103.247.238.92/webportal/pages/covid19.php'
regions = {}
result = {}
p = re.compile(r'DataTable\((\[[\s\S]+\])\)')
with requests.Session() as s:
s.headers = {'User-Agent': 'Mozilla/5.0', 'Referer': 'http://103.247.238.92/webportal/pages/covid19.php'}
soup = bs(s.get(url).content, 'lxml')
divisions = options_dict(soup, '#division option')
for k,v in divisions.items():
r = s.get(f'http://103.247.238.92/webportal/pages/ajaxDataDistrictDHIS2Dashboard.php?division_id={v}')
soup = bs(r.content, 'lxml')
districts = options_dict(soup, '#district option')
regions[k] = districts
s.headers = {'User-Agent': 'Mozilla/5.0','if-modified-since': 'Wed, 08 Jul 2020 22:27:07 GMT'}
for k,v in divisions.items():
result[k] = {}
for k2,v2 in regions.items():
data = {'division': k2, 'district': v2, 'period': 'LAST_12_MONTH', 'Submit': 'Search'}
r = s.get('http://103.247.238.92/webportal/pages/covid19.php', data=data)
result[k][k2] = covid_numbers(r.text)

Related

JSON Post R - Web scraping to a Dataframe in R

I'm trying to capture a data pull from the twelvedata.com webpage. I want to pull historical equity prices for several stocks with a long time frame.
The instructions are located here: https://twelvedata.com/docs#complex-data
This request requires a JSON POST. I'm struggling to make this work. This is what I have thus far.
url <- "https://api.twelvedata.com/complex_data?apikey=myapikey"
requestBody <- paste0('{"symbols" : "AAPL",
"intervals" : "1day",
"start_date" : "2021-05-01",
"methods" : "symbol"}')
res <- httr::POST(url = url,
body = requestBody,
encode = "json")
test_1 <- content(res, as="text") %>% fromJSON()
test_2 <- as.data.frame(rjson::fromJSON(test_1))
Any help would be greatly appreciated. Thank you for your time.

Perhaps use toJSON to create requestBody. Here is an example:
requestBody = toJSON(list(
symbols=c("AAPL"),
intervals=c("1day"),
start_date=unbox("2021-05-01"),
end_date=unbox("2021-12-31"),
methods="time_series")
)
res <- httr::POST(url = url,body = requestBody,encode = "json")
The following will then return this type of structure:
test_1 <- httr::content(res, as="text") %>% rjson::fromJSON()
do.call(rbind, lapply(test_1$data[[1]]$values,data.frame)) %>% head()
Output:
datetime open high low close volume
1 2021-12-30 179.47000 180.57001 178.09000 178.20000 59773000
2 2021-12-29 179.33000 180.63000 178.14000 179.38000 62348900
3 2021-12-28 180.16000 181.33000 178.53000 179.28999 79144300
4 2021-12-27 177.09000 180.42000 177.07001 180.33000 74919600
5 2021-12-23 175.85001 176.85001 175.27000 176.28000 68227500
6 2021-12-22 173.03999 175.86000 172.14999 175.64000 92004100
Update: Multiple Symbols.
If you have multiple symbols, requestBody should be updated like this:
requestBody = toJSON(list(
symbols=c("AAPL", "MSFT"),
intervals=c("1day"),
start_date=unbox("2021-05-01"),
end_date=unbox("2021-12-31"),
methods="time_series")
)
and you can extract each into its own data frame within a list, like this
lapply(test_1$data, function(s){
stock = s$meta$symbol
do.call(rbind, lapply(s$values, data.frame)) %>% mutate(symbol = stock)
})

Rselenium scraping table

I want to extract the data in the "Completed Games" table located here "https://www.chess.com/member/magnuscarlsen".
The code below gives me a list of size 0. The Selenium side of things seems to be working. A firefox browser opens on my desktop and navigates to the page. Any help would be greatly appreciated. I'm at my wits end!
rD <- rsDriver(browser="firefox", port=4442L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://www.chess.com/member/magnuscarlsen")
Sys.sleep(5) # give the page time to fully load
html <- remDr$getPageSource()[[1]]
html <- read_html(html)
signal <- html %>%
html_nodes("table.table-component table-hover archived-games-table")

1
If you don't mind not having the accuracy figures (for which I believe there is no published basis for calculation) have a look at the public APIs from Chess.com. You do get all the moves info included.
In particular, the implementations via BigChess package. I amended examples from there below:
All games:
library(rjson)
library(bigchess)
user <- "magnuscarlsen"
json_file <- paste0("https://api.chess.com/pub/player/", user,"/games/archives")
json_data <- fromJSON(paste(readLines(json_file), collapse = ""))
result <- data.frame()
for(i in json_data$archives)
result <- rbind(result, read.pgn(paste0(i, "/pgn")))
Single month:
library(bigchess)
df <- read.pgn("https://api.chess.com/pub/player/magnuscarlsen/games/2020/12/pgn")
print(df[df$Date == '2020.12.11'])
Adding in your accuracies as requested. Most of the info on that page is actually available via the APIs:
library(bigchess)
#> Warning: package 'bigchess' was built under R version 4.0.3
library(purrr)
library(jsonlite)
#> Warning: package 'jsonlite' was built under R version 4.0.3
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
library(stringr)
try_again <- function(link) { #https://blog.r-hub.io/2020/04/07/retry-wheel/
maxtry <- 5
try <- 1
resp <- read_json(link)
while (try <= maxtry && is.null(resp$data)) {
resp <- read_json(.)
try <- try + 1
Sys.sleep(try * .25)
}
return(resp)
}
url <- "https://api.chess.com/pub/player/magnuscarlsen/games/2020/12"
result <- data.frame()
result <- read.pgn(paste0(url, "/pgn"))
#> Warning in readLines(con): incomplete final line found on 'https://
#> api.chess.com/pub/player/magnuscarlsen/games/2020/12/pgn'
#> 2021-02-15 20:29:04, successfully imported 47 games
#> 2021-02-15 20:29:04, N moves computed
#> 2021-02-15 20:29:04, extract moves done
#> 2021-02-15 20:29:04, stat moves computed
result <- filter(result, result$Date == "2020.12.11")
data <- read_json(url)
mask <- map(data$games, ~ !is.na(str_match(.x$pgn, 'UTCDate\\s\\"2020\\.12\\.11')[, 1])) %>% unlist()
games <- data$games[mask]
games <- paste0("https://www.chess.com/callback/analysis/game/live/", map(games, ~ str_match(.x$url, "\\d+")[, 1]), "/all")
df <- map_df(games, ~ {
json_data <- try_again(.x)
tryCatch(
data.frame(
Url = .x,
WhiteAccuracy = json_data$data$analysis$CAPS$white$all,
BlackAccuracy = json_data$data$analysis$CAPS$black$all,
stringsAsFactors = FALSE
),
error = function(e) {
data.frame(
Url = .x,
WhiteAccuracy = NA_integer_,
BlackAccuracy = NA_integer_,
stringsAsFactors = FALSE
)
}
)
})
final <- cbind(result, df)
#> Error in .cbind.ts(list(...), .makeNamesTs(...), dframe = FALSE, union = TRUE): non-time series not of the correct length
Created on 2021-02-15 by the reprex package (v0.3.0)

Here is an approach that solves your problem easily because the page itself has just one table. Use rvest for easily getting it out. Note that I used pipes because I prefer them. You can of course do without them.
library(RSelenium)
library(rvest)
rD <- rsDriver(browser="firefox", port=4443L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://www.chess.com/member/magnuscarlsen")
Sys.sleep(5) # give the page time to fully load
html <- remDr$getPageSource()[[1]]
html <- read_html(html)
##required table
html %>% html_table() %>% .[[1]]

r Web scraping: Unable to read the main table

I am new to web scraping. I am trying to scrape a table with the following code. But I am unable to get it. The source of data is
https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1
url <- "https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1"
urlYAnalysis <- paste(url, sep = "")
webpage <- readLines(urlYAnalysis)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
Tab <- readHTMLTable(tableNodes[[1]])
I copied this apporach from the link (Web scraping of key stats in Yahoo! Finance with R) where it is applied on yahoo finance data.
In my opinion, in readHTMLTable(tableNodes[[12]]), it should be Table 12. But when I try giving tableNodes[[12]], it always gives me an error.
Error in do.call(data.frame, c(x, alis)) :
variable names are limited to 10000 bytes
Please suggest me the way to extract the table and combine the data from other tabs as well (Fundamental, Technical and Performance).

This data is returned dynamically as json. In R (behaves differently from Python requests) you get html from which you can extract a given page's results as json. A page includes all the tabs info and 50 records. From the first page you are given the total record count and therefore can calculate the total number of pages to loop over to get all results. Perhaps combine them info a final dataframe during a loop to total number of pages; where you alter the pn param of the XHR POST body to the appropriate page number for desired results in each new POST request. There are two required headers.
Probably a good idea to write a function that accepts a page number in signature and returns a given page's json as a dataframe. Apply that via a tidyverse package to handle loop and combining of results to final dataframe?
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>%read_html()%>%html_node('p')%>% html_text()
page1_data <- jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])
total_rows <- str_match(s, '"totalCount\":(\\d+),' )[1,2]%>%as.integer()
num_pages <- ceiling(total_rows/50)
My current attempt at combining which I would welcome feedback on. This is all the returned columns, for all pages, and I have to handle missing columns and different ordering of columns as well as 1 column being a data.frame. As the returned number is far greater than those visible on page, you could simply revise to subset returned columns with a mask just for the columns present in the tabs.
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
get_data <- function(page_number){
data['pn'] = page_number
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>% read_html() %>% html_node('p') %>% html_text()
if(page_number==1){ return(s) }
else{return(data.frame(jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])))}
}
clean_df <- function(df){
interim <- df['viewData']
df_minus <- subset(df, select = -c(viewData))
df_clean <- cbind.data.frame(c(interim, df_minus))
return(df_clean)
}
initial_data <- get_data(1)
df <- clean_df(data.frame(jsonlite::fromJSON(str_match(initial_data, '(\\[.*\\])' )[1,2])))
total_rows <- str_match(initial_data, '"totalCount\":(\\d+),' )[1,2] %>% as.integer()
num_pages <- ceiling(total_rows/50)
dfs <- map(.x = 2:num_pages,
.f = ~clean_df(get_data(.)))
r <- rbindlist(c(list(df),dfs),use.names=TRUE, fill=TRUE)
write_csv(r, 'data.csv')

Nothing returned when I try to scrape mlb.com transactions using rvest

I've been trying to scrape the mlb transactions page (http://mlb.mlb.com/mlb/transactions/index.jsp#month=5&year=2019) for the corresponding date and text of every given transaction with no luck. Using rvest and the selector gadget I wrote a brief function which should give me the table displayed all the way back from the first available n 2001 to March 2019.
I just get this series of errors and nothing at all happens.
Here is my code to scrape the data from the given website.
library(tidyverse)
library(rvest)
# breaking the URL into the start and end for easy pasting to fit timespan
url_start = "http://mlb.mlb.com/mlb/transactions/index.jsp#month="
url_end = "&year="
# function which scrapes data
mlb_transactions = function(month, year){
url = paste0(url_start, month, url_end, year)
payload = read_html(url) %>%
html_nodes("td") %>%
html_table() %>%
as.data.frame()
payload
}
# function run on appropriate dates
mlb_transactions(month = 1:12, year = 2001:2019)
here are the errors I'm getting
Show Traceback
Rerun with Debug
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=19].
and here is the Traceback
12.
stop(structure(list(message = "Expecting a single string value: [type=character; extent=19].",
call = doc_parse_file(con, encoding = encoding, as_html = as_html,
options = options), cppstack = NULL), class = c("Rcpp::not_compatible",
"C++Error", "error", "condition")))
11.
doc_parse_file(con, encoding = encoding, as_html = as_html, options = options)
10.
read_xml.character(x, encoding = encoding, ..., as_html = TRUE,
options = options)
9.
read_xml(x, encoding = encoding, ..., as_html = TRUE, options = options)
8.
withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning"))
7.
suppressWarnings(read_xml(x, encoding = encoding, ..., as_html = TRUE,
options = options))
6.
read_html.default(url)
5.
read_html(url)
4.
eval(lhs, parent, parent)
3.
eval(lhs, parent, parent)
2.
read_html(url) %>% html_nodes("td") %>% html_table() %>% as.data.frame()
1.
mlb_transactions(month = 1:12, year = 2001:2019)
One final note is that my plan is, though I don't yet know how to do this, because on the transaction tables not every transaction has a date its direct left but there is an implied date span could I make it so once loaded every empty date column is filled with the info of the column directly above it if filled and this runs a sort of loop or is there a better way to load the dates from the very start?

Pseudo code (language agnostic):
There is an alternative url construct that returns json via a querystring. The querystring has a start and end date.
http://lookup-service-prod.mlb.com/json/named.transaction_all.bam?start_date=20010101&end_date=20031231&sport_code=%27mlb%27
From testing with Python (so R mileage may vary - I will look to add an R example hopefully later) you can issue requests for *2 years at a time and get a json response with the rows of data in. *This was the more reliable time frame.
You could construct this in a loop from 2001 to 2018 with a step of 2 i.e.
intervals of
['2001-2003', '2004-2006', '2007-2009' ,'2010-2012', '2013-2015', '2016-2018]
Then parse the json response for the data of interest. Example json response here.
Example row within json:
{"trans_date_cd":"D","from_team_id":"","orig_asset":"Player","final_asset_type":"","player":"Rafael Roque","resolution_cd":"FIN","final_asset":"","name_display_first_last":"Rafael Roque","type_cd":"REL","name_sort":"ROQUE, RAFAEL","resolution_date":"2001-03-14T00:00:00","conditional_sw":"","team":"Milwaukee Brewers","type":"Released","name_display_last_first":"Roque, Rafael","transaction_id":"94126","trans_date":"2001-03-14T00:00:00","effective_date":"2001-03-14T00:00:00","player_id":"136305","orig_asset_type":"PL","from_team":"","team_id":"158","note":"Milwaukee Brewers released LHP Rafael Roque."}
Note:
Non-bulk use of the Materials is permitted but bulk usage requires prior consent.
Python example:
import requests
for year in range(2001, 2018, 2):
r = requests.get('http://lookup-service-prod.mlb.com/json/named.transaction_all.bam?start_date={0}0101&end_date={1}1231&sport_code=%27mlb%27'.format(year,year + 1)).json()
print(len(r['transaction_all']['queryResults']['row'])) # just to demonstrate response content
This
len(r['transaction_all']['queryResults']['row'])
gives the number of rows/transactions of data per request (2 year period)
This yields transaction counts of:
[163, 153, 277, 306, 16362, 19986, 20960, 23352, 24732]

Here is an R alternative - similar to #QHarr's solution. The following function get_data takes year as argument and fetches the data for year;year+1 as start and end dates
get_data <- function (year) {
root_url <- 'http://lookup-service-prod.mlb.com'
params_dates <- sprintf('start_date=%s0101&end_date=%s1231', year, year+1)
params <- paste('/json/named.transaction_all.bam?&sport_code=%27mlb%27', params_dates, sep = '&')
js <- jsonlite::fromJSON(paste0(root_url, params))
return (js)
}
get_processed_data <- function (year) get_data(year=year)$transaction_all$queryResults$row
The output js is of class list and the data is stored in $transaction_all$queryResults$row.
Finally, the same loop as in the other solution printing out the number of rows of the ouput
for (year in seq(2001, 2018, 2)) print(nrow(get_data(year)$transaction_all$queryResults$row))
# [1] 163
# [1] 153
# [1] 277
# [1] 306
# [1] 16362
# [1] 19986
# [1] 20960
# [1] 23352
# [1] 24732

API Query for loop

I'm trying to pull some data from an API throw it all into a single data frame. I'm trying to put a variable into the URL I'm pulling from and then loop it to pull data from 54 keys. Here's what I have so far with notes.
library("jsonlite")
library("httr")
library("lubridate")
options(stringsAsFactors = FALSE)
url <- "http://api.kuroganehammer.com"
### This gets me a list of 58 observations, I want to use this list to
### pull data for each using an API
raw.characters <- GET(url = url, path = "api/characters")
## Convert the results from unicode to a JSON
text.raw.characters <- rawToChar(raw.characters$content)
## Convert the JSON into an R object. Check the class of the object after
## it's retrieved and reformat appropriately
characters <- fromJSON(text.raw.characters)
class(characters)
## This pulls data for an individual character. I want to get one of
## these for all 58 characters by looping this and replacing the 1 in the
## URL path for every number through 58.
raw.bayonetta <- GET(url = url, path = "api/characters/1/detailedmoves")
text.raw.bayonetta <- rawToChar(raw.bayonetta$content)
bayonetta <- fromJSON(text.raw.bayonetta)
## This is the function I tried to create, but I get a lexical error when
## I call it, and I have no idea how to loop it.
move.pull <- function(x) {
char.x <- x
raw.x <- GET(url = url, path = cat("api/characters/",char.x,"/detailedmoves", sep = ""))
text.raw.x <- rawToChar(raw.x$content)
char.moves.x <- fromJSON(text.raw.x)
char.moves.x$id <- x
return(char.moves.x)
}

The first part of this:
library(jsonlite)
library(httr)
library(lubridate)
library(tidyverse)
base_url <- "http://api.kuroganehammer.com"
res <- GET(url = base_url, path = "api/characters")
content(res, as="text", encoding="UTF-8") %>%
fromJSON(flatten=TRUE) %>%
as_tibble() -> chars
Gets you a data frame of the characters.
This:
pb <- progress_estimated(length(chars$id))
map_df(chars$id, ~{
pb$tick()$print()
Sys.sleep(sample(seq(0.5, 2.5, 0.5), 1)) # be kind to the free API
res <- GET(url = base_url, path = sprintf("api/characters/%s/detailedmoves", .x))
content(res, as="text", encoding="UTF-8") %>%
fromJSON(flatten=TRUE) %>%
as_tibble()
}, .id = "id") -> moves
Gets you a data frame of all the "moves" and adds the "id" for the character. You get a progress bar for free, too.
You can then either left_join() as needed or group & nest the moves data into a separate list-nest column. If you want that to begin with you can use map() vs map_df().
Leave in the time pause code. It's a free API and you should likely increase the pause times to avoid DoS'ing their site.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping PHP dashboard with R (rvest) - r

Related

JSON Post R - Web scraping to a Dataframe in R

Rselenium scraping table

r Web scraping: Unable to read the main table

Nothing returned when I try to scrape mlb.com transactions using rvest

API Query for loop

Categories

Resources