Extract table from - r

I would like to extract the following table using rvest from http://finra-markets.morningstar.com/BondCenter/TRACEMarketAggregateStats.jsp (for any date):
I tried the following but failed to produce any result:
library(rvest)
url <- "http://finra-markets.morningstar.com/BondCenter/TRACEMarketAggregateStats.jsp"
htmlSession <-html_session(url) ## create session
goForm <- html_form(htmlSession)[[2]] ## pull form from session
#filledGoForm <- set_values(goForm, value="04/26/2017") # This does not work
filledGoForm <- goForm
filledGoForm$fields[[1]]$value <- "04/26/2017"
htmlSession <- submit_form(htmlSession, filledGoForm)
> htmlSession <- submit_form(htmlSession, filledGoForm)
Submitting with ''
Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode, :
Not Found (HTTP 404).
Any hints on how to do this highly appreciated.

That site uses many XHR requests to populate the tables. And, it establishes a server session with a hidden POST request which won't be replicated with html_session().
We'll need to add in httr for some help:
library(httr)
library(rvest)
The first thing we need to do is to just hit the site to get an initial qs_wid cookie into the implicit cookie jar curl/httr/rvest share:
init <- GET("http://finra-markets.morningstar.com/MarketData/Default.jsp")
Next, we need to mimic the hidden "login" that the web page does:
nxt <- POST(url = "http://finra-markets.morningstar.com/finralogin.jsp",
body = list(redirectPage = "/BondCenter/TRACEMarketAggregateStats.jsp"),
encode = "form")
That creates a session on the server back-end and places a few other cookies in our cookie jar.
Finally:
GET(
url = "http://finra-markets.morningstar.com/transferPage.jsp",
query = list(
`path`="http://muni-internal.morningstar.com/public/MarketBreadth/C",
`date`="04/24/2017",
`_`=as.numeric(Sys.time())
)
) -> res
makes the request. You can make a function out of all three steps (together) and parameterize that last GET.
Unfortunately, that returns a very broken HTML <table> that html_table() can't translate into a data frame automagically for you, but that shouldn't stop you:
content(res) %>%
html_nodes("td") %>%
html_text() %>%
matrix(ncol=4, byrow=TRUE) %>%
as_data_frame() %>%
mutate_all(as.numeric) %>%
rename(all_issues=V1, investment_grade=V2, high_yield=V3, convertible=V4) %>%
mutate(category = c("total_issues_traded", "advances", "declines", "unchanged", "high_52", "low_52", "dollar_volume"))
## # A tibble: 7 × 5
## all_issues investment_grade high_yield convertible category
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 7983 5602 2194 187 total_issues_traded
## 2 3025 1798 1100 127 advances
## 3 4448 3575 824 49 declines
## 4 124 42 75 7 unchanged
## 5 257 66 175 16 high_52
## 6 139 105 33 1 low_52
## 7 22601 16143 5742 715 dollar_volume
To get the other data tables, go to the Developer Tools option in your browser (switch to one that has it if yours doesn't … you're likely on Windows given that you're doing finance things and IE/Edge aren't very good browsers for introspection) and refresh the page to see the other requests that get made.

Related

curl error (Could not resolve host: NA) while scraping in a loop

While this code for scraping prices from a webshop has worked perfectly fine for me over the last months, today I just got the following error message:
Error in curl::curl_fetch_memory(url, handle = handle) :
Could not resolve host: NA
The code i use is as follows:
This part is for getting the full url's:
#Scrape Galaxus
vec_galaxus<-vector()
i=0
input_galaxus <- input %>%
filter(`Galaxus Artikel`!=0)
input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxus$`Galaxus Artikel`)
This is the scraping loop:
sess <- session(input_galaxus2[1]) #to start the session
for (j in input_galaxus2){
sess <- sess %>% session_jump_to(j) #jump to URL
i=i+1
try(vec_galaxus[i] <- read_html(sess) %>% #can read direct from sess
html_nodes('.sc-1aeovxo-1.gvrGle') %>%
html_text()%>%
str_extract("[0-9]+") %>%
as.integer())
Sys.sleep(runif(1, min=0.2, max=0.5))
}
where part of my input "input_galaxus2" looks like this:
c("https://www.galaxus.ch/15758734", "https://www.galaxus.ch/7362734",
"https://www.galaxus.ch/12073455", "https://www.galaxus.ch/20841274",
"https://www.galaxus.ch/20589944 ", "https://www.galaxus.ch/13595276",
"https://www.galaxus.ch/16255768", "https://www.galaxus.ch/6296373",
"https://www.galaxus.ch/14513900", "https://www.galaxus.ch/14465626",
"https://www.galaxus.ch/10592707", "https://www.galaxus.ch/19958785",
"https://www.galaxus.ch/9858343", "https://www.galaxus.ch/14513913")
Does anybody know why suddenly this code gives me the above error message?
Thanks in advance for your responses!
If it were a different error, I'd think it was throttling, but this error does not really support that. However, to troubleshoot that (and you hitting too-many-hits limits on the server), try introducing a delay between pulls, perhaps a few seconds or a minute, just to see if that resolves things.
Here's a method that will allow to you repeat your code until all URLs are pulled without error. Note that this may also need the "delay" I suggested above in order to not anger the server admins on the remote end (or firewall or whatever).
Create a list in which we'll store the results. Run this code only once, all the remaining bullets in the list should be repeatable without consequence.
out <- vector("list", length(input_galaxus2))
Prep the session. This may be repeatable depending on if you have authentication or other attributes.
sess <- session(input_galaxus2[1]) #to start the session
Iterate over the empty elements of your URLs and query as needed. If you get any errors, feel free to wait a little bit and re-run this code. If a particular URL succeeded, it will not be re-attempted, so repeat as needed, eventually (assuming the failures are intermittent and all URLs are value) you will get all results.
I don't think you need read_html in this pipe, but I'm not testing for fear of "slashdotting" the website. The point of this answer is to suggest a mechanism that allows you to reattempt efficiently.
empties <- which(sapply(out, is.null))
for (i in empties) {
res <- tryCatch({
sess %>%
session_jump_to(input_galaxus2[i]) %>%
html_nodes('.sc-1aeovxo-1.gvrGle') %>%
html_text() %>%
str_extract("[0-9]+") %>%
as.integer()
}, error = function(e) e)
if (inherits(res, "error")) {
warning(sprintf("failed (%i, %s): %s", i, input_galaxus2[i], conditionMessage(e)))
# optional
Sys.sleep(3)
} else out[[i]] <- res
}
Note: this assumes that a NULL value means the previous attempt failed, was interrupted, or ... was not attempted. If NULL can be a valid and successful return value from your pull, then you should likely prefill out with some other "canary" value: choose something that you are more confident will "never" appear in real results, and change how you define empties above.
Using purrr::map instead of loop, without any Sys.sleep().
library(tidyverse)
library(rvest)
df <- tibble(
links = c("https://www.galaxus.ch/15758734", "https://www.galaxus.ch/7362734",
"https://www.galaxus.ch/12073455", "https://www.galaxus.ch/20841274",
"https://www.galaxus.ch/20589944 ", "https://www.galaxus.ch/13595276",
"https://www.galaxus.ch/16255768", "https://www.galaxus.ch/6296373",
"https://www.galaxus.ch/14513900", "https://www.galaxus.ch/14465626",
"https://www.galaxus.ch/10592707", "https://www.galaxus.ch/19958785",
"https://www.galaxus.ch/9858343", "https://www.galaxus.ch/14513913")
)
get_prices <- function(link) {
link %>%
read_html() %>%
html_nodes(".sc-1aeovxo-1.gvrGle") %>%
html_text2() %>%
str_remove_all("–")
}
df %>%
mutate(price= map(links, get_prices) %>%
as.numeric)
# A tibble: 14 × 2
links price
<chr> <dbl>
1 "https://www.galaxus.ch/15758734" 17.8
2 "https://www.galaxus.ch/7362734" 500.
3 "https://www.galaxus.ch/12073455" 173
4 "https://www.galaxus.ch/20841274" 112
5 "https://www.galaxus.ch/20589944 " 25.4
6 "https://www.galaxus.ch/13595276" 313
7 "https://www.galaxus.ch/16255768" 40
8 "https://www.galaxus.ch/6296373" 62.9
9 "https://www.galaxus.ch/14513900" 539
10 "https://www.galaxus.ch/14465626" 466.
11 "https://www.galaxus.ch/10592707" 63.5
12 "https://www.galaxus.ch/19958785" NA
13 "https://www.galaxus.ch/9858343" 7.3
14 "https://www.galaxus.ch/14513913" 617

Can't extract the price variable from this website? (suspect span within a span in the codes)

I have tried several ways but it usually ends up with blank or N/A only.
library('rvest')
library(stringr)
url <- 'https://www.kimovil.com/en/compare-smartphones/f_min_dm+unveileddate.3,i_b+slug.samsung'
webpage <- read_html(url)
device_cost_html <- html_nodes(webpage,'.price')
device_cost <- html_text(device_cost_html)
device_cost <- as.numeric(device_cost)
This is not a static webpage that can be scraped with rvest. The span elements actually are empty on the requested html document. What happens in your web browser is that when the html document is loaded, the browser reads the javascript code on the page, which generates further requests to the server. These requests return the actual data in json format, which the javascript code then uses to populate the empty span elements.
The reason why this doesn't work in rvest is that it has no facility to run the javascript on the page. It just returns the original empty html spans.
However, all is not lost. Using the console in your browser's developer tools, you can find the url of the json that contains the data and just request that directly. In your case, this is surprisingly straightforward:
json <- httr::GET("https://www.kimovil.com/uploads/last_prefetch.json")
all_phones <- httr::content(json, "parsed")
df <- do.call(rbind, lapply(all_phones$smartphones, function(x) {
data.frame(name = x$full_name, price_usd = paste("$", x$usd))}))
head(df)
#> name price_usd
#> 1 Realme GT Neo 2 $ 489
#> 2 Google Pixel 6 $ 754
#> 3 OnePlus 9RT $ 577
#> 4 Google Pixel 6 Pro $ 1044
#> 5 Apple iPhone 13 Pro Max $ 1432
#> 6 Apple iPhone 13 Pro $ 31
Created on 2021-10-31 by the reprex package (v2.0.0)

How to access Youtube Data API v3 with R

I am trying to use R to retrieve data from the YouTube API v3 and there are few/no tutorials out there that show the basic process. I have figured out this much so far:
# Youtube API query
base_url <- "https://youtube.googleapis.com/youtube/v3/"
my_yt_search <- function(search_term, max_results = 20) {
my_api_url <- str_c(base_url, "search?part=snippet&", "maxResults=", max_results, "&", "q=", search_term, "&key=",
my_api_key, sep = "")
result <- GET(my_api_url)
return(result)
}
my_yt_search(search_term = "salmon")
But I am just getting some general meta-data and not the search results. Help?
PS. I know there is a package 'tuber' out there but I found it very unstable and I just need to perform simple searches so I prefer to code the requests myself.
Sadly there is no way to directly get the durations, you'll need to call the videos endpoint (with the part set to part=contentDetails) after doing the search if you want to get those infos, however you can pass as much as 50 ids in a single call thus we can save some time by pasting all the ids together.
library(httr)
library(jsonlite)
library(tidyverse)
my_yt_duration <- function(...){
my_api_url <- paste0(base_url, "videos?part=contentDetails", paste0("&id=", ..., collapse=""), "&key=",
my_api_key )
GET(my_api_url) -> resp
fromJSON(content(resp, "text"))$items %>% as_tibble %>% select(id, contentDetails) -> tb
tb$contentDetails$duration %>% tibble(id=tb$id, duration=.)
}
### getting the video IDs
my_yt_search(search_term = "salmon")->res
## Converting from JSON then selecting all the video ids
# fromJSON(content(res,as="text") )$items$id$videoId
my_yt_duration(fromJSON(content(res,as="text") )$items$id$videoId) -> tib.id.duration
# A tibble: 20 x 2
id duration
<chr> <chr>
1 -x2E7T3-r7k PT4M14S
2 b0ahREpQqsM PT3M35S
3 ROz8898B3dU PT14M17S
4 jD9VJ92xyzA PT5M42S
5 ACfeJuZuyxY PT3M1S
6 bSOd8r4wjec PT6M29S
7 522BBAsijU0 PT10M51S
8 1P55j9ub4es PT14M59S
9 da8JtU1YAyc PT3M4S
10 4MpYuaJsvRw PT8M27S
11 _NbbtnXkL-k PT2M53S
12 3q1JN_3s3gw PT6M17S
13 7A-4-S_k_rk PT9M37S
14 txKUTx5fNbg PT10M2S
15 TSSPDwAQLXs PT3M11S
16 NOHEZSVzpT8 PT7M51S
17 4rTMdQzsm6U PT17M24S
18 V9eeg8d9XEg PT10M35S
19 K4TWAvZPURg PT3M3S
20 rR9wq5uN_q8 PT4M53S

How do I use rvest to sort text into different columns?

I am using rvest to (try to) scrape all the author affiliation data from a database of academic publications called RePEc. I have the authors' short IDs, which I'm using to scrape affiliation data. However, each time I try, it gives me the 404 error: Error in open.connection(x, "rb") : HTTP error 404
It must be an issue with my use of sapply because when I test it using an individual ID, it works. Here is the code I'm using:
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500")
df$websites <- paste0("https://ideas.repec.org/e/", df$author_reg, ".html")
df$affiliation <- sapply(df$websites, function(x) try(x %>% read_html %>% html_nodes("#affiliation h3") %>% html_text()))
I actually need to do this for six columns of authors and there are NA values I'd like to skip so if anyone knows how to do that as well, I would be enormously grateful (but not a big deal if I not). Thank you in advance for your help!
EDIT: I have just discovered that the error is in the formula for the websites. Sometimes it should be df$websites <- paste0("https://ideas.repec.org/e/", df$author_reg, ".html") and sometimes it should be df$websites <- paste0("https://ideas.repec.org/f/", df$author_reg, ".html")
Does anyone know how to get R to try both and give me the one that works?
You can have the two links and use try on bottom of them. I am assuming there is only 1 that would give a valid website. Otherwise we can always edit the code to take in everything that works:
library(rvest)
library(purrr)
df = data.frame(id=1:6)
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"
df$affiliation <- sapply(df$author_reg, function(x){
links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))
# here we try both links and store under attempt
attempts = links %>% map(function(i){
try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
})
# the good ones will have "character" class, the failed ones, try-error
gdlink = which(sapply(attempts,class) != "try-error")
if(length(gdlink)>0){
return(attempts[[gdlink[1]]])
}
else{
return("True 404 error")
}
})
Check the results:
df
id author_reg
1 1 paa6
2 2 paa2
3 3 paa1
4 4 paa8
5 5 pve266
6 6 pya500
affiliation
1 Statistisk SentralbyråGovernment of Norway
2 Department of EconomicsCollege of BusinessUniversity of Wyoming
3 (80%) Institutt for ØkonomiUniversitetet i Bergen, (20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen
4 Centraal Planbureau (CPB)Government of the Netherlands
5 Department of FinanceRotterdam School of Management (RSM Erasmus University)Erasmus Universiteit Rotterdam
6 Business SchoolSwinburne University of Technology

show the map using mapview include multibyte character in data.frame

I want to show the data using mapview package.
but include multibyte character, sometime cannot show the map.
What would be the best thing to show the map?
library(mapview)
data(atlStorms2005)
test1 <- test2 <- atlStorms2005
test1#data$test <- as.factor(c("日本語", "てすと"))
test2#data$test <- as.factor(c("日本語", "五十嵐"))
mapview(test1) # can show the map
mapview(test2) # cannot show
re.data.frame <- function(data, encoding = "UTF-8", fileEncoding="UTF-8"){
write.csv(data, file("tmp.csv", encoding = encoding), row.names = F, fileEncoding=fileEncoding)
tmp <- readr::read_csv("tmp.csv", col_types = cols())
return(tmp)
}
test2#data <- re.data.frame(test2#data)
mapview(test2) # can show
but,the popup in test colum character is corrupted text.
data is correct.
head(test2#data)
# A tibble: 6 × 4
Name MaxWind MinPress test
<chr> <int> <int> <chr>
1 ALPHA 45 998 日本語
2 ARLENE 60 989 五十嵐
3 BRET 35 1002 日本語
4 CINDY 65 991 五十嵐
5 DELTA 60 980 日本語
6 DENNIS 130 930 五十嵐
As of commit bc2c57f, this should have been fixed. Until the next CRAN release of mapview, simply use the development version (devtools::install_github("environmentalinformatics-marburg/mapview", ref = "develop")) to solve this issue.
In brief, this behavior was related to our Rcpp routines which run under the hood in order to ensure a computationally efficient creation of popup tables. Here, the user's native encoding was used instead of UTF-8 to create JSON output files, resulting in corrupted text output on some machines where UTF-8 was not the default.

Resources