How to check whether an XML node set is empty in R? - r

I'm writing a function that iterates over XML nodes in R; for this I've been looking for a verb that affirms or denies the presence of an empty XML-nodeset (something like isEmptyNodeSet).
In other words, a function that returns TRUE if a case like the following occurs:
library(magrittr)
library(rvest)
#> Loading required package: xml2
library(xml2)
"https://www.admin.ch/ch/d/gg/pc/ind2010.html" %>%
read_html() %>%
html_nodes("a.adminCHlink, div#spalteContentPlus h2 ~ ul") %>%
.[[1]] %>%
html_nodes("strong")
#> {xml_nodeset (0)}
Created on 2019-01-12 by the reprex package (v0.2.1)
Thanks so much in advance (and sorry if the answer is obvious, I'm an XML-rookie)!

Either use is_empty <- function(x) if(length(x) == 0) TRUE else FALSE (thanks #Chase).
Or use rlang::is_empty() or purrr::is_empty() respectively, which does exactly the same.
The code then becomes:
library(magrittr)
library(rvest)
#> Loading required package: xml2
library(xml2)
"https://www.admin.ch/ch/d/gg/pc/ind2010.html" %>%
read_html() %>%
html_nodes("a.adminCHlink, div#spalteContentPlus h2 ~ ul") %>%
.[[1]] %>%
html_nodes("strong") %>%
rlang::is_empty()
#> [1] TRUE

Related

Get maps location without clicking on it

I have a url/api from google which allocate the location as per the latitude and longitude as shown below.
Here the user have to click on the below link to navigate to the maps.
So wanted to check if we can have location ready without clicking on it
HTML('Google Maps')
How about this:
library(rvest)
library(stringr)
h <- read_html(htmltools::HTML('Google Maps'))
h %>%
html_elements("a") %>%
html_attr("href") %>%
gsub(".*\\?q\\=(.*)$", "\\1", .) %>%
str_split(., ",", simplify=TRUE) %>%
as.numeric(.)
#> [1] 50.89091 14.86668
Created on 2022-12-29 by the reprex package (v2.0.1)

How to find all Event IDs in an efficient way?

How could I crawl this database with rvest to identify all tournament IDs for each year? Currently, I'm just going from 1:maxx(event_id), which is really a drain on compute time.
https://www.worldloppet.com/results/
The results filter seems to be dynamic on the webpage, so the url doesn't change.
outlist <- list()
for (event_id in 2483:2570) {
event_id = 2483
# update progress
message('Retrieving Event ',event_id)
race_url = paste0('https://www.worldloppet.com/browse/?id=',event_id)
event_info = read_html(race_url) %>%
html_nodes('h2') %>%
.[1] %>%
gsub('<br>','<br> ',.) %>%
gsub("<[^>]+>", "",.) %>%
str_split(.,' ') %>%
unlist()
#event_info$eventid <- event_id
outlist <- c(outlist, list(c(event_id, event_info)))
# temporary break
Sys.sleep(3)
}
You can extract all links containing the word browse from the HTML document:
library(tidyverse)
library(rvest)
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
read_html("https://www.worldloppet.com/results/") %>%
html_nodes("a") %>%
html_attr("href") %>%
as.character() %>%
keep(~ .x %>% str_detect("browse")) %>%
paste0("https://www.worldloppet.com",.)
#> [1] "https://www.worldloppet.com/browse/?id=2570"
#> [2] "https://www.worldloppet.com/browse/?id=1818"
#> [3] "https://www.worldloppet.com/browse/?id=1817"
#> [4] "https://www.worldloppet.com/browse/?id=2518"
#> [5] "https://www.worldloppet.com/browse/?id=2517"
Created on 2022-02-09 by the reprex package (v2.0.1)
The IDs of the rage can be found in the links, which can be extracted using the html_attr function. From there we can use some regex to find the numbers, here I include id= to make sure the page is an id, as I'm not sure whether you want to include links like masters=9173.
library(rvest)
library(stringi)
url <- "https://www.worldloppet.com/results/"
page <- read_html(url)
string <- html_attr(html_elements(page, "a"), "href")
matches <- stri_extract_all_regex(string, "(?<=id=).*", simplify = T)
as.integer(matches[!is.na(matches)])
# first 5
[1] 2570 1818 1817 2518 2517

In R's {collapse} package, how to achieve what dplyr::distinct() or base::unique.data.frame() do?

I'm new to the {collapse} R package, trying to find how to do the same thing that dplyr::distinct() or base::unique.data.frame() do: get the unique combinations over several columns.
For example:
library(babynames)
library(dplyr, warn.conflicts = FALSE)
library(collapse, warn.conflicts = FALSE)
#> collapse 1.6.5, see ?`collapse-package` or ?`collapse-documentation`
#> Note: stats::D -> D.expression, D.call, D.name
via_distinct <- babynames %>% distinct(sex, name)
via_collapse <- babynames %>% collap(sex ~ name)
nrow(via_distinct)
#> [1] 107973
nrow(via_collapse)
#> [1] 97310
Created on 2021-08-19 by the reprex package (v2.0.0)
Only by number of rows we can see that via_collapse isn't giving the same output as via_distinct. Clearly, I'm not using collapse() correctly, or othwerwise there should be a different way to use {collapse} tools for this task.
Any idea?
collapse has funique although note that collapse does not have its goal to be comprehensive but rather to provide functions where it can offer performance so, in general, it can't be expected that there will be a replacement for each function.
via_distinct <- babynames %>% distinct(sex, name) %>% arrange(sex, name)
via_collapse <- babynames %>% slt(sex, name) %>% funique(sort = TRUE)
identical(via_distinct, via_collapse)
## [1] TRUE

Scraping 'Artsy' using rvest

I am trying to get information from Artsy using rvest package of R. I want to get information on name of painting, year, price, place (name of gallery, auction etc.), name of artist, and materials that are used. Information on material is provided in inside page of each painting. Codes that I tried to use are provided below:
library(rvest)
library(dplyr)
library(tidyverse)
get_material = function (painting_link) {
painting_page = read_html (painting_link)
material = painting_page %>% html_nodes('h2+ .kPqROo') %>%
html_text() %>% paste(collapse = ",")
return(material)
}
for(page_result in 2:3) {
link = paste0 ("https://www.artsy.net/collect?page=", page_result, "&additional_gene_ids%5B0%5D=painting")
page = read_html(link)
painting_name_year = page %>% html_nodes("#main .kjRHrZ") %>% html_text()
painting_link = page %>% html_nodes('#main .kjRHrZ') %>% html_attr("<div color="black60" font-family="sans" class="Box-sc-15se88d-0 Text-sc-18gcpao-0 kjRHrZ">\n<i>") %>% paste("https://www.artsy.net", ., sep="/")
price = page %>% html_nodes('.ibabyz') %>% html_text()
place = page %>% html_nodes('hWKLzd') %>% html_text()
artist = page %>% html_nodes('.bQOCym .bQOCym') %>% html_text()
material = sapply(painting_link, FUN=get_material, USE.NAMES = FALSE)
}
artsy <- data.frame(painting_name_year, price, place, artist)
view(artsy)
Code for painting_link, place, and material are not working. Moreover, one observation is repeating for 3 times. How can I fix this problem?
You can remove the loop. First generate the start url list. Then, rather than scrape some info from landing pages, before visiting individual listing pages, you can instead, gather all the urls of the individual listings first.
Then, you can gain a little efficiency by working across more cpu cores and gathering the data you want from all the listings via a function call against each url.
N.B. As this operation is I/O bound you would likely see better efficiencies with an asynchronous method. If I can find a decent tutorial/reference on this I will maybe update this answer.
If you return a tibble of the desired info, from each listing url, via the function, you can generate a final dataframe by calling future_map_dfr on the listings links and user defined function.
library(purrr)
library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.3
#> Warning: package 'forcats' was built under R version 4.0.3
library(jsonlite)
#> Warning: package 'jsonlite' was built under R version 4.0.3
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
library(furrr)
#> Warning: package 'furrr' was built under R version 4.0.3
#> Loading required package: future
#> Warning: package 'future' was built under R version 4.0.3
library(stringr)
get_art_links <- function(link) {
hrefs <- read_html(link) %>%
html_nodes("[href*=artwork][class]") %>%
html_attr("href") %>%
paste0("https://www.artsy.net", .)
return(hrefs)
}
get_listing_json <- function(page) {
data <- page %>%
html_node('[type="application/ld+json"]') %>%
html_text() %>%
jsonlite::parse_json()
return(data)
}
get_listing_info <- function(link) {
page <- read_html(link)
json <- get_listing_json(page)
artist <- json$brand$name
title <- page %>%
html_node('[data-test="artworkSidebar"] h2 > i') %>%
html_text()
production_date <- json$productionDate
material <- page %>%
html_node('[data-test="artworkSidebar"] h2 + div') %>%
html_text()
width <- json$width
height <- json$height
place <- stringr::str_match(json$description, "from (.*?),")[, 2]
price <- json$offers$price
currency <- json$offers$priceCurrency
availability <- str_replace(json$offers$availability, "https://schema.org/", "")
return(tibble(artist, title, production_date, material, width, height, place, price, currency, availability))
}
pages <- 2:3 %>% as.character()
urls <- sprintf("https://www.artsy.net/collect?page=%s&additional_gene_ids[0]=painting", pages)
links <- purrr::map(urls, get_art_links) %>%
unlist()
no_cores <- future::availableCores() - 1
future::plan(future::multisession, workers = no_cores)
results <- future_map_dfr(links, .f = get_listing_info)
Created on 2021-05-16 by the reprex package (v0.3.0)

purrr: How to pluck a list element's contents (instead of the element)

In cases where I have a list of tibbles, I sometimes purrr::keep() multiple elements and then combine using reduce() to end up with a tibble, however when I purrr::pluck() or purrr::keep() only one, doing a reduce doesn't make sense. What is the best way to get at the tibble, rather than the list element containing the tibble?
I've found that doing keep() %>% enframe() %>% unnest() works in some cases but not others, but it seems messy regardless.
Other attemps:
require(tidyverse)
#> Loading required package: tidyverse
mylist <- lst(cars, diamonds)
mylist %>% pluck("cars") %>% typeof
#> [1] "list"
mylist %>% pluck("cars") %>% as.tibble() %>% typeof
#> [1] "list"
mylist %>% pluck("cars")[1]
#> Error in .[pluck("cars"), 1]: incorrect number of dimensions
mylist %>% pluck("cars")[[1]]
#> Error in .[[pluck("cars"), 1]]: incorrect number of subscripts
mylist %>% pluck("cars") %>% unlist() %>% typeof
#> [1] "double"
mylist %>% pluck("cars") %>% unnest() %>% typeof
#> [1] "list"
mylist %>% pluck("cars") %>% flatten %>% typeof
#> [1] "list"
Created on 2018-03-20 by the reprex package (v0.2.0).
The goal is to get the tibble.
Nothing like doing a bunch of reprex to solve your own (stupid) question!
Here's the deal for anyone else:
1) all tibbles will return "list" from typeof() (this is what threw me off)
2) keep is for multiple elements. Thus to get to a tibble, you'll have to do some kind of reduction (bind_rows, left_join, etc).
3) pluck is for single elements, and will return the element contents.
I stole this example from Hadley and updated with purrr functions:

Resources