Learning web scraping unable to understand : html_nodes("table") %>% `[[`(6) %>% - r

I am learning web scraping in r , written the following code :
url <- "https://en.wikipedia.org/wiki/World_population"
library(rvest)
library(tidyr)
library(dplyr)
ten_most_df <- read_html(url)
ten_most_populous <- ten_most_df %>%
html_nodes("table") %>% `[[`(6) %>% html_table()
In the above mentioned code, what does : [[(6) represent.
I have referred some document as well for this where the following text is written, but not getting clearity on this :
"For vectors and matrices the [[ forms are rarely used, although they have some slight semantic
differences from the [ form (e.g. it drops any names or dimnames attribute, and that partial
matching is used for character indices)"
Request you to please explain on this , will be very helpful. thanks

It's just one way of selecting the 6th element from the nodeset.
The code ten_most_df %>% html_nodes("table") returns an xml_nodeset object with 26 elements, corresponding to the 26 tables on the page. [[(6) subsets the object and returns the 6th node.
In fact there's a quicker way using only html_table, which returns the tables in a list:
ten_most_df %>%
html_table() %>%
.[[6]]
Personally I find this a little easier to read; the . represents the list and [[n]] is the standard way to access list element number n.

Related

web scraping from the same level multiple nodes with the same names

I would like to extract the following data from four nodes all at the same level and sharing the same code name.
# I was able to extract the first of the four nodes - Property Amenities, using google chrome selector gadget as to identify the nodes.
library(rvest)
page0_url<-read_html ("https://www.tripadvisor.com/Hotel_Review-g1063979-d1447619-Reviews-
Solana_del_Ter-Ripoll_Province_of_Girona_Catalonia.html")
result_amenities <- html_text (html_node(page0_url,"._1nAmDotd") %>% html_nodes("div") )
However, I cannot figure out how to pass the code to extract the elements within the second object named "Room Features". This is at the same node level and has the same name code as the one above =.This is also the case for the two objects following to this last one and by the names of "Room types" and "Good to know".
You need to query all of the nodes with same class using the html_nodes() function then parse each of those nodes individually.
For Example
library(rvest)
url<- "https://www.tripadvisor.com/Hotel_Review-g1063979-d1447619-Reviews-Solana_del_Ter-Ripoll_Province_of_Girona_Catalonia.html"
page0_url<-read_html(url)
result_amenities <- html_text(html_nodes(page0_url,"._1nAmDotd") %>% html_nodes("div") )
names <- html_nodes(page0_url,"div._1mJdgpMJ") %>% html_text()
groupNodes <- html_nodes(page0_url,"._1nAmDotd")
outputlist <-lapply(groupNodes, function(node){
results <- node %>% html_nodes("div") %>% html_text()
})
On the reference page there is no corresponding "_1nAmDotd" node the "Good to Know" section thus leading to an unbalance in the results.
Almost all desirable data (including everything you requested) is available via the page manifest, within a script tag, as that is where it is loaded from. You can regex out that enormous amount of data with regex. Then write user defined functions to extract desired info.
I initially parse the regex matched group into a json object all_data. I then look through that list of lists to find strings only associated with the data of interest. For example, starRating is associated with the location data you are interested in. get_target_list returns that list and then I extract from that what I want. You can see
that location_info holds the data related to hotel amenities (including room amentities), the star rating (hotel class) and languages spoken etc.
E.g. location_info$hotelAmenities$languagesSpoken or location_info$hotelAmenities$highlightedAmenities$roomFeatures ........
N.B. As currently written, it is intended that search_string is unique to the desired list, within the list of lists initially held in the json object. I wasn't sure if the names, of the named lists, would remain constant, so chose to dynamically retrieve the right list.
R:
library(rvest)
library(jsonlite)
library(stringr)
library(magrittr)
is_target_list <- function(x, search_string) {
return(str_detect(x %>% toString(), search_string))
}
get_target_list <- function(data_list, search_string) {
mask <- lapply(data_list, is_target_list, search_string) %>% unlist()
return(subset(data_list, mask))
}
r <- read_html("https://www.tripadvisor.com/Hotel_Review-g1063979-d1447619-Reviews-Solana_del_Ter-Ripoll_Province_of_Girona_Catalonia.html") %>%
toString()
all_data <- gsub("pageManifest:", '"pageManifest":', stringr::str_match(r, "(\\{pageManifest:.*);\\(")[, 2]) %>%
jsonlite::parse_json()
data_list <- all_data$pageManifest$urqlCache
# target_info <- get_target_list(data_list, 'hotelAmenities')
location_info <- get_target_list(data_list, "starRating") %>%
unname() %>%
.[[1]] %>%
{
.$data$locations[[1]]$detail
}
Regex:

rvest: cannot extract node using html_nodes with xpath and a regular expression

I am doing some webscraping with rvest and stringr and having a problem I could not yet find a solution for on stackoverflow.
I want to extract a specific node that contains a combination of words and numbers over a large number of documents.
Because the information is at a different node in each of the documents, I want to locate it via using the [contains(text() '')] method from xpath and entering a regular expression '\\d{1,4}' here.
The following example illustrates where the code does not work or where I am doing it wrong:
library(tidyverse)
library(RSelenium)
library(rvest)
library(stringr)
library(lubridate)
library(rstudioapi)
rep_url <- "https://de.wikipedia.org/wiki/Eintracht_Braunschweig"
parsed <- read_html(rep_url)
I want to extract "15. Dezember 1895", but when I use an xpath combined with a regex [contains(text() '\\d{1,4}')] I get an empty character, where I should get two character strings. The code only works when I enter the concrete numbers.
number <- parsed %>%
html_nodes(xpath = "/html/body/div[3]/div[3]/div[5]/div[1]/table[1]/tbody/tr[7]/td[contains(text(), '\\d{1,4}')]") %>%
html_text()
number
>character(0)
number <- parsed %>%
html_nodes(xpath = "/html/body/div[3]/div[3]/div[5]/div[1]/table[1]/tbody/tr[7]/td[contains(text(), '1895')]") %>%
html_text()
number
>[1] "15. Dezember 1895\n"
What am I missing out here? What do I do wrong such that I cannot combine this regular expression with my xpath?
The above example illustrates the problem, and as there are probably other ways to extract the "15. Dezember 1895", I emphasise that Rit is important that I find a way to combine xpath with regex successfully.
Some advice is much appreciated since I could not yet find a similar problem on stackoverflow or other webpages.
Best regards
Mio
To use regular expressions I think you need to use matches() instead of contains. However, matches() doesn't seem to work for rvest.
Would something like this work instead ?
library(rvest)
rep_url <- "https://de.wikipedia.org/wiki/Eintracht_Braunschweig"
parsed <- read_html(rep_url)
parsed %>%
html_nodes(xpath = "/html/body/div[3]/div[3]/div[5]/div[1]/table[1]/tbody/tr[7]/td") %>%
html_text() %>%
grep('\\d{1,4}', ., value = TRUE)
#[1] "15. Dezember 1895\n"

Scraping tabulated data from ballotpedia.org with rvest

I'm trying to scrape tabulated data on previous US statewide election results, and I think ballotpedia.org is a good place to be getting this data from - as URLs are in a consistent format for all states.
Here's the code I set up to test it:
library(dplyr)
library(rvest)
# STEP 1 - URL COMPONENTS TO SCRAPE FROM
senate_base_url <- "https://ballotpedia.org/United_States_Senate_elections_in_"
senate_state_urls <- gsub(" ", "_", state.name)
senate_year_urls <- c(",_2012", ",_2014", ",_2016")
# TEST
test_url <- paste0(senate_base_url, senate_state_urls[10], senate_year_urls[2])
this results in the following URL: https://ballotpedia.org/United_States_Senate_elections_in_Georgia,_2014
Using the 'selectorgadget' chrome plugin, I selected the table in question containing the election result, and tried parsing it into R as follows:
test_data <- read_html(test_url)
test_data <- test_data %>%
html_node(xpath = '//*[#id="collapsibleTable0"]') %>%
html_table()
However, I'm getting the following error:
Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to an object of class "xml_missing"
Furthermore, the R object test_data yields a list with 2 empty elements.
Can anyone tell me what I'm doing wrong here? Is the html_table() function the wrong one? Using html_text() simply returns an NA character vector. Any help would be greatly appreciated, thanks very much :).
Your xpath statement is incorrect, thus the html_node function is returning a null value.
Here is a solution using the html tags. "Look for a table tag within a center tag"
library(rvest)
test_data <- read_html(test_url)
test_data <- test_data %>% html_nodes("center table") %>% html_table()
Or to retrieve the fully collapsed table use the html tag with class name:
collapsedtable<-test_data %>% html_nodes("table.collapsible") %>%
html_table(fill=TRUE)
this works for me:
library(httr)
library(XML)
r <- httr::GET("https://ballotpedia.org/United_States_Senate_elections_in_Georgia,_2014")
XML::readHTMLTable(rawToChar(r$content))[[2]]

Calling a column from csv.file to extract its data

I imported the csv file that I want to use in r. Here, I am trying to call one of the columns from the csv file. This column has a list of urls titled "URLs". Then, I want the code which I have to scrap data from each url. In short, I want to use more efficient way than listing all the urls in c() function since I have about 200 links.
https://www.nytimes.com/2018/04/07/health/health-care-mergers-doctors.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/11/well/move/why-exercise-alone-may-not-be-the-key-to-weight-loss.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/07/health/antidepressants-withdrawal-prozac-cymbalta.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/09/well/why-you-should-get-the-new-shingles-vaccine.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/09/health/fda-essure-bayer-contraceptive-implant.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/09/health/hot-pepper-thunderclap-headaches.html?rref=collection%2Fsectioncollection%2Fhealth
The error appears when running this: article <- links %>% map(read_html).
It gives me this message:
(Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "factor")
Here is the code:
setwd("C:/Users/Majed/Desktop")
d <- read.csv("NYT.csv")
d
links<- d$URLs
article <- links %>% map(read_html)
title <-
article %>% map_chr(. %>% html_node("title") %>% html_text())
content <-
article %>% map_chr(. %>% html_nodes(".story-body-text") %>% html_text() %>% paste(., collapse = ""))
article_table <- data.frame("Title" = title, "Content" = content)
Pay attention to the meaning of your error message: read_html expects a character string, but you're giving it a factor. read.csv converts strings to factors, unless you include the argument stringsAsFactors = F. read_csv from readr is a good alternative if you, like me, forget that you don't want strings automatically turned into factors.
I can't reproduce the problem without your data, but try converting the URLs to strings:
links <- as.character(d$URLs)
article <- links %>% map(read_html)

Web-scraping in R

I am practicing my web scraping coding in R and I cannot pass one phase no matter what website I try.
For example,
https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music
My goal is to extract all 77 schools' name (Oxford to London Metropolitan)
So I tried...
library(rvest)
url_college <- "https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music"
college <- read_html(url_college)
info <- html_nodes(college, css = '.league-table-institution-name')
info %>% html_nodes('.league-table-institution-name') %>% html_text()
From F12, I could find out that all schools' name is under class '.league-table-institution-name'... and that's why I wrote that in html_nodes...
What have I done wrong?
You appear to be running html_nodes() twice: first on college, an xml_document (which is correct) and then on info, a character vector, which is not correct.
Try this instead:
url_college %>%
read_html() %>%
html_nodes('.league-table-institution-name') %>%
html_text()
and then you'll need an additional step to clean up the school names; this one was suggested:
%>%
str_replace_all("(^[^a-zA-Z]+)|([^a-zA-Z]+$)", "")

Resources