html_attr "href" does not extract link - r

I want to download the file that is in the tab "Dossier" with the text "Modul 4" here:
https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier
First I want to get the link.
My code for that is the following:
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes(".gba-download__text") %>%
.[[4]] %>%
html_attr("href")
(I know the piece .[[4]] is not really good, this is not my full code.)
This leads to NA and I don't understand why.
Similar questions couldn't help here.

Allan already left a concise answer. But let me leave another way. If you check the page source, you can see that the target is in .gba-download-list. (There are actually two of them.) So get that part and walk down to href part. Once you get urls, you can use grep() to identify a link containing Modul4. I used unique() in the end to remove a dupe.
read_html("https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier") %>%
html_nodes(".gba-download-list") %>%
html_nodes("a") %>%
html_attr("href") %>%
grep(pattern = "Modul4", value = TRUE) %>%
unique()
[1] "/downloads/92-975-67/2011-12-05_Modul4A_Apixaban.pdf"

It's easier to get to a specific node if you use xpath :
library(rvest)
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes(xpath = "//span[contains(text(),'Modul 4')]/..") %>%
.[[1]] %>%
html_attr("href")
#> [1] "/downloads/92-975-67/2011-12-05_Modul4A_Apixaban.pdf"

I have another solution now and want to share it:
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes("a.download-helper") %>%
html_attr("href") %>%
.[str_detect(., "Modul4")] %>%
unique

It is faster to use a css selector with contains operator to target the href by substring. In addition, only a single node match needs to be returned
library(rvest)
url <- "https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier"
link <- read_html(url) %>%
html_node("[href*='Modul4']") %>%
html_attr("href") %>% url_absolute(url)

Related

character (0) after scraping webpage in read_html

I'm trying to scrape "1,335,000" from the screenshot below (the number is at the bottom of the screenshot). I wrote the following code in R.
t2<-read_html("https://fortune.com/company/amazon-com/fortune500/")
employee_number <- t2 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//*[contains(#class, 'info__value--2AHH7')]") %>%
rvest::html_text()
However, when I call "employee_number", it gives me "character(0)". Can anyone help me figure out why?
As Dave2e pointed the page uses javascript, thus can't make use of rvest.
url = "https://fortune.com/company/amazon-com/fortune500/"
#launch browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[#id="content"]/div[5]/div[1]/div[1]/div[12]/div[2]') %>%
html_text()
[1] "1,335,000"
Data is loaded dynamically from a script tag. No need for expense of a browser. You could either extract the entire JavaScript object within the script, pass to jsonlite to handle as JSON, then extract what you want, or, if just after the employee count, regex that out from the response text.
library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)
page <- read_html('https://fortune.com/company/amazon-com/fortune500/')
data <- page %>% html_element('#preload') %>% html_text() %>%
stringr::str_match(. , "PRELOADED_STATE__ = (.*);") %>% .[, 2] %>% jsonlite::parse_json()
print(data$components$page$`/company/amazon-com/fortune500/`[[6]]$children[[4]]$children[[3]]$config$employees)
#shorter version
print(page %>%html_text() %>% stringr::str_match('"employees":"(\\d+)?"') %>% .[,2] %>% as.integer() %>% format(big.mark=","))

How to select "href" of a web page of a specific "target"?

<a class="image teaser-image ng-star-inserted" target="_self" href="/politik/inland/neuwahlen-2022-welche-szenarien-jetzt-realistisch-sind/401773131">
I just want to extract the "href" (for example the upper HTML tag) in order to concat it with the domain name of this website "https://kurier.at" and web scrape all articles on the home page.
I tried the following code
library(rvest)
library(lubridate)
kurier_wbpg <- read_html("https://kurier.at")
# I just want the "a" tags which come with the attribute "_self"
articleLinks <- kurier_wbpg %>% html_elements("a")%>%
html_elements(css = "tag[attribute=_self]") %>%
html_attr("href")%>%
paste("https://kurier.at",.,sep = "")
When I execute up to the html_attr("href") part of the above code block, the result I get is
character(0)
I think something wrong with selecting the HTML element tag.
I need some help with this?
You need to narrow down your css to the second teaser block image which you can do by using the naming conventions of the classes. You can use url_absolute() to add the domain.
library(rvest)
library(magrittr)
url <- 'https://kurier.at/'
result <- read_html(url) %>%
html_element('.teasers-2 .image') %>%
html_attr('href') %>%
url_absolute(url)
Same principle to get all teasers:
results <- read_html(url) %>%
html_elements('.teaser .image') %>%
html_attr('href') %>%
url_absolute(url)
Not sure if you want the bottom block of 5 included. If so, you can again use classes
articles <- read_html(url) %>%
html_elements('.teaser-title') %>%
html_attr('href') %>%
url_absolute(url)
It works with xpath -
library(rvest)
kurier_wbpg <- read_html("https://kurier.at")
articleLinks <- kurier_wbpg %>%
html_elements("a") %>%
html_elements(xpath = '//*[#target="_self"]') %>%
html_attr('href') %>%
paste0("https://kurier.at",.)
articleLinks
# [1] "https://kurier.at/plus"
# [2] "https://kurier.at/coronavirus"
# [3] "https://kurier.at/politik"
# [4] "https://kurier.at/politik/inland"
# [5] "https://kurier.at/politik/ausland"
#...
#...

rvest how to get last page number

Trying to get the last page number:
library(rvest)
url <- "https://www.immobilienscout24.de/Suche/de/wohnung-kaufen"
page <- read_html(url)
last_page_number <- page %>%
html_nodes("#pageSelection > select > option") %>%
html_text() %>%
length()
The result is empty for some reason.
I can access the pages by this url, for example to get page #3:
https://www.immobilienscout24.de/Suche/de/wohnung-kaufen?pagenumber=3
You are in the right direction but I think you have got wrong css selectors. Try :
library(rvest)
url <- 'https://www.immobilienscout24.de/Suche/de/wohnung-kaufen'
url %>%
read_html() %>%
html_nodes('div.select-container select option') %>%
html_text() %>%
tail(1L)
#[1] "1650"
An alternative :
url %>%
read_html() %>%
html_nodes('div.select-container select option') %>%
magrittr::extract2(length(.)) %>%
html_text()

Scraped table returns empty data frame

I'm trying to scrape two things. I want to extract the links from each individual school on a page with this code:
scraped_links <- read_html("https://www.scholenopdekaart.nl/middelbare-scholen/zoeken/") %>%
html_nodes("a.school-naam") %>%
html_attr("href") %>%
html_table() %>%
as.data.frame() %>%
as.tbl()
Then I want to scrape the tabels on these pages:
scraped_tables <- read_html("https://www.scholenopdekaart.nl/Middelbare-scholen/146/1086/Almere-College/Slaagpercentage") %>%
html_nodes(xpath = "/html/body/div[3]/div[3]/div[1]/div[3]/div[3]/div[3]") %>%
html_table() %>%
as.data.frame() %>%
as.tbl()
They both return empty data frames. I tried css selectors, multiple xpaths, but I can't get it to work... Hope someone can help me.

XML file not being extracted in R

I am working on a project that requires me to go through various pages of links, and within these links find the xml file and parse it. I am having trouble extracting the xml file. There are two xml files within each link and I am interested in the one that is bigger. How can I extract the xml file, and find the the one with the max size. I tried using the grep function but its constantly giving me an error.
sotu<-data.frame()
for (i in seq(1,501, 100))
{
securl <- paste0("https://www.sec.gov/cgi-bin/srch-edgar?text=abs-
ee&start=", i, "&count=100&first=2016")
main.page <- read_html(securl)
urls <- main.page %>%
html_nodes("div td:nth-child(2) a")%>%
html_attr("href")
baseurl <- "https://www.sec.gov"
fulllink <-paste(baseurl, urls, sep = "")
names <- main.page %>%
html_nodes ("div td:nth-child(2) a") %>%
html_text()
date <- main.page %>%
html_nodes ("td:nth-child(5)") %>%
html_text()
result <- data.frame(urls=fulllink,companyname=names,FilingDate=date, stringsAsFactors = FALSE)
sotu<- rbind(sotu,result)
}
for (i in seq(nrow(sotu)))
{
getXML <- read_html(sotu$urls[1]) %>%
grep("xml", getXML, ignore.case=FALSE )
}
Everything works except when I try to loop over every link and find the xml file, I keep getting an error. Is this not the right function?
With some help from dplyr we can do:
sotu %>%
rowwise() %>%
do({
read_html(.$urls) %>%
html_table() %>%
as.data.frame() %>%
filter(grepl('.*\\.xml', Document)) %>%
filter(Size == max(Size))
})
or, as the type is always 'EX-102' at least in the example:
sotu %>%
rowwise() %>%
do({
read_html(.$urls) %>%
html_table() %>%
as.data.frame() %>%
filter(Type == 'EX-102')
})
This also get rid of the for loop, which is rarely a good idea in R.

Resources