Issue webscraping a website: not extracting anything - r

I am trying to extract data from the following website: 'https://2010-2014.kormany.hu/hu/hirek'. When I try to extract, for example, the articles' links from that website using the following, I got nothing.
library(rvest)
library(dplyr)
library(XML)
url <- 'www.2015-2019.kormany.hu/hu/hirek'
links <- read_html(url) %>% html_nodes("div") %>% html_nodes(xpath = '//*[#class="article"]') %>% html_nodes("h2") %>% html_nodes("a") %>% html_attr("href")
links
> character(0)
I don't even get anything if I run the following code:
links <- read_html(url) %>% html_nodes("div")
links
> character(0)
This is very strange since, when I inspect the website, it seems that I should be getting the list of URLs from the code I provided. According to the website's source, there are "div" nodes ('view-source:https://2015-2019.kormany.hu/hu/hirek'). Does anyone know what I could be doing wrong?

Today I re-tried my code and it works perfectly. I am not sure what was happening yesterday.

Related

Obtaining "Character(0)" error when using rvest to get Google results headlines

Sorry if my question is simple or badly asked, I am very new at web scraping with R.
I am trying to scrape the headlines from a Google search. Sorry if it is exactly the same request previously asked in the link below, however it does not work for me (it still returns
"character(0)" ).
Character(0) error when using rvest to webscrape Google search results
Here is the two scripts I tried, based on the answers provided in the link above:
#Script 1
library(rvest)
library(dplyr)
web1 <- read_html("https://www.google.at/search?q=munich+prices")
web1 %>%
html_nodes(xpath = '//div/div/div/a/div[not(div)]') %>%
html_text
#Script 2
library(rvest)
library(dplyr)
web1 <- read_html("https://www.google.at/search?q=munich+prices")
web1 %>%
html_nodes(xpath = '//div/div/div/a/h3/div[not(div)]') %>%
html_text
The two scripts still return "character(0)" for me.
Does anyone have an idea?
Thanks you for your help.
Victor
As requested here is the screenshot,
library(rvest)
library(dplyr)
web1 <- read_html("https://www.google.at/search?q=munich+prices")
web1 %>%
html_nodes(xpath = '//div/div/div/a/h3/div[not(div)]') %>%
html_text

How to log scrape paths rvest used?

Background: Using rvest I'd like to scrape all details of all art pieces for the painter Paulo Uccello on wikiart.org. The endgame will look something like this:
> names(uccello_dt)
[1] title year style genre media imgSRC infoSRC
Problem: When a scraping attempt doesn't go as planned, I get back character(0). This isn't helpful for me in understanding exactly what path the scrape took to get character(0). I'd like to have my scrape attempts output what path it specifically took so that I can better troubleshoot my failures.
What I've tried:
I use Firefox, so after each failed attempt I go back to the web inspector tool to make sure that I am using the correct css selector / element tag. I've been keeping the rvest documentation by my side to better understand its functions. It's been a trial and error that's taking much longer than I think should. Here's a cleaned up source of 1 of many failures:
library(tidyverse)
library(data.table)
library(rvest)
sample_url <-
read_html(
"https://www.wikiart.org/en/paolo-uccello/all-works#!#filterName:all-paintings-chronologically,resultType:detailed"
)
imgSrc <-
sample_url %>%
html_nodes(".wiki-detailed-item-container") %>% html_nodes(".masonry-detailed-artwork-item") %>% html_nodes("aside") %>% html_nodes(".wiki-layout-artist-image-wrapper") %>% html_nodes("img") %>%
html_attr("src") %>%
as.character()
title <-
sample_url %>%
html_nodes(".masonry-detailed-artwork-title") %>%
html_text() %>%
as.character()
Thank you in advance.

Using R to scrape play-by-play data

I am currently trying to scrape the play-by-play entries from the following link:
https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4
I used the SelectorGadget to determine CSS selectors and ended up with '//td'. However when I attempt to scrape the data using this, html_nodes() returns an empty list and thus the following code returns an error.
library("rvest")
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
play_by_play <- url %>%
read_html %>%
html_node(xpath='//td') %>%
html_table()
play_by_play
Does anybody know how to resolve this issue?
Thank you in advance!
I think you cannot get the table simply because there are no table in the website(see the source).
It there are any tables, you can get it with following code.
library("rvest")
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
play_by_play <- url %>%
read_html %>%
html_table()
play_by_play
The data in the page you are loading is loaded with Javascript, so when you used read_html, you are not seeing what you want. If you check "view the source", you will not see table or td in the source page.
What you can do is using other options like Rselenium to get the page source, and if you want to use rvest later you can scrape from the source you get.
library(rvest)
library(Rselenium)
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
rD<- rsDriver()
remDr <- rD$client
remDr$navigate(url)
remDr$getPageSource()[[1]]
play_by_play <-read_html(unlist(remDr$getPageSource()),encoding="UTF-8") %>%
html_nodes("td")
remDr$close()
rm(remDr, rD)
gc()

get links while do web scraping to google in R

I am trying to get links of google while do a search, that is, all these links:.
I have done this kind of scraping but in this case I do not understand why It doesn't work, so I run the following lines:
library(rvest)
url<-"https://www.google.es/search?q=Ediciones+Peña+sl+telefono"
content_request<-read_html(url)
content_request %>%
html_nodes(".r") %>%
html_attr("href")
I have tried with other nodes and I obtain similar answers:
content_request %>%
html_nodes(".LC20lb") %>%
html_attr("href")
Finally I tried to get all the links of the web page, but there are some links that I cannot download:
html_attr(html_nodes(content_request, "a"), "href")
Please, could you help me in this case? Thank you.
Here are two options for you to play around with.
#1)
url <- "https://www.google.es/search?q=Ediciones+Pe%C3%B1a+sl+telefono"
html <- paste(readLines(url), collapse="\n")
library(stringr)
matched <- str_match_all(html, "<a href=\"(.*?)\"")
#2)
library(xml2)
library(rvest)
URL <- "https://www.google.es/search?q=Ediciones+Pe%C3%B1a+sl+telefono"
pg <- read_html(URL)
head(html_attr(html_nodes(pg, "a"), "href"))

How to scrape a table with rvest and xpath?

using the following documentation i have been trying to scrape a series of tables from marketwatch.com
here is the one represented by the code bellow:
The link and xpath are already included in the code:
url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation <- url %>%
html() %>%
html_nodes(xpath='//*[#id="maincontent"]/div[2]/div[1]') %>%
html_table()
valuation <- valuation[[1]]
I get the following error:
Warning message:
'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
Thanks in advance.
That website doesn't use an html table, so html_table() can't find anything. It actaully uses div classes column and data lastcolumn.
So you can do something like
url <- "http://www.marketwatch.com/investing/stock/IRS/profile"
valuation_col <- url %>%
read_html() %>%
html_nodes(xpath='//*[#class="column"]')
valuation_data <- url %>%
read_html() %>%
html_nodes(xpath='//*[#class="data lastcolumn"]')
Or even
url %>%
read_html() %>%
html_nodes(xpath='//*[#class="section"]')
To get you most of the way there.
Please also read their terms of use - particularly 3.4.

Resources