get links while do web scraping to google in R - r

I am trying to get links of google while do a search, that is, all these links:.
I have done this kind of scraping but in this case I do not understand why It doesn't work, so I run the following lines:
library(rvest)
url<-"https://www.google.es/search?q=Ediciones+Peña+sl+telefono"
content_request<-read_html(url)
content_request %>%
html_nodes(".r") %>%
html_attr("href")
I have tried with other nodes and I obtain similar answers:
content_request %>%
html_nodes(".LC20lb") %>%
html_attr("href")
Finally I tried to get all the links of the web page, but there are some links that I cannot download:
html_attr(html_nodes(content_request, "a"), "href")
Please, could you help me in this case? Thank you.

Here are two options for you to play around with.
#1)
url <- "https://www.google.es/search?q=Ediciones+Pe%C3%B1a+sl+telefono"
html <- paste(readLines(url), collapse="\n")
library(stringr)
matched <- str_match_all(html, "<a href=\"(.*?)\"")
#2)
library(xml2)
library(rvest)
URL <- "https://www.google.es/search?q=Ediciones+Pe%C3%B1a+sl+telefono"
pg <- read_html(URL)
head(html_attr(html_nodes(pg, "a"), "href"))

Related

Obtaining "Character(0)" error when using rvest to get Google results headlines

Sorry if my question is simple or badly asked, I am very new at web scraping with R.
I am trying to scrape the headlines from a Google search. Sorry if it is exactly the same request previously asked in the link below, however it does not work for me (it still returns
"character(0)" ).
Character(0) error when using rvest to webscrape Google search results
Here is the two scripts I tried, based on the answers provided in the link above:
#Script 1
library(rvest)
library(dplyr)
web1 <- read_html("https://www.google.at/search?q=munich+prices")
web1 %>%
html_nodes(xpath = '//div/div/div/a/div[not(div)]') %>%
html_text
#Script 2
library(rvest)
library(dplyr)
web1 <- read_html("https://www.google.at/search?q=munich+prices")
web1 %>%
html_nodes(xpath = '//div/div/div/a/h3/div[not(div)]') %>%
html_text
The two scripts still return "character(0)" for me.
Does anyone have an idea?
Thanks you for your help.
Victor
As requested here is the screenshot,
library(rvest)
library(dplyr)
web1 <- read_html("https://www.google.at/search?q=munich+prices")
web1 %>%
html_nodes(xpath = '//div/div/div/a/h3/div[not(div)]') %>%
html_text

Issue webscraping a website: not extracting anything

I am trying to extract data from the following website: 'https://2010-2014.kormany.hu/hu/hirek'. When I try to extract, for example, the articles' links from that website using the following, I got nothing.
library(rvest)
library(dplyr)
library(XML)
url <- 'www.2015-2019.kormany.hu/hu/hirek'
links <- read_html(url) %>% html_nodes("div") %>% html_nodes(xpath = '//*[#class="article"]') %>% html_nodes("h2") %>% html_nodes("a") %>% html_attr("href")
links
> character(0)
I don't even get anything if I run the following code:
links <- read_html(url) %>% html_nodes("div")
links
> character(0)
This is very strange since, when I inspect the website, it seems that I should be getting the list of URLs from the code I provided. According to the website's source, there are "div" nodes ('view-source:https://2015-2019.kormany.hu/hu/hirek'). Does anyone know what I could be doing wrong?
Today I re-tried my code and it works perfectly. I am not sure what was happening yesterday.

R programming Web Scraping

I tried to scrape webpage from the below link using R vest package from R programming.
The link that I scraped is http://dk.farnell.com/c/office-computer-networking-products/prl/results
My code is:
library("xml2")
library("rvest")
url<-read_html("http://dk.farnell.com/c/office-computer-networking-products/prl/results")
tbls_ls <- url %>%
html_nodes("table") %>%
html_table(fill = TRUE)%>%
gsub("^\\s\\n\\t+|\\s+$n+$t+$", "", .)
View(tbls_ls)
My requirement is that I want to remove \\n,\\t from the result. I want to give pagination to scrape multiple pages, so that I can scrape this webpage with pagination.
I'm intrigued by these kinds of questions so I'll try to help you out. Be forewarned, I am not an expert with this stuff (or anything close to it). Anyway, I think it should be kind of like this...
library(rvest)
library(rvest)
library(tidyverse)
urls <- read_html("http://dk.farnell.com/c/office-computer-networking-products/prl/results/")
pag <- 1:5
read_urls <- paste0(urls, pag)
read_urls %>%
map(read_html) -> p
Now, I didn't see any '\\n' or '\\t' patterns in the data sets. Nevertheless, if you want to look for a specific string, you can do it like this.
library(stringr)
str_which(urls, "[your]string_here")
The link below is very useful!
http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/webscrape.html

rvest only returns header from scraped table

The following only returns the headers from desired table that was scraped using rvest.
library(rvest)
url <-("https://www.baseball-reference.com/draft/?year_ID=2017&draft_round=1&draft_type=junreg&query_type=year_round&from_type_jc=0&from_type_hs=0&from_type_4y=0&from_type_unk=0")
draft <- read_html(url)
draft_first_html <- html_nodes(draft,xpath = '//*[#id="div_draft_stats"]')
I've tried a few different xpaths with no luck. It should return 36 observations and 24 variables.
This works for me after correcting your URL:
draft <- read_html(url)
draft %>%
html_node("#draft_stats") %>%
html_table()
You were close to the answer. You just needed to correct the id to get the proper html node. Then using html_table() on that node will give you the data you want. My try at the solution:
library(rvest)
url <-("https://www.baseball-reference.com/draft/?year_ID=2017&draft_round=1&draft_type=junreg&query_type=year_round&from_type_jc=0&from_type_hs=0&from_type_4y=0&from_type_unk=0")
draft <- read_html(url)
draft_first_html <- html_node(draft,xpath = '//*[#id="draft_stats"]')
draft_df <- html_table(draft_first_html)
A cleaner solution with less code would be:
library(rvest)
url <-("https://www.baseball-reference.com/draft/?year_ID=2017&draft_round=1&draft_type=junreg&query_type=year_round&from_type_jc=0&from_type_hs=0&from_type_4y=0&from_type_unk=0")
draft_df <- read_html(url) %>%
html_node(xpath = '//*[#id="draft_stats"]') %>%
html_table()
Hope it helped! I didn't check the terms and conditions of the webpage, but always be sure that you are respecting the terms before scraping :)
If there is anything that you don't understand about my solution, don't hesitate to leave a comment below!

Getting email address in web scraping through rvest

Hi I am trying to get little information about this webpage through web scraping in R language using the package rvest. I am getting name and everything but I am unable to get email id i.e. info#brewhemia.co.uk. If I see in the read_html as text, I don't see email id in html parsed text. Can anybody please help? I am new to web scraping. But I know R Language.
link <- 'https://food.list.co.uk/place/22191-brewhemia-edinburgh/'
page <- read_html(link)
name_html <- html_nodes(page,'.placeHeading')
business_adr <- html_text(adr_html)
tel_html <- html_nodes(page,'.value')
business_tel <- html_text(tel_html)
The email id is in 'a' html tag but I am not able to extract it.
You need a javascript engine here to process the js code. Luckily, R has got V8.
Modify your code after installing V8 package:
library(rvest)
library(V8)
link <- 'https://food.list.co.uk/place/22191-brewhemia-edinburgh/'
page <- read_html(link)
name_html <- html_nodes(page,'.placeHeading')
business_adr <- html_text(adr_html)
tel_html <- html_nodes(page,'.value')
business_tel <- html_text(tel_html)
emailjs <- page %>% html_nodes('li') %>% html_nodes('script') %>% html_text()
ct <- v8()
read_html(ct$eval(gsub('document.write','',emailjs))) %>% html_text()
Output:
> read_html(ct$eval(gsub('document.write','',emailjs))) %>% html_text()
[1] "info#brewhemia.co.uk"

Resources