I am trying to scrap data from a website, where I have to first get a list of links from the main page, and then go into each link to scrape data. The only way I can think of to do this is through a loop.
For example:
library(rvest)
content <- c(rep(NA_character_,10))
for(i in 1:10){
link <- html(links[i])
content[i] <- html_text(html_nodes(link,"tr:nth-child(1) td"))
}
Here, assume that links is a character vector of urls.
This works, but it is very slow. Is there a way to speed it up?
Related
I am pretty new to web scraping and i need to scrape newspapers articles content from a list of urls related to articles from different websites. I would like to obtain the actual textual content from each of the documents, however, I cannot find a way to automate the scraping procedure through links relating to different websites.
In my case, data are stored in "dublin", a dataframe looking like this.
enter image description here
So far, I managed to scrape together articles from equal websites in order to rely to the same .css paths I find with selector gadget for retrieving the texts. Here is the code I'm using to scrape content selecting documents from the same webpage, in this case those posted by The Irish Times:
library(xml2)
library(rvest)
library(dplyr)
dublin <- dublin%>%
filter(dublin$page == "The Irish Times")
link <- c(pull(dublin, 2))
articles <- list()
for(i in link){
page <- read_html(i)
text = page %>%
html_elements(".body-paragraph")%>%
html_text()
articles[[i]] <- c(text)
}
articles
It actually works. However, since webpages vary case by case, I was wondering whether there is any way to automate this procedure through all the elements of the "url" variable.
Here is an example of the links I scraped:
https://www.thesun.ie/news/10035498/dublin-docklands-history-augmented-reality-app/
https://lovindublin.com/lifestyle/dublins-history-comes-to-life-with-new-ar-app-that-lets-you-experience-it-first-hand
https://www.irishtimes.com/ireland/dublin/2023/01/11/phone-app-offering-augmented-reality-walking-tour-of-dublins-docklands-launched/
https://www.dublinlive.ie/whats-on/family-kids-news/new-augmented-reality-app-bring-25949045
https://lovindublin.com/news/campaigners-say-we-need-to-be-ambitious-about-potential-lido-for-georges-dock
Thank you in advance! Hope the material I provided is enough.
I've posted about the same question before here but the other thread is dying and I'm getting desperate.
I'm trying to scrape a webpage using rvest etc. Most of the stuff works but now I need R to loop trough a list of links and all it gives me is NA.
This is my code:
install.packages("rvest")
site20min <- read_xml("https://api.20min.ch/rss/view/1")
urls <- site20min %>% html_nodes('link') %>% html_text()
I need the next one because the first two links the api gives me direct back to the homepage
urls <- urls[-c(1:2)]
If I print my links now it gives me a list of 109 links.
urls
Now this is my loop. I need it to give me the first link of urls so I can read_html it
I'm looking for something like: "https://beta.20min.ch/story/so-sieht-die-coronavirus-kampagne-des-bundes-aus-255254143692?legacy=true".
I use break so it shows me only the first link but all I get is NA.
for(i in i:length(urls)) {
link <- urls[i]
break
}
link
If I can get this far, I think I can handle the rest with rvest but I've tried for hours now and just ain't getting anywhere.
Thx for your help.
Can you try out
for(i in 1:length(urls)) {
link <- urls[i]
break
}
link
instead?
How can I scrape the pdf documents from HTML?
I am using R and I can do only extract the text from HTML. The example of the website that I am going to scrape is as follows.
https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx
When you say you want to scrape the PDF files from HTML pages, I think the first problem you face is to actually identify the location of those PDF files.
library(XML)
library(RCurl)
url <- "https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx"
page <- getURL(url)
parsed <- htmlParse(page)
links <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds <- grep("*.pdf", links)
links <- links[inds]
links contains all the URLs to the PDF-files you are trying to download.
Beware: many websites don't like it very much when you automatically scrape their documents and you get blocked.
With the links in place, you can start looping through the links and download them one by one and saving them in your working directory under the name destination. I decided to extract reasonable document names for your PDFs, based on the links (extracting the final piece after the last / in the urls
regex_match <- regexpr("[^/]+$", links, perl=TRUE)
destination <- regmatches(links, regex_match)
To avoid overloading the servers of the website, I have heard it is friendly to pause your scraping every once in a while, so therefore I use 'Sys.sleep()` to pause scraping for a time between 0 and 5 seconds:
for(i in seq_along(links)){
download.file(links[i], destfile=destination[i])
Sys.sleep(runif(1, 1, 5))
}
I want to scrape the reviews of room from airbnb web-page. For example, from this web-page: https://www.airbnb.com/rooms/8400275
And this is my code for this task. I used rvest packege and selectorgadget:
x <- read_html('https://www.airbnb.com/rooms/8400275')
x_1 <- x%>%html_node('#reviews p')%>%html_text()%>%as.character()
Can you help me to fix that? Is it possible to do with rvest package(I am not familiar with xpathSApply)
I assume that you want to extract the comment itself. Looking at the html file, it seems that that is not an easy task, since you have to extract it within the script node. So, what I tried was this:
Reading the html. Here I use connection and readLines to read it
as character vectors.
Selecting the line that contains the review information.
Using str_extract to extract the comments.
For the first two steps, we can also use rvest or XML package to select the appropriate node.
url <- "https://www.airbnb.com/rooms/8400275"
con <- file (url)
raw <- readLines (con)
close (con)
comment.regex <- "\"comments\":\".*?\""
comment.line <- raw[grepl(comment.regex, raw)]
require(stringr)
comment <- str_extract_all(comment.line, comment.regex)
I am trying to scrap data from a website which lists the ratings of multiple products. So, let's say a product has 800 brands. So, with 10 brands per page, I will need to scrap data from 8 pages. Eg: Here is the data for baby care. There are 24 pages worth of brands that I need - http://www.goodguide.com/products?category_id=152775-baby-care&sort_order=DESC#!rf%3D%26rf%3D%26rf%3D%26cat%3D152775%26page%3D1%26filter%3D%26sort_by_type%3Drating%26sort_order%3DDESC%26meta_ontology_node_id%3D
I have used the bold font for 1, as that is the only thing that changes in this url as we move from page to page. So, I thought it might be straight forward to write a loop in R. But what I find is that as I move to page 2, the page does not load again. Instead, just the results are updated in about 5 secs. However, R does not wait for 5 seconds and thus, I had the data from the first page 26 times.
I also tried entering the page 2 url directly and ran my code without a loop. Same story- I got page 1 results. I am sure I can't be the only one facing this. Any help is appreciated. I have attached the code.
Thanks a million. And I hope my question was clear enough.
# build the URL
N<-matrix(NA,26,15)
R<-matrix(NA,26,60)
for(n in 1:26){
url <- paste("http://www.goodguide.com/products?category_id=152775-baby-care&sort_order=DESC#!rf%3D%26rf%3D%26rf%3D%26cat%3D152775%26page%3D",i,"%26filter%3D%26sort_by_type%3Drating%26sort_order%3DDESC%26meta_ontology_node_id%3D")
raw.data <-readLines(url)
Parse <- htmlParse(raw.data)
#####
A<-querySelector(Parse, "div.results-container")
#####
Name<-querySelectorAll(A,"div.reviews>a")
Ratings<-querySelectorAll(A,"div.value")
N[n,]<-sapply(Name,function(x)xmlGetAttr(x,"href"))
R[n,]<-sapply(Ratings,xmlValue)
}
Referring to the html source reveals that the urls you want can be simplified to this structure:
http://www.goodguide.com/products?category_id=152775-baby-care&page=2&sort_order=DESC.
The content of these urls is retrieved by R as expected.
Note that you can also go straight to:
u <- sprintf('http://www.goodguide.com/products?category_id=152775-baby-care&page=%s&sort_order=DESC', n)
Parse <- htmlParse(u)