I am pretty new to web scraping and i need to scrape newspapers articles content from a list of urls related to articles from different websites. I would like to obtain the actual textual content from each of the documents, however, I cannot find a way to automate the scraping procedure through links relating to different websites.
In my case, data are stored in "dublin", a dataframe looking like this.
enter image description here
So far, I managed to scrape together articles from equal websites in order to rely to the same .css paths I find with selector gadget for retrieving the texts. Here is the code I'm using to scrape content selecting documents from the same webpage, in this case those posted by The Irish Times:
library(xml2)
library(rvest)
library(dplyr)
dublin <- dublin%>%
filter(dublin$page == "The Irish Times")
link <- c(pull(dublin, 2))
articles <- list()
for(i in link){
page <- read_html(i)
text = page %>%
html_elements(".body-paragraph")%>%
html_text()
articles[[i]] <- c(text)
}
articles
It actually works. However, since webpages vary case by case, I was wondering whether there is any way to automate this procedure through all the elements of the "url" variable.
Here is an example of the links I scraped:
https://www.thesun.ie/news/10035498/dublin-docklands-history-augmented-reality-app/
https://lovindublin.com/lifestyle/dublins-history-comes-to-life-with-new-ar-app-that-lets-you-experience-it-first-hand
https://www.irishtimes.com/ireland/dublin/2023/01/11/phone-app-offering-augmented-reality-walking-tour-of-dublins-docklands-launched/
https://www.dublinlive.ie/whats-on/family-kids-news/new-augmented-reality-app-bring-25949045
https://lovindublin.com/news/campaigners-say-we-need-to-be-ambitious-about-potential-lido-for-georges-dock
Thank you in advance! Hope the material I provided is enough.
Related
I am trying to scrape specific portions of html based journal articles. For example if I only wanted to scrape the "Statistical analyses" sections of article in a Frontiers publication how could I do that? Since the number of paragraphs and locations of the section change for each article, the selectorGadget isn't helping.
https://www.frontiersin.org/articles/10.3389/fnagi.2010.00032/full
I've tried using rvest with html_nodes and xpath, but I'm not having any luck. The best I can do is begin scraping at the section I want, but can't get it to stop after. Any suggestions?
example_page <- "https://www.frontiersin.org/articles/10.3389/fnagi.2010.00032/full"
example_stats_section <- read_html(example_page) %>%
html_nodes(xpath="//h3[contains(., 'Statistical Analyses')]/following-sibling::p") %>%
html_text()
Since there is a "Results" section after each "Statistical analyses" try
//h3[.='Statistical Analyses']/following-sibling::p[following::h2[.="Results"]]
to get required section
I am trying to use web scraping to generate a play by play dataset from ESPN. I have figured out most of it, but have been unable to tell which team the event is for, as this is only encoded on ESPN in the form of an image. The best way I have come up with to solve this problem is to get the URL of the logo for each entry and compare it to the URL of the logo for each team at the top of the page. However, I have been unable to figure out how to get an attribute such as the url from the image.
I am running this on R and am using the rvest package. The url I am scraping is https://www.espn.com/mens-college-basketball/playbyplay?gameId=400587906 and I am scraping using the SelectorGadget Chrome extension. I have also tried comparing the name of the player to the boxscore, which has all of the players listed, but each team has a player with the last name of Jones, so I would prefer to be able to get the team by looking at the image, as this will always be right.
library(rvest)
url <- "https://www.espn.com/mens-college-basketball/playbyplay?gameId=400587906"
webpage <- read_html(url)
# have been able to successfully scrape game_details and score
game_details_html <- html_nodes(webpage,'.game-details')
game_details <- html_text(game_details_html) %>% as.character()
score_html <- html_nodes(webpage,'.combined-score')
score <- html_text(score_html)
# have not been able to scrape image
ImgNode <- html_nodes(webpage, css = "#gp-quarter-1 .team-logo")
link <- html_attr(ImgNode, "src")
For each event, I want it to be labeled "Duke" or "Wake Forest".
Is there a way to generate the URL for each image? Any help would be greatly appreciated.
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/150.png&h=100&w=100"
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/154.png&h=100&w=100"
Your code returns these.
500/150 is Duke and 500/154 is Wake Forest. You can create a simple dataframe with these and then join the tables.
link_df <- as.data.frame(link)
link_ref_df <- data.frame(link = c("https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/150.png&h=100&w=100", "https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/154.png&h=100&w=100"),
team_name = c("Duke", "Wake Forest"))
link_merged <- merge(link_df,
link_ref_df,
by = 'link',
all.x = T)
This is not scalable if you're doing hundreds of these with other teams, but works for this specific option.
How can I scrape the pdf documents from HTML?
I am using R and I can do only extract the text from HTML. The example of the website that I am going to scrape is as follows.
https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx
When you say you want to scrape the PDF files from HTML pages, I think the first problem you face is to actually identify the location of those PDF files.
library(XML)
library(RCurl)
url <- "https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx"
page <- getURL(url)
parsed <- htmlParse(page)
links <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds <- grep("*.pdf", links)
links <- links[inds]
links contains all the URLs to the PDF-files you are trying to download.
Beware: many websites don't like it very much when you automatically scrape their documents and you get blocked.
With the links in place, you can start looping through the links and download them one by one and saving them in your working directory under the name destination. I decided to extract reasonable document names for your PDFs, based on the links (extracting the final piece after the last / in the urls
regex_match <- regexpr("[^/]+$", links, perl=TRUE)
destination <- regmatches(links, regex_match)
To avoid overloading the servers of the website, I have heard it is friendly to pause your scraping every once in a while, so therefore I use 'Sys.sleep()` to pause scraping for a time between 0 and 5 seconds:
for(i in seq_along(links)){
download.file(links[i], destfile=destination[i])
Sys.sleep(runif(1, 1, 5))
}
I'm learning how to scrape information from websites using httr and XML in R. I'm getting it to work just fine for websites with just a few tables, but can't figure it out for websites with several tables. Using the following page from pro-football-reference as an example: https://www.pro-football-reference.com/boxscores/201609110atl.htm
# To get just the boxscore by quarter, which is the first table:
URL = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
URL = GET(URL)
SnapTable = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)[[1]]
# Return the number of tables:
AllTables = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)
length(AllTables)
[1] 2
So I'm able to scrape info, but for some reason I can only capture the top two tables out of the 20+ on the page. For practice, I'm trying to get the "Starters" tables and the "Officials" tables.
Is my inability to get the other tables a matter of the website's setup or incorrect code?
If it comes down to web scraping in R make intensive use of the package rvest.
While managing to get the html is just about fine - rvest makes use of css selectors - SelectorGadget helps finding a pattern in styling for a particular table which is hopefully unique. Therefore you can extract exactly the tables you are looking for instead of coincidence
To get you started - read the vignette on rvest for more detailed information.
#install.packages("rvest")
library(rvest)
library(magrittr)
# Store web url
fb_url = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
linescore = fb_url %>%
read_html() %>%
html_node(xpath = '//*[#id="content"]/div[3]/table') %>%
html_table()
Hope this helps.
I am trying to scrap data from a website which lists the ratings of multiple products. So, let's say a product has 800 brands. So, with 10 brands per page, I will need to scrap data from 8 pages. Eg: Here is the data for baby care. There are 24 pages worth of brands that I need - http://www.goodguide.com/products?category_id=152775-baby-care&sort_order=DESC#!rf%3D%26rf%3D%26rf%3D%26cat%3D152775%26page%3D1%26filter%3D%26sort_by_type%3Drating%26sort_order%3DDESC%26meta_ontology_node_id%3D
I have used the bold font for 1, as that is the only thing that changes in this url as we move from page to page. So, I thought it might be straight forward to write a loop in R. But what I find is that as I move to page 2, the page does not load again. Instead, just the results are updated in about 5 secs. However, R does not wait for 5 seconds and thus, I had the data from the first page 26 times.
I also tried entering the page 2 url directly and ran my code without a loop. Same story- I got page 1 results. I am sure I can't be the only one facing this. Any help is appreciated. I have attached the code.
Thanks a million. And I hope my question was clear enough.
# build the URL
N<-matrix(NA,26,15)
R<-matrix(NA,26,60)
for(n in 1:26){
url <- paste("http://www.goodguide.com/products?category_id=152775-baby-care&sort_order=DESC#!rf%3D%26rf%3D%26rf%3D%26cat%3D152775%26page%3D",i,"%26filter%3D%26sort_by_type%3Drating%26sort_order%3DDESC%26meta_ontology_node_id%3D")
raw.data <-readLines(url)
Parse <- htmlParse(raw.data)
#####
A<-querySelector(Parse, "div.results-container")
#####
Name<-querySelectorAll(A,"div.reviews>a")
Ratings<-querySelectorAll(A,"div.value")
N[n,]<-sapply(Name,function(x)xmlGetAttr(x,"href"))
R[n,]<-sapply(Ratings,xmlValue)
}
Referring to the html source reveals that the urls you want can be simplified to this structure:
http://www.goodguide.com/products?category_id=152775-baby-care&page=2&sort_order=DESC.
The content of these urls is retrieved by R as expected.
Note that you can also go straight to:
u <- sprintf('http://www.goodguide.com/products?category_id=152775-baby-care&page=%s&sort_order=DESC', n)
Parse <- htmlParse(u)