Scrape different levels of web pages - r

I am doing a research on World Bank (WB) projects on developing countries.
To do so, I am scraping their website in order to collect the data I am interested in.
The structure of the webpage I want to scrape is the following:
List of countries the list of all countries in which WB has developed projects
1.1. By clicking on a single country on 1. , one gets the single countries project list (that includes many webpages) it includes all the projects in a single countries . Of course, here I have included just one page of a single countries, but every country has a number of pages dedicated to this subject
1.1.1. By clicking on a a single project on 1.1. , one gets - among the others - the project's overview option I am interested in.
In other words, my problem is to find out a way to create a dataframe including all the countries, a complete list of all projects for each country and an overview of any single project.
Yet, this is the code that I have (unsuccessfully) written:
WB_links <- "http://projects.worldbank.org/country?lang=en&page=projects"
WB_proj <- function(x) {
Sys.sleep(5)
url <- sprintf("http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s", x)
html <- read_html(url)
tibble(title = html_nodes(html, ".grid_20") %>% html_text(trim = TRUE),
project_url = html_nodes(html, ".grid_20") %>% html_attr("href"))
}
WB_scrape <- map_df(1:5, WB_proj) %>%
mutate(study_description =
map(project_url,
~read_html(sprintf
("http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s", .x)) %>%
html_node() %>%
html_text()))
Any suggestion?
Note: I am sorry if this question seems trivial, but I am quite a newbie in R and I haven't found a help on this by looking around (though I could have missed something, of course).

Related

R scraping from multiple websites links

I am pretty new to web scraping and i need to scrape newspapers articles content from a list of urls related to articles from different websites. I would like to obtain the actual textual content from each of the documents, however, I cannot find a way to automate the scraping procedure through links relating to different websites.
In my case, data are stored in "dublin", a dataframe looking like this.
enter image description here
So far, I managed to scrape together articles from equal websites in order to rely to the same .css paths I find with selector gadget for retrieving the texts. Here is the code I'm using to scrape content selecting documents from the same webpage, in this case those posted by The Irish Times:
library(xml2)
library(rvest)
library(dplyr)
dublin <- dublin%>%
filter(dublin$page == "The Irish Times")
link <- c(pull(dublin, 2))
articles <- list()
for(i in link){
page <- read_html(i)
text = page %>%
html_elements(".body-paragraph")%>%
html_text()
articles[[i]] <- c(text)
}
articles
It actually works. However, since webpages vary case by case, I was wondering whether there is any way to automate this procedure through all the elements of the "url" variable.
Here is an example of the links I scraped:
https://www.thesun.ie/news/10035498/dublin-docklands-history-augmented-reality-app/
https://lovindublin.com/lifestyle/dublins-history-comes-to-life-with-new-ar-app-that-lets-you-experience-it-first-hand
https://www.irishtimes.com/ireland/dublin/2023/01/11/phone-app-offering-augmented-reality-walking-tour-of-dublins-docklands-launched/
https://www.dublinlive.ie/whats-on/family-kids-news/new-augmented-reality-app-bring-25949045
https://lovindublin.com/news/campaigners-say-we-need-to-be-ambitious-about-potential-lido-for-georges-dock
Thank you in advance! Hope the material I provided is enough.

How do I extract certain html nodes using rvest?

I'm new to web-scraping so I may not be doing all the proper checks here. I'm attempting to scrape information from a url, however I'm not able to extract the nodes I need. See sample code below. In this example, I want to get the product name (Madone SLR 9 eTap Gen) which appears to be stored in the buying-zone__title class.
library(tidyverse)
library(rvest
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
read_html(url) %>%
html_nodes(".buying-zone__title") %>%
html_text()
When I run the code above, I get {xml_nodeset (0)}. How can I fix this? I would also like to scrape the year, price, available colors and specs from that page. Any help will be appreciated.
There is a lot of dynamic content on that page which can be reviewed by disabling JS running in browser or comparing rendered page against page source.
You can view page source with Ctrl+U, then Ctrl+F to search for where the product name exists within the non-rendered content.
The title info you want is present in lots of places and there are numerous way to obtain. I will offer an "as on tin" option as the code gives clear indications as to what is being selected for.
I've updated the syntax and reduced the volume of imported external dependencies.
library(magrittr)
library(rvest)
url <- "https://www.trekbikes.com//us/en_US/bikes/road-bikes/performance-road-bikes/madone/madone-slr/madone-slr-9-etap-gen-7/p/37420"
page <- read_html(url)
name <- page %>% html_element("[product-name]") %>% html_attr("product-name")
specs <- page %>% html_elements('[class="sprocket__table spec"]') %>% html_table()
price <- page %>% html_element('#gtm-product-display-price') %>% html_attr('value') %>% as.numeric()

Web Scraping - Table Name

New to webscraping.
I am trying to scrape a site. I recently learnt how to get information from tables, but I want to know how to get the table name. (I believe table name might be wrong word here but bear with me)
Eg - https://www.msc.com/che/about-us/our-fleet?page=1
MSC is shipping firm and I need to get the list of their fleet and information on each ship.
I have written the following code that will retrieve the table data for each ship.
df <- MSCwp[i,1] %>%
read_html() %>% html_table()
MSCwp is the list url. This code gets me all the information I need about the ships listed in the webpage expect its name.
Is there any way to retrieve the name along with the table?
Eg - df for the above mentioned website will return 10 tables. (corresponding to the ships in the webpage). df[1] will have information about the ship Agamemnon but I am not sure how to retrieve the shipname along with the table.
You need to pull the names out from the main page.
library(rvest)
library(dplyr)
url <- "https://www.msc.com/che/about-us/our-fleet?page=1"
page <- read_html(url)
names <- page %>% html_elements("dd a") %>% html_text()
names
[1] "AGAMEMNON" "AGIOS DIMITRIOS" "ALABAMA" "ALLEGRO" "AMALTHEA" "AMERICA" "ANASTASIA"
[8] "ANTWERP TRADER" "ARCHIMIDIS" "ARIES"
In this case I am looking for the text in the "a" child node of the "dd" nodes.

Web scrape subtitles from opensubtitles.org in R

I'm new to web scraping, and I'm currently trying to download subtitle files for over 100,000 films for a research project. Each film has a unique IMDb ID (i.e., the ID for Inception is 1375666). I have a list in R containing the 102524 IDs, and I want to download the corresponding subtitles from opensubtitles.org.
Each film has its own page on the site, for example, Inception has:
https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-1375666
The link to download the subtitles is obtained by clicking on the first link in the table called "Movie name", which takes you to a new page, then clicking the "Download button" on that page.
I'm using rvest to scrape the pages, and I've written this code:
for(i in 1:102524) {
subtitle.url = paste0("https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-", movie.ids[i])
read_html(subtitle.url) %>%
html_nodes(".head+ .expandable .bnone")
# Not sure where to go from here
}
Any help on how to do this will be greatly appreciated.
EDIT: I know I'm asking something pretty complicated, but any pointers on where to start would be great.
Following the link and the download button, we can see that the actual subtitle file is downloaded from https://www.opensubtitles.org/en/download/vrf-108d030f/sub/6961922 (for you example). I found out this inspecting the Network tab in Mozilla's Developer Tools while doing a download.
We can download directly from that address using:
download.file('https://www.opensubtitles.org/en/download/vrf-108d030f/sub/6961922',
destfile = 'subtitle-6961922.zip')
The base url (https://www.opensubtitles.org/en/download/vrf-108d030f/sub/) is fixed for all the downloads as far as I can see, so we only need the site's id.
The id is found within the search page doing:
id <- read_html(subtitle.url) %>%
html_node('.bnone') %>%
html_attr('href') %>%
stringr::str_extract('\\d+')
So, putting it all together:
search_url <- 'https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-'
download_url <- 'https://www.opensubtitles.org/en/download/vrf-108d030f/sub/'
for(i in 1:102524) {
subtitle.url = paste0(search_url, movie.ids[i])
id <- read_html(subtitle.url) %>%
html_node('.bnone') %>%
html_attr('href') %>%
stringr::str_extract('\\d+')
download.file(paste0(download_url, id),
destfile = paste0('subtitle-', movie.ids[i], '.zip'))
# Wait somwhere between 1 and 4 second before next download
# as courtesy to the site
Sys.sleep(runif(1, 1, 4))
}
Keep in mind this will take a long time!

How to write a loop to extract articles from website archive which links to numerous external sources?

I am trying to extract articles for a period of 200 days from Time dot mk archive, e.g. http://www.time.mk/week/2016/22. Each day has top 10 headings, each of which link to all articles related to it (at bottom of each heading "e.g. 320 поврзани вести". Following this link leads to a list of all related articles.
This is what I've managed so far:
`library(rvest)
url = ("http://www.time.mk/week/2016/22")
frontpage = read_html(url) %>%
html_nodes(".other_articles") %>%
html_attr("href") %>%
paste0()
mark = "http://www.time.mk/"
frontpagelinks = paste0(mark, frontpage)`
by now I access primary links going to related news
The following extracts all of the links to related news for the first heading, from where I clear my data for only those links that I need.
final = list()
final = read_html(frontpagelinks[1]) %>%
html_nodes("h1 a") %>%
html_attr("href")%>%
paste0()`
My question is how I could instruct R, whether via loop or some other option so as to extract links from all 10 headings from "frontpagelinks" at once - I tried a variety of options but nothing really worked.
Thanks!
EDIT
Parfait's response worked like a charm! Thank you so much.
I've run into an inexplicable issue however after using that solution.
Whereas before, when I was going link by link, I could easily sort out the data for only those portals that I need via:
a1onJune = str_extract_all(dataframe, ".a1on.")
Which provided me with a clean output: [130] "a1on dot mk/wordpress/archives/618719"
with only the links I needed, now if I try to run the same code with the larger df of all links I inexplicably get many variants of this this:
"\"alsat dot mk/News/255645\", \"a1on dot mk/wordpress/archives/620944\", , \"http://www dot libertas dot mk/sdsm-poradi-kriminalot-na-vmro-dpmne-makedonija-stana-slepo-tsrevo-na-balkanot/\",
As you can see in bold it returns my desired link, but also many others (I've edited out most for clarity sake) that occur both before and after it.
Given that I'm using the same expression, I don't see why this would be happening.
Any thoughts?
Simply run lapply to return a list of links from each element of frontpagelinks
linksList <- lapply(frontpagelinks, function(i) {
read_html(i) %>%
html_nodes("h1 a") %>%
html_attr("href")
})

Resources