R Web scraping from different URLs - r

I am web scraping a page at
http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
From this url, I have built up a dataframe through the following code:
dflist <- map(.x = 1:417, .f = function(x) {
Sys.sleep(5)
url <- ("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=")
read_html(url) %>%
html_nodes(".title a") %>%
html_text() %>%
as.data.frame()
}) %>% do.call(rbind, .)
I have repeated the same code in order to get all the data I was interested in and it seems to work perfectly, although is of course a little slow due to the Sys.sleep() thing.
My issue has raised once I have tried to scrape the single projects descriptions that should be included in the dataframe.
For instance, the first project description is at
http://catalog.ihsn.org/index.php/catalog/7118/study-description
the second project description is at
http://catalog.ihsn.org/index.php/catalog/6606/study-description
and so forth.
My problem is that I can't find a dynamic way to scrape all the projects' pages and insert them in the data frame, being the number in the URLs not progressive nor at the end of the link.
To make things clearer, this is the structure of the website I am scraping:
1.http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
1.1. http://catalog.ihsn.org/index.php/catalog/7118
1.1.a http://catalog.ihsn.org/index.php/catalog/7118/related_materials
1.1.b http://catalog.ihsn.org/index.php/catalog/7118/study-description
1.1.c. http://catalog.ihsn.org/index.php/catalog/7118/data_dictionary
I have scraped successfully level 1. but cannot level 1.1.b. (study-description) , the one I am interested in, since the dynamic element of the URL (in this case: 7118) is not consistent in the website's above 6000 pages of that level.

You have to extract the deeper urls from the .title a and then scrape those as well. Here's a small example on how to do that using rvest and the tidyverse
library(tidyverse)
library(rvest)
scraper <- function(x) {
Sys.sleep(5)
url <- sprintf("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=%s&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=", x)
html <- read_html(url)
tibble(title = html_nodes(html, ".title a") %>% html_text(trim = TRUE),
project_url = html_nodes(html, ".title a") %>% html_attr("href"))
}
result <- map_df(1:2, scraper) %>%
mutate(study_description = map(project_url, ~read_html(sprintf("%s/study-description", .x)) %>% html_node(".xsl-block") %>% html_text()))
This isn't complete as to all the things you want to do, but should show you an approach.

Related

How scrape dynamic pages when there's only one URL

I am trying to scrape data from this webpage: https://www.premierleague.com/stats/top/players/saves however there are two pages of data i want to scrape. I have been able to scrape the first page of data with the code below:
remDr$navigate("https://www.premierleague.com/stats/top/players/saves")
epl <- read_html(remDr$getPageSource()[[1]])
rank <- epl %>% html_nodes(".statsTableContainer .rank") %>% html_text()
player <- epl %>% html_nodes(".playerName ") %>% html_text()
club <- epl %>% html_nodes(".statNameSecondary") %>% html_text()
stat <- epl %>% html_nodes(".statsTableContainer .text-centre") %>% html_text()
str(rank)
str(player)
str(club)
str(stat)
Saves <- data.frame(rank, player, club, stat)
I have been using the RSelenium pkg for the scraping. For the second page there isn't a different URL you have to click the arrow on the side. How do i scrape the second page when there's only an arrow to select?
I haven't been able to try anything as i'm not sure where to even start as i've not come accross this problem before.

How can i scrape the complete dataset from yahoo finance with rvest

Im trying to get the complete data set for bitcoin historical data from yahoo finance via web scraping, this is my first option code chunk:
library(rvest)
library(tidyverse)
crypto_url <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <- html_nodes(crypto_url,css = "table")
cryp_table <- html_table(cryp_table,fill = T) %>%
as.data.frame()
I the link that i provide to read_html() a long period of time is already selected, however it just get the first 101 rows and the last row is the loading message that you get when you keep scrolling, this is my second shot but i get the same:
col_page <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <-
col_page %>%
html_nodes(xpath = '//*[#id="Col1-1-HistoricalDataTable-Proxy"]/section/div[2]/table') %>%
html_table(fill = T)
cryp_final <- cryp_table[[1]]
How can i get the whole dataset?
I think you can get the link of download, if you view the Network, you see the link of download, in this case:
"https://query1.finance.yahoo.com/v7/finance/download/BTC-USD?period1=1480464000&period2=1638230400&interval=1d&events=history&includeAdjustedClose=true"
Well, this link looks like the url of the site, i.e., we can modify the url link to get the download link and read the csv. See the code:
library(stringr)
library(magrittr)
site <- "https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true"
base_download <- "https://query1.finance.yahoo.com/v7/finance/download/"
download_link <- site %>%
stringr::str_remove_all(".+(?<=quote/)|/history?|&frequency=1d") %>%
stringr::str_replace("filter", "events") %>%
stringr::str_c(base_download, .)
readr::read_csv(download_link)

Webscraping with R: How can I tidy my data?

I'm building a web scraper for some News websites in Switzerland. After some trial & error and a lot of help from StackOverflow (thx everyone!), I've gotten to a point where I can get text data from all articles.
#packages instalieren
install.packages("rvest")
install.packages("tidyverse")
install.packages("dplyr")
library(rvest)
library(stringr)
#seite einlesen
apisrf<- read_xml('https://www.srf.ch/news/bnf/rss/1646')
urls_srf <- apisrf %>% html_nodes('link') %>% html_text()
zeit_srf <- apisrf %>% html_nodes('pubDate') %>% html_text()
#data.frame basteln
dfsrf_titel_text <- data.frame(Text = character())
#scrape
for(i in 1:length(urls_srf)) {
link <- urls_srf[i]
artikel <- read_html(link)
#Informationen entnehmen
textsrf<- artikel %>% html_nodes('p') %>% html_text()
#In Dataframe strukturieren
dfsrf_text <- data.frame(Text = textsrf)
dfsrf_titel_text <- rbind(dfsrf_titel_text, cbind(dfsrf_text))
}
running this gives me dfsrf_titel_text. (I'm going to combine it with the titles of the articles at some point but let that be my problem.)
however, now my data is pretty untidy and I can't really figure out how to clean it in a way so it works for me. Especially annoying is that the texts from the different articles are not really structured in that way but get a new line whenever there is a paragraph in the texts. I can't combine the paragraphs because all the texts have different lengths. (The first article, starting at point 3, is super long because it's a live ticker covering the corona crisis so don't get confused if you run my code.)
how can I get R to create a new row in my dataframe only if the text is from a new article (meaning from a new URL?
thx for your help!
can you provide a sample of your data? you can use the strsplit(string, pattern) function where the pattern you specify is something that only happens between articles. Perhaps the URL?
strsplit(dfsrf_text,"www.\\w+.ch")
That will split your text anytime a URL in the .ch domain is found. you can use this regular expression cheat sheet to help you identify the pattern that seperates your articles.
You should correct this while creating dataframe itself. Here I am binding this all the data for each article together using paste0 adding new line character between them (\n\n).
library(rvest)
for(i in 1:length(urls_srf)) {
link <- urls_srf[i]
artikel <- read_html(link)
#Informationen entnehmen
textsrf<- paste0(artikel %>% html_nodes('p') %>% html_text(), collapse = "\n\n")
#In Dataframe strukturieren
dfsrf_text <- data.frame(Text = textsrf)
dfsrf_titel_text <- rbind(dfsrf_titel_text, cbind(dfsrf_text))
}
However, growing data in a loop is highly inefficient and can slow the process terribly especially when you have large data to scrape like this. Try using sapply.
dfsrf_titel_text <- data.frame(text = sapply(urls_srf, function(x) {
paste0(read_html(x) %>% html_nodes('p') %>% html_text(), collapse = "\n\n")
}))
So this will give you number of rows same as length of urls_srf .

Simple solution to scrape and loop with rvest, storing the results of the for-loop in a variable

I need to collect the links from 3 pages, each having 150 links, using R with rvest library. I used a for-loop to crawl through the pages. I know that it's a very basic question, which has been answered elsewhere:
R web scraping across multiple pages
Scrape and Loop with Rvest
I tried different versions of the following code. Most of them worked but returned only 50 instead of 150 links
library(rvest)
baseurl <- "https://www.ebay.co.uk/sch/i.html?_from=R40&_nkw=chain+and+sprocket&_sacat=0&_pgn="
n <- 1:3
nextpages <- paste0(baseurl, n)
for(i in nextpages){
html <- read_html(nextpages)
links <- html %>% html_nodes("a.vip") %>% html_attr("href")
}
The code is expected to return all the 150, instead of just 50.
You're overwriting the links variable in every iteration, so you would only end up with the last 50 links.
But you're looping using the 'i' variable, whereas your read_html() function uses the nextpages variable, which is actually a vector of 3 urls. You should be getting an error.
Try this:
links <- c()
for(i in nextpages){
html <- read_html(i)
links <- c(links, html %>% html_nodes("a.vip") %>% html_attr("href"))
}
We can use map instead of a for loop.
library(rvest)
library(purrr)
map(nextpages, . %>% read_html %>%
html_nodes("a.vip") %>%
html_attr("href")) %>% flatten_chr()
#[1] "https://www.ebay.co.uk/itm/Genuine-Honda-Chain-and-sprocket-set-Honda-Cub-C50-C70-C90-Heavy-Duty/254287014069?hash=item3b34afe8b5:g:wjEAAOSwqaBdH69W"
#[2] "https://www.ebay.co.uk/itm/DID-Heavy-Duty-Drive-Chain-And-JT-Sprocket-Kit-For-Honda-MSX125-Grom-2013-2019/223130604262?hash=item33f39ed2e6:g:QmwAAOSwdrpcAQ4c"
#.....
#...

Scraping only the NEWEST blog post with rvest in R

I'm using rvest to scrape the .txt files of a blog page, and I have a script that triggers every day, and scrapes the newest post. The base of that script is an lapply function that simply scrapes all of the posts, and I later sort out duplicates using Apache NiFi.
That's not an efficient way to sort duplicates, so I was wondering if there's a way to use the same script, and only scrape the newest posts?
The posts are labelled with numbers that count up, such as BLOG001, BLOG002, etc. I want to put a line of code that makes sure to scrape the newest posts (they may post several in any given day). How do I make sure that I only get BlOG002, and the next run only get BLOG003, and so on?
library(tidyverse)
library(rvest)
# URL set up
url <- "https://www.example-blog/posts.aspx"
page <- html_session(url, config(ssl_verifypeer = FALSE))
# Picking elements
links <- page %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href")
# Function
out <- Map(function(ln) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
writeBin(fun1$response$content)
return(fun1$response$content)
}, links)
Assuming that all of the links you want start with 'BLOG' as in your post, and you only want to download the one with the maximum number each time the code is run. You could try something like this to achieve that.
library(tidyverse)
library(rvest)
# URL set up
url <- "https://www.example-blog/posts.aspx"
page <- html_session(url, config(ssl_verifypeer = FALSE))
# Picking elements
links <- page %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href")
# Make sure only 'BLOG' links are checked
links <- links[substr(links, 1, 4) == 'BLOG']
# Get numeric value from link
blog.nums <- as.numeric(substr(links, 5, nchar(links)))
# Get the maximum link value
max.link <- links[which(blog.nums == max(blog.nums))]
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", max.link)),
config(ssl_verifypeer = FALSE))
writeBin(fun1$response$content)

Resources