Does anyone know whether I can scrape this site or this one with httr and rvest, or should I use selenium or phantomjs?
Both of the sites seem to be using ajax, and I cant seem to get through it.
Essentially what I am after is the following:
# I want this to return the titles of the listings, but I get character(0)
"https://www.sahibinden.com/satilik" %>%
read_html() %>%
html_nodes(".searchResultsItem .classifiedTitle") %>%
html_text()
# I want this to return the prices of the listings, but I get 503
"https://www.hurriyetemlak.com/konut" %>%
read_html() %>%
html_nodes(".listing-item .list-view-price") %>%
html_text()
Any ideas with v8, or artificial sessions are welcome.
Also, any purely curl solutions are also welcome. I'll try to translate them into httr later :)
Thanks
You will have to set cookies to make a successful request.
One should check whether the site (sahibinden) allows scraping.
robotstxt::paths_allowed(paths = "https://www.sahibinden.com/satilik", warn = FALSE) --> robotstxt does not seem to forbid it
if you update the site after deleting cookies in the browser the site does not allow access anymore and reports unusual behaviour --> indication for counter measures against scraping
to be sure one should read the terms of usage.
Therefore, i would share the "theoretical" code, but not the required cookie data, which is user dependent anyway.
Full code would read:
library(xml2)
library(httr)
library(magrittr)
library(DT)
url <- "https://www.sahibinden.com/satilik"
YOUR_COOKIE_DATA <- NULL
if(is.null(YOUR_COOKIE_DATA)){
stop("You did not set your cookie data.
Also please check if terms of usage allow the scraping.")
}
response <- url %>% GET(add_headers(.headers = c(Cookie = YOUR_COOKIE_DATA))) %>%
content(type = "text", encoding = "UTF-8")
xpathes <- data.frame(
XPath0 = 'td[2]',
XPath1 = 'td[3]/a[1]',
XPath2 = 'td/span[1]',
XPath3 = 'td/span[2]',
XPath4 = 'td[4]',
XPath5 = 'td[5]',
XPath6 = 'td[6]',
XPath7 = 'td[7]',
XPath8 = 'td[8]'
)
nodes <- response %>% read_html %>% html_nodes(xpath =
"/html/body/div/div/form/div/div/table/tbody/tr"
)
output <- lapply(xpathes, function(xpath){
lapply(nodes, function(node) html_nodes(x = node, xpath = xpath) %>%
{ifelse(length(.), yes = html_text(.), no = NA)}) %>% unlist
})
output %>% data.frame %>% DT::datatable()
Concerning the right to scrape the website data. I try to follow: Should questions that violate API Terms of Service be flagged?. Although, in this case its "potential violation".
Reading cookies programmatically:
I am not sure it is possible to fully skip using the browser:
Why doesn't document.cookie show all the cookie for the site?
Selenium WebDriver manager().getCookies() returns 0 always
Related
I am doing a project where I need to download FAFSA completion data from this website: https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school
I am using rvest to webscrape that data, but when I try to use the function read_html on the link, it never reads in and eventually I have to stop execution. I can read in other websites, so I'm not sure if it is a website specific issue or if I'm doing something wrong. Here is my code so far:
library(rvest)
fafsa_link <- "https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school"
read_html(fafsa_link)
Any help would be greatly appreciated! Thank you!
An user-agent header is required. The download links are also given in an json file. You could regex out the links (or indeed parse them out); or as I do, regex out one then substitute the state code within that to get the additional download url (given urls only vary in this aspect)
library(magrittr)
library(httr)
library(stringr)
data <- httr::GET('https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school.json', add_headers("User-Agent" = "Mozilla/5.0")) %>%
content(as = "text")
ca <- data %>% stringr::str_match(': "(.*?CA\\.xls)"') %>% .[2] %>% paste0('https://studentaid.gov', .)
ma <- gsub('CA\\.xls', 'MA\\.xls' ,ca)
I am using the R programming language for NLP (natural language process) analysis - for this, I need to "webscrape" publicly available information on the internet.
Recently, I learned how to "webscrape" a single pdf file from the website I am using :
library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tibble)
#this is an example of a single pdf
url <- "https://www.canlii.org/en/ns/nswcat/doc/2013/2013canlii47876/2013canlii47876.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words <- article_words %>%
anti_join(stop_words, by = "word")
#this final command can take some time to run
article_summary <- textrank_sentences(data = article_sentences, terminology = article_words)
#Sources: https://stackoverflow.com/questions/66979242/r-error-in-textrank-sentencesdata-article-sentences-terminology-article-w , https://www.hvitfeldt.me/blog/tidy-text-summarization-using-textrank/
The above code works fine if you want to manually access a single website and then "webscrape" this website. Now, I want to try and automatically download 10 such articles at the same time, without manually visiting each page. For instance, suppose I want to download the first 10 pdf's from this website: https://www.canlii.org/en/#search/type=decision&text=dog%20toronto
I think I found the following website which discusses how to do something similar (I adapted the code for my example): https://towardsdatascience.com/scraping-downloading-and-storing-pdfs-in-r-367a0a6d9199
library(tidyverse)
library(rvest)
library(stringr)
page <- read_html("https://www.canlii.org/en/#search/type=decision&text=dog%20toronto ")
raw_list <- page %>%
html_nodes("a") %>%
html_attr("href") %>%
str_subset("\\.pdf") %>%
str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .)
map(read_html) %>%
map(html_node, "#raw-url") %>%
map(html_attr, "href") %>%
str_c("https://www.canlii.org/en/#search/type=decision&text=dog", .) %>%
walk2(., basename(.), download.file, mode = "wb")
But this produces the following error:
Error in .f(.x[[1L]], .y[[1L]], ...) : scheme not supported in URL 'NA'
Can someone please show me what I am doing wrong? Is it possible to download the first 10 pdf files that appear on this website and save them individually in R as "pdf1", "pdf2", ... "pdf9", "pdf10"?
Thanks
I see some people suggesting that you use rselenium, which is a way to
simulate browser actions, so that the web server renders the page as
if a human was visiting the site. From my experience it is almost never
necessary to go down that route. The javascript part of the website is
interacting with an API and we can utilize that to circumvent the Javascript
part and get the raw json data directly. In Firefox (and Chrome is similar in that regard I
assume) you can right-click on the website and select “Inspect Element (Q)”,
go to the “Network” tab and click on reload. You’ll see that each request
the browser makes to the webserver is being listed after a few seconds or less.
We are interested in the ones that have the “Type” json.
When you right click on an entry you can select “Open in New Tab”. One of the
requests that returns json has the following URL attached to it https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1
Opening that URL in Firefox gets you to a GUI that lets you explore the
json data structure and you’ll see that there is a “results” entry which
contains the data for the 25 first results of your search. Each one has a
“path” entry, that leads to the page that will display the embedded PDF.
It turns out that if you replace the “.html” part with “.pdf” that path
leads directly to the PDF file. The code below utilizes all this information.
library(tidyverse) # tidyverse for the pipe and for `purrr::map*()` functions.
library(httr) # this should already be installed on your machine as `rvest` builds on it
library(pdftools)
#> Using poppler version 20.09.0
library(tidytext)
library(textrank)
base_url <- "https://www.canlii.org"
json_url_search_p1 <-
"https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1"
This downloads the json for page 1 / results 1 to 25
results_p1 <-
GET(json_url_search_p1, encode = "json") %>%
content()
For each result we extract the path only.
result_html_paths_p1 <-
map_chr(results_p1$results,
~ .$path)
We replace “.html” with “.pdf”, combine the base URL with the path to
generate the full URLs pointing to the PDFs. Last we pipe it into purrr::map()
and pdftools::pdf_text in order to extract the text from all 25 PDFs.
pdf_texts_p1 <-
gsub(".html$", ".pdf", result_html_paths_p1) %>%
paste0(base_url, .) %>%
map(pdf_text)
If you want to do this for more than just the first page you might want to
wrap the above code in a function that lets you switch out the “&page=”
parameter. You could also make the “&text=” parameter an argument of the
function in order to automatically scrape results for other searches.
For the remaining part of the task we can build on the code you already have.
We make it a function that can be applied to any article and apply that function
to each PDF text again using purrr::map().
extract_article_summary <-
function(article) {
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words <- article_words %>%
anti_join(stop_words, by = "word")
textrank_sentences(data = article_sentences, terminology = article_words)
}
This now will take a real long time!
article_summaries_p1 <-
map(pdf_texts_p1, extract_article_summary)
Alternatively you could use furrr::future_map() instead to utilize all the CPU
cores in your machine and speed up the process.
library(furrr) # make sure the package is installed first
plan(multisession)
article_summaries_p1 <-
future_map(pdf_texts_p1, extract_article_summary)
Disclaimer
The code in the answer above is for educational purposes only. As many websites do, this service restricts automated access to its contents. The robots.txt explicitly disallows the /search path from being accessed by bots. It is therefore recommended to get in contact with the site owner before downloading big amounts of data. canlii offers API access on an individual request basis, see documentation here. This would be the correct and safest way to access their data.
I'm doing some scraping, but as I'm parsing approximately 4000 URL's, the website eventually detects my IP and blocks me every 20 iterations.
I've written a bunch of Sys.sleep(5) and a tryCatch so I'm not blocked too soon.
I use a VPN but I have to manually disconnect and reconnect it every now and then to change my IP. That's not a suitable solution with such a scraper supposed to run all night long.
I think rotating a proxy should do the job.
Here's my current code (a part of it at least) :
library(rvest)
library(dplyr)
scraped_data = data.frame()
for (i in urlsuffixes$suffix)
{
tryCatch({
message("Let's scrape that, Buddy !")
Sys.sleep(5)
doctolib_url = paste0("https://www.website.com/test/", i)
page = read_html(site_url)
links = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_attr("href")
Sys.sleep(5)
name = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_text()
Sys.sleep(5)
job_title = page %>%
html_nodes(".seo-directory-doctor-speciality") %>%
html_text()
Sys.sleep(5)
address = page %>%
html_nodes(".seo-directory-doctor-address") %>%
html_text()
Sys.sleep(5)
scraped_data = rbind(scraped_data, data.frame(links,
name,
address,
job_title,
stringsAsFactors = FALSE))
}, error=function(e){cat("Houston, we have a problem !","\n",conditionMessage(e),"\n")})
print(paste("Page : ", i))
}
Interesting question. I think the first thing to note is that, as mentioned on this Github issue, rvest and xml2 use httr for the connections. As such, I'm going to introduce httr into this answer.
Using a proxy with httr
The following code chunk shows how to use httr to query a url using a proxy and extract the html content.
page <- httr::content(
httr::GET(
url,
httr::use_proxy(ip, port, username, password)
)
)
If you are using IP authentication or don't need a username and password, you can simply exclude those values from the call.
In short, you can replace the page = read_html(site_url) with the code chunk above.
Rotating the Proxies
One big problem with using proxies is getting reliable ones. For this, I'm just going to assume that you have a reliable source. Since you haven't indicated otherwise, I'm going to assume that your proxies are stored in the following reasonable format with object name proxies:
ip
port
64.235.204.107
8080
167.71.190.253
80
185.156.172.122
3128
With that format in mind, you could tweak the script chunk above to rotate proxies for every web request as follows:
library(dplyr)
library(httr)
library(rvest)
scraped_data = data.frame()
for (i in 1:length(urlsuffixes$suffix))
{
tryCatch({
message("Let's scrape that, Buddy !")
Sys.sleep(5)
doctolib_url = paste0("https://www.website.com/test/",
urlsuffixes$suffix[[i]])
# The number of urls is longer than the proxy list -- which proxy to use
# I know this isn't the greatest, but it works so whatever
proxy_id <- ifelse(i %% nrow(proxies) == 0, nrow(proxies), i %% nrow(proxies))
page <- httr::content(
httr::GET(
doctolib_url,
httr::use_proxy(proxies$ip[[proxy_id]], proxies$port[[proxy_id]])
)
)
links = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_attr("href")
Sys.sleep(5)
name = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_text()
Sys.sleep(5)
job_title = page %>%
html_nodes(".seo-directory-doctor-speciality") %>%
html_text()
Sys.sleep(5)
address = page %>%
html_nodes(".seo-directory-doctor-address") %>%
html_text()
Sys.sleep(5)
scraped_data = rbind(scraped_data, data.frame(links,
name,
address,
job_title,
stringsAsFactors = FALSE))
}, error=function(e){cat("Houston, we have a problem !","\n",conditionMessage(e),"\n")})
print(paste("Page : ", i))
}
This may not be enough
You might want to go a few steps further and add elements to the httr request such as the user-agent etc. However, one of the big problems with a package like httr is that it can't render dynamic html content, such as JavaScript-rendered html, and any website that really cares about blocking scrapers is going to detect this. To conquer this problem there are tools such as Headless Chrome that are meant to address specifically stuff like this. Here's a package you might want to look into for headless Chrome in R NOTE: still in development.
Disclaimer
Obviously, I think this code will work but since there's no reproducible data to test with, it may not.
As already said by #Daniel-Molitor headless Chrome gives stunning results.
Another cheap option in R Studio is looping over a list of proxies while you have to start a new R process afterwards
Sys.setenv(http_proxy=proxy)
.rs.restartR()
Sys.sleep(1) can be even omitted afterwards ;-)
New to programming and trying to scrap data from the below site. When I run the below code it returns an empty dataset or table. Any help or alternatives will be greatly appreciated.
url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003"
tab <- url %>% read_html %>%
html_node("dogruns_wrapper") %>%
html_text()
View(tab)
Have tried with xpath and same result and html_table() instead of text returns an error of no applicable method for 'html_table' applied to an object of class "xml_missing".
As Mislav stated, the table is generated with JavaScript, so your best option is RSelenium.
In addition, if you want to get the table, you can get it with less code if you use html_table().
My try:
# Load packages
library(rvest) #Loading the rvest package
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the webpage
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
# define url
url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003"
# go to website
remDr$navigate(url)
# as it's being loaded with JavaScript and it has a slow load, add a sleep here
Sys.sleep(10) # increase as needed
# get the html object of the webpage
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# read the table in the html_obj
tab <- html_obj %>% html_table() %>% .[[1]]
Hope it helps! However, always check if webpages allow scraping before doing it!
Check Terms and conditions:
Except for the direct purpose of viewing, printing, accessing or
interacting with the Web Site for your own personal use or as
otherwise indicated on the Web Site or these Terms and Conditions, you
must not copy, reproduce, modify, communicate to the public, adapt,
transfer, distribute, download or store any of the contents of the Web
Site (including Race Information as described below), or incorporate
any part of the Web Site into another web site without GRV’s written
consent.
Okay, So I am stuck on what seems would be a simple web scrape. My goal is to scrape Morningstar.com to retrieve a fund name based on the entered url. Here is the example of my code:
library(rvest)
url <- html("http://www.morningstar.com/funds/xnas/fbalx/quote.html")
url %>%
read_html() %>%
html_node('r_title')
I would expect it to return the name Fidelity Balanced Fund, but instead I get the following error: {xml_missing}
Suggestions?
Aaron
edit:
I also tried scraping via XHR request, but I think my issue is not knowing what css selector or xpath to select to find the appropriate data.
XHR code:
get.morningstar.Table1 <- function(Symbol.i,htmlnode){
try(res <- GET(url = "http://quotes.morningstar.com/fundq/c-header",
query = list(
t=Symbol.i,
region="usa",
culture="en-US",
version="RET",
test="QuoteiFrame"
)
))
tryCatch(x <- content(res) %>%
html_nodes(htmlnode) %>%
html_text() %>%
trimws()
, error = function(e) x <-NA)
return(x)
} #HTML Node in this case is a vkey
still the same question is, am I using the correct css/xpath to look up? The XHR code works great for requests that have a clear css selector.
OK, so it looks like the page dynamically loads the section you are targeting, so it doesn't actually get pulled in by read_html(). Interestingly, this part of the page also doesn't load using an RSelenium headless browser.
I was able to get this to work by scraping the page title (which is actually hidden on the page) and doing some regex to get rid of the junk:
library(rvest)
url <- 'http://www.morningstar.com/funds/xnas/fbalx/quote.html'
page <- read_html(url)
title <- page %>%
html_node('title') %>%
html_text()
symbol <- 'FBALX'
regex <- paste0(symbol, " (.*) ", symbol, ".*")
cleanTitle <- gsub(regex, '\\1', title)
As a side note, and for your future use, your first call to html_node() should include a "." before the class name you are targeting:
mypage %>%
html_node('.myClass')
Again, this doesn't help in this specific case, since the page is failing to load the section we are trying to scrape.
A final note: other sites contain the same info and are easier to scrape (like yahoo finance).