How can I check my session cookies and specify those cookies before making a subsequent web request?
I want to scrape a page but I cannot submit the cookies.
I'm using the rvest library.
My code:
library(rvest)
WP <- html_session("http://www.wp.pl/")
headers <- httr::headers(WP)
cookies <- unlist(headers[names(headers) == "set-cookie"])
crumbs <- stringr::str_split_fixed(cookies, "; ", 4)
# method 1
stringr::str_split_fixed(crumbs[, 1], "=", 2)
# method 2
cookies(WP)
How do I set my cookies to do the web scraping?
Keep in mind that rvest is built on top of the httr library.
For some reason that I can't explain, this code didn't work until I rebooted RStudio.
Here's some code that'll do the trick:
library(httr)
library(rvest)
httr::GET("http://www.wp.pl/",
set_cookies(`_SMIDA` = "7cf9ea4bfadb60bbd0950e2f8f4c279d",
`__utma` = "29983421.138599299.1413649536.1413649536.1413649536.1",
`__utmb` = "29983421.5.10.1413649536",
`__utmc` = "29983421",
`__utmt` = "1",
`__utmz` = "29983421.1413649536.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)")) %>%
read_html %>% # Sample rvest code
read_table(fill=TRUE) # Sample rvest code
Related
Does anyone know whether I can scrape this site or this one with httr and rvest, or should I use selenium or phantomjs?
Both of the sites seem to be using ajax, and I cant seem to get through it.
Essentially what I am after is the following:
# I want this to return the titles of the listings, but I get character(0)
"https://www.sahibinden.com/satilik" %>%
read_html() %>%
html_nodes(".searchResultsItem .classifiedTitle") %>%
html_text()
# I want this to return the prices of the listings, but I get 503
"https://www.hurriyetemlak.com/konut" %>%
read_html() %>%
html_nodes(".listing-item .list-view-price") %>%
html_text()
Any ideas with v8, or artificial sessions are welcome.
Also, any purely curl solutions are also welcome. I'll try to translate them into httr later :)
Thanks
You will have to set cookies to make a successful request.
One should check whether the site (sahibinden) allows scraping.
robotstxt::paths_allowed(paths = "https://www.sahibinden.com/satilik", warn = FALSE) --> robotstxt does not seem to forbid it
if you update the site after deleting cookies in the browser the site does not allow access anymore and reports unusual behaviour --> indication for counter measures against scraping
to be sure one should read the terms of usage.
Therefore, i would share the "theoretical" code, but not the required cookie data, which is user dependent anyway.
Full code would read:
library(xml2)
library(httr)
library(magrittr)
library(DT)
url <- "https://www.sahibinden.com/satilik"
YOUR_COOKIE_DATA <- NULL
if(is.null(YOUR_COOKIE_DATA)){
stop("You did not set your cookie data.
Also please check if terms of usage allow the scraping.")
}
response <- url %>% GET(add_headers(.headers = c(Cookie = YOUR_COOKIE_DATA))) %>%
content(type = "text", encoding = "UTF-8")
xpathes <- data.frame(
XPath0 = 'td[2]',
XPath1 = 'td[3]/a[1]',
XPath2 = 'td/span[1]',
XPath3 = 'td/span[2]',
XPath4 = 'td[4]',
XPath5 = 'td[5]',
XPath6 = 'td[6]',
XPath7 = 'td[7]',
XPath8 = 'td[8]'
)
nodes <- response %>% read_html %>% html_nodes(xpath =
"/html/body/div/div/form/div/div/table/tbody/tr"
)
output <- lapply(xpathes, function(xpath){
lapply(nodes, function(node) html_nodes(x = node, xpath = xpath) %>%
{ifelse(length(.), yes = html_text(.), no = NA)}) %>% unlist
})
output %>% data.frame %>% DT::datatable()
Concerning the right to scrape the website data. I try to follow: Should questions that violate API Terms of Service be flagged?. Although, in this case its "potential violation".
Reading cookies programmatically:
I am not sure it is possible to fully skip using the browser:
Why doesn't document.cookie show all the cookie for the site?
Selenium WebDriver manager().getCookies() returns 0 always
New to programming and trying to scrap data from the below site. When I run the below code it returns an empty dataset or table. Any help or alternatives will be greatly appreciated.
url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003"
tab <- url %>% read_html %>%
html_node("dogruns_wrapper") %>%
html_text()
View(tab)
Have tried with xpath and same result and html_table() instead of text returns an error of no applicable method for 'html_table' applied to an object of class "xml_missing".
As Mislav stated, the table is generated with JavaScript, so your best option is RSelenium.
In addition, if you want to get the table, you can get it with less code if you use html_table().
My try:
# Load packages
library(rvest) #Loading the rvest package
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the webpage
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
# define url
url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003"
# go to website
remDr$navigate(url)
# as it's being loaded with JavaScript and it has a slow load, add a sleep here
Sys.sleep(10) # increase as needed
# get the html object of the webpage
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# read the table in the html_obj
tab <- html_obj %>% html_table() %>% .[[1]]
Hope it helps! However, always check if webpages allow scraping before doing it!
Check Terms and conditions:
Except for the direct purpose of viewing, printing, accessing or
interacting with the Web Site for your own personal use or as
otherwise indicated on the Web Site or these Terms and Conditions, you
must not copy, reproduce, modify, communicate to the public, adapt,
transfer, distribute, download or store any of the contents of the Web
Site (including Race Information as described below), or incorporate
any part of the Web Site into another web site without GRV’s written
consent.
I would like to programmatically export the records available at this website. To do this manually, I would navigate to the page, click export, and choose the csv.
I tried copying the link from the export button which will work as long as I have a cookie (I believe). So a wget or httr request will result in the html site instead of the file.
I've found some help from an issue on the rvest github repo but ultimately I can't really figure out like the issue maker how to use objects to save the cookie and use it in a request.
Here is where I'm at:
library(httr)
library(rvest)
apoc <- html_session("https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx")
headers <- headers(apoc)
GET(url = "https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=False&exportFormat=CSV&isExport=True",
add_headers(headers)) # how can I take the output from headers in httr and use it as an argument in GET from httr?
I have checked the robots.txt and this is permissible.
You can get the __VIEWSTATE and __VIEWSTATEGENERATOR from the headers when you GET https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx and then reuse those __VIEWSTATE and __VIEWSTATEGENERATOR in your subsequent POST query and GET csv.
options(stringsAsFactors=FALSE)
library(httr)
library(curl)
library(xml2)
url <- 'https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx'
#get session headers
req <- GET(url)
req_html <- read_html(rawToChar(req$content))
fields <- c("__VIEWSTATE","__VIEWSTATEGENERATOR")
viewheaders <- lapply(fields, function(x) {
xml_attr(xml_find_first(req_html, paste0(".//input[#id='",x,"']")), "value")
})
names(viewheaders) <- fields
#post request. you can get the list of form fields using tools like Fiddler
params <- c(viewheaders,
list(
"M$ctl19"="M$UpdatePanel|M$C$csfFilter$btnExport",
"M$C$csfFilter$ddlNameType"="Any",
"M$C$csfFilter$ddlField"="Elections",
"M$C$csfFilter$ddlReportYear"="2017",
"M$C$csfFilter$ddlStatus"="Default",
"M$C$csfFilter$ddlValue"=-1,
"M$C$csfFilter$btnExport"="Export"))
resp <- POST(url, body=params, encode="form")
print(resp$status_code)
resptext <- rawToChar(resp$content)
#writeLines(resptext, "apoc.html")
#get response i.e. download csv
url <- "https://aws.state.ak.us//ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=True&exportFormat=CSV&isExport=True"
req <- GET(url, body=params)
read.csv(text=rawToChar(req$content))
You might need to play around with the inputs/code to get what you want precisely.
Here is another similar solution using RCurl:
how-to-login-and-then-download-a-file-from-aspx-web-pages-with-r
I am trying to extract urls from the website below. The tricky thing here is that the website automatically loads new pages. I did not manage to get the xpath for scraping all urls, including those on the newly loaded pages - I only manage to get the first 15 urls (of more than 70). I assume the xpath in the last line (new_results...) is missing some crucial element to account also for the pages after. Any ideas? Thank you!
# load packages
library(rvest)
library(httr)
library(RCurl)
library(XML)
library(stringr)
library(xml2)
# aim: download all speeches stored at:
# https://sheikhmohammed.ae/en-us/Speeches
# first, create vector which stores all urls to each single speech
all_links <- character()
new_results <- "/en-us/Speeches"
signatures = system.file("CurlSSL", cainfo = "cacert.pem", package = "RCurl")
options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
while(length(new_results) > 0){
new_results <- str_c("https://sheikhmohammed.ae", new_results)
results <- getURL(new_results, cainfo = signatures)
results_tree <- htmlParse(results)
all_links <- c(all_links, xpathSApply(results_tree,"//div[#class='speech-share-board']", xmlGetAttr,"data-url"))
new_results <- xpathSApply(results_tree,"//div[#class='speech-share-board']//after",xmlGetAttr, "data-url")}
# or, alternatively with phantomjs (also here, it loads only first 15 urls):
url <- "https://sheikhmohammed.ae/en-us/Speeches#"
# write out a script phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
# process it with phantomjs
write(readLines(pipe("phantomjs scrape.js", "r")), "scrape.html")
Running the Javascript for lazy loading in RSelenium or Selenium in Python would be the most elegant approach to solve the problem. Yet, as a less elegant but faster alternative, one can manually change the settings of the json query in the firefox development modus/network feature to load not only 15 but more (=all) speeches at once. This worked fine for me and I was able to extract all the links via the json response.
I am in the process of writing a collection of freely-downloadable R scripts for http://asdfree.com/ to help people analyze the complex sample survey data hosted by the UK data service. In addition to providing lots of statistics tutorials for these data sets, I also want to automate the download and importation of this survey data. In order to do that, I need to figure out how to programmatically log into this UK data service website.
I have tried lots of different configurations of RCurl and httr to log in, but I'm making a mistake somewhere and I'm stuck. I have tried inspecting the elements as outlined in this post, but the websites jump around too fast in the browser for me to understand what's going on.
This website does require a login and password, but I believe I'm making a mistake before I even get to the login page.
Here's how the website works:
The starting page should be: https://www.esds.ac.uk/secure/UKDSRegister_start.asp
This page will automatically re-direct your web browser to a long URL that starts with: https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah]
(1) For some reason, the SSL certificate does not work on this website. Here's the SO question I posted regarding this. The workaround I've used is simply ignoring the SSL:
library(httr)
set_config( config( ssl.verifypeer = 0L ) )
and then my first command on the starting website is:
z <- GET( "https://www.esds.ac.uk/secure/UKDSRegister_start.asp" )
this gives me back a z$url that looks a lot like the https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah] page that my browser also re-directs to.
In the browser, then, you're supposed to type in "uk data archive" and click the continue button. When I do that, it re-directs me to the web page https://shib.data-archive.ac.uk/idp/Authn/UserPassword
I think this is where I'm stuck because I cannot figure out how to have cURL followlocation and land on this website. Note: no username/password has been entered yet.
When I use the httr GET command from the wayf.ukfederation.org.uk page like this:
y <- GET( z$url , query = list( combobox = "https://shib.data-archive.ac.uk/shibboleth-idp" ) )
the y$url string looks a lot like z$url (except it's got a combobox= on the end). Is there any way to get through to this uk data archive authentication page with RCurl or httr?
I can't tell if I'm just overlooking something or if I absolutely must use the SSL certificate described in my previous SO post or what?
(2) At the point I do make it through to that page, I believe the remainder of the code would just be:
values <- list( j_username = "your.username" ,
j_password = "your.password" )
POST( "https://shib.data-archive.ac.uk/idp/Authn/UserPassword" , body = values)
But I guess that page will have to wait...
The relevant data variables returned by the form are action and origin, not combobox. Give action the value selection and origin the value from the relevant entry in combobox
y <- GET( z$url, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
> y$url
[1] "https://shib.data-archive.ac.uk:443/idp/Authn/UserPassword"
Edit
It looks as though the handle pool isn't keeping your session alive correctly. You therefore need to pass the handles directly rather than automatically. Also for the POST command you need to set multipart=FALSE as this is the default for HTML forms. The R command has a different default as it is mainly designed for uploading files. So:
y <- GET( handle=z$handle, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
POST(body=values,multipart=FALSE,handle=y$handle)
Response [https://www.esds.ac.uk/]
Status: 200
Content-type: text/html
...snipped...
<title>
Introduction to ESDS
</title>
<meta name="description" content="Introduction to the ESDS, home page" />
I think one way to address "enter your organization" page goes like this:
library(tidyverse)
library(rvest)
library(stringr)
org <- "your_organization"
user <- "your_username"
password <- "your_password"
signin <- "http://esds.ac.uk/newRegistration/newLogin.asp"
handle_reset(signin)
# get to org page and enter org
p0 <- html_session(signin) %>%
follow_link("Login")
org_link <- html_nodes(p0, "option") %>%
str_subset(org) %>%
str_match('(?<=\\")[^"]*') %>%
as.character()
f0 <- html_form(p0) %>%
first() %>%
set_values(origin = org_link)
fake_submit_button <- list(name = "submit-btn",
type = "submit",
value = "Continue",
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "btn-enabled"
f0[["fields"]][["submit"]] <- fake_submit_button
c0 <- cookies(p0)$value
names(c0) <- cookies(p0)$name
p1 <- submit_form(session = p0, form = f0, config = set_cookies(.cookies = c0))
Unfortunately, that doesn't solve the whole problem—(2) is harder than it looks. I've got more of what I think is a solution posted here: R: use rvest (or httr) to log in to a site requiring cookies. Hopefully someone will help us get the rest of the way.