Rcrawler - How to crawl account/password protected sites? - r

I am trying to crawl and scrape a website's tables. I have an account with the website, and I found out that Rcrawl could help me with getting parts of the table based on specific keywords, etc. The problem is that on the GitHub page there is no mentioning of how to crawl a site with account/password protection.
An example for signing in would be below:
login <- list(username="username", password="password",)
Do you have any idea if Rcrawler has this functionality? For example something like:
Rcrawler(Website = "http://www.glofile.com" +
list (username = "username", password = "password" + no_cores = 4, no_conn = 4, ExtractCSSPat = c(".entry-title",".entry-content"), PatternsNames = c("Title","Content"))
I'm confident my code above is wrong, but I hope it gives you an idea of what I want to do.

To crawl or scrape password-protected websites in R, more precisely HTML-based Authentication, you need to use web driver to stimulate a login session, Fortunately, this is possible since Rcrawler v0.1.9, which implement phantomjs web driver ( a browser but without graphics interface).
In the following example will try to log in a blog website
library(Rcrawler)
Dowload and install web driver
install_browser()
Run the browser session
br<- run_browser()
If you get an error than disable your antivirus or allow the program in your system setting
Run an automated login action and return a logged-in session if successful
br<-LoginSession(Browser = br, LoginURL = 'http://glofile.com/wp-login.php'
LoginCredentials = c('demo','rc#pass#r'),
cssLoginFields =c('#user_login', '#user_pass'),
cssLoginButton ='#wp-submit' )
Finally, if you know already the private pages you want to scrape/download use
DATA <- ContentScraper(... , browser =br)
Or, simply crawl/scrape/download all pages
Rcrawler(Website = "http://glofile.com/",no_cores = 1 ,no_conn = 1,LoggedSession = br ,...)
Don't use multiple parallel no_cores/no_conn as many websites reject multiple sessions by one user.
Stay legit and honor robots.txt by setting Obeyrobots = TRUE
You access the browser functions, like :
br$session$getUrl()
br$session$getTitle()
br$session$takeScreenshot(file = "image.png")

Related

Scraping login protected website with a challenge form?

I'm trying to do some web scraping from steamspy.com, specifically the total playtime hours for a certain game. That info is behind the login wall for the site, so I've been trying to figure out how to get R past it for html mining.
I tried this method for passing login credentials via POST() but it doesn't seem to work. I noticed that the login handler for that example used POST, whereas looking at the source code for steamspy it seems to use a challenge form and I wasn't sure how to proceed with R.
My attempt thus far looks like this:
handle <- handle("http://steamspy.com")
path <- "/login/"
login <- list(
jschl_vc = "bc4e...",
pass = "148..."
)
response <- POST(handle = handle, path = path, body = login)
I found the values for the jschl_vc and pass from inspecting the source code after I logged in. The code above doesn't work and gives me:
Error in curl::curl_fetch_memory(url, handle = handle) : Failure
when receiving data from the peer
probably since I'm tryign to use POST to a challenge form. Is there way that I'm missing to proceed?

How can I allow new R users to send information to a Google Form?

How can I allow new R users to send information to a Google Form? (RSelenium requires a bit of set up, at least for headless browsing, so it's not the best candidate IMO but I may be missing something that makes it the best choice).
I have some new R users I want to get responses from interactively and send to a secure location. I have chosen Google Forms to pass the information to, as it allows one way sends of the info and doesn't allow the user access to the spreadsheet that is created from the form.
Here's a url of this form:
url <- "https://docs.google.com/forms/d/1tz2RPftOLRCQrGSvgJTRELrd9sdIrSZ_kxfoFdHiqD4/viewform"
To give context here's how I'm using R to interact with the user:
question <- function(message, opts = c("Yes", "No")){
message(message)
ans <- menu(opts)
if (ans == "2") FALSE else TRUE
}
question("Was this information helpful?")
I want to then send that TRUE/FALSE to the Google form above. How can I send a response to the Google Form above from within R in a way that I can embed in code the user will interact with and doesn't require difficult set up by the user?
Add on R packages are fine if they accomplish the task.
You can send a POST query. Here an example using httr package:
For example:
library(httr)
send_response<-
function(response){
form_url <- "https://docs.google.com/forms/d/1tz2RPftOLRCQrGSvgJTRELrd9sdIrSZ_kxfoFdHiqD4/formResponse"
POST(form_url,
query = list(`entry.1651773982`=response)
)
}
Then you can call it :
send_response(question("Was this information helpful?"))

How to change Google Accounts using library(googlesheets)?

I experience some problems. I use a R-Script on one machine successfully. The same script used on a different computer causes problems:
# Here I register the sheet
browser <- gs_title("Funnel Daily")
browser<-gs_edit_cells(ws="Classic Browser", browser, input = ClassicBrowser, anchor = "A1",byrow = FALSE,
col_names = NULL, trim = F,verbose = TRUE)
Auto-refreshing stale OAuth token.
Error in gs_lookup(., "sheet_title", verbose) :
"Funnel Daily" doesn't match sheet_title of any sheet returned by gs_ls() (which should reflect user's Google Sheets home screen).
> browser <- gs_title("Funnel Daily")
Error in gs_lookup(., "sheet_title", verbose) :
"Funnel Daily" doesn't match sheet_title of any sheet returned by gs_ls() (which should reflect user's Google Sheets home screen).`
if using gl_ls() I get a message about an google account which I also frequently use. So is there a way maybe via a token or so to differentiate between accounts or how can I solve that issue. I mean how can I force googlesheets to access some specific account?
Currently I'm using the tokenof the account which corresponds to Funnel Daily. The only possibility I can think of which may have caused the issue is that the browser-authentification was done with the account which not included Funnel Daily..I just confused them.
I tried to remove googlesheets as well as httrwith all dependencies. But when running the library(googlesheets) and asking gs_user googlesheets always refers to the account which doesnt include the specific sheet.
Include your credentials and confirm the browser authentication via your Funnel Daily Google Account:
options(googlesheets.client_id = "",
googlesheets.client_secret = "",
googlesheets.httr_oauth_cache = FALSE)
gs_auth(token = NULL, new_user = FALSE,
key = getOption("googlesheets.client_id"),
secret = getOption("googlesheets.client_secret"),
cache = getOption("googlesheets.httr_oauth_cache"), verbose = TRUE)
Cheers

Staying Logged In Using Requests and Python

I am trying to log onto a website using python and requests. I'm pretty sure I am logging on properly. The next part is I go to a different page and try to download a file from that page. However, in order to download the file you have to be logged in. When I go to download the file, however, it redirects me to the log-in menu saying I haven't logged in. I am stuck and don't know what to do! By the way, the website is grabcad.com, what I'm basically trying to do is press the download all button featured on such a page
http://grabcad.com/library/apple-ipod-touch-5th-gen-1
payload = {'member[email]': 'username', 'member[password]': 'pass'}
with requests.Session() as s:
rObject = s.post('http://www.grabcad.com/login', data=payload)
cookies = rObject.cookies
rObject = s.get('http://www.grabcad.com' + downloadUrl, cookies=cookies)
#download URL is something I obtain early and I know it's correct. It's the URL for when you press the downloadAll button
path = 'C:\\User\\Desktop\\filename
with open(path, 'wb') as f:
for chunk in rObject.iter_content():
f.write(chunk)
So I took an altogether different route to solve the problem, I simply used mechanize which is an automated browswer tool for python.
#how to use mechanize to log-in, specifically for grabcad
b.open('http://grabcad.com/login')
b.form = list(b.forms())[1]
control = b.form.find_control("member[email]")
control2 = b.form.find_control("member[password]")
control.value = 'username'
control2.value = 'pass'
b.submit()
#Download Part
path = 'C:\\User\\Desktop\\filename
b.retrieve('https://www.grabcad.com' + downloadUrl, path)
#downloadUrl is obtained earlier and is simply the URL for the download
How are you ensuring that you're logged in correctly? I would print out the html after sending that post request from the session object & ensure it isn't a login page or an invalid password page. Cookies are automatically persistent across requests made on the session object, so I believe that the initial login isn't successful (http://docs.python-requests.org/en/latest/user/advanced/#session-objects).
Personally, I would use selenium for this though.
I have correctly logged into grabcad with the following code:
import requests
s = requests.session()
payload = {'member[email]': 'yourEmail', 'member[password]': 'yourPassword'}
p = s.post('https://grabcad.com/login', data=payload) # Ensure you're posting to HTTPS

Reading HTML tables in R if login and other previous actions are required

I am using XML package to read HTML tables from web sites.
Actually I'm trying to read a table from a local address, something like http://10.35.0.9:8080/....
To get this table I usually have to login into by typing login and password.
Therefore, when I run:
library(XML)
acsi.url <- 'http://10.35.0.9:8080/...'
acsi.df <- readHTMLTable(acsi.url, header = T, stringsAsFactors = F)
acsi.df
I see acsi.df isn't my table but the login page.
How can I tell R to input login and password and loggin on before reading the table?
There is no general solution, you have to analyze the details of you login procedure, but package RCurl and the following link should help:
Login to WordPress using RCurl

Resources