login in to page in R httr moviepilot - r

I'm trying to begin working myself into web-scraping. Now my target is to get my personal rated movies from the moviepilot.de page.
For this I need to access following page: http://www.moviepilot.de/users/schlusie/rated/movies. But without authentication it is not possible.
I've read that the httr package can do something like this, save it as a handler with handle and than navigating over the homepage with your login-information. And thus accessing desired page. It should look like this:
library(httr)
mp = handle("http://moviepilot.de")
# authentication step
GET(handle=mp, path="/users/schlusie/rated/movies")
This is the login-page: http://www.moviepilot.de/login
Can someone please give me any pointers?

Related

Salesforce: Download Reports via URL in R

I try to download the reports available in Salesforce via the URL, e.g.
http://YOURInstance.my.salesforce.com/012389u13541?export=1&enc=UTF-8&xf=csv
in R.
I already did some investigation to access the report via HTTR-GET, however, up until today without any meaningful outcomes. Unfortunately, R is downloading HTML-code instead of the desired csv file. I also tried to realize the approach suggested here:
https://salesforce.stackexchange.com/questions/47414/download-a-report-using-python
The package "RForcecom" allows the interaction via an API, but I was not able to figure out how to realize above solution in R.
General GET-Request:
GET("http://YOUR_Instance.my.salesforce.com/012389u13541?export=1&enc=UTF-8&xf=csv")
I expect the output to be in csv format, but I receive the report data as html source code.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3...
<html>
<head>
<meta HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
...
Did anyone of you guys encounter same issues and can provide guidance? Any kind of help is much appreciated. Thanks in advance!
UPDATED and not-working R-Snippet:
library(RForcecom)
library(httr)
username='username'
password='password'
instanceURL <- "https://login.salesforce.com/"
session <- rforcecom.login(username, password, instanceURL)
sid=as.character(session['sessionID'])
url='http://YOURInstance.my.salesforce.com/012389u13541?export=1&enc=UTF-8&xf=csv'
getData=GET(url,add_headers('Content-Type'='application/json','Authorization'=paste0("Bearer ",sid),'X-PrettyPrint'='1'),set_cookies('sid'=sid))
Are you sure you have a valid report id? It doesn't look right (did you just obfuscate it for purposes of this post?). What is in that HTML you're getting, an error message? SF login screen?
What you're doing is effectively "screen scraping". This is not a real API, it can break at any time, you should find/build something that properly uses Salesforce Analytics API. You've been warned.
But if you're after a quick and dirty solution...
You need to pretend you're an authenticated user, that you have a valid session id. Add a cookie to your GET request.
How to get a valid session id?
You'd have to log in to SF first (for example use SOAP API's login call or I listed some REST api ideas here: https://stackoverflow.com/a/56034159/313628 )
or display some user's session ID in a SF formula, visualforce page and user would copy-paste it to your app.
Once you have it - add a Cookie header to your GET with value sid=<session id goes here>
Here's a raw request & response in SoapUI.
I recently struggled with the same issue, there's a magic parameter you need to add to the query : isdtp=p1
so if you try:
http://YOURInstance.my.salesforce.com/012389u13541?export=1&enc=UTF-8&xf=csv&isdtp=p1
it should return you the file directly.
In your example, I don't think that you can use the rforcecom session with httr functions as you are trying.
Here is a slightly different way to solve the problem.
Rather than trying to retrieve a report that you already created in Salesforce, why not specify the report in SOQL and use rforcecom.query function to execute the SOQL from r. That would return the data in a data frame and would require no further data wrangling in r to make it useable.
I use this technique often and once you get used to the Salesforce API I think that its probably faster and more powerful for most use cases.
Here is a simple function that I use to return select opportunity data for all opportunities in Salesforce.
getSFOpps <- function(session) {
#Construct SOQL Query
soql <- "SELECT Id,
Name,
AccountId,
Amount,
CurrencyIsoCode,
convertCurrency(Amount) usd_amount,
CloseDate,
CreatedDate,
Region__c,
IsClosed,
IsWon,
LastActivityDate,
LeadSource,
OwnerId,
Probability,
StageName,
Type,
IsDeleted
FROM Opportunity"
#Retrieve Opp information
as_tibble(RForcecom::rforcecom.query(session, soql))
}
It requires that you pass in a valid session from Rforcecom.login but you seem to have that part working from your code above.
I hope this helps ...
As of v0.2.0, the {salesforcer} R package implements the Salesforce Reports and Dashboards REST API. You can execute and manage reports without needing to write functions from scratch to pull down report data. Below is an example of how to find a report in your Org and then retrieve its data. You can also just use the report Id which appears in the URL bar when viewing the report in Salesforce (highlighted in red in the screenshot below).
# install.packages('salesforcer')
library(dplyr, warn.conflicts = FALSE)
library(salesforcer)
# Authenticate using username, password, and security token ...
sf_auth(username = "test#gmail.com",
password = "{PASSWORD_HERE}",
security_token = "{SECURITY_TOKEN_HERE}")
# ... or using OAuth 2.0 authentication
sf_auth()
# find a report in your org and run it
all_reports <- sf_query("SELECT Id, Name FROM Report")
this_report_id <- all_reports$Id[1]
results <- sf_run_report(this_report_id)
results

Scraping login protected website with a challenge form?

I'm trying to do some web scraping from steamspy.com, specifically the total playtime hours for a certain game. That info is behind the login wall for the site, so I've been trying to figure out how to get R past it for html mining.
I tried this method for passing login credentials via POST() but it doesn't seem to work. I noticed that the login handler for that example used POST, whereas looking at the source code for steamspy it seems to use a challenge form and I wasn't sure how to proceed with R.
My attempt thus far looks like this:
handle <- handle("http://steamspy.com")
path <- "/login/"
login <- list(
jschl_vc = "bc4e...",
pass = "148..."
)
response <- POST(handle = handle, path = path, body = login)
I found the values for the jschl_vc and pass from inspecting the source code after I logged in. The code above doesn't work and gives me:
Error in curl::curl_fetch_memory(url, handle = handle) : Failure
when receiving data from the peer
probably since I'm tryign to use POST to a challenge form. Is there way that I'm missing to proceed?

Rfacebook Packages getpage() command only retrieving a few posts from Facebook Pages

I recently tried Rfacebook package by pablobarbera, which works quite well. I am having this slight issue, for which I am sharing the code.
install.packages("Rfacebook") # from CRAN
library(devtools)
install_github("Rfacebook", "pablobarbera", subdir = "Rfacebook")
library(Rfacebook)
# token generated here: https://developers.facebook.com/tools/explorer
token <- "**********"
page <- getPage("DarazOnlineShopping", token, n = 1000)
getPage command works, but it only retrieves 14 records from the Facebook page I used in the command. In the example used by pablobarbera in the original post he retreived all the posts from "Humans of New York", but when I tried the same command, facebook asked me to reduce the number of posts, and I hardly managed to get 20 posts. This is the command used by Pablo bera:
page <- getPage("humansofnewyork", token, n = 5000)
I thought I was using temporary token access that why Facebook is not giving me the required data, but I completed the wholo Facebook Oauth Process, and the same result.
Can somebody look into this, and tell why this is happening.
The getPage() command looks fine to me, I manually counted 14 posts (including photos) on the main page. It could be that Daraz Online Shopping has multiple pages and that the page name you are using only returns results from the main page, when (I assume) you want results from all of them.
getPage() also accepts page IDs. You might want to collect a list of IDs associated with Daraz Online Shopping, loop through and call each of them and combine the outputs to get the results you need.
To find this out these IDs you could write a scraper (or manually search for them all) that views the page source and searches for the unique page ID. Searching for content="fb://page/?id= will highlight the location of the page ID in the source code.

Cran R 'httr'/'Rcurl' packages - use cookies to load a page

I am a begginer in R coding (or in coding in general) and I am trying to load a bunch of prices from a website in Brazil.
http://www.muffatosupermercados.com.br/Home.aspx
When I open the page I am prompted with a form to choose a city, in which I want "CURITIBA".
Opening the Cookies in Chrome I get this:
Name: CidadeSelecionada
Content: CidadeId=55298&NomeCidade=CURITIBA&FilialId=53
My code is to get the prices from this link:
"http://www.muffatosupermercados.com.br/CategoriaProduto.aspx?Page=1&c=2"
library(httr)
a1 <- "http://www.muffatosupermercados.com.br/CategoriaProduto.aspx?Page=1&c=2"
b2 <- GET(a1,set_cookies(.CidadeSelecionada = c(CidadeId=55298,NomeCidade="CURITIBA",FilialId=53)))
cookies(b2)
From this the only response I get is the session Id cookie:
$ASP.NET_SessionId
[1] "o5wlycpnjbfraislczix1dj4"
and when I try to load the page I only get the page behind the form, which is empty:
html <- content(b2,"text")
writeBin(html, "myfile.txt")
Does anyone have an idea on how to solve this? I also tried using RCurl and posting the form data with no luck...
There is a link to another thread of mine trying to do this in a different way:
RCurl - submit a form and load a page

Reading HTML tables in R if login and other previous actions are required

I am using XML package to read HTML tables from web sites.
Actually I'm trying to read a table from a local address, something like http://10.35.0.9:8080/....
To get this table I usually have to login into by typing login and password.
Therefore, when I run:
library(XML)
acsi.url <- 'http://10.35.0.9:8080/...'
acsi.df <- readHTMLTable(acsi.url, header = T, stringsAsFactors = F)
acsi.df
I see acsi.df isn't my table but the login page.
How can I tell R to input login and password and loggin on before reading the table?
There is no general solution, you have to analyze the details of you login procedure, but package RCurl and the following link should help:
Login to WordPress using RCurl

Resources