Cran R 'httr'/'Rcurl' packages - use cookies to load a page - r

I am a begginer in R coding (or in coding in general) and I am trying to load a bunch of prices from a website in Brazil.
http://www.muffatosupermercados.com.br/Home.aspx
When I open the page I am prompted with a form to choose a city, in which I want "CURITIBA".
Opening the Cookies in Chrome I get this:
Name: CidadeSelecionada
Content: CidadeId=55298&NomeCidade=CURITIBA&FilialId=53
My code is to get the prices from this link:
"http://www.muffatosupermercados.com.br/CategoriaProduto.aspx?Page=1&c=2"
library(httr)
a1 <- "http://www.muffatosupermercados.com.br/CategoriaProduto.aspx?Page=1&c=2"
b2 <- GET(a1,set_cookies(.CidadeSelecionada = c(CidadeId=55298,NomeCidade="CURITIBA",FilialId=53)))
cookies(b2)
From this the only response I get is the session Id cookie:
$ASP.NET_SessionId
[1] "o5wlycpnjbfraislczix1dj4"
and when I try to load the page I only get the page behind the form, which is empty:
html <- content(b2,"text")
writeBin(html, "myfile.txt")
Does anyone have an idea on how to solve this? I also tried using RCurl and posting the form data with no luck...
There is a link to another thread of mine trying to do this in a different way:
RCurl - submit a form and load a page

Related

Web Scraping - Using Functions on a Secure Site (rvest)

I'm attempting to scrape a site that requires a form to be submitted before getting results. I'm struggling to understand how it works, let alone syntax and other things.
I have been looking at code posted by other people, and many people use rvest or RSelenium. I can't seem to get my form to submit properly, and not sure how to go about extracting the results into R once it does submit.
Now, I can't share the specific site that I'm working from, but I've found an analog:
https://gapines.org/eg/opac/advanced
For example, I might need to select "Books" under "Item Type," and "Braille" under "Item Form." Once the form is submitted, I would need to capture that results page.
Copying from other peoples' code, I have the following:
library(rvest)
url <- "https://gapines.org/eg/opac/advanced"
my_session <- html_session(url) #Create a persistant session
unfilled_forms <- html_form(my_session)
login_form <- unfilled_forms[[2]] # select the form you need to fill
filled_form <- set_values(login_form,'fi:item_type'="Books",'fi:item_form'="Braille")
login_session <- submit_form(my_session,filled_form)
When I run the submit_form(), it says "Submitting with 'NULL'."
Once it is submitted, I also want to extract the results, but not sure how to even begin.
Thanks in advance.

Setting cookies/submitting forms with rvest/httr in R: problems setting local store for web scraping homedepot.com

I am setting up an R script to scrape data from homedepot.com. It is going fine, except that I would like to scrape the stock levels for products, which requires setting the local store. I have tried a few ways to do this using rvest without success. How can I set the local store on homedepot.com?
I have found these related questions that have not led me to a solution:
(R language ) How to make a click on webpage using rvest or rcurl
Submit form with no submit button in rvest
How to properly set cookies to get URL content using httr
More info:
- the store location code seems to be stored in a cookie called THD-LOC-STORE, with a 4-digit store ID. I have been unsuccessful in setting this cookie:
library("rvest")
library("httr")
# try to set cookie in site with store ID:
session <- html_session("http://www.homedepot.com", set_cookies('THD-LOC-STORE'='2679'))
# if this worked, it would show the store name instead of "Select a Store":
storefinder <- session %>% read_html() %>% html_nodes(".headerStoreFinder") %>% html_text() %>% gsub("\\t","",.)
storefinder
cookies(session)
I also thought about using submit_form() in rvest, but the buttons to select a store are run by javascript and there are no SUBMIT buttons to choose.
Concerning your possible option "I also thought about using submit_form() in rvest, but the buttons to select a store are run by javascript and there are no SUBMIT buttons to choose", I posted an answer to the question "Submit form with no submit button in rvest" which might provide this solution for your.
In brief, you can inject a submit button into your version of the code and then submit that. Details of how to do that are in the linked post.

Rfacebook Packages getpage() command only retrieving a few posts from Facebook Pages

I recently tried Rfacebook package by pablobarbera, which works quite well. I am having this slight issue, for which I am sharing the code.
install.packages("Rfacebook") # from CRAN
library(devtools)
install_github("Rfacebook", "pablobarbera", subdir = "Rfacebook")
library(Rfacebook)
# token generated here: https://developers.facebook.com/tools/explorer
token <- "**********"
page <- getPage("DarazOnlineShopping", token, n = 1000)
getPage command works, but it only retrieves 14 records from the Facebook page I used in the command. In the example used by pablobarbera in the original post he retreived all the posts from "Humans of New York", but when I tried the same command, facebook asked me to reduce the number of posts, and I hardly managed to get 20 posts. This is the command used by Pablo bera:
page <- getPage("humansofnewyork", token, n = 5000)
I thought I was using temporary token access that why Facebook is not giving me the required data, but I completed the wholo Facebook Oauth Process, and the same result.
Can somebody look into this, and tell why this is happening.
The getPage() command looks fine to me, I manually counted 14 posts (including photos) on the main page. It could be that Daraz Online Shopping has multiple pages and that the page name you are using only returns results from the main page, when (I assume) you want results from all of them.
getPage() also accepts page IDs. You might want to collect a list of IDs associated with Daraz Online Shopping, loop through and call each of them and combine the outputs to get the results you need.
To find this out these IDs you could write a scraper (or manually search for them all) that views the page source and searches for the unique page ID. Searching for content="fb://page/?id= will highlight the location of the page ID in the source code.

login in to page in R httr moviepilot

I'm trying to begin working myself into web-scraping. Now my target is to get my personal rated movies from the moviepilot.de page.
For this I need to access following page: http://www.moviepilot.de/users/schlusie/rated/movies. But without authentication it is not possible.
I've read that the httr package can do something like this, save it as a handler with handle and than navigating over the homepage with your login-information. And thus accessing desired page. It should look like this:
library(httr)
mp = handle("http://moviepilot.de")
# authentication step
GET(handle=mp, path="/users/schlusie/rated/movies")
This is the login-page: http://www.moviepilot.de/login
Can someone please give me any pointers?

Reading HTML tables in R if login and other previous actions are required

I am using XML package to read HTML tables from web sites.
Actually I'm trying to read a table from a local address, something like http://10.35.0.9:8080/....
To get this table I usually have to login into by typing login and password.
Therefore, when I run:
library(XML)
acsi.url <- 'http://10.35.0.9:8080/...'
acsi.df <- readHTMLTable(acsi.url, header = T, stringsAsFactors = F)
acsi.df
I see acsi.df isn't my table but the login page.
How can I tell R to input login and password and loggin on before reading the table?
There is no general solution, you have to analyze the details of you login procedure, but package RCurl and the following link should help:
Login to WordPress using RCurl

Resources