Error 500 while webscraping with rvest - web-scraping

I am trying to webscrape with the code below, but am getting the following warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode, :
Internal Server Error (HTTP 500)
What am I doing wrong?
library(rvest):
sisben <-html_session("https://wssisbenconsulta.sisben.gov.co/dnp_sisbenconsulta/dnp_sisben_consulta.aspx")
form <- html_form(sisben)[[1]]
fillform <- set_values(form,"ddlTipoDocumento" = "Cédula de Ciudadanía", "tboxNumeroDocumento" = "1234")
sis <- submit_form(session=sisben, form=fillform)

What data do you exactly want to scrape? For me the code rather looks as if you interact with a page (fill in a form and submit), but I do not see any rvest code you use to scrape data.
Regarding your error:
Looking at the html source code it looks like you only submit the label of the "Tipo de Documento", but not the correct internal value (which is numbered)
<option value="-1">Seleccione...</option>
<option value="1">Cédula de Ciudadanía</option>
<option value="3">Cédula de Extranjería</option>
<option selected="selected" value="4">Registro Civil</option>
<option value="2">Tarjeta de Identidad</option>
I didn't receive an error using the option value as input:
fillform <- set_values(form,"ddlTipoDocumento" = "1", "tboxNumeroDocumento" = "1234")
sis <- submit_form(session=sisben, form=fillform)
leads to the output:
Submitting with 'tboxNumeroDocumento'
Maybe this is already what you are looking for?

Related

Filling and submit search with rvest in R

I am learning how to fill forms and submit with rvest in R, and I got stucked when I want to search for ggplot tag in stackoverflow. This is my code:
url<-"https://stackoverflow.com/questions"
(session<-html_session("https://stackoverflow.com/questions"))
(form<-html_form(session)[[2]])
(filled_form<-set_values(form, tagQuery = "ggplot"))
searched<-submit_form(session, filled_form)
I've got the error:
Submitting with '<unnamed>'
Error in parse_url(url) : length(url) == 1 is not TRUE
Follow this question (rvest error on form submission) I tried several things to solve this, but I couldnt:
filled_form$fields[[13]]$name<-"submit"
filled_form$fields[[14]]$name<-"submit"
filled_form$fields[[13]]$type<-"button"
filled_form$fields[[14]]$type<-"button"
Any help guys
The search query is in html_form(session)[[1]]
As there is no submit button in this form :
<form> 'search' (GET /search)
<input text> 'q':
this workaround seems to work :
<form> 'search' (GET /search)
<input text> 'q':
<input submit> '':
Giving the following code sequence :
library(rvest)
url<-"https://stackoverflow.com/questions"
(session<-html_session("https://stackoverflow.com/questions"))
(form<-html_form(session)[[1]])
fake_submit_button <- list(name = NULL,
type = "submit",
value = NULL,
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "input"
form[["fields"]][["submit"]] <- fake_submit_button
(filled_form<-set_values(form, q = "ggplot"))
searched<-submit_form(session, filled_form)
the problem is that the reply has a captcha :
searched$url
[1] "https://stackoverflow.com/nocaptcha?s=7291e7e6-9b8b-4b5f-bd1c-0f6890c23573"
You won't be able to handle this with rvest, but after clicking manually on the captcha you get the query you're looking for :
https://stackoverflow.com/search?q=ggplot
Probably much easier to use my other answer with:
read_html(paste0('https://stackoverflow.com/search?tab=newest&q=',search))

How to keep the original format of values after unlist(lapply(mydata, function(x) {x$getElementText()}))

I'm trying to keep the original values format. The data format is:
<option value="xxxxx ">xxxx </option>
<option value="yyyy ">yyyy </option>
<option value="zzzzzzz ">zzzzzzz </option>
...
But, I'm getting this after using
unlist(lapply(mydata, function(x) {x$getElementText()}))
head(mydata)
[1] "xxxxx" "yyyy" "zzzzzzz"
What I need:
head(mydata)
[1] "xxxxx " "yyyy " "zzzzzzz "
I appreciate any help
The getElementText method normalized the text as it would appear in the browser. If you have a bunch of standard spaces between words in an HTML page, pretty much all browsers will render that as a single space. However, you can get the the underlying value from
x$getAttribute('textContent')
which was found here
Or in your case if you want the value attribute from the option tag
x$getElementAttribute('value')

Setting cookies and reading html

I need to read a source code for a reasearch and I can read the full text when I use a browser, but in R there is a hidden part. The code is replaced by a message saying that the content is allowed just for browsers which use cookies.
Based on the question
How to properly set cookies to get URL content using httr
I am using the following code:
library(httr)
url<-"https://www.ogol.com.br/player_results.php?id=5637"
r <- GET(url, query = list(a = 1))
cookies(r)
response<-GET(url,
set_cookies(`__cfduid` = "dde27d084f28a84488910bf48f22f5fa01530024956",
`FORCE_SITE_VERSION` = "desktop",
`FORCE_MODALIDADE` = "1",
`PHPSESSID` = "uou4jukkosdaafidp26857k8t3"))
player_code<-content(x = response,as = "text", encoding = "ISO-8859-1")
But it also hides a part of the code and returns the message:
"Este conteúdo apenas está disponível para browsers que aceitam cookies" (put the message just to identify if your help has the same result :) )
It means: The content is available just for browsers that accept cookies.
Am I using wrong cookie values or any other clue? Thanks in advance.

RCurl postForm issues with fieldnames with special characters

How do I enclose a field name to postForm with RCurl when the form has fields like those below?
<input id="form:checkEstrato" type="checkbox" name="form:checkEstrato" checked="checked" />
<input id="form:checkArea" type="checkbox" name="form:checkArea" checked="checked" />
if I try something like
if(url.exists(url))
results <- postForm(url,
form:evento="35",
form:area = "10")
I get
> if(url.exists(url))
+ results <- postForm(url,
+ form:evento="35",
Error: unexpected '=' in:
" results <- postForm(url,
form:evento="
> form:area = "10")
Error: unexpected ')' in " form:area = "10")"
in fact it was simple, although now I have to work it out why Rcurl doesn't get what I want.
At least to avoid de above error it as just a matter of enclosing the parameter name with quotes
if(url.exists(url))
results <- postForm(url,
'form:evento'="35",
'form:area' = "10")
Now lets move ahead trying to understanding what is being sent to server and why its not working the way I expected.

Scrape the Data with Rcurl

I want to crawl some data from the following the url using Rcurl and XML.
http://datacenter.mep.gov.cn/report/air_daily/air_dairy.jsp?&lang=
the data range is from "2000-06-05" to "2013-12-30", there're more than 10000 pages.
The elements in this page associated with the data.
<form name="report1_turnPageForm" method=post
action="http://datacenter.mep.gov.cn:80/.../air.../air_dairy.jsp..." style="display:none">
<input type=hidden name=reportParamsId value=122169>
<input type=hidden name=report1_currPage value="1">
<input type=hidden name=report1_cachedId value=53661>
</form>
and the link also looks like this
http://datacenter.mep.gov.cn/report/air_daily/air_dairy.jsp?city&startdate=2013-12-15&enddate=2013-12-30&page=31
there're startdate and enddate and page..
then I began to crawl the web.
require(RCurl)
require(XML)
k = postForm("http://datacenter.mep.gov.cn/report/air_daily/air_dairy.jsp?&lang=")
k = iconv(k, 'gbk', 'utf-8')
k = htmlParse(k, asText = TRUE, encoding = 'utf-8')
then..I don't know what to do next..and I'm not sure whether I'm on the correct track?
I also tried this
k = sapply(getNodeSet(doc = k, path = "//font[#color='#0000FF' and #size='2']"),
xmlValue)[1:24]
It doesn't work..
Could give some suggestions ? Thanks a lot!
Scrapy and beautifulsoup solutions are also strongly welcomed!
If XML is sufficient, maybe this would be a starting point:
require(XML)
url <- "http://datacenter.mep.gov.cn/report/air_daily/air_dairy.jsp?city&startdate=2013-12-15&enddate=2013-12-30&page=%d"
pages <- 2
tabs <- vector("list", length=pages)
for (page in 1:pages) {
doc <- htmlParse(paste(suppressWarnings(readLines(sprintf(url,
page),
encoding="UTF-8")),
collapse="\n"))
tabs[[page]] <- readHTMLTable(doc,
header=TRUE,
which=4) # readHTMLTable(doc)[["report1"]]
}
do.call(rbind.data.frame, tabs) # output

Resources