Characters different in RcURL/getURL than in browser - r

I am looking to extract foreign-language text from a website. The following code (hopefully self-contained) will demonstrate the problem:
require(RCurl)
require(XML)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Chrome 39.0.2171.71 (64-bit)"
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt' ,useragent = agent,followlocation = TRUE , autoreferer = TRUE , curl = curl)
html <-getURL('http://164.100.47.132/LssNew/psearch/result12.aspx?dbsl=1008', maxredirs = as.integer(20), followlocation = TRUE, curl = curl)
work <- htmlTreeParse(html, useInternal = TRUE)
table <- xpathApply(work, "//table[#id = 'ctl00_ContPlaceHolderMain_DataList1' ]//font|//table[#id = 'ctl00_ContPlaceHolderMain_DataList1' ]//p", xmlValue) #this one captured some mess in 13
table[[2]]
Where the first bunch of characters in the console printout appear for me as ¸Ã\u0089Ã\u0092 iÃ\u0089{Ã\u0089xÃ\u0089 Ã\u008aºÃ\u0089Eònù®ú.
Note that if I go to the actual page (http://bit.ly/1AcE9Gs), and view the page source and find the second opening <font tag (corresponding to the second list item in my table, or inspect the element near the first Hindi characters) what renders in the page source looks something like this: ¸ÉÒ iÉ{ÉxÉ ÊºÉEònù®ú (nù¨Énù¨É): which is what I want.
Anyone know why this might occur, and/or how to fix? Something to do with encodings in R, or RcURL? I can see all the way up to the initial getURL call that the characters are different like this, so it doesn't have to do with passing from the html text to xpathApply.
I am using MAC OSX 10.9.3, Chrome browser (for viewing the actual page), R 3.1.1.
If interested, see a related question on xpathApply here: R and xpathApply -- removing duplicates from nested html tags
Thanks!

Add encoding options to htmlParse and getURL:
require(RCurl)
require(XML)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Chrome 39.0.2171.71 (64-bit)"
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt' ,useragent = agent,followlocation = TRUE , autoreferer = TRUE , curl = curl)
html <-getURL('http://164.100.47.132/LssNew/psearch/result12.aspx?dbsl=1008'
, maxredirs = as.integer(20), followlocation = TRUE, curl = curl
, .encoding = 'UTF-8')
work <- htmlParse(html, encoding = 'UTF-8')
table <- xpathApply(work, "//table[#id = 'ctl00_ContPlaceHolderMain_DataList1' ]//font|//table[#id = 'ctl00_ContPlaceHolderMain_DataList1' ]//p", xmlValue) #this one captured some mess in 13
> table[[2]]
[1] "¸ÉÒ iÉ{ÉxÉ ÊºÉEònù®ú (nù¨Énù¨É):\r\nºÉ¦ÉÉ{ÉÊiÉ ¨É½þÉänùªÉ, {ɽþ±Éä ÊnùxÉ ¨ÉèÆ ¤ÉÉä±É\r\n®ú½þÉ lÉÉ iÉÉä ¨ÉèÆxÉä =iiÉ®ú {ÉÚ´ÉÒÇ ¦ÉÉ®úiÉ Eòä\r\n+ÉiÉÆEò´ÉÉnù {É®ú =ºÉ ÊnùxÉ nùÉä {ɽþ±ÉÖ+ÉäÆ EòÉ =±±ÉäJÉ\r\nÊEòªÉÉ lÉÉ* +ÉVÉ ¦ÉÒ ¨ÉèÆ, ÊVÉºÉ EòÉ®úhÉ ºÉä +ÉiÉÆEò´ÉÉnù\r\n{ÉènùÉ ½þÖ+É, =ºÉEòä Ê´É¹ÉªÉ ¨ÉäÆ lÉÉäc÷É ºÉÉ =±±ÉäJÉ\r\nEò°üÆMÉÉ*"

Here's an alternative implementation using rvest. Not only is the code simpler, but you don't have to do anything with the encoding, rvest figures it out for you.
library("rvest")
url <- "http://164.100.47.132/LssNew/psearch/result12.aspx?dbsl=1008"
search <- html(url)
search %>%
html_node("#ctl00_ContPlaceHolderMain_DataList1") %>%
html_nodes("font, p") %>%
html_text() %>%
.[[2]]
#> [1] "¸ÉÒ iÉ{ÉxÉ ÊºÉEònù®ú (nù¨Énù¨É):\r\nºÉ¦ÉÉ{ÉÊiÉ ¨É½þÉänùªÉ, ...

Related

How to input text file to api in R

I want to use this api:
http(s)://lindat.mff.cuni.cz/services/morphodita/api/
with the method "tag". It will tag and lemmatize my text input. It has worked fine with a text string (see below), but I need to send an entire file to the API.
Just to show that string as input works fine:
method <- "tag"
lemmatized_text <- RCurl::getForm(paste#
("http://lindat.mff.cuni.cz/services/morphodita/api/", method, sep = ""),
.params = list(data = "Peter likes cakes. John likes lollypops.",#
output = "json", model = "english-morphium-wsj-140407-no_negation"), #
method = method)
This is the - correct - result:
[1] "{\n \"model\": \"english-morphium-wsj-140407-no_negation\",\n
\"acknowledgements\": [\n \"http://ufal.mff.cuni.cz
/morphodita#morphodita_acknowledgements\",\n \"http://ufal.mff.cuni.cz
/morphodita/users-manual#english-morphium-wsj_acknowledgements\"\n ],\n
\"result\": [[{\"token\":\"Peter\",\"lemma\":\"Peter\",\"tag\":\"NNP
\",\"space\":\" \"},{\"token\":\"likes\",\"lemma\":\"like\",\"tag\":\"VBZ
\",\"space\":\" \"},{\"token\":\"cakes\",\"lemma\":\"cake\",\"tag\":\"NNS
[truncated by me]
However, replacing a string with a vector with elements corresponding to lines of a text file does not work, since the API requires a string on input. Only one, by default the first, vector element would be processed:
method <- "tag"
mydata <- c("cakes.", "lollypops")
lemmatized_text <- RCurl::getForm(paste("http://lindat.mff.cuni.cz
/services/morphodita/api/", method, sep = ""),
.params = list(data = mydata, output = "json",
model = "english-morphium-wsj-140407-no_negation"))
[1] "{\n \"model\": \"english-morphium-wsj-140407-no_negation\",\n
[truncated by me]
\"result\": [[{\"token\":\"cakes\",\"lemma\":\"cake\",\"tag\":\"NNS
\"},{\"token\":\".\",\"lemma\":\".\",\"tag\":\".\"}]]\n}\n"
This issue can be alleviated with sapply and a function calling that API on each element of the vector at the same time, but each element of the resulting vector contains a separate json document. To parse it, I need the entire data to be one single json document, though.
Eventually I tried textConnection, but it returns an erroneous output:
mydata <- c("cakes.", "lollypops")
mycon <- textConnection(mydata, encoding = "UTF-8")
lemmatized_text <- RCurl::getForm(paste#
("http://lindat.mff.cuni.cz/services/morphodita/api/", method,#
sep = ""), .params = list(data = mycon, output = "json",#
model = "english-morphium-wsj-140407-no_negation"))
[1] "{\n \"model\": \"english-morphium-wsj-140407-no_negation\",\n
\"acknowledgements\": [\n \"http://ufal.mff.cuni.cz
/morphodita#morphodita_acknowledgements\",\n \"http://ufal.mff.cuni.cz
/morphodita/users-manual#english-morphium-wsj_acknowledgements\"\n ],\n
\"result\": [[{\"token\":\"5\",\"lemma\":\"5\",\"tag\":\"CD\"}]]\n}\n"
attr(,"Content-Type")
I should probably also say that I have already tried to paste and collapse the vector into one single element, but that is very fragile. It works with dummy data but not with larger files and never with Czech files (although UTF-8 encoded). The API strictly requires UTF-8-encoded data. I therefore suspect encoding issues. I have tried this file:
mydata <- RCurl::getURI("https://ia902606.us.archive.org/4/items/maidmarian00966gut/maidm10.txt", .opts = list(.encoding = "UTF-8"))
and it said
Error: Bad Request
but when I only used a few lines, it suddenly worked. I also made a local copy of the file where I changed the newlines from MacIntosh to Windows. Maybe this helped a bit, but it was definitely not sufficient.
Eventually I should add that I work on Windows 8 Professional, running R-3.2.4 64bit, with RStudio Version 0.99.879.
I should have used RCurl::postForm instead of RCurl::getForm, with all other arguments remaining the same. The postForm function can not only be used to write files on the server, as I had wrongly believed. It does not impose strict limits on the size of the data to be processed, since, with postForm the data do not become part of the URL, unlike with getForm.
This is my convenience function (requires RCurl, stringi, stringr, magrittr):
process_w_morphodita <- function(method, data, output = "json", model
= "czech-morfflex-pdt-161115", guesser = "yes",...){
# for formally optional but very important argument-value pairs see
MorphoDiTa REST API reference at
# http://lindat.mff.cuni.cz/services/morphodita/api-reference.php
pokus <- RCurl::postForm(paste("http://lindat.mff.cuni.cz/services
/morphodita/api/", method, sep = ""), .params = list(data =
stringi::stri_enc_toutf8(data), output = output, model = model,
guesser = guesser,...))
if (output == "vertical") {
pokus <- pokus %>% stringr::str_trim(side = "both") %>%
stringr::str_conv("UTF-8") %>% stringr::str_replace_all(pattern =
"\\\\t", replacement = "\t") %>% stringr::str_replace_all(pattern =
"\\\\n", replacement = "\n") # look for four backslashes, replace
with one backslash to get vertical format in text file
}
return(pokus)
}

fill form in R without Rselenium

I need to fill the fields month and year of the page:
Http://www.svs.cl/institucional/mercados/entidad.php?mercado=S&rut=99588060&grupo=&tipoentidad=CSVID&row=AABaHEAAaAAAB7uAAT&vig=VI&control=svs&pestania=3
By this, I have programmed the following in Rselenium and it works
#library
library(RSelenium)
#browser parameters
mybrowser<-remoteDriver(browserName = "chrome")
mybrowser$open(silent = TRUE)
mybrowser$setTimeout(type = "page load", milliseconds =1000000)
mybrowser$setImplicitWaitTimeout(milliseconds = 1000000)
url<-paste("http://www.svs.cl/institucional/mercados/entidad.php?mercado=S&rut=99588060&grupo=&tipoentidad=CSVID&row=AABaHEAAaAAAB7uAAT&vig=VI&control=svs&pestania=3",sep="")
#start navigation
mybrowser$navigate(url)
webElem$clickElement()
wxbox<-mybrowser$findElement(using="class","bordeInput2")
wxbox$sendKeysToElement(list("09"))
wxbox<-mybrowser$findElement(using="id","aa")
wxbox$sendKeysToElement(list("2016"))
wxbutton<-mybrowser$findElement('xpath',"//*[#id='fm']/div[2]/input")
wxbutton$clickElement()
However, I'd like to see a solution using rvest or rcurl, I've tried and it does not work for me. If anyone can help me with that, I would appreciate it.
An attempt I made was
library(RCurl)
library(XML)
form <- postForm("Http://www.svs.cl/institucional/mercados/entidad.php?mercado=S&rut=99588060&grupo=&tipoentidad=CSVID&row=AABaHEAAaAAAB7uAAT&vig=VI&control=svs&pestania=3", Year = 2010, Month = 2)
doc <- htmlParse(form) pkids <- xpathSApply(doc, xmlAttrs)
pkids
data <- lapply(pkids)
tab <- readHTMLTable(data[[1]], which = 1)
first of all, Thanks
You can simply POST to the URL as follows:
require(rvest)
require(httr)
a <- POST("http://www.svs.cl/institucional/mercados/entidad.php",
# Body = what you fill in the form
body = list(mm = 09, aa = 2016),
# query = the long URL broken into parameter
query = list(mercado="S",
rut="99588060",
grupo="",
tipoentidad="CSVID",
row="AABaHEAAaAAAB7uAAT",
vig="VI",
control="svs",
pestania="3"))
read_html(a) %>% html_nodes("dd") %>% html_text %>%
setNames(c("Business name", "RUT"))
Which gives you:
Business name RUT
"ACE SEGUROS DE VIDA S.A." "99588060-1"

Web scrape password protected website but there are errors

I am trying to scrape data from the member directory of a website ("members.dublinchamber.ie"). I have tried using the 'rvest' but I got the data from the login page even after entering the login details. The code is as follows:
library(rvest)
url <- "members.dublinchamber.ie/login.aspx"
pgsession <- html_session(url)
pgform <- html_form(pgsession)[[2]]
filled_form <- set_values(pgform,
"Username" = "username",
"Password" = "password")
submit_form(pgsession, filled_form)
memberlist <- jump_to(pgsession,'members.dublinchamber.ie/directory/profile.aspx?compid=50333')
page <- read_html(memberlist)
usernames <- html_nodes(x = page, css = 'css of required data')
data_usernames <- data.frame(html_text(usernames, trim = TRUE),stringsAsFactors = FALSE)
I also used RCurl and again I'm getting data from the login page. The RCurl code is as follows:
library(RCurl)
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
html <- getURL('http://members.dublinchamber.ie/login.aspx', curl = curl)
viewstate <- as.character(sub('.*id="__VIEWSTATE" value=['142555296'].*', '\\1', html))
params <- list(
'ctl00$ContentPlaceHolder1$ExistingMembersLogin1$username'= 'username',
'ctl00$ContentPlaceHolder1$ExistingMembersLogin1$password'= 'pass',
'ctl00$ContentPlaceHolder1$ExistingMembersLogin1$btnSubmit'= 'login',
'__VIEWSTATE' = viewstate
)
html = postForm('http://members.dublinchamber.ie/login.aspx', .params = params, curl = curl)
grep('Logout', html)
There are 3 URL's actually:
1) members.dublinchamber.ie/directory/default.aspx(has the names of all industry and it is required to click on any industry)
2) members.dublinchamber.ie/directory/default.aspx?industryVal=AdvMarPubrel (the advmarpubrel is just a small string which is generated as i clicked that industry)
3) members.dublinchamber.ie/directory/profile.aspx?compid=19399 (this has the profile information of a specific company which i clicked in the previous page)
i want to scrape data which should give me industry name, list of companies in each industry and their details which are present as a table in the 3rd URL above.
I am new here and also to R, webscrape. Please don't mind if the question was lengthy or not that clear.

How do I extract data using data from Localytics using the httr package in R?

I am trying to extract data from Localytics using R. Here is the snippet of code I'm using:
library(httr)
localytics_url = 'https://api.localytics.com/v1/query'
r <- POST(url = localytics_url,
body=list(
app_id=app_id,
metrics=c("users","revenue"),
dimensions=c("day","birth_day"),
conditions=list(
day=c("between", "2015-02-01", "2015-04-01")
)
),
encode="json",
authenticate(key,secret),
accept("application/json"),
content_type("application/json")
)
stop_for_status(r)
content(r)
But the output I get from content is binary, and not a json. I'm confused. Furthermore if I try to look at the object 'r', I see
Response [https://api.localytics.com/v1/query]
Date: 2015-04-14 15:18
Status: 200
Content-Type: application/vnd.localytics.v1+hal+json;type=ResultSet; charset=utf-8
Size: 1.02 MB
<BINARY BODY>
I don't understand why it's a binary body or how to convert it back. Can anyone provide me any help/clues?
I've also tried this with Rcurl using the following code:
cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")
object <- getForm(uri=localytics_url, app_id=app_id, metrics="customers", dimensions="day", conditions = toJSON(list(day=c("between", "2015-01-01", "2015-04-09"))), .opts=curlOptions(userpwd=sprintf("%s:%s", key, password))
But that generates the error
Error in function (type, msg, asError = TRUE) :
SSL certificate problem: unable to get local issuer certificate
So I'm a bit stumped.
######## Added April 15, 2015
First thanks to MrFlick for his help so far. I got it to work with
contents=content(r, as="text")
Thanks very much for your help. I (think I) had tried that before and then went on to try and extract it to an R data format using fromJSON, but I was using the rjson library, and the jsonlite package worked for me.
I appreciate your patience.
Here's a complete sample of code on how you would get the data, and then extract the results and view them as a table.
library(httr)
library(jsonlite)
response <- POST(url = 'https://api.localytics.com/v1/query',
body=list(
app_id='APP_ID',
metrics='sessions',
conditions=list(
day=c("between", format(Sys.Date() - 31, "%Y-%m-%d"), format(Sys.Date() - 1, "%Y-%m-%d"))
),
dimensions=c('new_device','day')
),
encode="json",
authenticate('KEY','SECRET'),
accept("application/json"),
content_type("application/json"))
stop_for_status(response)
# Convert the content of the result to a string, you can load with jsonlite
result <- paste(rawToChar(response$content), collapse = "")
# Useful to print your result incase you are getting any errors
print(result)
# Load your data with jsonlite
document <- fromJSON(result)
# The results tag contains the table of data you need
View(document$results)

getForm - how to send special characters?

I have a small script written with Rcurl which connect me to corpus of polish language and ask about target word frequency. However this solution works only with standard characters. If i ask about the word with polish letter (ie. "ę", "ą") its return no match. The output log suggest that the script is not transferring properly polish characters in url adress.
My script:
#slowo = word;
wordCorpusChecker<- function (slowo, korpus=2) {
#this line help me bypass the redirection page after calling for specific word
curl = getCurlHandle(cookiefile = "", verbose = TRUE,
followlocation=TRUE, encoding = "utf-8")
#standard call for submitting html form
getForm("http://korpus.pl/poliqarp/poliqarp.php",
query = slowo, corpus = as.character(korpus), showMatch = "1",
showContext = "3",leftContext = "5", rightContext = "5",
wideContext = "50", hitsPerPage = "10",
.opts = curlOptions(
verbose = T,
followlocation=TRUE,
encoding = "utf-8"
)
, curl = curl)
#In test2 there is html of page where I can find information I'm interested in
test1 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl)
test2 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl)
#"scrapping" the frequency from html website
a<-regexpr("Found <em>", test2)[1]+
as.integer(attributes(regexpr("Found <em>", test2)))
b<-regexpr("</em> results<br />\n", test2)[1] - 1
c<-a:b
value<-substring(test2, c[1], c[length(c)])
return(value)
}
#if you try this you will get nice result about "pies" (dog) frequency in polish corpus
wordCorpusChecker("pies")
#if you try this you will get no match because of the special characters
wordCorpusChecker("kałuża")
#the log from `verbose`:
GET /poliqarp/poliqarp.php?query=ka%B3u%BFa&corpus=2&showMatch=1&showContext=3&leftContext=5&rightContext=5&wideContext=50&hitsPerPage=10
I've tried to specify encoding option but as manual says it's refers to the result of a query. I'm experimenting with curlUnescape but with no positive results. Kindly asking for advice.
One solution is to specify the utf coding for example
> "ka\u0142u\u017Ca"
[1] "kałuża"
wordCorpusChecker("ka\u0142u\u017Ca")
[1] "55"

Resources