I have a small script written with Rcurl which connect me to corpus of polish language and ask about target word frequency. However this solution works only with standard characters. If i ask about the word with polish letter (ie. "ę", "ą") its return no match. The output log suggest that the script is not transferring properly polish characters in url adress.
My script:
#slowo = word;
wordCorpusChecker<- function (slowo, korpus=2) {
#this line help me bypass the redirection page after calling for specific word
curl = getCurlHandle(cookiefile = "", verbose = TRUE,
followlocation=TRUE, encoding = "utf-8")
#standard call for submitting html form
getForm("http://korpus.pl/poliqarp/poliqarp.php",
query = slowo, corpus = as.character(korpus), showMatch = "1",
showContext = "3",leftContext = "5", rightContext = "5",
wideContext = "50", hitsPerPage = "10",
.opts = curlOptions(
verbose = T,
followlocation=TRUE,
encoding = "utf-8"
)
, curl = curl)
#In test2 there is html of page where I can find information I'm interested in
test1 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl)
test2 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl)
#"scrapping" the frequency from html website
a<-regexpr("Found <em>", test2)[1]+
as.integer(attributes(regexpr("Found <em>", test2)))
b<-regexpr("</em> results<br />\n", test2)[1] - 1
c<-a:b
value<-substring(test2, c[1], c[length(c)])
return(value)
}
#if you try this you will get nice result about "pies" (dog) frequency in polish corpus
wordCorpusChecker("pies")
#if you try this you will get no match because of the special characters
wordCorpusChecker("kałuża")
#the log from `verbose`:
GET /poliqarp/poliqarp.php?query=ka%B3u%BFa&corpus=2&showMatch=1&showContext=3&leftContext=5&rightContext=5&wideContext=50&hitsPerPage=10
I've tried to specify encoding option but as manual says it's refers to the result of a query. I'm experimenting with curlUnescape but with no positive results. Kindly asking for advice.
One solution is to specify the utf coding for example
> "ka\u0142u\u017Ca"
[1] "kałuża"
wordCorpusChecker("ka\u0142u\u017Ca")
[1] "55"
Related
I've been trying to extract multiple DNA-sequence alignments in R (4.0.3) invoking the alignment REST API endpoint from Ensembl. A toy example is below:
library(httr)
library(jsonlite)
tmp_chr = "16"
tmp_seq_str = "87187517"
tmp_seq_end = "87187717"
server = "http://rest.ensembl.org"
ext = paste0("/alignment/region/homo_sapiens/", tmp_chr, ":", tmp_seq_str, "-",
tmp_seq_end, "?species_set_group=primates")
r = GET(paste(server, ext, sep = ""), content_type("application/json"))
json_object = fromJSON(toJSON(content(r)))[[1]]
toJSON function works for some genomic locations, but not for some others giving the error message below:
Error in toJSON(content(r)) : unable to convert R type 22 to JSON
I was wondering if I do something wrong or if this is an issue with jsonlite. Please let me know if you need any additional info to reproduce the error. Many thanks!
I want to use this api:
http(s)://lindat.mff.cuni.cz/services/morphodita/api/
with the method "tag". It will tag and lemmatize my text input. It has worked fine with a text string (see below), but I need to send an entire file to the API.
Just to show that string as input works fine:
method <- "tag"
lemmatized_text <- RCurl::getForm(paste#
("http://lindat.mff.cuni.cz/services/morphodita/api/", method, sep = ""),
.params = list(data = "Peter likes cakes. John likes lollypops.",#
output = "json", model = "english-morphium-wsj-140407-no_negation"), #
method = method)
This is the - correct - result:
[1] "{\n \"model\": \"english-morphium-wsj-140407-no_negation\",\n
\"acknowledgements\": [\n \"http://ufal.mff.cuni.cz
/morphodita#morphodita_acknowledgements\",\n \"http://ufal.mff.cuni.cz
/morphodita/users-manual#english-morphium-wsj_acknowledgements\"\n ],\n
\"result\": [[{\"token\":\"Peter\",\"lemma\":\"Peter\",\"tag\":\"NNP
\",\"space\":\" \"},{\"token\":\"likes\",\"lemma\":\"like\",\"tag\":\"VBZ
\",\"space\":\" \"},{\"token\":\"cakes\",\"lemma\":\"cake\",\"tag\":\"NNS
[truncated by me]
However, replacing a string with a vector with elements corresponding to lines of a text file does not work, since the API requires a string on input. Only one, by default the first, vector element would be processed:
method <- "tag"
mydata <- c("cakes.", "lollypops")
lemmatized_text <- RCurl::getForm(paste("http://lindat.mff.cuni.cz
/services/morphodita/api/", method, sep = ""),
.params = list(data = mydata, output = "json",
model = "english-morphium-wsj-140407-no_negation"))
[1] "{\n \"model\": \"english-morphium-wsj-140407-no_negation\",\n
[truncated by me]
\"result\": [[{\"token\":\"cakes\",\"lemma\":\"cake\",\"tag\":\"NNS
\"},{\"token\":\".\",\"lemma\":\".\",\"tag\":\".\"}]]\n}\n"
This issue can be alleviated with sapply and a function calling that API on each element of the vector at the same time, but each element of the resulting vector contains a separate json document. To parse it, I need the entire data to be one single json document, though.
Eventually I tried textConnection, but it returns an erroneous output:
mydata <- c("cakes.", "lollypops")
mycon <- textConnection(mydata, encoding = "UTF-8")
lemmatized_text <- RCurl::getForm(paste#
("http://lindat.mff.cuni.cz/services/morphodita/api/", method,#
sep = ""), .params = list(data = mycon, output = "json",#
model = "english-morphium-wsj-140407-no_negation"))
[1] "{\n \"model\": \"english-morphium-wsj-140407-no_negation\",\n
\"acknowledgements\": [\n \"http://ufal.mff.cuni.cz
/morphodita#morphodita_acknowledgements\",\n \"http://ufal.mff.cuni.cz
/morphodita/users-manual#english-morphium-wsj_acknowledgements\"\n ],\n
\"result\": [[{\"token\":\"5\",\"lemma\":\"5\",\"tag\":\"CD\"}]]\n}\n"
attr(,"Content-Type")
I should probably also say that I have already tried to paste and collapse the vector into one single element, but that is very fragile. It works with dummy data but not with larger files and never with Czech files (although UTF-8 encoded). The API strictly requires UTF-8-encoded data. I therefore suspect encoding issues. I have tried this file:
mydata <- RCurl::getURI("https://ia902606.us.archive.org/4/items/maidmarian00966gut/maidm10.txt", .opts = list(.encoding = "UTF-8"))
and it said
Error: Bad Request
but when I only used a few lines, it suddenly worked. I also made a local copy of the file where I changed the newlines from MacIntosh to Windows. Maybe this helped a bit, but it was definitely not sufficient.
Eventually I should add that I work on Windows 8 Professional, running R-3.2.4 64bit, with RStudio Version 0.99.879.
I should have used RCurl::postForm instead of RCurl::getForm, with all other arguments remaining the same. The postForm function can not only be used to write files on the server, as I had wrongly believed. It does not impose strict limits on the size of the data to be processed, since, with postForm the data do not become part of the URL, unlike with getForm.
This is my convenience function (requires RCurl, stringi, stringr, magrittr):
process_w_morphodita <- function(method, data, output = "json", model
= "czech-morfflex-pdt-161115", guesser = "yes",...){
# for formally optional but very important argument-value pairs see
MorphoDiTa REST API reference at
# http://lindat.mff.cuni.cz/services/morphodita/api-reference.php
pokus <- RCurl::postForm(paste("http://lindat.mff.cuni.cz/services
/morphodita/api/", method, sep = ""), .params = list(data =
stringi::stri_enc_toutf8(data), output = output, model = model,
guesser = guesser,...))
if (output == "vertical") {
pokus <- pokus %>% stringr::str_trim(side = "both") %>%
stringr::str_conv("UTF-8") %>% stringr::str_replace_all(pattern =
"\\\\t", replacement = "\t") %>% stringr::str_replace_all(pattern =
"\\\\n", replacement = "\n") # look for four backslashes, replace
with one backslash to get vertical format in text file
}
return(pokus)
}
I am trying to obtain data from a website and thanks to a helper i could get to the following script:
require(httr)
require(rvest)
res <- httr::POST(url = "http://apps.kew.org/wcsp/advsearch.do",
body = list(page = "advancedSearch",
AttachmentExist = "",
family = "",
placeOfPub = "",
genus = "Arctodupontia",
yearPublished = "",
species ="scleroclada",
author = "",
infraRank = "",
infraEpithet = "",
selectedLevel = "cont"),
encode = "form")
pg <- content(res, as="parsed")
lnks <- html_attr(html_node(pg,"td"), "href")
However, in some cases, like the example above, it does not retrieve the right link because, for some reason, html_attr does not find urls ("href") within the node detected by html_node. So far, i have tried different CSS selector, like "td", "a.onwardnav" and ".plantname" but none of them generate an object that html_attr can handle correctly.
Any hint?
You are really close on getting the answer your were expecting. If you would like to pull the links off of the desired page then:
lnks <- html_attr(html_nodes(pg,"a"), "href")
will return a list of all of the links at the "a" tag with a "href" attribute. Notice the command is html_nodes and not node. There are multiple "a" tags thus the plural.
If you are looking for the information from the table in the body of then try this:
html_table(pg, fill=TRUE)
#or this
html_nodes(pg,"tr")
The second line will return a list of the 9 rows from the table which one could then parse to obtain the row names ("th") and/or row values ("td").
Hope this helps.
I am looking to extract foreign-language text from a website. The following code (hopefully self-contained) will demonstrate the problem:
require(RCurl)
require(XML)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Chrome 39.0.2171.71 (64-bit)"
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt' ,useragent = agent,followlocation = TRUE , autoreferer = TRUE , curl = curl)
html <-getURL('http://164.100.47.132/LssNew/psearch/result12.aspx?dbsl=1008', maxredirs = as.integer(20), followlocation = TRUE, curl = curl)
work <- htmlTreeParse(html, useInternal = TRUE)
table <- xpathApply(work, "//table[#id = 'ctl00_ContPlaceHolderMain_DataList1' ]//font|//table[#id = 'ctl00_ContPlaceHolderMain_DataList1' ]//p", xmlValue) #this one captured some mess in 13
table[[2]]
Where the first bunch of characters in the console printout appear for me as ¸Ã\u0089Ã\u0092 iÃ\u0089{Ã\u0089xÃ\u0089 Ã\u008aºÃ\u0089Eònù®ú.
Note that if I go to the actual page (http://bit.ly/1AcE9Gs), and view the page source and find the second opening <font tag (corresponding to the second list item in my table, or inspect the element near the first Hindi characters) what renders in the page source looks something like this: ¸ÉÒ iÉ{ÉxÉ ÊºÉEònù®ú (nù¨Énù¨É): which is what I want.
Anyone know why this might occur, and/or how to fix? Something to do with encodings in R, or RcURL? I can see all the way up to the initial getURL call that the characters are different like this, so it doesn't have to do with passing from the html text to xpathApply.
I am using MAC OSX 10.9.3, Chrome browser (for viewing the actual page), R 3.1.1.
If interested, see a related question on xpathApply here: R and xpathApply -- removing duplicates from nested html tags
Thanks!
Add encoding options to htmlParse and getURL:
require(RCurl)
require(XML)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Chrome 39.0.2171.71 (64-bit)"
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt' ,useragent = agent,followlocation = TRUE , autoreferer = TRUE , curl = curl)
html <-getURL('http://164.100.47.132/LssNew/psearch/result12.aspx?dbsl=1008'
, maxredirs = as.integer(20), followlocation = TRUE, curl = curl
, .encoding = 'UTF-8')
work <- htmlParse(html, encoding = 'UTF-8')
table <- xpathApply(work, "//table[#id = 'ctl00_ContPlaceHolderMain_DataList1' ]//font|//table[#id = 'ctl00_ContPlaceHolderMain_DataList1' ]//p", xmlValue) #this one captured some mess in 13
> table[[2]]
[1] "¸ÉÒ iÉ{ÉxÉ ÊºÉEònù®ú (nù¨Énù¨É):\r\nºÉ¦ÉÉ{ÉÊiÉ ¨É½þÉänùªÉ, {ɽþ±Éä ÊnùxÉ ¨ÉèÆ ¤ÉÉä±É\r\n®ú½þÉ lÉÉ iÉÉä ¨ÉèÆxÉä =iiÉ®ú {ÉÚ´ÉÒÇ ¦ÉÉ®úiÉ Eòä\r\n+ÉiÉÆEò´ÉÉnù {É®ú =ºÉ ÊnùxÉ nùÉä {ɽþ±ÉÖ+ÉäÆ EòÉ =±±ÉäJÉ\r\nÊEòªÉÉ lÉÉ* +ÉVÉ ¦ÉÒ ¨ÉèÆ, ÊVÉºÉ EòÉ®úhÉ ºÉä +ÉiÉÆEò´ÉÉnù\r\n{ÉènùÉ ½þÖ+É, =ºÉEòä Ê´É¹ÉªÉ ¨ÉäÆ lÉÉäc÷É ºÉÉ =±±ÉäJÉ\r\nEò°üÆMÉÉ*"
Here's an alternative implementation using rvest. Not only is the code simpler, but you don't have to do anything with the encoding, rvest figures it out for you.
library("rvest")
url <- "http://164.100.47.132/LssNew/psearch/result12.aspx?dbsl=1008"
search <- html(url)
search %>%
html_node("#ctl00_ContPlaceHolderMain_DataList1") %>%
html_nodes("font, p") %>%
html_text() %>%
.[[2]]
#> [1] "¸ÉÒ iÉ{ÉxÉ ÊºÉEònù®ú (nù¨Énù¨É):\r\nºÉ¦ÉÉ{ÉÊiÉ ¨É½þÉänùªÉ, ...
how can I encode a url as this
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi=InchI=1S/C21H30O9/c1-11(5-6-21(28)12(2)8-13(23)9-20(21,3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19/h5-8,14,16-19,22,25-28H,9-10H2,1-4H3/b6-5+,11-7-/t14-,16-,17+,18-,19+,21-/m1/s1&token=e4a6d6fb-ae07-4cf6-bae8-c0e6115bc681
to make this
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi=InChI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1
on R?
I tried
URLencode
but it does not work.
Thanks
It seems that you want to get rid of all but first URL GET data specifier and then to encode the associated data.
url <- "..."
library(stringi)
(addr <- stri_replace_all_regex(url, "\\?.*", ""))
## [1] "http://www.chemspider.com/inchi.asmx/InChIToSMILES"
args <- stri_match_first_regex(url, "[?&](.*?)=([^&]+)")
(data <- stri_replace_all_regex(
stri_trans_general(args[,3], "[^a-zA-Z0-9\\-()]Any-Hex/XML"),
"&#x([0-9a-fA-F]{2});", "%$1"))
## [1] "InchI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1"
(addr <- stri_c(addr, "?", args[,2], "=", data))
## [1] "http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi=InchI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1"
Here I made use of the ICU's transliterator (via stri_trans_general). All characters but A..Z, a..z, 0..9, (, ), and - have been converted to hexadecimal representation
(it seems that URLencode does not handle , even with reserved=TRUE) of the form &#xNN;. Then, each &#xNN; was converted to %NN with stri_replace_all_regex.
Here are two approaches:
1) gsubfn/URLencode If u is an R character string containing the URL then try this. This inputs everything after ? to URLencode replacing the input with the output of that function. Note that "\\K" kills everything in the buffer up to that point so that the ? itself does not get encoded:
library(gsubfn)
gsubfn("\\?\\K(.*)", ~ URLencode(x, TRUE), u, perl = TRUE)
It gives the following (which is not identical to the output in the question but may be sufficient):
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi%3dInchI%3d1S%2fC21H30O9%2fc1-11(5-6-21(28)12(2)8-13(23)9-20(21,3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2fh5-8,14,16-19,22,25-28H,9-10H2,1-4H3%2fb6-5+,11-7-%2ft14-,16-,17+,18-,19+,21-%2fm1%2fs1%26token%3de4a6d6fb-ae07-4cf6-bae8-c0e6115bc681
2) gsubfn/curlEscape For a somewhat different output continuing to use gsubfn try:
library(RCurl)
gsubfn("\\?\\K(.*)", curlEscape, u, perl = TRUE)
giving:
http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi%3DInchI%3D1S%2FC21H30O9%2Fc1%2D11%285%2D6%2D21%2828%2912%282%298%2D13%2823%299%2D20%2821%2C3%294%297%2D15%2824%2930%2D19%2D18%2827%2917%2826%2916%2825%2914%2810%2D22%2929%2D19%2Fh5%2D8%2C14%2C16%2D19%2C22%2C25%2D28H%2C9%2D10H2%2C1%2D4H3%2Fb6%2D5%2B%2C11%2D7%2D%2Ft14%2D%2C16%2D%2C17%2B%2C18%2D%2C19%2B%2C21%2D%2Fm1%2Fs1%26token%3De4a6d6fb%2Dae07%2D4cf6%2Dbae8%2Dc0e6115bc681
ADDED curlEscape approach