How to input text file to api in R - r

I want to use this api:
http(s)://lindat.mff.cuni.cz/services/morphodita/api/
with the method "tag". It will tag and lemmatize my text input. It has worked fine with a text string (see below), but I need to send an entire file to the API.
Just to show that string as input works fine:
method <- "tag"
lemmatized_text <- RCurl::getForm(paste#
("http://lindat.mff.cuni.cz/services/morphodita/api/", method, sep = ""),
.params = list(data = "Peter likes cakes. John likes lollypops.",#
output = "json", model = "english-morphium-wsj-140407-no_negation"), #
method = method)
This is the - correct - result:
[1] "{\n \"model\": \"english-morphium-wsj-140407-no_negation\",\n
\"acknowledgements\": [\n \"http://ufal.mff.cuni.cz
/morphodita#morphodita_acknowledgements\",\n \"http://ufal.mff.cuni.cz
/morphodita/users-manual#english-morphium-wsj_acknowledgements\"\n ],\n
\"result\": [[{\"token\":\"Peter\",\"lemma\":\"Peter\",\"tag\":\"NNP
\",\"space\":\" \"},{\"token\":\"likes\",\"lemma\":\"like\",\"tag\":\"VBZ
\",\"space\":\" \"},{\"token\":\"cakes\",\"lemma\":\"cake\",\"tag\":\"NNS
[truncated by me]
However, replacing a string with a vector with elements corresponding to lines of a text file does not work, since the API requires a string on input. Only one, by default the first, vector element would be processed:
method <- "tag"
mydata <- c("cakes.", "lollypops")
lemmatized_text <- RCurl::getForm(paste("http://lindat.mff.cuni.cz
/services/morphodita/api/", method, sep = ""),
.params = list(data = mydata, output = "json",
model = "english-morphium-wsj-140407-no_negation"))
[1] "{\n \"model\": \"english-morphium-wsj-140407-no_negation\",\n
[truncated by me]
\"result\": [[{\"token\":\"cakes\",\"lemma\":\"cake\",\"tag\":\"NNS
\"},{\"token\":\".\",\"lemma\":\".\",\"tag\":\".\"}]]\n}\n"
This issue can be alleviated with sapply and a function calling that API on each element of the vector at the same time, but each element of the resulting vector contains a separate json document. To parse it, I need the entire data to be one single json document, though.
Eventually I tried textConnection, but it returns an erroneous output:
mydata <- c("cakes.", "lollypops")
mycon <- textConnection(mydata, encoding = "UTF-8")
lemmatized_text <- RCurl::getForm(paste#
("http://lindat.mff.cuni.cz/services/morphodita/api/", method,#
sep = ""), .params = list(data = mycon, output = "json",#
model = "english-morphium-wsj-140407-no_negation"))
[1] "{\n \"model\": \"english-morphium-wsj-140407-no_negation\",\n
\"acknowledgements\": [\n \"http://ufal.mff.cuni.cz
/morphodita#morphodita_acknowledgements\",\n \"http://ufal.mff.cuni.cz
/morphodita/users-manual#english-morphium-wsj_acknowledgements\"\n ],\n
\"result\": [[{\"token\":\"5\",\"lemma\":\"5\",\"tag\":\"CD\"}]]\n}\n"
attr(,"Content-Type")
I should probably also say that I have already tried to paste and collapse the vector into one single element, but that is very fragile. It works with dummy data but not with larger files and never with Czech files (although UTF-8 encoded). The API strictly requires UTF-8-encoded data. I therefore suspect encoding issues. I have tried this file:
mydata <- RCurl::getURI("https://ia902606.us.archive.org/4/items/maidmarian00966gut/maidm10.txt", .opts = list(.encoding = "UTF-8"))
and it said
Error: Bad Request
but when I only used a few lines, it suddenly worked. I also made a local copy of the file where I changed the newlines from MacIntosh to Windows. Maybe this helped a bit, but it was definitely not sufficient.
Eventually I should add that I work on Windows 8 Professional, running R-3.2.4 64bit, with RStudio Version 0.99.879.

I should have used RCurl::postForm instead of RCurl::getForm, with all other arguments remaining the same. The postForm function can not only be used to write files on the server, as I had wrongly believed. It does not impose strict limits on the size of the data to be processed, since, with postForm the data do not become part of the URL, unlike with getForm.
This is my convenience function (requires RCurl, stringi, stringr, magrittr):
process_w_morphodita <- function(method, data, output = "json", model
= "czech-morfflex-pdt-161115", guesser = "yes",...){
# for formally optional but very important argument-value pairs see
MorphoDiTa REST API reference at
# http://lindat.mff.cuni.cz/services/morphodita/api-reference.php
pokus <- RCurl::postForm(paste("http://lindat.mff.cuni.cz/services
/morphodita/api/", method, sep = ""), .params = list(data =
stringi::stri_enc_toutf8(data), output = output, model = model,
guesser = guesser,...))
if (output == "vertical") {
pokus <- pokus %>% stringr::str_trim(side = "both") %>%
stringr::str_conv("UTF-8") %>% stringr::str_replace_all(pattern =
"\\\\t", replacement = "\t") %>% stringr::str_replace_all(pattern =
"\\\\n", replacement = "\n") # look for four backslashes, replace
with one backslash to get vertical format in text file
}
return(pokus)
}

Related

toJSON conversion issue

I've been trying to extract multiple DNA-sequence alignments in R (4.0.3) invoking the alignment REST API endpoint from Ensembl. A toy example is below:
library(httr)
library(jsonlite)
tmp_chr = "16"
tmp_seq_str = "87187517"
tmp_seq_end = "87187717"
server = "http://rest.ensembl.org"
ext = paste0("/alignment/region/homo_sapiens/", tmp_chr, ":", tmp_seq_str, "-",
tmp_seq_end, "?species_set_group=primates")
r = GET(paste(server, ext, sep = ""), content_type("application/json"))
json_object = fromJSON(toJSON(content(r)))[[1]]
toJSON function works for some genomic locations, but not for some others giving the error message below:
Error in toJSON(content(r)) : unable to convert R type 22 to JSON
I was wondering if I do something wrong or if this is an issue with jsonlite. Please let me know if you need any additional info to reproduce the error. Many thanks!

How to download and/or extract data stored in a 'raw' binary zip object within a response object in R?

I am unable to download or read a zip file from an API request using httr package. Is there another package I can try that will allow me to download/read binary zip files stored within the response of a get request in R?
I tried two ways:
used GET to get an application/json type response object (successful) and then used fromJSON to extract content using content(my_response, 'text'). The output includes a column called 'zip' which is the data I'm interested in downloading, which documentation states is a base64 encoded binary file. This column is currently a really long string of random letters and I'm not sure how to convert this to the actual dataset.
I tried bypassing using fromJSON because I noticed there is a field of class 'raw' within the response object itself. This object is a list of random numbers which I suspect are the binary representation of the dataset. I tried using rawToChar(my_response$content) to try to convert the raw data type into character, but this results in the same long character string being produced as in #1.
I noticed that with approach #1, if I use base64_dec() to try to convert the long character string I also get the same type of output as the 'raw' field within the response object itself.
getzip1 <- GET(getzip1_link)
getzip1 # successful response, status 200
df <- fromJSON(content(getzip1, "text"))
df$status # "OK"
df$dataset$zip # <- this is the very long string of letters (eg. "I1NC5qc29uUEsBAhQDFA...")
# Method 1: try to convert from the 'zip' object in the output of fromJSON
try1 <- base64_dec(df$dataset$zip)
#looks similar to getzip1$content (i.e. this produces the list of numbers/letters 50 4b 03 04 14 00, etc, perhaps binary representation)
# Method 2: try to get data directly from raw object
class(getzip1$content) # <- 'raw' class object directly from GET request
try2 <- rawToChar(getzip1$content) #returns same output as df$data$zip
I should be able to use either the raw 'content' object from my response or the long character string in the 'zip' object of the output of fromJSON in order to view the dataset or somehow download it. I don't know how to do this. Please help!
welcome!
Based on the documentation for the API the response to the getDataset endpoint has schema
Dataset archive including meta information, the dataset itself is base64 encoded to allow for binary ZIP
transfers.
{
"status": "OK",
"dataset": {
"state_id": 5,
"session_id": 1624,
"session_name": "2019-2020 Regular Session",
"dataset_hash": "1c7d77fe298a4d30ad763733ab2f8c84",
"dataset_date": "2018-12-23",
"dataset_size": 317775,
"mime": "application\/zip",
"zip": "MIME 64 Encoded Document"
}
}
We can use R for obtaining the data with the following code,
library(httr)
library(jsonlite)
library(stringr)
library(maditr)
token <- "" # Your API key
session_id <- 1253L # Obtained from the getDatasetList endpoint
access_key <- "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile <- file.path("path", "to", "file.zip") # Modify
response <- str_c("https://api.legiscan.com/?key=",
token,
"&op=getDataset&id=",
session_id,
"&access_key=",
access_key) %>%
GET()
status_code(x = response) == 200 # Good
body <- content(x = response,
as = "text",
encoding = "utf8") %>%
fromJSON() # This contains some extra metadata
content(x = response,
as = "text",
encoding = "utf8") %>%
fromJSON() %>%
getElement(name = "dataset") %>%
getElement(name = "zip") %>%
base64_dec() %>%
writeBin(con = destfile)
unzip(zipfile = destfile)
unzip will unzip the files which in this case will look like
hash.md5 # Can be checked against the metadata
AL/2016-2016_1st_Special_Session/bill/*.json
AL/2016-2016_1st_Special_Session/people/*.json
AL/2016-2016_1st_Special_Session/vote/*.json
As always, wrap your code in functions and profit.
PS: Here is how the code would like like in Julia as a comparison.
using Base64, HTTP, JSON3, CodecZlib
token = "" # Your API key
session_id = 1253 # Obtained from the getDatasetList endpoint
access_key = "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
destfile = joinpath("path", "to", "file.zip") # Modify
response = string("https://api.legiscan.com/?",
join(["key=$token",
"op=getDataset",
"id=$session_id",
"access_key=$access_key"],
"&")) |>
HTTP.get
#assert response.status == 200
JSON3.read(response.body) |>
(content -> content.dataset.zip) |>
base64decode |>
(data -> write(destfile, data))
run(pipeline(`unzip`, destfile))

How do I pass a numeric vector that is used as part of a function to purrr?

I have a function that I wrote that scrapes JSON from an API and saves the result to my computer. How would I take a numeric vector and pass it to the function to scrape each individual JSON file and save?
scrape_function <- function(period, api_key){
base_url <- "http://www.madeupurl.com/api/figures?"
params <-
list(
period = period,
response_format = "JSON",
api_key = api_key)
resp <- httr::GET(base_url, params)
# Save Response in JSON Format
out <- httr::content(resp, as = "text", encoding = "UTF-8")
# Read into JSON format
json <-
out %>%
jsonlite::prettify() %>%
jsonlite::fromJSON(simplifyDataFrame = TRUE, flatten = TRUE)
# Save Raw JSON Output
jsonlite::write_json(json, here::here("data-raw", "json", paste0("data-", period, ".json" )))
}
I want to run this function for a numeric vector of periods 1 through 28. The result will be the files as outlined in the function. I'm not sure which purrr function to use, as I've only used it for df using map_dfr.
period <- 1:28
The simplest way to loop over an integer vector is probably a for loop:
for (x in 1:28) {
scrape_function(x, api_key)
}
You could translate this in a base R lapply:
lapply(1:28, function(x) {scrape_function(x, api_key)})
Or in a purrr::map call which allows the shorter lambda (~) function notation:
purrr::map(1:28, ~ scrape_function(.x, api_key))
Note that lapply and map will both produce the desired side-effect (of writing the JSON files) and a list as output. If you are only interested in the side-effects you might as well use walk.
purrr::walk(1:28, ~ scrape_function(.x, api_key))
walk not only produces side-effects, it can also return the original object which was passed into it when piping %>% the output into another function or when set visible using (. In our case this would be the integer vector 1:28.
(purrr::walk(1:28, ~ scrape_function(.x, api_key)))

Hash the Local-Part of an E-mail Address in R to Obfuscate Values

I am trying to parse an e-mail address field into it's local and domain parts, MD5 hash the local part and then concatenate them back together. The goal here is to obfuscate the data in our development environment but still allow the field to be joined with other datasets by that field. I have this kind-of working, but I can't get the parselcl value to return correctly... I was expecting it to be a vector, but it returns as single value.
Here is my code:
library(stringr)
localp <- gsub("#.*", "", dat$channels.email.address)
domainp <- gsub(".*#", "", dat$channels.email.address)
parsedlcl <- digest(localp, "md5", serialize = FALSE)
dat$channels.email.address <- str_c(parsedlcl, "#", domainp)
You need to loop the digest over all the values in dat$channels.email.address. Otherwise it will just generate a single value as you are experiencing.
Your code would look like this:
library(stringr)
library(digest)
localp <- gsub("#.*", "", dat$channels.email.address)
domainp <- gsub(".*#", "", dat$channels.email.address)
for(i in seq_along(dat$channels.email.address)) {
parsedlcl[i] <- digest(localp[i], "md5", serialize = F)
}
dat$channels.email.address <- str_c(parsedlcl, "#", domainp)

getForm - how to send special characters?

I have a small script written with Rcurl which connect me to corpus of polish language and ask about target word frequency. However this solution works only with standard characters. If i ask about the word with polish letter (ie. "ę", "ą") its return no match. The output log suggest that the script is not transferring properly polish characters in url adress.
My script:
#slowo = word;
wordCorpusChecker<- function (slowo, korpus=2) {
#this line help me bypass the redirection page after calling for specific word
curl = getCurlHandle(cookiefile = "", verbose = TRUE,
followlocation=TRUE, encoding = "utf-8")
#standard call for submitting html form
getForm("http://korpus.pl/poliqarp/poliqarp.php",
query = slowo, corpus = as.character(korpus), showMatch = "1",
showContext = "3",leftContext = "5", rightContext = "5",
wideContext = "50", hitsPerPage = "10",
.opts = curlOptions(
verbose = T,
followlocation=TRUE,
encoding = "utf-8"
)
, curl = curl)
#In test2 there is html of page where I can find information I'm interested in
test1 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl)
test2 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl)
#"scrapping" the frequency from html website
a<-regexpr("Found <em>", test2)[1]+
as.integer(attributes(regexpr("Found <em>", test2)))
b<-regexpr("</em> results<br />\n", test2)[1] - 1
c<-a:b
value<-substring(test2, c[1], c[length(c)])
return(value)
}
#if you try this you will get nice result about "pies" (dog) frequency in polish corpus
wordCorpusChecker("pies")
#if you try this you will get no match because of the special characters
wordCorpusChecker("kałuża")
#the log from `verbose`:
GET /poliqarp/poliqarp.php?query=ka%B3u%BFa&corpus=2&showMatch=1&showContext=3&leftContext=5&rightContext=5&wideContext=50&hitsPerPage=10
I've tried to specify encoding option but as manual says it's refers to the result of a query. I'm experimenting with curlUnescape but with no positive results. Kindly asking for advice.
One solution is to specify the utf coding for example
> "ka\u0142u\u017Ca"
[1] "kałuża"
wordCorpusChecker("ka\u0142u\u017Ca")
[1] "55"

Resources