R httr GET doesn't like time format - r

I want to download data from a website using an api. It looks like the following but with the XXXs, YYYs and ZZZs replaced with api codes.
https://api.ecowitt.net/api/v3/device/history?application_key=XXXX&api_key=-YYY&mac=ZZZ&start_date=2022-06-01 00:00:00&end_date=2022-09-27 00:00:00&cycle_type=30min&call_back=outdoor,indoor.humidity
When I put this url into a web browser I get a page full of data so the server seems to be happy with it.
When I name it "url_complete" and run
response <- httr::GET(url_complete) I get this error message
Error in curl::curl_fetch_memory(url, handle = handle) : URL using
bad/illegal format or missing URL
With the start and end dates and times removed from the above url, The response Status is 200.
response <- httr::GET(test_url)
content(response, "text")
[1] "{"code":40000,"msg":"start_date require","time":"1664309668","data":[]}"

The solution was to pull the date strings out of the url and add them with a query as follows
response <- httr::GET(url_complete_noDates, add_headers(accept = 'application/json'), query = list(start_date = "2022-06-01 00:00:00", end_date = "2022-09-27 00:00:00"))
What this does is add some code to replace the special characters in the date/time string - space and colon.
start_date=2022-06-01%2000%3A00%3A00&end_date=2022-09-27%2000%3A00%3A00

Related

fix xml file with invalid characters, Invalid xmlChar value 2 [9]

I am parsing xml-files from a webservice, and occasionally I encounter this error:
xml2:::read_xml.raw(rs$content) # where the object rs is the response from the webservice, obtained using the httr package
Error in read_xml.raw(x, encoding = encoding, ...) :
xmlParseCharRef: invalid xmlChar value 2 [9]
I downloaded thousands of xmls and only a few are broken. '
My question is then:
How to locate the characters in the response that causes the error.
And what is the genral strategy to fix an invalid xml caused by invalid xmlChars?
I have sircumvented the problem by paring the response as a html, but I would rather fix the issue and parse as xml
Thanks!
I was able to figure it out by doing the following:
first to get to peek inside the content of the httr response
xml_broken <- readBin(rs$content, what = "character")
Then I was able to systematically delete data from the broken xml, until I finally found this string of text which caused the problem:
"" # from the context i could see that this should be parsed as the danish character 'æ'
from https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references I could see that this should in fact be encoded as
"aelig;"
so finally the httr content can be parsed by doing
rs$content %>%
readBin(what = "character") %>%
gsub(pattern = "", replacement = "aelig;") %>%
XML::xmlParse()

extract links of subsequent images in div#data-old-hires

With some help, I am able to extract the landing image/main image of a url. However, I would like to be able to extract the subsequent images as well
require(rvest)
url <-"https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-
Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-
spons&keywords=lunch+bag&psc=1"
webpage <- read_html(url)
r <- webpage %>%
html_nodes("#landingImage") %>%
html_attr("data-a-dynamic-image")
imglink <- strsplit(r, '"')[[1]][2]
print(imglink)
This gives the correct output for the main image. However, I would like to extract the links when I roll-over to the other images of the same product. Essentially, I would like the output to have the following links:
1.https://images-na.ssl-images- amazon.com/images/I/81bF%2Ba21WLL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/81HVwttGJAL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/81Z1wxLn-uL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/91iKg%2BKqKML.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/91zhpH7%2B8gL.UY500.jpg
Many thanks
As requested Python script at bottom. In order to make this applicable across languages the answer is in two parts. 1) A high level pseudo code description of steps which can be carried out with R/Python/many other languages 2) A python example.
R script to obtain string shown at end (Steps 1-3 of Process).
1) Process:
Obtain the html via GET request
Regex out a substring from one of the script tags which is in fact what jquery on the page uses to provide the image links from json
The regex pattern is
jQuery\.parseJSON\(\'(.*)\'\);
The explanation is:
Basically, the contained json object is gathered starting at the { before "dataInJson" and ending before the characters '). That extracts this json object as string. The use of 1st Capturing Group (.*) gathers from between the start string and end string (excluding either side).
The first match is the only one wanted, so out of matches returned the first must be extracted. This is then handled with a json parsing library that can take a string and return a json object
That json object is looped accessing by key (in the case of Python as structure is a dictionary - R will be slightly different) colorImages, to generate the colours (of the product), which are in turn used to access the actual urls themselves.
colours:
nested level for images:
2) Those steps shown in Python
import requests #library to handle xhr GET
import re #library to handle regex
import json
headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'jQuery\.parseJSON\(\'(.*)\'\);')
data = p1.findall(r.text)[0]
json_source = json.loads(data)
for colour in json_source['colorImages']:
for image in json_source['colorImages'][colour]:
print(image['large'])
Output:
All the links for the product in all colours - large image links only (so urls appear slightly different and more numerous but are the same images)
R script to regex out required string and generate JSON:
library(rvest)
library( jsonlite)
library(stringr)
con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
page %>%
html_nodes(xpath=".//script[contains(., 'colorImages')]")%>%
html_text() %>% as.character %>% str_match(.,"jQuery\\.parseJSON\\(\\'(.*)\\'\\);") -> res
json = fromJSON(res[,2][2])
They've updated the page so now just use:
Python:
import requests #library to handle xhr GET
import re #library to handle regex
headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'"large":"(.*?)"')
links = p1.findall(r.text)
print(links)
R:
library(rvest)
library(stringr)
con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
res <- page %>%
html_nodes(xpath=".//script[contains(., 'var data')]")%>%
html_text() %>% as.character %>%
str_match_all(.,'"large":"(.*?)"')
print(res[[1]][,2])

Filter for Google Analytics API

I'm trying to run this query with rGoogleAnalytics but it's throwing the error
Error in ParseDataFeedJSON(GA.Data) :
code : 400 Reason : Invalid value 'ga:pagePath=~/companies/[0-9]{6,8};ga:pagePath!#reviews' for filters parameter
I'm trying to fetch pages matching the pattern /companies/ followed by 6-8 numbers and not containing reviews
query.list <- Init(start.date = "2016-01-01",
end.date = "2017-03-31",
dimensions = "ga:pagePath",
metrics = "ga:pageviews",
filters = "ga:pagePath=~\/companies\/[0-9]{6,9};ga:pagePath!#reviews",
max.results = 10000,
table.id = "ga:xxxxxx")
Thanks
It appears that the problem is with your use of {6,9}. Perhaps you can Try to url encode that part of your regular expression: %7B6%2C9%7D
Use the Query Explorer to play with your query until you find one that works with what you are trying to accomplish.
The documentation states URL-reserved characters — Characters such as & must be url-encoded

Searching elasticsearch from R using URI string with #timestamp range

I am trying to modify working URI search from R to retrieve data from ElasticSearch using time range on #timestamp es variable:
here is the uri string:
myuri01 <- paste0( 'myserver01:9201/filebeat-*/_search?pretty=true&size=9000&sort:#timestamp&q=%2BSESTYPE:APPTYPE001%20 2Bhost:mytesthost%20%2B#timestamp>"2016-10-25T18:59:23.250Z"')
retset = getURL(myuri01 )
but I get no data, no errors ... Also tried without double quote around time stamp - same result. If I remove the last timestamp part then I do get the data as expected.
Any help would be appreciated.
Thanks

How to wait for webpage to load before reading lines in R?

I am using R to scrape some webpages. One of these pages is a redirect to a new page. When I used readLines with this page like so
test <- readLines('http://zfin.org/cgi-bin/webdriver?MIval=aa-markerselect.apg&marker_type=GENE&query_results=t&input_name=anxa5b&compare=contains&WINSIZE=25')
I get the still redirecting page, instead of the final page http://zfin.org/ZDB-GENE-030131-9076. I want to use this redirection page because in the URL it has input_name=anxa which makes it easy to grab pages for different input names.
How can I get the HTML of the final page?
The redirection page: http://zfin.org/cgi-bin/webdriver?MIval=aa-markerselect.apg&marker_type=GENE&query_results=t&input_name=anxa5b&compare=contains&WINSIZE=25
The final page: http://zfin.org/ZDB-GENE-030131-9076
I don't know how to wait untill the redirection but in the source code of the web page before the redirection, you can see (in a script tag) a javascript function replaceLocation which contain the path where you are redirected : replaceLocation(\"/ZDB-GENE-030131-9076\").
Then I suggest you to parse the code and get this path.
Here is my solution :
library(RCurl)
library(XML)
url <- "http://zfin.org/cgi-bin/webdriver?MIval=aa-markerselect.apg&marker_type=GENE&query_results=t&input_name=anxa5b&compare=contains&WINSIZE=25"
domain <- "http://zfin.org"
doc <- htmlParse(getURL(url, useragent='R'))
scripts <- xpathSApply(doc, "//script", xmlValue)
script <- scripts[which(lapply(lapply(scripts, grep, pattern = "replaceLocation\\([^url]"), length) > 0)]
# > script
# [1] "\n \n\t \n\t replaceLocation(\"/ZDB-GENE-030131-9076\")\n \n \n\t"
new.url <- paste0(domain, gsub('.*\\"(.*)\\".*', '\\1', script))
readLines(new.url)
xpathSApply(doc, "//script", xmlValue) to get all the scripts in the source code.
script <- scripts[which(lapply(lapply(scripts, grep, pattern = "replaceLocation\\([^url]"), length) > 0)] to get the script containing the function with the redirecting path.
("replaceLocation\\([^url]" You need to exclude "url" cause there is two replaceLocationfunctions, one with the object url and the other one with the evaluated object (a string))
And finaly gsub('.*\\"(.*)\\".*', '\\1', script) to get only what you need in the script, the argument of the function, the path.
Hope this help !

Resources