log into a website to grab the data using RCurl - r

I wanted to login to the website using RCurl and grab the data from the web (The data cannot be seen without logging in.)
I wanted to export this (for example) "http://www.appannie.com/app/ios/instagram/ranking/history/chart_data/?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone" into R after I log in using RCurl. The issue is I cannot log in using RCurl. I haven't tried this before so mostly I referred to http://www.omegahat.org/RCurl/philosophy.html.
So here's what I tried. (here, 'me#gmail.com' is my user ID and '9999' is my Password - i just made it up.)
library(RJSONIO)
library(rjson)
library(RCurl)
appannie <- getURL("http://www.appannie.com/app/ios/instagram/ranking/history/chart_data/.json?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone, userpwd = me#gmail.com:9999", verbose = TRUE)
But this gave me the message below :
About to connect() to www.appannie.com port 80 (#0)
* Trying 69.167.138.64... * connected
* Connected to www.appannie.com (69.167.138.64) port 80 (#0)
> GET /app/ios/instagram/ranking/history/chart_data/?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone HTTP/1.1
Host: www.appannie.com
Accept: */*
< HTTP/1.1 403 FORBIDDEN
< Server: nginx/1.1.19
< Date: Fri, 01 Mar 2013 23:41:32 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Keep-Alive: timeout=10
< Vary: Accept-Encoding
< Vary: Cookie,Accept-Encoding
<
* Connection #0 to host www.appannie.com left intact
So, I went back and read this http://www.omegahat.org/RCurl/philosophy.html again and didn't know what to do, so I tried this after I saw the similar question from stackoverflow.
getURL("http://www.appannie.com/app/ios/instagram/ranking/history/chart_data/?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone",.opts=list(userpwd=me#gmail.com:9999"))
But this gives me below output.
[1] ""
Can anyone give me a hint? (After a bunch of different trial, the website starts to send me warnings =(

This is some sort of authentication issue not anything you did wrong with RCurl most likely.
You got through to the server but either your login was incorrect, it wasn't valid or the data is not available via the API.
http://en.wikipedia.org/wiki/HTTP_403

Related

A strange problem about size limit in http header

Context: I maintain a kind of web service server, but with a particular implementation: all data sent by the web services are located in the http header. That means there is only http header in the response (no body part). The web service runs as a windows service. The consumer is my PHP code which invokes the web service via CURL library. All this is in production since 3 years and works fine. I recently had to build a development environment.
I have the web service on a Windows 7 pro, as a windows service.
I have my PHP consumer in another windows 7 pro (WAMP + CURL).
my PHP code invokes the web service and displays the raw response.
in this context the problem occurs: if the response contains more than 1215 characters, I have an empty response (but no error message).
I installed my PHP code (exactly the same) on a new Linux ubuntu: I have the same problem.
I installed my PHP code (exactly the same) on a new Linux centos: I DON'T HAVE THE PROBLEM.
I read a lot on internet about size limitation on http header, and I think today it's not the reason of the problem.
I examined all size limitation parameters on Apache, PHP, Curl but I didn't find something relevant.
If someone has some information. All tracks are welcome. Thanks
not an answer, but want to say that using PHP 7.2.5 under mod_php with Apache 2.4.33, i am unable to reproduce your issue, as i have no problems sending anything from 1 byte to 10,000 to even 100,000 bytes in headers:
here is my producer.php:
<?php
$size=((int)($_GET['s'] ?? 1));
header("X-size: {$size}");
$data=str_repeat("a",$size);
header("X-data: {$data}");
http_response_code(204); // 204 NO CONTENT
and whether i hit http://127.0.0.1/producer.php?s=1 or http://127.0.0.1/producer.php?s=10000 or even http://127.0.0.1/producer.php?s=100000 , the data is returned without issue, as you can see in the screenshot above. can you reproduce the issue using my producer.php code?
btw, interestingly, when i try 1 million bytes, i get this error from curl:
$ curl -I http://127.0.0.1/producer.php?s=1000000
HTTP/1.1 204 No Content
Date: Wed, 16 Jan 2019 20:11:25 GMT
Server: Apache/2.4.33 (Win32) OpenSSL/1.1.0h PHP/7.2.5
X-Powered-By: PHP/7.2.5
X-size: 1000000
curl: (27) Rejected 104960 bytes header (max is 102400)!
Hanshenrik,
i also used CURLOPT_VERBOSE as you said. Here are the 2 curl logs.
The only difference is the line
<* stopped the pause stream!> in the Ubuntu curl log.
CURL log from Ubuntu witch has the probleme:
* Trying 192.168.1.205...
* TCP_NODELAY set
* Connected to 192.168.1.205 (192.168.1.205) port 8084 (#0)
> POST /datasnap/rest/TServerMethods/%22W_GetDashboard%22/ HTTP/1.1
Host: 192.168.1.205:8084
Accept-Encoding: gzip,deflate
Accept: application/json
Content-Type: text/xml; charset=utf-8
Pragma: dssession=146326.909376.656191
Content-Length: 15
* upload completely sent off: 15 out of 15 bytes
< HTTP/1.1 200 OK
< Connection: close
< Content-Encoding: deflate
< Content-Type: application/json
< Content-Length: 348
< Date: Thu, 17 Jan 2019 15:27:03 GMT
< Pragma: dssession=146326.909376.656191,dssessionexpires=3600000
<
* stopped the pause stream!
* Closing connection 0
CURL log from Centos witch has NOT the probleme:
* About to connect() to 192.168.1.205 port 8084 (#1)
* Trying 192.168.1.205...
* Connected to 192.168.1.205 (192.168.1.205) port 8084 (#1)
> POST /datasnap/rest/TServerMethods/%22W_GetDashboard%22/ HTTP/1.1
Host: 192.168.1.205:8084
Accept-Encoding: gzip,deflate
Accept: application/json
Content-Type: text/xml; charset=utf-8
Pragma: dssession=3812.553164.889594
Content-Length: 15
* upload completely sent off: 15 out of 15 bytes
< HTTP/1.1 200 OK
< Connection: close
< Content-Encoding: deflate
< Content-Type: application/json
< Content-Length: 348
< Date: Thu, 17 Jan 2019 15:43:39 GMT
< Pragma: dssession=3812.553164.889594,dssessionexpires=3600000
<
* Closing connection 1

How do I what Content Types are on offer (for HTTP Content Negotiation)?

What one gets back when resolving a DOI depends on content negotiation.
I was looking at https://citation.crosscite.org/docs.html#sec-3
and I see different services offer different Content Types.
For a particular URL I want to know all the content types it can give me.
Some of them might be more useful than any that I am aware of (i.e. i don't want to write a list of preferences in advance).
For example:
https://doi.org/10.5061/dryad.1r170
I thought maybe OPTIONS was the way to do it
but that gave back nothing interesting, only about allowed request methods.
shell> curl -v -X OPTIONS http://doi.org/10.5061/dryad.1r170
* Hostname was NOT found in DNS cache
* Trying 2600:1f14:6cf:c01::d...
* Trying 54.191.229.235...
* Connected to doi.org (2600:1f14:6cf:c01::d) port 80 (#0)
> OPTIONS /10.5061/dryad.1r170 HTTP/1.1
> User-Agent: curl/7.38.0
> Host: doi.org
> Accept: */*
>
< HTTP/1.1 200 OK
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< Allow: GET, HEAD, POST, TRACE, OPTIONS
< Content-Length: 0
< Date: Mon, 29 Jan 2018 07:01:14 GMT
<
* Connection #0 to host doi.org left intact
I guess there is no such standard yet, but Link header: https://www.w3.org/wiki/LinkHeader could expose this information.
But personally, I won't rely too much on it. For example, a server could start sending a new content type and still NOT expose it via this header.
It might be useful to check the API response headers frequently, via manual or automated means for any changes.

HTTP 400 Bad Request Error

I am trying to post a XML file (test1.xml) and receive an output from a webservice API. I get an error HTTP/1.1 400 Bad Request. This is the code below.
myheader=c(Connection="close",
'Content-Type' = "application/xml",
'Content-length' =nchar("test1.xml"))
data = getURL("http://abcd/efg/requests/",
userpwd="m12345:123456", httpauth = 1L,
postfields="test1.xml",
httpheader=myheader,
verbose=TRUE)
This is the output
* Hostname was NOT found in DNS cache
* Trying 123.456.789.123...
* Connected to rcftomdev1 (123.456.789.123) port 8086 (#0)
* Server auth using Basic with user 'm12345'
> POST /dart/requests/ HTTP/1.1
Authorization: Basic bTEzNDQ4M#$#$#$#$%5==
Host: abcd:8086
Accept: */*
Connection: close
Content-Type: application/xml
Content-length: 9
* upload completely sent off: 9 out of 9 bytes
< HTTP/1.1 400 Bad Request
< Server: Apache-Coyote/1.1
< Content-Type: text/html;charset=utf-8
< Content-Length: 968
< Date: Tue, 20 Jan 2015 07:50:05 GMT
< Connection: close
<
* Closing connection 0
Not sure where I am going wrong, need help ?
If that was indeed exactly what was sent in the Authorization: header for the given user+password then something seriously wrong happened as your combo should've created bTEyMzQ1OjEyMzQ1Ngo=.
The other aspects of the request looks fine so only reading up on the server side requirements for this server and API will give you the answer.

Web scraping Billboard Hot Airplay chart

I'm trying to get a hot100 airplay chart from its official website by using R.
<http://www.billboard.com/biz/charts/hot-100-airplay>
The problem is that I have to somehow log into the website with my id and password. I've tried example codes that Rcurl provided but none of them actually works.
So right now, instead of getting all the chart, I just scraped top four songs for each week. Can anyone offer a solution so I could able to scrape all the info?
Oh and billboard's API is officially closed so I can't expect anything from them. This is what I tried:
appannie = getURL("http://www.billboard.com/biz/charts/2013-11-02/hot-100-airplay, userpwd = tayshin:passward", verbose = TRUE)
The output was following:
About to connect() to www.billboard.com port 80 (#0)
Trying 93.184.216.229... * connected
Connected to www.billboard.com (93.184.216.229) port 80 (#0)
GET /biz/charts/2013-11-02/hot-100-airplay, userpwd = tayshin:passward HTTP/1.1
Host: www.billboard.com
Accept: */*
HTTP 1.0, assume close after body
HTTP/1.0 400 Bad Request
Connection: close
Date: Sun, 03 Nov 2013 06:52:23 GMT
Server: ECSF (sjc/4F95)
Closing connection #0
appannie
[1] ""
Also, this one doesn't work.
x = getURL("http://www.billboard.com/biz/charts/2013-11-02/hot-100-airplay", userpwd = "tayshin:password")
It outputs something but the infos are limited.

cURL receives empty body response from Nginx server

I try to fetch HTTP content with cURL, but I only get an empty body in the reply:
[root#www ~]# curl -v http://www.existingdomain.com/
* About to connect() to www.existingdomain.com port 80 (#0)
* Trying 95.211.256.257... connected
* Connected to www.existingdomain.com (95.211.256.257) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.21.0 (x86_64-redhat-linux-gnu) libcurl/7.21.0 NSS/3.12.8.0 zlib/1.2.5 libidn/1.18 libssh2/1.2.4
> Host: www.existingdomain.com
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx/0.8.53
< Date: Sat, 28 May 2011 15:56:23 GMT
< Content-Type: text/html
< Transfer-Encoding: chunked
< Connection: keep-alive
< Vary: Accept-Encoding
< X-Powered-By: PHP/5.3.3-0.dotdeb.1
<
* Connection #0 to host www.existingdomain.com left intact
* Closing connection #0
If I change the URL to another domain, like www.google.com, I get the content.
How can this be possible? And how to fetch content?
The server is free to send to the client whatever he likes, including nothing. While this is not exactly nice, there's little the client can do about this. You could
check the server logs to see if there is some problem which makes him so calm (given the server is under your control) or
try another client to see if the server does not like to talk to curl. You can then configure curl to mimic a regular web browser, if that helps

Resources