Web scraping Billboard Hot Airplay chart

Web scraping Billboard Hot Airplay chart - r

I'm trying to get a hot100 airplay chart from its official website by using R.
<http://www.billboard.com/biz/charts/hot-100-airplay>
The problem is that I have to somehow log into the website with my id and password. I've tried example codes that Rcurl provided but none of them actually works.
So right now, instead of getting all the chart, I just scraped top four songs for each week. Can anyone offer a solution so I could able to scrape all the info?
Oh and billboard's API is officially closed so I can't expect anything from them. This is what I tried:
appannie = getURL("http://www.billboard.com/biz/charts/2013-11-02/hot-100-airplay, userpwd = tayshin:passward", verbose = TRUE)
The output was following:
About to connect() to www.billboard.com port 80 (#0)
Trying 93.184.216.229... * connected
Connected to www.billboard.com (93.184.216.229) port 80 (#0)
GET /biz/charts/2013-11-02/hot-100-airplay, userpwd = tayshin:passward HTTP/1.1
Host: www.billboard.com
Accept: */*
HTTP 1.0, assume close after body
HTTP/1.0 400 Bad Request
Connection: close
Date: Sun, 03 Nov 2013 06:52:23 GMT
Server: ECSF (sjc/4F95)
Closing connection #0
appannie
[1] ""
Also, this one doesn't work.
x = getURL("http://www.billboard.com/biz/charts/2013-11-02/hot-100-airplay", userpwd = "tayshin:password")
It outputs something but the infos are limited.

Related

How do I what Content Types are on offer (for HTTP Content Negotiation)?

What one gets back when resolving a DOI depends on content negotiation.
I was looking at https://citation.crosscite.org/docs.html#sec-3
and I see different services offer different Content Types.
For a particular URL I want to know all the content types it can give me.
Some of them might be more useful than any that I am aware of (i.e. i don't want to write a list of preferences in advance).
For example:
https://doi.org/10.5061/dryad.1r170
I thought maybe OPTIONS was the way to do it
but that gave back nothing interesting, only about allowed request methods.
shell> curl -v -X OPTIONS http://doi.org/10.5061/dryad.1r170
* Hostname was NOT found in DNS cache
* Trying 2600:1f14:6cf:c01::d...
* Trying 54.191.229.235...
* Connected to doi.org (2600:1f14:6cf:c01::d) port 80 (#0)
> OPTIONS /10.5061/dryad.1r170 HTTP/1.1
> User-Agent: curl/7.38.0
> Host: doi.org
> Accept: */*
>
< HTTP/1.1 200 OK
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< Allow: GET, HEAD, POST, TRACE, OPTIONS
< Content-Length: 0
< Date: Mon, 29 Jan 2018 07:01:14 GMT
<
* Connection #0 to host doi.org left intact

I guess there is no such standard yet, but Link header: https://www.w3.org/wiki/LinkHeader could expose this information.
But personally, I won't rely too much on it. For example, a server could start sending a new content type and still NOT expose it via this header.
It might be useful to check the API response headers frequently, via manual or automated means for any changes.

External links URL encoding leads to '%3F' and '%3D' on Nginx server

I got a problem with my server. I got four inbound links to different sites of my dynamic webpage which look something like this:
myurl.com/default/Site%3Fid%3D13
They should look like this:
myurl.com/default/Site?id=13
I do know that those %3F is an escape sequence for the ? sign and the %3D is an escape sequence for the equal sign. But I do get an error 400 when I use those links. What can I do about that?
The four links are for different sites, and I imagine over time there will be more links like that. So one fix for all would be perfect.

An exact same question was actually asked on nginx-ru mailing list about a year ago:
http://mailman.nginx.org/pipermail/nginx-ru/2013-February/050200.html
The most helpful response, by an Nginx, Inc, employee/developer, Валентин Бартенев:
http://mailman.nginx.org/pipermail/nginx-ru/2013-February/050209.html
Если запрос приходит в таком виде, то это уже не параметры, а имя запрошенного
файла. Другое дело, что location ищется по уже раскодированному адресу, о чем в
документации написано.
Translation:
If the request comes in such a form, then these are no longer the args, but the name of the requested file. Another thing is that, as documented, the location matching is performed against a normalised URI.
His suggested solution, translated to the sample example from the question here at SO, would then be:
location /default/Site? {
rewrite \?(.*)$ /default/Site?$1? last;
}
location = /default/Site {
[...]
}

The following sample would redirect all wrongly-looking requests (defined as having ? in the requested filename — encoded as %3F in the request) into less wrongly-looking ones, regardless of URL.
(Please note that, as rightly advised elsewhere, you should not be getting these wrongly-formed links in the first place, so, use it as a last resort — only when you cannot correct the wrongly formed links otherwise, and you do know that such requests are attempted by valid agents.)
server {
listen [::]:80;
server_name localhost;
rewrite ^/([^?]*)\?(.*)$ /$1?$2? permanent;
location / {
return 200 "id is $arg_id\n";
}
}
This is example of how it would work — when a wrongly looking request is encountered, a correction attempt is made with a 301 Moved Permanently response with a supposedly correct Location response header, which would make the browser automatically re-issue the request to the newly provided location:
opti# curl -6v "http://localhost/default/Site%3Fid%3D13"
* About to connect() to localhost port 80 (#0)
* Trying ::1...
* connected
* Connected to localhost (::1) port 80 (#0)
> GET /default/Site%3Fid%3D13 HTTP/1.1
> User-Agent: curl/7.26.0
> Host: localhost
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Server: nginx/1.4.1
< Date: Wed, 15 Jan 2014 17:09:25 GMT
< Content-Type: text/html
< Content-Length: 184
< Location: http://localhost/default/Site?id=13
< Connection: keep-alive
<
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.4.1</center>
</body>
</html>
* Connection #0 to host localhost left intact
* Closing connection #0
Note that no correction attempts are made on proper-looking requests:
opti# curl -6v "http://localhost/default/Site?id=13"
* About to connect() to localhost port 80 (#0)
* Trying ::1...
* connected
* Connected to localhost (::1) port 80 (#0)
> GET /default/Site?id=13 HTTP/1.1
> User-Agent: curl/7.26.0
> Host: localhost
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx/1.4.1
< Date: Wed, 15 Jan 2014 17:09:30 GMT
< Content-Type: application/octet-stream
< Content-Length: 9
< Connection: keep-alive
<
id is 13
* Connection #0 to host localhost left intact
* Closing connection #0

The URL is perfectly valid. The escaped characters it contains are just that, escaped. Which is perfectly fine.
The purpose is that you can actually have a request name (in most cases corresponding to the filename on the disk) that is Site?id=13 and not Site and the rest as the query string.
I would consider it bad practice to have characters in a filename that makes this necessary. However, in URL arguments it may very well be necessary.
Nevertheless, the request URL is valid, and probably not what you want it to be. Which consequently suggest that you should correct the error wherever anybody has picked up the wrong URL in the first place.
I do not really understand why you get an error 400; you should rather get an error 404. But that depends on your setup.
There are also cases, especially with nginx, that mostly involve passing on whole URLs and URL parts along multiple levels (for example reverse proxies, matching regular expressions from the URL and using them as variables, etc.) where such an error may occur. But to verify this and fix it we would need to know more about your setup.

HTTP streaming / chunked responses on Heroku with clojure

I'm making a clojure web app that streams data to clients using chunked HTTP responses. This works great when I run it locally using foreman, but doesn't work properly when I deploy it to Heroku.
A minimal example exhibiting this behaviour can be found on my github here. The frontend (in resources/index.html) performs an AJAX GET request and prints the response chunks as they arrive. The server uses http-kit to send a new chunk to connected clients every second. By design, the HTTP request never completes.
When the same code is deployed to Heroku, the HTTP connection is closed by the server immediately after the first chunk is sent. It seems to be Heroku's routing mesh which is causing this disconnection to occur.
This can also be seen by performing the GET request using curl:
$ curl -v http://arcane-headland-2284.herokuapp.com/stream
* About to connect() to arcane-headland-2284.herokuapp.com port 80 (#0)
* Trying 54.243.166.168...
* Adding handle: conn: 0x6c3be0
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x6c3be0) send_pipe: 1, recv_pipe: 0
* Connected to arcane-headland-2284.herokuapp.com (54.243.166.168) port 80 (#0)
> GET /stream HTTP/1.1
> User-Agent: curl/7.31.0
> Host: arcane-headland-2284.herokuapp.com
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/html; charset=utf-8
< Date: Sat, 17 Aug 2013 16:57:24 GMT
* Server http-kit is not blacklisted
< Server: http-kit
< transfer-encoding: chunked
< Connection: keep-alive
<
* transfer closed with outstanding read data remaining
* Closing connection 0
curl: (18) transfer closed with outstanding read data remaining
The time is currently Sat Aug 17 16:57:24 UTC 2013 <-- this is the first chunk
Can anybody suggest why this is happening? HTTP streaming is supposed to be supported in Heroku's Cedar stack. The fact the code runs correctly using foreman suggests it is something in Heroku's routing mesh causing it to break.
Live demo of the failing project: http://arcane-headland-2284.herokuapp.com/

This was due to a bug in http-kit which will be fixed shortly.

https://devcenter.heroku.com/articles/request-timeout may be relevant: "long-polling" requests like yours have to send data every 55 seconds or be terminated.

log into a website to grab the data using RCurl

I wanted to login to the website using RCurl and grab the data from the web (The data cannot be seen without logging in.)
I wanted to export this (for example) "http://www.appannie.com/app/ios/instagram/ranking/history/chart_data/?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone" into R after I log in using RCurl. The issue is I cannot log in using RCurl. I haven't tried this before so mostly I referred to http://www.omegahat.org/RCurl/philosophy.html.
So here's what I tried. (here, 'me#gmail.com' is my user ID and '9999' is my Password - i just made it up.)
library(RJSONIO)
library(rjson)
library(RCurl)
appannie <- getURL("http://www.appannie.com/app/ios/instagram/ranking/history/chart_data/.json?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone, userpwd = me#gmail.com:9999", verbose = TRUE)
But this gave me the message below :
About to connect() to www.appannie.com port 80 (#0)
* Trying 69.167.138.64... * connected
* Connected to www.appannie.com (69.167.138.64) port 80 (#0)
> GET /app/ios/instagram/ranking/history/chart_data/?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone HTTP/1.1
Host: www.appannie.com
Accept: */*
< HTTP/1.1 403 FORBIDDEN
< Server: nginx/1.1.19
< Date: Fri, 01 Mar 2013 23:41:32 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Keep-Alive: timeout=10
< Vary: Accept-Encoding
< Vary: Cookie,Accept-Encoding
<
* Connection #0 to host www.appannie.com left intact
So, I went back and read this http://www.omegahat.org/RCurl/philosophy.html again and didn't know what to do, so I tried this after I saw the similar question from stackoverflow.
getURL("http://www.appannie.com/app/ios/instagram/ranking/history/chart_data/?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone",.opts=list(userpwd=me#gmail.com:9999"))
But this gives me below output.
[1] ""
Can anyone give me a hint? (After a bunch of different trial, the website starts to send me warnings =(

This is some sort of authentication issue not anything you did wrong with RCurl most likely.
You got through to the server but either your login was incorrect, it wasn't valid or the data is not available via the API.
http://en.wikipedia.org/wiki/HTTP_403

HTTP over TCP using Telnet/Hercules/Raw socket/

I'm connecting to real-time data on a remote server as a client. I want to send the following to a server and keep the connection open. This is a 'push' protocol.
http://server.domain.com:80/protocol/dosomething.txt?POSTDATA=thePostData
I can call this in a browser and it's fine. However, if I try to use telnet directly in a windows command prompt, the prompt just exits.
GET protocol/dosomething.txt?POSTDATA=thePostData
The same is the case if I use Putty.exe and select Telnet as the protocol. I can't see a way to do this with Hercules at all, as I don't think the server will interpret the GET
Is there any way I can do this?
Thanks.

You have to match the HTTP protocol (RFC2616) to the letter if you want to use telnet. Try something like:
shell$ telnet www.google.com 80
Trying 173.194.43.50...
Connected to www.google.com (173.194.43.50).
Escape character is '^]'.
GET / HTTP/1.1
Host: www.google.com:80
Connection: close
HTTP/1.1 200 OK
Date: Tue, 11 Sep 2012 15:09:51 GMT
...
You need to type the following lines including an "empty line" following the "Connection" line.
GET / HTTP/1.1
Host: www.google.com:80
Connection: close

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web scraping Billboard Hot Airplay chart - r

Related

How do I what Content Types are on offer (for HTTP Content Negotiation)?

External links URL encoding leads to '%3F' and '%3D' on Nginx server

HTTP streaming / chunked responses on Heroku with clojure

log into a website to grab the data using RCurl

HTTP over TCP using Telnet/Hercules/Raw socket/

Categories

Resources