How to scrape music streaming header for infos

How to scrape music streaming header for infos - web-scraping

I have a very long list of radio streaming links in an excel file, most of the streaming links when i inspect them through network developer tool of google chrome i see they have a header which contains some infos of the name of the radio and the type of music it streams..
I want to know if there is anyway to automatically scrape these infos, (Name, music type) please.
here is an example:
http://centova2.whsh4u.com:9007/stream?hash=1463479281647.mp3
infos:
icy-genre:Histoire - Culture - Musique
icy-name:Aquitaine Radio Diffusion

You didn't specify what tool you would like to use. In Python to get request headers you would do:
import requests
r = requests.get("https://www.google.co.uk")
print(r.headers)
Output:
{'Expires': '-1', 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2018-01-12-10; expires=Sun, 11-Feb-2018 10:54:36 GMT; path=/; domain=.google.co.uk, NID=121=U2phVKlO3UmL_jlK02Qj6J5K_uo6SLe1-hSWZsIA0fjlB82hEDT7D_69JYk9NRnCTFfhpviKsB-wRgoQKEDHsq6q7Cf8IWynKWHopoYHPWa8IPNhBD9r5dLsweNm52jS; expires=Sat, 14-Jul-2018 10:54:36 GMT; path=/; domain=.google.co.uk; HttpOnly', 'Cache-Control': 'private, max-age=0', 'X-XSS-Protection': '1; mode=block', 'Alt-Svc': 'hq=":443"; ma=2592000; quic=51303431; quic=51303339; quic=51303338; quic=51303337; quic=51303335,quic=":443"; ma=2592000; v="41,39,38,37,35"', 'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Server': 'gws', 'Date': 'Fri, 12 Jan 2018 10:54:36 GMT', 'Content-Type': 'text/html; charset=ISO-8859-1'}
I imagie your data would be there.

Related

Asp.net Core 1.1.2 OpenID connect cookies not being created on redirect, Set-Cookie header is added

I am using ASP.NET core 1.1.2 with OpenIDConnect to connect to a Single Sign On server by IdentityServer.
"Microsoft.AspNetCore.Authentication.OpenIdConnect": "1.1.2"
Right now I am having a weird issue, that didn't happen until yesterday.
The initial cookies that should be created before being redirected to SSO server is not being created on browsers.
Using developer console on browsers, I can see the Set-Cookie header but cookies are not being stored.
Content-Length: 0
Date: Mon, 14 Jan 2019 18:50:10 GMT
Location: http://mysso.com/connect/authorize?client_id=8111797110116117109&redirect_uri=http%3A%...ZD7cNnuWSmAoGyk2kERmR4hemQKsP2OKNbABYvybQCrdCMggrggDuY-5ZXnCTFf3oG11cR4Eh5N3Uarh99MD1nvJZrO4WsWDO73OQrEjh-zK3AceJzjfB2GH0gKKw-51SpcUWNgSTbQe517
Server: Kestrel
Set-Cookie: .AspNetCore.Correlation.oidc.I3lU6aE3BFH_4uuJ6KlgbpFl6Dij_WC-nyhlbUfvAPI=N; expires=Mon, 14 Jan 2019 18:07:24 GMT; path=/; httponly
Set-Cookie: .AspNetCore.OpenIdConnect.Nonce.CfDJ8BWkCPQm5ElIof7iuryYpWDHYvyls6nYDr84XfQAIcLzg0ktLHIGOP7Tp_eqbvDOTdcQqnKIIogwMad9tWSy9v8BPnN8VUBucuz8qc9kv5Pkpe5aCg9oh6dgQD79a-w8Lc9haFm_tOEze1Wzna3XG7OzcGhw8kwyU5j3K_sK3Z7Y-u3cE_pey9DVbBzZkZStJXpoNjG_HWJHBjuqv7ADfCc91Oi83Ieuk7bBue8md1v2WqvSji3ziHkqyw9FKTV44Iw2Kg4o8Rf_3G-Q9ITNwr8=N; expires=Mon, 14 Jan 2019 18:07:24 GMT; path=/; httponly
X-Powered-By: ASP.NET
I checked if the cookies are being expired before creation but, they all have 10 min time left before expiry.
This issue is happening on all major browsers (edge, chrome, firefox), not just my pc but others too.
Configuration code
app.UseOpenIdConnectAuthentication(new OpenIdConnectOptions
{
Authority = Configuration["SSOConfig:ServerUrl"],
AuthenticationScheme = "oidc",
SignInScheme = AuthenticationScheme.Cookies,
RequireHttpsMetadata = false,
ClientId = Configuration["SSOConfig:ClientId"],
ClientSecret = Configuration["SSOConfig:ClientSecret"],
ResponseType = "code id_token",
Scope = { "openid", "offline_access" },
SaveTokens = false,
I tried adding cookies manually, and it is working.
HttpContextAccessor.HttpContext.Response.Cookies.Append("Test", "test");

Seems like, somehow the cookies set were expired cookies.
I had to upgrade the system to .net core 2.
The difference I found between the two was on set-cookie. path=/signin-oidc; secure
Date: Tue, 15 Jan 2019 10:53:04 GMT
Location: https://mysso.com/connect/authorize?client_id=8111797110243116117109&redirect_ur...
Server: Kestrel
Set-Cookie: .AspNetCore.Correlation.oidc.Iy3pTZ-akQm6BzLMCdBPLz1CAGTJ70QgQtjkY9Kvg1Y=N; expires=Tue, 15 Jan 2019 11:08:04 GMT; path=/signin-oidc; secure; httponly
Set-Cookie: .AspNetCore.OpenIdConnect.Nonce.CfDJ8Oct5aw6xUJOnpJ_-0Ep-nSLfWIXgaEiH7y-0IN9tx61lNrxFhgAzLvLlBQfOfBBegyRJrEsIZFi00iuUt90cJ_bMQI_1XTVr0SiBCAJ9wqR2682VrYe2IbjIrFuB9d-Mmu-ztw-O2Htzd8Z36ndD8zPsgSCY_RD6JYVRe4MTfFBQbDZRxMQ3rgB_ulvSZmshD7vB4gvgcsbLyiY2wVuKzVGEKgJxgq23nxzkNKkL-vHm6w_41D_rZI5_V9hDsfrShFuTViZNttAes1fmA2jMTQ=N; expires=Tue, 15 Jan 2019 11:08:04 GMT; path=/signin-oidc; secure; httponly
X-Powered-By: ASP.NET``

How to parse chunked HTTP content with Lua on nodemcu?

I have script which coomunicates between nodemcu and my server. It works good on my localhost and is parsing response retrieved from my server when I send GET request. Problem is when I upload it all on my website where transfer encoding is chunked. I am not able to retrieve content, although request is legitimate and correct. Code is written in Lua and I am trying to work on my NodeMCU device.
conn=net.createConnection(net.TCP, 0)
conn:on("connection",function(conn, payload)
conn:send("GET /mypath/node.php?id=1&update"..
" HTTP/1.1\r\n"..
"Host: www.mydomain.com\r\n"..
"Accept: */*\r\n"..
"User-Agent: Mozilla/4.0 (compatible; esp8266 Lua;)"..
"\r\n\r\n")
end)
conn:on("receive", function(conn, payload)
if string.find(payload, "UPDATE")~=nil then
node.restart()
end
conn:close()
conn = nil
end)
conn:connect(80,"www.mydomain.com")
end
Just to repeat that this GET request works and is tested manualy and on localhost. Only problem is with chunked content, I don't know how to parse it.
Update: I managed to remove chunked encoding by changing HTTP/1.1 to HTTP/1.0, but still I have problem
using this code
conn:on("receive", function(conn, payload)
print(payload)
I get this response
HTTP/1.1 200 OK
Date: Tue, 09 Jan 2018 02:34:25 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: PHPSESSID=9m226vr20r4baa634bagk8k2k3; path=/
Connection: close
Content-Type: text/html; charset=utf-8
Update 2.
I have just created one file http.php with text included "php". I have uploaded it to localhost and to my domain. Once I tried to access my localhost from nodemcu, and then to domain. Results were different
This is the request
conn:send("GET /"..s.path.."/http.php"..
" HTTP/1.0\r\n"..
"Host: "..s.domain.."\r\n"..
"Accept: */*\r\n"..
"User-Agent: Mozilla/4.0 (compatible; esp8266 Lua;)"..
"\r\n\r\n")
end)
s.domain and s.path correcponds to different paths and domains on localhost and my domain
Result on domain
HTTP/1.1 200 OK
Date: Tue, 09 Jan 2018 03:09:28 GMT
Server: Apache
Connection: close
Content-Type: text/html; charset=UTF-8
result on localhost
TTP/1.1 200 OK
Date: Tue, 09 Jan 2018 03:08:48 GMT
Server: Apache/2.4.27 (Win64) PHP/7.0.23
X-Powered-By: PHP/7.0.23
Content-Length: 3
Connection: close
Content-Type: text/html; charset=UTF-8
php
As you can see, localhost is showing content "php", and domain is showing only header. When I type some file which does not exists domain is showing me html code.

I'm using the following code to put the chunks together. I'm wondering anyways, why your response from the server is missing the Content-Length header.
conn:on("receive", function(client, payload)
-- Inspired by https://github.com/marcoskirsch/nodemcu-httpserver/blob/master/httpserver.lua
-- Collect data packets until the size of HTTP body meets the Content-Length stated in header
if payload:find("Content%-Length:") or bBodyMissing then
if fullPayload then fullPayload = fullPayload .. payload else fullPayload = payload end
if (tonumber(string.match(fullPayload, "%d+", fullPayload:find("Content%-Length:")+16)) > #fullPayload:sub(fullPayload:find("\r\n\r\n", 1, true)+4, #fullPayload)) then
bBodyMissing = true
return
else
payload = fullPayload
fullPayload, bBodyMissing = nil
end
end
if (bBodyMissing == nil) then
local _, headerEnd = payload:find("\r\n\r\n")
local body = payload:sub(headerEnd + 1)
print (body)
end
end)

Google Server gives a server error with the first request in private browsing mode

Whenever I run the url https://scholar.google.com/citations?user=N7m4vIQAAAAJ&hl=en in private windows of Safari and Google Chrome, Google gives an errors.
It happens only on the first request with private browsing mode.
Anybody knows why this happens only in specific environment?
This has been happening since 3 days ago.
-- an error message and a capture
Server Error
We're sorry but it appears that there has been an internal server error while processing your request. Our engineers have been notified and are working to resolve the issue.
Please try again later.
--- added
The header file includes
http header response
Cache-Control: no-cache, must-revalidate
Content-Encoding: gzip
Content-Type: text/html; charset=UTF-8
Date: Mon, 16 Nov 2015 19:35:39 GMT
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Pragma: no-cache
Server: citations
Set-Cookie: NID=73=eF98qod1NpYg7nb03RUToiSiacFgqNoZxQ4CuzqwGlQn53SoR7rHlzO0OExsmYkpRazROCQ3WqKoCsWKFPxp8dZr5pBra6nD1HPcxWUILl9gVAf5Q7GSQc3B0O3TP4gu; expires=Tue, 17-May-2016 19:35:39 GMT; path=/; domain=.google.com; HttpOnly
X-Firefox-Spdy: h2
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
p3p: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
x-content-type-options: no sniff

Fixed the issue by having cookies when it requests URLs. REF: PHP cURL how to add the User Agent value OR overcome the Servers blocking cURL requests?
I use php scripts to retrieve and put some cookie options.
A code snippet is
$curl = curl_init($url);
$dir = dirname(__FILE__);
$config['cookie_file'] = $dir . '/cookies/' . md5($_SERVER['REMOTE_ADDR']) . '.txt';
curl_setopt($curl, CURLOPT_COOKIEFILE, $config['cookie_file']);
curl_setopt($curl, CURLOPT_COOKIEJAR, $config['cookie_file']);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($curl);
curl_close($curl);

Trying to download page in python with urllib2 and requests but keep getting redirected

I am trying to simply download a page with python.
http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770
If i get the response code from the server i get 200
import urllib2
url = 'http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770'
file_pointer = urllib2.urlopen(url)
print file_pointer.getcode()
However if i get the url i get the redirect page
file_pointer.geturl()
I have tried urllib, urllib2,requests, and mechanize all separately and can not get any to work. I am obviously missing something because other people in the office have code that works. SOS
Also here is more information provided by requests
import requests
url = 'http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770'
proxy = { 'https': '200.35.152.93:1212'}
response = requests.get(url, proxies=proxy)
send: 'GET /CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770 HTTP/1.1\r\nHost: webapps.rrc.state.tx.us\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: python-requests/2.7.0 CPython/2.7.10 Windows/7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Date: Wed, 26 Aug 2015 19:33:12 GMT
header: Server: Apache/2.2.15 (Red Hat)
header: Location: http://www.rrc.state.tx.us/site-policies/railroad-commission-of-texas-site-policies/?method=cmplP4FormPdf&packetSummaryId=97770
header: Content-Length: 405
header: Connection: close
header: Content-Type: text/html; charset=iso-8859-1
send: 'GET /site-policies/railroad-commission-of-texas-site-policies/?method=cmplP4FormPdf&packetSummaryId=97770 HTTP/1.1\r\nHost: www.rrc.state.tx.us\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: python-requests/2.7.0 CPython/2.7.10 Windows/7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Cache-Control: private
header: Content-Type: text/html; charset=utf-8
header: server: one
header: Date: Wed, 26 Aug 2015 19:33:11 GMT
header: Content-Length: 41216

The problem is that this specific site is looking for your User Agent header, and since you're a python client, it disallows you to get the PDF and redirect you.
Therefore you need to mask your user agent.
Look at the following example:
url = 'http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770'
req = urllib2.Request(url)
req.add_unredirected_header('User-Agent', 'Mozilla/5.0')
file_pointer = urllib2.urlopen(req)
print file_pointer.getcode()
print file_pointer.geturl();

Okay so all one has to do with the requests module is to disable redirection.Here is my working code that is also using a proxy server.
import requests
url = 'http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770'
proxy = { 'https': '200.35.152.93:1212'}
r = requests.get(url, proxies=proxy,allow_redirects=False)
print r.url

HttpClient request to local IIS 8.0 does not produce expected headers in the response

I'm making the following request to a local website running in IIS
var httpRequestMessage = new HttpRequestMessage();
httpRequestMessage.RequestUri = new Uri("http://localhost:8081/");
httpRequestMessage.Method = HttpMethod.Get;
var response = new HttpClient().SendAsync(httpRequestMessage).Result;
This produces the following response headers:
HTTP/1.1 200 OK
Accept-Ranges: bytes
Date: Mon, 03 Jun 2013 22:34:25 GMT
ETag: "50c7472eb342ce1:0"
Server: Microsoft-IIS/8.0
X-Powered-By: ASP.NET
An identical request made via Fiddler produces the following response headers (I've highlighted the differences):
HTTP/1.1 200 OK
Content-Type: text/html
Last-Modified: Fri, 26 Apr 2013 19:20:58 GMT
Accept-Ranges: bytes
ETag: "50c7472eb342ce1:0"
Server: Microsoft-IIS/8.0
X-Powered-By: ASP.NET
Date: Mon, 03 Jun 2013 22:29:34 GMT
Content-Length: 10
Why is there a difference in response headers?
Am I using HttpClient correctly (aside from the fact I am calling Send synchronously)?

TL;DR;
To access all response headers you need to read both HttpResponseMessage.Headers and HttpResponseMessage.Content.Headers properties.
Long(er) answer:
This, basically:
var response = new HttpClient().GetAsync("http://uri/").Result;
var allHeaders = response.Headers.Union(response.Content.Headers);
foreach (var header in allHeaders)
{
// do stuff
}
I see two issues with this:
The Headers property is not appropriately named: it should really be SomeHeaders or AllHeadersExceptContentHeaders. (I mean, really, when you see a property named Headers, do you expect it to return all headers or some headers? I am pretty sure they are in violation of their own framework design guidelines on this one.)
The MSDN page does not mention at any point the fact this is a subset of all headers and developers should also inspect Content.Headers.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to scrape music streaming header for infos - web-scraping

Related

Asp.net Core 1.1.2 OpenID connect cookies not being created on redirect, Set-Cookie header is added

How to parse chunked HTTP content with Lua on nodemcu?

Google Server gives a server error with the first request in private browsing mode

Trying to download page in python with urllib2 and requests but keep getting redirected

HttpClient request to local IIS 8.0 does not produce expected headers in the response

Categories

Resources