Determining the size of a webpage response using Scala - http

I have an assignment where I need to determine how much cache space will be required to store the contents of a webpage, and I have to do it all in Scala, which I'm in the process of learning. I know I can get the required information with a HTTP HEAD request, but from what I've read it seems I need an external library for that.
Is it possible to download the HTTP header without using an HTTP request and extract the required information using only Scala (no calls to Java code)?

If you need not use 3rd party libraries, then the solution might be to use Source.fromURL to get the page and then compute its size.
Hope this helps ;)

Without your restriction that only Scala may be used I would have said: use Async-Http-Client's AsyncHandler and stop as soon as onHeadersReceived has been called.
Without external libraries, you could try to mimic what a HTTP client is doing. Here's a sample telnet session:
$ telnet www.google.com 80
HEAD / Trying 173.194.40.20...
Connected to www.google.com.
Escape character is '^]'.
HEAD / HTTP/1.1
Host: www.google.com
HTTP/1.1 302 Found
Location: http://www.google.ch/
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=c2b92507b9088226:FF=0:TM=1361870408:LM=1361870408:S=mbY_Qws86Z75gPAk; expires=Thu, 26-Feb-2015 09:20:08 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=dAFEWKT5vk9HWP1sTF6Oo49jv0sRV7_49ewSgD3fYRiTjHqlUasKl7Jz86SnJhtS-o9zU9raxwCLhdfvEwdwl9imRwONMBTDBKDXtJhFufLCnAoOKgDQetv0A5FTN3Da; expires=Wed, 28-Aug- 2013 09:20:08 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Date: Tue, 26 Feb 2013 09:20:08 GMT
Server: gws
Content-Length: 218
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
(What I typed was HEAD / HTTP/1.1, Host: www.google.com, and an additional return.)
You could try to use the JVM's Socket class to open a TCP connection to your server and send, as in the example above, the HEAD request yourself.

Related

HTTP GET request always returns 301

I'm trying to obtain the HTML dump of some RFC's from IETF website, via a simple GET request. However, it responds with status code 301. I'm making use of netcat to simulate the HTTP GET request with the following command :
$ printf 'GET /html/rfc3986 HTTP/1.1\r\nHost: tools.ietf.org\r\nConnection: close\r\n\r\n' | nc tools.ietf.org 80
The following reply is obtained as a result of the above command :
HTTP/1.1 301 Moved Permanently
Date: Wed, 09 Sep 2020 15:36:36 GMT
Server: Apache/2.2.22 (Debian)
Location: https://tools.ietf.org/html/rfc3986
Vary: Accept-Encoding
Content-Length: 323
Connection: close
Content-Type: text/html; charset=iso-8859-1
X-Pad: avoid browser bug
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved here.</p>
<hr>
<address>Apache/2.2.22 (Debian) Server at tools.ietf.org Port 80</address>
</body></html>
However, if I try to send a HTTP/1.0 based HEAD request to the Location value determined in the above reply, I get status 404 in reply. I made use of HEAD method just to check the status code of the reply.
Command :
printf 'HEAD https://tools.ietf.org/html/rfc3986 HTTP/1.0\r\n\r\n' | nc tools.ietf.org 80
Reply:
HTTP/1.1 404 Not Found
Date: Wed, 09 Sep 2020 16:32:18 GMT
Server: Apache/2.2.22 (Debian)
Vary: accept-language,accept-charset,Accept-Encoding
Accept-Ranges: bytes
Connection: close
Content-Type: text/html; charset=iso-8859-1
Content-Language: en
Expires: Wed, 09 Sep 2020 16:32:18 GMT
Is there a mistake in the way I'm making use of GET method to obtain the results?
You are sending a plain text request to port 80, so the URL you are trying is effectively http://tools.ietf.org/html/rfc3986
The response is telling you to instead request https://tools.ietf.org/html/rfc3986. That's not a different path on the same server, but a full URL.
The difference is that it begins https meaning you need to make a TLS-secured connection on port 443.
That's not going to be possible with a trivial use of netcat, so you're better off using an HTTP client like curl or wget

is it a bug to send response gzip-compressed to clients that doesn't specify Accept-Encoding: gzip?

is it a bug in the server if it sends content gzip-compressed to clients that did not specify Accept-Encoding: gzip ? is it breaking the http specs? or is it legal?
i'm curious because https://www.amazon.com always sends content gzip-compressed, regardless of the Accept-Encoding header, as a simple test to confirm:
$ curl https://www.amazon.com
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.
$ curl https://www.amazon.com -I
HTTP/2 405
content-type: text/html; charset=UTF-8
server: Server
date: Sat, 03 Nov 2018 11:27:35 GMT
set-cookie: skin=noskin; path=/; domain=.amazon.com
strict-transport-security: max-age=47474747; includeSubDomains; preload
x-amz-id-1: 2M3HZHHA9J21D3MTHH4K
allow: POST, GET
vary: Accept-Encoding,User-Agent,X-Amazon-CDN-Cache
content-encoding: gzip
x-amz-rid: 2M3HZHHA9J21D3MTHH4K
x-frame-options: SAMEORIGIN
x-cache: Error from cloudfront
via: 1.1 1cc4305a3ce000ca199328864ca1c98e.cloudfront.net (CloudFront)
x-amz-cf-id: OKz61IdKmCBfC97pPg-zmDhQnJzK3THXL2iYwegU5EtDaRf6yjBGzw==
curl complains that it's recieving binary data here because it's not responding with HTML, but gzip-compressed html, which is binary data. to actually see the html, add the --compressed argument, which tells curl to add the header Accept-Encoding: gzip, deflate and automatically decompress the response.
A request without an Accept-Encoding header field implies that the user agent has no preferences regarding content-codings. Although this allows the server to use any content-coding in a response, it does not imply that the user agent will be able to correctly process all encodings.
-- https://greenbytes.de/tech/webdav/rfc7231.html#rfc.section.5.3.4.p.4

Is the version of HTTP either 1.0 or 1.1 defined by webserver? How works the HTTP protocol definition?

I have a quick question but in advance I've read the RFC 2616 Chapter 14.22 about Host and HTTP Header but I still not understand where in httpd.conf or configuration file of a webserver should be changed? Please correct me if I'm wrong.
Look at following two HTTP GET I did to an Apache. The first one is GET for HTTP 1.0 , the other one is GET for HTTP 1.1. See the output:
HTTP/1.0 200 OK
Date: Thu, 24 Oct 2013 03:46:22 GMT
Server: Apache/1.3.41 (Unix) mod_gzip/1.3.26.1a PHP/5.2.9 mod_throttle/3.1.2 mod_psoft_traffic/0.2 mod_ssl/2.8.31 OpenSSL/0.9.8b
Vary: *
Last-Modified: Fri, 10 Aug 2012 20:22:30 GMT
ETag: "17c815b-3b-50256d86"
Accept-Ranges: bytes
Content-Length: 59
Connection: close
Content-Type: text/html
<html>
<body>
<center>webli7</center>
</body>
</html>
HTTP/1.1 400 Bad Request
Date: Thu, 24 Oct 2013 04:04:40 GMT
Server: Apache/1.3.41 (Unix) mod_gzip/1.3.26.1a PHP/5.2.9 mod_throttle/3.1.2 mod_psoft_traffic/0.2 mod_ssl/2.8.31 OpenSSL/0.9.8b
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1
16e
The HTTP protocol version is decided dynamicaly, not through configuration files. The client send a request specifying the highest protocol version that its support. Then, the server must respond with either the version requested by the client, or any earlier version that it prefers.
Since Apache does support HTTP/1.1, it should therefore match exactly the version provided by the client.
There exist a flag that you may set in Apache's config to force Apache to use HTTP/1.0 in certain situations, even though the browser requested HTTP/1.1. This is used to fix bugs in HTTP/1.1 handling of some very old browser. Today, you should not need to play with this flag.
As for your error, I would suggest that you make sure that your GET does provide the Host: header. This header is required in HTTP/1.1, yet optional in HTTP/1.0, and having it missing would certainly result in a 400 error.

Why is Connection: keep-alive still being specified in http headers (isn't it deprecated)?

According to "HTTP: The Definitive Guide", using
Connection: keep-alive
to specify a persistent connection is deprecated in HTTP/1.1, since HTTP/1.1 specifies that connections are persistent by default and must be closed manually by sending
Connection: close
Thus, my simple assumption is that "Connection: keep-alive" shouldn't really be used anymore. However, it still seems alive and well. For example, keep-alive is being returned in the following query:
curl -I https://foursquare.com
HTTP/1.1 200 OK
Server: nginx/0.8.52
Date: Thu, 11 Aug 2011 21:15:45 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Expires: Thu, 11 Aug 2011 21:15:45 UTC
Set-Cookie: XSESSIONID=w19~kqtn4bpqmfq51p8qolstpk6ti;Path=/;Secure;HttpOnly
Set-Cookie: LOCATION=49.25::-123.13330078125::Hockeytown::CA;Path=/;Secure
Set-Cookie: bbhive=OQ32XATE0OQAEVCY0IVSWUDPQ1A2GT
Content-Length: 38815
Cache-Control: no-cache, private, no-store
Pragma: no-cache
My question is: Why is Connection: keep-alive still being specified in HTTP headers?
A corollary question is: Are there still (clients, servers, proxies, etc) that still only speak HTTP/1.0 and its variants, or are most such entities on HTTP/1.1 as of 2011?
Here are my working hypotheses:
1) HTTP/1.0 is no longer in use, b/c that was "many years" ago
2) Given (1), keep-alive shouldn't be used anymore, but is purely for vestigial reasons (that is, certain technologies haven't bothered to remove it, or keep it around as voodoo code, etc.)
If (1) is incorrect, and HTTP/1.0 is still in use, then sure it seems plausible to keep using keep-alive, despite follow-up questions on HTTP 1.0-1.1 interop.
Thanks in advance for any insights shared!
HTTP/1.0 have no headers like Connection, but there is many different implementation of HTTP/1.0 and HTTP/1.1.
so Connection: keep-alive is used 'Just in case'

Content-Length header with HEAD requests?

The http spec says about the HEAD request:
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request.
Should the response to a HEAD request contain a Content-Length header? Should it be the value which would be returned on a GET request, even if there is no response body? Or should the Content-Length be 0?
To me it looks like the HTTP 1.1 RFC is pretty specific:
The Content-Length
entity-header field indicates the size of the entity-body, in decimal
number of OCTETs, sent to the recipient or, in the case of the HEAD
method, the size of the entity-body that would have been sent had
the request been a GET.
Section 14.13 of the HTTP/1.1 spec detailed the Content-Length header, and says this:
Applications SHOULD use this field to
indicate the transfer-length of the
message-body, unless this is
prohibited by the rules in section
4.4.
The word 'SHOULD' has a very specific meaning in RFCs:
SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.
So, you may not always see a Content-Length. Typically you might not see it for any content which is dynamically generated, since that might be too expensive to service an exploratory HEAD request. For example, a HEAD request to Apache for a static file will have a Content-Length, but a request for a PHP script may not.
For example, try this very website...
telnet stackoverflow.com 80
HEAD / HTTP/1.0
Host:stackoverflow.com
HTTP/1.1 200 OK
Date: Mon, 11 Jan 2016 10:58:25 GMT
Content-Type: text/html; charset=utf-8
Connection: close
Set-Cookie: __cfduid=c2eb4742a1e02d89cab0402220736c0bd1452509905; expires=Tue, 10-Jan-17 10:58:25 GMT; path=/; domain=.stackoverflow.com; HttpOnly
Cache-Control: public, no-cache="Set-Cookie", max-age=36
Expires: Mon, 11 Jan 2016 10:59:02 GMT
Last-Modified: Mon, 11 Jan 2016 10:58:02 GMT
Vary: *
X-Frame-Options: SAMEORIGIN
X-Request-Guid: 487e80bc-3783-4cfd-d883-a3bc84253234
Set-Cookie: prov=8dc24306-c067-45eb-bf5d-cffa855c2b03; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Server: cloudflare-nginx
CF-RAY: 26303c15f8e035a2-LHR
No content-length there.
Yes, the Content-Length of a HEAD response SHOULD, but not always does (see #Paul's answer) include the Content-Length value of a GET response:
Stack Overflow does:
> telnet stackoverflow.com 80
HEAD / HTTP/1.1
Host: stackoverflow.com
HTTP/1.1 200 OK
Cache-Control: public, max-age=60
Content-Length: 362245 <--------
Content-Type: text/html; charset=utf-8
Expires: Mon, 04 Oct 2010 11:51:49 GMT
Last-Modified: Mon, 04 Oct 2010 11:50:49 GMT
Vary: *
Date: Mon, 04 Oct 2010 11:50:49 GMT
Google doesn't:
> telnet www.google.com 80
HEAD / HTTP/1.1
Host: www.google.ie
HTTP/1.1 200 OK
Date: Mon, 04 Oct 2010 11:55:36 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Server: gws
X-XSS-Protection: 1; mode=block
Transfer-Encoding: chunked
The HTTP-spec at W3C states:
If the new field values indicate that the cached entity differs from the current entity (as would be indicated by a change in Content-Length, ...
Which (to me) means it should hold the "correct" value as you would in a GET response.
Contra the accepted answer, section 4.3.2 of RFC 7231 states:
The server SHOULD send the same header fields in response to a HEAD request as it would have sent if the request had been a GET, except that the payload header fields (Section 3.3)
—which is to say, Content-Length, Content-Range, Trailer, and Transfer-Encoding—
MAY be omitted.
This is even weaker than the note on SHOULD in Paul Dixon's answer:
MAY This word, or the adjective "OPTIONAL", mean that an item is
truly optional. One vendor may choose to include the item because a
particular marketplace requires it or because the vendor feels that
it enhances the product while another vendor may omit the same item.
So the real answer is, you don't need to include Content-Length, but if you do, you should give the correct value.

Resources