is it a bug to send response gzip-compressed to clients that doesn't specify Accept-Encoding: gzip? - http

is it a bug in the server if it sends content gzip-compressed to clients that did not specify Accept-Encoding: gzip ? is it breaking the http specs? or is it legal?
i'm curious because https://www.amazon.com always sends content gzip-compressed, regardless of the Accept-Encoding header, as a simple test to confirm:
$ curl https://www.amazon.com
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.
$ curl https://www.amazon.com -I
HTTP/2 405
content-type: text/html; charset=UTF-8
server: Server
date: Sat, 03 Nov 2018 11:27:35 GMT
set-cookie: skin=noskin; path=/; domain=.amazon.com
strict-transport-security: max-age=47474747; includeSubDomains; preload
x-amz-id-1: 2M3HZHHA9J21D3MTHH4K
allow: POST, GET
vary: Accept-Encoding,User-Agent,X-Amazon-CDN-Cache
content-encoding: gzip
x-amz-rid: 2M3HZHHA9J21D3MTHH4K
x-frame-options: SAMEORIGIN
x-cache: Error from cloudfront
via: 1.1 1cc4305a3ce000ca199328864ca1c98e.cloudfront.net (CloudFront)
x-amz-cf-id: OKz61IdKmCBfC97pPg-zmDhQnJzK3THXL2iYwegU5EtDaRf6yjBGzw==
curl complains that it's recieving binary data here because it's not responding with HTML, but gzip-compressed html, which is binary data. to actually see the html, add the --compressed argument, which tells curl to add the header Accept-Encoding: gzip, deflate and automatically decompress the response.

A request without an Accept-Encoding header field implies that the user agent has no preferences regarding content-codings. Although this allows the server to use any content-coding in a response, it does not imply that the user agent will be able to correctly process all encodings.
-- https://greenbytes.de/tech/webdav/rfc7231.html#rfc.section.5.3.4.p.4

Related

Loadrunner - Getting message in HTTP response "HTML parsing not performed for Content-Type "application/xml""

The send and receive content on the server I work on is with type "application/xml".
On my init section I added the below line to automatically to add to all my header requests
web_add_auto_header("Content-Type","application/xml");
When I run the script, I get response header showing the correct content-type but in the boo day I get message:
351-byte response headers for "http://172.29.67.68/svc/bw/cti/monitor/event/bw_perfuser1000_60a439f7-599d-4fe1-baa6-598391312954" (RelFrameId=1, Internal ID=5)
HTTP/1.1 200 OK\r\n
Date: Mon, 11 Mar 2019 18:20:09 GMT\r\n
Content-Length: 681\r\n
Content-Type: application/xml\r\n
X-Frame-Options: SAMEORIGIN\r\n
Expires: Thu, 01 Jan 1970 00:00:00 GMT\r\n
Cache-Control: no-cache, private, must-revalidate, max-stale=0, post-check=0, pre-check=0
no-store\r\n
Pragma: no-cache\r\n
Keep-Alive: timeout=15, max=96\r\n
Connection: Keep-Alive\r\n
message I get:
HTML parsing not performed for Content-Type "application/xml" ("ParseHtmlContentType" Run-Time Setting is "TEXT").
To fix this issue, I need to add the below line before each request
web_add_header("Content-Type","application/xml");
Can anyone please explain why I need to explicitly mention the content-type before each request although I used the web_add_auto_header() function?
In HTTP protocol, you need to specific Request Header Fields in HTTP request. The detail of HTTP Header please refer to wiki.

Why does curl repeat headers in the output?

Options I used:
-I, --head
(HTTP/FTP/FILE) Fetch the HTTP-header only! HTTP-servers feature
the command HEAD which this uses to get nothing but the header
of a document. When used on an FTP or FILE file, curl displays
the file size and last modification time only.
-L, --location
(HTTP/HTTPS) If the server reports that the requested page has moved to a different location (indi-
cated with a Location: header and a 3XX response code), this option will make curl redo the request
on the new place. If used together with -i, --include or -I, --head, headers from all requested
pages will be shown. When authentication is used, curl only sends its credentials to the initial
host. If a redirect takes curl to a different host, it won't be able to intercept the user+password.
See also --location-trusted on how to change this. You can limit the amount of redirects to follow
by using the --max-redirs option.
When curl follows a redirect and the request is not a plain GET (for example POST or PUT), it will
do the following request with a GET if the HTTP response was 301, 302, or 303. If the response code
was any other 3xx code, curl will re-send the following request using the same unmodified method.
You can tell curl to not change the non-GET request method to GET after a 30x response by using the
dedicated options for that: --post301, --post302 and -post303.
-v, --verbose
Be more verbose/talkative during the operation. Useful for debugging and seeing what's going on
"under the hood". A line starting with '>' means "header data" sent by curl, '<' means "header data"
received by curl that is hidden in normal cases, and a line starting with '*' means additional info
provided by curl.
Note that if you only want HTTP headers in the output, -i, --include might be the option you're
looking for.
If you think this option still doesn't give you enough details, consider using --trace or --trace-
ascii instead.
This option overrides previous uses of --trace-ascii or --trace.
Use -s, --silent to make curl quiet.
Below is the output that I'm wondering about. In the response containing the redirect(301), all the headers are displayed twice, but only one of the duplicates has the < in front of it. How am I supposed to interpret that?
$ curl -ILv http://www.mail.com
* Rebuilt URL to: http://www.mail.com/
* Trying 74.208.122.4...
* Connected to www.mail.com (74.208.122.4) port 80 (#0)
> HEAD / HTTP/1.1
> Host: www.mail.com
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
HTTP/1.1 301 Moved Permanently
< Date: Sun, 28 May 2017 22:02:16 GMT
Date: Sun, 28 May 2017 22:02:16 GMT
< Server: Apache
Server: Apache
< Location: https://www.mail.com/
Location: https://www.mail.com/
< Vary: Accept-Encoding
Vary: Accept-Encoding
< Connection: close
Connection: close
< Content-Type: text/html; charset=iso-8859-1
Content-Type: text/html; charset=iso-8859-1
<
* Closing connection 0
* Issue another request to this URL: 'https://www.mail.com/'
* Trying 74.208.122.4...
* Connected to www.mail.com (74.208.122.4) port 443 (#1)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
* Server certificate: *.mail.com
* Server certificate: thawte SSL CA - G2
* Server certificate: thawte Primary Root CA
> HEAD / HTTP/1.1
> Host: www.mail.com
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Sun, 28 May 2017 22:02:16 GMT
Date: Sun, 28 May 2017 22:02:16 GMT
< Server: Apache
Server: Apache
< Vary: X-Forwarded-Proto,Host,Accept-Encoding
Vary: X-Forwarded-Proto,Host,Accept-Encoding
< Set-Cookie: cookieKID=kid%40autoref%40mail.com; Domain=.mail.com; Expires=Tue, 27-Jun-2017 22:02:16 GMT; Path=/
Set-Cookie: cookieKID=kid%40autoref%40mail.com; Domain=.mail.com; Expires=Tue, 27-Jun-2017 22:02:16 GMT; Path=/
< Set-Cookie: cookiePartner=kid%40autoref%40mail.com; Domain=.mail.com; Expires=Tue, 27-Jun-2017 22:02:16 GMT; Path=/
Set-Cookie: cookiePartner=kid%40autoref%40mail.com; Domain=.mail.com; Expires=Tue, 27-Jun-2017 22:02:16 GMT; Path=/
< Cache-Control: no-cache, no-store, must-revalidate
Cache-Control: no-cache, no-store, must-revalidate
< Pragma: no-cache
Pragma: no-cache
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Set-Cookie: JSESSIONID=F0BEF03C92839D69057FFB57C7FAA789; Path=/mailcom-webapp/; HttpOnly
Set-Cookie: JSESSIONID=F0BEF03C92839D69057FFB57C7FAA789; Path=/mailcom-webapp/; HttpOnly
< Content-Language: en-US
Content-Language: en-US
< Content-Length: 85237
Content-Length: 85237
< Connection: close
Connection: close
< Content-Type: text/html;charset=UTF-8
Content-Type: text/html;charset=UTF-8
<
* Closing connection 1
best guess: with -v you tell curl to be verbose (send debug info) to STDERR. with -I you tell curl to dump headers to STDOUT. and your shell, by default, combines STDOUT and STDERR. separate stdout and stderr, and you'll avoid the confusion.
curl -ILv http://www.mail.com >stdout.log 2>stderr.log ; cat stdout.log
Use:
curl -ILv http://www.mail.com 2>&1 | grep '^[<>\*].*$'
When cURL is called with the verbose command line flag, it sends the verbose output to stderr instead of stdout. The above command redirects stderr to stdout (2>&1), then we pipe the combined output to grep and use the above regex to only return the lines that begin with *, <, or >. All of the other lines in the output (including the dupes you were first concerned with) are removed from the output.

Determining the size of a webpage response using Scala

I have an assignment where I need to determine how much cache space will be required to store the contents of a webpage, and I have to do it all in Scala, which I'm in the process of learning. I know I can get the required information with a HTTP HEAD request, but from what I've read it seems I need an external library for that.
Is it possible to download the HTTP header without using an HTTP request and extract the required information using only Scala (no calls to Java code)?
If you need not use 3rd party libraries, then the solution might be to use Source.fromURL to get the page and then compute its size.
Hope this helps ;)
Without your restriction that only Scala may be used I would have said: use Async-Http-Client's AsyncHandler and stop as soon as onHeadersReceived has been called.
Without external libraries, you could try to mimic what a HTTP client is doing. Here's a sample telnet session:
$ telnet www.google.com 80
HEAD / Trying 173.194.40.20...
Connected to www.google.com.
Escape character is '^]'.
HEAD / HTTP/1.1
Host: www.google.com
HTTP/1.1 302 Found
Location: http://www.google.ch/
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=c2b92507b9088226:FF=0:TM=1361870408:LM=1361870408:S=mbY_Qws86Z75gPAk; expires=Thu, 26-Feb-2015 09:20:08 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=dAFEWKT5vk9HWP1sTF6Oo49jv0sRV7_49ewSgD3fYRiTjHqlUasKl7Jz86SnJhtS-o9zU9raxwCLhdfvEwdwl9imRwONMBTDBKDXtJhFufLCnAoOKgDQetv0A5FTN3Da; expires=Wed, 28-Aug- 2013 09:20:08 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Date: Tue, 26 Feb 2013 09:20:08 GMT
Server: gws
Content-Length: 218
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
(What I typed was HEAD / HTTP/1.1, Host: www.google.com, and an additional return.)
You could try to use the JVM's Socket class to open a TCP connection to your server and send, as in the example above, the HEAD request yourself.

HTTP 500 error in wget

Take a look at this page:
http://www.ptmytrade.com/product.asp?id=61363
It's loading fine (at least here). Now I would like to grab it with wget.
$ wget http://www.ptmytrade.com/product.asp?id=61363 --debug
DEBUG output created by Wget 1.12 on linux-gnu.
--2011-05-21 18:24:51-- http://www.ptmytrade.com/product.asp?id=61363
Resolving www.ptmytrade.com... 205.209.150.134
Caching www.ptmytrade.com => 205.209.150.134
Connecting to www.ptmytrade.com|205.209.150.134|:80... connected.
Created socket 3.
Releasing 0x0890e260 (new refcount 1).
---request begin---
GET /product.asp?id=61363 HTTP/1.0
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: www.ptmytrade.com
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 500 Internal Server Error
Connection: keep-alive
Date: Sat, 21 May 2011 16:24:56 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Content-Length: 471822
Content-Type: text/html
Set-Cookie: ASPSESSIONIDSCACCAQA=FOCCMJODFHHMOKNKPAIHJCIL; path=/
Cache-control: private
---response end---
500 Internal Server Error
Stored cookie www.ptmytrade.com -1 (ANY) / <session> <insecure> [expiry none] ASPSESSIONIDSCACCAQA FOCCMJODFHHMOKNKPAIHJCIL
Registered socket 3 for persistent reuse.
Disabling further reuse of socket 3.
Closed fd 3
2011-05-21 18:24:57 ERROR 500: Internal Server Error.
OK, so I check the headers when fetching the page using my browser (using Live HTTP Headers add-on):
http://www.ptmytrade.com/product.asp?id=61361
GET /product.asp?id=61361 HTTP/1.1
Host: www.ptmytrade.com
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:2.0) Gecko/20100101 Firefox/4.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Cookie: ASPSESSIONIDSCACBBRA=AMPBLLNDGMFLNPNCPEBPNNLB; ASPSESSIONIDSCACCAQA=FJNBMJODLHHJNDHPFBIEEPEM
HTTP/1.1 500 Internal Server Error
Date: Sat, 21 May 2011 16:20:46 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Content-Length: 471822
Content-Type: text/html
Cache-Control: private
----------------------------------------------------------
http://www.ptmytrade.com/images/index_117.jpg
GET /images/index_117.jpg HTTP/1.1
Host: www.ptmytrade.com
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:2.0) Gecko/20100101 Firefox/4.0
Accept: image/png,image/*;q=0.8,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Referer: http://www.ptmytrade.com/product.asp?id=61361
Cookie: ASPSESSIONIDSCACBBRA=AMPBLLNDGMFLNPNCPEBPNNLB; ASPSESSIONIDSCACCAQA=FJNBMJODLHHJNDHPFBIEEPEM
HTTP/1.1 404 Not Found
Content-Length: 1635
Content-Type: text/html
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Date: Sat, 21 May 2011 16:20:48 GMT
I'm not sure what's going on here. The page displays just fine, but I'm getting the 500 error code in the header.
The problem was solved by using curl (which was also getting a 500, but fetched the page just fine) instead, but I'm curious what's going here.
Hi everybody I had this huge problem too, I don't know why, but the solution was add this:
wget -U "Opera 11.0" "http://your_link" -O out.csv
I found it on
[Curl and wget return error 500 for helloworld.php on new install but browser is fine
Using this option will fix the issue:
--content-on-error
If this is set to on, wget will not skip the content when the
server responds with a http status code that indicates error.
So the command looks like this:
wget --content-on-error "https://stackoverflow.com"
NOTE: It's important to put the URL inside double-quotes, otherwise, wget will get stuck on Redirecting output to ‘wget-log’..
Or as stated in the comments and by OP, use curl instead.
But I should note that curl cannot download whole webpages (css, js, images etc.) because it cannot parse HTML. Source and Taken from.
It's a bug in the webpage. The HTTP status is indeed seemingly incorrectly set to HTTP 500. Firefox/Firebug also confirms this. Basically, you're facing a HTTP 500 error page with "normal" content.
Report it to the site admin.
Try enclosing it in quotes:
wget "http://www.ptmytrade.com/product.asp?id=61363"
instead of:
wget http://www.ptmytrade.com/product.asp?id=61363

Http protocol content-length

I am working on a simple download application. While making a request for the following file both firefox and my application doesn't get the content-length field. But if i make the request using wget server does send the content-length field. I did change wgets user agent string to test and it still got the content-length field.
Any ideas why this is happening?
wget request
---request begin---
GET /dc-13/video/2005_Defcon_V2-P_Zimmerman-Unveiling_My_Next_Big_Project.mp4 HTTP/1.0
User-Agent: test
Accept: */*
Host: media.defcon.org
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.0 200 OK
Server: lighttpd
Date: Sun, 05 Apr 2009 04:40:08 GMT
Last-Modified: Tue, 23 May 2006 22:18:19 GMT
Content-Type: video/mp4
Content-Length: 104223909
Connection: keep-alive
firefox request
GET /dc-13/video/2005_Defcon_V2-P_Zimmerman-Unveiling_My_Next_Big_Project.mp4 HTTP/1.1
Host: media.defcon.org
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.0.8) Gecko/2009032608 Firefox/3.0.8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: http://www.defcon.org/html/links/defcon-media-archives.html
Pragma: no-cache
Cache-Control: no-cache
HTTP/1.x 200 OK
Server: lighttpd
Date: Sun, 05 Apr 2009 05:20:12 GMT
Last-Modified: Tue, 23 May 2006 22:18:19 GMT
Content-Type: video/mp4
Transfer-Encoding: chunked
Update:
Is there a header that I can send that will tell Lighthttpd not to use chunked encoding.My original problem is that I am using urlConnection to grab the file in my java application which automatically sends HTTP 1.1 request.
I would like to know the size of the file so i can update my percentage.
GET
/dc-13/video/2005_Defcon_V2-P_Zimmerman-Unveiling_My_Next_Big_Project.mp4
HTTP/1.1
Firefox is performing an HTTP 1.1 GET request. Lighthttpd understands that the client will support chunked-transfer encoding and returns the content in chunks, with each chunk reporting its own length.
GET
/dc-13/video/2005_Defcon_V2-P_Zimmerman-Unveiling_My_Next_Big_Project.mp4
HTTP/1.0
Wget on the other hand performs an HTTP 1.0 GET request. Lighthttpd, understanding that the client doesn't support HTTP 1.1 (and thus chunked-transfer encoding), returns the content in one single chunk, with the length reported in the response header.
Looks like it's because of the chunked transfer encoding:
Transfer-Encoding: chunked
This will send the video down in chunks, each with its own size. This is defined in HTTP 1.1, which is what Firefox is using, while wget is using HTTP 1.0, which doesn't support chunked transfer encoding, so the server has to send the whole file at once.
I was having the same problem and found a solution regardless of which HTTP version:
First use a HEAD request to the server which correctly responds with just the HTTP header and no contents. This header correctly includes the wanted Content-Length: bytes size for the file to download.
Proceed with the GET request to download the file (the header from the GET response fails to include Content-length).
An Objective-C language example:
NSString *zipURL = #"http://1.bp.blogspot.com/_6-cw84gcURw/TRNb3PDWneI/AAAAAAAAAYM/YFCZP1foTiM/s1600/paragliding1.jpg";
NSURL *url = [NSURL URLWithString:zipURL];
// Configure the HTTP request for HEAD header fetch
NSMutableURLRequest *urlRequest = [NSMutableURLRequest requestWithURL:url];
urlRequest.HTTPMethod = #"HEAD"; // Default is "GET"
// Define response class
__autoreleasing NSHTTPURLResponse *response;
// Send HEAD request to server
NSData *contentsData = [NSURLConnection sendSynchronousRequest:urlRequest returningResponse:&response error:nil];
// Header response field
NSDictionary *headerDeserialized = response.allHeaderFields;
// The contents length
int contents_length = [(NSString*)headerDeserialized[#"Content-Length"] intValue];
//printf("HEAD Response header: %s\n",headerDeserialized.description.UTF8String);
printf("HEAD:\ncontentsData.length: %d\n",contentsData.length);
printf("contents_length = %d\n\n",contents_length);
urlRequest.HTTPMethod = #"GET";
// Send "GET" to download file
contentsData = [NSURLConnection sendSynchronousRequest:urlRequest returningResponse:&response error:nil];
// Header response field
headerDeserialized = response.allHeaderFields;
// The contents length
contents_length = [(NSString*)headerDeserialized[#"Content-Length"] intValue];
printf("GET Response header: %s\n",headerDeserialized.description.UTF8String);
printf("GET:\ncontentsData.length: %d\n",contentsData.length);
printf("contents_length = %d\n",contents_length);
return;
And the output:
HEAD:
contentsData.length: 0
contents_length = 146216
GET:
contentsData.length: 146216
contents_length = 146216
(Note: This example URL does correctly provides the header Content-Length from the GET response, but shows the idea if it failed to)

Resources