wget Fails to Download Website (ERROR 0: no description) - recursion

I'm trying to mirror the whole website at http://opposedforces.com/parts/impreza/en_g11/type_63/
Accessing through a browser (Firefox, w3m) or Postman work fine, and return the html file.
Accessing through wget, cURL, the Python requests module and HTTrack all fail.
wget specifically fails with:
↪ wget --mirror -p --convert-links "http://opposedforces.com/parts/impreza/en_g11/type_63/"
--2021-02-03 20:48:29-- http://opposedforces.com/parts/impreza/en_g11/type_63/
Resolving opposedforces.com (opposedforces.com)... 138.201.30.59Connecting to opposedforces.com (opposedforces.com)|138.201.30.59|:80... connected.
HTTP request sent, awaiting response... 0
2021-02-03 20:48:29 ERROR 0: (no description).
Converted links in 0 files in 0 seconds.
It seemingly returns no information. Originally I thought some JavaScript was generating the html, but I can't find any JS using Firefox developer tools, and I would assume Postman would not work in this case.
Any ideas how to get around this? Ideally I can use wget to download this and all sub-pages, but alternative solutions are also welcome.

This is one of those times when the website is completely and absolutely broken.
It is unfortunate that web browsers go to great lengths to support such broken web pages.
The problem is that the server sends a broken response. This is the response I see:
---response begin---
HTTP/1.1 000
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 44892
Expires: -1
Server: Microsoft-IIS/7.5
X-AspNet-Version: 2.0.50727
Set-Cookie: ASP.NET_SessionId=gxhoir45jpd43545iujdpiru; path=/; HttpOnly
X-Powered-By: ASP.NET
Date: Fri, 05 Feb 2021 09:26:26 GMT
See? It returns a HTTP/1.1 000 response, which doesn't exist in the spec. Web browsers seem to just accept it as a 200 response and move on. Wget doesn't.
But you can get around it by using the --content-on-error option which is ask Wget to download the content irrespective of the response code

Related

Curl having problem to retrieve data after Nginx restart

My server was working fine until I restarted the server and now my program with cURL API stops working. After troubleshooting for a long time, I figured out what the problem is.
When I use this command:
curl -i https://server.my-site.com/checkConnection
Nginx returns error:
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Thu, 04 Jul 2019 17:14:40 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 0
Connection: keep-alive
Location: /checkConnection/
but if I use this command:
curl -i -L https://server.my-site.com/checkConnection
Then the server return:
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Thu, 04 Jul 2019 17:14:40 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 0
Connection: keep-alive
Location: /checkConnection/
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 04 Jul 2019 17:14:40 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 2
Connection: keep-alive
X-Frame-Options: SAMEORIGIN
ok
And if I use a browser, then everything works. I have no clue what the error comes from. and how to fix it.
Any help is appreciated!
This is what happens when the path maps to a directory. In theory, a URL like http://example.org/directory could map to a directory like /wherever/public_html/directory, and being a directory, show an index.html or similar file from there; however, that would cause surprising issues when you go to refer to other things like images in the same directory. <img src="picture.jpg"> would load http://example.org/picture.jpg rather than http://example.org/directory/picture.jpg since it's relative to the URL the browser is actually viewing. Because of this, HTTP servers generally issue a redirect to add a slash at the end, which then both loads the right page and at a URL where relative paths do what humans expect.
Adding -L to your curl commandline causes it to follow the redirect, as browsers do, and you get the result you were expecting. Without -L, curl is a more naive http client and lets you do what you will with the information.
Maybe you have a rule for www.server.my-site.com and that is why this is returning the 301 because it is redirecting from server.my-site.com to www site maybe you should share your configuration to check it
Ok. I finally fix it by adding an internal routing to uwsgi. Everything working fine now.

Getting 404 error if requesting a page through proxy, but 200 if connecting directly

I am developing an HTTP proxy in Java. I resend all the data from client to server without touching it, but for some URLs (for example this) server returns the 404 error if I am connecting through my proxy.
The requested URL uses Varnish caching, so it might be the root of problem. I cannot reconfigure it - it is not my.
If I request that URL directly with browser, the server returns 200 and the image is shown correctly.
I am stuck because I even do not know what to read and how to compose a search request.
Thanks a lot.
Fix the Host: header of the re-issued request. The request going out from the proxy either has no Host header or it is broken (or only X-Host exists). Also take note that the proxy application will execute its own DNS lookup and that might yield a different IP address than your local computer (where you issued the original request).
This works:
> curl -s -D - -o /dev/null http://212.25.95.152/w/w-200/1902047-41.jpg -H "Host: msc.wcdn.co.il"
HTTP/1.1 200 OK
Content-Type: image/jpeg
Cache-Control: max-age = 315360000
magicmarker: 1
Content-Length: 27922
Accept-Ranges: bytes
Date: Sun, 05 Jul 2015 00:52:08 GMT
X-Varnish: 2508753650 2474246958
Age: 67952
Via: 1.1 varnish
Connection: keep-alive
X-Cache: HIT

WGET 401 Unauthorized

I'm trying to use a batch file with WGET to download the public FCC file from here
http://wireless.fcc.gov/uls/data/complete/l_micro.zip
When I intially run the batch file with parameters
wget --server-response -owget.log http://wireless.fcc.gov/uls/data/complete/l_micro.zip
It fails with an HTTP 401 unauthorized error. I can retry at this point and it keeps failing. However I noticed if I open up IE, start a download and cancel when prompted to save, I can rerun the batch file and it executes perfectly!
Here is my detailed server response from the log
--2012-02-06 14:32:24-- http://wireless.fcc.gov/uls/data/complete/l_micro.zip
Resolving wireless.fcc.gov (wireless.fcc.gov)... 192.104.54.158
Connecting to wireless.fcc.gov (wireless.fcc.gov)|192.104.54.158|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 302 Found
Location: REMOVED - appears to have my IP
Cache-Control: no-cache
Pragma: no-cache
Content-Type: text/html; charset=utf-8
Connection: close
Content-Length: 513
Location: REMOVED [following]
--2012-02-06 14:32:24-- REMOVED
Resolving REMOVED... 192.168.2.11
Connecting to REMOVED|192.168.2.11|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 401 Unauthorized
Cache-Control: no-cache
Pragma: no-cache
WWW-Authenticate: NTLM
WWW-Authenticate: BASIC realm="AD_BCAAA"
Content-Type: text/html; charset=utf-8
Proxy-Connection: close
Set-Cookie: BCSI-CS-8ECFB6B4AA642EF0=2; Path=/
Connection: close
Content-Length: 575
Authorization failed.
Here is the log after doing my little IE procedure and getting it to work
--2012-02-08 15:52:43-- http://wireless.fcc.gov/uls/data/complete/l_micro.zip
Resolving wireless.fcc.gov (wireless.fcc.gov)... 192.104.54.158
Connecting to wireless.fcc.gov (wireless.fcc.gov)|192.104.54.158|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Server: Sun-Java-System-Web-Server/7.0
Date: Fri, 27 Jan 2012 18:37:51 GMT
Content-type: application/zip
Last-modified: Sun, 22 Jan 2012 11:18:09 GMT
Etag: "46fa95c-4f1bf071"
Accept-ranges: bytes
Content-length: 74426716
Connection: Keep-Alive
Age: 1045014
Length: 74426716 (71M) [application/zip]
Saving to: `l_micro.zip'
Any help is appreciated!
If the website has simply a htpassword setup, you can try:
wget --user=admin --ask-password https://www.yourwebsite.com/file.zip
I used --auth-no-challenge and the exact error get solved .
You have a Blue Coat secure web gateway on your network, as evidenced by the line in the response:
Set-Cookie: BCSI-CS-8ECFB6B4AA642EF0=2; Path=/
It looks like it wants you to authenticate, presumably with your domain credentials. Try passing them with --http-user and --http-passwd.
I had a similar issue with the xwiki based site. after several attempts I found some combination that worked for me just fine
wget --no-check-certificate --auth-no-challenge -k -nc -p -l 1 -r https://user:password#host.domain
I think the key was --auth-no-challenge
Try using this extension for firefox. It generates a wget or a curl command that can be copied and run from bash.
I came here trying to find out why wget was giving a 401 unauthorized message when on another system the problem did not occur.
After installing a later version of wget from source (binary was not available in my distro) it worked. I can't explain why, except that it must be some kind of bug so if none of the above fixes your problem, consider upgrading wget.
Try setting a user-agent string with wget - e.g.
--user-agent=Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
it's entirely feasible for a site to reject requests from certain user agents, particularly if they look to be circumventing the "usual" routes to information (i.e. through webpages).
Although this doesn't explain your problem, it's a good idea anyway. Perhaps the site implements a mechanism whereby when you browse with a "known" browser (e.g. IE) it then caches your IP as "safe" then allows any user agent from your IP to download anything :)

Programatically updating a VSTO Word Addin in IIS7.5

We have recently moved to a new web server (from IIS6 to IIS7.5) and I'm having some trouble updating our VSTO word addin.
Our app checks for updates manually when logging in and if a newer version has been found updates like this (let me know if there is a better way to do this - I've tried ApplicationDeployment.Update() but had no luck with it either!):
WebBrowser browser = new WebBrowser();
browser.Visible = false;
Uri setupLocation = new Uri("https://updatelocation.com/setup.exe");
browser.Url = setupLocation;
This used to launch the setup and update the app and when the user restarted word they would have the new version installed. Since the server move the update no longer happens. No exceptions are thrown. Browsing to the URL launches the updater as expected. What would I need to change to get this to work?
Note I have the following MIME types setup on the folder in IIS:
.application
application/x-ms-application
.manifest
application/x-ms-manifest
.deploy
application/octet-stream
.msu
application/octet-stream
.msp
application/octet-stream
.exe
application/octet-stream
Edit
OK I've had a look in fiddler and its returning a body size of -1:
If I enter the same URL in IE you can see that the setup.exe is launched without problems.
This is what fiddler displays in the raw view when accessing from word:
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Last-Modified: Tue, 27 Sep 2011 15:07:42 GMT
Accept-Ranges: bytes
ETag: "9bd0c334277dcc1:0"
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Mon, 14 Nov 2011 07:42:18 GMT
Content-Length: 735608
MZ��������������������#������������������������������������������ �!�L�!This program cannot be run in DOS mode. $�������
*** FIDDLER: RawDisplay truncated at 128 characters. Right-click to disable truncation. ***
Have you tried a tool like (for instance) fiddler2 to see what http traffic is actually created?
Does the client make a server call? What does the server actually return?
Then:
Make the calls from within word (which isn't working)
Make the calls by hand (which is working)
Compare both the request and response packages from those calls to spot the differences

Determine supported HTTP version by the web server

Is there a way to check whether a web server supports HTTP 1.0 or 1.1? If so, how is this done?
You could issue a:
curl --head www.test.com
that will print out the HTTP version in the first line of the output...
e.g.
HTTP/1.1 200 OK
Content-Length: 28925
Content-Type: text/html
Last-Modified: Fri, 26 Jun 2009 16:08:04 GMT
Accept-Ranges: bytes
ETag: "a41944978f6c91:0"
Server: Microsoft-IIS/7.0
X-Powered-By: ASP.NET
Date: Fri, 31 Jul 2009 06:13:25 GMT
In Google Chrome you can see protocol of each requests like this
open developers tools with F12
go to Network Tab
right click any where in column headers (like Name in the picture) and from the context menu select Protocol to be displayed as a new column
then you will see values like h2 (HTTP 2) or http/1.1 entry like the following picture in Protocol column
This should work on any platform that includes a telnet client:
telnet <host> 80
Then you have to type one of the following blind:
HEAD / HTTP/1.0
or
GET /
and hit enter twice.
The first line returned should output the HTTP version supported:
telnet www.stackoverflow.com 80
HEAD / HTTP/1.0
HTTP/1.1 404 Not Found
Content-Length: 315
Content-Type: text/html; charset=us-ascii
Server: Microsoft-HTTPAPI/2.0
Date: Fri, 31 Jul 2009 15:15:15 GMT
Connection: close
Read the release notes or the documentation of the webserver to check that. For example Apache Tomcat documentation tells it supports HTTP 1.1
Which webserver are you looking for?
Also are you asking if this can be checked programmatically?
In Google Chrome and Brave, you can easily use the Developer tools (F12 or Command + Option + I). Open the Network tab, find the request, click the Header tab, scroll down to "Response Headers", and click view source. It should show the HTTP version in the first line.
In the screenshot below, the server is using HTTP/1.1, as you can see: HTTP/1.1 200 OK. If that is missing, it's HTTP/2, since there is no readable source, it's in binary instead.
Alternatively, you can also use netcat so that you don't have to type it blindly as in telnet.
user#linux:~$ nc www.stackoverflow.com 80
HEAD / HTTP
HTTP/1.1 400 Bad Request
Connection: close
Content-Length: 0
user#linux:~$
$curl --head https://url:port -k
You get result something like...
HTTP/1.1 200 OK
blah....blah.
blah...blah..
$
So first line shows version it supports..

Resources