I am using both curl and wget to get this url: http://opinionator.blogs.nytimes.com/2012/01/19/118675/
For curl, it returns no output at all, but with wget, it returns the entire HTML source:
Here are the 2 commands. I've used the same user agent, and both are coming from the same IP, and are following redirects. The URL is exactly the same. For curl, it returns immediately after 1 second, so I know it's not a timeout issue.
curl -L -s "http://opinionator.blogs.nytimes.com/2012/01/19/118675/" --max-redirs 10000 --location --connect-timeout 20 -m 20 -A "Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" 2>&1
wget http://opinionator.blogs.nytimes.com/2012/01/19/118675/ --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
If NY Times might be cloaking, and not returning the source to curl, what could be different in the headers curl is sending? I assumed since the user agent is the same, the request should look exactly the same from both of these requests. What other "footprints" should I check?
The way to solve is to analyze your curl request by doing curl -v ... and your wget request by doing wget -d ... which shows that curl is redirected to a login page
> GET /2012/01/19/118675/ HTTP/1.1
> User-Agent: Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
> Host: opinionator.blogs.nytimes.com
> Accept: */*
>
< HTTP/1.1 303 See Other
< Date: Wed, 08 Jan 2014 03:23:06 GMT
* Server Apache is not blacklisted
< Server: Apache
< Location: http://www.nytimes.com/glogin?URI=http://opinionator.blogs.nytimes.com/2012/01/19/118675/&OQ=_rQ3D0&OP=1b5c69eQ2FCinbCQ5DzLCaaaCvLgqCPhKP
< Content-Length: 0
< Content-Type: text/plain; charset=UTF-8
followed by a loop of redirections (which you must have noticed, because you have already set the --max-redirs flag).
On the other hand, wget follows the same sequence except that it returns the cookie set by nytimes.com with its subsequent request(s)
---request begin---
GET /2012/01/19/118675/?_r=0 HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: */*
Host: opinionator.blogs.nytimes.com
Connection: Keep-Alive
Cookie: NYT-S=0MhLY3awSMyxXDXrmvxADeHDiNOMaMEZFGdeFz9JchiAIUFL2BEX5FWcV.Ynx4rkFI
The request sent by curl never includes the cookie.
The easiest way I see to modify your curl command and obtain the desired resource is by adding -c cookiefile to your curl command. This stores the cookie in the otherwise unused temporary "cookie jar" file called "cookiefile" thereby enabling curl to send the needed cookie(s) with its subsequent requests.
For example, I added the flag -c x directly after "curl " and I obtained the output just like from wget (except that wget writes it to a file and curl prints it on STDOUT).
In my case was because the https_proxy enviroment variable for utility cURL needs set the port in the URL, for example :
Not work with cURL :
https_proxy=http://proxyapp.net.com/
Works with cURL :
https_proxy=http://proxyapp.net.com:80/
With "wget" utility works with and without the port in url, but curl needs it, in case of not set the utility "curl" return error "(56) Proxy CONNECT aborted".
When you get verbosity of the command "curl -v" could see "curl" use port "1080" as default if port in not set at proxy url.
Related
A user is able to upload a file. During the upload the file is scanned. If there is an issue with the file Symfony returns a Response(400) and the rest of the file is not uploaded, saving the user and the host time and bandwidth.
This is done via \Symfony\Component\HttpFoundation\Request::getContent(true)
$resource = $request->getContent(true);
The file is scanned a line at a time using:
fgets($resource);
The resource is also closed before the response is sent to the user:
fclose($resource);
However there is unexpected and strange behaviour happening for some user clients.
For example wget:
wget -4 --no-check-certificate --method PUT --timeout=0 --header 'Authorization: Bearer xxx' --body-file='xxx' 'https://example.com/xxx' --content-on-error -d -O -
Response hangs:
---request begin---
PUT /xxx HTTP/1.1
User-Agent: Wget/1.20.3 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: xxx
Connection: Keep-Alive
Content-Length: 37767602
Authorization: Bearer xxx
---request end---
[writing BODY file xxx ...
It appears that wget does not understand the upload does not need to be completed, is this a header that php is failing to send or a flag required in the wget command?
A similar command in curl works
curl -k --location --request PUT 'https://example.com/xxx' \
--header 'Authorization: Bearer xxx' \
--data-binary '#/xxx'
Response
< Server: Apache/2.4.38 (Debian)
< Vary: Authorization
< X-Robots-Tag: noindex
< Transfer-Encoding: chunked
* HTTP error before end of send, stop sending
<
* Closing connection 0
* TLSv1.3 (OUT), TLS alert, close notify (256):
This request used to work but now gets a 403. I tried adding a user agent like in this answer but still no good: https://stackoverflow.com/a/38489588/2415706
This second answer further down says to find the referer header but I can't figure out where these response headers are: https://stackoverflow.com/a/56946001/2415706
import requests
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
"referer": "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State"
job_url = "https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State"
job_response = requests.get(job_url, headers=headers, timeout=10)
print(job_response)
This is what I see under Request Headers for the first tab after refreshing the page but there's too much stuff. I assume I only need one of these lines.
:authority: www.ziprecruiter.com
:method: GET
:path: /Salaries/What-Is-the-Average-Programmer-Salary-by-State
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
cache-control: max-age=0
cookie: __cfduid=dea4372c39465cfa2422e97f84dea45fb1620355067; zva=100000000%3Bvid%3AYJSn-w3tCu9yJwJx; ziprecruiter_browser=99.31.211.77_1620355067_495865399; SAFESAVE_TOKEN=1a7e5e90-60de-494d-9af5-6efdab7ade45; zglobalid=b96f3b99-1bed-4b7c-a36f-37f2d16c99f4.62fd155f2bee.6094a7fb; ziprecruiter_session=66052203cea2bf6afa7e45cae7d1b0fe; experian_campaign_visited=1
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"
sec-ch-ua-mobile: ?0
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: none
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36
EDIT: looking at the other tabs, they have referer: "referer": "https://www.ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State" so I'm trying that now but it is still 403.
Using httpx package it seems to work with:
import httpx
url = 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State'
r = httpx.get(url)
print(r.text)
print(r.status_code)
print(r.http_version)
repl.it: https://replit.com/#bertrandmartel/ZipRecruiter
I may be wrong but I think that the server didn't like the TLS negociation for the requests library. It's weird since the above call is using HTTP1.1 in the request and with curl it only works with http2 and TLS1.3
Using a curl binary built with http2 and with openssl supporting TLS1.3, the following works:
docker run --rm curlimages/curl:7.76.1 \
--http2 --tlsv1.3 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State' \
-H 'user-agent: Mozilla' \
-s -o /dev/null -w "%{http_code}"
returns:
301
The following commands are failing:
forcing http1.1 and enforcing TLS 1.3
docker run --rm curlimages/curl:7.76.1 \
--http1.1 --tlsv1.3 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State' \
-H 'user-agent: Mozilla' \
-s -o /dev/null -w "%{http_code}"
Output: 403
forcing http2 and enforcing TLS 1.2:
docker run --rm curlimages/curl:7.76.1 \
--http2 --tlsv1.2 'https://ziprecruiter.com/Salaries/What-Is-the-Average-Programmer-Salary-by-State' \
-H 'user-agent: Mozilla' \
-s -o /dev/null -w "%{http_code}"
Output: 403
My guess is that it detects something in the TLS negociation but the check is different when there is both TLS1.3 and HTTP/2
Unfortunately, you can't check http/2 with requests/urlib since it's not supported
I have a simple file on my web server, and when I request it in a browser, it loads without problems:
http://example.server/report.php
But when I request the file with wget from a Raspberry Pi, I get this:
$ wget -d --spider http://example.server/report.php
Setting --spider (spider) to 1
DEBUG output created by Wget 1.18 on linux-gnueabihf.
Reading HSTS entries from /home/pi/.wget-hsts
URI encoding = 'ANSI_X3.4-1968'
converted 'http://example.server/report.php' (ANSI_X3.4-1968) -> 'http://example.server/report.php' (UTF-8)
Converted file name 'report.php' (UTF-8) -> 'report.php' (ANSI_X3.4-1968)
Spider mode enabled. Check if remote file exists.
--2018-06-03 07:29:29-- http://example.server/report.php
Resolving example.server (example.server)... 49.132.206.71
Caching example.server => 49.132.206.71
Connecting to example.server (example.server)|49.132.206.71|:80... connected.
Created socket 3.
Releasing 0x00832548 (new refcount 1).
---request begin---
HEAD /report.php HTTP/1.1
User-Agent: Wget/1.18 (linux-gnueabihf)
Accept: */*
Accept-Encoding: identity
Host: example.server
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 406 Not Acceptable
Date: Fri, 15 Jun 2018 08:25:17 GMT
Server: Apache
Keep-Alive: timeout=3, max=200
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
---response end---
406 Not Acceptable
Registered socket 3 for persistent reuse.
URI content encoding = 'iso-8859-1'
Remote file does not exist -- broken link!!!
I read somewhere that it might be an encoding problem, so I tried
$ wget -d --spider --header="Accept-encoding: *" http://example.server/report.php
but that gives me the exact same error.
That's because the server you're connecting to serves only to certain User-Agents.
Change the user agent and it works fine:
wget -d --user-agent="Mozilla/5.0 (Windows NT x.y; rv:10.0) Gecko/20100101 Firefox/10.0" http://example.server/report.php
To test something, I want to run a simple web server that:
Will listen for HTTPS POST requests
Print the POST data received to STDOUT (along with other stuff, potentially, so it's fine if it just cats the whole HTTP request)
Is there a quick way to set something like this up? I've tried using OpenSSL's s_server, but it only seems to want to respond to GET requests.
Since s_server does not support POST requests, you should use socat instead of openssl s_server:
# socat -v OPENSSL-LISTEN:443,cert=mycert.pem,key=key.pem,verify=0,fork 'SYSTEM:/bin/echo HTTP/1.1 200 OK;/bin/echo;/bin/echo this-is-the-content-of-the-http-answer'
Here are essential parameters:
fork: to loop for many requests
-v: to display the POST data (and other stuff) to STDOUT
verify=0: do not ask for mutual authentication
Now, here is an example:
We use the following POST request:
% wget -O - --post-data=abcdef --no-check-certificate https://localhost/
[...]
this-is-the-content-of-the-http-answer
We see the following socat output:
# socat -v OPENSSL-LISTEN:443,cert=mycert.crt,key=key.pem,verify=0,fork 'SYSTEM:/bin/echo HTTP/1.1 200 OK;/bin/echo;/bin/echo this-is-the-content-of-the-http-answer'
> 2017/08/05 03:13:04.346890 length=212 from=0 to=211
POST / HTTP/1.1\r
User-Agent: Wget/1.19.1 (freebsd10.3)\r
Accept: */*\r
Accept-Encoding: identity\r
Host: localhost:443\r
Connection: Keep-Alive\r
Content-Type: application/x-www-form-urlencoded\r
Content-Length: 6\r
\r
< 2017/08/05 03:13:04.350299 length=16 from=0 to=15
HTTP/1.1 200 OK
> 2017/08/05 03:13:04.350516 length=6 from=212 to=217
abcdef< 2017/08/05 03:13:04.351549 length=1 from=16 to=16
< 2017/08/05 03:13:04.353019 length=39 from=17 to=55
this-is-the-content-of-the-http-answer
This topic has been up quite some times in the community (forums, blog posts etc) and the conclusion is that this should be done making a REST Post call to share and the url /service/modules/create-site
The reason is that some surf specific stuff like the site dashboard are created from the share side.
However, I have been trying this approach from different angles all day, always ending up with a HTTP 200 in the response and no share site created. Quite frustrating.
I'm running this on Alfresco Enterprise 4.2.3.3 (I suspect my problems is due to a recent change)
To strip this down to something that is easy to reproduce, I'm following Martin Bergljungs blog post on the subject (http://www.ixxus.com/blog201203creating-alfresco-share-sites-javascript/), starting with using curl like this:
create a text file with login credentials (login.txt) with the following content (change to appropriate values):
username=admin&password=admin
create a text file with the json to create a site (site_data.json)
{"visibility" : "PUBLIC","title" : "My Test Site","shortName" : "mytestsite",
"description" : "My Test Site created from command line", "sitePreset" : "site-dashboard"}
Get the JSESSIONID by requesting a ticket:
curl -v -d #login.txt -H "Content-Type:application/x-www-form-urlencoded" http://localhost:8081/share/page/dologin
copy the resulting JSESSIONID value into the following curl call:
curl -v -d #site_data.json -H "Cookie:JSESSIONID=<insert your jsessionid>" -H "Content-Type:application/json" -H "Accept:application/json" http://localhost:8081/share/service/modules/create-site
output from curl:
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8081 (#0)
> POST /share/service/modules/create-site HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8081
> Cookie:JSESSIONID=5963B948684F562A278909AF466D2306
> Content-Type:application/json
> Accept:application/json
> Content-Length: 196
>
* upload completely sent off: 196 out of 196 bytes
< HTTP/1.1 200 OK
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< X-Frame-Options: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< X-Content-Type-Options: nosniff
< Content-Language: en-US
< Content-Length: 0
< Date: Tue, 02 Dec 2014 13:57:02 GMT
<
* Connection #0 to host localhost left intact
The latter curl call results in a HTTP 200 as seen above, but a login to share reveals there have been no site created what so ever :(
BTW. I have disabled the CSRF Token Filter.
UPDATE:
I have verified that the above approach works to create a site on Alfresco Enterprise 4.1.5
I have verified that it also fails on Alfresco Community 4.2.e
This is reported as a bug: https://issues.alfresco.com/jira/browse/MNT-11706
UPDATE: Since the question was not clear to a reader I have reformulated it now
UPDATE:
Following Dave Websters answer, I been trying again using the following steps, still with CSRF Token disabled:
Login:
curl -v -d #login.txt -H "Content-Type:application/x-www-form-urlencoded" http://localhost:8081/share/page/dologin
Response:
POST /share/page/dologin HTTP/1.1
User-Agent: curl/7.35.0
Host: localhost:8081
Accept: /
Content-Type:application/x-www-form-urlencoded
Content-Length: 29
* upload completely sent off: 29 out of 29 bytes
< HTTP/1.1 302 Found
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< X-Frame-Options: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< X-Content-Type-Options: nosniff
< Set-Cookie: JSESSIONID=058A52486E4EB12F94D1F95302732616; Path=/share/; HttpOnly
< Set-Cookie: alfLogin=1417618589; Expires=Wed, 10-Dec-2014 14:56:29 GMT; Path=/share
< Set-Cookie: alfUsername3=admin; Expires=Wed, 10-Dec-2014 14:56:29 GMT; Path=/share
< Location: http://localhost:8081/share
< Content-Length: 0
< Date: Wed, 03 Dec 2014 14:56:29 GMT
Took the cookie values and inserted into Daves code (with the csrf-stuff stripped out):
curl 'http://localhost:8081/share/service/modules/create-site' -H 'Cookie: JSESSIONID=058A52486E4EB12F94D1F95302732616; alfLogin=1417618589; alfUsername3=admin;' -H 'Origin: http://localhost:8081' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36' -H 'Content-Type: application/json' -H 'Accept: */*' -H 'Referer: http://localhost:8081/share/page/site/erik/dashboard' -H 'X-Requested-With: application/json' -H 'Connection: keep-alive' --data-binary $'{"visibility":"PUBLIC","title":"erik'","shortName":"erik'","description":"This site is auto generated","sitePreset":"site-dashboard"}' --compressed
Still no share site generated though, and still a HTTP 200 Response. No errors in the logs either. This is driving me nuts :(
New Update (It works!):
I have now found out that you will need to "touch" a share webscript after making the login call before calling create-site with a post. I do this by making a get request in between. This somehow needs to be done to initialize the share session.
This is the curl command I use to generate sites programatically. I insert the JSESSIONID, LOGINCOOKIECONTENTS and CSRFTOKEN (twice) contents manually, but getting them programatically should work.
curl 'http://localhost:8081/share/service/modules/create-site' -H 'Cookie: JSESSIONID={JSESSIONID}; alfLogin={LOGINCOOKIECONTENTS}; alfUsername3=admin; Alfresco-CSRFToken={CSRFTOKEN};' -H 'Origin: http://localhost:8081' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36' -H 'Content-Type: application/json' -H 'Accept: */*' -H 'Referer: http://localhost:8081/share/page/site/auto-gen-0/dashboard' -H 'X-Requested-With: application/json' -H 'Connection: keep-alive' -H 'Alfresco-CSRFToken: {CSRFTOKEN}' --data-binary $'{"visibility":"PUBLIC","title":"auto-gen'$I'","shortName":"auto-gen-'$I'","description":"This site is auto generated","sitePreset":"site-dashboard"}' --compressed
The expected response is:
{
"success": true
}