External links URL encoding leads to '%3F' and '%3D' on Nginx server

External links URL encoding leads to '%3F' and '%3D' on Nginx server - nginx

I got a problem with my server. I got four inbound links to different sites of my dynamic webpage which look something like this:
myurl.com/default/Site%3Fid%3D13
They should look like this:
myurl.com/default/Site?id=13
I do know that those %3F is an escape sequence for the ? sign and the %3D is an escape sequence for the equal sign. But I do get an error 400 when I use those links. What can I do about that?
The four links are for different sites, and I imagine over time there will be more links like that. So one fix for all would be perfect.

An exact same question was actually asked on nginx-ru mailing list about a year ago:
http://mailman.nginx.org/pipermail/nginx-ru/2013-February/050200.html
The most helpful response, by an Nginx, Inc, employee/developer, Валентин Бартенев:
http://mailman.nginx.org/pipermail/nginx-ru/2013-February/050209.html
Если запрос приходит в таком виде, то это уже не параметры, а имя запрошенного
файла. Другое дело, что location ищется по уже раскодированному адресу, о чем в
документации написано.
Translation:
If the request comes in such a form, then these are no longer the args, but the name of the requested file. Another thing is that, as documented, the location matching is performed against a normalised URI.
His suggested solution, translated to the sample example from the question here at SO, would then be:
location /default/Site? {
rewrite \?(.*)$ /default/Site?$1? last;
}
location = /default/Site {
[...]
}

The following sample would redirect all wrongly-looking requests (defined as having ? in the requested filename — encoded as %3F in the request) into less wrongly-looking ones, regardless of URL.
(Please note that, as rightly advised elsewhere, you should not be getting these wrongly-formed links in the first place, so, use it as a last resort — only when you cannot correct the wrongly formed links otherwise, and you do know that such requests are attempted by valid agents.)
server {
listen [::]:80;
server_name localhost;
rewrite ^/([^?]*)\?(.*)$ /$1?$2? permanent;
location / {
return 200 "id is $arg_id\n";
}
}
This is example of how it would work — when a wrongly looking request is encountered, a correction attempt is made with a 301 Moved Permanently response with a supposedly correct Location response header, which would make the browser automatically re-issue the request to the newly provided location:
opti# curl -6v "http://localhost/default/Site%3Fid%3D13"
* About to connect() to localhost port 80 (#0)
* Trying ::1...
* connected
* Connected to localhost (::1) port 80 (#0)
> GET /default/Site%3Fid%3D13 HTTP/1.1
> User-Agent: curl/7.26.0
> Host: localhost
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Server: nginx/1.4.1
< Date: Wed, 15 Jan 2014 17:09:25 GMT
< Content-Type: text/html
< Content-Length: 184
< Location: http://localhost/default/Site?id=13
< Connection: keep-alive
<
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.4.1</center>
</body>
</html>
* Connection #0 to host localhost left intact
* Closing connection #0
Note that no correction attempts are made on proper-looking requests:
opti# curl -6v "http://localhost/default/Site?id=13"
* About to connect() to localhost port 80 (#0)
* Trying ::1...
* connected
* Connected to localhost (::1) port 80 (#0)
> GET /default/Site?id=13 HTTP/1.1
> User-Agent: curl/7.26.0
> Host: localhost
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx/1.4.1
< Date: Wed, 15 Jan 2014 17:09:30 GMT
< Content-Type: application/octet-stream
< Content-Length: 9
< Connection: keep-alive
<
id is 13
* Connection #0 to host localhost left intact
* Closing connection #0

The URL is perfectly valid. The escaped characters it contains are just that, escaped. Which is perfectly fine.
The purpose is that you can actually have a request name (in most cases corresponding to the filename on the disk) that is Site?id=13 and not Site and the rest as the query string.
I would consider it bad practice to have characters in a filename that makes this necessary. However, in URL arguments it may very well be necessary.
Nevertheless, the request URL is valid, and probably not what you want it to be. Which consequently suggest that you should correct the error wherever anybody has picked up the wrong URL in the first place.
I do not really understand why you get an error 400; you should rather get an error 404. But that depends on your setup.
There are also cases, especially with nginx, that mostly involve passing on whole URLs and URL parts along multiple levels (for example reverse proxies, matching regular expressions from the URL and using them as variables, etc.) where such an error may occur. But to verify this and fix it we would need to know more about your setup.

Related

How do I what Content Types are on offer (for HTTP Content Negotiation)?

What one gets back when resolving a DOI depends on content negotiation.
I was looking at https://citation.crosscite.org/docs.html#sec-3
and I see different services offer different Content Types.
For a particular URL I want to know all the content types it can give me.
Some of them might be more useful than any that I am aware of (i.e. i don't want to write a list of preferences in advance).
For example:
https://doi.org/10.5061/dryad.1r170
I thought maybe OPTIONS was the way to do it
but that gave back nothing interesting, only about allowed request methods.
shell> curl -v -X OPTIONS http://doi.org/10.5061/dryad.1r170
* Hostname was NOT found in DNS cache
* Trying 2600:1f14:6cf:c01::d...
* Trying 54.191.229.235...
* Connected to doi.org (2600:1f14:6cf:c01::d) port 80 (#0)
> OPTIONS /10.5061/dryad.1r170 HTTP/1.1
> User-Agent: curl/7.38.0
> Host: doi.org
> Accept: */*
>
< HTTP/1.1 200 OK
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< Allow: GET, HEAD, POST, TRACE, OPTIONS
< Content-Length: 0
< Date: Mon, 29 Jan 2018 07:01:14 GMT
<
* Connection #0 to host doi.org left intact

I guess there is no such standard yet, but Link header: https://www.w3.org/wiki/LinkHeader could expose this information.
But personally, I won't rely too much on it. For example, a server could start sending a new content type and still NOT expose it via this header.
It might be useful to check the API response headers frequently, via manual or automated means for any changes.

HTTP header 400 bad request response

I'm trying to test writing correct HTTP headers to understand
the syntax. Here I'm trying to PUT some text into httpbin.org/put and I expect the response body content to be the same.
PUT /HTTP/1.1
Host: httpbin.org
Accept-Language: en-us
Connection: Keep-Alive
Content-type: text/plain
Content-Length: 12
Hello jerome
However I'm getting the following bad request 400 response:
HTTP/1.1 400 Bad Request
Server: nginx
Date: Tue, 01 Mar 2016 12:34:02 GMT
Content-Type: text/html
Content-Length: 166
Connection: close
Response:
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>nginx</center>
</body>
</html>
What syntactical errors have I done?
NOTE: newlines are \r\n not \n in the request.

Apparently the correct syntax goes like this for PUT:
PUT /put HTTP/1.1\r\n
Content-Length: 11\r\n
Content-Type: text/plain\r\n
Host: httpbin.org\r\n\r\n
hello lala\n
I believe I didn't say much on how I connected to httpbin.org; it was via sockets in C. So the connection was already established before sending the header + message.

You miss the destination url following the PUT verb, the first line must be:
PUT http://httpbin.org/ HTTP/1.1
This will probably also fail, you need one of their handler urls so they know what to reply with:
PUT http://httpbin.org/put HTTP/1.1

The general form of the first line, or Request Line, in an HTTP request is as follows:
<method> <path component of URL, or absolute URL> HTTP/<Version>\r\n
Where for your example, the method is PUT. Including an absolute URL (so, starting with http:// or https:// is only necessary when connecting to a proxy, because the proxy will then attempt to retrieve that URL, rather than attempt to serve a local resource (as found by the path component).
As presented, the only change you should have needed to make was ensuring there was a space between the / and HTTP/1.1. Otherwise, the path would be "/HTTP/1.1"... which would be a 404, if it weren't already a badly formed request. /HTTP/1.1 being interpreted as a path means the HTTP server that's parsing your request line doesn't find the protocol specifier (the HTTP/1.1 bit) before the terminating \r\n... and that's one example of how 400 response codes are born.
Hope that helped. Consult the HTTP 1.1 RFC (2616), section 5.1 for more information and the official definitions.

Nginx return statement not accepting "text"

Following config is working for me:
server {
listen 80;
root /app/web;
index index.json;
location / {
return 409;
}
}
If I hit the website the 409 page will be presented. However following is not working:
server {
listen 80;
root /app/web;
index index.json;
location / {
return 409 "foobar";
}
}
The page is unreachable. But according to the docs http://nginx.org/en/docs/http/ngx_http_rewrite_module.html#return
return 409 "foobar";
should work. Any ideas whats wrong? There are no logs in nginx/error.log.

The thing is, Nginx does exactly what you ask it to do. You can verify this by calling curl -v http://localhost (or whatever hostname you use). The result will look somewhat like this:
* Rebuilt URL to: http://localhost/
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost
> Accept: */*
>
< HTTP/1.1 409 Conflict
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
< Date: Fri, 08 May 2015 19:43:12 GMT
< Content-Type: application/octet-stream
< Content-Length: 6
< Connection: keep-alive
<
* Connection #0 to host localhost left intact
foobar
As you can see, Nginx returns both 409 and foobar, as you ordered.
So the real question here is why your browser shows the pretty formatted error page when there is no custom text after the return code, and the gray "unreachable" one, when such text is present.
And the answer is: because of the Content-Type header value.
The HTTP standard states that some response codes should or must come with the response body. To comply with the standard, Nginx does this: whenever you return a special response code without the required body, the web server sends its own hardcoded HTML response to the client. And a part of this response is the header Content-Type: text/html. This is why you see that pretty white error page, when you do return 409 without the text part — because of this header your browser knows that the returned data is HTML and it renders it as HTML.
On the other hand, when you do specify the text part, there is no need for Nginx to send its own version of the body. So it just sends back to the client your text, the response code and the value of Content-Type that matches the requested file (see /etc/nginx/mime.types).
When there is no file, like when you request a folder or a site root, the default MIME type is used instead. And this MIME type is application/octet-stream, which defines some binary data. Since most browsers have no idea how to render random binary data, they do the best they can, that is, they show their own hardcoded error pages.
And this is why you get what you get.
Now if you want to make your browser to show your foobar, you need to send a suitable Content-Type. Something like text/plain or text/html. Usually, this can be done with add_header, but not in your case, for this directive works only with a limited list of response codes (200, 201, 204, 206, 301, 302, 303, 304, or 307).
The only other option I see is to rewrite your original request to something familiar to Nginx, so that it could use a value from /etc/nginx/mime.types for Content-Type:
server {
listen 80;
root /app/web;
index index.json;
location / {
rewrite ^.*$ /index.html;
return 409 "foobar";
}
}
This might seem somewhat counter-intuitive but this will work.
EDIT:
It appears that the Content-Type can be set with the default_type directive. So you can (and should) use default_type text/plain; instead of the rewrite line.

Updating #ivan-tsirulev 's answer:
By now you can set headers even for page with status codes for errors using always.
location #custom_error_page {
return 409 "foobar";
add_header Content-Type text/plain always;
}
But if you set default_type, the response headers will have two Content-Type headers: default, then added. Nevertheless, it works fine.

log into a website to grab the data using RCurl

I wanted to login to the website using RCurl and grab the data from the web (The data cannot be seen without logging in.)
I wanted to export this (for example) "http://www.appannie.com/app/ios/instagram/ranking/history/chart_data/?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone" into R after I log in using RCurl. The issue is I cannot log in using RCurl. I haven't tried this before so mostly I referred to http://www.omegahat.org/RCurl/philosophy.html.
So here's what I tried. (here, 'me#gmail.com' is my user ID and '9999' is my Password - i just made it up.)
library(RJSONIO)
library(rjson)
library(RCurl)
appannie <- getURL("http://www.appannie.com/app/ios/instagram/ranking/history/chart_data/.json?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone, userpwd = me#gmail.com:9999", verbose = TRUE)
But this gave me the message below :
About to connect() to www.appannie.com port 80 (#0)
* Trying 69.167.138.64... * connected
* Connected to www.appannie.com (69.167.138.64) port 80 (#0)
> GET /app/ios/instagram/ranking/history/chart_data/?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone HTTP/1.1
Host: www.appannie.com
Accept: */*
< HTTP/1.1 403 FORBIDDEN
< Server: nginx/1.1.19
< Date: Fri, 01 Mar 2013 23:41:32 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Keep-Alive: timeout=10
< Vary: Accept-Encoding
< Vary: Cookie,Accept-Encoding
<
* Connection #0 to host www.appannie.com left intact
So, I went back and read this http://www.omegahat.org/RCurl/philosophy.html again and didn't know what to do, so I tried this after I saw the similar question from stackoverflow.
getURL("http://www.appannie.com/app/ios/instagram/ranking/history/chart_data/?s=2010-10-06&e=2012-06-04&c=143441&f=ranks&d=iphone",.opts=list(userpwd=me#gmail.com:9999"))
But this gives me below output.
[1] ""
Can anyone give me a hint? (After a bunch of different trial, the website starts to send me warnings =(

This is some sort of authentication issue not anything you did wrong with RCurl most likely.
You got through to the server but either your login was incorrect, it wasn't valid or the data is not available via the API.
http://en.wikipedia.org/wiki/HTTP_403

How to test HTTP Keep alive is actually working

I know HTTP keep-alive is on by default in HTTP 1.1 but I want to find a way to confirm that it is actually working.
Does anyone know of a simple way to test from a web browser (EG how to make sense of wireshark). I know I need to look for multiple HTTP requests over the same TCP connection but I don't know how to confirm that in wireshark or any other way.
Thanks!

As Ron Garrity said on ServerFault, you can use Curl like this:
curl -Iv http://www.aptivate.org 2>&1 | grep -i 'connection #0'
And it outputs these two lines if keep-alive is working:
* Connection #0 to host www.aptivate.org left intact
* Closing connection #0
And if keep-alive is not working, then it just outputs this line:
* Closing connection #0

If you're on Windows Vista or later, you can use Resource Manager. The Network tab will list all open TCP connections and the process they were started by. Open a browser with one tab, browse to your page, and test.

First, try to capture the traffic to the target website in Wireshark and limit it to what you need with a filter like:
tcp port 80 and host targetwebsite.com
Then load the page in a browser or fetch it by any tool you have. If the target web page refreshes itself or one of the values in it, leave it open until you have at least one change in it.
Now you have enough data and you can stop capturing procedure in Wireshark.
You should see dozens of records and their protocol should be TCP or HTTP. For the purpose of your quick simple check, you will not need TCP records. So, lets remove them by applying another filter. In top of the window there is a "filter" field. Type http there, and wireshark will hide all records but those which have a HTTP protocol.
Now select a record and look at the next level of details, which you can find in the 2nd box bellow all records. Just to be sure you are looking at the right place, the first line there starts with "Frame XYZ". The fourth line starts with "Transmission Control Protocol". Look for the port numbers after "SRC Port" and "DST Port:". Depending on the record, one of these numbers belongs to the webserver (typically 80) and the other one shows port number in your end.
Now check a couple of different GET records. To know if the request is a GET record, check the Info column. If the port numbers in your end are used several times, all those requests were made through HTTP keepalive.
Remember that most browsers will open multiple connections, even if the webserver supports keepalive. So, DO NOT conclude your evaluation by finding just one different port.

The most accurate way is to curl the same URL multiple times.
curl -v http://weibo.com -o /dev/null http://weibo.com -o /dev/null
If the output contains Re-using existing connection, then the HTTP keep-alive feature is working. For example,
* TCP_NODELAY set
* Connected to weibo.com (180.149.138.251) port 80 (#0)
> GET / HTTP/1.1
> Host: weibo.com
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 301 Moved Permanently
< ...
< ...
<
{ [236 bytes data]
* Connection #0 to host weibo.com left intact
* Found bundle for host weibo.com: 0x56324121d9a0 [serially]
* Can not multiplex, even if we wanted to!
* Re-using existing connection! (#0) with host weibo.com
* Connected to weibo.com (180.149.138.251) port 80 (#0)
> GET / HTTP/1.1
> Host: weibo.com
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 301 Moved Permanently
< ...
< ...
<
{ [236 bytes data]
* Connection #0 to host weibo.com left intact
Another quick way is to test with ab. But some HTTP servers might not return the Connection: keep-alive header even when they've already turned on keep-alive feature, such as uwsgi. In such cases, ab does NOT send keep-alive requests. That makes ab can only do "positive" detection on HTTP keep-alive.
ab -c 5 -n 50 -k https://www.google.com/
If the result shows
...
Complete requests: 50
Failed requests: 0
Keep-Alive requests: 50 # Pay attention to this line
Total transferred:
...
Then the HTTP keep-alive is enabled.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

External links URL encoding leads to '%3F' and '%3D' on Nginx server - nginx

Related

How do I what Content Types are on offer (for HTTP Content Negotiation)?

HTTP header 400 bad request response

Nginx return statement not accepting "text"

log into a website to grab the data using RCurl

How to test HTTP Keep alive is actually working

Categories

Resources