Recently I put some hidden links in a web site in order to trap web crawlers. (Used CSS visibility hidden style in order to avoid human users accessing it).
Any way, I found that there were plenty of HTTP requests with a reference of browsers which have accessed the hidden links.
E.g : "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31"
So now my problems are:
(1) Are these web crawlers? or else what can be?
(2) Are they malicious?
(3) Is there a way to profile their behaviour?
I searched on the web but couldn't find any valuable information. Can you please provide me some resources, or any help would be appreciated.
This is a HTTP user agent. They are not malicious at all. It's following the pattern, for example Mozilla/<version> and so on. A browser is a user-agent for example. However, they can be used by attackers and this can be identified by looking at anomalies. You can read this paper.
The Hypertext Transfer Protocol (HTTP) identifies the client software
originating the request, using a "User-Agent" header, even when the
client is not operated by a user.
The answer to your questions are, in order:
They are not web crawlers. They are user agents. Common term for a web developer.
Generally they aren't malicious but they can be, as I suggest, look at the paper.
I don't understand what you mean by profiling behaviour, they aren't malware!
Related
I have a WordPress site that I manage. I recently received a Qualys vulnerability security scan (non-authenticated scan) that has a large number of "Path Based Vulnerability" findings. Almost all of the paths listed follow this format:
https://www.example.com/search/SomeString
https://www.example.com/search/1/feed/rss2
Some examples include:
https://www.example.com/search/errors
https://www.example.com/search/admin
https://www.example.com/search/bin
When I go to these URL's, I get an appropriate search page response stating, for example, "Search for Admin produced no results".
But, if I go to https://www.example.com/search/ without a string parameter, I get a 404 error (custom error page) stating the page could not be found. All this works like I would expect it to. No sensitive data/pages are being shown.
An example of the Qualys finding is:
150004 Path-Based Vulnerability URL:
https://www.example.com/search/1/feed/rss2/ Finding #
8346060(130736429) Severity Confirmed Vulnerability - Level 2 Unique #
redacted Group Path Disclosure Detection Date 22 Mar 2021 18:16
GMT-0400 CWE CWE-22 OWASP A5 Broken Access Control WASC WASC-15
APPLICATION MISCONFIGURATION WASC-16 DIRECTORY INDEXING WASC-17
IMPROPER FILESYSTEM PERMISSIONS CVSS V3 Base 5.3 CVSS V3 Temporal5
CVSS V3 Attack VectorNetwork
Details
Threat A potentially sensitive file, directory, or directory listing was discovered on the Web server.
Impact The contents of this file or directory may disclose sensitive information.
Solution Verify that access to this file or directory is permitted. If necessary, remove it or apply access controls to it.
Detection Information
Parameter No param has been required for detecting the information.
Authentication In order to detect this vulnerability, no authentication has been required.
Access Path Here is the path followed by the scanner to reach the exploitable URL: https://www.example.com
https://www.example.com/?s=1
Payloads
#1
#1 Request
GET https://www.example.com/search/tools/
Referer: https://www.example.com
Cookie: [removed in case its sensitive];
caosLocalGa= [removed in case its sensitive];
Host: https://www.example.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1
Safari/605.1.15
Accept: /
Based on the findings, this seems to be a false positive. But, my CIO insists that I prove it as such. First, is there any documentations on this that might be helpful? Second, does anyone know of any updates to WP that could hide/remove these findings?
(I'd comment, but my rep isn't high enough.)
I can partially answer this, as I fighting the same battle right now with a different web app. If you run the request in a browser with the developer tools on, I'll bet you'll see the response code from the server is 200 even though it is actually doing a redirect.
The scanner sees the response code is OK and based on that, the request succeeded as is when it really didn't. You have to give it a different response code when doing a "silent" redirect.
I am exploring the Google PageSpeed insights api and there in the response I see a tag called:
{
...
lighthouse.userAgent:'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/84.0.4147.140 Safari/537.36'
...
}
Docs mention userAgent="The user agent that was used to run this LHR."
https://developers.google.com/speed/docs/insights/rest/v5/pagespeedapi/runpagespeed#LighthouseResultV5 What does that mean? How is this performance aggregated by running on all browsers?
PS: This is for Desktop version.
What does that mean?
This lets you know what browser was used to run the test.
It is useful for if you believe that there is an issue with Lighthouse (a bug in your report) so you can test it directly on the same browser that Lighthouse uses.
There is also the "environment" object which contains how Lighthouse presented itself (it sent a header saying "treat me like this browser") to the website that was being tested. (lighthouseResult.environment.networkUserAgent)
This is useful so you can check your server isn't blocking requests for that user agent etc.
It is also useful for checking your server logs to see what requests Lighthouse made etc. etc.
See the Wikipedia page for user agent for more info on user agents
How is this performance aggregated by running on all browsers?
As for your second question it doesn't quite make sense, but it has no impact on performance unless your server does something different for that user agent string if that is what you mean.
While it is important that people should use approved browsers downloaded from their legitimate sites to use them, is there any way for the server to detect if someone is spoofing the browser (user agent)?
My question is in particular reference to security. What if someone creates a browser (user agent) and does not respect some contracts (for example, Same origin policy of cookies) to exploit vulnerabilities there? This illegitimate browser can claim that it is a genuine user agent by populating the User-Agent header with standard values used in Firefox or Chrome.
Is there any way at the server side to detect if the user is using spoofed user agent so that the server can take counter measures if needed? Or is this the absolute responsibility of the individual using the browser to use approved browsers only (servers have no way to detect it)?
Browsers are just high level user interfaces to HTTP. Prior to the introduction of various security methods there was not much in place to prevent such attacks. Nowadays browsers (Chrome, Firefox, Edge) have restrictions and abide by certain rules/contracts (in order to work properly).
One can spoof(send) anything with a HTTP client (a very lean
"browser") such as CURL.
curl -A "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0" http://blah.nonexistent.tld
(The default curl header is something like User-Agent: curl/7.16.3)
A server could restrict or detect uncommon User-Agents to prevent scraping or scanning, but the user agent is nothing but a "name" and could just be changed to a common one as done above.
The security methods (contracts) that have been added such as Same Origin Resource Policy / Cross Origin Resource Sharing / HTTP Only are there to protect the client (browser) and server. They must be implemented by both in order to function properly(securely) to protect against an attack as mentioned. If the client and the server don't properly use the contracts agreed upon, then cookies could be exfiltrated (A modern browser is designed to fail fast and would still prevent this).
If you meant that you were to create your own browser set its User-Agent as Chrome, ignore contracts in place by properly configured servers then they might ignore you. What user cookies would you steal from a "custom" browser, that few people may use?
Let's say that a page is just printing the value of the HTTP 'referer' header with no escaping. So the page is vulnerable to an XSS attack, i.e. an attacker can craft a GET request with a referer header containing something like <script>alert('xss');</script>.
But how can you actually use this to attack a target? How can the attacker make the target issue that specific request with that specific header?
This sounds like a standard reflected XSS attack.
In reflected XSS attacks, the attacker needs the victim to visit some site which in some way is under the attacker's control. Even if this is just a forum where an attacker can post a link in the hope somebody will follow it.
In the case of a reflected XSS attack with the referer header, then the attacker could redirect the user from the forum to a page on the attacker's domain.
e.g.
http://evil.example.com/?<script>alert(123)>
This page in turn redirects to the following target page in a way that preserves referer.
http://victim.example.org/vulnerable_xss_page.php
Because it is showing the referer header on this page without the proper escaping, http://evil.example.com/?<script>alert(123)> gets output within the HTML source, executing the alert. Note this works in Internet Explorer only.
Other browsers will automatically encode the URL rendering
%3cscript%3ealert%28123%29%3c/script%3e
instead which is safe.
I can think of a few different attacks, maybe there are more which then others will hopefully add. :)
If your XSS is just some header value reflected in the response unencoded, I would say that's less of a risk compared to stored. There may be factors to consider though. For example if it's a header that the browser adds and can be set in the browser (like the user agent), an attacker may get access to a client computer, change the user agent, and then let a normal user use the website, now with the attacker's javascript injected. Another example that comes to mind is a website may display the url that redirected you there (referer) - in this case the attacker only has to link to the vulnerable application from his carefully crafted url. These are kind of edge cases though.
If it's stored, that's more straightforward. Consider an application that logs user access with all request headers, and let's suppose there is an internal application for admins that they use to inspect logs. If this log viewer application is web based and vulnerable, any javascript from any request header could be run in the admin context. Obviously this is just one example, it doesn't need to be blind of course.
Cache poisoning may also help with exploiting a header XSS.
Another thing I can think of is browser plugins. Flash is less prevalent now (thankfully), but with different versions of Flash you could set different request headers on your requests. What exactly you can and cannot set is a mess and very confusing across Flash plugin versions.
So there are several attacks, and it is necessary to treat all headers as user input and encode them accordingly.
Exploitation of xss at referrer header is almost like a traditional reflected xss, Just an additional point to make is "Attacker's website redirects to victim website and hence referrer header with required javascript will be appended to the victim website request".
Here One essential point that needs to be discussed is Why only with IE one can exploit this vulnerability why not with other browsers?
Traditional answer for this question is 'Chrome and firefox automatically encodes URL parameters and hence XSS is not possible..' But interesting thing here is when we have n number of bypasses for traditional xss bypasses. why can't we have bypasses for this scenario.
Yes.. We can bypass with following payload which is same way to bypass HTML validation in traditional payload.
http://evil.example.com/?alert(1)//cctixn1f .
Here the response could be something like this:
The link on the
referring
page seems to be wrong or outdated.
Response End
If victim clicks on referring page, alert will be generated..
Bottomline: Not just only IE, XSS can be possible even in Mozilla and Firefox when referrer is being used as part of href tag..
Is there any difference in performance when using Googlebot vs Mozilla as the CURLOPT_USERAGENT? I'm hypothesizing that some pages might output simpler HTML when Googlebot is the useragent, but I don't really know.
Setting a user agent won't make your cURL call faster. In fact, in some servers, they intend to block suspicious User Agent. Therefore, just use the default user agent is fine.