I am exploring the Google PageSpeed insights api and there in the response I see a tag called:
{
...
lighthouse.userAgent:'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/84.0.4147.140 Safari/537.36'
...
}
Docs mention userAgent="The user agent that was used to run this LHR."
https://developers.google.com/speed/docs/insights/rest/v5/pagespeedapi/runpagespeed#LighthouseResultV5 What does that mean? How is this performance aggregated by running on all browsers?
PS: This is for Desktop version.
What does that mean?
This lets you know what browser was used to run the test.
It is useful for if you believe that there is an issue with Lighthouse (a bug in your report) so you can test it directly on the same browser that Lighthouse uses.
There is also the "environment" object which contains how Lighthouse presented itself (it sent a header saying "treat me like this browser") to the website that was being tested. (lighthouseResult.environment.networkUserAgent)
This is useful so you can check your server isn't blocking requests for that user agent etc.
It is also useful for checking your server logs to see what requests Lighthouse made etc. etc.
See the Wikipedia page for user agent for more info on user agents
How is this performance aggregated by running on all browsers?
As for your second question it doesn't quite make sense, but it has no impact on performance unless your server does something different for that user agent string if that is what you mean.
Related
While it is important that people should use approved browsers downloaded from their legitimate sites to use them, is there any way for the server to detect if someone is spoofing the browser (user agent)?
My question is in particular reference to security. What if someone creates a browser (user agent) and does not respect some contracts (for example, Same origin policy of cookies) to exploit vulnerabilities there? This illegitimate browser can claim that it is a genuine user agent by populating the User-Agent header with standard values used in Firefox or Chrome.
Is there any way at the server side to detect if the user is using spoofed user agent so that the server can take counter measures if needed? Or is this the absolute responsibility of the individual using the browser to use approved browsers only (servers have no way to detect it)?
Browsers are just high level user interfaces to HTTP. Prior to the introduction of various security methods there was not much in place to prevent such attacks. Nowadays browsers (Chrome, Firefox, Edge) have restrictions and abide by certain rules/contracts (in order to work properly).
One can spoof(send) anything with a HTTP client (a very lean
"browser") such as CURL.
curl -A "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0" http://blah.nonexistent.tld
(The default curl header is something like User-Agent: curl/7.16.3)
A server could restrict or detect uncommon User-Agents to prevent scraping or scanning, but the user agent is nothing but a "name" and could just be changed to a common one as done above.
The security methods (contracts) that have been added such as Same Origin Resource Policy / Cross Origin Resource Sharing / HTTP Only are there to protect the client (browser) and server. They must be implemented by both in order to function properly(securely) to protect against an attack as mentioned. If the client and the server don't properly use the contracts agreed upon, then cookies could be exfiltrated (A modern browser is designed to fail fast and would still prevent this).
If you meant that you were to create your own browser set its User-Agent as Chrome, ignore contracts in place by properly configured servers then they might ignore you. What user cookies would you steal from a "custom" browser, that few people may use?
I have a desktop application that uses CEF for displaying a built in web page.
I have customized the User-Agent (Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) DesktopApp MyAppName/1.0 (MyApp release 1.0 stamp 99999) Safari/537.36) but Google Analytics only shows as Safari 537.36.
Are browsers outside the known universe of real browsers supported by GA when looking up browsers used? I would like this to instead be MyApp instead of Safari or Chrome.
I just looked at my browser reports and unless "aaa", "ddd" and "this is a test ua" are actually existing browsers it would seem that GA also tracks unknown user agents.
More seriously, the measurement protocol (on top of which Google Analytics is built) allows for a user agent override parameter (&ua), which probably would make very little sense if you could only pass in known browser names (after all this is meant so support e.g IoT devices which might not even have a real user agent name).
Recently I put some hidden links in a web site in order to trap web crawlers. (Used CSS visibility hidden style in order to avoid human users accessing it).
Any way, I found that there were plenty of HTTP requests with a reference of browsers which have accessed the hidden links.
E.g : "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31"
So now my problems are:
(1) Are these web crawlers? or else what can be?
(2) Are they malicious?
(3) Is there a way to profile their behaviour?
I searched on the web but couldn't find any valuable information. Can you please provide me some resources, or any help would be appreciated.
This is a HTTP user agent. They are not malicious at all. It's following the pattern, for example Mozilla/<version> and so on. A browser is a user-agent for example. However, they can be used by attackers and this can be identified by looking at anomalies. You can read this paper.
The Hypertext Transfer Protocol (HTTP) identifies the client software
originating the request, using a "User-Agent" header, even when the
client is not operated by a user.
The answer to your questions are, in order:
They are not web crawlers. They are user agents. Common term for a web developer.
Generally they aren't malicious but they can be, as I suggest, look at the paper.
I don't understand what you mean by profiling behaviour, they aren't malware!
Is there any difference in performance when using Googlebot vs Mozilla as the CURLOPT_USERAGENT? I'm hypothesizing that some pages might output simpler HTML when Googlebot is the useragent, but I don't really know.
Setting a user agent won't make your cURL call faster. In fact, in some servers, they intend to block suspicious User Agent. Therefore, just use the default user agent is fine.
Is it possible to obtain raw logs from Google Analytic? Is there any tool that can generate the raw logs from GA?
No you can't get the raw logs, but there's nothing stopping you from getting the exact same data logged to your own web server logs. Have a look at the Urchin code and borrow that, changing the following two lines to point to your web server instead.
var _ugifpath2="http://www.google-analytics.com/__utm.gif";
if (_udl.protocol=="https:") _ugifpath2="https://ssl.google-analytics.com/__utm.gif";
You'll want to create a __utm.gif file so that they don't show up in the logs as 404s.
Obviously you'll need to parse the variables out of the hits into your web server logs. The log line in Apache looks something like this. You'll have lots of "fun" parsing out all the various stuff you want from that, but everything Google Analytics gets from the basic JavaScript tagging comes in like this.
127.0.0.1 - - [02/Oct/2008:10:17:18 +1000] "GET /__utm.gif?utmwv=1.3&utmn=172543292&utmcs=ISO-8859-1&utmsr=1280x1024&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=9.0%20%20r124&utmdt=My%20Web%20Page&utmhn=www.mydomain.com&utmhid=979599568&utmr=-&utmp=/urlgoeshere/&utmac=UA-1715941-2&utmcc=__utma%3D113887236.511203954.1220404968.1222846275.1222906638.33%3B%2B__utmz%3D113887236.1222393496.27.2.utmccn%3D(organic)%7Cutmcsr%3Dgoogle%7Cutmctr%3Dsapphire%2Btechnologies%2Bsite%253Arumble.net%7Cutmcmd%3Dorganic%3B%2B HTTP/1.0" 200 35 "http://www.mydomain.com/urlgoeshere/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/0.2.153.1 Safari/525.19"
No. But why don't you just use your webserver's logs? The value of GA is not in the data they collect, but the aggregation/analysis. That's why it's not called Google Raw Data.
Please have a look on this article which explains a hack to get Google analytics data.
http://blogoscoped.com/archive/2008-01-17-n73.html
Also If you can wait for sometime then official Google analytics blog says that they are working on data export api but currently it is in Private Beta.
http://analytics.blogspot.com/2008/10/more-enterprise-class-features-added-to.html
Not exactly the same as raw vs aggregated, but it seems that "unsampled" data is only available to Premium accounts:
"Unsampled Reports are only available in Premium accounts using the latest version of Google Analytics."
http://support.google.com/analytics/bin/answer.py?hl=en&answer=2601061
You can get the Analytics data, but it'll take a bit of hacking.
In any analytics report, click the 'email' button at the top of the screen. Set up the email to go to your address (or a new address on your server) and change the format to csv or xml.
Then, you can use php (or another language) to check the email account, parse the email and import the attachment to your system.
There's an article entitled 'Incoming mail and PHP' on evolt.org: http://evolt.org/incoming_mail_and_php
No, but there are other paid services like Mixpanel and KISSmetrics that have data export APIs. Much easier than trying to build your own analytics service, but costs money.