I've developed a standalone XULRunner app that I'm using as a site-specific browser. The web application it accesses does filtering of browsers to know whether the browser being used is optimal. I'd like to add my XULRunner app to the list of optimal browsers. I figured that to do this, I'll need to know the HTTP header information that accompanies request sent by the XULRunner app. What information in the HTTP header can I use to identify my XULRunner app? Something like the Gecko Engine version, etc. I've been searching around, but no luck yet.
An application is usually identified by means of the User-Agent header. You can see it on the client side by means of the window.navigator.userAgent property, e.g. the header for Firefox 12 on Windows 7 is:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0
The important part here is Gecko/... (identifies a Gecko-based browser) and rv:... (Gecko version). The Firefox/12.0 part should be replaced by something like MyApp/1.2.3 in your case (name and version number of your application).
Related
I have come across a requirement to scrape product pages of some E-Commerce websites at scale [25-50 TPS]. I am taking the example of crawling amazon product pages as an example here in order to explain the problems I am facing. The API I am making use of to perform this task is Apache's CloseableHttpClient [in Java].
The code itself works as expected. But, while running the scrape at scale, I have come across quite a few Captchas and 5xx error responses from amazon [as an example]. This I believe is to be expected as amazon would have its own Anti-Bot mechanism.
In order to avoid being detected as a bot, I have tried to mimic the behaviour of various browsers [Chrome, Safari, Firefox etc]. I have considered the following parameters to avoid detection:
Proxies
Headers
Cookies
User-Agents
As an example, I have tried to mimic the behaviour of Safari browser which is as shown below:
GET /Gillette-Fusion-ProGlide-Refills-Cartridge/dp/B072M5FFZ5 HTTP/1.1
Cookie: session-id=147-2327656-3202211; session-id-time=2082787201l; ubid-main=131-1518884-4482131; csm-hit=tb:s-P678SHE1N3JR8AXN19XZ|1635514068722&t:1635514070717&adb:adblk_no; session-token=XDtHgEvrmW2+5c05J+Yo8vjk4yaVAf2+ZpVBtSe8c/1esqV/GGIbrYX8p8iyBqUEuVqzKM7iWhi0FQIh4rO+md/5fjJD4Wf7PRcLVMxCFzXlde/OVdPLHmi5Xq+/4Mgg2qM+t5eSjdyn9rdxv2SJ9vU6wgouHr0MYH3YRBOP/NnspXKQZBA9shS5JCdIwpc3; skin=noskin; i18n-prefs=USD; lc-main=en_US
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Host: www.amazon.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15
Accept-Language: en-us
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
I have used a couple of proxies [in the required region], used the headers in the same order and the same case as shown above, used a list of User-Agents with a behaviour similar to the example shared above and a couple of pre-calculated cookies. Inspite of this, my success rate stands at 30-35%.
My questions here are as follows:
How do I go about improving the success rate of my scrapes?
How important is it to add a cookie? The cookie being sent by Safari has a csm-hit entry. Is there a way to calculate this on an ad-hoc basis?
Am I missing other entities [apart from headers, user-agents and proxies] that would improve my success rate?
Any help would be appreciated.
Please forgive me if this question is too stupid.
We know that in the browser it is possible to go to Inspect -> Network -> XHR -> Headers and get Request Headers. It is then possible to add these Headers to the Scrapy request.
However, is there a way to get these Request Headers automatically using the Scrapy request, rather than manually?
I tried to use: response.request.headers but this information is not enough:
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 S afari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}
We see a lot more of Request Headers information in the browser. How to get this information?
Scrapy uses these headers to scrape the webpage. Sometimes if a website needs some special keys in headers (like an API), you'll notice that the scrapy won't be able to scrape the webpage.
However there is a workaround, in DownloaMiddilewares, you can implement Selenium. So the requested webpage will be downloaded using selenium automated browser. then you would be able to extract the complete headers as the selenium initiates an actual browser.
## Import webdriver from Selenium Wire instead of Selenium
from seleniumwire import webdriver
## Get the URL
driver = webdriver.Chrome("my/path/to/driver", options=options)
driver.get("https://my.test.url.com")
## Print request headers
for request in driver.requests:
print(request.url) # <--------------- Request url
print(request.headers) # <----------- Request headers
print(request.response.headers) # <-- Response headers
You can use the above code to get the request headers. This must be placed within DownlaodMiddleware of Scrapy so both can work together.
I'm accessing Windows authentication information in my ASP.NET MVC application using the following code.
WindowsIdentity identity = HttpContext.Current.Request.LogonUserIdentity;
Value of identity.Name is correctly the Windows login name.
When I inspect the http request that is send from browser to the server I see the following.
GET http://localhost:12010/administration HTTP/1.1
Authorization: Negotiate YIIJqgYGKwYBBQUCoIIJnjCCCZqgMDAuBgkqhkiC9xIBAgIGCSqGSIb3EgECAgYKKwYBBAGCNwICHgYKKwYBBAGCNwICCqKCCWQEgglgYIIJXAYJKoZIhvcSAQICAQBugglLMIIJR6ADAgEFoQMCAQ6iBwMFACAAAACjggfcYYIH2DCCB9SgAwIBBaEKGwhJVC5MT0NBTKIoMCagAwIBAqEfMB0bBEhUVFAbFWl0LWRsMzgyLWhraS5JVC5MT0NBTKOCB5UwggeRoAMCARKhAwIBIaKCB4MEggd
Accept: text/html, application/xhtml+xml, */*
Accept-Language: en-EN
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Host: localhost:12010
Cookie: ASP.NET_SessionId=pr0qmeomsr1rlb1ehp2sffd3
There is no Windows authentication information in the http request, but I can access it in my code.
How are the values of LogonUserIdentity properties passed from browser to server?
Note that these type of authentication is designed to employ in intranet applications.
As you can see, the Request parameters for my own test is:
Params: {ALL_HTTP=HTTP_CACHE_CONTROL%3amax-age%3d0%0d%0aHTTP_CONNECTION%3akeep-alive%0d%0aHTTP_ACCEPT%3atext%2fhtml%2capplication%2fxhtml%2bxml%2capplication%2fxml%3bq%3d0.9%2cimage%2fwebp%2c*%2f*%3bq%3d0.8%0d%0aHTTP_ACCEPT_ENCODING%3agzip%2c+deflate%2c+sdch%0d%0aHTTP_ACCEPT_LANGUAGE%3aen-US%2cen%3bq%3d0.8%2cfa%3bq%3d0.6%0d%0aHTTP_HOST%3alocalhost%3a54035%0d%0aHTTP_USER_AGENT%3aMozilla%2f5.0+(Windows+NT+6.3%3b+WOW64)+AppleWebKit%2f537.36+(KHTML%2c+like+Gecko)+Chrome%2f40.0.2214.94+Safari%2f537.36%0d%0a&ALL_RAW=Cache-Control%3a+max-age%3d0%0d%0aConnection%3a+keep-alive%0d%0aAccept%3a+text%2fhtml%2capplication%2fxhtml%2bxml%2capplication%2fxml%3bq%3d0.9%2cimage%2fwebp%2c*%2f*%3bq%3d0.8%0d%0aAccept-Encoding%3a+gzip%2c+deflate%2c+sdch%0d%0aAccept-Language%3a+en-US%2cen%3bq%3d0.8%2cfa%3bq%3d0.6%0d%0aHost%3a+localhost%3a54035%0d%0aUser-Agent%3a+Mozilla%2f5.0+(Windows+NT+6.3%3b+WOW64)+AppleWebKit%2f537.36+(KHTML%2c+like+Gecko)+Chrome%2f40.0.2214.94+Safari%2f537.36%0d%0a&APPL_MD_PATH=%2fLM%2fW3SVC%2f7%2fROOT&APPL_
PHYSICAL_PATH=c%3a%5cusers%5camirmehr%5cdocuments%5cvisual+studio+2013%5cProjects%5cWindows+Auth%5cWindows+Auth%5c&AUTH_TYPE=Negotiate&AUTH_USER=Amir%5camirmehr&AUTH_PASSWORD=&LOGON_USER=Amir%5camirmehr&REMOTE_USER=Amir%5camirmehr&CERT_COOKIE=&CERT_FLAGS=&CERT_ISSUER=&CERT_KEYSIZE=&CERT_SECRETKEYSIZE=&CERT_SERIALNUMBER=&CERT_SERVER_ISSUER=&CERT_SERVER_SUBJECT=&CERT_SUBJECT=&CONTENT_LENGTH=0&CONTENT_TYPE=&GATEWAY_INTERFACE=CGI%2f1.1&HTTPS=off&HTTPS_KEYSIZE=&HTTPS_SECRETKEYSIZE=&HTTPS_SERVER_ISSUER=&HTTPS_SERVER_SUBJECT=&INSTANCE_ID=7&INSTANCE_META_PATH=%2fLM%2fW3SVC%2f7&LOCAL_ADDR=%3a%3a1&PATH_INFO=%2f&PATH_TRANSLATED=c%3a%5cusers%5camirmehr%5cdocuments%5cvisual+studio+2013%5cProjects%5cWindows+Auth%5cWindows+Auth&QUERY_STRING=&REMOTE_ADDR=%3a%3a1&REMOTE_HOST=%3a%3a1&REMOTE_PORT=54427&REQUEST_METHOD=GET&SCRIPT_NAME=%2f&SERVER_NAME=localhost&SERVER_PORT=54035&SERVER_PORT_SECURE=0&SERVER_PROTOCOL=HTTP%2f1.1&SERVER_SOFTWARE=Microsoft-IIS%2f8.0&URL=%2f&HTTP_CACHE_CONTROL=max-age%3d0&HTTP_CONNECTION=keep-alive&HTTP_
ACCEPT=text%2fhtml%2capplication%2fxhtml%2bxml%2capplication%2fxml%3bq%3d0.9%2cimage%2fwebp%2c*%2f*%3bq%3d0.8&HTTP_ACCEPT_ENCODING=gzip%2c+deflate%2c+sdch&HTTP_ACCEPT_LANGUAGE=en-US%2cen%3bq%3d0.8%2cfa%3bq%3d0.6&HTTP_HOST=localhost%3a54035&HTTP_USER_AGENT=Mozilla%2f5.0+(Windows+NT+6.3%3b+WOW64)+AppleWebKit%2f537.36+(KHTML%2c+like+Gecko)+Chrome%2f40.0.2214.94+Safari%2f537.36}
And Server variables :
ServerVariables: {ALL_HTTP=HTTP_CACHE_CONTROL%3amax-age%3d0%0d%0aHTTP_CONNECTION%3akeep-alive%0d%0aHTTP_ACCEPT%3atext%2fhtml%2capplication%2fxhtml%2bxml%2capplication%2fxml%3bq%3d0.9%2cimage%2fwebp%2c*%2f*%3bq%3d0.8%0d%0aHTTP_ACCEPT_ENCODING%3agzip%2c+deflate%2c+sdch%0d%0aHTTP_ACCEPT_LANGUAGE%3aen-US%2cen%3bq%3d0.8%2cfa%3bq%3d0.6%0d%0aHTTP_HOST%3alocalhost%3a54035%0d%0aHTTP_USER_AGENT%3aMozilla%2f5.0+(Windows+NT+6.3%3b+WOW64)+AppleWebKit%2f537.36+(KHTML%2c+like+Gecko)+Chrome%2f40.0.2214.94+Safari%2f537.36%0d%0a&ALL_RAW=Cache-Control%3a+max-age%3d0%0d%0aConnection%3a+keep-alive%0d%0aAccept%3a+text%2fhtml%2capplication%2fxhtml%2bxml%2capplication%2fxml%3bq%3d0.9%2cimage%2fwebp%2c*%2f*%3bq%3d0.8%0d%0aAccept-Encoding%3a+gzip%2c+deflate%2c+sdch%0d%0aAccept-Language%3a+en-US%2cen%3bq%3d0.8%2cfa%3bq%3d0.6%0d%0aHost%3a+localhost%3a54035%0d%0aUser-Agent%3a+Mozilla%2f5.0+(Windows+NT+6.3%3b+WOW64)+AppleWebKit%2f537.36+(KHTML%2c+like+Gecko)+Chrome%2f40.0.2214.94+Safari%2f537.36%0d%0a&APPL_MD_PATH=%2fLM%2fW3SVC%2f7%2fR
OOT&APPL_PHYSICAL_PATH=c%3a%5cusers%5camirmehr%5cdocuments%5cvisual+studio+2013%5cProjects%5cWindows+Auth%5cWindows+Auth%5c&AUTH_TYPE=Negotiate&AUTH_USER=Amir%5camirmehr&AUTH_PASSWORD=&LOGON_USER=Amir%5camirmehr&REMOTE_USER=Amir%5camirmehr&CERT_COOKIE=&CERT_FLAGS=&CERT_ISSUER=&CERT_KEYSIZE=&CERT_SECRETKEYSIZE=&CERT_SERIALNUMBER=&CERT_SERVER_ISSUER=&CERT_SERVER_SUBJECT=&CERT_SUBJECT=&CONTENT_LENGTH=0&CONTENT_TYPE=&GATEWAY_INTERFACE=CGI%2f1.1&HTTPS=off&HTTPS_KEYSIZE=&HTTPS_SECRETKEYSIZE=&HTTPS_SERVER_ISSUER=&HTTPS_SERVER_SUBJECT=&INSTANCE_ID=7&INSTANCE_META_PATH=%2fLM%2fW3SVC%2f7&LOCAL_ADDR=%3a%3a1&PATH_INFO=%2f&PATH_TRANSLATED=c%3a%5cusers%5camirmehr%5cdocuments%5cvisual+studio+2013%5cProjects%5cWindows+Auth%5cWindows+Auth&QUERY_STRING=&REMOTE_ADDR=%3a%3a1&REMOTE_HOST=%3a%3a1&REMOTE_PORT=54427&REQUEST_METHOD=GET&SCRIPT_NAME=%2f&SERVER_NAME=localhost&SERVER_PORT=54035&SERVER_PORT_SECURE=0&SERVER_PROTOCOL=HTTP%2f1.1&SERVER_SOFTWARE=Microsoft-IIS%2f8.0&URL=%2f&HTTP_CACHE_CONTROL=max-age%3d0&HTTP_CONNECTION=keep-al
ive&HTTP_ACCEPT=text%2fhtml%2capplication%2fxhtml%2bxml%2capplication%2fxml%3bq%3d0.9%2cimage%2fwebp%2c*%2f*%3bq%3d0.8&HTTP_ACCEPT_ENCODING=gzip%2c+deflate%2c+sdch&HTTP_ACCEPT_LANGUAGE=en-US%2cen%3bq%3d0.8%2cfa%3bq%3d0.6&HTTP_HOST=localhost%3a54035&HTTP_USER_AGENT=Mozilla%2f5.0+(Windows+NT+6.3%3b+WOW64)+AppleWebKit%2f537.36+(KHTML%2c+like+Gecko)+Chrome%2f40.0.2214.94+Safari%2f537.36}
which contains my username equal to amirmehr. But in browser request we don't see any user information which says the IIS is reading user data from OS since I'm running my project locally :). Here is browser request:
Request URL:http://localhost:54035/
Request Headers
Provisional headers are shown
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer:http://localhost:54035/
User-Agent:Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36
After some research, I found out the question depends on test environment (besides configuration).
For Local case, take a look at Unable to get windows authentication to work through local IIS
And to understand how it works, check the Windows Authentication for
.net MVC 4 - how it works, how to test it. Both have useful
things.
Also we have configuration considerations mentioned below:
Configuration(Microsoft Link)
Configuring Windows Authentication in MVC4 with IIS Express.
For customization check ASP.NET MVC 4 Forms Authentication Customized :)
Depending on your WindowsAuthenticationMode, the browser may or may not authenticate with IIS. For example, if you allow Anonymous Authentication, then IUSR_MACHINENAME user will be used to authenticate the user. If Basic is specified, then the browser will authenticate with IIS (you will need to provide user/password) and then you will see in the requests the authentication information. Then there is NTLM authentication. More information here.
So I set up an ejabberd XMPP server and use nginx as a proxy on an EC2 instance. The Strophe.js echobot example can connect from my Chrome browser. Here are the Request Headers:
Request URL:http://foo.eu-west-1.compute.amazonaws.com/http-bind
Request Method:POST
Status Code:200 OK
Request Headers
Accept:*/*
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4
Connection:keep-alive
Content-Length:115
Content-Type:application/xml
Host:foo.compute.amazonaws.com
Origin:foo.eu-west-1.compute.amazonaws.com
Referer:foo.eu-west-1.compute.amazonaws.com/examples/echobot.html
User-Agent:Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11
Request Payload
<body rid='2692151172' xmlns='http://jabber.org/protocol/httpbind' sid='15f6dc6cbc7a7d69b5db531c2ebf7828e094f043'/>
Response Headers
Access-Control-Allow-Headers:Content-Type
Access-Control-Allow-Origin:*
Connection:keep-alive
Content-Length:286
Content-Type:text/xml; charset=utf-8
Date:Fri, 04 Jan 2013 11:14:14 GMT
Server:nginx/1.2.4
I ported the echobot example to work in Spotify as an app. However, Strophe cannot connect. Here is the Web Inspector network log:
Request URL:http://foo.eu-west-1.compute.amazonaws.com/http-bind
Request Headers
POST http://foo.eu-west-1.compute.amazonaws.com/http-bind HTTP/1.1
Origin: sp://chatify
User-Agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.15 Safari/535.11
Content-Type: application/xml
Request Payload
<body rid='790399813' xmlns='http://jabber.org/protocol/httpbind' to='ec2-54-246-45-111.eu-west-1.compute.amazonaws.com' xml:lang='en' wait='60' hold='1' content='text/xml; charset=utf-8' ver='1.6' xmpp:version='1.0' xmlns:xmpp='urn:xmpp:xbosh'/>
Now. The RequiredPermissons in the manifest.json look as follows:
"RequiredPermissons": [
"http://foo.eu-west-1.compute.amazonaws.com",
"foo.eu-west-1.compute.amazonaws.com",
"http://foo.eu-west-1.compute.amazonaws.com/http-bind",
"foo.eu-west-1.compute.amazonaws.com/http-bind"
]
I can load sources from foo.eu-west-1.compute.amazonaws.com so I think the permissons work.
The Spotify app uses a different origin, "sp://chatify". I read that this can cause problems. And Access-Control-Allow-Origin:* should be added to the header. I did so in the nginx config just to find out that it had been sent all along. You can see it in Response Headers of the first request.
Strophe.js itself says in the logs of the example if turned on:
error 0 happened
So any suggestions?
The manifest seems to be right
Access-Control-Allow-Origin;* seem right
It works in a browser but not Spotify.
Thx for your help!
Update:
If I open the echobot.html locally it still works. The Origin is then null.
Okay, I got it to work. And the solution is a classic. However, my debugging might help others.
At first I looked at the Web Inspector. It showed that all request were canceled. No the inspector is not always right as mentioned in many places. This time it was, though.
So Strophe.js could not establish a connection. To see if any custom Spotify App could establish a connection I tried the Google Maps example from the tutorial app. It did not work with the native Linux version and the Windows under wine. It did however work under windows.
My next step was to see if my app could load any resources from the net. So I simply added an iframe: . It did not load my page.
I took a close look at the manifest of the tutorial app and my app. I could not find any difference. So I just copied "RequiredPermissions" to see that I had a typo. Once that was fixed the iframe and eventually strophe.js worked.
It is funny. I looked for typos a few days ago I could not find any.
To sum up:
Make sure your server sets Access-Control-Allow-Origin;*
Set the Permissions. E.g.
"RequiredPermissions": [
"http://ec2-foo.eu-west-1.compute.amazonaws.com",
"ec2-foo.eu-west-1.compute.amazonaws.com",
"http://ec2-foo.eu-west-1.compute.amazonaws.com/http-bind",
"ec2-foo.eu-west-1.compute.amazonaws.com/http-bind"
]
For Filepicker.io we built "grab from url", but certain sites aren't happy with not passing a User-Agent header. I could just use a stock browser user agent as suggested in some other answers, but as a good web citizen I wanted to know if there isa more appropriate user-agent to set for a server requesting another server's data?
Depends on the language you wrote your server in. For example, Python's urllib sets a default value to User-agent: Python-urllib/2.1, but you can just as easily set it to something like User-agent: filepicker.io/<your-version-here> or something more language specific if you'd like.