Scrape product pages from E-Commerce websites

Scrape product pages from E-Commerce websites - web-scraping

I have come across a requirement to scrape product pages of some E-Commerce websites at scale [25-50 TPS]. I am taking the example of crawling amazon product pages as an example here in order to explain the problems I am facing. The API I am making use of to perform this task is Apache's CloseableHttpClient [in Java].
The code itself works as expected. But, while running the scrape at scale, I have come across quite a few Captchas and 5xx error responses from amazon [as an example]. This I believe is to be expected as amazon would have its own Anti-Bot mechanism.
In order to avoid being detected as a bot, I have tried to mimic the behaviour of various browsers [Chrome, Safari, Firefox etc]. I have considered the following parameters to avoid detection:
Proxies
Headers
Cookies
User-Agents
As an example, I have tried to mimic the behaviour of Safari browser which is as shown below:
GET /Gillette-Fusion-ProGlide-Refills-Cartridge/dp/B072M5FFZ5 HTTP/1.1
Cookie: session-id=147-2327656-3202211; session-id-time=2082787201l; ubid-main=131-1518884-4482131; csm-hit=tb:s-P678SHE1N3JR8AXN19XZ|1635514068722&t:1635514070717&adb:adblk_no; session-token=XDtHgEvrmW2+5c05J+Yo8vjk4yaVAf2+ZpVBtSe8c/1esqV/GGIbrYX8p8iyBqUEuVqzKM7iWhi0FQIh4rO+md/5fjJD4Wf7PRcLVMxCFzXlde/OVdPLHmi5Xq+/4Mgg2qM+t5eSjdyn9rdxv2SJ9vU6wgouHr0MYH3YRBOP/NnspXKQZBA9shS5JCdIwpc3; skin=noskin; i18n-prefs=USD; lc-main=en_US
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Host: www.amazon.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15
Accept-Language: en-us
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
I have used a couple of proxies [in the required region], used the headers in the same order and the same case as shown above, used a list of User-Agents with a behaviour similar to the example shared above and a couple of pre-calculated cookies. Inspite of this, my success rate stands at 30-35%.
My questions here are as follows:
How do I go about improving the success rate of my scrapes?
How important is it to add a cookie? The cookie being sent by Safari has a csm-hit entry. Is there a way to calculate this on an ad-hoc basis?
Am I missing other entities [apart from headers, user-agents and proxies] that would improve my success rate?
Any help would be appreciated.

Related

How to test Firebase Functions with Cache Control Header on localhost?

I'm new to Cache Control Header implementation and I need someone to point out any of my mistakes and/or misunderstandings over the cache control effects on Firebase Cloud Functions.
My understanding & expectation on Cache Control over Firebase Functions
When the Cache Control Header has been successfully set using Express response object (confirmed by checking from the Chrome's Network
tab), regardless it is on localhost or production server, the Firebase
Https Functions (not callable functions) should not be invoked again
after the first reload until the cache is expired.
Am I right? But after a few rounds of testing, it seems like my cloud function on localhost still consistently get invoked (confirmed by server console logging) regardless the number of refresh on my web browser. Below is my current Http header:
**General:**
Request URL: http://localhost:5005/otk-web-solutions?id=B0Y0jp2x83WVYzWrpg5y
Request Method: GET
Status Code: 304 Not Modified
Remote Address: 127.0.0.1:5005
Referrer Policy: strict-origin-when-cross-origin
**Response Headers:**
Access-Control-Allow-Origin: *
cache-control: public, max-age=432000, s-maxage=432000
content-length: 9688
content-type: text/html; charset=utf-8
date: Mon, 05 Apr 2021 11:52:20 GMT
etag: W/"25d8-TxL0Q+ujhzDjys8IJ1mLigY7jT8"
vary: Origin, Accept-Encoding, Authorization, Cookie
x-powered-by: Express
**Request Headers:**
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en;q=0.9,en-US;q=0.8,zh;q=0.7
Cache-Control: max-age=0
Connection: keep-alive
Cookie: _ga=GA1.1.816734993.1603107580; _gid=GA1.1.223745218.1617606982; __atuvc=20%7C12%2C15%7C13%2C23%7C14; __atuvs=606aec5f76521aab00a
DNT: 1
Host: localhost:5005
If-None-Match: W/"25d8-TxL0Q+ujhzDjys8IJ1mLigY7jT8"
sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"
sec-ch-ua-mobile: ?0
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36
**Query String Parameters:**
id: B0Y0jp2x83WVYzWrpg5y
On Firebase documentation:
You can, though, configure caching behavior for dynamic content. For
example, if a function generates new content only periodically, you
can speed up your app by caching the generated content for at least a
short period of time.
You can also potentially reduce function execution costs because the
content is served from the CDN rather than via a triggered function.
Could it be that, the cache control header has no effects on localhost except on Firebase CDN, which means only when we've deployed it to the production server for the caching to work on the cloud CDN? Is there a right way to implement such test to see the effectiveness of the cache control header in helping to save the Firebase Cloud Functions' execution costs?
Please advise, thanks a lot!

From my understanding of the documentation and after checking your test, only if the content is served from the CDN you will be able to save costs.
Even if your request is met with a status code 304 Not Modified, it seems like you're still making the request and invoking the function if you're not using the Firebase Hosting CDN.
So to make a test to see if you can save costs by not invoking the function many times you should set up Firebase Hosting and do that same test to see if a response from the CDN invokes the function.

the header fields of an HTTP message: who chooses them?

Looking at the network panel in the developer tools of google chrome I can read the HTTP request and response messages of each file in a web page and, in particular, I can read the start line and the headers with all their fields.
I know (and I hope that is right) that the start line of each HTTP message has a specific and rigorous structure (different for request and response message, of course) and any element inside a start line cannot be missed.
Unlike the start line, the header of an HTTP message contains additional informations, so, I guess, the header fields are facultative or, at least, not so strictly requested like the fields in the start line.
Considering all this, I'm wondering: who sets the header fields in an HTTP message? Or, in other words, how are determined the header fields of an HTTP message?
For example, i can actually see that the HTTP request message for a web page is this:
GET / HTTP/1.1
Host: www.corriere.it
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: it-IT,it;q=0.8,en-US;q=0.6,en;q=0.4,de;q=0.2
Cookie: rccsLocalPref=milano%7CMilano%7C015146; rcsLocalPref=milano%7CMilano; _chartbeat2=DVgclLD1BW8iBl8sAi.1422913713367.1430683372200.1111111111111111; rlId=8725ab22-cbfc-45f7-a737-7c788ad27371; __ric=5334%3ASat%20Jun%2006%202015%2014%3A13%3A31%20GMT+0200%20%28ora%20legale%20Europa%20occidentale%29%7C; optimizelyEndUserId=oeu1433680191192r0.8780217287130654; optimizelySegments=%7B%222207780387%22%3A%22gc%22%2C%222230660652%22%3A%22false%22%2C%222231370123%22%3A%22referral%22%7D; optimizelyBuckets=%7B%7D; __gads=ID=bbe86fc4200ddae2:T=1434976116:S=ALNI_MZnWxlEim1DkFzJn-vDIvTxMXSJ0g; fbm_203568503078644=base_domain=.corriere.it; apw_browser=3671792671815076067.; channel=Direct; apw_cache=1438466400.TgwTeVxF.1437740670.0.0.0...EgjHfb6VZ2K4uRK4LT619Zau06UsXnMdig-EXKOVhvw; ReadSpeakerSettings=enlarge=enlargeoff; _ga=GA1.2.1780902850.1422986273; __utma=226919106.1780902850.1422986273.1439110897.1439114180.19; __utmc=226919106; __utmz=226919106.1439114180.19.18.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); s_cm_COR=Googlewww.google.it; gvsC=New; rcsddfglr=1441375682.3.2.m0i10Mw-|z1h7I0wH.3671792671815076067..J3ouwyCkNXBCyau35GWCru0I1mfcA3hRLNURnDWREPs; cpmt_xa=5334,5364; utag_main=v_id:014ed4175b8e000f4d2bb480bdd10606d001706500bd0$_sn:74$_ss:1$_st:1439133960323$_pn:1%3Bexp-session$ses_id:1439132160323%3Bexp-session; testcookie=true; s_cc=true; s_nr=1439132160762-Repeat; SC_LNK_CR=%5B%5BB%5D%5D; s_sq=%5B%5BB%5D%5D; dtLatC=116p80.5p169.5p91.5p76.5p130.5p74p246.5p100p74.5p122.5; dtCookie=E4365758C13B82EE9C1C69A59B6F077E|Corriere|1|_default|1; dtPC=-; NSC_Wjq_Dpssjfsf_Dbdif=ffffffff091a1f8d45525d5f4f58455e445a4a423660; hz_amChecked=1
how these header fields are chosen? Who/what chose them? (The browser? Not me, of course...)
p.s.:
hope my question is clear, please, forgive my bad english

All internet websites are hosted on HTTP servers, these headers are set by the http server who is hosting the webpage. They are used to control how pages are shown, cached, and encoded.
Web browsers set the headers when requesting the pages from the servers. This mutual communication protocol is the HTTP protocol linked above.

here is a list of all the possible header fields for a request message: the question is, why the broser chooses only some of them?
The browser doesn't include all possible request headers in every request because either:
They aren't applicable to the current request or
The default value is the desired value
For instance:
Accept tells the server that only certain data formats are acceptable in the response. If any kind of data is acceptable, then it can be omitted as the default is "everything".
Content-Length describes the length of the body of the request. A GET request doesn't have a body, so there is nothing to describe the length of.
Cookie contains a cookie set by the server (or JavaScript) on a previous request. If a cookie hasn't been set, then there isn't one to send back to the server.
and so on.

Spotify App Strophe.js cannot connect to nginx/ejabberd combo

So I set up an ejabberd XMPP server and use nginx as a proxy on an EC2 instance. The Strophe.js echobot example can connect from my Chrome browser. Here are the Request Headers:
Request URL:http://foo.eu-west-1.compute.amazonaws.com/http-bind
Request Method:POST
Status Code:200 OK
Request Headers
Accept:*/*
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4
Connection:keep-alive
Content-Length:115
Content-Type:application/xml
Host:foo.compute.amazonaws.com
Origin:foo.eu-west-1.compute.amazonaws.com
Referer:foo.eu-west-1.compute.amazonaws.com/examples/echobot.html
User-Agent:Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11
Request Payload
<body rid='2692151172' xmlns='http://jabber.org/protocol/httpbind' sid='15f6dc6cbc7a7d69b5db531c2ebf7828e094f043'/>
Response Headers
Access-Control-Allow-Headers:Content-Type
Access-Control-Allow-Origin:*
Connection:keep-alive
Content-Length:286
Content-Type:text/xml; charset=utf-8
Date:Fri, 04 Jan 2013 11:14:14 GMT
Server:nginx/1.2.4
I ported the echobot example to work in Spotify as an app. However, Strophe cannot connect. Here is the Web Inspector network log:
Request URL:http://foo.eu-west-1.compute.amazonaws.com/http-bind
Request Headers
POST http://foo.eu-west-1.compute.amazonaws.com/http-bind HTTP/1.1
Origin: sp://chatify
User-Agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.15 Safari/535.11
Content-Type: application/xml
Request Payload
<body rid='790399813' xmlns='http://jabber.org/protocol/httpbind' to='ec2-54-246-45-111.eu-west-1.compute.amazonaws.com' xml:lang='en' wait='60' hold='1' content='text/xml; charset=utf-8' ver='1.6' xmpp:version='1.0' xmlns:xmpp='urn:xmpp:xbosh'/>
Now. The RequiredPermissons in the manifest.json look as follows:
"RequiredPermissons": [
"http://foo.eu-west-1.compute.amazonaws.com",
"foo.eu-west-1.compute.amazonaws.com",
"http://foo.eu-west-1.compute.amazonaws.com/http-bind",
"foo.eu-west-1.compute.amazonaws.com/http-bind"
]
I can load sources from foo.eu-west-1.compute.amazonaws.com so I think the permissons work.
The Spotify app uses a different origin, "sp://chatify". I read that this can cause problems. And Access-Control-Allow-Origin:* should be added to the header. I did so in the nginx config just to find out that it had been sent all along. You can see it in Response Headers of the first request.
Strophe.js itself says in the logs of the example if turned on:
error 0 happened
So any suggestions?
The manifest seems to be right
Access-Control-Allow-Origin;* seem right
It works in a browser but not Spotify.
Thx for your help!
Update:
If I open the echobot.html locally it still works. The Origin is then null.

Okay, I got it to work. And the solution is a classic. However, my debugging might help others.
At first I looked at the Web Inspector. It showed that all request were canceled. No the inspector is not always right as mentioned in many places. This time it was, though.
So Strophe.js could not establish a connection. To see if any custom Spotify App could establish a connection I tried the Google Maps example from the tutorial app. It did not work with the native Linux version and the Windows under wine. It did however work under windows.
My next step was to see if my app could load any resources from the net. So I simply added an iframe: . It did not load my page.
I took a close look at the manifest of the tutorial app and my app. I could not find any difference. So I just copied "RequiredPermissions" to see that I had a typo. Once that was fixed the iframe and eventually strophe.js worked.
It is funny. I looked for typos a few days ago I could not find any.
To sum up:
Make sure your server sets Access-Control-Allow-Origin;*
Set the Permissions. E.g.
"RequiredPermissions": [
"http://ec2-foo.eu-west-1.compute.amazonaws.com",
"ec2-foo.eu-west-1.compute.amazonaws.com",
"http://ec2-foo.eu-west-1.compute.amazonaws.com/http-bind",
"ec2-foo.eu-west-1.compute.amazonaws.com/http-bind"
]

XULRunner Application Request Header Information

I've developed a standalone XULRunner app that I'm using as a site-specific browser. The web application it accesses does filtering of browsers to know whether the browser being used is optimal. I'd like to add my XULRunner app to the list of optimal browsers. I figured that to do this, I'll need to know the HTTP header information that accompanies request sent by the XULRunner app. What information in the HTTP header can I use to identify my XULRunner app? Something like the Gecko Engine version, etc. I've been searching around, but no luck yet.

An application is usually identified by means of the User-Agent header. You can see it on the client side by means of the window.navigator.userAgent property, e.g. the header for Firefox 12 on Windows 7 is:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0
The important part here is Gecko/... (identifies a Gecko-based browser) and rv:... (Gecko version). The Firefox/12.0 part should be replaced by something like MyApp/1.2.3 in your case (name and version number of your application).

context.Request.Files collection empty on remote server only

I'm using a custom ashx handler to handle a file upload. When run locally, the file uploads fine.
When I use the same setup on the web server I get a "Index out of range" error.
In firebug I see the binary contents of the file in the post data and the file name is also passed in the query string.
Any one seen this before?
I`m sure its something minor, but its driving me up the wall.
Update: Made some progress. I found out that I'm getting two different errors. One from FF / Chrome and one from IE. I'm focusing on FF now, just because firebug makes debugging easier. Now I get an error "Could not find a part of the path 'C:\inetpub\wwwroot\'"
Update 2: Got this working in FF/Chrome. Turns out IE and FF/Chrome post the data differentlly.
Update 3: Here is the output of the network profiler in IE dev tool:
Request header:
Key Value
Request POST /Secured/UploadHandler.ashx? HTTP/1.1
Accept text/html, application/xhtml+xml, */*
Referer http://cms.webstreet.co.il/Secured/fileUpload.aspx
Accept-Language he-IL
User-Agent Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)
Content-Type multipart/form-data; boundary=---------------------------7db13b13d1b12
Accept-Encoding gzip, deflate
Host cms.webstreet.co.il
Content-Length 262854
Connection Keep-Alive
Cache-Control no-cache
Request body:
-----------------------------7db13b13d1b12
Content-Disposition: form-data; name="qqfile"; filename="P-Art_Page_Digital.jpg"
Content-Type: image/jpeg
<Binary File Data Not Shown>
---------------------------7db13b13d1b12--

See the (big) list of comments and replies attached to the original question. Not sure why it works now, but Elad seems to have fixed his problem.

You have to specify the name tag.
<input id="File1" name="file1" type="file" />

I am pretty sure file uploads CANNOT be done via Ajax; you need to use a regular form post.
Also make sure you have the enctype attribute set correctly on your form tag.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex