I'm trying to set cookies while scraping Amazon to not get caught and look like an authentic user.
I'm trying to replicate the behaviour of the website. I've completely analyzed the headers, the request and response signatures etc. The only thing which change is cookies, and that too csm-hit and visitCount. I understood the logic behind visitCount getting updated, but not csm-hit.
Here's the csm-hit cookie.
tb:s-Y4SB9X78SYQB53MGCQWE|1551555477343&t:1551555479805&adb:adblk_no
It is of the below format:-
tb:s - ALPHANUMERIC | EPOCH_TIME &t EPOCH_TIME &adb:adblk_no
This alphanumeric characters (which looks like base64 encoded) keep changing. It calls a function updateCsmHit when reloading/redirecting out of the amazon, which then updates the csm-hit value and then re-use the same updated value next time when the request is being made to the server. If this cookie is not already saved in the browser, it does not send csm-hit in the request, but saves it the moment we step out of Amazon.
I've anlyzed the complete source code using Chrome Dev tool, but unable to crack the logic of generating this alphanumeric value.
I want to understand how this value is being generated so that I can use the same logic to replicate its generation? Can anyone please help me in this regard by using Chrome Dev tool.
Does anyone know the significance of csm-hit?
PS: Please don't advise me to use the same csm-hit everytime. I want to know how is this getting generated.
Related
I have a website hosted in firebase that totally went viral for a day. Since I wasn't expecting that, I didn't install any analytics tool. However, I would like to know the number of visits or downloads. The only metric I have available is the GB Downloaded: 686,8GB. But I am confused because if I open the website with the console of Chrome, I get two different metrics about the size of the page: 319KB transferred and 1.2MB resources. Furthermore, not all of those things are transferred from firebase but from other CDN as you can see in the screenshots. What is the proper way of calculating the visits I had?
Transferred metric is how much bandwidth was used after compression was applied.
Resources metric is how much disk space those resources use before they are compressed (for transfer).
True analytics requires an understanding how what is on the web. There are three classifications:
Humans, composed of flesh and blood and overwhelmingly (though not absolutely) use web browsers.
Spiders (or search engines) that request pages with the notion that they obey robots.txt and will list your website in their websites for relevant search queries.
Rejects (basically spammers and the unknowns) which include (though are far from limited to) content/email scrapers, brute-force password guessers, vulnerability scanners and POST spammers.
With this clarification in place what you're asking in effect is, "How many human visitors am I receiving?" The easiest way to obtain that information is to:
Determine what user agent requests are human (not easy, behavior based).
Determine the length of time a single visit from a human should count as.
Assign human visitors a session.
I presume you understand what a cookie is and how it differs from a session cookie. Obviously when you sign in to a website you are assigned a session. If that session cookie is not sent to the server on a page request you will in effect be signed out. You can make session cookies last for a long time and it will come down to factors such as convenience for the visitor and if you directly count those sessions or use it in conjunction with something else.
Now your next thought likely is, "But how do I count downloads?" Thankfully you mention PHP in your website so I can thankfully give you some code that should make sense to you. If you just link directly to the file you'd be stuck with (at best) counting clicks via a click event on the anchor element though if the download gets canceled because it was a mistake or something else makes it more subjective than my suggestion. Granted my suggestion can still be subjective (e.g. they decide they actually don't want to download and cancel before the completion) and of course if they use the download is another aspect to consider. That being said if you want the server to give you a download count you'd want to do the following:
You'll may want to use Apache rewrite (or whatever the other HTTP server equivalents are) so that PHP handles the download.
You'll may need to ensure Apache has the proper handling for PHP (e.g. AddType application/x-httpd-php5 .exe .msi .dmg) so your server knows to let PHP run on the request file.
You'll want to use PHP's file_exists() with an absolute file path on the server for the sake of security.
You'll want to ensure that you set the correct mime for the file via PHP's header() as you should expect browsers to be horrible at guessing.
You absolutely need to use die() or exit() to avoid Gecko (Firefox) bugs if your software leaks even whitespace as the browser would interpret it as part of the file likely causing corruption.
Here is the code for PHP itself:
$p = explode('/',strrev($_SERVER['REQUEST_URI']));
$file = strrev($p[0]);
header('HTTP/1.1 200');
header('Content-Type: '.$mime);
echo file_get_contents($path_absolute.$file);
die();
For counting downloads if you want to get a little fancy you could create a couple of database tables. One for the files (download_files) and the second table for requests (download_requests). Throw in basic SQL queries and you're collecting data. Record IPv6 (Storing IPv6 Addresses in MySQL) and you'll be able to discern from a query how many unique downloads you have.
Back to human visitors: it takes a very thorough study to understand the differences between humans and bots. Things like Captcha are garbage and are utterly annoying. You can get a rough start by requiring a cookie to be sent back on requests though not all bots are ludicrously stupid. I hope this at least gets you on the right path.
I've read online that you shouldn't update your database with GET requests for the following reasons:
GET request is idempotent and safe
violates HTTP spec
should always read data from a server-database
Let's say that we have build a URL Shortener service. When someone clicks on the link or paste it at the browser's address bar it will be a GET request.
So, if I want to update, on my database, the stats of a shortened link every time it's been clicked, how can I do this if GET requests are idempotent?
The only way I can think of is by calling a PUT request inside the server's side code that handles the GET request.
Is this a good practice or is there a better way to do this?
It seems as you're mixing up a few things here.
While you shouldn't use GET requests to transfer sensitive data (as it is shown in the request URL and most likely logged somewhere in between), there is nothing wrong with using them in your use case. You are only updating a variable serverside.
Just keep in mind that the request parameter are stored in the URL when using GET requests and you should be ok.
I am trying to create a Jmeter script for my project, which includes simple login and logout flow. I recorded the script using Chrome Extension "Blazemeter," found under the login http sampler with post method sending below data in the body.
{"username":"U2FsdGVkX1/NOa/KQxSdikbkXsT8M5bF4K2oSVx/zTBAvyymJ93co+10eUT0cZuJ","password":"U2FsdGVkX183cvN9kZZiv5HRtrV/Z0ZwB89YenArxtA="}
When I replay the script in Jmeter, it is throwing this error:
{"not_auth":"Invalid credentials. Please try again"}
I tried giving the actual credentials but didn't work for me.
Can anyone please help me with this issue? This dynamic value is generated at the browser end and I can not find any dynamic value in previous response.
Record your login flow once again and check username and password values.
If they are the same as for the first recording - the problem lives somewhere else and you will need to figure out what are the differences between requests your JMeter test sends and the requests which are being sent by real browser. I would suggest comparing both using a sniffer tool like Wireshark in order to be able to detect and work around the differences.
If they are different each time you will need to find out what algorithm is being used for encrypting username and password, most likely it is being done by JavaScript directly in the browser so you should be able to figure that out and either re-use or re-implement in JSR223 PreProcessor. For example, your username and password look utterly like Base64-encoded strings so I tried to decode them using Groovy built-in decoding methods but unfortunately it is encrypted with something else in addition
Check out Groovy Is the New Black article for more information on using Groovy in JMeter scripts.
OK, I know already all the reasons on paper why I should not use a HTTP GET when making a RESTful call to update the state of something on the server. Thus returning possibly different data each time. And I know this is wrong for the following 'on paper' reasons:
HTTP GET calls should be idempotent
N > 0 calls should always GET the same data back
Violates HTTP spec
HTTP GET call is typically read-only
And I am sure there are more reasons. But I need a concrete simple example for justification other than "Well, that violates the HTTP Spec!". ...or at least I am hoping for one. I have also already read the following which are more along the lines of the list above: Does it violate the RESTful when I write stuff to the server on a GET call? &
HTTP POST with URL query parameters -- good idea or not?
For example, can someone justify the above and why it is wrong/bad practice/incorrect to use a HTTP GET say with the following RESTful call
"MyRESTService/GetCurrentRecords?UpdateRecordID=5&AddToTotalAmount=10"
I know it's wrong, but hopefully it will help provide an example to answer my original question. So the above would update recordID = 5 with AddToTotalAmount = 10 and then return the updated records. I know a POST should be used, but let's say I did use a GET.
How exactly and to answer my question does or can this cause an actual problem? Other than all the violations from the above bullet list, how can using a HTTP GET to do the above cause some real issue? Too many times I come into a scenario where I can justify things with "Because the doc said so", but I need justification and a better understanding on this one.
Thanks!
The practical case where you will have a problem is that the HTTP GET is often retried in the event of a failure by the HTTP implementation. So you can in real life get situations where the same GET is received multiple times by the server. If your update is idempotent (which yours is), then there will be no problem, but if it's not idempotent (like adding some value to an amount for example), then you could get multiple (undesired) updates.
HTTP POST is never retried, so you would never have this problem.
If some form of search engine spiders your site it could change your data unintentionally.
This happened in the past with Google's Desktop Search that caused people to lose data because people had implemented delete operations as GETs.
Here is an important reason that GETs should be idempotent and not be used for updating state on the server in regards to Cross Site Request Forgery Attacks. From the book: Professional ASP.NET MVC 3
Idempotent GETs
Big word, for sure — but it’s a simple concept. If an
operation is idempotent, it can be executed multiple times without
changing the result. In general, a good rule of thumb is that you can
prevent a whole class of CSRF attacks by only changing things in your
DB or on your site by using POST. This means Registration, Logout,
Login, and so forth. At the very least, this limits the confused
deputy attacks somewhat.
One more problem is there. If GET method is used , data is sent in the URL itself . In web server's logs , this data gets saved somewhere in the server along with the request path. Now suppose that if someone has access to/reads those log files , your data (can be user id , passwords , key words , tokens etc. ) gets revealed . This is dangerous and has to be taken care of .
In server's log file, headers and body are not logged but request path is . So , in POST method where data is sent in body, not in request path, your data remains safe .
i think that reading this resource:
http://www.servicedesignpatterns.com/WebServiceAPIStyles could be helpful to you to make difference between message API and resource api ?
I've been trying (unsuccessfully, I might add) to scrape a website created with the Microsoft stack (ASP.NET, C#, IIS) using Python and urllib/urllib2. I'm also using cookielib to manage cookies. After spending a long time profiling the website in Chrome and examining the headers, I've been unable to come up with a working solution to log in. Currently, in an attempt to get it to work at the most basic level, I've hard-coded the encoded URL string with all of the appropriate form data (even View State, etc..). I'm also passing valid headers.
The response that I'm currently receiving reads:
29|pageRedirect||/?aspxerrorpath=/default.aspx|
I'm not sure how to interpret the above. Also, I've looked pretty extensively at the client-side code used in processing the login fields.
Here's how it works: You enter your username/pass and hit a 'Login' button. Pressing the Enter key also simulates this button press. The input fields aren't in a form. Instead, there's a few onClick events on said Login button (most of which are just for aesthetics), but one in question handles validation. It does some rudimentary checks before sending it off to the server-side. Based on the web resources, it definitely appears to be using .NET AJAX.
When logging into this website normally, you request the domian as a POST with form-data of your username and password, among other things. Then, there is some sort of URL rewrite or redirect that takes you to a content page of url.com/twitter. When attempting to access url.com/twitter directly, it redirects you to the main page.
I should note that I've decided to leave the URL in question out. I'm not doing anything malicious, just automating a very monotonous check once every reasonable increment of time (I'm familiar with compassionate screen scraping). However, it would be trivial to associate my StackOverflow account with that account in the event that it didn't make the domain owners happy.
My question is: I've been able to successfully log in and automate services in the past, none of which were .NET-based. Is there anything different that I should be doing, or maybe something I'm leaving out?
For anyone else that might be in a similar predicament in the future:
I'd just like to note that I've had a lot of success with a Greasemonkey user script in Chrome to do all of my scraping and automation. I found it to be a lot easier than Python + urllib2 (at least for this particular case). The user scripts are written in 100% Javascript.
When scraping a web application, I use either:
1) WireShark ... or...
2) A logging proxy server (that logs headers as well as payload)
I then compare what the real application does (in this case, how your browser interacts with the site) with the scraper's logs. Working through the differences will bring you to a working solution.