I am currently trying to scrape a website and trying to stay logged in as I scrape. Unfortunately, from what I understand splash resets cookies at every splashrequest. I am using splash with scrapy to scrape a site with javascript. My question is: How do I keep my cookies from being reset?
After scraping the web myself for a solution, I know it has something to do with lua scripts or cookie middleware but I have no idea how to use them. If anyone could help it would be great. All the sites that talk about that are really unclear so please be as clear as possible.
Yes, you can set cookies and return cookies in lua scripts. If the login page and scraping page use same script, your script should be like this:
function main(splash)
splash:init_cookies(splash.args.cookies)
-- ... your script
return {
cookies = splash:get_cookies(),
-- ... other results, e.g. html
}
end
If you use different scripts for login and scrape, u can return cookies from login_script and send it along with SplashRequest:
yield SplashRequest(url = url, callback=self.item_parse, endpoint='execute',args={
'lua_source': self.scrape_script
}, meta={'cookies': cookies})
In scrape_script you need to set cookies using command:
splash:init_cookies(splash.args.cookies)
I have an application which needs to make sure the user has saved their data when they accidentally close the browser (or tab). I have added a window:beforeunload handler to show the confirm dialog and a window:unload to call an http service call if the user really did want to leave but save the changes. However, I the call never gets executed on the server (unless I set a break point chrome). I found a post using angularjs (How to send an HTTP request onbeforeunload in AngularJS?) but cannot for the life of me figure out how to do this in Angular 5. Any help would be much appreciated.
I have figured it out with a bit of help from a different post (Angular 2 - Execute code when closing window).
So to save my info I use a post like the following...
let xhr = new XMLHttpRequest()
xhr.open("POST", url, false);
xhr.setRequestHeader('Content-Type', 'application/json');
xhr.send(JSON.stringify(data));
IM writing some html which provides a preview function for a list of URLs. I want to use an iframe for this functionality,
The issue arises when some of the URLs are broken (returning a 500 error) or when a page contains some authenication process which the requesting user cannot satisfy. In these situations the iframe is trying to display the URL but the content returned in the frame is useless (500 error or authenication error ) to the user.
DOes iframe have any built in error handling for these senarios or is there some other way i can display a generic error page if something happens when loading the iframe?
Thanks
AFAIK, there is no way to directly access the header of a response to a request initiated by an iframe (or indeed, any request) in client side script.
This is slightly convoluted, but I think it would work:
The iframe is initially loaded with a URL that refers to a script on your server, and your pass the actual URL as a GET parameter.
The server side script takes that URL, and sends a HEAD request to it (following 3xx redirects).
If the response code for the HEAD request is >= 200 and < 300, send some script back to the client which changes the iframe's src to the actual URL (you might be able to do something as simple as window.location.href = but I'm not sure without testing).
If the response code is >= 400, send a page that says "This page is not loading at the moment".
If you know PHP/have it available on your server, I can try and provide a code example.
What does "Pending" mean under the status column in the "Network" tab of Google Chrome Developer window?
This happens when my page script issues a GET request whose response contains content-headers for downloading a CSV file:
Content-type: text/csv;
Content-Disposition: attachment; filename=myfile.csv
This works fine in FF and IE7, downloading a CSV file as expected and opening a file picker to save the file, but Chrome does nothing. I confirmed that the server responds to the request, so it appears that Chrome will not process the response.
Curiously, all works as expected if I type the URL into Chromes address bar and hit <enter>.
FYI: Chrome 10.0.648.204 on Windows XP
In my case, I found that the "pending" status was caused by the AdBlock extension. The image that I couldn't get to load had the word "ad" in the URL, so AdBlock kept it from loading.
Disabling AdBlock fixes this issue.
Renaming the file so that it doesn't contain "ad" in the URL also fixes it, and is obviously a better solution. Unless it's an advertisement, in which case you should leave it like that.
I also get this when using the HTTPS everywhere plugin.
This plugin has a list of sites that also have https instead of http. So I assume before the actual request is made it is already being cancelled somehow.
So for example when I go to http://stackexchange.com, in Developer I first see a request with status (terminated). This request has some headers, but only the GET, User-Agent, and Accept. No response as well.
Then there is request to https://stackexchange.com with full headers etc.
So I assume it is used for requests that aren't sent.
I had some problems with pending request for mp3 files.
I had a list of mp3 files and one player to play them. If I picked a file that had already been downloaded, Chrome would block the request and show "pending request" in the network tab of the developer tools.
All versions of Chrome seem to be affected.
Here is a solution I found:
player[0].setAttribute('src','video.webm?dummy=' + Date.now());
You just add a dummy query string to the end of each url. This forces Chrome to download the file again.
Another example with popcorn player (using jquery) :
url = $(this).find('.url_song').attr('url');
pop = Popcorn.smart( "#player_", url + '?i=' + Date.now());
This works for me. In fact, the resource is not stored in the cache system. This should also work in the same way for .csv files.
I had the same issue on OSX Mavericks, it turned out that Sophos anti-virus was blocking certain requests, once I uninstalled it the issue went away.
If you think that it might be caused by an extension one easy way to try and test this is to open chrome with the '--disable-extensions flag to see if it fixes the problem. If that doesn't fix it consider looking beyond the browser to see if any other application might be causing the problem, specifically security apps which can affect requests.
I had a similar issue with application/json ajax calls. In ff/IE they were fine. In chrome in the Developer Network window Status was always (pending) because a different status code was being returned.
In my case I changed my Json response to send a HttpStatusCode of 200 then Chrome was fine and the Status Text changed to 200 OK.
For example using ASP.NET Web Api
return new HttpResponseMessage(HttpStatusCode.OK ) {
Content = request.Content
};
The Network pending state on time, means your request is in progressing state. As soon as it responds the time will be updated with total elapsed time.
This picture shows the network call is in processing state(Pending)
This picture shows the time taken in processing by network call.
The fix, for me, was to add the following to the top of the php file which was being requested.
header("Cache-Control: no-cache,no-store");
Same problem with Chrome : I had in my html page the following code :
<body>
...
<script src="http://myserver/lib/load.js"></script>
...
</body>
But the load.js was always in status pending when looking in the Network pannel.
I found a workaround using asynchronous load of load.js:
<body>
...
<script>
setTimeout(function(){
var head, script;
head = document.getElementsByTagName("head")[0];
script = document.createElement("script");
script.src = "http://myserver/lib/load.js";
head.appendChild(script);
}, 1);
</script>
...
</body>
Now its working fine.
Encountered a similar issue recently.
My App is in angular 11 and we have a form with some validators which have regex to validate the data. One of data element had a special character which the regex wasn't handling and it made the entire browser hung up. Infact, even though all network calls were successful with 200 Ok, chrome was not showing any response returned by the backend and was also showing the requests in Pending State when infact all network calls are successful, there was no console log errors or anything. Handling the regex fixed the issue.
After i found the issue, i googled more about it. Here is more explanation about it.
https://javascript.info/regexp-catastrophic-backtracking
I came across this issue when I was debugging a local web application. The issue turned out to be AVG Antivirus and Firewall restrictions. I had to allow an exception through the firewall to get rid of the "Pending" status.
In my case, a simple restart to my browser (chrome) and it worked straight away afterwards like magic!
Little bit of context, I happen to refresh my frontend web page and straight away went onto making a changes to my API which led it to restart. During that instance, the frontend was making calls to API which led into "pending" due to that API is reloading. Browser at this point cached that pending state. For me to get out of it is either I set no-cache (which I didn't want to) or simply restart the browser, I chose the restart.
A little background
I encountered such an issue when requesting an url in my Django project. The server is setup using Apache HTTP web server and basic auth for user authentication.
The url I was accessing required no authentication to access i.e. in my Apache config, I had set Require all granted on the url using the LocationMatch directive.
The issue
The url I was trying to access returned 200 status (in the Network tab in Chrome), but the static assets being used for styling of the requested webpage (css, javascript, font files etc.) associated with the request url were not loading and returned pending status.
In the meanwhile, the page loaded partially and still kept on loading. All this was happening in the presence of basic-auth dialog in browser, even though my url was granted all access.
What worked for me
Interestingly, as I entered my credentials and logged in, the requested page loaded all the static assets. This made it very clear to me that the static assets directory might NOT have the necessary access permissions.
Then, I granted the access to the static assets directory by updating my Apache config and then the requested url and the webpage loaded up fine (200 status) without any basic auth dialog OR pending status.
In my case, there's an update for Chrome that makes it won't load before you restart the browser. Cheers
I encountered the same problem when I request certain images from page. I use JavaScript to set the src attribute of an img object and if the network is poor pending will be displayed in the network panel of chrome developer window. I think it's due to the poor network.
I'm debugging my webserver, and I'd like to manually send HEAD requests to some web pages. Is there a way to do this in Firefox? Some extension perhaps.
I want to use firefox so that it can be part of a normal session (ie cookies set, logged in, etc). So things like curl aren't perfect.
Another possiblity is opening up firebug (or making this into a greasemonkey script) and using javascript to send your HEAD request.
// Added comments
var xmlhttp = new XmlHttpRequest();
xmlhttp.open("HEAD", "/test/this/page.php",true); // Make async HEAD request (must be a relative path to avoid cross-domain restrictions)
xmlhttp.onreadystatechange=function() {
if (xmlhttp.readyState==4) { // make sure the request is complete
alert(xmlhttp.getAllResponseHeaders()) // display the headers
}
}
xmlhttp.send(null); // send request
XmlHttpRequests inherit the cookies and current session (authentication from .htaccess etc).
Way to use this:
Use the javascript: url method
Use the Firebug console (http://getfirebug.com/) to execute javascript on the page
Create a greasemonkey script that executes HEAD requests and displays the result
Live HTTP Headers can send arbitrary HTTP requests using its replay function. Though it's a bit fiddly. And as it's a HEAD request, there'll be no output to see locally (it's normally displayed in the browser window).
First you need to open up the Live HTTP Headers (LHH) window, do your request from the browser using GET, then select that request in the LHH window and choose Replay.... Then, in the window that pops up, change GET to HEAD and fiddle with the headers if you like.
Pressing Replay will make the request.
This is a pretty old thread, but there is a firefox plugin called "Poster" that does what you want.
There is another plugin I've used called "Rest Client" that is also good.
I don't know of any plugin but this page might be of some use to you
http://www.askapache.com/online-tools/http-headers-tool
I believe that you can send head requests with Fiddler
http://www.fiddler2.com/Fiddler2/version.asp
This seems to be a solution that works in firefox as an addon, called Modify Headers
https://addons.mozilla.org/en-US/firefox/addon/967
Check out http-tool for firefox ..
https://addons.mozilla.org/en-US/firefox/addon/http-tool/
Aimed at web developers who need to debug HTTP requests and responses.
Can be extremely useful while developing REST based api.
Features:
* GET
* HEAD
* POST
* PUT
* DELETE
Add header(s) to request.
Add body content to request.
View header(s) in response.
View body content in response.
View status code of response.
View status text of response.