Scrapy Splash having infinite requests on loading a page - web-scraping

I am new to Scrapy and Splash. I want to extract some texts from an online forum but some of the pages cannot be completely loaded. I tested on Splash and found that it was looping on attempting to request for something until the end of the wait time and the page cannot be completely loaded.
The screenshot of loading failed page (https://lihkg.com/thread/2463999/page/1)
The screenshot of loading successful page (https://lihkg.com/thread/2968429/page/1)
I have tried to extend the wait time to 30s and even more but it doesn't help.
Here's the script for running Splash:
function main(splash, args)
splash.private_mode_enabled = false
splash.images_enabled = false
assert(splash:go(args.url))
assert(splash:wait(15))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
Thanks!

Related

how to solve Browser errors were logged to the console?

PageSpeed Insights is showing this error message for my wordpress website (MyBGMI.Com
I can't fix this problem. To be very honest can't understand the problem.
**Errors logged to the console indicate unresolved problems. They can come from network request failures and other browser concerns. Learn more
Source
Description
TypeError: Cannot read properties of null (reading 'parentNode') at data:text/javascript;base64,dmFyIGRvd25sb2FkQnV0dG9uPWRvY3VtZW50LmdldEVsZW1lbnRCeUlkKCJkb3dubG9hZCIpO3ZhciBjb3VudGVyPTQwO3ZhciBuZXdFbGVtZW50PWRvY3VtZW50LmNyZWF0ZUVsZW1lbnQoInAiKTtuZXdFbGVtZW50LmlubmVySFRNTD0iWW91IGNhbiBkb3dubG9hZCB0aGUgZmlsZSBpbiA0MCBzZWNvbmRzLiI7dmFyIGlkO2Rvd25sb2FkQnV0dG9uLnBhcmVudE5vZGUucmVwbGFjZUNoaWxkKG5ld0VsZW1lbnQsZG93bmxvYWRCdXR0b24pO2lkPXNldEludGVydmFsKGZ1bmN0aW9uKCl7Y291bnRlci0tO2lmKGNvdW50ZXI8MCl7bmV3RWxlbWVudC5wYXJlbnROb2RlLnJlcGxhY2VDaGlsZChkb3dubG9hZEJ1dHRvbixuZXdFbGVtZW50KTtjbGVhckludGVydmFsKGlkKX1lbHNle25ld0VsZW1lbnQuaW5uZXJIVE1MPSJKVVNUIFdBSVQgIitjb3VudGVyLnRvU3RyaW5nKCkrIiBTRUNPTkRTLiIrIllPVVIgQkdNSSAyLjMgRE9XTkxPQUQgTElOSyBJUyBHRU5FUkFUSU5HIn19LDEwMDAp:1:200**
I tired figured out the issue but did not understand anything. Just checked the page with chrome browser developers tool.
And where i found two erros. but can't understand how to fix them.
enter image description here
This is base64 encoded JavaScript (usually bad when found on WordPress site)
(you can decode it online here: https://www.base64decode.org/)
Decoded it says:
var downloadButton=document.getElementById("download");
var counter=40;
var newElement=document.createElement("p");
newElement.innerHTML="You can download the file in 40 seconds.";
var id;
downloadButton.parentNode.replaceChild(newElement,downloadButton);
id=setInterval(function(){
counter--;
if(counter<0){
newElement.parentNode.replaceChild(downloadButton,newElement);
clearInterval(id)
}else{
newElement.innerHTML="JUST WAIT "+counter.toString()+" SECONDS."+"YOUR BGMI 2.3 DOWNLOAD LINK IS GENERATING"
}
},1000)
it appears that newElement.parentNode is null and that's what's causing the error.
if this is your code, and a desired code-piece on your WordPress website - try changing if(counter<0){ into if (newElement.parentNode && counter<0) { . otherwise, find where this is coming from, and remove it from your code base.
Update
Try this:
var downloadButton=document.getElementById("download");
var counter=40;
var newElement=document.createElement("p");
newElement.innerHTML="You can download the file in 40 seconds.";
var id;
if (downloadButton && newElement.parentNode) {
downloadButton.parentNode.replaceChild(newElement,downloadButton);
id=setInterval(function(){
counter--;
if(counter<0){
newElement.parentNode.replaceChild(downloadButton,newElement);
clearInterval(id)
}else{
newElement.innerHTML="JUST WAIT "+counter.toString()+" SECONDS."+"YOUR BGMI 2.3 DOWNLOAD LINK IS GENERATING"
}
},1000)
}

got has triggered a breakpoint error when load url

I use cefsharp(version:92.260) in my application. the application may crash when loading url, but not always.
below is my code:
Task.Factory.StartNew(() =>
{
var spinWait = new SpinWait();
while(!_Browser.IsBrowserInitialized)
{
spinWait.SpinOnce();
}
_Browser.Load(url);
});
I'd suggest you upgrade to a supported version, 103 at time of writing.
Improvements have been made, no longer is it nessicary to wait for IsBrowserInitialized to be equal to true before you can call Load(url).
You can just call Load without the IsBrowserInitialized check.

Get title from a url from a server side Meteor method when the page content is compressed (gzip)

I want to create a routine to get the title given a url ─ outside my domain ─ using a server side Meteor method.
I'm using the following code:
'''
var result = HTTP.get(url, {timeout:30000});
if(result.statusCode==200) {
console.log('OK status code 200 on libGetTitleFromUrlAsync');
var content = (result.headers['content-encoding'] === 'gzip')? inflateSync(new Buffer(result.content)) : result.content;
var start = content.toLowerCase().indexOf('<title>');
var end = content.toLowerCase().indexOf('</title>');
var title = content.substring(start + '<title>'.length, end);
Im' using http and gb96:zlib packages. The first retrieves the ulr content and the second decompress the buffer if it's compressed.
I'm using this compressed page as a sample http://firstround.com/review/this-is-what-impactful-engineering-leadership-looks-like/
InflateSync keeps raising an exception.
Any idea?
When the target url in not compressed it works perfectly.

Meteor Autorun() not running when accessing collection.property

Something strange is going on; when I run the following code, it works:
Deps.autorun(function() {
var room = Rooms.findOne({'room_id':Session.get('room_id')});
// var p = room.room_id;
console.log('autorun');
}
However, if I uncomment the var p line, it (the whole block) stops running. What's happening?
Found this in the depths of the Meteor.js documentation: "If the initial run of an autorun throws an exception, the computation is automatically stopped and won't be rerun."
On the first run of autorun, which is when the page is loaded, the database hasn't been loaded which throws an exception when you try to access room.room_id, and the autorun immediately stops from running again. Fixed by adding:
if (room) {
console.log(room.room_id);
...
}

Unknown reason for Timeout on HTTP HEAD request

I'm using ASP.NET 3.5 to build a website. One area of the website shows 28 video thumbnail images, which are jpeg's hosted on another webserver. If one or more of these jpegs do not exist, I want to display a locally hosted default image to the user, rather than a broken image link in the browser.
The approach I have taken to implement this is whenever the page is rendered it will perform an HTTP HEAD request to each of the images. If I get a 200 OK status code back, then the image is good and I can write out <img src="http://media.server.com/media/123456789.jpg" />. If I get a 404 Not Found, then I write out <img src="/images/defaultthumb.jpg" />.
If course I don't want to do this every time for all requests, and so I've implemented a list of cached image status objects stored at application level so that each image is only checked once every 5 minutes across all users, but this doesn't really have any bearing on my issue.
This seems to work very well. My problem is that for specific images, the HTTP HEAD request fails with Request Timed Out.
I have set my timeout value very low to only 200ms so that is doesn't delay the page rendering too much. This timeout seems to be fine for most of the images, and I've tried playing around and increasing this during debugging, but it makes no difference even if it's 10s or more.
I write out a log file to see whats happening, and this is what I get (edited for clarify and anonymity):
14:24:56.799|DEBUG|[HTTP HEAD CHECK OK [http://media.server.com/adpm/505C3080-EB4F-6CAE-60F8-B97F77A43A47/videothumb.jpg]]
14:24:57.356|DEBUG|[HTTP HEAD CHECK OK [http://media.server.com/adpm/66E2C916-EEB1-21D9-E7CB-08307CEF0C10/videothumb.jpg]]
14:24:57.914|DEBUG|[HTTP HEAD CHECK OK [http://media.server.com/adpm/905C3D99-C530-46D1-6B2B-63812680A884/videothumb.jpg]]
...
14:24:58.470|DEBUG|[HTTP HEAD CHECK OK [http://media.server.com/adpm/1CE0B04D-114A-911F-3833-D9E66FDF671F/videothumb.jpg]]
14:24:59.027|DEBUG|[HTTP HEAD CHECK OK [http://media.server.com/adpm/C3D7B5D7-85F2-BF12-E32E-368C1CB45F93/videothumb.jpg]]
14:25:11.852|ERROR|[HTTP HEAD CHECK ERROR [http://media.server.com/adpm/BED71AD0-2FA5-EA54-0B03-03D139E9242E/videothumb.jpg]] The operation has timed out
Source: System
Target Site: System.Net.WebResponse GetResponse()
Stack Trace: at System.Net.HttpWebRequest.GetResponse()
at MyProject.ApplicationCacheManager.ImageExists(String ImageURL, Boolean UseCache) in d:\Development\MyProject\trunk\src\Web\App_Code\Common\ApplicationCacheManager.cs:line 62
14:25:12.565|ERROR|[HTTP HEAD CHECK ERROR [http://media.server.com/adpm/92399E61-81A6-E7B3-4562-21793D193528/videothumb.jpg]] The operation has timed out
Source: System
Target Site: System.Net.WebResponse GetResponse()
Stack Trace: at System.Net.HttpWebRequest.GetResponse()
at MyProject.ApplicationCacheManager.ImageExists(String ImageURL, Boolean UseCache) in d:\Development\MyProject\trunk\src\Web\App_Code\Common\ApplicationCacheManager.cs:line 62
14:25:13.282|ERROR|[HTTP HEAD CHECK ERROR [http://media.server.com/adpm/7728C3B6-69C8-EFAA-FC9F-DAE70E1439F9/videothumb.jpg]] The operation has timed out
Source: System
Target Site: System.Net.WebResponse GetResponse()
Stack Trace: at System.Net.HttpWebRequest.GetResponse()
at MyProject.ApplicationCacheManager.ImageExists(String ImageURL, Boolean UseCache) in d:\Development\MyProject\trunk\src\Web\App_Code\Common\ApplicationCacheManager.cs:line 62
As you can see, the first 25 HEAD requests work, and the final 3 do not. It's always the last three.
If I paste one of the failed HEAD request URLs into a web browser: http://media.server.com/adpm/BED71AD0-2FA5-EA54-0B03-03D139E9242E/videothumb.jpg, it loads the image with no problems.
To try to work out what is happening here, I used Wireshark to capture all of the HTTP requests that are sent to the webserver hosting the images. For the log example I've given, I can see 25 HEAD requests for the 25 that were successful, but the 3 that failed do NOT appear in the wireshark trace.
Other than the images having different visual content, there is no difference from one image to the next.
To eliminate any problems with the URL itself (even though it works in a browser) I changed the order by switching one of the first images with one of the last failed three. When I do this, the problem goes away for the one that used to fail, and starts failing for the one that was bumped down to the end of the list.
So I think I can deduce from the above that when more than 25 HEAD requests occur in quick succession, subsequent HEAD requests fail regardless of the specific URL. I also know that the issue is on the IIS server rather than the remote image hosting server, due to the lack of requests in the Wireshark trace beyond the first 25.
The code snippet I'm using to perform the HEAD requests is shown below. Can anyone give me any suggestions as to what might be the problem? I've tried various different combinations of request header values, but none of them seem to make any difference. My gut feeling is there is some IIS setting somewhere that limits the number of concurrent HttpWebRequests's to 25 in any one request to an ASP.NET page.
try {
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create(ImageURL);
hwr.Method = "HEAD";
hwr.KeepAlive = false;
hwr.AllowAutoRedirect = false;
hwr.Accept = "image/jpeg";
hwr.Timeout = 200;
hwr.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.Reload);
//hwr.Connection = "close";
HttpWebResponse hwr_result = (HttpWebResponse)hwr.GetResponse();
if (hwr_result.StatusCode == HttpStatusCode.OK) {
Diagnostics.Diags.Debug("HTTP HEAD CHECK OK [" + ImageURL + "]", HttpContext.Current.Request);
// EXISTENCE CONFIRMED - ADD TO CACHE
if (UseCache) {
_ImageExists.Value.RemoveAll(ie => ie.ImageURL == ImageURL);
_ImageExists.Value.Add(new ImageExistenceCheck() { ImageURL = ImageURL, Found = true, CacheExpiry = DateTime.Now.AddMinutes(5) });
}
// RETURN TRUE
return true;
} else if (hwr_result.StatusCode == HttpStatusCode.NotFound) {
throw new WebException("404");
} else {
throw new WebException("ERROR");
}
} catch (WebException ex) {
if (ex.Message.Contains("404")) {
Diagnostics.Diags.Debug("HTTP HEAD CHECK NOT FOUND [" + ImageURL + "]", HttpContext.Current.Request);
// NON-EXISTENCE CONFIRMED - ADD TO CACHE
if (UseCache) {
_ImageExists.Value.RemoveAll(ie => ie.ImageURL == ImageURL);
_ImageExists.Value.Add(new ImageExistenceCheck() { ImageURL = ImageURL, Found = false, CacheExpiry = DateTime.Now.AddMinutes(5) });
}
return false;
} else {
Diagnostics.Diags.Error(HttpContext.Current.Request, "HTTP HEAD CHECK ERROR [" + ImageURL + "]", ex);
// ASSUME IMAGE IS OK
return true;
}
} catch (Exception ex) {
Diagnostics.Diags.Error(HttpContext.Current.Request, "GENERAL CHECK ERROR [" + ImageURL + "]", ex);
// ASSUME IMAGE IS OK
return true;
}
I have solved this myself. The problem was indeed the number of allowed connections, which was set to 24 by default.
In my case, I am going to only perform the image check if the MyHttpWebRequest.ServicePoint.CurrentConnections is less than 10.
To increase the max limit, just set ServicePointManager.DefaultConnectionLimit to the number of concurrent connections you require.
An alternative which may help some people would be to reduce the idle time that is the time a connection waits until it destroys itself. To change this, you need to set MyHttpWebRequest.ServicePoint.MaxIdleTime to the timeout value in milliseconds.

Resources