I'm trying to build a scraper using Casperjs but it keeps getting blocked. I read several articles saying that it can be avoided by setting user-agent but even with user-agent I get blocked.
Here is my current setup:
var casper = require('casper').create({
verbose: true,
logLevel: 'debug',
colorizerType: 'Dummy',
waitTimeout: 30000, // timeout for waits (loading etc.)
exitOnError: true,
pageSettings: {
userAgent: 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.56 Safari/536.5',
javascriptEnabled: true,
loadImages: true,
loadPlugins: true,
},
onError: function(msg, backtrace) {
this.exit();
}
});
casper.start().then(function() {
this.open('https://WEBSITE-URL', {
headers: {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
}
});
casper.viewport(1280, 1024);
});
// Login
casper.then(function() {
this.echo("Waiting for login form to load.");
this.echo(this.getHTML());
});
I receive this HTML after running casper:
<!DOCTYPE html><html><head>
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0">
<meta http-equiv="cache-control" content="no-cache">
<meta http-equiv="expires" content="0">
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT">
<meta http-equiv="pragma" content="no-cache">
<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?requestId=972f0bd8-1861-4c7b-8459-ce880b8cf2b6&httpReferrer=%2F">
<script type="text/javascript">
(function(window){
try {
if (typeof sessionStorage !== 'undefined'){
sessionStorage.setItem('distil_referrer', document.referrer);
}
} catch (e){}
})(window);
</script>
<script type="text/javascript" src="/dstltrntmls.js" defer="">
</script>
<style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#ruxctfdwzvsxvuucdvdtdtsufa{display:none!important}</style></head>
<body>
<div id="distilIdentificationBlock"> </div>
<div id="d__fFH" style="position: absolute; top: -5000px; left: -5000px;">
<object id="d_dlg" classid="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></object>
<span id="d__fF" style="font-family: Courier, serif; font-size: 72px; visibility: hidden;">The quick brown fox jumps over the lazy dog.</span></div></body>
</html>
Is there a way to workaround this issue. When I try a simple GET request in POSTMAN it turns the actual HTML but it doesn't in casperjs.
Related
I have an image at the URL http://192.168.1.53/html/cam.jpg (from a Raspberry Pi) and this image is changing very fast (it is from a camera).
So I want to have some JavaScript on a website that will reload this image every second for example.
My HTML is:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Pi Viewer</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, user-scalable=no, minimum-scale=1.0, maximum-scale=1.0">
</head>
<body>
<style>
img,body {
padding: 0px;
margin: 0px;
}
</style>
<img id="img" src="http://192.168.1.53/html/cam.jpg">
<img id="img1" src="http://192.168.1.53/html/cam.jpg">
<script src="script.js"></script>
</body>
</html>
And my script:
function update() {
window.alert("aa");
document.getElementById("img").src = "http://192.168.1.53/html/cam.jpg";
document.getElementById("img1").src = "http://192.168.1.53/html/cam.jpg";
setTimeout(update, 1000);
}
setTimeout(update, 1000);
alert is working, but the image is not changing :/ (I have 2 images (they are the same))
The problem is that the image src is not altered so the image is not reloaded.
You need to convince the browser that the image is new. A good trick is to append a timestamp to the url so that it is always considered new.
function update() {
var source = 'http://192.168.1.53/html/cam.jpg',
timestamp = (new Date()).getTime(),
newUrl = source + '?_=' + timestamp;
document.getElementById("img").src = newUrl;
document.getElementById("img1").src = newUrl;
setTimeout(update, 1000);
}
I try to use prerender.io with a Meteor 1.5 App. I use the npm prerender-node package
var prerenderio = Npm.require('prerender-node') .set('prerenderToken', token)
.set('protocol', protocol)
.set('host', host);
// Feed it to middleware! (app.use)
WebApp.connectHandlers.use(prerenderio);
I seed log in prerender.io with code 200. Paged are cached but pages only contain the head and an empty body.
To make tests, I added the package meteorhacks:inject-initial and inserted lines in the body and it works fine.
I wonder if I have a problem with the router that is flowrouter.
I also place
<head>
<meta charset="utf-8">
<meta name="fragment" content="!">
ALL <meta ... >
<!-- a script -->
<scripts ... >
<!-- prerenderio -->
<script> console.log(Date()); window.prerenderReady = false; </script>
</head>
To test, I flushed the cache of prerender.io.
when I test https://www.toto.com?_escaped_fragment_=
I don't have any time in the client console. I see the new line on prerender.io
I did the same test into
https://www.bing.com/webmaster/diagnostics/seo/analyzer
Now let's see the new cache line content:
Server: nginx/1.4.6 (Ubuntu)
Date: Wed, 21 Jun 2017 20:16:33 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Cache-Control: no-cache
Content-Encoding: gzip
<!DOCTYPE html>
<html>
<head>
<script type="text/javascript" src="/packages/meteorhacks_zones/assets/utils.js?1498075872540"></script>
<script type="text/javascript" src="/packages/meteorhacks_zones/assets/before.js?1498075872540"></script>
<script type="text/javascript" src="/packages/meteorhacks_zones/assets/zone.js?1498075872540"></script>
<script type="text/javascript" src="/packages/meteorhacks_zones/assets/tracer.js?1498075872540"></script>
<script type="text/javascript" src="/packages/meteorhacks_zones/assets/after.js?1498075872540"></script>
<script type="text/javascript" src="/packages/meteorhacks_zones/assets/reporters.js?1498075872540"></script>
<link rel="stylesheet" type="text/css" class="__meteor-css__" href="/f6675d95a01ce3ac63036505eb1ace94a68b1bb7.css?meteor_css_resource=true">
<meta charset="utf-8">
<title>Toto</title>
...
<meta name="robots" content="index, follow">
...
<!-- http referrer links -->
<meta name="referrer" content="always">
<meta name="fragment" content="!">
<!-- No resize on mobiles -->
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=0"> <!--320-->
<!-- a script -->
<script ... ></script>
<!-- prerenderio -->
<script> console.log(Date()); window.prerenderReady = false; </script>
</head>
<body>
<script type="text/javascript">__meteor_runtime_config__ = JSON.parse(decodeURIComponent("%7B%22meteorRelease%22%3A%22METEOR%401.5-beta.8%22%2C%22meteorEnv%22%3A%7B%22NODE_ENV%22%3A%22production%22%2C%22TEST_METADATA%22%3A%22%7B%7D%22%7D%2C%22PUBLIC_SETTINGS%22%3A%7B%22analyticsSettings%22%3A%7B%22autorun%22%3Afalse%2C%22Mixpanel%22%3A%7B%22token%22%3A%225fc6f49885cbf2ec8d80a15c0a310399%22%2C%22people%22%3Atrue%7D%2C%22Google%20Analytics%22%3A%7B%22trackingId%22%3A%22UA-60492530-2%22%7D%7D%2C%22ga%22%3A%7B%22account%22%3A%22UA-60492530-2%22%7D%7D%2C%22ROOT_URL%22%3A%22https%3A%2F%2Fwww.Toto.com%22%2C%22ROOT_URL_PATH_PREFIX%22%3A%22%22%2C%22appId%22%3A%221yyz6nr2dyysy1rxvng3%22%2C%22autoupdateVersion%22%3A%2237dc9ce25f5e0f5f3a2253bdcb3c80166a4e83d2%22%2C%22autoupdateVersionRefreshable%22%3A%2273bc2363ca210840ec3d6947a649e3388beca5c5%22%2C%22autoupdateVersionCordova%22%3A%224b9074136556412500d12284625a4f5d81ac3f3d%22%7D"));</script>
<script type="text/javascript" src="/fef4959b1bc37d5a9c2a1c09ddcb28fd7c374f10.js?meteor_js_resource=true"></script>
</body>
</html>
Here I gave the entire body content. I didn't remove any line.
Prerender user here. We had issues with Prerender not waiting until all scripts ran to completion and therefore the calculated dom is not yet fully rendered which sounds like what you're encountering. We solved this by using the prerender ready flag described in the prerender documentation copied here for convenience:
Put this in your HTML:
<script> window.prerenderReady = false; </script>
When we see window.prerenderReady set to false, we'll wait until it's set to true to save the HTML. Execute the following when your page will be ready (usually after ajax calls).
window.prerenderReady = true;
I have a website up and running (and i need to support IE8).
Server: Nginx, framework Symfony2/PHP/MySQL
The issue is simple: IE8 (8.0.6) shows an HTTP 406Content not acceptable on all HTML pages.
Headers (Nginx)
Cache-Control:no-cache
Connection:keep-alive
Content-Encoding:gzip
Content-Type:text/html; charset=UTF-8
Date:Mon, 25 Apr 2016 15:23:46 GMT
Server:nginx/1.6.2
Transfer-Encoding:chunked
X-Debug-Token:d7e68f
HTML (2 versions, not working)
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE" />
<meta name="robots" content="noindex, nofollow">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Htm 2</title>
<link rel="icon" type="image/x-icon" href="/favicon.ico" />
</head>
<body>
... hi
</body>
</html>
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="robots" content="noindex, nofollow">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Htm</title>
<link rel="icon" type="image/x-icon" href="/favicon.ico" />
</head>
<body>
... hi
</body>
</html>
I have read tons of stuff on that matter but could not find any clue. The previous website version worked on IE8 and ran on Apache 2.
This bug was not related to Nginx but to Symfony 2 and FOSRESTBundle which relies on Client headers to negociate the response content via format listener.
Solution:
I changed my configuration to disable FOSRest to disable format listener for HTML pages.
fos_rest:
format_listener:
rules:
- { path: '^/rest', priorities: [ 'json' ], fallback_format: json, prefer_extension: false }
#- { path: ^/, priorities: [ 'text/html', '*/*' ], fallback_format: html, prefer_extension: true }
- { path: '^/', stop: true }
The action code:
public ActionResult Visit(VisitModel model)
{
if (Request.HttpMethod == "GET")
return PartialView("VisitPostRedirect", visit);
// some logic...
return PartialView(visit);
}
The 'VisitPostRedirect' view:
#model VisitModel
<!doctype html>
<html lang="en">
<body onload="javascript: document.getElementById('visitPostRedirectForm').submit()">
#using (Html.BeginRouteForm("Visit", new
{
// some data...
RedirectUrl = string.Empty
}, FormMethod.Post,
new { id="visitPostRedirectForm" }))
{
#Html.HiddenFor(m => m.ReturnUrl)
// some data...
}
</body>
</html>
The 'Visit' view:
#model VisitModel
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, , maximum-scale=1.0">
<!-- ... -->
</head>
<body>
<!-- ... -->
</html>
The 'visitPostRedirectForm' form submits correctly, with:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8,pl;q=0.6
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:69
Content-Type:application/x-www-form-urlencoded
Cookie:ASP.NET_SessionId=vh3kl4zbxkonborzazuafkiw
Host: /*removed*/
Origin:/*removed*/
Referer:/*removed*/
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36
but the response is:
Cache-Control:private
Content-Encoding:gzip
Content-Length:2135
Content-Type:application/json; charset=utf-8
Date:Wed, 04 Jun 2014 09:23:32 GMT
Server:Microsoft-IIS/7.5
Vary:Accept-Encoding
X-AspNet-Version:4.0.30319
X-AspNetMvc-Version:5.0
X-Powered-By:ASP.NET
Has anybody an idea why the content-type of the response is 'application/json'? This causes a browser to render raw html. The returned html is correct.
My Goal is to use the Intel XDK Barcode scanner using the front camera on the iPhone or iPad.
Please help
I currently have the simple code snippet
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=0" />
<style type="text/css">
* { -webkit-user-select:none; -webkit-tap-highlight-color:rgba(0, 0, 0, 0);
</style>
<script src='intelxdk.js'></script>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
<script type="text/javascript">
var onDeviceReady=function(){
//hide splash screen
intel.xdk.device.hideSplashScreen();
//intel.xdk.device.scanBarcode();
};
document.addEventListener("intel.xdk.device.ready",onDeviceReady,false);
document.addEventListener("intel.xdk.device.barcode.scan", barcodeScanned, false);
function barcodeScanned(evt) {
intel.xdk.notification.beep(1);
if (evt.type == "intel.xdk.device.barcode.scan") {
if (evt.success == true) {
var url = evt.codedata;
//intel.xdk.device.showRemoteSite(url, 264, 0,56, 48)
alert(evt.codedata);
} else {
//scan cancelled
}
}
}
$(document).ready(function(){
$("#scanBtn").click(function(){
intel.xdk.device.scanBarcode();
});
});
</script>
</head>
<body>
Scan Now
</body>
</html>
The XDK does not currently support the use of the front camera. Such options will be possible when we are able to support the use of standard Cordova plugins.
Sorry, but I cannot comment on when new features will become available, Intel does not allow public disclosure of roadmaps and schedules.