How do you scrape a dynamically generated webpage in NodeJs? - web-scraping

There are sites whose DOM and contents are generated dynamically when the page loads. (Angularjs-based sites are notorious for this)
What approach do you use?
I tried both phantomjs and jsdom but it seems I am unable get the page to execute its javascript before I scrape.
Here's a simple jsdom example (not angularjs-based but still dynamically generated)
var env = require('jsdom').env;
exports.scrape = function(link, callback) {
var config = {
url: link,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36'
},
done: jsdomDone
};
env(config);
}
function jsdomDone(err, window) {
var info = null;
if(err) {
console.error(err);
} else {
var $ = require('jquery')(window);
console.log($('.profilePic').attr('src'));
}
}
exports.scrape('https://www.facebook.com/elcompanies');
I tried phantomjs with moderate success.
var page = new WebPage()
var fs = require('fs');
page.onLoadFinished = function() {
console.log("page load finished");
window.setTimeout(function() {
page.render('export.png');
fs.write('1.html', page.content, 'w');
phantom.exit();
}, 10000);
};
page.open("https://www.facebook.com/elcompanies", function() {
page.evaluate(function() {
});
});
Here I wait for the onLoadFinished event and even put a 10-second timer. The interesting thing is that while my export.png image capture of the page shows a fully rendered page, my 1.html doesn't show the .profilePic class element in its rightful place. It seems to be sitting in some javascript code, surrounded by some kind of "require("TimeSlice").guard(function() {bigPipe.onPageletArrive({..." block
If you can provide me a working example that scrapes the image off this page, that'd be helpful.

I've done some scraping in Facebook by using nightmarejs.
Here is a code that I did to get some content from some posts of a Facebook page.
module.exports = function checkFacebook(callback) {
var nightmare = Nightmare();
Promise.resolve(nightmare
.viewport(1000, 1000)
.goto('https://www.facebook.com/login/')
.wait(2000)
.evaluate(function(){
document.querySelector('input[id="email"]').value = facebookEmail
document.querySelector('input[id="pass"]').value = facebookPwd
return true
})
.click('#loginbutton input')
.wait(1000)
.goto('https://www.facebook.com/groups/bierconomia')
.evaluate(function(){
var posts = document.getElementsByClassName('_1dwg')
var length = posts.length
var postsContent = []
for(var i = 0; i < length; i++){
var pTag = posts[i].getElementsByTagName('p')
postsContent.push({
content: pTag[0] ? pTag[0].innerText : '',
productLink: posts[i].querySelector('a[rel = "nofollow"]') ? posts[i].querySelector('a[rel = "nofollow"]').href : '',
photo: posts[i].getElementsByClassName('_46-i img')[0] ? posts[i].getElementsByClassName('_46-i img')[0].src : ''
})
}
return postsContent
}))
.then(function(results){
log(results)
return new Promise(function(resolve, reject) {
var leanLinks = results.map(function(result){
return {
post: {
content: result.content,
productLink: extractLinkFromFb(result.productLink),
photo: result.photo
}
}
})
resolve(leanLinks)
})
})
The thing that I find useful with nightmare is that you can use the wait function to either wait for X ms or for a specific class to render.

This is because generated web pages based on AJAX calls have asynchronous AJAX calls and you can't rely on onLoad events (because data still not available).
In my personal opinion, the most reliable way would be tracing which REST services are being called from this HTML and make direct calls to them. Sometimes you will need using values found in HTML or values taken from another calls.
I know this may sound complicated, and in fact it is. You kinda need to debug page and learn what is being called. But this will work for sure.
By the way, using chrome developer tools will help this task. Just observe which call are made in network tab. You can even observe what has been sent and received in each AJAX call.

If it is a one time thing, that is, if I just want to scrape a single page once, I just use the browser and artoo-js.

I never tried to write a page on disk using phantom, but I have two observations:
1) you are using fs.write to write things to disk, but writeFile is an async call. This means that you either need to change it to fs.writeFileSync or use a callback before closing phantom.
2) I hope you aren't expecting to write a HTML to a file and open it in a browser and get it rendered like when you saved a png, because it doesnt work this way. Some objects can be stored directly in DOM properties and certainly there are values stored in javascript variables, those things will never be persisted.

Related

Firebase reCAPTCHA has already been rendered in this element

Authenticate with Firebase with a Phone Number (JS) requires a mandatory reCAPTCHA verifier, it takes the ID of the container. For the ID of the container, I am generating a random one -
firebase_recaptcha_container: "recaptcha-container",
firebase_recaptcha_reset: function() {
if (typeof appVerifier != "undefined") {
appVerifier.reset()
appVerifier.clear()
}
let id = loadJS.firebase_recaptcha_container
let newID = loadJS.randomString(10)
$("#"+id).contents().remove()
$("#"+id).prop("id", newID)
loadJS.firebase_recaptcha_container = newID
return newID
}
then requesting for the RecaptchaVerifier and upon receiving I set this as a global variable window.appVerifier .
firebase_recaptcha: function(name_r="default") {
let promiseD = new firebase.auth.RecaptchaVerifier(name_r, {
'size': 'invisible',
'callback': function(response) {
resolve(response)
},
'expired-callback': function(r) {
console.log("expired", r)
},
'isolated' : false
});
return promiseD
},
_____________________
let container_recaptcha = $utils.firebase_recaptcha_reset()
window.appVerifier = await $utils.firebase_recaptcha(container_recaptcha)
It works totally fine for the very first time. But its a honest mistake for users not to use correct phone number. So for next time, I am doing the same thing again and getting error while generating the RecaptchaVerifier -
reCAPTCHA has already been rendered in this element
Which sadly does not make sense as the new element is totally different and also clear, reset methods were called following the documentation. I am neither using any other reCaptcha on this page. Refreshing the page might be a possible solution but that I really hate. Any insight would be helpful.
Thanks!
Finally found the solution, looks like it was a stupid mistake!
Invoking firebase.auth.RecaptchaVerifier adds new recaptcha scripts, every time! Hence all needed to be done is, calling it once, it does the rest on its own.
This won't get fixed just by implementing recaptchaVerifier.clear() method.
In the callback where you are passing this appVerifier, you'll have to implement the above clear method and add that "recaptcha-container" using ref
The below would be the element in render method:
<div ref={recaptchaWrapperRef}>
<div id="recaptcha-container"></div>
</div>
GenerateCaptcha function:
const generateRecaptcha = () => {
appVerifier = new RecaptchaVerifier(
"recaptcha-container",
{
size: "invisible",
},
authentication
);
Inside submit Callback:
if (appVerifier && recaptchaWrapperRef.current) {
appVerifier.clear();
recaptchaWrapperRef.current.innerHTML = `<div id="recaptcha-container"></div>`;
}
// Initialize new reCaptcha verifier
generateRecaptcha();

How to scrape any type of website

I am working on scraping websites, I have tried many technologies to scrape websites.
First of all I used PHP cURL as a scraping tool, and went up to some extent to scrape websites, but then I faced a problem, that was; the PHP cURL couldn't scrape websites that used Ajax to load the website contents/data. And that's what stopped me scraping through PHP.
After a decent research I have found another solution to scrape websites, that were beyond the limitation of Ajax loaded websites etc, and was very powerful and cool to use, they were indeed Phantom JS and Casper JS. I have scraped lot of sites with it.
The problem I faced with these tools was that, these tools works/controlled through the command line interface, for example when you want to run the Phantom/Casper JS code, you need to run it through the command line. And this is my basic problem. What I need is, to write the code in Phantom/Casper JS and I want to have a webpage with admin panel, where I can control these scripts. Currently I am scraping career/jobs listings websites, and I want to automate these tools, to scrape these sites automatically after a given time, to stay updated with the employers sites, who posts new jobs.
For instance, I have code for each website separately and I manually execute each file through the command line and then wait for it to finish scraping and then I continue with second one and so on. What I want to have is, I write a script in JavaScript (preferably in Node JS - but not compulsory) which will execute the scraper code after a specific instance, and then will start scraping all of the websites in the background.
I can do the automation, its not a problem, but the problem is, I am unable to connect the Phantom/Casper JS with the website, even I tried Spooky JS which connects Phantom/Casper JS with Node JS, but unfortunately it doesn't work for me, and its alot messy.
Is there any other tool that's powerful like these two, and I can easily interact with them through a webpage ?
Continuing my own research for scrapping sites, I was unable to find any perfect solution. But the powerful solution I came up with is to use Phantom JS module with Node JS. You can find this module here.
For installation Guide follow this documentation. Phantom JS is used asynchronously in node JS and then its alot easier to get the results, and really easy to interact with it using, express JS on server side and Ajax or Socket.io on client side to enhance the functionality.
Below is my code which I came up with :
const phantom = require('phantom');
const ev = require('events');
const event = new ev.EventEmitter();
var MAIN_URL,
TOTAL_PAGES,
TOTAL_JOBS,
PAGE_DATA_COUNTER = 0,
PAGE_COUNTER = 0,
PAGE_JOBS_DETAILS = [],
IND_JOB_DETAILS = [],
JOB_NUMBER = 1,
CURRENT_PAGE = 1,
PAGE_WEIGHT_TIME,
CLICK_NEXT_TIME,
CURRENT_WEBSITE,
CURR_WEBSITE_LINK,
CURR_WEBSITE_NAME,
CURR_WEBSITE_INDEX,
PH_INSTANCE,
PH_PAGE;
function InitScrap() {
// Initiate the Data
this.init = async function(url) {
MAIN_URL = url;
PH_INSTANCE = await phantom.create(),
PH_PAGE = await PH_INSTANCE.createPage();
console.log("Scrapper Initiated, Please wait...")
return "success";
}
// Load the Basic Page First
this.loadPage = async function(pageLoadWait) {
var status = await PH_PAGE.open(MAIN_URL),
w;
if (status == "success") {
console.log("Page Loaded . . .");
if (pageLoadWait !== undefined && pageLoadWait !== null && pageLoadWait !== false) {
let p = new Promise(function(res, rej) {
setTimeout(async function() {
console.log("Page After 5 Seconds");
PH_PAGE.render("new.png");
TOTAL_PAGES = await PH_PAGE.evaluate(function() {
return document.getElementsByClassName("flatten pagination useIconFonts")[0].textContent.match(/\d+/g)[1];
});
TOTAL_JOBS = await PH_PAGE.evaluate(function() {
return document.getElementsByClassName("jobCount")[0].textContent.match(/\d+/g)[0];
});
res({
p: TOTAL_PAGES,
j: TOTAL_JOBS,
s: true
});
}, pageLoadWait);
})
return await p;
}
}
}
function ScrapData(opts) {
var scrap = new InitScrap();
scrap.init("https://www.google.com/").then(function(init_res) {
if (init_res == "success") {
scrap.loadPage(opts.pageLoadWait).then(function(load_res) {
console.log(load_res);
if (load_res.s === true) {
scrap.evaluatePage().then(function(ev_page_res) {
console.log("Page Title : " + ev_page_res);
scrap.evaluateJobsDetails().then(function(ev_jobs_res) {
console.log(ev_jobs_res);
})
})
}
return
})
}
});
return scrap;
}
module.exports = {
ScrapData
};
}

Trigger a button click from a URL

We need to scrape VEEC Website for the total number once a week.
As an example, for the week of 17/10/2016 - 23/10/2016 the URL returns the number Total 167,356 when the search button is clicked. We want this number to be stored in our database.
I'm using coldfusion to generate the weekly dates as params and have been passing them like the above URL. But I'm unable to find a query param so that the "Search" button click event is triggered.
I've tried like this & this but nothing seems to be working.
Any pointers?
It seems like for every form submission, a CRSF token is added, which prevents malicious activity. To make matters worse for you, the CRSF token is changed for each form submission, not just for each user, which makes it virtually impossible to circumvent.
When I make a CFHTTP POST request to this form, I get HTML FileContent back, but there is no DB data within the results table cell placeholders. It seems to me that the form owner allows form submission from an HTTP request, but if the CRSF token cannot be validated, no DB data is returned.
It maybe worth asking the website owner, if there is any kind of REST API, that you can hook into...
If you want to use a headless browser PhantomJS (https://en.wikipedia.org/wiki/PhantomJS) for this, here is a script that will save the total to a text file.
At command prompt, after you install PhantomJS, run phantomjs.exe main.js.
main.js
"use strict";
var firstLoad = true;
var url = 'https://www.veet.vic.gov.au/Public/PublicRegister/Search.aspx?CreatedFrom=17%2F10%2F2016&CreatedTo=23%2F10%2F2016';
var page = require("webpage").create();
page.viewportSize = {
width: 1280,
height: 800
};
page.onCallback = function (result) {
var fs = require('fs');
fs.write('veet.txt', result, 'w');
};
page.onLoadStarted = function () {
console.log("page.onLoadStarted, firstLoad", firstLoad);
};
page.onLoadFinished = function () {
console.log("page.onLoadFinished, firstLoad", firstLoad);
if (firstLoad) {
firstLoad = false;
page.evaluate(function () {
var event = document.createEvent("MouseEvents");
event.initEvent("click", true, true);
document.querySelectorAll(".dx-vam")[3].dispatchEvent(event);
});
} else {
page.evaluate(function () {
var element = document.querySelectorAll('.dxgv')[130];
window.callPhantom(element.textContent);
});
setTimeout(function () {
page.render('veet.png');
phantom.exit();
}, 3000);
}
};
page.open(url);
The script is not perfect, you can work on it if you're interested, but as is it will save the total to a file veet.txt and also save a screenshot veet.png.

How to get a published collection's total count, regardless of a specified limit, on the client?

I'm using the meteor-paginated-subscription package in my app. On the server, my publication looks like this:
Meteor.publish("posts", function(limit) {
return Posts.find({}, {
limit: limit
});
});
And on the client:
this.subscriptionHandle = Meteor.subscribeWithPagination("posts", 10);
Template.post_list.events = {
'click #load_more': function(event, template) {
template.subscriptionHandle.loadNextPage();
}
};
This works well, but I'd like to hide the #load_more button if all the data is loaded on the client, using a helper like this:
Template.post_list.allPostsLoaded = function () {
allPostsLoaded = Posts.find().count() <= this.subscriptionHandle.loaded();
Session.set('allPostsLoaded', allPostsLoaded);
return allPostsLoaded;
};
The problem is that Posts.find().count() is returning the number of documents loaded on the client, not the number available on the server.
I've looked through the Telescope project, which also uses the meteor-paginated-subscription package, and I see code that does what I want to do:
allPostsLoaded: function(){
allPostsLoaded = this.fetch().length < this.loaded();
Session.set('allPostsLoaded', allPostsLoaded);
return allPostsLoaded;
}
But I'm not sure if it's actually working. Porting their code into mine does not work.
Finally, it does look like Mongo supports what I want to do. The docs say that, by default, cursor.count() ignores the effects of limit.
Seems like all the pieces are there, but I'm having trouble putting them together.
None of the answers do what you really want becase none provide solution that is reactive.
This package does exactly what you want and also reactive.
publish-counts
I think you can see the demo: counts-by-room in meteor doc
It can help you publish the counts of your posts at server and get it at client
You can simply write this:
// server: publish the current size of your post collection
Meteor.publish("counts-by-room", function () {
var self = this;
var count = 0;
var initializing = true;
var handle = Posts.find().observeChanges({
added: function (id) {
count++;
if (!initializing)
self.changed("counts", 'postCounts', {count: count});
},
removed: function (id) {
count--;
self.changed("counts", postCounts, {count: count});
}
});
initializing = false;
self.added("counts", 'postCounts', {count: count});
self.ready();
self.onStop(function () {
handle.stop();
});
});
// client: declare collection to hold count object
Counts = new Mongo.Collection("counts");
// client: subscribe to the count for posts
Tracker.autorun(function () {
Meteor.subscribe("postCounts");
});
// client: simply use findOne, you can get the count object
Counts.findOne()
The idea of sub.loaded() is to help you with exactly this problem.
Posts.count() isn't going to return the right thing because, as you've guessed, on the client, Meteor has no way of knowing the real number of posts that live on the server. But what the client knows is how many posts it's tried to load. That's what that .loaded() tells you, and is why the line this.fetch().length < this.loaded() will tell you if there are more posts on the server or not.
What I would do is write a Meteor server side method that retrieves the count like so:
Meteor.methods({
getPostsCount: function () {
return Posts.find().count();
}
});
Then call it on the client, in observe to make it reactive:
function updatePostCount() {
Meteor.call('getPostsCount', function (err, count) {
Session.set('postCount', count);
});
}
Posts.find().observe({
added: updatePostCount,
removed: updatePostCount
});
Although this question is old, I thought I would provide an answer that ended up working for me. I did not create the solution, I found the basis for it here (so credit where credit is due): Discover Meteor
Anyway, in my case I was trying to get "size" of the database from client side, so I can determine when to hide the "load more" -button. I was using template level subscriptions. Oh and for this solution to work, you need to add reactive-var -package. Here is my (in short):
/*on the server we define the method which returns
the number of posts in total in the database*/
if(Meteor.isServer){
Meteor.methods({
postsTotal: function() {
return PostsCollection.find().count();
}
});
}
/*In the client side we first create the reactive variable*/
if(Meteor.isClient){
Template.Posts.onCreated(function() {
var self = this;
self.totalPosts = new ReactiveVar();
});
/*then in my case, when the user clicks the load more -button,
we call the postsTotal-method and set the returned value as
the value of the totalPosts-reactive variable*/
Template.Posts.events({
'click .load-more': function (event, instance){
Meteor.call('postsTotal', function(error, result){
instance.totalPosts.set(result);
});
}
});
}
Hope this helps someone (I recommend checking the link first). For template level subscriptions, I used this as my guide Discover Meteor - template level subscriptions. This was my first stacked-post and I am just learning Meteor, so please have mercy...:D
Ouch this post is old, anyway maybe it will help someone.
I had exactly the same issue. I managed to solve it with 2 simple lines...
Remember the :
handle = Meteor.subscribeWithPagination('posts', 10);
Well I used in client handle.loaded() and Posts.find().count(). Because when they are different it means that all the posts are loaded. So here is my code :
"click #nextPosts":function(event){
event.preventDefault();
handle.loadNextPage();
if(handle.loaded()!=Posts.find().count()){
$("#nextPosts").fadeOut();
}
}
I had the same problem, and using the publish-counts package didn't work with the subs-manager package. I created a package that can set a reactive server-to-client session, and keep the document count in this session. You can find an example here:
https://github.com/auweb/server-session/#getting-document-count-on-the-client-before-limit-is-applied
I'm doing something like this:
On cliente
Template.postCount.posts = function() {
return Posts.find();
};
Then you create a template:
<template name="postCount">
{{posts.count}}
</template>
Then, whatever you want to show the counter: {{> postCount}}
Much easier than any solution i have seen.

Sporadic behaviour with Session variables in Meteor

I've been scratching my head as to why this code will work some of the time, but not all (or at least most of the time). I've found that it actually does run displaying the correct content in the browser some of the time, but strangely there will be days when I'll come back to the same code, run the server (as per normal) and upon loading the page will receive an error in the console: TypeError: 'undefined' is not an object (evaluating 'Session.get('x').html')
(When I receive that error there will be times where the next line in the console will read Error - referring to the err object, and other times when it will read Object - referring the data object!?).
I'm obviously missing something about Session variables in Meteor and must be misusing them? I'm hoping someone with experience can point me in the right direction.
Thanks, in advance for any help!
Here's my dummy code:
/client/del.html
<head>
<title>del</title>
</head>
<body>
{{> hello}}
</body>
<template name="hello">
Hello World!
<div class="helloButton">{{{greeting}}}</div>
</template>
My client-side javascript file is:
/client/del.js
Meteor.call('foo', 300, function(err, data) {
err ? console.log(err) : console.log(data);
Session.set('x', data);
});
Template.hello.events = {
'click div.helloButton' : function(evt) {
if ( Session.get('x').answer.toString() === evt.target.innerHTML ) {
console.log('yay!');
}
}
};
Template.hello.greeting = function() {
return Session.get('x').html;
};
And my server-side javascript is:
/server/svr.js
Meteor.methods({
doubled: function(num) {
return num * 2;
},
foo: function(lmt) {
var count = lmt,
result = {};
for ( var i = 0; i < lmt; i++ ) {
count++;
}
count = Meteor.call('doubled', count);
result.html = "<em>" + count + "</em>";
result.answer = count;
return result;
}
});
I think it's just that the session variable won't be set yet when the client first starts up. So Session.get('x') will return undefined until your method call (foo) returns, which almost certainly won't happen before the template first draws.
However after that it will be in the session, so things will probably behave right once you refresh.
The answer is to just check if it's undefined before trying to access the variable. For example:
Template.hello.greeting = function() {
if (Session.get('x')) return Session.get('x').html;
};
One of the seven principles of Meteor is:
Latency Compensation. On the client, use prefetching and model simulation to make it look like you have a zero-latency connection to the database.
Because there is latency, your client will first attempt to draw the lay-out according to the data it has at the moment your client connects. Then it will do the call and then it will update according to the call. Sometimes the call might be able to respond fast enough to be drawn at the same time.
As now there is a chance for the variable to not be set, it would throw an exception in that occasion and thus break down execution (as the functions in the call stack will not continue to run).
There are two possible solutions to this:
Check that the variable is set when using it.
return Session.get('x') ? Session.get('x').html : '';
Make sure the variable has an initial value by setting it at the top of the script.
Session.set('x', { html = '', answer = ''});
Another approach would be to add the templates once the call responds.
Meteor.call('foo', 300, function(err, data) {
Session.set('x', data);
$('#page').html(Meteor.ui.render(function() {
return Template.someName();
}));
});

Resources