if URL contains function in puppeteer - web-scraping

So I am trying to have my code do something if the URL contains this: https://kith.com/throttle/queue?
After the '?' there can be anything, so I only want it to identify 'https://kith.com/throttle/queue?'
I am using puppeteer and want it to work like this:
If the URL contains 'https://kith.com/throttle/queue?' then wait until it passes the queue (page.waitForNavigation({ waitUntil: 'networkidle2' }) would work for waiting until it's through the queue
Else (If the URL doesn't contain that): do nothing and go to the next line of code

It is hard to understand how do you get this URL, but if this is a URL of a current page, you can try this:
const url = await page.evaluate(() => location.href);
if(url.startsWith('https://kith.com/throttle/queue?')) {
// Wait for navigation.
} else {
// Do nothing.
}

Related

How can I make the command work on page specific

Thank you in advance for your help. How can I make the following command work only on the page named "customer" and not on other pages?
function onEdit(e) {
if (e.range.columnStart > 9)
return;
e.source.getActiveSheet().getRange(e.range.rowStart, 12).setValue(new Date());
}
If you want that code to run only inside a given page, you should include that code only in that page.
But in case you have a general js that you are going to include anywhere and you want to add logic to it so that it can make guesses on which page is loaded when it's running, you can read the value of window.location.href that will return the entire url of the page currently loaded in viewport:
var isCurrentPageCustomer = false;
//this regex will extract the final part of the url after the last slash (/)
//so if the url was http://mydomain/path/to/page the match will look for /page
var match = /\/([^/]+)$/.exec(window.location.href);
if (match != null) {
//result is holding "page" without the trailing slash (/)
result = match[1];
//if that part (eg.: 'page') matches 'customer', switch the flag=true
if (result === 'customer') isCurrentPageCustomer = true;
}
if(isCurrentPageCustomer)
console.log('current page is customer.');
else
console.log('current page is NOT customer.');
But it turned out later that you meant to run some code only when the current active sheet (in Google Sheets) is named customer. In that case I just changed your onEdit handler function and used the property name of the active sheet to decide to return from the function instantly if it doesn't match with 'customer'
function onEdit(e){
//if the name of the active sheet is not 'customer'
if (e.source.getActiveSheet().getName() != 'customer')
//exits the function without executing the rest of its code
return;
//the rest of your code here
}

Apify cheerio scraper stops even with urls in the queue

here is the scenario, I'm using the cheerio scraper to scraper a website containing real estate announces.
Each announce has the link to the next announce so before scrapint the current page I add the next page in the request queue.
What it happens always at certain and a random point is that the scraper stops without any reason, even if in the queue there is the next page to scrape (I add the image).
Why does this happens since there is still a pending request in the queue?
Many thanks
Here is the message I get:
2021-02-28T10:52:35.439Z INFO CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
2021-02-28T10:52:35.672Z INFO CheerioCrawler: Final request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":963,"requestsFinishedPerMinute":50,"requestsFailedPerMinute":0,"requestTotalDurationMillis":22143,"requestsTotal":23,"crawlerRuntimeMillis":27584,"requestsFinished":23,"requestsFailed":0,"retryHistogram":[23]}
2021-02-28T10:52:35.679Z INFO Cheerio Scraper finished.
Here the request queue:
Here the code
async function pageFunction(context) {
const { $, request, log } = context;
// The "$" property contains the Cheerio object which is useful
// for querying DOM elements and extracting data from them.
const pageTitle = $('title').first().text();
// The "request" property contains various information about the web page loaded.
const url = request.url;
// Use "log" object to print information to actor log.
log.info('Scraping Page', { url, pageTitle });
// Adding next page to the queue
var baseUrl = '...';
if($('div.d3-detailpager__element--next a').length > 0)
{
var nextPageUrl = $('div.d3-detailpager__element--next a').attr('href');
log.info('Found another page', { nextUrl: baseUrl.concat(nextPageUrl) });
context.enqueueRequest({ url:baseUrl.concat(nextPageUrl) });
}
// My code for scraping follows here
return { /*my scaped object*/}
}
Missing await
await context.enqueueRequest

Trigger a button click from a URL

We need to scrape VEEC Website for the total number once a week.
As an example, for the week of 17/10/2016 - 23/10/2016 the URL returns the number Total 167,356 when the search button is clicked. We want this number to be stored in our database.
I'm using coldfusion to generate the weekly dates as params and have been passing them like the above URL. But I'm unable to find a query param so that the "Search" button click event is triggered.
I've tried like this & this but nothing seems to be working.
Any pointers?
It seems like for every form submission, a CRSF token is added, which prevents malicious activity. To make matters worse for you, the CRSF token is changed for each form submission, not just for each user, which makes it virtually impossible to circumvent.
When I make a CFHTTP POST request to this form, I get HTML FileContent back, but there is no DB data within the results table cell placeholders. It seems to me that the form owner allows form submission from an HTTP request, but if the CRSF token cannot be validated, no DB data is returned.
It maybe worth asking the website owner, if there is any kind of REST API, that you can hook into...
If you want to use a headless browser PhantomJS (https://en.wikipedia.org/wiki/PhantomJS) for this, here is a script that will save the total to a text file.
At command prompt, after you install PhantomJS, run phantomjs.exe main.js.
main.js
"use strict";
var firstLoad = true;
var url = 'https://www.veet.vic.gov.au/Public/PublicRegister/Search.aspx?CreatedFrom=17%2F10%2F2016&CreatedTo=23%2F10%2F2016';
var page = require("webpage").create();
page.viewportSize = {
width: 1280,
height: 800
};
page.onCallback = function (result) {
var fs = require('fs');
fs.write('veet.txt', result, 'w');
};
page.onLoadStarted = function () {
console.log("page.onLoadStarted, firstLoad", firstLoad);
};
page.onLoadFinished = function () {
console.log("page.onLoadFinished, firstLoad", firstLoad);
if (firstLoad) {
firstLoad = false;
page.evaluate(function () {
var event = document.createEvent("MouseEvents");
event.initEvent("click", true, true);
document.querySelectorAll(".dx-vam")[3].dispatchEvent(event);
});
} else {
page.evaluate(function () {
var element = document.querySelectorAll('.dxgv')[130];
window.callPhantom(element.textContent);
});
setTimeout(function () {
page.render('veet.png');
phantom.exit();
}, 3000);
}
};
page.open(url);
The script is not perfect, you can work on it if you're interested, but as is it will save the total to a file veet.txt and also save a screenshot veet.png.

Is there a way to update the URL in Flow Router without a refresh/redirect?

Is there a way to update a part of the URL reactively without using FlowRouter.go() while using React and react-layout?
I want to change the value in the document that is used to get the document from the DB. For example, if I have a route like ~/users/:username and update the username field in the document, I then have to user FlowRouter.go('profile', {data}) to direct the user to that new URL. The "old" route is gone.
Below is the working version I have, but there are two issues:
I have to use FlowRouter.go(), which is actually a full page refresh (and going back would be a 404).
I still get errors in the console because for a brief moment the reactive data for the component is actually wrong.
Relevant parts of the component are like this:
...
mixins: [ReactMeteorData],
getMeteorData() {
let data = {};
let users = Meteor.subscribe('user', {this.props.username});
if (user.ready())
data.user = user;
return data;
}
...
updateName(username) {
Users.update({_id:this.data.user._id}, {$set:{username}}, null, (e,r) => {
if (!e)
FlowRouter.go('profile', {username});
});
},
...
The route is like this:
FlowRouter.route('/users/:username', {
name: 'profile',
action(params) {
ReactLayout.render(Main, {content: <UserProfile {...params} />});
}
});
The errors I get in the console are:
Exception from Tracker recompute function:
and
TypeError: Cannot read property '_id' of undefined

How do you scrape a dynamically generated webpage in NodeJs?

There are sites whose DOM and contents are generated dynamically when the page loads. (Angularjs-based sites are notorious for this)
What approach do you use?
I tried both phantomjs and jsdom but it seems I am unable get the page to execute its javascript before I scrape.
Here's a simple jsdom example (not angularjs-based but still dynamically generated)
var env = require('jsdom').env;
exports.scrape = function(link, callback) {
var config = {
url: link,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36'
},
done: jsdomDone
};
env(config);
}
function jsdomDone(err, window) {
var info = null;
if(err) {
console.error(err);
} else {
var $ = require('jquery')(window);
console.log($('.profilePic').attr('src'));
}
}
exports.scrape('https://www.facebook.com/elcompanies');
I tried phantomjs with moderate success.
var page = new WebPage()
var fs = require('fs');
page.onLoadFinished = function() {
console.log("page load finished");
window.setTimeout(function() {
page.render('export.png');
fs.write('1.html', page.content, 'w');
phantom.exit();
}, 10000);
};
page.open("https://www.facebook.com/elcompanies", function() {
page.evaluate(function() {
});
});
Here I wait for the onLoadFinished event and even put a 10-second timer. The interesting thing is that while my export.png image capture of the page shows a fully rendered page, my 1.html doesn't show the .profilePic class element in its rightful place. It seems to be sitting in some javascript code, surrounded by some kind of "require("TimeSlice").guard(function() {bigPipe.onPageletArrive({..." block
If you can provide me a working example that scrapes the image off this page, that'd be helpful.
I've done some scraping in Facebook by using nightmarejs.
Here is a code that I did to get some content from some posts of a Facebook page.
module.exports = function checkFacebook(callback) {
var nightmare = Nightmare();
Promise.resolve(nightmare
.viewport(1000, 1000)
.goto('https://www.facebook.com/login/')
.wait(2000)
.evaluate(function(){
document.querySelector('input[id="email"]').value = facebookEmail
document.querySelector('input[id="pass"]').value = facebookPwd
return true
})
.click('#loginbutton input')
.wait(1000)
.goto('https://www.facebook.com/groups/bierconomia')
.evaluate(function(){
var posts = document.getElementsByClassName('_1dwg')
var length = posts.length
var postsContent = []
for(var i = 0; i < length; i++){
var pTag = posts[i].getElementsByTagName('p')
postsContent.push({
content: pTag[0] ? pTag[0].innerText : '',
productLink: posts[i].querySelector('a[rel = "nofollow"]') ? posts[i].querySelector('a[rel = "nofollow"]').href : '',
photo: posts[i].getElementsByClassName('_46-i img')[0] ? posts[i].getElementsByClassName('_46-i img')[0].src : ''
})
}
return postsContent
}))
.then(function(results){
log(results)
return new Promise(function(resolve, reject) {
var leanLinks = results.map(function(result){
return {
post: {
content: result.content,
productLink: extractLinkFromFb(result.productLink),
photo: result.photo
}
}
})
resolve(leanLinks)
})
})
The thing that I find useful with nightmare is that you can use the wait function to either wait for X ms or for a specific class to render.
This is because generated web pages based on AJAX calls have asynchronous AJAX calls and you can't rely on onLoad events (because data still not available).
In my personal opinion, the most reliable way would be tracing which REST services are being called from this HTML and make direct calls to them. Sometimes you will need using values found in HTML or values taken from another calls.
I know this may sound complicated, and in fact it is. You kinda need to debug page and learn what is being called. But this will work for sure.
By the way, using chrome developer tools will help this task. Just observe which call are made in network tab. You can even observe what has been sent and received in each AJAX call.
If it is a one time thing, that is, if I just want to scrape a single page once, I just use the browser and artoo-js.
I never tried to write a page on disk using phantom, but I have two observations:
1) you are using fs.write to write things to disk, but writeFile is an async call. This means that you either need to change it to fs.writeFileSync or use a callback before closing phantom.
2) I hope you aren't expecting to write a HTML to a file and open it in a browser and get it rendered like when you saved a png, because it doesnt work this way. Some objects can be stored directly in DOM properties and certainly there are values stored in javascript variables, those things will never be persisted.

Resources