#Page query - hiding header in #page:first seems to misplace all trailing footers (using Puppeteer) [duplicate] - css

Im new using nodejs functions and also puppeteer. Previously I was using wkhtmltopdf but currently its options are very poor.
So, my idea was generating a pdf from a html with a first cover page (an image with full A4 width/height ), since the footer is generated from the index.js, theres no way to hide it on the FIRST page of the PDF.
//Imports
const puppeteer = require('puppeteer');
//Open browser
async function startBrowser() {
const browser = await puppeteer.launch({headless: true, args:['--no-sandbox']});
const page = await browser.newPage();
return {browser, page};
}
//Close browser
async function closeBrowser(browser) {
return browser.close();
}
//Html to pdf
async function html2pdf(url) {
const {browser, page} = await startBrowser();
await page.goto(url, {waitUntil: 'networkidle2'});
await page.emulateMedia('screen');
//Options
await page.pdf({
printBackground: true,
path: 'result.pdf',
displayHeaderFooter: true,
footerTemplate: '<div style="width:100%;text-align:right;position:relative;top:10px;right:10px;"><img width="60px" src="data:data:image/..."'
margin : {top: '0px',right: '0px',bottom: '40px',left: '0px' },
scale: 1,
landscape: false,
format: 'A4',
pageRanges: ""
});
}
//Exec
(async () => {
await html2pdf('file:///loc/node_pdfs/givenhtml.html');
process.exit(1);
})();
My question is, is there any way to locate the first footer and hide it from the index fuction?
Thanks!

There are currently multiple bugs (see this question/answer or this one) that make it impossible to get this working.
This is currently only possible for headers using this trick (taken from this github comment):
await page.addStyleTag({
content: `
body { margin-top: 1cm; }
#page:first { margin-top: 0; }
`,
});
This will basically hide the margin on the first page, but will not work when using a bottom margin (as also noted here).
Possible Solution
The solution I recommend is to create two PDFs, one with only the first page and no margins, and another one with the remaining pages and a margin:
await page.pdf({
displayHeaderFooter: false,
pageRanges: '1',
path: 'page1.pdf',
});
await page.pdf({
displayHeaderFooter: true,
footerTemplate: '<div style="font-size:5mm;">Your footer text</div>',
margin: {
bottom: '10mm'
},
pageRanges: '2-', // start this PDF at page 2
path: 'remaining-pages.pdf',
});
Depending on how often you need to perform the task you could either manually merge the PDFs or automate it using a tool like easy-pdf-merge (I have not used this one myself).

small hint: easy-pdf-merge an pdf-merge have some "system-tools-dependencies"
I prefer pdf-lib, a plain js tool where you can use Buffers and Typescript support
My Typescript:
import {PDFDocument} from 'pdf-lib'
...
const options: PDFOptions = {
format: 'A4',
displayHeaderFooter: true,
footerTemplate: footerTemplate,
margin: {
top: '20mm',
bottom: '20mm',
},
}
const page1: Buffer = await page.pdf({
...options,
headerTemplate: '<div><!-- no header hack --></div>',
pageRanges: '1',
})
const page2: Buffer = await page.pdf({
...options,
headerTemplate: headerTemplate,
pageRanges: '2-',
})
const pdfDoc = await PDFDocument.create()
const coverDoc = await PDFDocument.load(page1)
const [coverPage] = await pdfDoc.copyPages(coverDoc, [0])
pdfDoc.addPage(coverPage)
const mainDoc = await PDFDocument.load(page2)
for (let i = 0; i < mainDoc.getPageCount(); i++) {
const [aMainPage] = await pdfDoc.copyPages(mainDoc, [i])
pdfDoc.addPage(aMainPage)
}
const pdfBytes = await pdfDoc.save()
// Buffer for https response in my case
return Buffer.from(pdfBytes)
...

Related

Apify How can I stop enqueueLinks function from opening new page in browser

I am trying to get links from a page and then navigate to the next page through a click of a button. The issue is I first need to add all the links on the first page to a queue however the enqueueList function seems to start a new page, causing a "Node not found error" when I try to click on an element.
Any advice would be helpful!
exports.handleList = async ({ request, page }, requestQueue) => {
await Apify.utils.enqueueLinks({
page: page,
requestQueue: requestQueue,
selector: "#traziPoduzeca > tbody > tr > td > span > a",
baseUrl: "https://www.fininfo.hr/Poduzece/[.*]",
transformRequestFunction: (request) => {
request.userData = {
label: "DETAIL",
};
return request;
},
});
log.info('about to scrap urls')
await page.waitForTimeout(120);
let btn_selector =
"//div[#class='contentNav'][1]//div[#class='pagination'][1]//span[#class='current']/following-sibling::a";
let buttonEle = await page.$x(btn_selector);
log.info(`length of pagination is ${buttonEle.length}`);
for (let index = 0; index < buttonEle.length; index++) {
await page.waitForTimeout(120);
await buttonEle[index].click();
log.info("clicked");
await page.waitForTimeout(200);
await Apify.utils.enqueueLinks({
page: page,
requestQueue: requestQueue,
selector: "#traziPoduzeca > tbody > tr > td > span > a",
baseUrl: "https://www.fininfo.hr/Poduzece/[.*]",
transformRequestFunction: (request) => {
request.userData = {
label: "DETAIL",
};
return request;
},
});
}
};
It's not the enqueueLinks function that is causing the error, is the navigation that is happening. You need to await for the navigation if you want to stay on the same page, something like this:
const btn_selector =
"//div[#class='contentNav'][1]//div[#class='pagination'][1]//span[#class='current']/following-sibling::a";
// log.info(`length of pagination is ${buttonEle.length}`);
for (;;) {
// the button handle will change everytime the navigation happens
// so you need to re-evaluate every time
const [buttonEle] = await page.$x(btn_selector);
if (!buttonEle) {
log.info("no pagination element found");
break;
}
await page.waitForTimeout(120);
await Promise.all([
page.waitForNavigation(), // this will not destroy the context of current page
buttonEle.click(),
]);
log.info("clicked");
await page.waitForTimeout(200);
// enqueue links...
}

How to store data from multi page to json?

thank you for ur attention, so i write a mini project that scrape news site and store main texts from them. i tried many solutions to add json in my project without have consol.log but always after scraping its show only one main text. so i show my code to you so u could help me how to have json with all three news.
const { Cluster } = require('puppeteer-cluster');
const fs = require('fs');
const launchOptions = {
headless: false,
args: [
'--disable-gpu',
'--disable-dev-shm-usage',
'--disable-web-security',
'--disable-xss-auditor',
'--disable-accelerated-2d-canvas',
'--ignore-certifcate-errors',
'--ignore-certifcate-errors-spki-list',
'--no-zygote',
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-webgl',
],
ignoreHTTPSErrors: true,
waitUntil: 'networkidle2',
};
(async() => {
// Create a cluster with 2 workers
const cluster = await Cluster.launch({
monitor: true,
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 2,
puppeteerOptions: launchOptions,
});
// Define a task (in this case: screenshot of page)
await cluster.task(async({ page, data: url }) => {
await page.setRequestInterception(true);
page.on('request', (request) => {
if (['stylesheet', 'font', 'image', 'styles','other', 'media'].indexOf(request.resourceType()) !== -1) {
request.abort();
} else {
request.continue();
}
});
await page.goto(url);
const scrapedData = await page.$eval('div[class="entry-content clearfix"]', el => el.innerText)
fs.writeFileSync('test.json', JSON.stringify(scrapedData, null, 2))
});
// Add some pages to queue
cluster.queue('https://www.ettelaat.com/?p=526642');
cluster.queue('https://www.ettelaat.com/?p=526640');
cluster.queue('https://www.ettelaat.com/?p=526641');
// Shutdown after everything is done
await cluster.idle();
await cluster.close();
})();
for gather all outputs i had to put my fs in the bottom of cluster.close
kanopyDB = []
.
.
.
kanopyDB = kanopyDB.concat(name);
.
.
.
await cluster.idle();
await cluster.close();
fs.writeFileSync('output.json', kanopyDB, 'utf8');

puppeteer break-word when generateing pdf from html

I'm trying to create a pdf from html using headless chrome and puppeteer. Everything works fine CSS wise and the pdf looks good. The problem is now when i'm trying to do so that some divs doesn't break on new page. I thought that maybe this could be done the old fashion way using page-break-inside: avoid but with no success. I can't find anything on their github api page either. Is there anyone that have managed to do this?
async function createListenPdf(html) {
try {
var jobId = uuidv4();
this.browser = await puppeteer.launch();
const page = await browser.newPage();
var viewport = {
width: 1165,
height: 1200
}
page.setViewport(viewport);
console.log("viewport");
page.on("console", msg => {
for (let i = 0; i < msg.args.length; ++i) {
console.log(`${jobId} - From page. Arg ${i}: ${msg.args[i]}`);
}
});
await page.goto(`data:text/html,${html}`, {
waitUntil: 'networkidle0'
});
await page.emulateMedia('print');
var buffer = await page.pdf({
format: "A4",
scale: 0.7,
printBackground: true,
});
console.log(`${jobId} - Done. Will return the stream`);
return buffer;
}
finally {
if (this.browser) {
console.log(`${jobId} - Closing browser`);
this.browser.close();
}
}
}

Vuefire get Firebase Image Url

I am storing relative paths to images in my firebase database for each item I wish to display. I am having trouble getting the images to appear on the screen, as I need to get the images asynchronously. The firebase schema is currently as follows:
{
items: {
<id#1>: {
image_loc: ...,
},
<id#2>: {
image_loc: ...,
},
}
}
I would like to display each of these images on my page with code such as:
<div v-for="item in items">
<img v-bind:src="item.image_loc">
</div>
This does not work, as my relative location points to a place in firebase storage. The relavent code to get the true url from this relative url is:
firebase.storage().ref('items').child(<the_image_loc>).getDownloadURL()
which returns a promise with the true url. Here is my current vue.js code:
var vue = new Vue({
el: '.barba-container',
data: {
items: []
},
firebase: function() {
return {
items: firebase.database().ref().child('items'),
};
}
});
I have tried using computed properties, including the use of vue-async-computed, but these solutions do not seem to work as I cannot pass in parameters.
Basically, how do I display a list of elements where each element needs the result of a promise?
I was able to solve this by using the asyncComputed library for vue.js and by making a promise to download all images at once, instead of trying to do so individually.
/**
* Returns a promise that resolves when an item has all async properties set
*/
function VotingItem(item) {
var promise = new Promise(function(resolve, reject) {
item.short_description = item.description.slice(0, 140).concat('...');
if (item.image_loc === undefined) {
resolve(item);
}
firebase.storage().ref("items").child(item.image_loc).getDownloadURL()
.then(function(url) {
item.image_url = url;
resolve(item);
})
.catch(function(error) {
item.image_url = "https://placeholdit.imgix.net/~text?txtsize=33&txt=350%C3%97150&w=350&h=150";
resolve(item);
});
});
return promise;
}
var vue = new Vue({
el: '.barba-container',
data: {
items: [],
is_loading: false
},
firebase: function() {
return {
items: firebase.database().ref().child('items'),
};
},
asyncComputed: {
processedItems: {
get: function() {
var promises = this.items.map(VotingItem);
return Promise.all(promises);
},
default: []
}
}
});
Lastly, I needed to use: v-for="item in processedItems" in my template to render the items with image urls attached
I was able to solve it without any extra dependencies not adding elements to the array until the url is resolved:
in my template:
<div v-for="foo in foos" :key="foo.bar">
<img :src="foo.src" :alt="foo.anotherbar">
...
</div>
in my component (for example inside mounted())
const db = firebase.firestore()
const storage = firebase.storage().ref()
const _this = this
db.collection('foos').get().then((querySnapshot) => {
const foos = []
querySnapshot.forEach((doc) => {
foos.push(doc.data())
})
return Promise.all(foos.map(foo => {
return storage.child(foo.imagePath).getDownloadURL().then(url => {
foo.src = url
_this.foos.push(foo)
})
}))
}).then(() => {
console.log('all loaded')
})

How do I take a screenshot of DOM element using intern js?

I'm using intern.js to test a web app. I can execute tests and create screenshots when they fail. I want to create a screenshot for specific element in order to do some CSS regression testing using tools like resemble.js. Is it possible? How can I do? Thank you!
driver.get("http://www.google.com");
WebElement ele = driver.findElement(By.id("hplogo"));
//Get entire page screenshot
File screenshot = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
BufferedImage fullImg = ImageIO.read(screenshot);
//Get the location of element on the page
Point point = ele.getLocation();
//Get width and height of the element
int eleWidth = ele.getSize().getWidth();
int eleHeight = ele.getSize().getHeight();
//Crop the entire page screenshot to get only element screenshot
BufferedImage eleScreenshot= fullImg.getSubimage(point.getX(), point.getY(), eleWidth,
eleHeight);
ImageIO.write(eleScreenshot, "png", screenshot);
//Copy the element screenshot to disk
File screenshotLocation = new File("C:\\images\\GoogleLogo_screenshot.png");
FileUtils.copyFile(screen, screenshotLocation);
Taken from here.
There isn't a built in way to do this with Intern. The takeScreenshot method simply calls Selenium's screenshot service, which returns a screenshot of the entire page as a base-64 encoded PNG. Intern's takeScreenshot converts this to a Node buffer before handing the result to the user.
To crop the image you will need to use an external library or tool such as png-crop (note that I've never used this). The code might look like the following (untested):
var image;
var size;
var position;
return this.remote
// ...
.takeScreenshot()
.then(function (imageBuffer) {
image = imageBuffer;
})
.findById('element')
.getSize()
.then(function (elementSize) {
size = elementSize;
})
.getPosition()
.then(function (elementPosition) {
position = elementPosition;
})
.then(function () {
// assuming you've loaded png-crop as PNGCrop
var config = {
width: size.width,
height: size.height,
top: position.y,
left: position.x
};
// need to return a Promise since PNGCrop.crop is an async method
return new Promise(function (resolve, reject) {
PNGCrop.crop(image, 'cropped.png', config, function (err) {
if (err) {
reject(err);
}
else {
resolve();
}
});
});
})

Resources