Scraping infinite scroll href with Cypress - web-scraping

I'm using Cypress to scrape a site with an infinite scroll.
The site is made with React, and after the user enters a search term in an input, as they scroll more products appear on the page matching the search term entered.
The code I've got so far opens a URL, navigates to the URL and collects all the hrefs that are currently visible.
I'm wondering is how I can tell cypress to scroll down further, slowly harvesting all the hrefs as it scrolls down the page, and then finally writing the hrefs to the json.
This is the code I have so far, minus the scrolling:
const arrayOfHrefs = [];
describe('Get links', () => {
it.only('should do a product search', () => {
cy.visit('https://www.testsite.com');
cy.wait(5000);
cy.get('#product_input').type('socks');
cy.contains('socks').click(); // renders new content on the client side
cy.wait(10000);
cy.get('a').each(($a) => {
const link = $a.attr('href');
arrayOfHrefs.push(link); // grabs all visible links and pushes them to array
}).then(() => {
console.log(arrayOfHrefs)
cy.writeFile('data.json', { urls: arrayOfHrefs }) // writes array to disk
})
});
});

You did not detail what you have tried so far and what issues you're currently having regarding scrolling, but I assume scrolling down the window and then adding some logic to wait until more links become visible is sufficient.
This command scrolls down the whole window to the bottom over 5000ms:
cy.scrollTo('bottom', {duration: 5000})
Note that it's not chained off from an element like:
cy.get('#some-scrollable-element').scrollTo(...)
I googled a page that has some similar dynamic infinite scroll behaviour, maybe you could base your code on the following snippet:
describe('', () => {
before('', () => {
cy.server()
cy.route('GET', '**/blog/page/**').as('blog')
})
it('', () => {
let numberOfChildren = 4
cy.visit('http://www.drewleague.com/blog/')
for (let i = 0; i < 5; i++) {
cy.get('.posts--desktop')
.children()
.then(children => {
cy.wrap(children)
.its('length')
.should('eq', numberOfChildren)
})
cy.scrollTo('bottom', {duration: 5000})
.wait('#blog')
.then(() => numberOfChildren += 4)
}
})
})
This code scrolls down the page to the bottom 5 times, and in each iteration we check the number of children which are dynamically added, also we wait until the xhr request finishes. Not very useful on its own but you get the idea.

Related

Adding a new item when using useSWRInfinite pushes other items out of the list

I am building a comment system where new replies are added to the start (top) of the list. The pagination is cursor-based.
At the moment, I use mutate to add the newly created comment as its own page to the front of the list.:
const {
data: commentsPages,
: commentsPagesSize,
: setCommentsPagesSize,
//TODO: Not true on successive page load. But isValidating refreshes on refetches
isLoading: commentsLoading,
error: commentsLoadingError,
mutate: mutateCommentPages,
} = useSWRInfinite(
getPageKey,
([blogPostId, lastCommentId]) => BlogApi.getCommentsForBlogPost(blogPostId, lastCommentId));
<CreateCommentBox
blogPostId={blogPostId}
title="Write a comment"
onCommentCreated={(newComment) => {
const updatedPages = commentsPages?.map(page => {
const updatedPage: GetCommentsResponse = { comments: [newComment, ...page.comments], paginationEnd: page.paginationEnd };
return updatedPage;
})
mutateCommentPages(updatedPages, { revalidate: false });
}}
/>
The problem is, SWR immediately starts revalidating the list and pushes the comment at the bottom out of the data set. This behavior is kind of awkward.
Is my only choice do disable automatic revalidation completely? How would you handle this?

cypress - click iframe element while the body of the iframe change after few seconds

I have a problem when using cy.getIframeBody().find('#some-button') that the #some-button element is not yet available, because the iframe is still loading, but the body element is not empty so the .find() is triggered.
This is the custom command to get the iframe body
Cypress.Commands.add('getIframeBody', ()=> {
return cy.get('iframe.cy-external-link-container')
.its('0.contentDocument.body')
.should('not.empty')
.then(cy.wrap)
});
How I can do it without using cy.wait()?
You can add random timeouts and .should() assertions, but if they work at all the test is likely to be flaky.
The key is to repeat the query of the body element (this line)
.its('0.contentDocument.body')
until the button shows up.
So not
.its('0.contentDocument.body')
.should('not.empty')
but something like
.its('0.contentDocument.body')
.should('have.child', '#some-button') // force a retry on body query
.should() with callback will do this
Cypress.Commands.add('getIframeBodyWithSelector', (waitForSelector) => {
return cy.get('iframe.cy-external-link-container')
.its('0.contentDocument.body')
.should(body => {
expect(Cypress.$(body).has(waitForSelector).length).gt(0)
})
.then(cy.wrap)
})
it('finds some button', () => {
cy.getIframeBodyWithSelector('#some-button')
.find('#some-button') // passes
})
You can add an timeout and also add should('be.visible'). should assertion will make sure that till the timeout value is reached it rerties and make sure that the iframe is loaded successfully.
Cypress.Commands.add('getIframeBody', () => {
return cy
.get('iframe.cy-external-link-container', {timeout: 7000})
.should('be.visible')
.its('0.contentDocument.body')
.should('not.empty')
.then(cy.wrap)
})

Why is my infiniteScroll function in Apify not working?

I am trying to get out product data from a website that loads the product list as the user scrolls down. I am using Apify for this. My first thought was to see if somebody had already solved this and I found 2 useful links: How to make the Apify Crawler to scroll full page when web page have infinite scrolling? and How to scrape dynamic-loading listing and individual pages using Apify?. However, when I tried to apply the functions they mention, my Apify crawler failed to load the content.
I am using a web-scraper based on the code in the basic web-scraper repository.
The website I am trying to get data out of is in this link. For the moment I am just learning so I just want to be able to get the data out of this one page, I do not need to navigate to other pages.
The PageFunction I am using is the following:
async function pageFunction(context) {
// Establishing uility constants to use throughout the code
const { request, log, skipLinks } = context;
const $ = context.jQuery;
const pageTitle = $('title').first().text();
context.log.info('Wait for website to render')
await context.waitFor(2000)
//Creating function to scroll the page til the bottom
const infiniteScroll = async (maxTime) => {
const startedAt = Date.now();
let itemCount = $('.upcName').length;
for (;;) {
log.info(`INFINITE SCROLL --- ${itemCount} initial items loaded ---`);
// timeout to prevent infinite loop
if (Date.now() - startedAt > maxTime) {
return;
}
scrollBy(0, 99999);
await context.waitFor(1000);
const currentItemCount = $('.upcName').length;
log.info(`INFINITE SCROLL --- ${currentItemCount} items loaded after scroll ---`);
if (itemCount === currentItemCount) {
return;
}
itemCount = currentItemCount;
}
};
context.log.info('Initiating scrolling function');
await infiniteScroll(60000);
context.log.info(`Scraping URL: ${context.request.url}`);
var results = []
$(".itemGrid").each(function() {
results.push({
name: $(this).find('.upcName').text(),
product_url: $(this).find('.nombreProductoDisplay').attr('href'),
image_url: $(this).find('.lazyload').attr('data-original'),
description: $(this).find('.block-with-text').text(),
price: $(this).find('.upcPrice').text()
});
});
return results
}
I replaced the while(true){...} loop for a for(;;){...} because I was getting a Unexpected constant condition. (no-constant-condition)ESLint error.
Also, I have tried varying the magnitude of the scroll and the await periods.
In spite of all this, I cannot seem to get the crawler to get me more than 32 results.
Could someone please explain to me what am i doing wrong?
################ UPDATE ##################
I continued to work on this and could not make it work from the Apify platform so my original question still stands. However, I did manage to make the scroll function work by running the script from my pc.
in this particular case, you can check for the loading spinner visibility after scrolling, instead of trying to count the number of items.
by changing your code a bit, you can make it like this:
async function pageFunction(context) {
// Establishing uility constants to use throughout the code
const { request, log, skipLinks } = context;
const $ = context.jQuery;
const pageTitle = $('title').first().text();
context.log.info('Wait for website to render')
// wait for initial listing
await context.waitFor('.itemGrid');
context.log.info(`Scraping URL: ${context.request.url}`);
let tries = 5; // keep track of the load spinner being invisible on the page
const results = new Map(); // this ensures you only get unique items
while (true) { // eslint-disable-line
log.info(`INFINITE SCROLL --- ${results.size} initial items loaded ---`);
// when the style is set to "display: none", it's hidden aka not loading any new items
const hasLoadingSpinner = $('.itemLoader[style*="none"]').length === 0;
if (!hasLoadingSpinner && tries-- < 0) {
break;
}
// scroll to page end, you can adjust the offset if it's not triggering the infinite scroll mechanism, like `document.body.scrollHeight * 0.8`
scrollTo({ top: document.body.scrollHeight });
$(".itemGrid").each(function() {
const $this = $(this);
results.set($this.find('#upcProducto').attr('value'), {
name: $this.find('.upcName').text(),
product_url: $this.find('.nombreProductoDisplay').attr('href'),
image_url: $this.find('.lazyload').data('original'),
description: $this.find('.block-with-text').text(),
price: $this.find('.upcPrice').text()
});
});
// because of the `tries` variable, this will effectively wait at least 5 seconds to consider it not loading anymore
await context.waitFor(1000);
// scroll to top, sometimes scrolling past the end of the page does not trigger the "load more" mechanism of the page
scrollTo({ top: 0 });
}
return [...results.values()]
}
this method also works for virtual pagination, like React Virtual or Twitter results that remove DOM nodes when they are not in the viewport.
using timeouts is very brittle and depending on how fast/slow your scraper is working, your results will vary. so you need a clear indication that the page is not delivering new items.
you can also keep track of the document.body.scrollHeight, as it will change when there are new items.

Cypress click() failed because this element is detached from the DOM in iteration

I try to test my Single Page Application with cypress.
The first page has multiple buttons as anchor tags which direct you to the second site(Angular routing).
On the second site i have a "back" button.
So i want my test to click on a button, wait for the second site to appear, click on the back and repeat this for all remaining buttons.
This is my cypress test:
describe('Select products', function () {
before(() => {
cy.visit('http://localhost:4200/')
})
it('Clicking through products', function () {
// getting each anchor to click
cy.get('a[data-cy=submit]').each(
($el) => {
// click to get on next site
cy.wrap($el).click()
// click to go back
cy.contains('go back').click()
}
)
})
})
It works fine for the first run(get all buttons => click the first => go back) but after getting back on the start site before clicking the next button cypress throws an error:
Can someone help me with this?
Thanks for any help!
cy.get('a[data-cy=submit]') must get a list of the buttons and store them to iterate via .each(), but the code within .each() navigates away from the first page - I guess that Angular destroys the original elements that .each() is trying to iterate over.
This is similar to iterating over a list and altering the list within the iteration, the loop fouls up because the list changes.
If you know how many buttons there are, this would be a better way
const buttonCount = 4;
for (let i = 0; i < buttonCount; i++) {
cy.get('a[data-cy=submit]').eq(i).click();
cy.contains('go back').click();
}
If the buttons are dynamic (you don't know the count), use
cy.get('a[data-cy=submit]').then($buttons => {
const buttonCount = $buttons.length;
for (let i = 0; i < buttonCount; i++) {
cy.get('a[data-cy=submit]').eq(i).click();
cy.contains('go back').click();
}
}

Load next set of Infinite Scroll images from within Fancybox, using next arrow

I'm using Infinite Scroll (https://infinite-scroll.com/) load a large image gallery in Wordpress. I'm also using Fancybox (https://fancyapps.com/fancybox/3/) to display those images in a lightbox.
Ideally, when the lightbox opens, the user should be able to cycle through the full image gallery (not just those currently loaded). However, Fancybox only displays images that have been loaded via Infinite Scroll prior to Fancybox being triggered. To see more images, you need to close Fancybox, scroll the page to load the additional images with Infinite Scroll, then re-open Fancybox.
Is there a way to get Fancybox to display the full image gallery, and not be constrained by the 'pages' that Infinite Scroll has currently loaded?
I'm pretty much stuck on this, so any suggestions would be welcome!
// Infinite Scroll
$container.infiniteScroll({
path: '.nextLink',
append: '.masonry-brick',
history: false,
hideNav: '.pageNav',
outlayer: msnry
});
// Fancybox
$().fancybox({
selector : '[data-fancybox="images"]',
loop: false,
});
Edit: Okay, I managed to get this working with the following:
// Infinite Scroll
$container.infiniteScroll({
path: '.nextLink',
append: '.masonry-brick',
history: false,
hideNav: '.pageNav',
outlayer: msnry
});
// Fancybox
$().fancybox({
selector: '[data-fancybox="images"]',
loop: false,
beforeShow: function(instance, current) {
// When we reach the last item in current Fancybox instance, load more images with Infinite Scroll and append them to Fancybox
if (current.index === instance.group.length - 1) { // 1. Check if at end of group
// 2. Trigger infinite scroll to load next set of images
$container.infiniteScroll('loadNextPage');
// 3. Get the newly loaded set of images
$container.on( 'load.infiniteScroll', function( event, response ) {
var $posts = $(response).find('.masonry-brick');
// 4. Set up an array to put them in
var newImages = [];
$($posts).each( function( index, element ){
// 5. Construct the objects
var a = {};
a['type'] = 'image';
a['src'] = $(this).find('a').attr('href');
// 6. Add them to the array
newImages.push(a);
});
// 7. And append to this instance
instance.addContent(newImages);
});
}
}
});
Hope this helps anyone having the same issue!

Resources