Why is my infiniteScroll function in Apify not working? - web-scraping

I am trying to get out product data from a website that loads the product list as the user scrolls down. I am using Apify for this. My first thought was to see if somebody had already solved this and I found 2 useful links: How to make the Apify Crawler to scroll full page when web page have infinite scrolling? and How to scrape dynamic-loading listing and individual pages using Apify?. However, when I tried to apply the functions they mention, my Apify crawler failed to load the content.
I am using a web-scraper based on the code in the basic web-scraper repository.
The website I am trying to get data out of is in this link. For the moment I am just learning so I just want to be able to get the data out of this one page, I do not need to navigate to other pages.
The PageFunction I am using is the following:
async function pageFunction(context) {
// Establishing uility constants to use throughout the code
const { request, log, skipLinks } = context;
const $ = context.jQuery;
const pageTitle = $('title').first().text();
context.log.info('Wait for website to render')
await context.waitFor(2000)
//Creating function to scroll the page til the bottom
const infiniteScroll = async (maxTime) => {
const startedAt = Date.now();
let itemCount = $('.upcName').length;
for (;;) {
log.info(`INFINITE SCROLL --- ${itemCount} initial items loaded ---`);
// timeout to prevent infinite loop
if (Date.now() - startedAt > maxTime) {
return;
}
scrollBy(0, 99999);
await context.waitFor(1000);
const currentItemCount = $('.upcName').length;
log.info(`INFINITE SCROLL --- ${currentItemCount} items loaded after scroll ---`);
if (itemCount === currentItemCount) {
return;
}
itemCount = currentItemCount;
}
};
context.log.info('Initiating scrolling function');
await infiniteScroll(60000);
context.log.info(`Scraping URL: ${context.request.url}`);
var results = []
$(".itemGrid").each(function() {
results.push({
name: $(this).find('.upcName').text(),
product_url: $(this).find('.nombreProductoDisplay').attr('href'),
image_url: $(this).find('.lazyload').attr('data-original'),
description: $(this).find('.block-with-text').text(),
price: $(this).find('.upcPrice').text()
});
});
return results
}
I replaced the while(true){...} loop for a for(;;){...} because I was getting a Unexpected constant condition. (no-constant-condition)ESLint error.
Also, I have tried varying the magnitude of the scroll and the await periods.
In spite of all this, I cannot seem to get the crawler to get me more than 32 results.
Could someone please explain to me what am i doing wrong?
################ UPDATE ##################
I continued to work on this and could not make it work from the Apify platform so my original question still stands. However, I did manage to make the scroll function work by running the script from my pc.

in this particular case, you can check for the loading spinner visibility after scrolling, instead of trying to count the number of items.
by changing your code a bit, you can make it like this:
async function pageFunction(context) {
// Establishing uility constants to use throughout the code
const { request, log, skipLinks } = context;
const $ = context.jQuery;
const pageTitle = $('title').first().text();
context.log.info('Wait for website to render')
// wait for initial listing
await context.waitFor('.itemGrid');
context.log.info(`Scraping URL: ${context.request.url}`);
let tries = 5; // keep track of the load spinner being invisible on the page
const results = new Map(); // this ensures you only get unique items
while (true) { // eslint-disable-line
log.info(`INFINITE SCROLL --- ${results.size} initial items loaded ---`);
// when the style is set to "display: none", it's hidden aka not loading any new items
const hasLoadingSpinner = $('.itemLoader[style*="none"]').length === 0;
if (!hasLoadingSpinner && tries-- < 0) {
break;
}
// scroll to page end, you can adjust the offset if it's not triggering the infinite scroll mechanism, like `document.body.scrollHeight * 0.8`
scrollTo({ top: document.body.scrollHeight });
$(".itemGrid").each(function() {
const $this = $(this);
results.set($this.find('#upcProducto').attr('value'), {
name: $this.find('.upcName').text(),
product_url: $this.find('.nombreProductoDisplay').attr('href'),
image_url: $this.find('.lazyload').data('original'),
description: $this.find('.block-with-text').text(),
price: $this.find('.upcPrice').text()
});
});
// because of the `tries` variable, this will effectively wait at least 5 seconds to consider it not loading anymore
await context.waitFor(1000);
// scroll to top, sometimes scrolling past the end of the page does not trigger the "load more" mechanism of the page
scrollTo({ top: 0 });
}
return [...results.values()]
}
this method also works for virtual pagination, like React Virtual or Twitter results that remove DOM nodes when they are not in the viewport.
using timeouts is very brittle and depending on how fast/slow your scraper is working, your results will vary. so you need a clear indication that the page is not delivering new items.
you can also keep track of the document.body.scrollHeight, as it will change when there are new items.

Related

Adding a new item when using useSWRInfinite pushes other items out of the list

I am building a comment system where new replies are added to the start (top) of the list. The pagination is cursor-based.
At the moment, I use mutate to add the newly created comment as its own page to the front of the list.:
const {
data: commentsPages,
: commentsPagesSize,
: setCommentsPagesSize,
//TODO: Not true on successive page load. But isValidating refreshes on refetches
isLoading: commentsLoading,
error: commentsLoadingError,
mutate: mutateCommentPages,
} = useSWRInfinite(
getPageKey,
([blogPostId, lastCommentId]) => BlogApi.getCommentsForBlogPost(blogPostId, lastCommentId));
<CreateCommentBox
blogPostId={blogPostId}
title="Write a comment"
onCommentCreated={(newComment) => {
const updatedPages = commentsPages?.map(page => {
const updatedPage: GetCommentsResponse = { comments: [newComment, ...page.comments], paginationEnd: page.paginationEnd };
return updatedPage;
})
mutateCommentPages(updatedPages, { revalidate: false });
}}
/>
The problem is, SWR immediately starts revalidating the list and pushes the comment at the bottom out of the data set. This behavior is kind of awkward.
Is my only choice do disable automatic revalidation completely? How would you handle this?

How to get HTML of each paragraph of a Word document with Office JS

I have MS Word documents where the table of contents, built using title 1 to title 4, is a hierarchy of more than 100 items.
I want to use Office JS to develop an add-in to import these documents in WordPress as a set of Pages with the same hierarchy as one of the tables of contents.
Each WP page would contain the HTML of all the paragraphs contained under each title level.
Looking at the Office JS samples, I have been able to log to the console the outline levels of all paragraphs in the document, but I am stuck with getting the HTML.
I think this is probably because I misunderstand context.sync().
Here is my code:
$("#exportToCMS").click(() => tryCatch(exportToCMS));
function exportToCMS() {
return Word.run(function (context) {
// Create a proxy object for the paragraphs collection
var paragraphs = context.document.body.paragraphs;
// Queue a command to load the outline level property for all of the paragraphs
paragraphs.load("outlineLevel");
return context.sync().then(function () {
// Queue a a set of commands to get the HTML of each paragraph.
paragraphs.items.forEach((paragraph) => {
// Queue a command to get the HTML of the paragraph.
var ooxml = paragraph.getOoxml();
return context.sync().then(function () {
console.log(ooxml.value);
console.log(paragraph.outlineLevel);
});
});
});
});
}
/** Default helper for invoking an action and handling errors. */
function tryCatch(callback) {
Promise.resolve()
.then(callback)
.catch(function (error) {
console.error(error);
});
}
If I comment out the line which logs ooxml.value, the script runs fine.
When uncommented, I get an error "Unhandled promise rejection".
Broken promise chains are common when you have a context.sync inside a loop. This is also bad from a performance standpoint. Your first step to fixing your code is to get the context.sync out of the forEach loop by following the guidance in this article: Avoid using the context.sync method in loops.

Scraping infinite scroll href with Cypress

I'm using Cypress to scrape a site with an infinite scroll.
The site is made with React, and after the user enters a search term in an input, as they scroll more products appear on the page matching the search term entered.
The code I've got so far opens a URL, navigates to the URL and collects all the hrefs that are currently visible.
I'm wondering is how I can tell cypress to scroll down further, slowly harvesting all the hrefs as it scrolls down the page, and then finally writing the hrefs to the json.
This is the code I have so far, minus the scrolling:
const arrayOfHrefs = [];
describe('Get links', () => {
it.only('should do a product search', () => {
cy.visit('https://www.testsite.com');
cy.wait(5000);
cy.get('#product_input').type('socks');
cy.contains('socks').click(); // renders new content on the client side
cy.wait(10000);
cy.get('a').each(($a) => {
const link = $a.attr('href');
arrayOfHrefs.push(link); // grabs all visible links and pushes them to array
}).then(() => {
console.log(arrayOfHrefs)
cy.writeFile('data.json', { urls: arrayOfHrefs }) // writes array to disk
})
});
});
You did not detail what you have tried so far and what issues you're currently having regarding scrolling, but I assume scrolling down the window and then adding some logic to wait until more links become visible is sufficient.
This command scrolls down the whole window to the bottom over 5000ms:
cy.scrollTo('bottom', {duration: 5000})
Note that it's not chained off from an element like:
cy.get('#some-scrollable-element').scrollTo(...)
I googled a page that has some similar dynamic infinite scroll behaviour, maybe you could base your code on the following snippet:
describe('', () => {
before('', () => {
cy.server()
cy.route('GET', '**/blog/page/**').as('blog')
})
it('', () => {
let numberOfChildren = 4
cy.visit('http://www.drewleague.com/blog/')
for (let i = 0; i < 5; i++) {
cy.get('.posts--desktop')
.children()
.then(children => {
cy.wrap(children)
.its('length')
.should('eq', numberOfChildren)
})
cy.scrollTo('bottom', {duration: 5000})
.wait('#blog')
.then(() => numberOfChildren += 4)
}
})
})
This code scrolls down the page to the bottom 5 times, and in each iteration we check the number of children which are dynamically added, also we wait until the xhr request finishes. Not very useful on its own but you get the idea.

Vue JS AJAX computed property

Ok, I believe I am VERY close to having my first working Vue JS application but I keep hitting little snag after little snag. I hope this is the last little snag.
I am using vue-async-computed and axios to fetch a customer object from my API.
I am then passing that property to a child component and rendering to screen like: {{customer.fName}}.
As far as I can see, the ajax call is being made and the response coming back is expected, the problem is there is nothing on the page, the customer object doesnt seem to update after the ajax call maybe.
Here is the profile page .vue file I'm working on
http://pastebin.com/DJH9pAtU
The component has a computed property called "customer" and as I said, I can see in the network tab, that request is being made and there are no errors. The response is being sent to the child component here:
<app-customerInfo :customer="customer"></app-customerInfo>
within that component I am rendering the data to the page:
{{customer.fName}}
But, the page shows no results. Is there a way to verify the value of the property "customer" in inspector? is there something obvious I am missing?
I've been using Vue for about a year and a half, and I realize the struggle that is dealing with async data loading and that good stuff. Here's how I would set up your component:
<script>
export default {
components: {
// your components were fine
},
data: () => ({ customer: {} }),
async mounted() {
const { data } = await this.axios.get(`/api/customer/get/${this.$route.params.id}`);
this.customer = data;
}
}
</script>
so what I did was initialize customer in the data function for your component, then when the component gets mounted, send an axios call to the server. When that call returns, set this.customer to the data. And like I said in my comment above, definitely check out Vue's devtools, they make tracking down variables and events super easy!
I believed your error is with naming. The vue-async-computed plugin needs a new property of the Vue object.
computed: {
customer: async function() {
this.axios.get('/api/customer/get/' + this.$route.params.id).then(function(response){
return(response.data);
});
}
}
should be:
asyncComputed: {
async customer() {
const res = await this.axios.get(`/api/customer/get/${this.$route.params.id}`);
return res.data;
}
}

meteor dbmongo find does not return all the elemnts

I'm using meteor and javascript fullcalendar .
I'm trying to get the data from a dbmongo cursor.
But it's seem really randomly when I get the elements and when I don't
Here is my code :
events : function (start, end , timezone, callback) {
var events = [];
var eventsData = Events.find();
eventsData.forEach(function(event) {
events.push(event);
});
callback (events);
}
This sits inside the isClient, in the jquery section.
Sometimes I get all the elements form the database and sometimes I don't.
Anybody have an idea on how to always get them?
Thanks

Resources