Apify cheerio scraper stops even with urls in the queue - web-scraping

here is the scenario, I'm using the cheerio scraper to scraper a website containing real estate announces.
Each announce has the link to the next announce so before scrapint the current page I add the next page in the request queue.
What it happens always at certain and a random point is that the scraper stops without any reason, even if in the queue there is the next page to scrape (I add the image).
Why does this happens since there is still a pending request in the queue?
Many thanks
Here is the message I get:
2021-02-28T10:52:35.439Z INFO CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
2021-02-28T10:52:35.672Z INFO CheerioCrawler: Final request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":963,"requestsFinishedPerMinute":50,"requestsFailedPerMinute":0,"requestTotalDurationMillis":22143,"requestsTotal":23,"crawlerRuntimeMillis":27584,"requestsFinished":23,"requestsFailed":0,"retryHistogram":[23]}
2021-02-28T10:52:35.679Z INFO Cheerio Scraper finished.
Here the request queue:
Here the code
async function pageFunction(context) {
const { $, request, log } = context;
// The "$" property contains the Cheerio object which is useful
// for querying DOM elements and extracting data from them.
const pageTitle = $('title').first().text();
// The "request" property contains various information about the web page loaded.
const url = request.url;
// Use "log" object to print information to actor log.
log.info('Scraping Page', { url, pageTitle });
// Adding next page to the queue
var baseUrl = '...';
if($('div.d3-detailpager__element--next a').length > 0)
{
var nextPageUrl = $('div.d3-detailpager__element--next a').attr('href');
log.info('Found another page', { nextUrl: baseUrl.concat(nextPageUrl) });
context.enqueueRequest({ url:baseUrl.concat(nextPageUrl) });
}
// My code for scraping follows here
return { /*my scaped object*/}
}

Missing await
await context.enqueueRequest

Related

How to make Next-Auth-session-token-dependent server queries with React Query in Next JS?

I am trying to make an API GET request, using React Query's useInfiniteQuery hook, that uses data from a Next Auth session token in the query string.
I have a callback in /api/auth/[...nextauth.ts] to send extra userData to my session token.
There are two relevant pages on the client side. Let's call them /pages/index.tsx and /hooks/useApiData.ts. This is what they look like, for all intents and purposes:
// pages/index.tsx
export default function Page() {
const {data, fetchNextPage, isLoading, isError} = useCourseData()
if (isLoading) return <main />
return <main>
<InfiniteScroller fetchMore={fetchNextPage}>
{data?.pages?.map(page => page?.results?.map(item: string => item))}
</InfiniteScroller>
</main>
}
// hooks/useApiData.ts
async function fetchPage(pageParam: string) {
const response = await fetch(pageParam)
return await response.json()
}
export default function useApiData() {
const {data: session} = useSession()
const init = `/api?userData=${session?.user?.userData}`
return useInfiniteQuery('query',
({pageParam = init}) => fetchPage(pageParam),
{getNextPageParam: prevPage => prevPage.next ?? undefined}
)
}
My initial request gets sent to the API as /api?userData=undefined. The extra data is definitely making its way into the token.
I can place the data from my session in the DOM via the render function of /pages/index.tsx, so I figure the problem is something to do with custom hooks running before the session context is ready, or something like that... I don't understand the mechanics of hooks well enough to figure that out.
I've been looking for answers for a long time, and I'm surprised not to have found a single person with the same issue. These are not unpopular packages and I guess a lot of people are using them in conjunction to achieve what I'm attempting here, so I figure I must be doing something especially dumb. But what?!
How can I get the data from my Next Auth session into my React Query request? And for bonus points, why is the session data not available when the request is sent in my custom hook?

Cypress not stubbing json data in intercept?

I've been searching for a solution all day, googling and StackOverflowing, but nothing appears to be working.
I've got a very simple NextJS app. On page load, I load a fact from a third party API automatically. Then a user can enter a search query, press enter, and search again based on that query. I want to create a Cypress test that checks for the functionality of that search feature.
Right now, I'm getting a timeout on cy.wait(), and it states that No request ever occurred.
app.spec.js
import data from '../fixtures/data';
describe('Test search functionality', () => {
it('renders new fact when search is performed', () => {
// Visit page
cy.visit('/');
// Wait for page to finish loading initial fact
cy.wait(1000);
// Intercept call to API
cy.intercept("GET", `${process.env.NEXT_PUBLIC_API_ENDPOINT}/jokes/search?query=Test`, {
fixture: "data.json",
}).as("fetchFact");
// Type in search input
cy.get('input').type('Test');
// Click on search button
cy.get('.submit-btn').click();
// Wait for the request to be made
cy.wait('#fetchFact').its('response.statusCode').should('eq', 200);
cy.get('p.copy').should('contain', data.result[0].value);
})
});
One thing I've noticed, is that the data being displayed on the page is coming from the actual API response, rather than the json file I'm attempting to stub with. None of React code is written server-side either, this is all client-side.
As you can see, the test is pretty simple, and I feel like I've tried every variation of intercept, changing order of things, etc. What could be causing this timeout? Why isn't the json being stubbed correctly in place of the network request?
And of course, I figure out the issue minutes after posting this question.
I realized that Cypress doesn't like Next's way of handling env variables, and instead needed to create a cypress.env.json. I've updated my test to look like this:
import data from '../fixtures/data';
describe('Test search functionality', () => {
it('renders new fact when search is performed', () => {
// Visit page
cy.visit('/');
// Wait for page to finish loading initial fact
cy.wait(1000);
// Intercept call to API
const url = `${Cypress.env('apiEndpoint')}/jokes/search?query=Test`;
cy.intercept("GET", url, {
fixture: "data",
}).as("fetchFact");
// Type in search input
cy.get('input').type('Test');
// Click on search button
cy.get('.submit-btn').click();
// Wait for the request to be made
cy.wait('#fetchFact').its('response.statusCode').should('eq', 200);
cy.get('p.copy').should('contain', data.result[0].value);
})
});

How to make an HttpClient GetAsync wait for a webpage that loads data asynchronously?

I'm making a snippet that sends data to a website that analyses it then sends back results.
Is there any way to make my GetAsych wait until the website finishes its calculation before getting a "full response"?
Ps: The await will not know if the page requested contains any asynchronous processing (eg: xhr calls)- I already use await and ReadAsByteArrayAsync()/ReadAsStringAsync()
Thank you!
You will need something like Selenium to not only fetch the HTML of the website but to fully render the page and execute any dynamic scripts.
You can then hook into some events, wait for certain DOM elements to appear or just wait some time until the page is fully initialized.
Afterwards you can use the API of Selenium to access the DOM and extract the information you need.
Example code:
using (var driver = new ChromeDriver(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location)))
{
driver.Navigate().GoToUrl(#"https://automatetheplanet.com/multiple-files-page-objects-item-templates/");
var link = driver.FindElement(By.PartialLinkText("TFS Test API"));
var jsToBeExecuted = $"window.scroll(0, {link.Location.Y});";
((IJavaScriptExecutor)driver).ExecuteScript(jsToBeExecuted);
var wait = new WebDriverWait(driver, TimeSpan.FromMinutes(1));
var clickableElement = wait.Until(ExpectedConditions.ElementToBeClickable(By.PartialLinkText("TFS Test API")));
clickableElement.Click();
}
Source: https://www.automatetheplanet.com/webdriver-dotnetcore2/
What you're looking for here is the await operator. According to the docs:
The await operator suspends evaluation of the enclosing async method until the asynchronous operation represented by its operand completes.
Sample use within the context of an HttpClient object:
public static async Task Main()
{
// send the HTTP GET request
var response = await httpClient.GetAsync("my-url");
// get the response string
// there are other `ReadAs...()` methods if the return type is not a string
var getResult = response.Content.ReadAsStringAsync();
}
Note that the method that encloses the await-ed code is marked as async and has a return type of Task (Task<T> would also work, depending on your needs).

I am noticing double entry (cpc and organic) for the same user ? little confused

I'm noticing double entry in google analytics. I have multiple ocurrences where it looks like the user came from the CPC campaign (which always has a 0s session duration) but that very same user also has an entry for "organic" and all the activities are logged under that.
My site is not ranked organically for those keywords. Unless a so many users come to my site, leave, and google for my "brand name" on google and revisits, this doesn't make sense.
I'm a little confused. Here's the report:
preview from google analytics dashboard
Based on the additional information in your comment, that the sites is a Single Page Application (SPA), you are most likely facing the problem of 'Rogue Referral'.
If this is the case, what happens, is that you overwrite the location field in the Analytics hit, losing the original UTM parameters, whereas referral is still sent with the hit, so Analytics recognizes the second hit as a new traffic source. One of the solutions is to store the original page URL and send it as the location, while sending the actual visited URL in the page field.
A very good article on this topic with further tips, by Simo Ahava, is available for your help.
Also please note, that as you have mentioned, that the first hit shows 0 second time on page, you might need to check, whether the first visited page is sent twice. E.g. sending a hit on the traditional page load event, and sending a hit for the same page as a virtual page view.
I have come up with a solution to this problem in a Gatsby website (a SPA), by writing the main logic in the gatsby-browser.js file, inside the onRouteUpdate function.
You can use this solution in other contexts, but please note that the code needs to run at the first load of the page and at every route change.
If you want the solution to work in browsers that do not support URLSearchParams I think you can easily find a polyfill.
Function to retrieve the parameters
// return the whole parameters only if at least one of the desired parameters exists
const retrieveParams = () => {
let storedParams;
if ('URLSearchParams' in window) {
// Browser supports URLSearchParams
const url = new URL(window.location.href);
const params = new URLSearchParams(url.search);
const requestedParams = ['utm_source', 'utm_medium', 'utm_campaign', 'utm_content', 'gclid'];
const hasRequestedParams = requestedParams.some((param) => {
// true if it exists
return !!params.get(param);
});
if (hasRequestedParams) {
storedParams = params;
}
}
return storedParams;
}
Create the full URL
// look at existing parameters (from previous page navigations) or retrieve new ones
const storedParams = window.storedParams || retrieveParams();
let storedParamsUrl;
if (storedParams) {
// update window value
window.storedParams = storedParams;
// create the url
const urlWithoutParams = document.location.protocol + '//' + document.location.hostname + document.location.pathname;
storedParamsUrl = `${urlWithoutParams}?${storedParams}`;
}
Send the value to analytics (using gtag)
// gtag
gtag('config', 'YOUR_GA_ID', {
// ... other parameters
page_location: storedParamsUrl ?? window.location.href
});
or
gtag('event', 'page_view', {
// ... other parameters
page_location: storedParamsUrl ?? window.location.href,
send_to: 'YOUR_GA_ID'
})

Trigger a button click from a URL

We need to scrape VEEC Website for the total number once a week.
As an example, for the week of 17/10/2016 - 23/10/2016 the URL returns the number Total 167,356 when the search button is clicked. We want this number to be stored in our database.
I'm using coldfusion to generate the weekly dates as params and have been passing them like the above URL. But I'm unable to find a query param so that the "Search" button click event is triggered.
I've tried like this & this but nothing seems to be working.
Any pointers?
It seems like for every form submission, a CRSF token is added, which prevents malicious activity. To make matters worse for you, the CRSF token is changed for each form submission, not just for each user, which makes it virtually impossible to circumvent.
When I make a CFHTTP POST request to this form, I get HTML FileContent back, but there is no DB data within the results table cell placeholders. It seems to me that the form owner allows form submission from an HTTP request, but if the CRSF token cannot be validated, no DB data is returned.
It maybe worth asking the website owner, if there is any kind of REST API, that you can hook into...
If you want to use a headless browser PhantomJS (https://en.wikipedia.org/wiki/PhantomJS) for this, here is a script that will save the total to a text file.
At command prompt, after you install PhantomJS, run phantomjs.exe main.js.
main.js
"use strict";
var firstLoad = true;
var url = 'https://www.veet.vic.gov.au/Public/PublicRegister/Search.aspx?CreatedFrom=17%2F10%2F2016&CreatedTo=23%2F10%2F2016';
var page = require("webpage").create();
page.viewportSize = {
width: 1280,
height: 800
};
page.onCallback = function (result) {
var fs = require('fs');
fs.write('veet.txt', result, 'w');
};
page.onLoadStarted = function () {
console.log("page.onLoadStarted, firstLoad", firstLoad);
};
page.onLoadFinished = function () {
console.log("page.onLoadFinished, firstLoad", firstLoad);
if (firstLoad) {
firstLoad = false;
page.evaluate(function () {
var event = document.createEvent("MouseEvents");
event.initEvent("click", true, true);
document.querySelectorAll(".dx-vam")[3].dispatchEvent(event);
});
} else {
page.evaluate(function () {
var element = document.querySelectorAll('.dxgv')[130];
window.callPhantom(element.textContent);
});
setTimeout(function () {
page.render('veet.png');
phantom.exit();
}, 3000);
}
};
page.open(url);
The script is not perfect, you can work on it if you're interested, but as is it will save the total to a file veet.txt and also save a screenshot veet.png.

Resources