Proper error catching when dealing with multiple elements in puppeteer - web-scraping

I am navigating through a site and have a nested structure like this:
if (deliveryTab) {
let deliveryButton = await self.page.$(
"button#shopping-selector-shop-context-intent-delivery"
);
if (deliveryButton) {
await deliveryButton.click();
I try to check if each button/element is not null before going to the next step but this starts creating a large amount of nested ifs. I was thinking of surrounding everything in a large try...catch or using a try..catch on every possible element I am dealing with. This leads me to believe that maybe my approach isn't good.
If I was running on a slow network, using waitForSelector() with shorter timeouts might throw an error even if an element was about to load, but I am also afraid that using page.$("selector") will return null on a slow network as well even after waitForNavigation. I was wondering if there are any elegant solutions/practices to making sure my app can handle most errors / changes in the site I am traversing.

Related

Get an array from SpookyJs to Meteor

After a lot of hard work, my SpookyJS script works as I should and I got my spoils of war, an array of values I want to use to query my Collection in my Meteor app, but I have a huge problem.
I can't find a way to call any Meteor specific methods from spooky...
So my code is like this for the spooky.on function:
spooky.on('fun', function (courses) {
console.log(courses);
// Meteor.call('edxResult', courses); // doesn't work...
});
The console.log gives me the result I want:
[ 'course-v1:MITx+6.00.2x_3+1T2015',
'HarvardX/CS50x3/2015',
'course-v1:LinuxFoundationX+LFS101x.2+1T2015',
'MITx/6.00.1x_5/1T2015' ]
What I need is a way to call a Meteor.method with courses as my argument or a way to get access to the array in the current Meteor.method, after spookyjs finished it's work (Sadly I have no idea how to check whether spooky is finished)
My last idea would be to give the Meteor.method a callback function and store the array in the session or something, but that sounds like extremly bad design, there has to be a better way, I hope.
I am extremly proud of my little ghost, so any help to get it the last few pieces over the finish line would be extremly appricated.

Purely functional feedback suppression?

I have a problem that I can solve reasonably easy with classic imperative programming using state: I'm writing a co-browsing app that shares URL's between several nodes. The program has a module for communication that I call link and for browser handling that I call browser. Now when a URL arrives in link i use the browser module to tell the
actual web browser to start loading the URL.
The actual browser will trigger the navigation detection that the incoming URL has started to load, and hence will immediately be presented as a candidate for sending to the other side. That must be avoided, since it would create an infinite loop of link-following to the same URL, along the line of the following (very conceptualized) pseudo-code (it's Javascript, but please consider that a somewhat irrelevant implementation detail):
actualWebBrowser.urlListen.gotURL(function(url) {
// Browser delivered an URL
browser.process(url);
});
link.receivedAnURL(function(url) {
actualWebBrowser.loadURL(url); // will eventually trigger above listener
});
What I did first wast to store every incoming URL in browser and simply eat the URL immediately when it arrives, then remove it from a 'received' list in browser, along the lines of this:
browser.recents = {} // <--- mutable state
browser.recentsExpiry = 40000;
browser.doSend = function(url) {
now = (new Date).getTime();
link.sendURL(url); // <-- URL goes out on the network
// Side-effect, mutating module state, clumsy clean up mechanism :(
browser.recents[url] = now;
setTimeout(function() { delete browser.recents[url] }, browser.recentsExpiry);
return true;
}
browser.process = function(url) {
if(/* sanity checks on `url`*/) {
now = (new Date).getTime();
var duplicate = browser.recents[url];
if(! duplicate) return browser.doSend(url);
if((now - duplicate_t) > browser.recentsExpiry) {
return browser.doSend(url);
}
return false;
}
}
It works but I'm a bit disappointed by my solution because of my habitual use of mutable state in browser. Is there a "Better Way (tm)" using immutable data structures/functional programming or the like for a situation like this?
A more functional approach to handling long-lived state is to use it as a parameter to a recursive function, and have one execution of the function responsible for handling a single "action" of some kind, then calling itself again with the new state.
F#'s MailboxProcessor is one example of this kind of approach. However it does depend on having the processing happen on an independent thread which isn't the same as the event-driven style of your code.
As you identify, the setTimeout in your code complicates the state management. One way you could simplify this out is to instead have browser.process filter out any timed-out URLs before it does anything else. That would also eliminate the need for the extra timeout check on the specific URL it is processing.
Even if you can't eliminate mutable state from your code entirely, you should think carefully about the scope and lifetime of that state.
For example might you want multiple independent browsers? If so you should think about how the recents set can be encapsulated to just belong to a single browser, so that you don't get collisions. Even if you don't need multiple ones for your actual application, this might help testability.
There are various ways you might keep the state private to a specific browser, depending in part on what features the language has available. For example in a language with objects a natural way would be to make it a private member of a browser object.

Update document in Meteor mini-mongo without updating server collections

In Meteor, I got a collection that the client subscribes to. In some cases, instead of publishing the documents that exists in the collection on the server, I want to send down some bogus data. Now that's fine using the this.added function in the publish.
My problem is that I want to treat the bogus doc as if it were a real document, specifically this gets troublesome when I want to update it. For the real docs I run a RealDocs.update but when doing that on the bogus doc it fails since there is no representation of it on the server (and I'd like to keep it that way).
A collection API that allowed me to pass something like local = true this would be fantastic but I have no idea how difficult that would be to implement and I'm not to fond of modifying the core code.
Right now I'm stuck at either creating a BogusDocs = new Meteor.Collection(null) but that makes populating the Collection more difficult since I have to either hard code fixtures in the client code or use a method to get the data from the server and I have to make sure I call BogusDocs.update instead of RealDocs.update as soon as I'm dealing with bogus data.
Maybe I could actually insert the data on the server and make sure it's removed later, but the data really has nothing to do with the server side collection so I'd rather avoid that.
Any thoughts on how to approach this problem?
After some further investigation (the evented mind site) it turns out that one can modify the local collection without making calls to the server. This is done by running the same methods as you usually would, but on MyCollection._collection instead of just on Collection. MyCollection.update() would thus become MyCollection._collection.update(). So, using a simple wrapper one can pass in the usual arguments to a update call to update the collection as usual (which will try to call the server which in turn will trigger your allow/deny rules) or we can add 'local' as the last argument to only perform the update in the client collection. Something like this should do it.
DocsUpdateWrapper = function() {
var lastIndex = arguments.length -1;
if (arguments[lastIndex] === 'local') {
Docs._collection.update(arguments.slice(0, lastIndex);
} else {
Docs.update(arguments)
}
}
(This could of course be extended to a DocsWrapper that allows for insertion and removals too.)(Didnt try this function yet but it should serve well as an example.)
The biggest benefit of this is imo that we can use the exact same calls to retrieve documents from the local collection, regardless of if they are local or living on the server too. By adding a simple boolean to the doc we can keep track of which documents are only local and which are not (An improved DocsWrapper could check for that bool so we could even omit passing the 'local' argument.) so we know how to update them.
There are some people working on local storage in the browser
https://github.com/awwx/meteor-browser-store
You might be able to adapt some of their ideas to provide "fake" documents.
I would use the transform feature on the collection to make an object that knows what to do with itself (on client). Give it the corruct update method (real/bogus), then call .update rather than a general one.
You can put the code from this.added into the transform process.
You can also set up a local minimongo collection. Insert on callback
#FoundAgents = new Meteor.Collection(null, Agent.transformData )
FoundAgents.remove({})
Meteor.call 'Get_agentsCloseToOffer', me, ping, (err, data) ->
if err
console.log JSON.stringify err,null,2
else
_.each data, (item) ->
FoundAgents.insert item
Maybe this interesting for you as well, I created two examples with native Meteor Local Collections at meteorpad. The first pad shows an example with plain reactive recordset: Sample_Publish_to_Local-Collection. The second will use the collection .observe method to listen to data: Collection.observe().

Proper LINQ to Lucene Index<T> usage pattern for ASP.NET?

What is the proper usage pattern for LINQ to Lucene's Index<T>?
It implements IDisposible so I figured wrapping it in a using statement would make the most sense:
IEnumerable<MyDocument> documents = null;
using (Index<MyDocument> index = new Index<MyDocument>(new System.IO.DirectoryInfo(IndexRootPath)))
{
documents = index.Where(d => d.Name.Like("term")).ToList();
}
I am occasionally experiencing unwanted deleting of the index on disk. It seems happen 100% of the time if multiple instances of the Index exist at the same time. I wrote a test using PLINQ to run 2 searches in parallel and 1 search works while the other returns 0 results because the index is emptied.
Am I supposed to use a single static instance instead?
Should I wrap it in a Lazy<T>?
Am I then opening myself up to other issues when multiple users access the static index at the same time?
I also want to re-index periodically as needed, likely using another process like a Windows service. Am I also going to run into issues if users are searching while the index is being rebuilt?
The code looks like Linq-to-Lucene.
Most cases of completely cleared Lucene indexes are new IndexWriters created with the create parameter set to true. The code in the question does not handle indexing so debugging this further is difficult.
Lucene.Net is thread-safe, and I expect linq-to-lucene to also inhibit this behavior. A single static index instance would cache stuff in memory, but I guess you'll need to handle index reloading of changes yourself (I do not know if linq-to-lucene does this for you).
There should be no problems using several searchers/readers when reindexing, Lucene is build to support that scenario. However, there can only be one writer per directory, so no other process can write documents to the index while your windows service were to optimize the index.

Flex: DeepCopy of FileReference

in my project, I let users pick pictures using the FileReference class. I then load these pictures into their .data properties, using the load() function. After this I perform some local manipulation and send them to the server.
What I would like to do, is to be able to iterate over the picked FileReferences again, load them into .data properties, perform different manipulation and send them to the server once again. I know that I should be able to do this from user-invoked event, that is not an issue here.
Problem is, once the FileReference is loaded for the first time, I can not unload it in any way, and I can not keep the data for all the pictures in the memory because these are huge.
So I guess there is only one thing I can do, which is performing a DeepCopy on the FileReference... Then I could load the first version, scrap it and use the copy for the second 'run'.
I tried to use ObjectUtil.copy, but when I access e.g. .name property of the copy, it fails with:
Error #2037: Functions called in incorrect sequence, or earlier call was unsuccessful.
at flash.net::FileReference/get name()
the relevant snippet:
registerClassAlias("FileReference",FileReference);
masterFileList.addItem(FileReference(ObjectUtil.copy(fr_load.fileList[i])));
trace(masterFileList[i].name)
Is it true that there are some protected properties of FileReference class that prevent it from being copied? If it is so, can I sidestep this somehow? Or is there any other solution to my overall problem?
I appreciate any hints/ideas!
I was trying to do almost exactly what you were doing, and I almost gave up after reading some of the answers, but I think I found a way to do it. I've found that if you have a FileReference object and call load() multiple times, it will work, but the main problem is that you're keeping the high-res bytes in memory after the first load. As you've mentioned, for people who don't know image processing, this is a big no-no.
The way to get around this is that after your first load(), you need to call the cancel() method on FileReference. From my testing so far, it looks like that will clear out the bytes in the FileReference, and load() will still work if you call it a second time later. Just a word of caution, this isn't explicitly-defined behavior in the API, so it is definitely subject to change, but it may help get you where you need to go in the mean time.
Hope that helps.
you cant use a ObjectUtil.copy. This method is designed for copying only data objects (VO classes).
you should create a new FileReference and copy the porperties, one by one. Create a function to do this..
Would copying it to a temporary file and then uploading the temporary file work? For example
var fileRef:FileReference = new FileReference();
fileRef.browse();
......................
var tmpFile:File = File.createTempFile();
try {
var tmpFileStream:FileStream = new FileStream();
tmpFileStream.open(tmpFile, FileMode.WRITE);
trace("Opened file: " + tmpFile.nativePath);
tmpFileStream.writeBytes(fileRef.data);
trace("copied file");
} catch ( error:Error ) {
trace("Unable to open file " + tmpFile.nativePath + "\n");
throw error;
}
I'm thinking that the operation is completely disallowed, for good reasons. If you can duplicate a new FileReference through ActionScript code, then you'd also be able to manufacture a FileReference object through ActionScript code. Of course, that'd be a pretty bad security hole if you could force the upload of an arbitrary file.
Keeping a copy of the data in memory really isn't that bad of a solution. After all, it's temporary. The typical client computer should be able to manage a few hundred extra MB of data with no problem. It's certainly a better option than having their browser do two separate uploads, which is what your attempted solution would end up doing.
A completely different potential solution to this problem is to avoid image manipulation by Flex altogether. Flex could post the uploaded file directly to the server, and the server could do the image manipulation itself. Of course, if the manipulation is driven through user interactions, then that wouldn't work at all.

Resources