Web scraping using Octoparse - web-scraping

I have been trying to use Octoparse to scrape data from a particular webpage.
It has a total of 361 pages and 10 data rows on each page (total of 3610 data points). However, what I get is only 3260 data points.
Normally the process works fine and the event log looks like this:
[Click to Paginate] Waiting for Ajax to load...
[Click to Paginate] Ajax loaded.
Executing [Loop Item] variable list
[Loop Item] executing loop item #1
[Extract Data] Data successfully extracted
[Loop Item] executing loop item #2
[Extract Data] Data successfully extracted
[Loop Item] executing loop item #3
Executing [Loop Item] variable list
...
[Loop Item] executing loop item #10
[Extract Data] Data successfully extracted
[Loop Item] executed 10 times, exiting loop now.
Paginate
However, I notice that on multiple occasions, the event log shows something like this.
[Click to Paginate] Waiting for Ajax to load...
[Click to Paginate] Ajax loaded.
Executing [Loop Item] variable list
[Loop Item] executing loop item #1
[Loop Item] executed 1 times, exiting loop now.
I have tried to adjust different wait time but did not work. Anyone knows how to fix this would be much appreciated. Thanks!!!

Related

Vaadin 14 Lazy Loading Fetch iterations - Not what we expect

We are attempting to use the CallbackDataProvider class to perform lazy loading using a Grid component.
Our data source is using JPA implementation with pagination
Setting a page size = 20 running a query that would return 200 rows in the result set the callback seems to perform only 2 fetches, the first fetch for 20 rows, the second for the remaining 180 rows
This is not what we expected, we are expecting 20 rows on each fetch or for the 200 rows, 10 fetch of 20 rows each.
Is our expectation incorrect here?
Using this paradigm if there are 1000 or 2000 rows in the result set, I don't see how lazy loading is of any benefit here since fetching 980 rows on the second fetch defeat the lazy load purpose
Anyone have a similar experience or is there something we are missing?
The actual buffersize of the loaded data is determined by the components web client part. Pagesize is only the initial parameter. The default value of the pagesize is 50, which leads in normal circumstances Grid to load 100 items at the time. If web client determines that pagesize is too small based on it visual size, it will larger buffer. Usually pagesizes as small as 20 do not work well.

How to count agents and save the number to a variable

I would like to count the number of agents that exits the Sink and save that number continuously to my variable "Loads" while i run the simulation.
!https://imgur.com/rAUQ52n
Anylogic suggets the following.. but i can't seem to get it right.
long count() - returns the number of agents exited via this Sink block.
hope some of you can help me out
It should be easy... you don't even need a variable since sink.count() has the information you need always.
But if you insist in creating one, You can also create a variable called counter of type int and on the sink action just use counter++;

Reset Date or DateTime data item in Blue Prism

I am reading a queue and using an Action stage to "Get Item Data" from "Work Queue" business object. The purpose of my process to prepare a report of the status of the queue items. The "Get Item Data" action expects one input, which is the queue item ID. A bunch of output items are spit out such as Key, Status, Completed DateTime, Exception DateTime...etc.
I generated Data Items for all of the output of the "Get Item Data" Action stage. I then created a loop to go over all the queue records, populate the generated data items, and then use the information in the data items to captured the details for my reporting.
The issue that I am having is that when the loop goes to the next item in the queue, it does not entirely reset the data items. For example, if the first record in the queue was in completed status, the "Completed DateTime" data item is populated with that date and time. If the next record in the queue is an exception, it populates the "Exception DateTime" data item, which is good, but it doesn't override the "Complete DateTime" data item with blank. It keeps the date from the previous record.
In my process, I check for "Completed DateTime" and "Exception DateTime" in order to determine the status of the record and update my report. The solution that I thought of is to add a Calculation stage to reset the data items, but can't seem to reset a DateTime data item. It does not like the empty quotes "". Any suggestions would be greatly appreciated!
FYI, one of the output items is called "Status", but it is not populated with any information. Otherwise, this would have been very easy.
Disclaimer: This may not be the ideal solution, but it'll work!
Use the Calculation Stage at the end of the loop, but as you cannot set a DateTime object to 'empty', how about you set them to an odd date? E.g. 01-01-4000 00:00:00.
After you finish your initial loop to populate the report (I assume something similar to Excel), you create another loop over your report and replace all the odd dates to empty cells. Alternatively you write a macro to get rid of them all at once without the need to loop.
The best solution of course would be to properly populate the Status column in your queue, but this requires access to the code and permission to alter it (and time to do so).

Paginating chronologically prioritized Firebase children

tl;dr Performing basic pagination in Firebase via startAt, endAt and limit is terribly complicated, there must be an easier way.
I'm constructing an administration interface for a large number of user submissions. My initial (and current) idea is to simply fetch everything and perform pagination on the client. There is however a noticable delay when fetching 2000+ records (containing 5-6 small number/string fields each) which I attribute to a data payload of over 1.5mb.
Currently all entries are added via push but I'm a bit lost as to how paginate through the huge list.
To fetch the first page of data I'm using endAt with a limit of 5:
ref.endAt().limit(10).on('child_added', function(snapshot) {
console.log(snapshot.name(), snapshot.val().name)
})
which results in the following:
-IlNo79stfiYZ61fFkx3 #46 John
-IlNo7AmMk0iXp98oKh5 #47 Robert
-IlNo7BeDXbEe7rB6IQ3 #48 Andrew
-IlNo7CX-WzM0caCS0Xp #49 Frank
-IlNo7DNA0SzEe8Sua16 #50 Jimmmy
Firstly to figure out how many pages there are I am keeping a separate counter that's updated whenever someone adds or removes a record.
Secondly since I'm using push I have no way of navigating to a specific page since I don't know the name of the last record for a specific page meaning an interface like this is not currently possible:
To make it simpler I decided on simply having next/previous buttons, this however also presents a new problem; if I use the name of the first record in the previous result-set I can paginate to the next page using the following:
ref.endAt(null, '-IlNo79stfiYZ61fFkx3').limit(5).on('child_added', function(snapshot) {
console.log(snapshot.name(), snapshot.val().name)
})
The result of this operation is as follows:
-IlNo76KDsN53rB1xb-K #42 William
-IlNo77CtgQvjuonF2nH #43 Christian
-IlNo7857XWfMipCa8bv #44 Jim
-IlNo78z11Bkj-XJjbg_ #45 Richard
-IlNo79stfiYZ61fFkx3 #46 John
Now I have the next page except it's shifted one position meaning I have to adjust my limit and ignore the last record.
To move back one page I'll have to keep a separate list on the client of every record I've received so far and figure out what name to pass to startAt.
Is there an easier way of doing this or should I just go back to fetching everything?
We're working on adding an "offset()" query to allow for easy pagination. We'll also be adding a special endpoint to allow you to read the number of children at a location without actually loading them from the server.
Both of these are going to take a bit though. In the meantime the method you describe (or doing it all on the client) are probably your best bet.
If you have a data structure that is append-only, you could potentially also do pagination when you write the data. For example: put the first 50 in /page1, put the second 50 in /page2, etc.
Another way to accomplish this is with two trees:
Tree of ids: {
id1: id1,
id2: id2,
id3: id3
}
Tree of data: {
id1: ...,
id2: ...,
id3: ...
}
You can then load the entire tree of ids (or a big chunk of it) to the client, and do fancy pagination with that tree of ids.
Here's a hack for paginate in each direction.
// get 1-5
ref.startAt().limit(5)
// get 6-10 from 5
ref.startAt(null, '5th-firebase-id' + 1).limit(5)
// get 11-15 from 10
ref.startAt(null, '10th-firebase-id' + 1).limit(5)
Basically it's a hack for startAtExclusive(). You can anything to the end of the id.
Also figured out endAtExclusive() for going backwards.
// get 6-10 from 11
ref.endAt(null, '11th-firebase-id'.slice(0, -1)).limit(5)...
// get 1-5 from 6
ref.endAt(null, '6th-firebase-id'.slice(0, -1)).limit(5)...
Will play with this some more but seems to work with push ids. Replace limit with limitToFirst or limitToLast if using firebase queries.

Firefox Extension executeAsync Returns only 15 rows at a time

I am developing a firefox extension which reads and writes to a sqlite database. I ran an async query to fetch 20 rows from a database, and the callback function which handles the receipt of data gets called twice. The first time it return 15 rows and second time it return the last 5. Is this a standard value? If so, can this value be changed?
Yes, executeAsync will return a result after at most 15 rows and 75ms execution time. No, this cannot be changed - the thresholds are hardcoded.

Resources