Can an Apify project contain several crawlers? - web-scraping

I was searching the documentation for it but wasn't able to find any related article.
I want to know if I can have several crawlers defined in a Apify project just like you can have several Spiders on Scrapy or if I have to create a new project for each new website that I like to crawl.
I would appreciate any response, thank you in advance!

Yes, you can create as many crawler instances you need/want.
It's usually a good thing to separate things like sitemap crawling, using it's own CheerioCrawler/BasicCrawler instances with specific settings and an specific queue, then the full scraper using the desired crawler, like PuppeteerCrawler, also using it's own queue if needed.
You can choose to run them in parallel with
await Promise.all([
crawler1.run(),
crawler2.run(),
]);
or one at a time, using
await crawler1.run();
await crawler2.run();
the caveat when using Promise.all is that if they are reading/writing to the same key-value store, you might have some racing conditions. If they don't share any state, you should be good to go.

Related

What is the best practice for server-side code in Meteor?

I'm new to the world of coding and Web dev and for my first real project, I've started building a quiz web app using Meteor.
Long story short, the app basically displays a random question to the user, then takes in their answer and gets feedback (it's a bit more complicated than that, but for the purposes of this question, that's the main functionality).
I've managed to get it working, but pretty much everything (apart from account creation and that kind of stuff) is done on the client-side (such as getting the random qn) - which I'd imagine is not very secure..
I'd like to move most of the calculations and operations on the server, but I don't want to publish any of the Questions collection to the client, since that means the client can essentially change it and/or view the correct answer.
So, my question is, would it be considered 'bad practice' if I don't publish anything to the client (except their user document) and basically do everything through Meteor methods (called on the client, and executed server-side)?
I've already tried implementing it and so far everything's working fine, but was just wondering whether it's good practice. Would it hurt performance in any way?
I've searched online for a while, but couldn't really find a definitive answer, hence my post here... TIA
The below example pulled right from the documentation showing how to omit fields.
// Server: Publish the `Rooms` collection, minus secret info...
Meteor.publish('rooms', function () {
return Rooms.find({}, {
fields: { secretInfo: 0 }
});
});

Import.io - Can it replace Kimonolabs

I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.

Endpoint design for a data retrieval orientated ASP.NET webapi

I am designing a system that uses asp.net webapi to serve data that is used by a number of jquery grid controls. The grids call back for the data after the page has loaded. I have a User table and a Project table. In between these is a Membership table that stores the many to many relationships.
User
userID
Username
Email
Project
projectID
name
code
Membership
membershipID
projectID
userID
My question is what is best way to describe this data and relationships as a webapi?
I have the following routes
GET: user // gets all users
GET: user/{id} // gets a single user
GET: project
GET: project/{id}
I think one way to do it would be to have:
GET: user/{id}/projects // gets all the projects for a given user
GET: project/{id}/users // gets all the users for a given project
I'm not sure what the configuration of the routes and the controllers should look like for this, or even if this is the correct way to do it.
Modern standard for that is a very simple approach called REST Just read carefully and implement it.
Like Ph0en1x said, REST is the new trend for web services. It looks like you're on the right track already with some of your proposed routes. I've been doing some REST design at my job and here are some things to think about:
Be consistent with your routes. You're already doing that, but watch out for when/if another developer starts writing routes. A user wants consistent routes for using your API.
Keep it simple. A major goal should be discoverability. What I mean is that if I'm a regular user of your system, and I know there are users and projects and maybe another entity called "goal" ... I want to guess at /goal and get a list of goals. That makes a user very happy. The less they have to reference the documentation, the better.
Avoid appending a ton of junk to the query string. We suffer from this currently at my job. Once the API gets some traction, users might want more fine grained control. Be careful not to turn the URL into something messy. Something like /user?sort=asc&limit=5&filter=...&projectid=...
Keep the URL nice and simple. Again I love this in a well design API. I can easily remember something like http://api.twitter.com. Something like http://www.mylongdomainnamethatishardtospell.com/api/v1/api/user_entity/user ... is much harder to remember and is frustrating.
Just because a REST API is on the web doesn't mean it's all that different than a normal method in client side only code. I've read arguments that any method should have no more than 3 parameters. This idea is similar to (3). If you find yourself wanting to expand, consider adding more methods/routes, not more parameters.
I know what I want in a REST API these days and that is intuition, discoverability, simplicity and to avoid having to constantly dig through complex documentation.

Can newsfeeds be versioned in Plone?

Articles in Plone go through an author-editor-publisher process. Why are feeds the exception? Is there a way to do this with feeds/news through the adding a news item process?
Feeds follow the same pattern; they will pick up on whatever criteria you have specified in e.g. a Collection. If you only show things that are "Published", that will also be the case in the associated feed.
If you're asking for a separate workflow/approval process for the feed itself, that's not how it works. (And I have problems seeing what kind of problem it would solve outside of adding more complexity and more process :)

To Develop LMS and Scorm Sequesncing Engine

We want a LMS(coded in ASP.NET/vb.net) which is able to import SCORM packages & display it to learner for viewing content. I am totally new to SCORM and have been shifted to this project. I want to know how can I access SCORM Assessment object's (Test) result, like Learner ID, passed/fail, time.
Can you please guide me what will I need to implement in ASP.NET code to accomplish my goal ?
Task that I have done so far is,
Reading a manifest zip file, unzipping the file and get all information from the file(content name,description,items and launching page) and when user clicks on a particular course a pop up window is launching the page.
I eagerly want to know what I can do next to communicate with the LMS with the APIs. Shall I need to develop my own LMS to get the result,If there is a quiz which is running, all I need to know is the no of questions attempted by the user, whether the user is pass or fail and I need to store all information in the database for individual user so that I can review the result afterwards.
So the task remaining.
Tracking mechanism to deliver the content.
SCORM/LMS sequencing engine that controls the navigation between parts of SCORM conformant course.
Please help.
SLK at codeplex provides a good starting point. However, if you are truly wanting to provide an in-house written SCORM play that is fully compliant, you have a major task ahead of you. In essence there are three party you need to fully develop:
CAM - the unzipping process, which it sounds like you have already achieved.
RTE - the javascript host for SCORM, providing the 8 specified methods. Behind this you also need to implement the SCORM object model, which SLC does help with. If you have implemented all of this, then there should be data entries on the data model that indicate completion etc.
SN - the sequencing and navigation processing. This is significantly the most complex part. I am still in the process of trying to implement this, using SLC, and it is hard. It is the completion of this that will potentially give you more information that will enable you to know what has been done.
it is also worth looking at scorm.com, who are a consultancy, but provide a lot of useful information about the scorm standard.
That is true. SCORM is one of these stadarts where you can implement as little as possible. But you will need some of Javascript with a Backend-Script (JSON to the rescue) so you can track the scorm data, and save it your database.
But let me tell you this: This is the easiest task! Making your own course-creator is a whole other beast.

Resources