I am developing a project that is using the WP REST API. After some tests, I realized that it's not retrieving all the categories based on what I am trying to GET. In precise, there are supposed to be 21 results but only 10 come up. Is there some kind of restriction that I am not seeing? Any settings that I'll have to change.
Here is what I am trying:
I saw this post here, but some of the answers relied on using the WordPress JSON API, which is a plugin that is no longer available due to security concerns.

I had to look at the documentation more closely. In terse, The default result shows a count of 10. Specifying the per page would change that. So:
(Change X to the respective number)


How To Specify the fields to Wordpress API v2

I'm using the WordPress REST API in my project and sending a GET request to:
and It's working quite alright but I want to specify the fields though I don't know how. I have looked at the documentation and still don't know how to go about it. For example, using the public API:
returns only the specified fields. How do I this with the v2 API?
Here's how to access the a list of titles and excerpts using REST API v2:
As it states there, the REST API v2 returns a certain set of default fields, and if you want different ones, then you have to implement this as described in that document.
The easy solution to this issue would be to use the ACF to REST API or an equivalent plugin that can extend the REST API for you. I have used this plugin on many sites successfully.
If this is not possible then you will need to modify the response as has been outlined by other answers. You can read more about that here.
You can use ?_fields[]=title&_fields[]=excerpt

Import.io - Can it replace Kimonolabs

I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.

Crawling wikipedia

I'm going through crawling wikipedia using website downloader for windows, i was looking through the whole options in this tool to find an option to download wikipedia pages for specific period, for example from 2005 untill now.
Does anyone get any idea about crawling the website in specific period of time ?
Why not download the SQL database containing all of Wikipedia?
You can then query it using SQL.
Give a try to the Wikipedia API and your programming skills.
There should be no need to do web scraping; use the MediaWiki API to directly request the information you want. I'm not sure what you mean by "wikipedia pages for a specific period" - do you mean last edited at a certain time? If so, while skimming, I noticed an API call that lets you get a look at the last n revisions; just ask for the last revision and see what its date is.
It depends if the website in question offers the archive and mostly don't so its not possible in a straightforward way to crawl a sample started from specific date. But you can implement some intelligence in your crawler to read the page created date or something like that.
But you can also look at Wikipedia API at http://en.wikipedia.org/w/api.php

Bugzilla: How to get an rss feed for bug comments?

I can see where to get an rss feed for the BUG LIST, however I would like to get rss updates for modifications to current bugs if possible.
This is quite high up when searching via Google for it, so I'm adding a bit of advertisement here:
As Bugzilla still doesn't support this I wrote a small web service supporting exactly this. You can find its source code here and a running instance here.
What you're asking for is the subject of this enhancement bug:
but no one seems to be working on it.
My first guess is that the way to do it is to add a template somewhere like template/en/default/bug/show.atom.tmpl with whatever you need. Put it in custom or an extension as needed.
If you're interested in working on it or helping someone with it, visit channel #mozwebtools on irc.mozilla.org.
Not a perfect solution, but with the resolution of bug #255606, Bugzilla now allows listing all bugs, by running a search with no criteria, and you can then get the results of the search in Atom format using the link in the bottom of the list.
From the release notes for 4.2:
Configuration: A new parameter search_allow_no_criteria has been added (default: on) which allows admins to forbid queries with no criteria. This is particularly useful for large installations with several tens of thousands bugs where returning all bugs doesn't make sense and would have a performance impact on the database.

Access to old, no longer available, feed entries

I am working on a project that requires reliable access to historic feed entries which are not necessarily available in the current feed of the website. I have found several ways to access such data, but none of them give me all the characteristics I need.
Look at this as a brainstorm. I will tell you how much I have found and you can contribute if you have any other ideas.
Google AJAX Feed API - will limit you to 250 items
Unofficial Google Reader API - Perfect but unofficial and therefore unreliable (and perhaps quasi-illegal?). Also, the authentication seems to be tricky.
Spinn3r - Costs a lot of money
Spidering the internet archive at the site of the feed - Lots of complexity, spotty coverage, only useful as a last resort
Yahoo! Feed API or Yahoo! Search BOSS - The first looks more like an aggregator, meaning I'd need a different registration for each feed and the second should give more access to Yahoo's data but I can find no mention of feeds.
(thanks to Lou Franco) Bloglines Sync API - Besides the problem of needing an account and being designed more as an aggregator, it does not have a way to add feeds to the account. So no retrieval of arbitrary feeds. You need to manually add them through the reader first.
Other search engines/blog search/whatever?
This is a really irritating problem as we are talking about semantic information that was once out there, is still (usually) valid, yet is difficult to access reliably, freely and without limits. Anybody know any alternative sources for feed entry goodness?
Bloglines has an API to sync accounts
You have to make an account, subscribe to the feed you want to download, but then then you can download based on Date, which can be way in the past. Not sure of the terms.
The best answer I've found so far, is this: Google reader's unofficial API turns out to have a public access point for their feeds, which means there is no authentication needed. Use is as follows:
http://www.google.com/reader/public/atom/feed/{your feed uri here}?n=1000
replace the text in the squigglies (including the squigglies themselves) with the feed URI you're interested in. More information about the precise arguments can be found here:
but remember to use the /public/ url if you don't want to mess with authentication
