Is there an API (SOAP, JSON, XML-RPC, REST, anything) to Google Code Issues? - google-code

Is there any programmatic way to read the data stored in a Google Code Project Hosting "Issues" tracker? Ideally it'd be nice to update/add issues too, but first I need some structured form of data from the tracker. I only see the HTML pages, and an Atom feed for issue updates.

Now there is one.
http://googlecode.blogspot.com/2009/10/issue-tracker-data-api-for-project.html

There doesn't seem to be but they have promised it's coming during the Google Wave introduction. The demo is at minute 61 and your question is answered at minute 78.

Related

LinkedIn share count API (/countserv/count/share) always returns "0"

Easily seen in the JSON result from this:
https://www.linkedin.com/countserv/count/share?url=https://www.linkedin.com
Which currently returns:
IN.Tags.Share.handleCount(
{
"count":0,
"fCnt":"0",
"fCntPlusOne":"1",
"url":"https:\/\/www.linkedin.com"
});
Apparently it affects most of the LinkedIn Share buttons/counters out there on the web, including WordPress and other blogs. This has been "broken" since late last week (13 Jan 2018).
I opened a ticket with LinkedIn support. Response was to post here, as this is where the LinkedIn Developers support resides. Hoping for a response that says "Oops, we'll fix this." Or, if deliberately crippled, an announcement that says so. (Twitter made a similar move a few years ago. It was unpopular among developers, but we've moved on.)
Further to Chris Hemedingers response, this feature has now been entirely deprecated
Deprecating the inShare Counter
As you can see they have deprecated this saying:
The share count on its own doesn’t fully reflect the impact that a piece of content delivers, and we encourage publishers and other content creators to leverage the inShare plugin as a way to drive conversation and engage with members on LinkedIn.
They then link to the Documentation saying:
Share on LinkedIn plugin will no longer return share count.
This is massively inconvenient for my company as we have just finished construction of a suite of tools powered by this.
The share count service is back in operation, working as it was before. The outage was deliberate (apparently) but temporary.
As far as I know this is an undocumented API, but it's integral to the "LinkedIn Share" buttons that are used in countless websites/blogs around the world. As such, LinkedIn has no contract/obligation to keep that service running...so consumers of the service in non-LinkedIn components should beware.
Thank you for the update! That was quite frustrating to track this down. I had to research the code, request from the API itself with multiple URLs, submitted a ticket to LinkedIn... and ultimately found myself here and read this. Just a recommendation, I think it would be better to return some kind of error code than a 0. Many people actually display the count on their sites.

Import.io - Can it replace Kimonolabs

I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.

Crawling wikipedia

I'm going through crawling wikipedia using website downloader for windows, i was looking through the whole options in this tool to find an option to download wikipedia pages for specific period, for example from 2005 untill now.
Does anyone get any idea about crawling the website in specific period of time ?
Why not download the SQL database containing all of Wikipedia?
You can then query it using SQL.
Give a try to the Wikipedia API and your programming skills.
There should be no need to do web scraping; use the MediaWiki API to directly request the information you want. I'm not sure what you mean by "wikipedia pages for a specific period" - do you mean last edited at a certain time? If so, while skimming, I noticed an API call that lets you get a look at the last n revisions; just ask for the last revision and see what its date is.
It depends if the website in question offers the archive and mostly don't so its not possible in a straightforward way to crawl a sample started from specific date. But you can implement some intelligence in your crawler to read the page created date or something like that.
But you can also look at Wikipedia API at http://en.wikipedia.org/w/api.php

Bugzilla: How to get an rss feed for bug comments?

I can see where to get an rss feed for the BUG LIST, however I would like to get rss updates for modifications to current bugs if possible.
This is quite high up when searching via Google for it, so I'm adding a bit of advertisement here:
As Bugzilla still doesn't support this I wrote a small web service supporting exactly this. You can find its source code here and a running instance here.
What you're asking for is the subject of this enhancement bug:
https://bugzilla.mozilla.org/show_bug.cgi?id=256718
but no one seems to be working on it.
My first guess is that the way to do it is to add a template somewhere like template/en/default/bug/show.atom.tmpl with whatever you need. Put it in custom or an extension as needed.
If you're interested in working on it or helping someone with it, visit channel #mozwebtools on irc.mozilla.org.
Not a perfect solution, but with the resolution of bug #255606, Bugzilla now allows listing all bugs, by running a search with no criteria, and you can then get the results of the search in Atom format using the link in the bottom of the list.
From the release notes for 4.2:
Configuration: A new parameter search_allow_no_criteria has been added (default: on) which allows admins to forbid queries with no criteria. This is particularly useful for large installations with several tens of thousands bugs where returning all bugs doesn't make sense and would have a performance impact on the database.

How Do I Fetch All Old Items on an RSS Feed?

I've been experimenting with writing my own RSS reader. I can handle the "parse XML" bit. The thing I'm getting stuck on is "How do I fetch older posts?"
Most RSS feeds only list the 10-25 most recent items in their XML file. How do I get ALL the items in a feed, and not just the most recent ones?
The only solution I could find was using the "unofficial" Google Reader API, which would be something like
http://www.google.com/reader/atom/feed/http://fskrealityguide.blogspot.com/feeds/posts/default?n=1000
I don't want to make my application dependent on Google Reader.
Is there any better way? I noticed that on Blogger, I can do "?start-index=1&max-results=1000", and on WordPress I can do "?paged=5". Is there any general way to fetch an RSS feed so that it gives me everything, and not just the most recent items?
RSS/Atom feeds does not allow for historic information to be retrieved. It is up to the publisher of the feed to provide it if they want such as in the blogger or wordpress examples you gave above.
The only reason that Google Reader has more information is that it remembered it from when it came up the first time.
There is some information on something like this talked about as an extension to the ATOM protocol, but I don't know if it is actually implemented anywhere.
As the other replies here mentioned, a feed may not provide archival data but historical items may be available from another source.
Archive.org’s Wayback Machine has an API to access historical content, including RSS feeds (if their bots have downloaded it). I’ve created the web tool Backfeed that uses this API to regenerate a feed containing concatenated historical items. If you'd like to discuss the implementation in detail please get in touch.
In my experience with RSS, the feed is compiled by the last X items where X is a variable. Certain Feeds may have the full list, but for bandwidth sake most places are likely limiting to just the last few items.
The likely answer for google reader having the old info, is that it is storing it on its side for users later.
Further to what David Dean said the RSS/Atom feeds will only contain what the publisher of the feed has up at that moment and someone would need to be actively collecting this informaton in order to have any historical information. Basically Google Reader was doing this for free and when you interacted with it you could retrieve this stored informaton from the google database servers.
Now that they have retired the service, to my knowledge you have two choices. You either have to start collection of this information from your feeds of interest and store the data using XML or some such, or you could pay for this data from one of the companies who sell this type of archived feed information.
I hope this information helps somebody.
Seán
Another potential solution that might not have been available when the question was originally asked and shouldn't require any specific service.
Find the URL of the RSS feed you want and use waybackpack to get the archived urls for that feed.
Use FeedReader or a similar library to pull down the archived RSS feed.
Take the URLs from each feed and scrape them as you wish. If you're going way back in time it's possible there might be some dead links.
All previous answers more or less relied on existing services to still have a copy of that feed or the feed engine to be able to provide older items dynamically.
There's though another, admittedly pro-active and rather theoretical way to do so: Let your feedreader use a caching proxy which semantically understands RSS and/or Atom feeds and caches them on a per-item base up to as many items as you configure.
If the feedreader doesn't poll feeds regularily, the proxy could fetch known feeds time-based on its own to not miss an item in highly volatile feeds like the one from User Friendly which has only one item and changes every day (or at least used to do so). Hence if the feedreadere.g. crashed or lost network connection while you are away for a few days, you might loose items in your feedreader's cache. Having the proxy to fetch those feeds regularily (e.g. from a data center instead from at home or on a server instead of a laptop) allows you to easily run the feedreader only then and when without loosing items which were posted after your feedreader fetched feeds the last time but rotated out again before you fetch them the next time.
I call that concept a Semantic Feed Proxy and I've implemented a proof of concept implementation called sfp. It's though not much more than a proof of concept and I haven't developed it further. (So I'd be happy about hints to projects with similar ideas or purposes. :-)
Why does this problem exist?
Most RSS readers need to import feeds through a live URL, which makes things harder for sites that are unindexed on Wayback Machine.
The reason why Wayback Machine feeds can be imported is that the reader can regularly poll the server for updates according to its defined TTL configuration. The reader compares the current datetime with the RSS feed posts pubDate or lastBuildDate keys in the XML response. We can't hack the machine datetime to work around the datetime resolution because the current datetime is fetched live.
I've outlined an alternative solution without Wayback below. Unfortunately, I have not been able to find a universal solution for all feed sources.
Alternative Solution(s)
In my experience, NOT ALL feeds are partial though. The XML doesn't have to specify the datetime of each post. This means the RSS Reader doesn't have a datetime to filter the feed with. An example of this feed type can be found here.
This kind of reading experience is useful when chronological order is irrelevant, and the content doesn't need to be sorted. This approach is useful for sites where ALL the content is valuable, and the linked Essays of Paul Graham is a good example.
If the site has a generic, non-chronological feed option, subscribe to that RSS instead (the preferred option).
Download the linked timestamped .rss file, strip datetimes and host the file on your own server. Note, we can implement this via an AWS Lambda.
Set up a server that fetches the RSS from live.
Strip the pubDate tags from the XML file on fetch.
Host the modified RSS on your own server.
Note
These are suboptimal solutions due to loss of orders, however, I wanted to provide a potential alternative to WaybackMachine.
In addition, some existing answers require advanced SysDesign workarounds, more prework and in some cases are outdated (Google Reader is shut down). I hope it's helpful for those who really need a solution for a complete feed list. Constructing new RSS feeds is not too hard from the original RSS file.

Resources