Scraping BRfares for train fares - web-scraping

I am looking for advise. The following website
http://brfares.com/#home
provides fares information for UK train lines. I would like to use it to build a database of travel costs for seasons tickets from different locations. I have never done this kind of thing before but have experience with Python/Bash scripting and some HTML.
Viewing the source code for a typical query the actual fair information is not displayed in index.html. Can anyone provide a pointer as to how to go about scraping (a new word for me) the information.

This is the url for the query : http://brfares.com/querysimple?orig=SUY&dest=0415&rlc=
the response is a json object.
First you need to build a lookup table of all destinations codes. you can use the following link to do that http://brfares.com/ac_loc?term=. Do it for all the letters in the alphabet and then parse for a unique list.
Then you take them by the pair, execute the json query, parse the returned json and feed the data to a database.
Now you can do whatever you want with that database.

Related

How to retrieve resources based on different conditions using GET in RESTful api?

As per REST framework, we can access resources using GET method, which is fine, if i know key my resource. For example, for getting transaction, if i pass transaction_id then i can get my resource for that transaction. But when i want to access all transactions between two dates, then how should i write my REST method using GET.
For getting transaciton of transaction_id : GET/transaction/id
For getting transaction between two dates ???
Also if there are other conditions, i need to put like latest 10 transactions, oldest 10 transaction, then how should i write my URL, which is main key in REST.
I tried to look on google but not able to find a way which is completely RESTful and solve my queries, so posting my question here. I have clear understanding of POST and DELETE, but if i want to do same update using PUT for some resource based on condition, then how to do it?
There are collection and item resources in REST.
If you want to get a representation of an item, you usually use an unique identifier:
/books/123
/books/isbn:32t4gf3e45e67 (not a valid isbn)
or with template
`/books/{id}
/books/isbn:{isbn}
If you want to get a representation of a collection, or a reduced collection you use the unique identifier of the collection and add some filters to it:
/books/since:{fromDate}/to:{toDate}/
/books/?since="{fromDate}"&to="{toDate}"
the filters can go into the path or into the queryString part of the url.
In the response you should add links with these URLs (aka HATEOAS), which the REST clients can follow. You should use link relations, for example IANA link relations to describe those links, and linked data, for example schema.org or to describe the data in your representation. There are other vocabs as well, for example GoodRelations, and ofc. you can write your own vocab as well for your application.

Where I can I find a public URL that returns a dataset of approx 3000 rows in JSON format for testing

I need to put together a jsBin example that demonstrates a problem I'm having with some UI controls, which doesn't manifest itself with a only a few records. I need a dataset of about 3000-5000 rows in JSON format that can be obtained via a URL by an AJAX XHR call. Can someone suggest a website with possibly government or open-source data that can be used for such testing?
P.S. It can't just be a download of a zipped file that can be expanded into a JSON text file. I need a JSON XHR response.
P.P.S. Ideally it would have 50-75 distinct values in one of the columns so I could demonstrate a grouping/aggregation issue. Data by US State or by Zipccode within a state would be excellent.
P.P.P.S. I've been searching the internet and found this site, now trying to figure out how to get JSON instead of XML:
http://www.sba.gov/about-sba-services/7617#city-county-state
All you have to do is this:
http://www.sba.gov/about-sba-services/7617#city-county-state/NY.json
You can find a lot of open data here
free open data
Have you looked at Freebase there should be a query to get you that many rows and they offer json responses.
EDIT: Theres a similiar site DBPedia I built this query which will return JSON and has about 3k rows:
http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+%3FConcept+where+%7B%5B%5D+a+%3FConcept%7D+LIMIT+3000&format=json%2Fhtml&timeout=30000&debug=on
you can go here and customize the query if you need more data.
-Ken
Why not create a page with a LOOP that generate those records as you desire. Shouldn't be so hard.
Maybe a Java Servlet.

how to retrieve particular set of information using TopicAPI?

I'm a newbie in Freebase Topic API. Currently I am looking for "How to retrieve specific set of data using Freebase Topic API?"
for e.g. if we request for particular information using following URL
https://www.googleapis.com/freebase/v1/topic/en/nicobar_scrubfowl?filter=/common/topic/description
we get ample of information like "id","property","values" array containing "text","lang","value" etc.. And I don't want all the information.
So how to retrieve particular set of information using topicAPI (like only "value" from "values" array OR only "provider" etc..)
thanks
If you want that level of control, you should investigate the MQLRead API.
There's no way to filter out those parts of the Topic API response. Every property value will have at least text, lang, id, creator and timestamp.
Why is this a problem in your application? As long as you're parsing this data with a JSON parser you will be able to access any of the data you want while ignoring the rest. If you're worried about the size of the response you can ask for a GZip response.

Should I use Wordpress Transient API in this case?

I'm writing a simple Wordpress plugin for work and am wondering if using the Transients API is practical in this case, or if I should seek out another way.
The plugin's purpose is simple. I'm making a call to USZip Web Service (http://www.webservicex.net/uszip.asmx?op=GetInfoByZIP) to retrieve data. Our sales team is using a Lead Intake sheet that the plugin will run on.
I wanted to reduce the number of API calls, so I thought of setting a transient for each zip code as the key and store the incoming data (city and zip). If the corresponding data for a given zip code already exists, then no need to make an API call.
Here are my concerns:
1. After a quick search, I realized that the transient data is stored in the wp_options table and storing the data would balloon that table in no time. Would this cause a significance performance issue if the db becomes huge?
2. Is this horrible practice to create this many transient keys? It could easily becomes thousands in a few months time.
If using Transient is not the best way, could you please help point me in the right direction? Thanks!
P.S. I opted for the Transients API vs the Options API. I know zip codes don't change often, but they sometimes so. I set expiration time of 3 months.
A less-inflated solution would be:
Store a single option called uszip with a serialized array inside the option
Grab the entire array each time and simply check if the zip code exists
If it doesn't exist, grab the data and save the whole transient again
You should make sure you don't hit the upper bounds of a serialized array in this table (9,000 elements) considering 43,000 zip codes exist in the US. However, you will most likely have a very localized subset of zip codes.

How can i create a segment data feature

I have a task to extend my web application to provide users the ability to segment their own data (i.e choose their own fields and add their criteria using And/Or etc), so I'm creating something similar to a query builder tool but lighter. I'm not worrying about the front end for the moment, i am just trying to focus on how to do this in the back end.
My only thoughts so far are to store their "Segment" as an XML document (serialized in the DB) which contains all of their columns and criteria and how they map to the database, then when the segment is called, i have a mapping class which deserializes this xml document and maps the fields and builds a SQL query for this and then returns the query results. The problem i see with this is if the database setup changes (likely) then i have a serialized XML document which knows nothing about these changes.
Has anyone tacked a similar situation?
I had a similar problem and posted a question on here with what could be a potential solution to your own issue.
Dynamic linq query with multiple/unknown criteria
See how you get on with that.

Resources