Identify web-objects from redundant URIs using HTTP requests

Identify web-objects from redundant URIs using HTTP requests - r

I am struggling with an ill-constructed web-server log file, which I want to summarize to analyse attendance of the hosted site. Unfortunately for me, the architecture of the site is messy, so that there are no indexes of the hosted objects (html pages, jpg images, pdf document, etc.) while several URIs can refer to the same page. For example :
http://www.site.fr/main.asp?page=foo.htm
http://www.site.fr/storage-tree/foo.htm
http://www.site.fr/specific.asp?id=200
http://www.site.fr/specific.asp?path=/storage-tree/foo.htm
etc. without any obvious regularities between the duplicate URIs.
How, conceptually and pratically, can I efficiently identify the pages? As I see the problem, the idea is to construct an index linking log's URIs with a unique-object identifier constructed from http requests. There are three loose constraints :
I use R for the statistical part, and would therefore prefer to use it for http processing too
logs consist in hundreds of thousands of different URIs (among which forms, search and database queries) so that rapidity might be a matter
If I want to be able to tell, even in three days or a month, that this new URI is a known previously identified page, I have store the features I use to assess that two URIs refer to the same page. Then, storage space is a matter.

This is pretty easy with httr:
library(httr)
HEAD("http://gmail.com")$url
You will probably also want to check the status_code returned by HEAD, as failures often won't be redirected.
(One advantage of using httr over RCurl here is that it automatically preserves the connection across multiple http calls to the same site, which makes things quite a bit faster)

Related

What is the expected logic of a batched POST request to a subresource

I like the idea of vectorized/batched requests similar to what the StackExchange API offers and would like to implement something for my own API, i.e. GET /users/1;2;3;4;5 would return the selected user resources with id 1 to 5.
I think this is fairly simple when reading data, but what would be the expected behavior for i.e. a POST request to a subresource?
POST /1;2;3;4;5/subresource
Would this mean:
Creation of five new subresources, assigned to each id (1:1)
Creation of a single new subresource, but assigned to each resource id (1:n)

I have a couple of concerns regarding this approach. First, resources should be uniquely addressable via certain resource locators (URIs). Using your approach however bypasses this requirement in some way IMO. This approach may also lead to other issues later on, i.e. plenty of frameworks do not allow URIs that exceed a certain character size.
Furthermore, instead of consecutive resource IDs the resource should use UUIDs instead. This will first and foremost prevent guessing attacks and also prevent logical issues on moving resources or inserting some in between.
The POST method requests that the target resource process the
representation enclosed in the request according to the resource's
own specific semantics.
In regards to HTTP POST operations, the specification clearly states that the semantics of any body received via POST is up to the service developer. So you are basically allowed to do anything within a POST request. As the semantics is totally up to you, you have to document the behavior explicitely. Not documenting the applied logic will leave a large grey-zone for service users.

How should Transient Resources be retrieved in a RESTful API

For a while I was (wrongly) thinking that a RESTful API just exposed CRUD operation to persisted entities for a web application. When you code something up in "the real world" you soon find out that this is not enough. For example, a bank account transfer doesn't have to be a persisted entity. It could be a transient resource where you POST to /transfers/ and in the payload you specify the details:
{"accountToCredit":1234, "accountToDebit":5678, "amount":10}
Using POST here makes sense because it changes the state on the server ($10 moves from one account to another every time this POST occurs).
What should happen in the case where it doesn't affect the server? The simple first answer would be to use GET. For example, you want to get a list of savings and checking accounts that have less than $100. You would then call something like GET to /accounts/searchResults?minBalance=0&maxBalance=100. What happens though if your search parameter need to use complex objects that wouldn't fit in the maximum length of a GET request.
My first thought was to use POST, but after thinking about it some more it should probably be a PUT since it isn't changing the state of the server, but from my (limited) understanding I always though of PUT as updating a resource and POST as creating a resource (like creating this search results). So which should be used in this case?
I found the following links which provide some information but it wasn't clear to me what should be used in the different cases:
Transient REST Representations
How to design RESTful search/filtering?
RESTful URL design for search

I would agree with your approach, it seems reasonable to me to use GET when searching for resources, and as said in one of your provided links, the whole point of query strings is for doing things like search. I also agree that PUT fits better when you want to update some resource in an idempotent way (no matter how many times you hit the request, the result will be the same).
So generally, I would do it as you propose. Now, if you are limited by the maximum length of GET request, then you could use POST or PUT, passing your parameters in a JSON, in a URI like:
PUT /api/search
You could see this as a "search resource" where you send new parameters. I know it seems like a workaround and you may be worried that REST is about avoiding verbs in the URIs. Well, there are few cases that it's still acceptable and RESTful to use verbs, e.g. in cases where calculation or conversion is involved to generate the result (for more about this, check this reference).
PS. I think this workaround is still RESTful, but even if it wasn't, REST isn't an obsession and an ultimate goal. Being pragmatic and keeping a clean API design might be a better approach, even if in few cases you are not RESTful.

Different representations of one resource

When i have a resource, let's say customers/3 which returns the customer object and i want to return this object with different fields, or some other changes (for example let's say i need to have include in customer object also his latest purchase (for the sake of speed i dont want to do 2 different queries)).
As i see it my options are:
customers/3/with-latest-purchase
customers/3?display=with-latest-purchase
In the first option there is distinct URI for the new representation, but is this REALLY needed? Also how do i tell the client that this URI exist?
In the second option there is GET parameter telling the server what kind of representation to return. The URI parameters can be explained through OPTIONS method and it is easier to tell client where to look for the data as all the representations are all in one place.
So my question is which of these is better (more RESTful) and/or is there some better way to do this that i do not know about?

I think what is best is to define atomic, indivisible service objects, e.g. customer and customer-latest-purchase, nice, clean, simple. Then if the client wants a customer with his latest purchases, they invoke both service calls, instead of jamming it all in one with funky parameters.
Different representations of an object is OK in Java through interfaces but I think it is a bad idea for REST because it compromises its simplicity.

There is a misconception that making query parameters look like file paths is more RESTful. The query portion of the address is included when determining a distinct URI so the second option is fine.
Is there much of a performance hit in including the latest purchase data in all customer GET requests? If not, the simplest thing would be to do that so there would neither be weird URL params or double requests. If getting the latest order is a significant hardship (which it probably shouldn't be) there is nothing wrong with adding a flag in the query string to include it.

What is the difference between POST and GET? [duplicate]

This question already has answers here:
When should I use GET or POST method? What's the difference between them?
(15 answers)
Closed 9 years ago.
I've only recently been getting involved with PHP/AJAX/jQuery and it seems to me that an important part of these technologies is that of POST and GET.
First, what is the difference between POST and GET? Through experimenting, I know that GET appends the returning variables and their values to the URL string
website.example/directory/index.php?name=YourName&bday=YourBday
but POST doesn't.
So, is this the only difference or are there specific rules or conventions for using one or the other?
Second, I've also seen POST and GET outside of PHP: also in AJAX and jQuery. How do POST and GET differ between these 3? Are they the same idea, same functionality, just utilized differently?

GET and POST are two different types of HTTP requests.
According to Wikipedia:
GET requests a representation of the specified resource. Note that GET should not be used for operations that cause side-effects, such as using it for taking actions in web applications. One reason for this is that GET may be used arbitrarily by robots or crawlers, which should not need to consider the side effects that a request should cause.
and
POST submits data to be processed (e.g., from an HTML form) to the identified resource. The data is included in the body of the request. This may result in the creation of a new resource or the updates of existing resources or both.
So essentially GET is used to retrieve remote data, and POST is used to insert/update remote data.
HTTP/1.1 specification (RFC 2616) section 9 Method Definitions contains more information on GET and POST as well as the other HTTP methods, if you are interested.
In addition to explaining the intended uses of each method, the spec also provides at least one practical reason for why GET should only be used to retrieve data:
Authors of services which use the HTTP protocol SHOULD NOT use GET based forms for the submission of sensitive data, because this will cause this data to be encoded in the Request-URI. Many existing servers, proxies, and user agents will log the request URI in some place where it might be visible to third parties. Servers can use POST-based form submission instead
Finally, an important consideration when using GET for AJAX requests is that some browsers - IE in particular - will cache the results of a GET request. So if you, for example, poll using the same GET request you will always get back the same results, even if the data you are querying is being updated server-side. One way to alleviate this problem is to make the URL unique for each request by appending a timestamp.

A POST, unlike a GET, typically has relevant information in the body of the request. (A GET should not have a body, so aside from cookies, the only place to pass info is in the URL.) Besides keeping the URL relatively cleaner, POST also lets you send much more information (as URLs are limited in length, for all practical purposes), and lets you send just about any type of data (file upload forms, for example, can't use GET -- they have to use POST plus a special content type/encoding).
Aside from that, a POST connotes that the request will change something, and shouldn't be redone willy-nilly. That's why you sometimes see your browser asking you if you want to resubmit form data when you hit the "back" button.
GET, on the other hand, should be idempotent -- meaning you could do it a million times and the server will do the same thing (and show basically the same result) each and every time.

Whilst not a description of the differences, below are a couple of things to think about when choosing the correct method.
GET requests can get cached by the browser which can be a problem (or benefit) when using ajax.
GET requests expose parameters to users (POST does as well but they are less visible).
POST can pass much more information to the server and can be of almost any length.

POST and GET are two HTTP request methods. GET is usually intended to retrieve some data, and is expected to be idempotent (repeating the query does not have any side-effects) and can only send limited amounts of parameter data to the server. GET requests are often cached by default by some browsers if you are not careful.
POST is intended for changing the server state. It carries more data, and repeating the query is allowed (and often expected) to have side-effects such as creating two messages instead of one.

If you are working RESTfully, GET should be used for requests where you are only getting data, and POST should be used for requests where you are making something happen.
Some examples:
GET the page showing a particular SO question
POST a comment
Send a POST request by clicking the "Add to cart" button.

With POST you can also do multipart mime encoding which means you can attach files as well. Also if you are using post variables across navigation of pages, the user will get a warning asking if they want to resubmit the post parameter. Typically they look the same in an HTTP request, but you should just stick to POST if you need to "POST" something TO a server and "GET" if you need to GET something FROM a server as that's the way they were intended.

The only "big" difference between POST & GET (when using them with AJAX) is since GET is URL provided, they are limited in ther length (since URL arent infinite in length).

GET vs. POST does it really really matter?

Ok, I know the difference in purpose. GET is to get some data. Make a request and get data back. POST should be used for CRUD operations other than read I believe. But when it comes down to it, does the server really care if it's receiving a GET vs. POST in the end?

According to the HTTP RFC, GET should not have any side-effects, while POST may have side-effects.
The most basic example of this is that GET is not appropriate for anything like a purchase-transaction or posting an article to a blog, while POST is appropriate for actions-that-have-consequences.
By the RFC, you can hold a user responsible for actions done by POST (such as a purchase), but not for GET actions. 'Bots always use GET for this reason.
From the RFC 2616, 9.1.1:
9.1.1 Safe Methods
Implementors should be aware that the
software represents the user in
their interactions over the Internet,
and should be careful to allow the
user to be aware of any actions they
might take which may have an
unexpected significance to themselves
or others.
In particular, the convention has
been established that the GET and
HEAD methods SHOULD NOT have the
significance of taking an action
other than retrieval. These methods
ought to be considered "safe". This
allows user agents to represent other
methods, such as POST, PUT and
DELETE, in a special way, so that the
user is made aware of the fact that
a possibly unsafe action is being
requested.
Naturally, it is not possible to
ensure that the server does not
generate side-effects as a result of
performing a GET request; in fact,
some dynamic resources consider that a
feature. The important distinction
here is that the user did not request
the side-effects, so therefore
cannot be held accountable for them.

It does if a search engine is crawling the page, since they will be making GET requests but not POST. Say you have a link on your page:
http://www.example.com/items.aspx?id=5&mode=delete
Without some sort of authorization check performed before the delete, it's possible that Googlebot could come in and delete items from your page.

Since you're the one writing the server software (presumably), then it cares if you tell it to care. If you handle POST and GET data identically, then no, it doesn't.
However, the browser definitely cares. Refreshing or clicking back to a page you got as a response to a POST pops up the little "Are you sure you want to submit data again" prompt, for example.

GET has data limit restrictions based on the sending browser:
The spec for URL length does not dictate a minimum or maximum URL length, but implementation varies by browser. On Windows: Opera supports ~4050 characters, IE 4.0+ supports exactly 2083 characters, Netscape 3 -> 4.78 support up to 8192 characters before causing errors on shut-down, and Netscape 6 supports ~2000 before causing errors on start-up

If you use a GET request to alter back-end state, you run the risk of bad things happening if a webcrawler of some kind traverses your site. Back when wikis first became popular, there were horror stories of whole sites being deleted because the "delete page" function was implemented as a GET request, with disastrous results when the Googlebot came knocking...

"Use GET if: The interaction is more like a question (i.e., it is a safe operation such as a query, read operation, or lookup)."
"Use POST if: The interaction is more like an order, or the interaction changes the state of the resource in a way that the user would perceive (e.g., a subscription to a service), or the user be held accountable for the results of the interaction."
source

You be aware of a few subtle security differences. See my question
GET versus POST in terms of security?
Essentially the important thing to remember is that GET will go into the browser history and will be transmitted through proxies in plain text, so you don't want any sensitive information, like a password in a GET.
Obvious maybe, but worth mentioning.

By HTTP specifications, GET is safe and idempotent and POST is neither. What this means is that a GET request can be repeated multiple times without causing side effects.
Even if your server doesn't care (and this is unlikely), there may be intermediate agents between your client and the server, all of whom have this expectation. For example proxies to cache data at your ISP or other providers for improved performance. THe same expectation is true for accelerators, for example, a prefetching plugin for your browser.
Thus a GET request can be cached (based on certain parameters), and if it fails, it can be automatically repeated without any expecation of harmful effects. So, really your server should strive to fulfill this contract.
On the other hand, POST is not safe, not idempotent and every agent knows not to cache the results of a POST request, or retry a POST request automatically. So, for example, a credit card transaction would never, ever be a GET request (you don't want accounts being debited multiple times because of network errors, etc).
That's a very basic take on this. For more information, you might consider the "RESTful Web Services" book by Ruby and Richardson (O'Reilly press).
For a quick take on the topic of REST, consider this post:
http://www.25hoursaday.com/weblog/2008/08/17/ExplainingRESTToDamienKatz.aspx
The funny thing is that most people debate the merits of PUT v POST. The GET v POST issue is, and always has been, very well settled. Ignore it at your own peril.

GET has limitations on the browser side. For instance, some browsers limit the length of GET requests.

I think a more appropriate answer, is you can pretty much do the same things with both. It is not so much a matter of preference, however, but a matter of correct usage. I would recommend you use you GETs and POSTs how they were intended to be used.

Technically, no. All GET does is post the stuff in the first line of the HTTP request, and POST posts stuff in the body.
However, how the "web infrastructure" treats the differences makes a world of difference. We could write a whole book about it. However, I'll give you some "best practises":
Use "POST" for when your HTTP request would change something "concrete" inside the web server. Ie, you're editing a page, making a new record, and so on. POSTS are less likely to be cached, or treated as something that's "repeatable without side-effects"
Use "GET" for when you want to "look at an object". Now, such a look might change something "behind the scenes" in terms of caching or record keeping, but it shouldn't change anything "substantial". Ie, I could repeat my GET over and over and nothing bad would happen, except for inflated hit counts. GETs should be easily bookmarkable, so a user can go back to that same object later on.
The parameters to the GET (the stuff after the ?, traditionally) should be considered "attributes to the view" or "what to view" and so on. Again, it shouldn't actually change anything: use POST for that.
And, a final word, when you POST something (for example, you're creating a new comment), have the processing for the post issue a 302 to "redirect" the user to a new URL that views that object. Ie, a POST processes the information, then redirects the browser to a GET statement to view the new state. Displaying information as a result of a POST can also cause problems. Doing the redirection is often used, and makes things work better.

Should the user be able to bookmark the resulting page? Another thing to think about is some browsers/servers incorrectly limit the GET URI length.
Edit: corrected char length restriction note - thanks ars!

It depends on the software at the server end. Some libraries, like CGI.pm in perl handles both by default. But there are situations where you more or less have to use POST instead of GET, at least for pushing data to the server. Large amounts of data (where the corresponding GET url would become too long), binary data (to avoid lots of encoding/decoding trouble), multipart files, non-parsed headers (for continuous updates pre-AJAX style...) and similar.

The server technically couldn't care one way or the other about what kind of request it receives. It will blindly execute any request coming across the wire.
Which is the problem. If you have an action that destroys or modifies data in a GET action, Google will tear your site up as it crawls through indexing.

The server usually doesn't care. But it's mostly for following good practices, as you mentioned. The client side also matter - as mentioned you cannot bookmark a POST'd page usually, and some browsers have limits on the length of the URL for really long GET queries.

Since GET is intended for specifying resource you wanna get, depending on exact software on the server side, the web server (or the load balancer in front of it) may have a size limit on GET requests to prevent Denial Of Service attacks...

Be aware that browsers may cache GET requests but will generally not cache POST requests.

Yes, it does matter. GET and POST are quite different, really.
You are right in that normally, GET is for "getting" data from the server and displaying a page, while POST is for "posting" data back to the server. Internally, your scripts get the same data whether it's GET or POST, so no, the server doesn't really care.
The main difference is GET parameters are specified in URLs, while POST is not. This is why POST is used for signup and login forms - you don't want your password in a URL. Similarly, if you're viewing different pages or displaying a specific view of some data, you normally want a unique URL.

It really does matter. I have gathered like 11 things you should know abut them.
11 things you should know about GET vs POST

No, they shouldn't except for #jbruce2112 answer and uploading files require POST.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex