Using proxy to cache expensive outgoing HTTP requests?

Using proxy to cache expensive outgoing HTTP requests? - http

I am using a fairly expensive external API (there's a cost per request) which makes testing code which uses it impractical.
In an ideal world, I would have a proxy server I would do my requests against which would cache each request (based on URL + query string) indefinitely and only hit the actual API server when I explicitly invalidate the cache for a given request. Is such a server available off the shelf with minimal configuration?
My current stack is Node.js, Docker, Nginx, PostgreSQL & AWS S3 (for non ephemeral state). I think Varnish might accomplish what I need but I'm not sure.

Varnish can and will accomplish that, but only if you build a 'test' API that returns some similar data you can play with. Your best bet if you have to save money, is to query the API a few times to get different typical responses. Once you know the ballpark of what to expect from it, create some sort of dummy API, or even some static JSON or XML files that you can use to mimic it. At that point you can test Varnish and Cache invalidation, and I'd be more than happy to help you with the syntax for that, given some examples of the code.

Related

How to cache an api response?

I'm using the api http://exchangeratesapi.io/ to get exchange rates.
Their site asks:
Please cache results whenever possible this will allow us to keep the service without any rate limits or api key requirements.
-source
Then I found this:
By default, the responses all of the requests to the exchangeratesapi.io API are cached. This allows for significant performance improvements and reduced bandwidth from your server.
-somebody's project on github, not sure if accurate
I've never cached something before and these two statements confuse me. When the API's site says to "please cache the results", it sounds like caching is something I can do in a fetch request, or somehow on the frontend. For example, some way to store the results in local storage or something. But I couldn't find anything about how to do this. I only found resources on how to force a response NOT to cache.
The second quote makes it sound like caching is something the API does itself on their servers, since they set the response to cache automatically.
How can I cache the results like the api site asks?

To clear your confusion on the conflicting statements you're referencing:
Caching just means to store the data. Examples of where the data can be stored are in memory, in some persistence layer (like Redis), or in the browser's local storage (like you mentioned). The intent behind caching can be to serve the data faster (compared to getting it from the primary data source) for future requests/fetches, and/or to save on costs for getting the same data repeatedly, among others.
For your case, the http://exchangeratesapi.io/ API is advising consumers to cache the results on their side (as you mentioned in your question, this can be in the browser's local storage, if you're calling the API front front-end code, or stored in memory or other caching mechanisms/structures on the server-side application code calling the API) to that they can avoid the need to introduce rate limiting.
The project from Github you're referencing, Laravel Exchange Rates, appears to be a PHP wrapper around the original API - so it's like a middleman between the API and a developer's PHP code. The intent is to make it easier to use the API from within PHP code, and avoid having to make raw HTTP requests to the API and avoid processing the responses; the Laravel Exchange Rates handles that for the developer.
In regards to the
By default, the responses all of the requests to the exchangeratesapi.io API are cached
statement you're asking about, it seems the library follows the advice of the API, and caches the results from the source API.
So, to sum up:
http://exchangeratesapi.io/ is the source API, and it advises consumers to cache results. If your code is going to be calling this API, you can cache the results in your own code.
The Laravel Exchange Rates PHP library is a wrapper around that source API, and does cache the results from the source API for the user. If you're using this library, you don't need to further cache.

Are "forever" and "never" the only useful caching durations?

In the presentation "Cache is King" by Steve Souders (at around 14:30), it is implied that there are in practice only two caching durations that you should use for your resources: "forever" and "never" (my own terminology).
"Forever" means that you effectively make the resource permanently immutable by setting a very high max age, such as one year. If you want to modify the resource at some point, the presentation suggests, you simply publish the modified resource at a different URL. (It is suggested that this renaming is necessary, in part or entirely, because of the large number of misconfigured proxies on the Internet.)
"Never" means that you effectively disable all forms of caching and require browsers to download the resource every time it is requested.
On the one hand, any performance advice given by the head performance engineer at Google carries weight on its own. On the other hand, HTTP caching was presumably designed with variable cache durations for a reason (not just "forever" and "never"), and changing the URL to a resource only because the resource has been modified seems to go against the spirit of HTTP.
Are "forever" and "never" the only cache durations that you should use in practice? Is this in conflict with other best practices on the web?
In addition to the typical "user with a browser" use case, I would also like to know how these principles apply to REST/hypermedia APIs.

Many people would disagree with limiting yourself to "forever" or "never" as you describe it.
For one thing, it ignores the option of allowing caching with always revalidating. In this case, if the client (or proxy) has cached the resource, it sends a conditional HTTP request. If the client/proxy has cached the latest version of the resource, then the server sends a short 304 response rather than the entire resource. If the client's (proxy) copy is out of date, then the server sends the entire resource.
With this scheme, the client will always get an up-to-date version of the resource, and if the resource doesn't change much bandwidth will be saved.
To save even more bandwidth, the client can be instructed to revalidate only when the resource is older than a certain period of time.
And if bad proxies are a problem, the server can specify that only clients and not proxies may cache the resource.
I found this document pretty concisely describes your options for caching. This page is longer but also gives some excellent information.

"It depends" really, on your use case, what you are trying to achieve, and your branding proposition.
If all you want to achieve is some bandwidth saving, you could do a total cost breakdown. Serving cost might not amount to much. Browsers are anyway pretty smart at optimizing image hits, for example, so understand your HTTP protocol. Forever, combined with versioned resource url, and url rewrite rules might be a good fit, like your Google engineer suggested.
Resource volatility is another. If you are only serving daily stock charts for example, it could safely be cached for some time but not forever.
Are your computation costs heavy? Are your users sensitive to timeliness? Is data live or fixed? For example, you might be serving airline routes, path of a hurricane, option greeks or a BI report to COO. You might want to have it cached, but the TTL will likely vary by user class, all the way down to never. Forever cannot work for live data but never might be a wrong answer too.
Degree of cooperation between the server and the client may be another factor. For example in a business operations environment where procedures can be distributed and expected to be followed, it might be worthwhile to again look at TTLs.
HTH. I doubt if there is a magical answer.

Idealy, you muste cache until the content changes, if you cannot clear/refresh the cache when content changes for any reason, you need a duration. But indeed, if you can, cache forever or do not cache. No need to refresh if you already know nothing changed.

If you know that the underlying data will be static for any length of time, caching makes sense. We have a web service that exposes data from a database that is populated by a nightly ETL job from an external source. Our RESTful web service only goes to the database when it changes. In our case, we know exactly when the data changes and we invalidate the cache right after the ETL process finishes.

Using Cloudfront to expose ElasticSearch REST API in read only (GET/HEAD)

I want to let my clients speak directly with ElasticSearch REST API, obviously preventing them from performing any data or configuration change.
I had a look at ElasticSearch REST interface and I noticed the pattern: HTTP GET requests are pretty safe (harmless queries and status of cluster).
So I thought I can use Cloudfront as a CDN/Proxy that only allows GET/HEAD methods (you can impose such restrict it in the main configuration).
So far so good, all is set up. But things don't work because I would need to open my EC2 security group to the world in order to be reachable from Cloudfront! I don't want this, really!
When I use EC2 with RDS, I can simply allow access to my EC2 security group in RDS security groups. Why can't I do this with CloudFront? Or can I?
Ideas?
edit: It's not documented, but ES accepts facets query, which involve a (JSON) body, not only with POST, but also with GET. This simply breaks HTTP recommendation (as for RFC3616) by not ignoring the body for GET request (source).
This relates because, as pointed out, exposing ES REST interface directly can lead to easy DOS attacks using complex queries. I'm still convinced though, having one less proxy is still worth it.
edit: Other option for me would be to skip CloudFront and adding a security layer as an ElasticSearch plugin as shown here

I ended coding with my own plugin. Surprisingly there was nothing quite like this around.
No proxies, no Jetty, no Tomcat.
Just a the original ES rest module and my RestFilter. Using a minimum of reflection to obtain the remote address of the requests.
enjoy:
https://github.com/sscarduzio/elasticsearch-readonlyrest-plugin

Note that even a GET request can be harmful in Elasticsearch. A query which simply takes up too much resources to compute will bring down your cluster. Facets are a good way to do this.
I'd recommend writing a simple REST API you place in front of ES so you get much more control over what hits your search cluster. If that's not an option you could consider running Nginx on your ES boxes to act as a local reverse proxy, which will give you the same control (and a whole lot more) as CloudFront does. Then you'd only have to open up Nginx to the world, instead of ES.

A way to do this in AWS would be:
Set up an Application Load Balancer in front of your ES cluster. Create a TLS cert for the ALB and serve https. Open the ES security group to the ALB.
Set up CloudFront and use the ALB as origin. Pass a custom header with a secret value (for WAF, see next point).
Set up WAF on your ALB to only allow requests that contain the custom header with the secret value. Now all requests have to go through CloudFront.
Set up a Lambda#Edge function on your CloudFront distribution to either remove the body from GET requests, or DENY such requests.
It’s quite some work, but there’s advantages over the plugin, e.g.:
CloudFront comes with free network DDOS protection
CloudFront gives your users lower latency to ES because of the fast CloudFront network and global PoP’s.
Opens many options to use CloudFront, WAF and Lamba#Edge to further protect your ES cluster.
I’m working on sample code in CDK to set all of this up. Will report back when that’s ready.

What is the difference between REST and HTTP protocols?

What is the REST protocol and what does it differ from HTTP protocol ?

REST is a design style for protocols, it was developed by Roy Fielding in his PhD dissertation and formalised the approach behind HTTP/1.0, finding what worked well with it, and then using this more structured understanding of it to influence the design of HTTP/1.1. So, while it was after-the-fact in a lot of ways, REST is the design style behind HTTP.
Fielding's dissertation can be found at http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm and is very much worth reading, and also very readable. PhD dissertations can be pretty hard-going, but this one is wonderfully well-described and very readable to those of us without a comparable level of Computer Science. It helps that REST itself is pretty simple; it's one of those things that are obvious after someone else has come up with it. (It also for that matter encapsulates a lot of things that older web developers learnt themselves the hard way in one simple style, which made reading it a major "a ha!" moment for many).
Other application-level protocols as well as HTTP can also use REST, but HTTP is the classic example.
Because HTTP uses REST, all uses of HTTP are using a REST system. The description of a web application or service as RESTful or non-RESTful relates to whether it takes advantage of REST or works against it.
The classic example of a RESTful system is a "plain" website without cookies (cookies aren't always counter to REST, but they can be): Client state is changed by the user clicking a link which loads another page, or doing GET form queries which brings results. POST form queries can change both server and client state (the server does something on the basis of the POST, and then sends a hypertext document that describes the new state). URIs describe resources, but the entity (document) describing it may differ according to content-type or language preferred by the user. Finally, it's always been possible for browsers to update the page itself through PUT and DELETE though this has never been very common and if anything is less so now.
The classic example of a non-RESTful system using HTTP is something which treats HTTP as if it was a transport protocol, and with every request sends a POST of data to the same URI which is then acted upon in an RPC-like manner, possibly with the connection itself having shared state.
A RESTful computer-readable (i.e. not a website in a browser, but something used programmatically) system would obtain information about the resources concerned by GETting URI which would then return a document (e.g. in XML, but not necessarily) which would describe the state of the resource, including URIs to related resources (hypermedia therefore), change their state through PUTting entities describing the new state or DELETEing them, and have other actions performed by POSTing.
Key advantages are:
Scalability: The lack of shared state makes for a much more scalable system (demonstrated to me massively when I removed all use of session state from a heavily hit website, while I was expecting it to give a bit of extra performance, even a long-time anti-session advocate like myself was blown-away by the massive gain from removing what had been pretty slim use of sessions, it wasn't even why I had been removing them!)
Simplicity: There are a few different ways in which REST is simpler than more RPC-like models, in particular there are only a few "verbs" that are ever possible, and each type of resource can be reasoned about in reasonable isolation to the others.
Lightweight Entities: More RPC-like models tend to end up with a lot of data in the entities sent both ways just to reflect the RPC-like model. This isn't needed. Indeed, sometimes a simple plain-text document is all that is really needed in a given case, in which case with REST, that's all we would need to send (though this would be an "end-result" case only, since plain-text doesn't link to related resources). Another classic example is a request to obtain an image file, RPC-like models generally have to wrap it in another format, and perhaps encode it in some way to let it sit within the parent format (e.g. if the RPC-like model uses XML, the image will need to be base-64'd or similar to fit into valid XML). A RESTful model would just transmit the file the same as it does to a browser.
Human Readable Results: Not necessarily so, but it is often easy to build a RESTful webservice where the results are relatively easy to read, which aids debugging and development no end. I've even built one where an XSLT meant that the entire thing could be used by humans as a (relatively crude) website, though it wasn't primarily for human-use (essentially, the XSLT served as a client to present it to users, it wasn't even in the spec, just done to make my own development easier!).
Looser binding between server and client: Leads to easier later development or moves in how the system is hosted. Indeed, if you keep to the hypertext model, you can change the entire structure, including moving from single-host to multiple hosts for different services, without changing client code at all.
Caching: For the GET operations where the client obtains information about the state of a resource, standard HTTP caching mechanisms allow both for statements that the resource won't meaningfully change until a certain date at the earliest (no need to query at all until then) or that it hasn't changed since the last query (send a couple hundred bytes of headers saying this rather than several kilobytes of data). The improvement in performance can be immense (big enough to move the performance of something from the point where it is impractical to use to the point where performance is no longer a concern, in some cases).
Availability of toolkits: Because it works at a relatively simple level, if you have a webserver you can build a RESTful system's server and if you have any sort of HTTP client API (XHR in browser javascript, HttpWebRequest in .NET, etc) you can build a RESTful system's client.
Resiliance: In particular, the lack of shared state means that a client can die and come back into use without the server knowing, and even the server can die and come back into use without the client knowing. Obviously communications during that period will fail, but once the server is back online things can just continue as they were. This also really simplifies the use of web-farms for redundancy and performance - each server acts like it's the only server there is, and it doesn't matter that its actually only dealing with a fraction of the requests from a given client.

REST is an approach that leverages the HTTP protocol, and is not an alternative to it.
http://en.wikipedia.org/wiki/Representational_State_Transfer
Data is uniquely referenced by URL and can be acted upon using HTTP operations (GET, PUT, POST, DELETE, etc). A wide variety of mime types are supported for the message/response but XML and JSON are the most common.
For example to read data about a customer you could use an HTTP get operation with the URL http://www.example.com/customers/1. If you want to delete that customer, simply use the HTTP delete operation with the same URL.
The Java code below demonstrates how to make a REST call over the HTTP protocol:
String uri =
"http://www.example.com/customers/1";
URL url = new URL(uri);
HttpURLConnection connection =
(HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Accept", "application/xml");
JAXBContext jc = JAXBContext.newInstance(Customer.class);
InputStream xml = connection.getInputStream();
Customer customer =
(Customer) jc.createUnmarshaller().unmarshal(xml);
connection.disconnect();
For a Java (JAX-RS) example see:
http://bdoughan.blogspot.com/2010/08/creating-restful-web-service-part-45.html

REST is not a protocol, it is a generalized architecture for describing a stateless, caching client-server distributed-media platform. A REST architecture can be implemented using a number of different communication protocols, though HTTP is by far the most common.

REST is not a protocol, it is a way of exposing your application, mostly done over HTTP.
for example, you want to expose an api of your application that does getClientById
instead of creating a URL
yourapi.com/getClientById?id=4
you can do
yourapi.com/clients/id/4
since you are using a GET method it means that you want to GET data
You take advantage over the HTTP methods: GET/DELETE/PUT
yourapi.com/clients/id/4 can also deal with delete, if you send a delete method and not GET, meaning that you want to dekete the record

All the answers are good.
I hereby add a detailed description of REST and how it uses HTTP.
REST = Representational State Transfer
REST is a set of rules, that when followed, enable you to build a distributed application that has a specific set of desirable constraints.
It is stateless, which means that ideally no connection should be maintained between the client and server.
It is the responsibility of the client to pass its context to the server and then the server can store this context to process the client's further request. For example, session maintained by server is identified by session identifier passed by the client.
Advantages of Statelessness:
Web Services can treat each method calls separately.
Web Services need not maintain the client's previous interaction.
This in turn simplifies application design.
HTTP is itself a stateless protocol unlike TCP and thus RESTful Web Services work seamlessly with the HTTP protocols.
Disadvantages of Statelessness:
One extra layer in the form of heading needs to be added to every request to preserve the client's state.
For security we may need to add a header info to every request.
HTTP Methods supported by REST:
GET: /string/someotherstring:
It is idempotent(means multiple calls should return the same results every time) and should ideally return the same results every time a call is made
PUT:
Same like GET. Idempotent and is used to update resources.
POST: should contain a url and body
Used for creating resources. Multiple calls should ideally return different results and should create multiple products.
DELETE:
Used to delete resources on the server.
HEAD:
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The meta information contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request.
OPTIONS:
This method allows the client to determine the options and/or requirements associated with a resource, or the capabilities of a server, without implying a resource action or initiating a resource retrieval.
HTTP Responses
Go here for all the responses.
Here are a few important ones:
200 - OK
3XX - Additional information needed from the client and url redirection
400 - Bad request
401 - Unauthorized to access
403 - Forbidden
The request was valid, but the server is refusing action. The user might not have the necessary permissions for a resource, or may need an account of some sort.
404 - Not Found
The requested resource could not be found but may be available in the future. Subsequent requests by the client are permissible.
405 - Method Not Allowed
A request method is not supported for the requested resource; for example, a GET request on a form that requires data to be presented via POST, or a PUT request on a read-only resource.
404 - Request not found
500 - Internal Server Failure
502 - Bad Gateway Error

How to post a file to an image hosting service in .NET?

Scenario:
localhost receives the current HttpRequest with 3 hidden inputs and a posted file. I must then forward this form data to an external image host and get the response.

See the System.Net.WebClient and related classes. You can use them to create a request to the remote server and handle the response. Also get Fiddler to help you replicate what the browser sends.

I hate doing this. It wastes my server's bandwidth and ties up IIS threads as well as using my server's CPU. It sucks and it's worth avoiding at all cost. Many services like, one that comes to mind is fliqz, provide a mechanism such that the files are uploaded directly from the client to their server (bypassing yours) and then they make a request to your server passing it various info on the query string.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex