Symfony2: slow write to S3 from EC2 instance - symfony

I have 1 micro instance and S3 server for dev purposes within same region. My program has typeahead functionality that is working this way:
when user types "lond", the url is:
mys3.s3-website-us-east-1.amazonaws.com/typeahead/cities/lond.json
Because it will return 404, javascript then tries:
http://mydomain.com/typeahead/cities/lond.json
This will trigger the controller that will find all cities starting with lond, write it to S3 and return json results.
So next time someone types "lond", results would be found on S3 as static file and no controller action will be executed.
Now all this works but the problem is that I have very slow write speed from EC2 to S3. When I remove
$filesystem->write($filename, json_encode($results), true) ;
from my code, the response is ~0.7 seconds. With writting to S3, it goes to 2 seconds which is hardly usefull. The problem is bigger for fast typers, probably because of quick writes sent to S3.
I am using KnpGaufretteBundle.
I also tried
echo json_encode($results);
$filename->write(....) ;
die;
to send output to browser and after that continue to save file to S3 but it didn't work. JS didn't get response earlier.
ob_start(), ob_end_flush().... and others didn't work either.
How can I solve it?
I was thinking of starting some background process that will upload result (although, I don't know how to do it) but I think it would be just too complicated.
The other idea is to use amazon_s3 service directly and skip KnpGaufretteBundle, but I would like to avoid that.

What you seem to be trying to do, is use s3 to store cached data that is better stored in memcached or another memory based store.
I'll give you a couple of other options. You could use both if you wanted, but 1 is more than enough.
Use memcached (redis or another key value store) with read through caching. You will use one end point, which on a request, will check if the value exists in the cache, then will read through to get the result if it does not. At the same time, the result will be stored in the cache.
Use a cloudfront distribution with a custom origin. The custom origin will be your website. With this option, the result will be cached on a cloudfront endpoint for a duration specified by your headers. Each endpoint may check the source for the file if it does not have it. To prevent the custom origin from accidentally caching the wrong stuff, you may want to check for the cloudfront user agent and allow it to only access certain urls.

Related

AWS Lambda, Caching {proxy+}

Simple ASP.Net AWS Lambda is uploaded and functioning with several gets like:
{proxy+}
api/foo/bar?filter=value
api/foo/barlist?limit=value
with paths tested in Postman as:
//#####.execute-api.us-west-2.amazonaws.com/Prod/{proxy+}
Now want to enable API caching but when doing so only the first api call gets cached and all other calls now incorrectly return the first cached value.
ie //#####.execute-api.us-west-2.amazonaws.com/Prod/api/foo/bar?filter=value == //#####.execute-api.us-west-2.amazonaws.com/Prod/api/foo/barlist?limit=value; In terms of the cache these are return the same but shouldn't be.
How do you setup the caching in APIGateway to correctly see these as different requests per both path and query?
I believe you can't use {proxy+} because that is a resource/integration itself and that is where the caching is getting applied. Or you can (because you can cache any integration), but you get the result you're getting.
Note: I'll use the word "resource" a lot because I think of each item in API Gateway as the item in question, but I believe technically AWS documentation will say "integration" because it's not just the resource but the actual integration on said resource...And said resource has an integration and parameters or what I'll go on to call query string parameters. Apologies to the terminology police.
Put another way, if you had two resources: GET foo/bar and GET foo/barlist then you'd be able to set caching on either or both. It is at this resource based level that caching exists (don't think so much as the final URL path, but the actual resource configured in API Gateway). It doesn't know to break {proxy+} out into an unlimited number of paths unfortunately. Actually it's method plus resource. So I believe you could have different cached results for GET /path and POST /path.
However. You can also choose the integration parameters as cache keys. This would mean that ?filter=value and ?limit=value would be two different cache keys with two different cached responses.
Should foo/bar and foo/barlist have the same query string parameters (and you're still using {proxy+}) then you'll run into that duplicate issue again.
So you may wish to do foo?action=bar&filter=value and foo?action=barlist&filter=value in that case.
You'll need to configure this of course, for each query string parameter. So that may also start to diminish the ease of {proxy+} catch all. Terraform.io is your friend.
This is something I wish was a bit more automatic/smarter as well. I use {proxy+} a lot and it really creates challenges for using their caching.

Update database with GET requests?

I've read online that you shouldn't update your database with GET requests for the following reasons:
GET request is idempotent and safe
violates HTTP spec
should always read data from a server-database
Let's say that we have build a URL Shortener service. When someone clicks on the link or paste it at the browser's address bar it will be a GET request.
So, if I want to update, on my database, the stats of a shortened link every time it's been clicked, how can I do this if GET requests are idempotent?
The only way I can think of is by calling a PUT request inside the server's side code that handles the GET request.
Is this a good practice or is there a better way to do this?
It seems as you're mixing up a few things here.
While you shouldn't use GET requests to transfer sensitive data (as it is shown in the request URL and most likely logged somewhere in between), there is nothing wrong with using them in your use case. You are only updating a variable serverside.
Just keep in mind that the request parameter are stored in the URL when using GET requests and you should be ok.

Serving static content programmatically from Servlet - does the spec have anything available or i should roll a custom one?

I have a db with original file names, location to files on disk, meta data like user that owns file... Those files on disk are with scrambled names. When user requests a file, the servlet will check whether he's authorized, then send the file in it's original name.
While researching on the subject i've found several cases that cover that issue, but nothing specific to mine.
Essentially there are 2 solutions:
A custom servlet that handles headers and other stuff the Default Servlet containers don't: http://balusc.omnifaces.org/2009/02/fileservlet-supporting-resume-and.html
Then there is the quick and easy one of just using the Default Servlet and do some path remapping. For ex., in Undertow you configure the Undertow subsystem and add file handlers in the standalone.xml that map http://example.com/content/ to /some/path/on/disk/with/files .
So i am leaning towards solution 1, since solution 2 is a straight path remap and i need to change file names on the fly.
I don't want to reinvent the hot water. And both solutions are non standard. So if i decide to migrate app server to other than Wildfly, it will be problematic. Is there a better way? How would you approach this problem?
While your problem is a fairly common one there isn't necessarily a standards based solution for every possible design challenge.
I don't think the #2 solution will be sufficient - what if two threads try to manipulate the file at the same time? If someone got the link to the file could they share it?
I've implemented something very similar to your #1 solution - the key there is that even if the link to the file got out no one could reuse the link as it requires security. You would just "return" a 401 or 403 for the resource.
Another possibility depends on how you're hosted. Amazon S3 allows you to generate a signed URL that has a limited time to live. In this way your server isn't sending the file directly. It is either sending a redirect or a URL to the front end to use. Keep the lifetime at like 15 seconds (depending on your needs) and then the URL is no longer valid.
I believe that the other cloud providers have a similar capability too.

Does Firebase guarantee that data set using updateValues or setValue is available in the backend as one atomic unit?

We have an application that uses base64 encoded content to transmit attachments to backend. Backend then moves the content to Storage after some manipulation. This way we can enjoy world class offline support and sync and at the same time use the much cheaper Storage to store the files in the end.
Initially we used updateChildren to set the content in one go. This works fairly well, but then users started to upload bigger and more files at the same time, resulting in silent freezing of the database in the end user devices.
We then changed the code to write the files one by one using FirebaseDatabase.getInstance().getReference("/full/uri").setValue(base64stuff), and then using updateChildren to only set the metadata.
This allowed seemingly endless amount of files (provided that it is chopped to max 9 meg chunks), but now we're facing another problem.
Our backend uses Firebase listener to start working once new content is available. The trigger waits for the metadata and then starts to process the attachments. It seems that even though the client device writes the files before we set the metadata, the backend usually receives the metadata before the content from the files is available. This forced us to change backend code to stop processing and check later again if the attachment base64 data is available.
This works, but is not elegant and wastes cpu cycles and increases latencies.
I haven't found anything in the docs wether Firebase guarantees anything about the order in which the data is received by the backend. It seems that everything written in one go (using setValue or updateChildren) is available in the backend as one atomic unit.
Is this correct? Can I depend on that as a fact that will not change in the future?
The way I'm going to go about this (if the assumptions are correct above) is to write metadata first using updateChildren in the client like this
"/uri/of/metadata/uid/attachments/attachment_uid1" = "per attachment metadata"
"/uri/of/metadata/uid/attachments/attachment_uid2" = "per attachment metadata"
and then each base64 chunk using updateChildren with following payload:
"/uri/of/metadata/uid/uploaded_attachments/attachment_uid2" = true
"/uri/of/base64/content/attachment_uid" = "base64content"
I can't use setValue for any data to prevent accidental overwrite depending the order in which the writes will happen in the end.
This would allow me to listen to /uri/of/base64/content and try to start the handling of the metadata package every time a new attachment completes the load. The only thing needed to determine if all files have been already uploaded is to grab the metadata and see that all attachment uids found from /attachments/ are also present /uploaded_attachments/.
Writes from a single Firebase Database client are delivered to the server in the same order as they are executed on the client. They are also broadcast out to any listening clients in the same order.
There is no chance that another client will see the results of write B without seeing the results from write A (unless A was rejected by security rules)

API caching with Symfony2

How can I cache my API responses built with Symfony?
I started to dig into FosCacheBundle and the SymfonyHttpCache, but I'm not sure about my usecase.
You can only access the API with a token in header and every users get the same data in their response for the same URL called (and with the same GET parameters).
I would like to have a cache entry for each of my URL (including get parameters)
and also, is it possible to reorder my GET parameters before the request is processed by my cache system (so that the system dont create multiple cache entries for "URL?foo=bar&foz=baz" and "URL?foz=baz&foo=bar" which returns the same data)
Well there are multiple ways.
But the simplest is this:
If the biggest problem is database access than just caching the compiled result in memcache or similar will go a long way. Plus, this way you stick to your already working authentication.
In your current controller action, after authentication and before payload creation check if there's an entry in memchache. If no, build the payload and save it into memcache than return it. When next request comes along there will be no DB access as it will be returned from memcache. Just don't forget to refresh the cache how ever often you need.
Note:
"Early optimization is the root of all evil" and "Servers are cheaper than programmer hours" are to things to keep in mind. Don't complicate your life with really advanced caching methods if you don't need to.

Resources