Processing and Merging files before uploading them to Amazon S3

Processing and Merging files before uploading them to Amazon S3 - meteor

I'm developing a web platform using meteorjs that will allows me to upload multiple text/plain such as .txt and .csv files to my amazon S3 bucket.
Currently my platform supports multiple files uploads, uploading them sequentially by using a for loop and meteor-slingshot to upload to my s3 directly.
What I'm trying to do:
I need to merge the files into one final file before the uploading to S3 happens so I can upload just one file instead of uploading multiple files (Files must be merged in one).
My ideas:
I've thought about uploading the files into my own server before they're uploaded to S3 and then use something like cat file1.txt file2.txt > merged_file.txt in linux to merge them and retrieve this file somehow, but it'll take longer because we need to upload the files to my server and then upload the merged file to the S3 bucket.
Use the fs.appendFile from node to merge the files, but I don't know if it could be viable because files can be up to 20/25mb and it'll will take some time to read them and append the information.
I think javascript is not probably viable to handle this directly, that's why I'm thinking in sh commands or c++ because of the nodejs core.
Can you recommend me a better way of doing this?
is there a better way to handle this process?

You can do this easily in shell.
for i in file1 file2 file3
do
cat $i >> combined_file
done
I would also say that you might want to consider uploading them as small files and then spin up an aws instance pull them down, merge them, and re-upload them. that way you don't lose the originals and it you can automate this whole process.

Related

pass csv file from get request to S3 in airflow

I did't find here an answer for that, so thought anyone can help:
I'm receiving a csv file from a Get request.
I want to upload it to S3 (and then continue the pipeline..)
I'm using Airflow on the managed AMAA platform.
Since when uploading to S3, the script required a file path for the csv file.
how can I pass a file path when it's on the AMAA platform? is it even stored anywhere?
do I need a middle man to store it in between?

Uncompress file on Rackspace cloud files container for CDN

I have created a Rackspace account earlier today for CDN to serve my Opencart images from Rackspace.
I have created a container where i will upload over 500,000 images, but prefer to upload them as a compressed file, feels more flexible.
If i upload all the images in a compressed file how do i extract the file when it is in the container? and what compression type files would work?

The answer may depend on how you are attempting to upload your file/files. Since this was not specified, I will answer your question using the CLI from a *nix environment.
Answer to your question (using curl)
Using curl, you can upload a compressed file and have it extracted using the extract-archive feature.
$ tar cf archive.tar directory_to_be_archived
$ curl -i -XPUT -H'x-auth-token: AUTH_TOKEN' https://storage101.iad3.clouddrive.com/v1/MossoCloudFS_aaa-aaa-aaa-aaa?extract-archive=tar -T ./archive.tar
You can find the documentation for this feature here: http://docs.rackspace.com/files/api/v1/cf-devguide/content/Extract_Archive-d1e2338.html
Recommended solution (using Swiftly)
Uploading and extracting that many objects using the above method might take a long time to complete. Additionally if there is a network interruption during that time, you will have to start over from the beginning.
I would recommend instead using a tool like Swiftly, which will allow you to concurrently upload your files. This way if there is a problem during the upload, you don't have to re-upload objects that have alreaady been successfully uploaded.
An example of how to do this is as follows:
$ swiftly --auth-url="https://identity.api.rackspacecloud.com/v2.0" \
--auth-user="{username}" --auth-key="{api_key}" --region="DFW" \
--concurrency=10 put container_name -i images/
If there is a network interruption while uploading, or you have to stop/restart uploading your files, you can add the "--different" option after the 'put' in the above command. This will tell Swiftly to HEAD the object first and only upload if the time or size of the local file does not match its corresponding object, skipping objects that have already been uploaded.
Swiftly can be found on github here: https://github.com/gholt/swiftly
There are other clients that possibly do the same things, but I know Swiftly works, so I recommend it.

Windows Azure - Upload file to blob using queue

I have to upload and download files from blob storage. Found a good article on tutorial to upload and download files. I have some queries though.
I want to create folder structure and do operations like
a. Fetch a particular file from folder
b. Fetch all files of a folder and its subfolders
c. Fetch name of files which are in a particular folder
d. Fetch name of files which are in a particular folder and its subfolders
Upload files to a particular folder or subfolder
What are the best practices for doing so and should I use queue in all this?
What would be performance impact if I am uploading large files to blob?

You can't really use queues for that purpose. Reasons being:
Maximum size of a message in a queue is 64 KB. What would happen if your file size is more than 64 KB?
More importantly, queues are not meant for that purpose. Queues are typically used as asynchronous communication channel between disconnected applications.
Do search around and you will find plenty of examples about uploading files in blob storage.
For uploading folders, essentially you will iterate over a folder and list all files and upload these files. Since blob storage doesn't really support folder hierarchy, you would need to name the blob by prepending the folder structure to the name of the file. For example, let's say you're uploading files from C:\images\thumbnails folder in a blob container named assets. If you're uploading a file called a.png, you can name the blob as images/thumbnails/a.png and that way you can preserve the folder structure.

How do I Download efficiently with rsync?

A couple of questions related to one theme: downloading efficiently with Rsync.
Currently, I move files from an 'upload' folder onto a local server using rsync. Files to be moved are often dumped there, and I regularly run rsync so the files don't build up. I use '--remove-source-files' to remove files that have been transferred.
1) the '--delete' options that remove destination files have various options that allow you to choose when to remove the files. This would be handly for '--remove-source-files' since is seems that, by default, rsync only removes the files after all files have been transferred, rather than after each file; Othere than writing a script to make rsync transfer files one-by-one, is there a better way to do this?
2) on the same problem, if a large (single) file is transferred, it can only be deleted after the whole thing has been sucessfully moved. It strikes me that I might be able to use 'split' to split the file up into smaller chunks, to allow each to be deleted as the file downloads; is there a better way to do this?
Thanks.

rsync list of specific local files in 1 step

I'm working on a web application where a user uploads a list of files, which should then be immediately rsynced to a remote server. I have a list of all the local files that need to be rsynced, but they will be mixed in with other files that I do not want rsynced every time. I know rsync will only send the changed files, but this directory structure and contents will grow very large over time and the delay would not be acceptable.
I know that doing a remote rsync, I can specify a list of remote files, i.e...
rsync "host:/path/to/file1 /path/to/file2 /path/to/file3"
... but that does not work once I remove "host:" and try to specify the files locally.
I also know I can use --files-from, but that would require me to create a file ahead of time with a list of files that I want to rsync (and then delete it afterwards). I think it'd be cleaner to just effectively say "rsync these 4 specific files to this remote server", but I can't seem to get that to work.
Is there any way to do what I'm trying to accomplish, or do I have to resort to creating a tmp file with a list in it?
Thanks!

You should be able to list the files similar to the example you gave. I did this on my machine to copy 2 specific files from a directory with many other files present.
rsync test.sql test2.cpp myUser#myHost:path/to/files/synced/

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Processing and Merging files before uploading them to Amazon S3 - meteor

Related

pass csv file from get request to S3 in airflow

Uncompress file on Rackspace cloud files container for CDN

Windows Azure - Upload file to blob using queue

How do I Download efficiently with rsync?

rsync list of specific local files in 1 step

Categories

Resources