Riak, is there an advantage to shortening bucket name? - riak

I plan to store a whole lot bunch of data in a Riak bucket. As the documentation shows, the bucket name is stored in the object itself.
See:
http://docs.basho.com/riak/latest/ops/building/planning/cluster/
http://docs.basho.com/riak/latest/ops/building/planning/bitcask/
Should I make my bucket and bucket types name 2 character long and cryptic to save resources ?

Related

DynamoDB to S3 parquet without Glue but with transformation and file naming

I would like to take a single DynamoDB table which contains a data field with JSON data. This data has a schema that is dependent on the user associated with the table entry. Let's assume that the schema is part of the entry for simplicity.
I would like to stream this data into S3 as Parquet with embedded schema, transformation (i.e. just sending the data field) and custom file naming based on the user ID.
I am using CDK v2.
What I found so far:
I can go from DynamoDB to Kinesis Stream to Firehose but Glue is required - I don't need it nor sure how I would provide it with these various "dynamic" schemas.
CDK asks for the S3 filename - I see there may be the possibility of a dynamic field in the name but not sure how I would use that (i've seen date for ex - I would need it to be something coming from the transform lambda),
I think that using Kinesis stream directly in DynamoDB config may not be what I want and I should just use regular DB streams but then.... Would I transform data, pass this to a Firehose? where does file naming etc come in.
I've read so many docs but all seem to deal with a standard table to file and Athena.
Summary: How can I append streaming dynamodb data to various parquet file and transform/determine file name from a lambda in the middle. I think I have to go from a dyanamodb stream lambda handler and append directly to S3 but not finding too much in examples. Concerned about buffering etc

How to order by firebase storage list

Can not find any function in firebase storage to list all files in a specific order like ascending or descending, i have tried listoptions but it supports only two arguments maxResults and pageToken
The Cloud Storage List API does not have the ability to sort anything by some criteria you choose. If you need the ability to query objects in a bucket, you should consider also storing information about your objects in a database that can be queried with the flexibility that you require. You will need to keep that database up to date as the contents of your bucket change (perhaps using Cloud Functions triggers). This is a common thing to implement, since Cloud Storage is optimized only for storage huge amounts of data for fast retrieval at extremely low costs - it is not also trying to be a database for object metadata.
Please also see:
gsutil / gcloud storage file listing sorted date descending?

Making Glue delete source data after a job

AWS Glue is great for transforming data from a raw form into whichever format you need, and keeping the source and destination data sets synchronized.
However, I have a scenario where data lands in a 'landing area' bucket from untrusted external sources, and the first ETL step needs to be a data validation step which only allows valid data to pass to the data lake, while non-valid data is moved to a quarantine bucket for manual inspection.
Non-valid data includes:
bad file formats/encodings
unparseable contents
mismatched schemas
even some sanity checks on the data itself
The 'landing area' bucket is not part of the data lake, it is only a temporary dead drop for incoming data, and so I need the validation job to delete the files from this bucket once it has moved them to the lake and/or quarantine buckets.
Is this possible with Glue? If the data is deleted from the source bucket, won't Glue end up removing it downstream in a subsequent update?
Am I going to need a different tool (e.g. StreamSets, NiFi, or Step Functions with AWS Batch) for this validation step, and only use Glue once the data is in the lake?
(I know I can set lifecycle rules on the bucket itself to delete the data after a certain time, like 24 hours, but in theory this could delete data before Glue has processed it, e.g. in case of a problem with the Glue job)
Please see purge_s3_path in the docs:
glueContext.purge_s3_path(s3_path, options={}, transformation_ctx="")
Deletes files from the specified Amazon S3 path recursively.
Also, make sure your AWSGlueServiceRole has s3:DeleteObject permissions
Your glue environment comes with boto3. You should be better of using the boto3 s3 client/resource to delete the landing files after Youve completed processing the data via glue

How do you delete an entire bucket in riak 2.0 using curl?

My buckets do not use Riak 2 object types, and I could not figure much from the 2.0 documentation
A bucket doesn't really exist. It is just a namespace that contains behaviors and settings to use when reading or storing keys. You can reset the bucket to default settings, but in order to remove the data you would need to delete each key.
You could use the $bucket index to get a list of the keys, and you could probably write a map reduce job that would delete all of the keys in a certain bucket. However, either of those options would be about as heavy on the cluster as a key listing.

Is it possible (and wise) to add more data to the riak search index document, after the original riak object has been saved (with a precommit hook)?

I am using riak (and riak search) to store and index text files. For every file I create a riak object (the text content of the file is the object value) and save it to a riak bucket. That bucket is configured to use the default search analyzer.
I would like to store (and be able to search by) some metadata for these files. Like date of submission, size etc.
So I have asked on IRC, and also given it quite some thought.
Here are some solutions, though they are not as good as I would like:
I could have a second "metadata" object that stores the data in question (maybe in another bucket), have it indexed etc. But that is not a very good solution especially if I want to be able to do combined searches like value:someword AND date:somedate
I could put the contents of the file inside a JSON object like: {"date":somedate, "value":"some big blob of text"}. This could work, but it's going to put too much load on the search indexer, as it will have to first deserialize a big json object (and those files are sometimes quite big).
I could write a custom analyzer/indexer that reads my file object and generates/indexes the metadata in question. The only real problem here is that I have a hard time finding documentation on how to do that. And it is probably going to be a bit of an operational PITA as I will need to push some erlang code to every riak node (and remember to do that when I update the cluster, when I add new nodes etc.) I might be wrong on this, if so, please, correct me.
So the best solution for me would be if I could alter the riak search index document, and add some arbitrary search fields to it, after it gets generated. Is this possible, is this wise, and is there support for this in libraries etc.? I can certainly modify the document in question "manually", as a bucket with index documents gets automatically created, but as I said, I just don't know what's the right thing to do.

Resources