What is bucket in Riak ? I tried to check documentation, but I was referred to buckets types, but could not grasp the concept of bucket in Riak.
Any explanation? what it is, and why its used?
I don't think there is much more to it than "bucket is a grouping mechanism for data with some configuration assigned to it.
Quoting official docs (emphasis mine):
Buckets are used to define a virtual keyspace for storing Riak
objects. They enable you to define non-default configurations over
that keyspace concerning replication properties and other parameters.
In certain respects, buckets can be compared to tables in relational
databases or folders in filesystems, respectively. From the standpoint
of performance, buckets with default configurations are essentially
“free,” while non-default configurations, defined using bucket types,
will be gossiped around [the ring][glossary read rep] using Riak’s
cluster metadata subsystem.
And from Bucket Types:
Buckets are essentially a flat namespace in Riak. They allow the same
key name to exist in multiple buckets and enable you to apply
configurations across keys.
Bucket : In certain respects, buckets can be compared to tables in relational databases or folders in filesystems
Bucket Types
In Riak 2.0 its new feature
Bucket types allow groups of buckets to share configuration details. This allows Riak users, and administrators, to manage bucket properties more efficiently than in the older configuration systems that were based on bucket properties
Lets dive little more :
The Using Bucket Types documentation covers the implementation, usage, and configuration of Bucket Types in great detail. Throughout the documentation there are code samples (e.g. Using Data Types) including code for creating the bucket types associated with each individual Riak Data Types.
Bucket types are a major improvement over the older system of bucket configuration. The ability to define a bucket configuration, and then change the configuration if necessary, for entire group of buckets, is a powerful new way to consider data modeling. In addition, bucket types are more reliable as buckets that have a given type (or configuration) only have their properties change when the type is changed. Previously, it was possible to change the properties of a bucket only through client requests.
In prior versions of Riak, bucket properties were altered by clients interacting with Riak…in contrast, bucket types are an operational concept. The riak-admin bucket-type interface enables Riak users to manage bucket configurations at an operational level, without recourse to the Riak clients.
In versions of Riak prior to 2.0, all queries were made to a bucket/key pair as in the following example:
curl http://localhost:8098/buckets/my_bucket/keys/my_key
Now in Riak 2.0 with the addition of bucket types, there is an additional namespace on top of buckets and keys. The same bucket name can be associated with completely different data if it is used in accordance with a different bucket type.
curl http://localhost:8098/types/type1/buckets/my_bucket/keys/my_key
curl http://localhost:8098/types/type2/buckets/my_bucket/keys/my_key
If a request is made to a bucket/key pair without a specified bucket type, default will be used in place of a bucket type. The following request are identical.
curl http://localhost:8098/buckets/my_bucket/keys/my_key
curl http://localhost:8098/types/default/my_bucket/keys/my_key
Related
My C++ application needs to support caching of files downloaded from the network. I started to write a native LRU implementation when someone suggested I look at using SQLite to store an ID, a file blob (typically audio files) and the the add/modify datetimes for each entry.
I have a proof of concept working well for the simple case where one client is accessing the local SQLite database file.
However, I also need to support multiple access by different processes in my application as well as support multiple instances of the application - all reading/writing to/from the same database.
I have found a bunch of posts to investigate but I wanted to ask the experts here too - is this a reasonable use case for SQLite and if so, what features/settings should I dig deeper into in order to support my multiple access case.
Thank you.
M.
Most filesystems are in effect databases too, and most store two or more timestamps for each file, i.e. related to the last modification and last access time allowing implementation of an LRU cache. Using the filesystem directly will make just as efficient use of storage as any DB, and perhaps more so. The filesystem is also already geared toward efficient and relatively safe access by multiple processes (assuming you follow the rules and algorithms for safe concurrent access in a filesystem).
The main advantage of SQLite may be a slightly simpler support for sorting the list of records, though at the cost of using a separate query API. Of course a DB also offers the future ability of storing additional descriptive attributes without having to encode those in the filename or in some additional file(s).
I saw in the tests spring cloud dataflow used to store the SpringDefinition - HashMap, is it possible to override the configuration of DateFlowServerConfiguration for storing streams and Tasks in an InMemory, for example in the same HashMap, if so, how?
I don't think it would be a trivial change. The server needs a backend to store it's metadata. By default it actually uses H2 in memory, and it relies on Spring Data JPA abstraction to give users the chance to select their RDBMS.
Storing on a different storage engine, would require not only replacing all the *Repository definitions on several configuration modules, but we do as well some pre population of data. It would become a bit hard to maintain this over time.
Is there a reason why a traditional RDBMS is not suitable here? or if you want in-memory just go with the ephemeral approach of H2?
In riak documentation, there are often examples that you could model your e-commerce datastore in certain way. But here is written:
In a production Riak cluster being hit by lots and lots of concurrent writes,
value conflicts are inevitable, and Riak Data Types
are not perfect, particularly in that they do not guarantee strong
consistency and in that you cannot specify the rules yourself.
From http://docs.basho.com/riak/latest/theory/concepts/crdts/#Riak-Data-Types-Under-the-Hood, last paragraph.
So, is it safe enough to user Riak as primary datastore in e-commerce app, or its better to use another database with stronger consistency?
Riak out of the box
In my opinion out of the box Riak is not safe enough to use as the primary datastore in an e-commerce app. This is because of the eventual consistency nature of Riak (and a lot of the NoSQL solutions).
In the CAP Theorem distributed datastores (Riak being one of them) can only guarentee at most 2 of:
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response
about whether it succeeded or failed)
Partition tolerance (the system
continues to operate despite arbitrary partitioning due to network
failures)
Riak specifically errs on the side of Availability and Partition tolerance by having eventual consistency of data held in its datastore
What Riak can do for an e-commerce app
Using Riak out of the box, it would be a good source for the content about the items being sold in your e-commerce app (content that is generally written once and read lots is a great use case for Riak), however maintaining:
count of how many items left
money in a users' account
Need to be carefully handled in a distributed datastore.
Implementing consistency in an eventually consistent datastore
There are several methods you can use, they include:
Implement a serialization method when writing updates to values that you need to be consistent (ie: go through a single/controlled service that guarantees that it will only update a single item sequentially), this would need to be done outside of Riak in your API layer
Change the replication properties of your consistent buckets so that you can 'guarantee' you never retrieve out of date data
At the bucket level, you can choose how many copies of data you want
to store in your cluster (N, or n_val), how many copies you wish to
read from at one time (R, or r), and how many copies must be written
to be considered a success (W, or w).
The above method is similar to using the strong consistency model available in the latest versions of Riak.
Important note: In all of these data store systems (distributed or not) you in general will do:
Read the current data
Make a decision based on the current value
Change the data (decrement the Item count)
If all three of the above actions cannot be done in atomic way (either by locking or failing the 3rd if the value was changed by something else) an e-commerce app is open to abuse. This issue exists in traditional SQL storage solutions (which is why you have SQL Transactions).
I'm developing an online file storage service in mainly PHP and MySQL, where users will be able to upload files up to 10 - 20 GB in size.
Unregistered users will be able to upload files but not in a personal storage space, just a directory where all file uploads of unregistered users will be stored.
Registered users will get a fixed amount (that might increase in the future) of personal storage space and access to a file manager to easily manage and organize all their files. They'll also be able to set their files private (not downloadable by anyone but themselves) or public.
What would be a good possible directory set-up?
I'm thinking about a "personal" directory that will contain folders with the user's id as the folder name for each registered user.
Alongside the personal directory, there will be an "other" folder which will just contain every file that's been uploaded by unregistered users.
Both will contain uploaded files, with each their corresponding row id (from the files table in the database) as the file name.
ROOT
FOLDER uploads
FOLDER personal
FOLDER 1
FILE file_id1
FILE file_id2
(...)
FOLDER 2
FILE file_id3
FILE file_id4
(...)
(...)
FOLDER other
FILE file_id5
FILE file_id6
(...)
This is the first time I'm dealing with a situation like this, but this concept is all so far what I could came up with. Any suggestions are also welcome!
Basically you need to address the following topics:
Security: With what you described it is pretty unclear who is allowed to read access the files. If this is always "everybody read everything" you set up a file structure within a web server virtual server. Otherwise you set up the folder structure in a "hidden" area and only access those via server side scripts (eg. copy on demand). The secure approach eats more ressources, but opens room to setup a technically optimized folder structure.
OS constraints: Each OS limits there number of items and/or files per folder. The actual figures of limitation depend on the os specific configuration of the file system. If I remember that right, there are LINUX setups that support 32000 items per folder. At the end of the day the example is not important. However importance lays on the fact, that your utilization planning does not exceed the limitations on your servers. So if you plan to provide your service to 10 users you may likely have a folder "other", if you target at a million users you probably need lots of folders "other". If you also do not want to restrict your users in number of files being uploaded you probably need the option to extend the folder per user. Personally I apply a policy where I not have more than 1000 items in a folder.
SEO requirements: If your service needs to be SEO complaint, it needs to be able to present speaking names to users - ideally without general categorization such as "Personal"/"Other". Your proposed structure may meet this requirement. However the OS constraints may force you into a more technical physical structure (eg. where chunk item id into 3 digits and use those to make up your folder and file structure). On top of that you can implement a logical structure which then converts IDs into names. However such implementation means file access via server side scripts and therefore demands for more ressources. Alternatively you could play with webserver url rewrites...
Consistency + Availability + Partition tolerance: Making your service a service likely requires you to have a balanced setup according those. Separating the beast into physical and logical layer helps here a lot. Consistency + Availability + Partition tolerance would be dealt with at the logical layer. http://en.wikipedia.org/wiki/NoSQL might be your way to go forward. http://en.wikipedia.org/wiki/CAP_theorem for details on the topic.
====================== UPDATE
From the comments we know now that you store meta data in an relational database, that you have physical layer (files on disk) and logical layer (access via php scripts) and that you base your physical file/folder layer on IDs.
This opens room to fully move any structural considerations to the relational database and maybe to improve the physical layer from the very beginning. So here are the tables of the sql database I would create:
======
users
======
id (unsigned INT, primary key)
username
password
isregisteredflag
...any other not relevant for the topic...
======
files
======
id (unsigned INT,primary key)
filename
_userid (foreign key to users.id)
createddate
fileattributes
...any other not relevant for the topic...
======
tag2file
======
_fileid (foreign key to files.id)
_tagid (foreign key to tag.id)
======
tags
======
id (unsigned INT,primary key)
tagname
Since this structure allows you to derive files from user IDs and also you can derive userID from files you do not need to store that relation as part of your folder structure. You just name the files on the physical layer files.id, which is a numeric value generated by the database. Since the ID is generated by the datebase you make sure to have them unique. Also now you can have tags which gives a richer categorization experience to your users (if you do not like tags you could do folder instead as well - in the database).
Taking care for at point 4 very much impacts on your design. If you take care after you did set up the whole thing you potentially double efforts. Since everything is settled to build files from numeric IDs it is a very small step to store your physical files in a key value store in a no-sql database (rather than on the file system), which makes your system scalable as hell. This would mean you would employ a sql database for meta and structure data and a nosql database for files content.
Btw. to cover your public files I would assume you to have a user "public" with ID=1. This ends up in some data hardcoding which is meant to be ugly. However as the functionality "public" is such a central element in your application you can contribute to unwritten laws by documenting that in a proper way. Alternatively you can add some more tables and blow up your code to cover two different things in a 'clean' way.
In my opinion, it shouldn't actually matter which folder structure you have. Of course (as already mentioned), there are OS and FS restrictions, and you may want to spend a thought or two on scaling.
But in the end, I would recommend a more flexible approach to storage and retrieval:
Ok, files are physically stored somewhere in a file system.
But: There should be a database with meta information about the file like categories, tags, descriptions, modification dates, maybe even change revisions. Of course, it will also store the physical position of the file, which may or may not be on the same machine.
This database would be optimized for searching by those criteria. (There are a couple of libraries for semantical indexing/searching, depending on your language/framework.)
This way, you would separate the physical concerns of the logical/semantical ones. And if you or your users still want the hierarchical approach, you can always go with the category logic.
Finally, you will have a much more flexible and appealing file hosting service.
I would like to be able to create a Riak bucket over cURL. I have been searching online and cant seem to find a way to do it. I know there are ways to do it easily with the drivers but need to be able to do it with cURL for the Saas application I am working on.
You would do a PUT passing the bucket properties you want as a json object, e.g.
curl -v http://riak3.local:8098/riak/newBucket -X PUT -H Content-Type:application/json --data-binary '{"props":{"n_val":5}}'
The docs has more complete details.
One other thing - the important thing to remember is that, there is no way to 'create' a bucket explicitly, with a call (via CURL or a client api).
You can only create custom buckets via the call above.
The reason for that is -- buckets are simply just a prefix to the keys. There is no object anywhere in the Riak system that keeps track of buckets. There is no file somewhere, no variable in memory, or anything like that. That is why the simple "list buckets" commands is so expensive: Riak literally has to go through every key in the cluster, and assemble a list of buckets by looking at the key prefixes.
The only thing that exists as actual objects are buckets with non-default settings, ie, custom buckets. That's what that curl command above does -- it keeps track of some non-default settings for Riak to consult if a call ever comes in to that bucket.
Anyways, the moral of the story is: you don't need to create buckets in the normal course of operations. You can just start writing to them, and they will come into being (again, in the sense of, keys with bucket prefixes will come into being, which means they can now be iterated over by the expensive 'list buckets' call).
You only have to issue the call for custom buckets (and also, you don't want to do that too much, either, as there is a practical limit to the number of custom buckets created in a cluster, somewhere around 9000).
I also found that if you add a new object to an non existing bucket it will create that bucket on the fly.
Remember, buckets are automatically created when you add keys to them. There is no need to explicitly “create” a bucket (more on buckets and their properties further down the page.)
Bucket Properties and Operations