I am a beginner in Zookeeper. Would like to know what data means while using create or set command. Zookeeper doesnot store data. Then what is this "data"? Also, znodes are created automatically or should we create it manually in using cli commands?
ZooKeeper does store data. When you create a node, you can set this data. You can update the data using setData. It's just an array of bytes. It's up to you to define what it actually is.
However, ZooKeeper is not meant to be a database. Databases usually get larger and larger while a system is used. ZooKeeper works best when it only stores a small amount of data that doesn't grow in time. Basically only data that is used to synchronize a distributed system.
It's up to you how and when you create zNodes. It's certainly easier to deploy when you create them automatically. Usually you have multiple clients. If all of them try to create the same nodes, they will run into conflicts. Make sure to handle this.
Related
I am using SQLite with SQLAlchemy on a embedded linux system (think Raspberry Pi) to do some data logging.
The DB has a schema for configuration of sensors and adc boards. It has a measurement table which has columns for timestamp, value, adc_board_id (FK), channel_id (FK).
Measurements are saved to the DB every minute.
However I want to update an LCD with live measurements (every second).
The LCD app is a separate process and gets data to display from the DB.
Is there a way to have a table (a special table) just for live measurements that is only RAM based?
I want it accessible via the DB (like shared memory) but never to persist (i.e. never written to disk).
NOTE: I want most of the tables to persist on disk, but specify one (or more) special tables that never get written to disk.
Would that even work with separate processes accessing the DB?
The table will be written by only one process, but may be read by many processes, possibly via a View (once I learn how to do that with SQLAlchemy).
I did see some SQLite documentation that states:
The database in which the new table is created. Tables may be created
in the main database, the temp database, or in any attached database.
Maybe using the temp database could do the trick?
Would it be accessible via other processes?
Alternatively, an attached database that lives on a ramdisk (/dev/shm/...)?
Are there any other techniques with DBs/SQLite/SQLAlchemy to achieve the same outcome?
Storing an SQLite DB in memory, while sharing with another process
You can do this with a "ramdisk". This works just like a normal file-system, but uses RAM as physical storage. You can create one using the tmpfs filesystem:
mount -t tmpfs -o size=512m tmpfs /mnt/ramdisk
After that you can just store your SQLite database in /mnt/ramdisk and access it with another application.
Can two processes can access the same datatabse
Technically yes. But you should - in the case of sqlite - ensure that only one process has write access. This seems to be feasible in your case.
I'd like to use pglogical to replicate a set of tables, but I want to make all of my changes downstream from the master - to avoid risk, I don't want to make any modifications to the master database. I'd also like to start using pglogical now so we get familiar with the technology and can include it across all of our databases on our next release.
I don't need constant updates, so I came up with a plan, a cron job that:
Turns off streaming replication to a standby
Makes this standby a logically-replicating master (just for logical replication, no writes)
Stop postgresql
Copy off data dir
Make config changes
Start postgresql
Create pglogical extension
Catches up logical replication
Makes this database a streaming standby without logical replication again
Stop postgresql
Replace data dir with previous copy
Config changes
Start postgresql
My question - does this approach even make sense? Is there some easy way to accomplish this that I'm totally missing?
I'm thinking about learn JanusGraph to use in my new big project but i can't understand some things.
Janus can be used like any database and supports "insert", "update", "delete" operations so JanusGraph will write data into Cassandra or other database to store these data, right?
Where JanusGraph store the Nodes, Edges, Attributes etc, it will write these into database, right?
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data that JanusGraph read, must be load in JanusGraph in every query or it will do selects in database to retrieve the data I need?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Should I use JanusGraph in my project in production or should I wait until it becomes production ready?
I'm developing some kind of social network that need to store friendship, posts, comments, user blocks and do some elasticsearch too, in this case, what database backend should I use?
Janus will write data into Cassandra or other database to store these data, right?
Where Janus store the Nodes, Edges, Attributes etc, it will write these into database, right?
Janus Graph will write the data into whatever storage backend you configure it to use. This includes Cassandra. It writes this data into the underlaying database using the data model roughly outlined here
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Janus Graph will only load into memory vertices and edges which you touch during a query/traversal. So if you do something like:
graph.traversal().V().hasLabel("My Amazing Label");
Janus will read and load into memory only the vertices with that label. So you don't need to worry about initializing a graph connection and then waiting for the entire graph to be serialised into memory before you can query. Janus is a lazy reader.
Should I use Janus in my project in production or should I wait until it becomes production ready?
That is entirely up to you and your use case. Janus is being used in production already as can be seen here at the bottom of the page. Janus was forked from and improved on TitanDB which is also used in several production use cases. So if you wondering "is it ready" then I would say yes, it's clearly ready given it's existing uses.
what database backend should I use?
Again, that's entirely up to you. I use Cassandra because it can scale horizontally and I find it easier to work with. It also seems to suit all different sizes of data.
I have toyed with Google Big Table and that seems very powerful as well. However, it's only really suited for VERY big data and it's also only on the cloud where as Cassandra can be hosted locally very easily.
I have not used Janus with HBase or BerkeleyDB so I can't comment there.
It's very simple to change between backends though (all you need to do is adjust some configs and check your dependencies are in place) so during your development feel free to play around with the backends. You only really need to commit to a backend when you go production or are more sure of each backend.
When considering what storage backend to use for a new project it's important to consider what tradeoffs you'd like to make. In my personal projects, I've enjoyed using NoSQL graph databases due to the following advantages over relational dbs
Not needing to migrate schemas increases productivity when rapidly iterating on a new project
Traversing a heavily normalized data-model is not as expensive as with JOINs in an RDBMS
Most include in-memory configurations which are great for experimenting & testing.
Support for multi-machine clusters and Partition Tolerance.
Here are sample JanusGraph and Neo4j backends written in Kotlin:
https://github.com/pm-dev/janusgraph-exploration
https://github.com/pm-dev/neo4j-exploration
The main advantage with JanusGraph is the flexibility of pluging-in whichever storage backend you'd like.
First of all I must put clear that I am a newbie and excuse myself if I don't use the correct terminology in my question.
This is my scenario:
I need to analyze large quantities of text like tweets, comments, mails, etc. The data is currently inserted into an Amazon RD MySQL instance as it occurs.
Later I run and R job locally using RTextTools (http://www.rtexttools.com/) over that data to output my desired results. At this point it might be important to make clear that the R scripts analyzes the data and writes data back into the MySQL table which will later be used to display it.
The issue I am having lately is that the job takes about 1 hour each time I run it and I need to do it at least 2 times a day...so using my local computer is not an option anymore.
Looking for alternatives I started to read about Amazon Elastic MapReduce instance which at first sight seems to be what I need, but here start my questions and confusions about it.
I read that data for EMR should be pulled out from an S3 bucket. If thats the case then I must start storing my data into a JSON or similar within an S3 bucket and not into my RDS instance, right?
At this point I read it is a good idea to create HIVE tables and then use RHive to read the data in order for RTextTools to do its job and write the results back to my RDS tables, is this right?
And now the final and most important question: Is taking all this trouble worth it vs. running a EC2 instance with R and running my R scripts there, will I reduce computing time?
Thanks a lot for your time and any tip in the right direction will be much appreciated
Interesting, I would like to suggest few things.
You can totally store data in S3, but you will have to first write your data to some file (txt etc) and then push it to S3. You cannot put raw JSON on S3. You can probably get the benefit of cloud front deployed over S3 for fast retrieval of data. You can also use RDS. the performance difference you will have to analyze yourself.
Writing results back to RDS shouldn't be any issue. EMR basically creates two EC2 instances , ElasticMapReduce-master and ElasticMapReduce-slave which can be used to communicate with RDS.
See,I think its worth trying out with EC2 instance with R , but then to reduce the computation time, you might have to go with expensive EC2 instance, or put autoscaling and divide task between different instances. Its just like implementing whole parallel computation logic by yourself, but in the case of EMR , you are getting all this logic of map reduce in itself. So, firstly you should try with EMR and if it doesn't work out well for your , try with new EC2 instance with R.
Let me know how it goes, thank you.
You should consider trying EMR. S3+EMR is very much worth trying out if the 1hour window is a constraint. For your type of processing workloads, you might save cycles by using a scalable on demand hadoop/hive platform. Obviously, there are some learning, re-platforming, and ongoing cluster mgmt costs related to the trial and switch. They are non-trivial. Alternatively, consider services such as Qubole, which also runs on EC2+S3 and provides higher level (and potentially easier to use) abstractions.
Disclaimer: I am a product manager at Qubole.
I have developed a CRM for my company. Next I would like to take that system and make it available for others to use in a hosted format. Very much like a salesforce.com. The question is what type of database structure would I use. I see two options:
Option 1. Each time a company signs up, I clone the master database for them.
The disadvantage of this is that I could end up with thousands of databases. Thats a lot of databases to backup every night. My CRM uses cron jobs for maintanance, those jobs would have to run on all databases.
Each time I upgrade the system with a new feature, and need to add a new column to the database, I will have to add that column to thousands of databases.
Option 2. Use only one database.
At the beginning of EVERY table add "CompanyID". In EVERY SQL statement add "and
companyid={companyid}".
The advantage of this method is the simplicity of only one database. Just one database to backup each night. Just one database to update when needed.
The disadvantage is what if I get 1000 companies signing up, and each wants to store data on 100,000 leads, that 100,000,000 rows in the lead table, which worries me.
Does anyone know how the online hosted CRMs like salesforce.com do this?
Thanks
Wouldn't you clone a table structure style to each new database id all sheets archived in master base indexed client clone is hash verified to access specific sheet run through a host machine at the front end of the master system. Then directing requests as primary role. Internal access is batched to read/write slave systems in groups. Obviously set raid configurations to copy real time and scheduled. Balance requests for load to speed of system resources. That way you separated the security flawed from ui and connection to the retention schema. it seems like simplified structures and reducing policy requests cut down request rss in the query processing. or Simply a man in the middle approach from inside out.
1) Splitting your data into smaller groups in a safe, unthinking way (such as one database per grouping) is almost always best if you want to scale. In this case, unless for some reason you want to query between companies, keeping them in separate databases is best.
2) If you are updating all of the databases by hand, you are doing something wrong if you want to scale. You'd want to automate the process.
3) Ultimately, salesforce.com uses this as the basis of their DB infrastructure:
http://blog.database.com/blog/2011/08/30/database-com-is-open-for-business/