Multiple Graphite Instances writing to one Whisper DB - graphite

I am creating a Graphite solution to collect Test data at run time.
I wanted to know can i have multiple independent graphite instances writing to the one whisper DB ?
Is this supported by carbon ?
How can i do it ?
Thanks All.

Related

Possible to create pipeline that writes an SQL database to MongoDB daily?

TL:DR I'd like to combine the power of BigQuery with my MERN-stack application. Is it better to (a) use nodejs-biquery to write a Node/Express API directly with BigQuery, or (b) create a daily job that writes my (entire) BigQuery DB over to MongoDB, and then use mongoose to write a Node/Express API with MongoDB?
I need to determine the best approach for combining a data ETL workflow that creates a BigQuery database, with a react/node web application. The data ETL uses Airflow to create a workflow that (a) backs up daily data into GCS, (b) writes that data to BigQuery database, and (c) runs a bunch of SQL to create additional tables in BigQuery. It seems to me that my only two options are to:
Do a daily write/convert/transfer/migrate (whatever the correct verb is) from BigQuery database to MongoDB. I already have a node/express API written using mongoose, connected to a MongoDB cluster, and this approach would allow me to keep that API.
Use the nodejs-biquery library to create a node API that is directly connected to BigQuery. My app would change from MERN stack (BQ)ERN stack. I would have to re-write the node/express API to work with BigQuery, but I would no longer need the MongoDB (nor have to transfer data daily from BigQuery to Mongo). However, BigQuery can be a very slow database if I am looking for a single entry, a since its not meant to be used as Mongo or a SQL Database (it has no index, one row retrieve query run slow as full table scan). Most of my APIs calls are for very little data from the database.
I am not sure which approach is best. I don't know if having 2 databases for 1 web application is a bad practice. I don't know if it's possible to do (1) with the daily transfers from one db to the other, and I don't know how slow BigQuery will be if I use it directly with my API. I think if it is easy to add (1) to my data engineering workflow, that this is preferred, but again, I am not sure.
I am going with (1). It shouldn't be too much work to write a python script that queries tables from BigQuery, transforms, and writes collections to Mongo. There are some things to handle (incremental changes, etc.), however this is much easier to handle than writing a whole new node/bigquery API.
FWIW in a past life, I worked on a web ecommerce site that had 4 different DB back ends. ( Mongo, MySql, Redis, ElasticSearch) so more than 1 is not an issue at all, but you need to consider one as the DB of record, IE if anything does not match between them, one is the sourch of truth, the other is suspect. For my example, Redis and ElasticSearch were nearly ephemeral - Blow them away and they get recreated from the unerlying mysql and mongo sources. Now mySql and Mongo at the same time was a bit odd and that we were dong a slow roll migration. This means various record types were being transitioned from MySql over to mongo. This process looked a bit like:
- ORM layer writes to both mysql and mongo, reads still come from MySql.
- data is regularly compared.
- a few months elapse with no irregularities and writes to MySql are turned off and reads are moved to Mongo.
The end goal was no more MySql, everything was Mongo. I ran down that tangent because it seems like you could do similar - write to both DB's in whatever DB abstraction layer you used ( ORM, DAO, other things I don't keep up to date with etc.) and eventually move the reads as appropriate to wherever they need to go. If you need large batches for writes, you could buffer at that abstraction layer until a threshold of your choosing was reached before sending it.
With all that said, depending on your data complexity, a nightly ETL job would be completely doable as well, but you do run into the extra complexity of managing and monitoring that additional process. Another potential downside is the data is always stale by a day.

Zookeeper data and Znode

I am a beginner in Zookeeper. Would like to know what data means while using create or set command. Zookeeper doesnot store data. Then what is this "data"? Also, znodes are created automatically or should we create it manually in using cli commands?
ZooKeeper does store data. When you create a node, you can set this data. You can update the data using setData. It's just an array of bytes. It's up to you to define what it actually is.
However, ZooKeeper is not meant to be a database. Databases usually get larger and larger while a system is used. ZooKeeper works best when it only stores a small amount of data that doesn't grow in time. Basically only data that is used to synchronize a distributed system.
It's up to you how and when you create zNodes. It's certainly easier to deploy when you create them automatically. Usually you have multiple clients. If all of them try to create the same nodes, they will run into conflicts. Make sure to handle this.

Is migrating from RDS to Elastic MapReduce + Hive the right choice?

First of all I must put clear that I am a newbie and excuse myself if I don't use the correct terminology in my question.
This is my scenario:
I need to analyze large quantities of text like tweets, comments, mails, etc. The data is currently inserted into an Amazon RD MySQL instance as it occurs.
Later I run and R job locally using RTextTools (http://www.rtexttools.com/) over that data to output my desired results. At this point it might be important to make clear that the R scripts analyzes the data and writes data back into the MySQL table which will later be used to display it.
The issue I am having lately is that the job takes about 1 hour each time I run it and I need to do it at least 2 times a day...so using my local computer is not an option anymore.
Looking for alternatives I started to read about Amazon Elastic MapReduce instance which at first sight seems to be what I need, but here start my questions and confusions about it.
I read that data for EMR should be pulled out from an S3 bucket. If thats the case then I must start storing my data into a JSON or similar within an S3 bucket and not into my RDS instance, right?
At this point I read it is a good idea to create HIVE tables and then use RHive to read the data in order for RTextTools to do its job and write the results back to my RDS tables, is this right?
And now the final and most important question: Is taking all this trouble worth it vs. running a EC2 instance with R and running my R scripts there, will I reduce computing time?
Thanks a lot for your time and any tip in the right direction will be much appreciated
Interesting, I would like to suggest few things.
You can totally store data in S3, but you will have to first write your data to some file (txt etc) and then push it to S3. You cannot put raw JSON on S3. You can probably get the benefit of cloud front deployed over S3 for fast retrieval of data. You can also use RDS. the performance difference you will have to analyze yourself.
Writing results back to RDS shouldn't be any issue. EMR basically creates two EC2 instances , ElasticMapReduce-master and ElasticMapReduce-slave which can be used to communicate with RDS.
See,I think its worth trying out with EC2 instance with R , but then to reduce the computation time, you might have to go with expensive EC2 instance, or put autoscaling and divide task between different instances. Its just like implementing whole parallel computation logic by yourself, but in the case of EMR , you are getting all this logic of map reduce in itself. So, firstly you should try with EMR and if it doesn't work out well for your , try with new EC2 instance with R.
Let me know how it goes, thank you.
You should consider trying EMR. S3+EMR is very much worth trying out if the 1hour window is a constraint. For your type of processing workloads, you might save cycles by using a scalable on demand hadoop/hive platform. Obviously, there are some learning, re-platforming, and ongoing cluster mgmt costs related to the trial and switch. They are non-trivial. Alternatively, consider services such as Qubole, which also runs on EC2+S3 and provides higher level (and potentially easier to use) abstractions.
Disclaimer: I am a product manager at Qubole.

Im building a salesforce.com type of CRM - what is the right database architecture?

I have developed a CRM for my company. Next I would like to take that system and make it available for others to use in a hosted format. Very much like a salesforce.com. The question is what type of database structure would I use. I see two options:
Option 1. Each time a company signs up, I clone the master database for them.
The disadvantage of this is that I could end up with thousands of databases. Thats a lot of databases to backup every night. My CRM uses cron jobs for maintanance, those jobs would have to run on all databases.
Each time I upgrade the system with a new feature, and need to add a new column to the database, I will have to add that column to thousands of databases.
Option 2. Use only one database.
At the beginning of EVERY table add "CompanyID". In EVERY SQL statement add "and
companyid={companyid}".
The advantage of this method is the simplicity of only one database. Just one database to backup each night. Just one database to update when needed.
The disadvantage is what if I get 1000 companies signing up, and each wants to store data on 100,000 leads, that 100,000,000 rows in the lead table, which worries me.
Does anyone know how the online hosted CRMs like salesforce.com do this?
Thanks
Wouldn't you clone a table structure style to each new database id all sheets archived in master base indexed client clone is hash verified to access specific sheet run through a host machine at the front end of the master system. Then directing requests as primary role. Internal access is batched to read/write slave systems in groups. Obviously set raid configurations to copy real time and scheduled. Balance requests for load to speed of system resources. That way you separated the security flawed from ui and connection to the retention schema. it seems like simplified structures and reducing policy requests cut down request rss in the query processing. or Simply a man in the middle approach from inside out.
1) Splitting your data into smaller groups in a safe, unthinking way (such as one database per grouping) is almost always best if you want to scale. In this case, unless for some reason you want to query between companies, keeping them in separate databases is best.
2) If you are updating all of the databases by hand, you are doing something wrong if you want to scale. You'd want to automate the process.
3) Ultimately, salesforce.com uses this as the basis of their DB infrastructure:
http://blog.database.com/blog/2011/08/30/database-com-is-open-for-business/

What administration tasks (if any) should I perform on a sqlite database?

I'm deploying a client application to several thousand windows machines. Each machine will use a sqlite database as a local datastore, ultimately writing data to a remote server. So the sqlite db won't really grow over a few MB over time, since data will be added and then later deleted. SQLite is supposed to be zero-administration, but are there any tasks I need to run occassionally such as analyze / update statistics / check for consistency, etc? Or can I assume that once the sqlite db is there it can be used for a couple of years with no worries and no corruption?
If there are such tasks I'll build processes into my app to run them occassionally, but at present I'm not sure if that's necessary/recommended.
SQLite has a VACUUM command, which will compact deleted entries and reduce fragmentation & file size. The amount of deletions vs. insertions really dictate how frequently this should be done. It's generally quick and not always necessary.
The best thing to do is setup some scripts to simulate your expected activity in production. Take snapshots of the file size & graph in your favorite spreadsheet. If after 10 years of simulated activity, there isn't a problem, then your usage patterns don't need VACUUM.

Resources