How to take druid segment data backup? - bigdata

I am new to druid. In our application we use druid for timeseries data and this can go pretty large(10-20TBs).
Druid provide you facility of deep storage. But if this deep storage crashes/or not reachable then it will result in data loss and which in turn affect the analytics the application is running.
I am thinking of taking an incremental backup druid segment data to some secure location like ftp server. So if deep storage is unavailable, then they can restore the data from this ftp server.
Is there any tool/utility available in druid to incrementally backup/restore druid segment?

In general it's important to take regular snapshots of the metadata storage as this is the "index" of what's in the Deep Storage. Maybe one snapshot per day, and store them for however long you like. It's good to store them for at least a couple of weeks, in case you need to roll back for some reason.
You also need to back up new segments in deep storage when they appear. It isn't important to take consistent snapshots, just to get every file eventually.
Also see https://groups.google.com/g/druid-user/c/itfKT5vaDl8
One other note as you mentioned data loss: Deep Storage is not queried directly - queries execute on the local segment cache in, for example, the Historical process. The Deep Storage is written to at ingestion time, so you might "lose" data that can't be ingested once it's available again, but you will continue to get analytics capability as the already-loaded data is on the historicals... Just a thought haha !
I hope that helps....?!?!

Related

Possible to create pipeline that writes an SQL database to MongoDB daily?

TL:DR I'd like to combine the power of BigQuery with my MERN-stack application. Is it better to (a) use nodejs-biquery to write a Node/Express API directly with BigQuery, or (b) create a daily job that writes my (entire) BigQuery DB over to MongoDB, and then use mongoose to write a Node/Express API with MongoDB?
I need to determine the best approach for combining a data ETL workflow that creates a BigQuery database, with a react/node web application. The data ETL uses Airflow to create a workflow that (a) backs up daily data into GCS, (b) writes that data to BigQuery database, and (c) runs a bunch of SQL to create additional tables in BigQuery. It seems to me that my only two options are to:
Do a daily write/convert/transfer/migrate (whatever the correct verb is) from BigQuery database to MongoDB. I already have a node/express API written using mongoose, connected to a MongoDB cluster, and this approach would allow me to keep that API.
Use the nodejs-biquery library to create a node API that is directly connected to BigQuery. My app would change from MERN stack (BQ)ERN stack. I would have to re-write the node/express API to work with BigQuery, but I would no longer need the MongoDB (nor have to transfer data daily from BigQuery to Mongo). However, BigQuery can be a very slow database if I am looking for a single entry, a since its not meant to be used as Mongo or a SQL Database (it has no index, one row retrieve query run slow as full table scan). Most of my APIs calls are for very little data from the database.
I am not sure which approach is best. I don't know if having 2 databases for 1 web application is a bad practice. I don't know if it's possible to do (1) with the daily transfers from one db to the other, and I don't know how slow BigQuery will be if I use it directly with my API. I think if it is easy to add (1) to my data engineering workflow, that this is preferred, but again, I am not sure.
I am going with (1). It shouldn't be too much work to write a python script that queries tables from BigQuery, transforms, and writes collections to Mongo. There are some things to handle (incremental changes, etc.), however this is much easier to handle than writing a whole new node/bigquery API.
FWIW in a past life, I worked on a web ecommerce site that had 4 different DB back ends. ( Mongo, MySql, Redis, ElasticSearch) so more than 1 is not an issue at all, but you need to consider one as the DB of record, IE if anything does not match between them, one is the sourch of truth, the other is suspect. For my example, Redis and ElasticSearch were nearly ephemeral - Blow them away and they get recreated from the unerlying mysql and mongo sources. Now mySql and Mongo at the same time was a bit odd and that we were dong a slow roll migration. This means various record types were being transitioned from MySql over to mongo. This process looked a bit like:
- ORM layer writes to both mysql and mongo, reads still come from MySql.
- data is regularly compared.
- a few months elapse with no irregularities and writes to MySql are turned off and reads are moved to Mongo.
The end goal was no more MySql, everything was Mongo. I ran down that tangent because it seems like you could do similar - write to both DB's in whatever DB abstraction layer you used ( ORM, DAO, other things I don't keep up to date with etc.) and eventually move the reads as appropriate to wherever they need to go. If you need large batches for writes, you could buffer at that abstraction layer until a threshold of your choosing was reached before sending it.
With all that said, depending on your data complexity, a nightly ETL job would be completely doable as well, but you do run into the extra complexity of managing and monitoring that additional process. Another potential downside is the data is always stale by a day.

Will Google Firestore always write to the server immediately in a mobile app with frequent writes?

Is it possible to limit the speed at which Google Firestore pushes writes made in an app to the online database?
I'm investigating the feasibility of using Firestore to store a data stream from an IoT device via a mobile device/bluetooth.
The main concern is battery cost - receive a new data packet roughly two minutes, I'm concerned about the additional battery drain that an internet round-trip every two minutes, 24hrs a day, will cost. I also would want to limit updates to wifi connections only.
It's not important for the data to be available online real-time. However it is possible for multiple sources to add to the same datastream in a 2-way sybc, (ie online DB and all devices have the merged data).
I'm currently handling that myself, but when I saw the offline capability of Datastore I hoped I could get all that functionality for free.
I know we can't directly control offline-mode in Firestore, but is there any way to prevent it from always and immediately pushing write changes online?
The only technical question I can see here has to do with how batch writes operate, and more importantly, cost. Simply put, a batch write of 100 writes is the same as writing 100 writes individually. The function is not a way to avoid the write costs of firestore. Same goes for transactions. Same for editing a document (that's a write). If you really want to avoid those costs then you could store the values for the thirty minutes and let the client send the aggregated data in a single document. Though you mentioned you need data to be immediate so I'm not sure that's an option for you. Of course, this would be dependent on what one interprets "immediate" as based off the relative timespan. In my opinion, (I know those aren't really allowed here but it's kind of part of the question) if the data is stored over months/years, 30 minutes is fairly immediate. Either way, batch writes aren't quite the solution I think you're looking for.
EDIT: You've updated your question so I'll update my answer. You can do a local cache system and choose how you update however you wish. That's completely up to you and your own code. Writes aren't really automatic. So if you want to only send a data packet every hour then you'd send it at that time. You're likely going to want to do this in a transaction if multiple devices will write to the same stream so one doesn't overwrite the other if they're sending at the same time. Other than that I don't see firestore being a problem for you.

How firebase download data (Usage) is computed?

I have a firebase project where I am getting a lot of usage in terms of Firebase download data size. Its giving me approx 30GB a month while the data size is very small ~70-80MB in my real time database.
I am wondering how can I figure out if that is really the state of the system used. Is there a way I can see logs, traffic distribution and which nodes are particularly hit in my realtime database.
Is it possible to get a list of calls being to my database which downloads the data ? It may be a case that someone in my team actually is running a script to download data and playing around with it ?
Here is a snapshot for reference:

How can I keep objects stored in firebase cloud function RAM?

My application needs to build a couple of large hashmaps before processing a user's request. Ideally I want to store these hashmaps in-memory on the machine, which means it never has to do any expensive processing and can process any incoming requests quickly.
But this doesn't work for firebase because there's a chance a user triggers a new instance which sets off the very time-consuming preprocessing step.
So, I tried designing my application to use the firebase database, and get only the data it needs from the database each time instead of holding all the data in-memory. But, since the cloud functions are downloading loads of data from the database, I have now triggered over 1.7 GB in download for this month, just by myself from testing. This goes over the quota.
There must be something I'm missing; all I want is a permanent memory storage of some hashmaps. All I want is for those hashmaps to be ready by the time the function is called with a request. It seems like such a simple requirement; how come there is no way to do this?
If you want to store data in the container that runs your Cloud Functions, you can use its local tempfs, which is actually kept in memory. But this will disappear when the container is recycled, which happens when your function hasn't been access for a while. So this local file system will have to be rebuilt whenever the container spins up.
If you want permanent storage of values you generate, consider using Google Cloud Storage. It is probably a more cost effective option, and definitively the most scalable one.

Firebase: queries on large datasets

I'm using Firebase to store user profiles. I tried to put the minimum amount of data in each user profile (following the good practices advised in the documentation about structuring data) but as I have more than 220K user profiles, it still represents 150MB when downloading as JSON all user profiles.
And of course, it will grow bigger and bigger as I intend to have a lot more users :)
I can't do queries on those user profiles anymore because each time I do that, I reach 100% Database I/O capacity and thus some other requests, performed by users currently using the app, end up with errors.
I understand that when using queries, Firebase need to consider all data in the list and thus read it all from disk. And 150MB of data seems to be too much.
So is there an actual limit before reaching 100% Database I/O capacity? And what is exactly the usefulness of Firebase queries in that case?
If I simply have small amounts of data, I don't really need queries, I could easily download all data. But now that I have a lot of data, I can't use queries anymore, when I need them the most...
The core problem here isn't the query or the size of the data, it's simply the time required to warm the data into memory (i.e. load it from disk) when it's not being frequently queried. It's likely to be only a development issue, as in production this query would likely be a more frequently used asset.
But if the goal is to improve performance on initial load, the only reasonable answer here is to query on less data. 150MB is significant. Try copying a 150MB file between computers over a wireless network and you'll have some idea what it's like to send it over the internet, or to load it into memory from a file server.
A lot here depends on the use case, which you haven't included.
Assuming you have fairly standard search criteria (e.g. you search on email addresses), you can use indices to store email addresses separately to reduce the data set for your query.
/search_by_email/$user_id/<email address>
Now, rather than 50k per record, you have only the bytes to store the email address per records--a much smaller payload to warm into memory.
Assuming you're looking for robust search capabilities, the best answer is to use a real search engine. For example, enable private backups and export to BigQuery, or go with ElasticSearch (see Flashlight for an example).

Resources