CSV/parquet to Dynamo, small file of ~500k rows, only two columns - amazon-dynamodb

I am looking for a way to upload CSV/parquet data to dynamo db without having to create a data pipeline.
I have a small (12mb parquet/30mb CSV) file which consists of two columns. It gets generated daily and the dynamo table needs a full refresh each day.
At first I decided to use AWS Athena, which was very easy to setup. But for reads, its slow (each query takes 1.5 to 4 seconds). This process may be used by others in the company in the near future, so I am now seeking something quicker.
I looked into Dynamo DB's batch write item function. But it feels extremely inefficient to make about 500000/25 calls a day to update this relatively small size table.
What is a bit frustrating is that a single call using batchwriteitem has a max size of 16mb, with 400kb per row. Which is almost the size of the file itself.
I looked into perhaps sending the data as one long row, and splitting it. But I could not find such an operation. Curious if anybody has any input on this.

Related

Firebase BigQuery Export Schema Size Difference

We have migrated all of our old Firebase BigQuery events tables to the new schema using the provided script. One thing we noticed was that the size of the daily tables increased dramatically.
For example, the data from 4/1/18 in the old schema was 3.5MM rows and 8.7 Gig. Once migrated, the new table from the same date is 32.3MM rows and 27 Gig. This is nearly 10 times larger in terms of number of rows and over 3X larger by space size.
Can someone tell me why the same data is so much larger in the new schema?
The result is that we are getting charged significantly more in BigQuery query costs when reading the tables from the new schema versus the old schema.
firebaser here
While increasing the size of the exported data definitely wasn't a goal, it is an expected side-effect of the new schema.
In the old storage format the events were stored in bundles. While I don't exactly know how the events are bundled, it was definitely always a bunch of events with their own unique and with shared properties. This meant that you frequently had to unnest the data in your query or cross join the tables with themselves, to get to the raw data, and then combine and group it again to fit your requirements.
In the new storage format, each event is stored separately. This definitely increases the storage size, since properties that were shared between events in a bundle, now are duplicated for each event. But the queries you write on the new format should be easier to read and can process the data faster, since they don't have to unnest it first.
So the larger storage size should come with a slightly faster processing speed. But I can totally imagine the sticker shock when you see the difference, and realize the improved speed doesn't always make up for that. I apologize if that is the case, and have been assured that don't have any other big schema changes planned from here on.

SQLite with large data, is it best to split

I am using SQLite because this is for cross-platform. I have about 10 tables with a small amount of data (maybe a few dozen rows each), but then I also have a set of data which might have a million or more rows.
The small dataset isn't really modified that much, just queried, but the large data set will be queried and modified frequently.
Rather than have a single SQLite database with all the tables in it, I was wondering if splitting it into two databases might be smartest.
Basically I'd have one database, lets call it "settings", with the 10 tables in it. I'd then have another database, lets call it "userdata", with the million rows.
I'll be creating a third database called "audits" where I record each change to the "userdata" database. This database is expected to grow (for a short time period).
I am just wondering if people have an opinion as to whether it is a good idea to split my data into multiple databases or if I should just have one massive one.
My thinking is the queries on the "userdata" database might be slightly more efficient since it will only have one table.
Note, this is not for long-term. It is for a short period of time. It will be queried and edited for about a week, then it is done.

Is it ok to build architecture around regular creation/deletion of tables in DynamoDB?

I have a messaging app, where all messages are arranged into seasons by creation time. There could be billions of messages each season. I have a task to delete messages of old seasons. I thought of a solution, which involves DynamoDB table creation/deletion like this:
Each table contains messages of only one season
When season becomes 'old' and messages no longer needed, table is deleted
Is it a good pattern and does it encouraged by Amazon?
ps: I'm asking, because I'm afraid of two things, met in different Amazon services -
In Amazon S3 you have to delete each item before you can fully delete bucket. When you have billions of items, it becomes a real pain.
In Amazon SQS there is a notion of 'unwanted behaviour'. When using SQS api you can act badly regarding SQS infrastructure (for example not polling messages) and thus could be penalized for it.
Yes, this is an acceptable design pattern, it actually follows a best practice put forward by the AWS team, but there are things to consider for your specific use case.
AWS has a limit of 256 tables per region, but this can be raised. If you are expecting to need multiple orders of magnitude more than this you should probably re-evaluate.
You can delete a table a DynamoDB table that still contains records, if you have a large number of records you have to regularly delete this is actually a best practice by using a rolling set of tables
Creating and deleting tables is an asynchronous operation so you do not want to have your application depend on the time it takes for these operations to complete. Make sure you create tables well in advance of you needing them. Under normal circumstances tables create in just a few seconds to a few minutes, but under very, very rare outage circumstances I've seen it take hours.
The DynamoDB best practices documentation on Understand Access Patterns for Time Series Data states...
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
It's perfectly acceptable to split your data the way you describe. You can delete a DynamoDB table regardless of its size of how many items it contains.
As far as I know there are no explicit SLAs for the time it takes to delete or create tables (meaning there is no way to know if it's going to take 2 seconds or 2 minutes or 20 minutes) but as long your solution does not depend on this sort of timing you're fine.
In fact the idea of sharding your data based on age has the potential of significantly improving the performance of your application and will definitely help you control your costs.

Storing Weighted Graph Time Series in Cassandra

I am new to Cassandra, and I want to brainstorm storing time series of weighted graphs in Cassandra, where edge weight is incremented upon each time but also updated as a function of time. For example,
w_ij(t+1) = w_ij(t)*exp(-dt/tau) + 1
My first shot involves two CQL v3 tables:
First, I create a partition key by concatenating the id of the graph and the two nodes incident on the particular edge, e.g. G-V1-V2. I do this in order to be able to use the "ORDER BY" directive on the second component of the composite keys described below, which is type timestamp. Call this string the EID, for "edge id".
TABLE 1
- a time series of edge updates
- PRIMARY KEY: EID, time, weight
TABLE 2
- values of "last update time" and "last weight"
- PRIMARY KEY: EID
- COLUMNS: time, weight
Upon each tick, I fetch and update the time and weight values stored in TABLE 2. I use these values to compute the time delta and new weight. I then insert these values in TABLE 1.
Are there any terrible inefficiencies in this strategy? How should it be done? I already know that the update procedure for TABLE 2 is not idempotent and could result in inconsistencies, but I can accept that for the time being.
EDIT: One thing I might do is merge the two tables into a single time series table.
You should avoid any kind of read-before-write when it comes to Cassandra (and any other database where you can't do a compare-and-swap operation for the write).
First of all: Which queries and query-patterns does your application have?
Furthermore I would be interested how often a new weight for each edge will be calculated and stored. Every second, hour, day?
Would it be possible to hold the last weight of each edge in memory? So you could avoid the reading before writing? Possibly some sort of lazy-loading mechanism of this value would be feasible.
If your queries will allow this data model, I would try to build a solution with a single column family.
I would avoid reading before writing in Cassandra as it really isn't a great fit. Reads are expensive, considerably more so than writes, and to sustain performance you'll need a large number of nodes for a relatively small amount of queries. What you're suggesting doesn't really lend itself to be a good fit for Cassandra, as there doesn't appear to be any way to avoid reading before you write. Even if you use a single table you will still need to fetch the last update entries to perform your write. While it certainly could be done, I think there is better tools for the job. Having said that, this would be perfectly feasible if you could keep all data in table 2 in memory, and potentially utilise the row cache. As long as table 2 isn't so large that it can fit the majority of rows in memory, your reads will be significantly faster which may make up for the need to perform a read every write. This would be quite a challenge however and you would need to ensure only the "last update time" for each row is kept in memory, and disk is rarely needed to be touched.
Anyway, another design you may want to look at is an implementation where you not only use Cassandra but also a cache in front of Cassandra to store the last updated times. This could be run alongside Cassandra or on a separate node but could be an in memory store of the last update times only, and when you need to update a row you query the cache, and write your full row to Cassandra (you could even write the last update time if you wished). You could use something like Redis to perform this function, and that way you wouldn't need to worry about tombstones or forcing everything to be stored in memory and so on and so forth.

SQlite3 Optimization: Store external file name in db? Or just have a huge number of rows?

I am a newbie with no comp sci background. So please forgive me for whatever dumb stuff I may say. I am working on a solar power monitoring project to monitor the power output of the solar power systems my company installs. I am writing a client that will query the inverter (for power output, voltage output, current output, system errors/faults, etc--which constitutes one "reading") of each of our monitoring customers every 15 minutes for as long as they have their system--which means roughly 35k readings per year per customer. So I was thinking of organizing my sqlite3 database in one of the two following ways.
(1) Have the database be two tables, one table with regular customer information (name, email, etc) and another much bigger table where each row represents one reading and includes the customer id and timestamp of reading as identifiers. Which means roughly 35k rows will be being added to this bigger table per customer per year. (Data more than two years old will be pared down and archived.)
OR
(2) Store all readings in a csv file (one csv file per customer) and store the csv file name in my table with regular customer information
This database will be serving a website (built on rails if that makes any difference for options) where customers will be able to view their power output data. I want to minimize the amount of time it will take to load their output data on login. I basically don't have a clear idea of the amount of time it would take for my computer to open and read in lines from a text file versus open, look for (based on customer id) and read in the data from a huge sqlite3 table--and therefore am having trouble knowing how to judge between the two options above. Also I'm having trouble gauging the limits of sqlite3 where it functions optimally despite having read some about it (I don't think I have the background to understand the reading I did because it seems to say 100s of millions of rows are just fine when I read other people's comments seeming to say just the opposite.). I am also open to a completely different option as I'm not married to anything right now. Whatever makes things load faster. Thanks so much in advance!
Storing the parsed data in sqlite would definitely be a timesaver if you're doing any kind of repeated data mining on it. CSV Parsing overhead would almost instantly eat up any database space/time savings you'd gain.
As for efficiency, you'd have to test it. There's no one hard fast rule that says "use this database" or "use that database". It's ALWAYS a "depends on the scenario". SQLite may be perfect for you in this case, but be useless for someone else with a slightly different workload.
SQL applications in general do very well with large data sets, as long as the columns being queried are indexed. You should keep them in the same database. It will take a huge lot less to obtain the data from the database than for parsing CSV files. Databases are created with the purpose of storing and retrieving data, CSV files are not.
I use MySQL databases with tens of millions of rows per table and queries return results in fractions of a second. SQLite might be faster.
Just make sure you create indexes for what you will be searching.
I would do option 1, but use a database server such as PostgreSQL instead of SQLite.
SQLite will lock the table on update so you may run into locking issues if you read and write to the table a lot. SQLite is better suited for single user applications on the desktop or in a smartphone.
You can easily have millions of rows without it causing any problems.

Resources