Using Firebase I'm giving the fields easily human-readable names, such as "timestamp", "last_changed", "message_direction" etc.
Are the field names part of the data exchange for every single "row"?
Meaning, would I save bandwidth by shortening the field names?
As Frank pointed out: Yes, they do affect bandwidth. However, if you do the math, you'll see that bandwidth is rarely worth the effort and obfuscation, and that generally your data payload will vastly eclipse the few extra characters of the keys.
Let's consider an average chat message:
{
"sender": "typicaluserid:12345678901234567890",
"timestamp": 1410193850266,
"message": "Hello world. All your base are belong to us!"
}
Okay, so first let's look at the payload for this message. This is 163 bytes, UTF-8 encoded. if I changed the keys to meaningless three character ids (e.g. sdr, ts, and msg), then it would be 149 bytes. With some liberal padding for transaction overhead, let's say about 40 bytes, it's about a 2% savings, in exchange for needing a rosetta stone to read my data.
Next, let's consider bandwidth usage. If we have a very active, very massive chat system (i.e. we're raking in serious $$), we might have 10k active users sending an average of 50 messages per day. That means we'd be sending 200 bytes * 500k = 100MB of data per day, or around 3GB of data per month.
That means that I could support a massive chat system comfortably on Firebase's free plan without exceeding bandwidth limits.
If we take those 11 bytes, off, we have 189 * 500k * 30 = 2.8GB. So not a significant difference for the effort.
My few cents for the sake of "lateral thinking" and "stating the obvious" : sometimes the shortening can have benefits other than bandwidth. For example readability, easier to type, less likely to make a typo, etc.
I recommend spending some time on finding the most concise names that would still be understandable.
/ MrObvious out
Related
I'm working on large scale component that generates unique/opaque tokens representing business entities. Over time there will be many billions of these records, but for the first year we're not expecting growth to exceed more than 2 billion individual items (probably less than 500 million).
The system itself is horizontally scaled but needs token generation to be idempotent; data integrity is maintained by using a contained but reasonably complex combination of transactional writes with embedded condition expressions AND standalone condition check write items.
The tokens themselves are UUIDs, and 'being efficient' are persisted as Binary attribute values (16 bytes) rather than the string representation (36 bytes), however the downside is that the data doesn't visualise nicely in query consoles making support hard if we encounter any bugs and/or broken data. Note there is no extra code complexity since we implement attributevalue.Marshaler interface to bind UUID (language) types to DynamoDB Binary attributes, and similarly do the same for any composite attributes.
My question relates to (mostly) data size/saving. Since the tokens are the partition keys, and some mapping columns are [token] -> [other token composite attributes], for example two UUIDs concatenated together into 32 bytes.
I wanted to keep really tight control over storage costs knowing that, over time, we will be spending ~$0.25/GB per month for this. My question is really three parts:
Are the PK/SK index size 'reserved' (i.e. padded) so it would make no difference at all to storage cost if we compress the overall field sizes down to the minimum possible size? (... I read somewhere that 100 bytes is typically reserved.
If they ARE padded, the cost savings for the data would be reasonably high, because each (tree) index node will be nearly as big as the data being mapped. (I assume a tree index is used once hashed PK has routed the query to the right server node/disk etc.)
Is there any observable query time performance benefit to compacting 36 bytes into 16 (beyond saving a few bytes across the network)? i.e. if Dynamo has to read fewer pages it'll work faster, but in practice are we talking microseconds at best?
This is a secondary concern, but is worth considering if there is a lot of concurrent access to the data. UUIDs will distribute partitions but inevitably sometimes we will have some more active partitions than others.
Are there any tools that can parse bytes back into human-readable UUIDs (or that we customise to inject behaviour to do this)?
This is concern, because making things small and efficient is ok, but supporting and resolving data issues will be difficult without significant tooling investment, and (unsurprisingly) the DynamoDB console, DynamoDB IntelliJ plugin and AWS NoSQL Workbench all garble the binary into unreadable characters.
No, the PK/SK types are not padded. There's 100 bytes of overhead per item stored.
Sending less data certainly won't hurt your performance. Don't expect a noticeable improvement though. If shorter values can keep your items at 1,024 bytes instead of 1,025 bytes then you save yourself a Write Unit during the save.
For the "garbled" binary values I assume you're looking at the base64 encoded values, which is a standard binary encoding standard which can be reversed by lots of tooling (now that you know the name of it).
There's an SQLite database being used to store static-sized data in a round-robin fashion.
For example, 100 days of data are stored. On day 101, day 1 is deleted and then day 101 is inserted.
The number of rows is the same between days. The the individual fields in the rows are all integers (32-bit or less) and timestamps.
The database is stored on an SD card with poor I/O speed,
something like a read speed of 30 MB/s.
VACUUM is not allowed because it can introduce a wait of several seconds
and the writers to that database can't be allowed to wait for write access.
So the concern is fragmentation, because I'm inserting and deleting records constantly
without VACUUMing.
But since I'm deleting/inserting the same set of rows each day,
will the data get fragmented?
Is SQLite fitting day 101's data in day 1's freed pages?
And although the set of rows is the same,
the integers may be 1 byte day and then 4 bytes another.
The database also has several indexes, and I'm unsure where they're stored
and if they interfere with the perfect pattern of freeing pages and then re-using them.
(SQLite is the only technology that can be used. Can't switch to a TSDB/RRDtool, etc.)
SQLite will reuse free pages, so you will get fragmentation (if you delete so much data that entire pages become free).
However, SD cards are likely to have a flash translation layer, which introduces fragmentation whenever you write to some random sector.
Whether the first kind of fragmentation is noticeable depends on the hardware, and on the software's access pattern.
It is not possible to make useful predictions about that; you have to measure it.
In theory, WAL mode is append-only, and thus easier on the flash device.
However, checkpoints would be nearly as bad as VACUUMs.
I would like to use firebase to load a 2D map in my site. There will be dynamic load of fields when user scroll map and also show changed map-fields.
But i am interested what is more efficient.
Read list of values even if i need only about 50% of loaded fields (e.g. 100 loaded fields)
geoRef.startAt(null, start).endAt(null, end).on('value', callback);
maybe better:
geoRef.startAt(null, start).endAt(null, end).once('value', callback);
geoRef.startAt(null, start).endAt(null, end).on('child_changed', callback);
or read a lot of single value (e.g. 50x)
geoRef.child(valueX).child(valueY).on('value', callback);
These reads will be triggerd for every user scroll. So that there will be a lot of 50x vs 1x(50%) reads.
Thanks
It's impossible to answer this question specifically without details. Are we talking 1000 records that are each 10 bytes or 1 million records that are each 5MB?
Data size and network bandwidth are the consideration here, not the number of connections. Firebase holds a socket opened to the server, so the overhead of establishing TCP connections is not a concern (as it would be with multiple HTTP requests), although the time it takes a request to return from the server (latency) is.
This leaves only two considerations: how much data and how many records.
For instance, if my system contains 1002 records and I want 1000 of them, and each is 1KB in size, it's going to be faster to simply request them all at once (since this requires only the latency of waiting for one response from the server). But if I want 10 of them, requesting them separately would likely be faster.
Even more ideal would be to segment them using priorities or split them cleverly into multiple paths by category, time frame, or another context. Then I can retrieve only segments of the data as a single transaction.
For example:
/messages/today
/messages/yesterday
/messages/all_messages
Now, assuming today is measured in hundreds and the payload is 1KB, I can just grab this whole list and iterate it client side any time I'd like--not worth the energy to grab them individually. If this is my common use case, perfect.
And assuming all_messages is measured in the millions of records, each about 1KB, then to grab 100 messages from here, I'll naturally gravitate to snagging each one individually.
I am attempting to put a potentially large string into a rendezvous message and was curious about size constraints. I understand there is a physical limit (64mb?) to the message as a whole, but I'm curious about how some other variables could affect it. Specifically:
How big the keys are?
How the string is stored (in one field vs. multiple fields)
Any advice on any of the above topics or anything else that could be relevant would be greatly appreciated.
Note: I would like to keep the message as a raw string (as opposed to bytecode, etc).
From the Tibco docs on Very Large Messages:
Rendezvous software can transport very
large messages; it divides them into
small packets, and places them on the
network as quickly as the network can
accept them. In some situations, this
behavior can overwhelm network
capacity; applications can achieve
higher throughput by dividing large
messages into smaller chunks and
regulating the rate at which it sends
those chunks. You can use the
performance tool to evaluate chunk
sizes and send rates for optimal
throughput.
This example, sends one message
consisting of ten million bytes.
Rendezvous software automatically
divides the message into packets and
sends them. However, this burst of
packets might exceed network capacity,
resulting in poor throughput:
sender> rvperfm -size 10000000 -messages 1
In this second example, the
application divides the ten million
bytes into one thousand smaller
messages of ten thousand bytes each,
and automatically determines the batch
size and interval to regulate the flow
for optimal throughput:
sender> rvperfm -size 10000 -messages 1000 -auto
By varying the -messages and -size
parameters, you can determine the
optimal message size for your
applications in a specific network.
Application developers can use this
information to regulate sending rates
for improved performance.
As to actual limits the Add string function takes a C style ansi string so is theoretically unbounded but, given the signature of the AddOpaque
tibrv_status tibrvMsg_AddOpaque(
tibrvMsg message,
const char* fieldName,
const void* value,
tibrv_u32 size);
which takes a u32 it would seem sensible to state that the limit is likely to be 4GB rather than 64MB.
That said using Tib to transfer such large packets is likely to be a serious performance bottleneck as it may have to buffer significant amounts of traffic as it tries to get these sorts of messages to all consumers. By default the rvd buffer is only 60 seconds so you may find yourself suffering message loss if this is a significant amount of your traffic.
Message overhead within tibco is largely as simple as:
the fixed cost associated with each message (the header)
All the fields (type info and the field id)
Plus the cost of all variable length aspects including:
the send and receive subjects (effectively limited to 256 bytes each)
the field names. I can find no limit to the length of the field names in the docs but the smaller they are the better, better still don't use them at all and use the numerical identifiers
the array/string/opaque/user defined variable length fields in the message
Note: If you use nested messages simply recurse the above.
In your case the payload overhead will be so vast in comparison to the names (so long as they are reasonable and simple) there is little point attempting to optimize these at all.
You may find you can considerable efficiency on the wire/buffered if you transmit the strings in a compressed form, either through the use of an rvrd with compression enabled or by changing your producer/consumer to use something fast but effective like deflate (or if you're feeling esoteric things like QuickLZ,FastLZ,LZO,etc. Especially ones with fixed memory footprint compress/decompress engines)
You don't say which platform api you are targeting (.net/java/C++/C for example) and this will colour things a little. On the wire all string data will be in 1 byte per character regardless of java/.net using UTF-16 by default however you will incur a significant translation cost placing these into/reading them out of the message because the underlying buffer cannot be reused in those cases and a copy (and compaction/expansion respectively) must be performed.
If you stick to opaque byte sequences you will still have the copy overhead in the naieve implementations possible through the managed wrapper apis but this will at least be less overhead if you have no need to work with the data as a native string.
The overall maximum size of a message is 64MB as was speculated in the OP. From the "Tibco Rendezvous Concepts" document:
Although the ability to exchange large data buffers is a feature of Rendezvous
software, it is best not to make messages too large. For example, to exchange data
up to 10,000 bytes, a single message is efficient. But to send files that could be
many megabytes in length, we recommend using multiple send calls, perhaps one
for each record, block or track. Empirically determine the most efficient size for
the prevailing network conditions. (The actual size limit is 64 MB, which is rarely
an appropriate size.)
What is the most bandwidth efficient way to unidirectionally synchronise a list of data from one server to many clients?
I have sizeable chunk of data (perhaps 20,000, 50-byte records) which I need to periodically synchronise to a series of clients over the Internet (perhaps 10,000 clients). Records may added, removed or updated only at the server end.
Something similar to bittorrent? Or even using bittorrent. Or maybe invent a wrapper around bittorrent.
(Assuming you pay for bandwidth on your server and not the others ...)
Ok, so we've got some detail now - perhaps 10 GB of total (uncompressed) data, every 3 days, so that's 100 GB per month.
That's actually not really a sizeable chunk of data these days. Whose bandwidth are you trying to save - yours, or your clients'?
Does the data perhaps compress very readily? For raw binary data it's not uncommon to achieve 50% compression, and if the data happens to have a lot of repeated patterns within it then 80%+ is possible.
That said, if you really do need a system that can just transfer the changes, my thoughts are:
make sure you've got a well defined primary key field - use that as your key to identify each record
record a timestamp for each record to say when it last changed
have each client tell you the timestamp of the last change it knows of, so you can calculate the deltas
ensure that full downloads are possible too, in case clients get out of sync