many-to-many query runs slow in windows phone 7 emulator - sqlite

my application is using sqlite for a database. in the database, i have a many-to-many relationship. when i use the sqlite addon/tool for firefox, the sql query joining the tables in the many-to-many runs pretty fast. however, when i run the same query on the emulator, it takes a very long time (5 minutes or more). i haven't even tried it on a real device, thus.
can someone tell me what is going on?
for example, i have 3 table.
1. create table person (id integer, name text);
2. create table course (id integer, name text);
3. create table registration(personId integer, courseId integer);
my sql statements that i have tried are as follows.
select *
from person, course, registration
where registration.personId = person.id and registration.courseId = course.id
and also as follows.
select *
from person inner join registration on person.id=registration.personId
inner join course on course.id=registration.courseId
i am using the sqlite client from http://wp7sqlite.codeplex.com. i have 4,800 records in the registration table, 4,000 records in the person table, and 1,000 records in the course table.
is it my queries? is it just the sqlite client? is it the record size? if this problem cannot be fixed on the app, i'm afraid i'll have to push the database remotely (that means my app will have to use the internet).

Yep, its your queries. You're not going to get away with you can get away with doing what you are trying to do on a mobile device. You have to remember you aren't running on a PC so you have to think differently about how you approach things (both code and UI). You have low memory, slow disk access, a slow-ish processor, no virtual memory, etc. You're going to have to make compromises.
I'm sure what ever you are doing is perfectly possible to do on the phone without needing an offsite server but you need to be smart about it. For example is it really necessary to load all 4800+ records into memory at once? Almost certainly not, a user can't possibly at look at all 4800 at the same time. Forgetting the database speed just showing this number of items in a ListBox is going to kill your app performance wise.
And even is performance was perfect is displaying 4800 items really a good user experience? Surely allowing the user to enter a search term would be better and would allow you to filter the list to a more manageable size. Could you implement paging so you only display the first 10 records and have the user click next for the next 10?
You might also want to consider de-normalizing your database, so that you just have one table rather than 3. It will improve performance considerably. Yes it goes against everything you were taught about databases in school but like I said: phone = compromises. And remember this isn't a big OLTP mission critical database, its a phone app - no one cares if your database is in 3rd normal form or not. Also remember that the more work you give the phone (chugging through data building up joins) the more battery power you app will consume.
Finally if you absolutely think you must to give the user a list of 4800 records to scroll through, you should look at some kind of data virtualization technique. Which gives the user the illusion they are scrolling through a long list, even though there are actually only a few items loaded at any given time.
But the short answer is: yes, doing queries like that will problematic, you need to consider changing them.

By the time you start doing those joins that's an awfuly large amount of records you could end up with. What is memory like during this operation?
Assuming you have tuned indexes appropraitely, rather than do this with joins, I'd try three separate queries.
Either that or consider restructuring your data so it only contains what you need in the app.
You should also look to only return the fields you need.

Related

Can DynamoDB be used for this simple problem?

I am trying to understand the limitations of DynamoDB/NoSQL, mostly as a learning exercise. I came across a problem that is fairly simple in a relational database, but I cannot figure out how to accomplish it in DynamoDB even with full control of rebuilding the tables and indexes.
Problem: Every day everyone in an office chooses one fruit for lunch. At the end of the week, I just want a list of everyone who ate both an apple and a banana.
Example Data
I thought employee name should be the PK, day of the week should be the SK.. and Fruit would be an attribute. But that doesn't seem to work, because you cant query against an attribute.
Is there a way to structure the data to make this work? Is there another tool like OpenSearch, HiveQL, GraphQL that can help me do what i am trying to do here?
Thanks.
When you say it's "fairly simple in a relational database", what you mean is it's simple to express, not exactly simple to compute. You're pushing a lot of list intersection work to the database. As your data set grows, the response time for your query will get slower and slower. At some point the database will no longer be able to give you the answer. And while it's consuming CPU (before timing out) you're negatively impacting the load on the relational database server for other users.
With DynamoDB you can't express queries that take unbounded effort to compute or that depend so much on total data set size for their performance characteristics. You have to design a query system up front that doesn't get exponentially slower as the data set grows.
The DynamoDB design then depends on what you know up front. For example, do you know it's always the intersection of an apple and banana? Then during insert of a new food note if the person ate both, and mark them as such on a user metadata item. Use that marker later during the query phase.
Sound like a nuisance? Well, if your data set isn't growing large and/or you don't need reliably fast query performance, then a relational database solves this problem well. Different databases for different purposes.
DynamoDB also supports SCAN and not only QUERY.
A simple design for the table is to have the PK to be the name of the person, and the attributes will be the numeric values of the fruits that you can increase every day.
UPDATE "FRUIT_COUNTS"
SET BANANA=BANANA + 1
WHERE Employee='Bob'
Then, at the end of the week, you can run a simple PartiQL query on the table:
SELECT * FROM "FRUIT_COUNTS"
WHERE BANANA > 0 AND APPLE > 0

Is there a best practice limitation of how many items I should keep in a single DynamoDB table?

I am setting up a Serverless application for a system and I am wondering the following:
Say that my table handle Companies. Each Company can have Invoices. Each company has roughly 6-8000 Invoices. Say that I have 14 Companies, that results in roughly 112 000 items in my table.
Is it "okay" to handle it this way? I will only pay for each Get request I do, and I can query a lot of items into the same get request.
I will not fetch every single item each time I write or get items.
So, is there a recommendation for how many items I should max have in a table? I could bake some items together, but I mainly want a general recommendation.
There is no practical limit to the number of items you can have in a table. How many items each invoice is depends on your application's access patterns. You need to ask, what data does your app need, when does it need that data, and how large is the data, how often is the item updated. For example, if all the data in one item comes in under the 1Kb WCU and 4Kb RCU and you do not write to it often, and when you read it, you need all of the data in the item, then shove it in one item perhaps. If the data is larger, or part of it gets written to more often, then perhaps split it up.
An example might be a package tracking app. You have the initial information about the package, size, weight, source address, destination address, etc. That could be a lot of data. When that package enters a sorting facility it is checked in. Do you want to update that entire item you already wrote? Or do you just write an item that has the same PK (item collection), but a different SK and then the info that it made it to the sorting facility? When it leaves the sorting facility, you want to write to the DB that it left, which truck it was on, etc. Same questions.
Now when you need to present the shipping information by tracking ID number, the PK, you can do a query to DynamoDB and get the entire item collection for that tracking ID number. Therefore you get all items with that ID as your app presents much of that information on the tracking web site for the customer.
So again, it really depends on the app and your access patterns, but you want to TRY to only read and write the items your app needs, when you need them, how you need them, and no more...within reason (there is such a thing as over slicing your data). That is how, in my opinion, you will make a NoSQL database like DynamoDB be the most performant and most cost effective.
Dynamo Db won't even notice 100K entries...
As mentioned by LifeOfPi, entries should be less than 400k.
The question indicates a distinct lack of understanding of what/why/how to use DDB. I suggest you do some more learning. The AWS Reinvent videos around DDB are quite useful.
In a standard RDBMS, you need to know the structure from the beginning. Accessing that data is then very flexible.
DDB is the opposite, you need to understand how you'll need to access you data; the structure is not important. You should end up with something like so:
For 100K items and for most applications, you may find Aurora serverless to be an easier fit for your needs; especially if you have complicated searching and/or sorting needs.

DynamoDB tables per customer considering DynamoDB's advanced recovery abilities

I am deciding whether or not I have tables per a customer, or a customer shares a table with everybody else. Creating a table for every customer seems problematic, as it is just another thing to manage.
But then I thought about backing up the database. There could be a situation where a customer does not have strong IT security, or even a disgruntled employee, and that this person goes and deletes a whole bunch of crucial data of the customer.
In this scenario if all the customers are on the same table, one couldn't just restore from a DynamoDB snapshot 2 days ago for instance, as then all other customers would lose the past 2 days of data. Before cloud this really wasn't such a prevalent consideration IMO because backups were not as straight forward offering such functionality to your customers who are not tier 1 businesses wasn't really on the table.
But this functionality could be a huge selling point for my SAAS application so now I am thinking it will be worth the hassle for me to have table per customer. Is this the right line of thinking?
Sounds like a good line of thinking to me. A couple of other things you might want to consider:
Having all customer data in one table will probably be cheaper as you can distribute RCUs and WCUs more efficiently. From your customer point of view this might be good or bad because one customer can spend any customers RCUs/WCUs (if you want to think about like that). If you split customer data into separate tables your can provision them independently.
Fine grained security isn't great in DynamoDB. You can only really implement row (item) level security if the partition key of the table is an Amazon uid. If this isn't possible you are relying on application code to protect customer data. Splitting customer data into separate tables will improve security (if you cant use item level security).
On to your question. DynamoDB backups don't actually have to be restored into the same table. So potentially you could have all your customer data in one table which is backed up. If one customer requests a restore you could load the data into a new table, sync their data into the live table and then remove the restore table. This wouldn't necessarily be easy, but you could give it a try. Also you could be paying for all the RCUs/WCUs as you perform your sync - a cost you don't incur on a restore.
Hope some of that is useful.
Separate tables:
Max number of tables. It's probably a soft limit but you'd have to contact support rather often - extra overhead for you because they prefer to raise limits in small (reasonable) bits.
A lot more things to manage, secure, monitor etc.
There's probably a lot more RCU and WCU waste.
Just throwing another idea up in the air, haven't tried it or considered every pro and con.
Pick up all the write ops with Lambda and write them to backup table(s). Use TTL (for how long can users restore their stuff) to delete old entries for free. You could even modify TTL per costumer basis if you e.g provide longer backups for different price classes of your service.
You need a good schema to avoid hot keys.
customer-id (partition ID) | time-of-operation#uuid (sort key) | data, source table etc
-------------------------------------------------------------------------------------------
E.g this example might be problematic if some of your costumers are a lot more active than others.
Possible solution: use known range of int-s to suffix IDs, e.g customer-id#1, customer-id#2 ... customer-id#100 etc. This will spread the writes and your app knows the range - able to query.
Anyway, this is just a quick and dirty example off the top of my head.
Few pros and cons that come to my mind:
Probably more expensive unless separate tables have big RCU/WCU headroom.
Restoring from regular backup might be a huge headache, e.g which items to sync?
This is very cranual, users can pick any moment in your TTL range to restore.
Restore specific items, revert specific ops w/ very low cost if your schema allows it.
Could use that backup data to e.g show history of items in front-end.

What is the advantage of using 1 to many relationship over adding 1 more column in this particular situation?

This is a typical situation for 1 to many relationships: a chat group iOS app, a group table to record all the group chat related information, like group id, create time, thread title, etc.
To record the participants, of course, I would assume there is another 1:m table. So I was rather surprised to see the app just added another column called "participants" to record it, with each participant is separated by a delimiter (':' to be exact). The problem with that is quite obvious, mixing application code with sql code, e.g. no way to see how many groups a specific user is in with sql code, violated 1NF/2NF, etc.
But they said we understood all your points. But
as this is a mobile app, you always need to use objective c code to access sqlite tables, you won't use sql codes alone. So not a "big deal" to mix them together.
participants don't change often and normally are set when a group is created. If we have 100 participants we would rather just insert 1 record to group table instead of insert 100 records into another group-participants table.
The participant data will be used when someone wants to see who are in this chat group (by several taps on the menu) and when someone joins or leaves the chat group, assume it won't happen often.
So my question is in this particular situation what is the advantage I will gain if I use another 1:m table?
----- update -----
Except for the answer I got, Renzo kindly pointed this discussion to me, which is also very helpful!
It's hard to respond to "is this design better/worse" style questions without understanding the full context. I'm going to make some assumptions based on your question.
You appear to be building a mobile application, supporting "many to many" user chat. I'm picturing something like Slack.
Your application design is using the SQLite database for local storage.
Your local sqlite database on the phone is some kind of subset of the overall application data - like a cache, only showing the data for the current user.
If all that is true, the question really is down to style/maintainability on the one hand, and performance and scalability on the other.
From a "style" point of view, storing the data in a comma-separated value in a column is ugly. A new developer who joins the project, with a background in "regular" database design will consider it at best a hack. On the other hand, iOS developers may consider it perfectly normal.
From a performance point of view, it's probably not worth arguing about - parsing the CSV is probably just as slow as reading/writing from the database.
From a scalability point of view, you may have a problem. If the application design needs to capture in which order users joined the chat, or capture some kind of status (active/asleep, for instance), or provide a bit of history (user x exited at 21:20), you almost certainly end up re-designing the database.

Storing messages and threads in Windows Azure Table Storage

I am designing a simple messaging service using ASP.NET MVC / Windows Azure Table Storage. I have two kinds of entities - messages and message threads. Relation between them is simple - each thread can have multiple messages but the message can only be assigned to one thread.
Table storage is not a relational DB, so representing relations is always a bit tricky. I need to decide between 2 approaches:
Having one big table for threads and one for messages. And having threadId as a partition key of message entity so that messages are partitioned by threads.
Dynamically creating a special table for each message thread and having threadId as a name of the table.
I tend to prefer the second because it fits better into architecture of the rest of the service. But there will obviously be large number of tables created in a storage account.
Do you think this may be a problem?
You could also consider having just one table, that stores both Thread and Message entities. This would give you transaction support, and you could use Lucifure's hybrid approach on this table.
Creating a large number of tables may be an issue, depending on how you want to manage them. The underlying REST API for listing tables works like a query for table entities. It only returns the first 1000 tables, after that you have to use a continuation token. All of the storage explorers I've seen don't allow you to query tables based on name, they simply like the first 1000 tables. If you end up with 20000 threads, it could take you a while to get to the table you want.
One way you could mitigate this is to put your message table in its own storage account. This way your storage account with all of your other tables won't get crowded out by all of these dynamic tables that you will be creating and possibly deleting.
Deleting is actually one of the ways in which using a separate table for each thread would be easier. To delete all of the related messages you simply have to delete one table rather than iterating over each message and deleting it.
Everything else however will be more complicated than keeping all of the messages in one table. If this is core functionality to your app and you can dedicate enough time to develop it this way, one table per thread is probably a good idea. Otherwise the easy way to do things is with one big table.
You may consider a hybrid approach to keep the number of tables to a manageable level, depending on your scalability needs.
My experience has been that date based partitioning at the table level is a very effective approach and can be leverage across the board.
For example you could partition tables based on date and with a granularity of day or month. So a table name like “Thread201202” could be used for all threads started in February 2012.
Your thread id would implicitly include the “201202” and be something like “201202-myid01” although you would not need to explicitly store it in the partition key since it would be implied in the table name.
Aged threads could then be easily disposed by deleting tables say more than a year old.

Resources