How can I get a random record from a table in sqlite for current day. This is for something like "Word of the day". So, I get a random record from db for today, a different random record tomorrow.
I've seen ORDER BY RAND(20120714) LIMIT 1 which works in MySQL, but I'd like to know if its possible to do this in SQLite.
Thanks in advance.
SQLite does not allow to seed its random number generator.
You have to compute the random number in your own code.
This is easy only if the records have consecutive numbers, i.e., autoincrementing IDs; you should change your database.
Related
I need to create a table with the following fields :
place, date, status
My keys are parition key - place , sort key - date
Status can be either 0 or 1
Table has approximately 300k rows per day and about 3 days worth of data at any given time, so about 1 million rows. I have a service that is continuously populating data to this DDB.
I need to run the following queries (only) once per day :
#1 Return count of all places with date = current_date-1
#2 Return count and list of all places with date= current_date-1 and status = 0
Questions :
As date is already a sort key, is query #1 bound to be quick?
Do we need to create indexes on sort key fields ?
If answer to above question is yes: for query #2, do I need to create a GSI on date and status? with date as Partition key, and status as sort key?
Creating a GSI vs using filter expression on status for query #2. Which of the two is recommended?
Running analytical queries (such as count) is a wrong usage of a NoSQL database such as DynamoDB that is designed for scalable LOOKUP use cases.
Even if you get the SCAN to work with one design or another, it will be more expensive and slow than it should.
A better option is to export the table data from DynamoDB into S3, and then run an Athena query over that data. It will be much more flexible to run various analytical queries.
Easiest thing for you to do is a full table scan once per day filtering by yesterday's date, and as part of that keep your own client-side count on if the status was 0 or 1. The filter is not index optimized so it will be a true full table scan.
Why not an export to S3? Because you're really just doing one query. If you follow the export route you'll have to a new export every day to keep the data fresh and the cost of the export in dollar terms (plus complexity) is more than a single full scan. If you were going to do repeated queries against the data then the export makes more sense.
Why not use GSIs? They would make the table scan more efficient by minimizing what's scanned. However, there's a cost (plus complexity) in keeping them current.
Short answer: a once per day full table scan is both simple to implement and as fast as you want (parallel scan is an option), plus it's not really costly.
How much would it cost? Million rows, 100 bytes each, so that's a 100 MB table. That's 25,000 read units to fully scan, which is halved down to 12,500 with eventual consistency. On Demand pricing is $0.25 per million read units. 12,500 / 1,000,000 * $0.25 = $0.003. Less than a dime a month. It'd be cheaper still if you run provisioned.
Just do the scan. :)
Can a group count query in Amazon Neptune or any Graph Databases fail due to Big Data ?
I mean if the counts exceeds the limits of the count datatype can there be a n overflow?
Short answer
Gremlin query language semantics (as defined by the Tinkerpop code) define output of count() function as a 64 bit long. So, yes, count cannot exceed the range of long.
Long answer
Having said that, let's try to calculate the amount of data you would need to insert into the DB to hit that threshold. Each entity(Vertex/Edge/Property) in the DB contains a unique ID associated with it. Let us hypothetically assume that the storage of each entity consists of just the identifier. Also, let us assume that the data type of the identifier is the most efficient, i.e. a long (and not a String which would use greater space than a long).
To hit the limit of count, the DB would need to store at least 2^64 entities each with a unique identifier i.e. at least ((2^64)*64)bits of data i.e. greater than 1000 PetaBytes of data at a very conservative estimate.
The point is, you would need to store a huge amount of data before you hit the limit of count. If you are operating with such amount of data, a DB might not be right storage solution for you.
I am developing an application that allows users to read books. I am using DynamoDB for storing details of the books that user reads and I plan to use the data stored in DynamoDB for calculating statistics, such as trending books, authors, etc.
My current schema looks like this:
user_id | timestamp | book_id | author_id
user_id is the partition key, and timestamp is the sort key.
The problem I am having is that, with this schema I am only able to query
the details of the books that a single user (partition key) has read. That is one of the requirements for me.
The other requirement is to query all the records that has been created in a certain date range, eg: records created in the past 7 days. With this schema, I am unable to run this query.
I have looked into so many other options, and haven't figured out a way to create a schema that would allow me to run both queries.
Retrieve the records of the books read by a single user (Can be done).
Retrieve the records of books read by all the users in last x days (Unable to do it).
I do not want to run a scan, since It will be expensive and I looked into the option of using GSI for timestamp, but it requires me to specify a hash key, and therefore I cannot query all the records created between 2 dates.
One naive solution would be to create a GSI with a constant hash key across all books and timestamp as a range key. This will allow you to perform your type of queries.
The problem with this approach is that it is likely to become a scaling bottleneck, as same hash key means same node. One workaround for this problem is to do sharding: create a set of hash keys (ex: from 1 to 10) and assign random key from this set to every book. Then when you make a query you will need to make 10 queries and merge results. You can even make this set size dynamic, so that it scales with your data.
I would also suggest looking into other tools (not DynamoDB) for this use case, as DDB is not the best tool for data analysis. You might, for example, feed DynamoDB data into CloudSearch or ElasticSearch and do your analysis there.
One solution could be using GSI and including two more columns, when ever you ingest a record kindly ingest date as a primary key e.g 2017-07-02 and timestamp as range key 04:22:33:000.
Maintain one table for checkpoint which would contain the process name and timestamp of the table, Everytime you read from the table you can update the checkpoint table to get incremental data. if you want to get last 7 day data change timestamp to past 7 date and get data between last 7 day and current time.
You can use query spec for the same by passing date as a partition and using between keywords for timestamp which is range condition.
Date diff you will to calculate from checkpoint table and current date and so day wise you get the data.
First, I explain my problem:
This is a table that will contain approximately 5,000,000 record per year, these records will be kept at least 10 years (it is not yet defined). We talk about events of production machine. I generate a report + a dashbord for displaying various information relatively complex (average number of events per 10 minutes a month, graphics, ...) and also wants to see the records themselves. The data displayed will be in large majority of the last 2 months, viewing the rest of the data must always be possible but at a lower speed of access.
I work on MariaDB v10.1.12.
The idea was to make a partition on the last 3 months. I realize now that this is not so easy. I have not found any solution to this partition, in fact, it is impossible to make a partition based on a now() or other current_date() etc. directly or indirectly via another calculated column.
Do you have any ideas for me? Perhaps another solution than a partition.
Thank you in advance.
I recommend PARTITION BY RANGE(TO_DAYS(...)) If you are only now breaking the table into partitions, I would recommend annual partitions for data before this year, then quarterly or monthly partitions henceforth. Yes, that, in theory, leads to an infinite number of partitions, but I predict that you will revamp the data structure within a few years.
20-50 partitions is a good number. More than that leads to inefficiencies due to the multitude of partitions; less than that leads to asking "why bother".
Use InnoDB. Design the PRIMARY KEY carefully, since it may be useful as the primary index into the data.
Usually it is best to put the date/timestamp column last in any indexes. Putting it first would be redundant since partition pruning comes first.
More on partitioning.
It sounds like a main purpose for the table is to summarize the data for graphing, etc. In that case, it may be very beneficial to build and maintain "Summary table(s)" of counts and subtotals over selected time intervals. 100 rows get added up for a 10-minute interval? If so, then the summary table based on 10-minute intervals would have 1/100th as many rows, and the queries would be much faster. Plus, you could 'denormalize' the summary tables to make them even simpler.
More on Summary tables.
It might be worth it to gather data for 10 minutes into a staging table, then summarize it into the summary table. And also throw the raw data into the big table.
Or, if the summary tables have everything you need, you could abandon the big table. Or, as a compromise, keep 12 month's worth of data (partitioned by month), and DROP PARTITION for older data. Meanwhile, the summary tables can continue to grow (although they will be much smaller).
Table partitioning is an advance features, it is not indexing, but rearrangement of tables data. So it is not "duplicate", indeed new data will stored according to the predefined partitioning range.
You must also specify month range criteria as usual. you MUST create index if those column are not used as partition range. When you make a select, algorithm that associate with partition table will handle those merging(if required) in background. So you just treat partition exactly like your typical table.
For more details, please check Mariadb paritioning overview
I output the query plan on SQLite, and it shows
0|0|0|SCAN TABLE t (~500000 rows)
I wonder what is the meaning of the number (500000)? I guess it is the table length, but I executed the query on a small table which does not have so many rows.
Is there any official document about the meaning of the number? thanks.
As the official documentation says, this is the number of rows that the database estimates will be returned.
If there is an index on a seached column, and if you have run ANALYZE, then SQLite can make an estimate based on the actual data. Otherwise, it assumes that tables contain one million rows, and that a search like column > x filters out half the rows.