I am trying to populate everyrow in a column with random ranging from 0 to row count.
So far I have this
UPDATE table
SET column = ABS (RANDOM() % (SELECT COUNT(id) FROM table))
This does the job but produces duplicate values, which turned out to be bad. I added a Unique constraint but that just causes it to crash.
Is there a way to update a column with random unique values from certain range?
Thanks!
If you want to later read the records in a random order, you can just do the ordering at that time:
SELECT * FROM MyTable ORDER BY random()
(This will not work if you need the same order in multiple queries.)
Otherwise, you can use a temporary table to store the random mapping between the rowids of your table and the numbers 1..N.
(Those numbers are automatically generated by the rowids of the temporary table.)
CREATE TEMP TABLE MyOrder AS
SELECT rowid AS original_rowid
FROM MyTable
ORDER BY random();
UPDATE MyTable
SET MyColumn = (SELECT rowid
FROM MyOrder
WHERE original_rowid = MyTable.rowid) - 1;
DROP TABLE MyOrder;
What you seem to be seeking is not simply a set of random numbers, but rather a random permutation of the numbers 1..N. This is harder to do. If you look in Knuth (The Art of Computer Programming), or in Bentley (Programming Pearls or More Programming Pearls), one suggested way is to create an array with the values 1..N, and then for each position, swap the current value with a randomly selected other value from the array. (I'd need to dig out the books to check whether it is any arbitrary position in the array, or only with a value following it in the array.) In your context, then you apply this permutation to the rows in the table under some ordering, so row 1 under the ordering gets the value in the array at position 1 (using 1-based indexing), etc.
In the 1st Edition of Programming Pearls, Column 11 Searching, Bentley says:
Knuth's Algorithm P in Section 3.4.2 shuffles the array X[1..N].
for I := 1 to N do
Swap(X[I], X[RandInt(I,N)])
where the RandInt(n,m) function returns a random integer in the range [n..m] (inclusive). That's nothing if not succinct.
The alternative is to have your code thrashing around when there is one value left to update, waiting until the random number generator picks the one value that hasn't been used yet. As a hit and miss process, that can take a while, especially if the number of rows in total is large.
Actually translating that into SQLite is a separate exercise. How big is your table? Is there a convenient unique key on it (other than the one you're randomizing)?
Given that you have a primary key, you can easily generate an array of structures such that each primary key is allocated a number in the range 1..N. You then use Algorithm P to permute the numbers. Then you can update the table from the primary keys with the appropriate randomized number. You might be able to do it all with a second (temporary) table in SQL, especially if SQLite supports UPDATE statements with a join between two tables. But it is probably nearly as simple to use the array to drive singleton updates. You'd probably not want a unique constraint on the random number column while this update is in progress.
Related
I have a table with around 65 million rows that I'm trying to run a simple query on. The table and indexes looks like this:
CREATE TABLE E(
x INTEGER,
t INTEGER,
e TEXT,
A,B,C,D,E,F,G,H,I,
PRIMARY KEY(x,t,e,I)
);
CREATE INDEX ET ON E(t);
CREATE INDEX EE ON E(e);
The query I'm running looks like this:
SELECT MAX(t), B, C FROM E WHERE e='G' AND t <= 9878901234;
I need to run this queries for thousands of different values of t and was expecting each query to run in a fraction of a second. However, the above query is taking nearly 10 seconds to run!
I tried running the query plan but only get this:
0|0|0|SEARCH TABLE E USING INDEX EE (e=?)
So this should be using the index. With a binary search I would expect worse case only 26 tests, which I would be pretty quick.
Why is my query so slow?
Each table in a query can use one index. Since your WHERE clause looks at multiple columns, you can use a multi-column index. For these, all but the last column used from the index has to test for equality; the last one used can be used for greater than/less than.
So:
CREATE INDEX e_idx_e_t ON E(e, t);
should give you a boost.
For further reading about how Sqlite uses indexes, the Query Planner documentation is a good introduction.
You're also mixing an aggregate function (max(t)) and columns (B and C) that aren't part of a group. In Sqlite's case, this means that it will pick values for B and C from the row with the maximum t value; other databases usually throw an error.
I have a table t with around 500,000 rows. One of the columns (stringtext) contains a very long string and I have now discovered that that there are in fact only 80 distinct strings. I'd like to declutter table t by moving the strings into a separate table, s, and merely referencing them in t.
I have created a separate table of the long strings, including what is effectively an explicit row-index number using:
CREATE TEMPORARY TABLE stmp AS
SELECT DISTINCT
stringtext
FROM t;
CREATE TABLE s AS
SELECT _ROWID_ AS stringindex, stringtext
FROM stmp;
(It was creating this table that showed me there were only a few distinct strings).
How can I now replace stringtext in t with the corresponding stringindex from s?
I would think about something like Update t set stringtext = (select stringindex from s where s.stringtext = t.stringtext) and would recommend first making an index on s(stringtext) as SQLite might not be smart enough to build a temporary index. And then a VACUUMing would be in order.
Untested.
With this schema:
CREATE TABLE temperatures(
sometext TEXT,
lowtemp INT,
hightemp INT,
moretext TEXT);
When I do search
select * from temperatures where lowtemp < 20 and hightemp > 20;
I get the correct result which is always one record (due to the specifics of the data).
Now, when I index the table:
CREATE INDEX ltemps ON temperatures(lowtemp);
CREATE INDEX htemps ON temperatures(hightemp);
The exact same query above stops providing expected results -- now I get many records, including ones where the lowtemp and hightemp obviously don't meet the comparison test.
I'm running this on the same sqlite3 database, same table. The only difference is adding the above 2 index statements after table creation.
Can someone explain how indexing influences this behavior?
SELECT col1 FROM tbl ORDER BY RAND() LIMIT 10;
This can work fine for small tables. However, for big table, it will have a serious performance problem as in order to generate the list of random rows, MySQL need to assign random number to each row and then sort them.
Even if you want only 10 random rows from a set of 100k rows, MySQL need to sort all the 100k rows and then, extract only 10 of them.
My solution for this problem, is to use RAND in the WHERE clause and not in the ORDER BY clause. First, you need to calculate the fragment of your desired result set rows number from the total rows in your table. Second, use this fragment in the WHERE clause and ask only for RAND numbers that smallest (or equal) from this fragment.
SELECT col1 FROM tbl WHERE RAND()<=0.0005;
In order to get exactly 100 row in the result set, we can increase the fragment number a bit and limit the query:
For example:
I have a table which I access by 2 int fields all the time so want an index to help. There is no writes ever. The int fields are not unique.
What is the most optimal index?
Table
MyIntA
MyIntB
SomeTextValue
The queries always look like this:
Select SomeTextValue from MyTable where MyIntA=1 and MyIntB=3
You could add an index on (MyIntA, MyIntB).
CREATE INDEX your_index_name ON MyTable (MyIntA, MyIntB);
Note: it might be preferable to make this pair of columns your primary key if the pair of columns (when considered together) contains only distinct values and there isn't another obvious choice for the primary key.
For example, if your table contains only data like this:
MyIntA MyIntB
1 1
1 2
2 1
2 2
Here both MyIntA and MyIntB when considered separately are not unique so neither of these columns individually could be used as a primary key. However, the pair (MyIntA, MyIntB) is unique, so this pair of columns could be used as a primary key.
The selectivity (number of discrete / distinct values) of the data in the columns MyIntA and MyIntB should assist you to decide on whether your index should be (MyIntA, MyIntB), (MyIntB, MyIntA), or just (MyIntA) or (MyIntB)
This link should help, albeit for a different RDBMS