Hbase schema design -- to make sorting easy? - olap

I have 1M words in my dictionary. Whenever a user issue a query on my website, I will see if the query contains the words in my dictionary and increment the counter corresponding to them individually. Here is the example, say if a user type in "Obama is a president" and "Obama" and "president" are in my dictionary, then I should increment the counter by 1 for "Obama" and "president".
And from time to time, I want to see the top 100 words (most queried words). If I use Hbase to store the counter, what schema should I use? -- I have not come up an efficient one yet.
If I use word in my dictionary as row key, and "counter" as column key, then updating counter(increment) is very efficient. But it's very hard to sort and return the top 100.
Anyone can give a good advice? Thanks.

You can use the natural schema (row key as word and column as count) and use IHBase to get a secondary index on the count column. See https://issues.apache.org/jira/browse/HBASE-2037 for the initial implementation; the current code lives at http://github.com/ykulbak/ihbase.

From Adobe's presentation at HBaseCon 2012 (slide 28 in particular), I suggest using two tables and this sort of data structure for the row key:
name
President => 1000
Test => 900
count
429461296:President => dummyvalue
429461396:Test => dummyvalue
The second table's row keys are derived by using Long.MAX_VALUE - count at that point of time.
As you get new words, just add the "count:word" as a row key to the count table. That way, you always have the top words returned first when you scan the table.

Sorting 1M longs can be done in memory, so what?
Store words x,y,z issued at time t as key:t cols:word:x=1 word:y=1 word:z=1 in a table. Then use a MapRed job to sum up counts for words and get the top 100.
This also enables further analysis.

Related

Limit fulltext search in MariaDB (innodb)

I'm having trouble making a search on a fairly large (5 million entries) table fast.
This is innodb on MariaDB (10.4.25).
Structure of the table my_table is like so:
id
text
1
some text
2
some more text
I now have a fulltext index on "text" and search for:
SELECT id FROM my_table WHERE MATCH ('text') AGAINST ("some* tex*" IN BOOLEAN MODE);
This is not super slow but can yield to millions of results. Retrieving them in my Java application takes forever but I need the matching ids.
Therefore, I wanted to limit the number already by the ids I know can only be relevant and tried something like this (id is primary index):
SELECT id FROM my_table WHERE id IN (1,2) AND MATCH ('text') AGAINST ("some* tex*" IN BOOLEAN MODE);
hoping that it would first limit to the 2 ids and then apply the fulltext search and give me the two results super quickly. Alas, that's not what happened and I don't understand why.
How can I limit the query if I already know some ids to only search through those AND make the query faster by doing so?
When you use a FULLTEXT (or SPATIAL) index together with some 'regular' index, the Optimizer assumes that the former will run faster, so it does that first.
Furthermore, it is nontrivial (maybe impossible) to run MATCH against a subset of a table.
Both of those conspire to say that the MATCH will happen first. (Of course, you were hoping to do the opposite.)
Is there a workaround? I doubt it. Especially if there a lot of rows with words starting with 'some' or 'tex'.
One thing to try is "+":
MATCH ('text') AGAINST ("+some* +tex*" IN BOOLEAN MODE);
Please report back whether this helped.
Hmmmm... Perhaps you want
MATCH (`text`) -- this
MATCH ('text') -- NOT this
There are two features in MariaDB:
max time spent in query
max number of rows accessed (may not apply to FULLTEXT)

How can I implement a junction index in DynamoDB?

Given two DynamoDB tables: Books and Words, how can I create an index that associates the two? Specifically, I'd like to query to get all Books that contain a certain Word, and query to get all Words that appear in a specific Book.
The objective is to avoid scanning an entire table for these queries.
Based on your question I can't tell if you only care about unique words or if you want every word including duplicates. I'll assume unique words.
This can be done with a single table and a Global Secondary Index.
Create a table called BookWords with a Hash key of bookId and a Sort key of word. If you Query this table with a bookId you will get all of the unique words in that book.
Create a Global Secondary Index with a Hash key of word and a Sort key of bookId. If you Query this index with a word you will get all of the bookIds of books that contain that word.
Depending of your use case, you will probably want to normalize the words. For example, is "Word" the same as "word"?
If you want all words, not just unique words, you can use a similar approach with a few small changes. Let me know

How to make values unique in cassandra

I want to make unique constraint in cassandra .
As i want to all the value in my column be unique in my column family
ex:
name-rahul
phone-123
address-abc
now i want that i this row no values equal to rahul ,123 and abc get inserted again on seraching on datastax i found that i can achieve it by doing query on partition key as IF NOT EXIST ,but not getting the solution for getting all the 3 values uniques
means if
name- jacob
phone-123
address-qwe
this should also be not inserted into my database as my phone column has the same value as i have shown with name-rahul.
The short answer is that constraints of any type are not supported in Cassandra. They are simply too expensive as they must involve multiple nodes, thus defeating the purpose of having eventual consistency in first place. If you needed to make a single column unique, then there could be a solution, but not for more unique columns. For the same reason - there is no isolation, no consistency (C and I from the ACID). If you really need to use Cassandra with this type of enforcement, then you will need to create some kind of synchronization application layer which will intercept all requests to the database and make sure that the values are unique, and all constraints are enforced. But this won't have anything to do with Cassandra.
I know this is an old question and the existing answer is correct (you can't do constraints in C*), but you can solve the problem using batched creates. Create one or more additional tables, each with the constrained column as the primary key and then batch the creates, which is an atomic operation. If any of those column values already exist the entire batch will fail. For example if the table is named Foo, also create Foo_by_Name (primary key Name), Foo_by_Phone (primary key Phone), and Foo_by_Address (primary key Address) tables. Then when you want to add a row, create a batch with all 4 tables. You can either duplicate all of the columns in each table (handy if you want to fetch by Name, Phone, or Address), or you can have a single column of just the Name, Phone, or Address.

How to design DynamoDB table to facilitate searching by time ranges, and deleting by unique ID

I'm new to DynamoDB - I already have an application where the data gets inserted, but I'm getting stuck on extracting the data.
Requirement:
There must be a unique table per customer
Insert documents into the table (each doc has a unique ID and a timestamp)
Get X number of documents based on timestamp (ordered ascending)
Delete individual documents based on unique ID
So far I have created a table with composite key (S:id, N:timestamp). However when I come to query it, I realise that since my id is unique, because I can't do a wildcard search on ID I won't be able to extract a range of items...
So, how should I design my table to satisfy this scenario?
Edit: Here's what I'm thinking:
Primary index will be composite: (s:customer_id, n:timestamp) where customer ID will be the same within a table. This will enable me to extact data based on time range.
Secondary index will be hash (s: unique_doc_id) whereby I will be able to delete items using this index.
Does this sound like the correct solution? Thank you in advance.
You can satisfy the requirements like this:
Your primary key will be h:customer_id and r:unique_id. This makes sure all the elements in the table have different keys.
You will also have an attribute for timestamp and will have a Local Secondary Index on it.
You will use the LSI to do requirement 3 and batchWrite API call to do batch delete for requirement 4.
This solution doesn't require (1) - all the customers can stay in the same table (Heads up - There is a limit-before-contact-us of 256 tables per account)

sqlite3 autoincrement - am I missing something?

I want to create unique order numbers for each day. So ideally, in PostgreSQL for instance, I could create a sequence and read it back for these unique numbers, because the readback both gets me the new number and is atomic. Then at close of day, I'd reset the sequence.
In sqlite3, however, I only see an autoincrement for the integer field type. So say I set up a table with an autoincrement field, and insert a record to get the new number (seems like an awfully inefficient way to do it, but anyway...) When I go to read the max back, who is to say that another task hasn't gone in there and inserted ANOTHER record, thereby causing me to read back a miss, with my number one too far advanced (and a duplicate of what the other task reads back.)
Conceptually, I require:
fast lock with wait for other tasks
increment number
retrieve number
unlock
...I just don't see how to do that with sqlite3. Can anyone enlighten me?
In SQLite, autoincrementing fields are intended to be used as actual primary keys for their records.
You should just it as the ID for your orders table.
If you really want to have an atomic counter independent of corresponding table records, use a table with a single record.
ACID is ensured with transactions:
BEGIN;
SELECT number FROM MyTable;
UPDATE MyTable SET number = ? + 1;
COMMIT;
ok, looks like sqlite either doesn't have what I need, or I am missing it. Here's what I came up with:
declare zorder as integer primary key autoincrement, zuid integer in orders table
this means every new row gets an ascending number, starting with 1
generate a random number:
rnd = int(random.random() * 1000000) # unseeded python uses system time
create new order (just the SQL for simplicity):
'INSERT INTO orders (zuid) VALUES ('+str(rnd)+')'
find that exact order number using the random number:
'SELECT zorder FROM orders WHERE zuid = '+str(rnd)
pack away that number as the new order number (newordernum)
clobber the random number to reduce collision risks
'UPDATE orders SET zuid = 0 WHERE zorder = '+str(newordernum)
...and now I have a unique new order, I know what the correct order number is, the risk of a read collision is reduced to negligible, and I can prepare that order without concern that I'm trampling on another newly created order.
Just goes to show you why DB authors implement sequences, lol.

Resources