Best way to store non-numeric Ids for wildcard search? [closed] - azure-data-explorer

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 days ago.
Improve this question
Azure Data Explorer is an Microsoft Azure service, closed source, with a good documentation, but sometimes it needs further explanation.
As it is closed source, I cannot check the code myself to understand its inner workings, so I must rely on community.
This questions is very specific for azure data explorer implementation of strings columns and the way it creates indexes for it. In a general sense, I feel I'd know how to implement that in other databases.
This is a real-life scenario. I can't go into much details, but right now we are using postgres for that. But as the vast majority of queries go into the last 90 days, postgres it's getting pretty expensive for the (5+years) of data we have. Data Explorer seems like a perfect fit, because of the cold storage feature.
I this case, in a single event (~170bytes), I have a field holding an identifier (device id), composed of 9 alphanumeric characters.
I want to be able to find events from a device using wildcard queries like:
*A*2
A*2
A*2*
*2
A*
IE: multiple wildcards, prefix and suffix, prefix only and suffix only.
What is the a good approach for that in Azure Data Explorer? This field has ~150million unique values, and we get in the order of 50 million rows per day.
Our users want to do queries like: give me the events involving devices with id in the format "A*2*" in the last 90 days.
What I'm considering doing is:
Alter the column encoding_policy as Identifier, to avoid creating a term search index, as recommended in the Encoding policy documentation
But I'm not sure how that affects this use case of wildcard searches, as the documentation is very vague about what the Identifier encoding profile does. Example: is it good for high cardinality columns? How does it affect performance for regex queries?

Related

What's the best way to do an all-column query in dynamodb? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
there, we got a dynamodb table with a bunch of columns, like stack-id, email, firstname, lastname, etc, whereas stack-id is the hash key and email is a GSI. Pretty standard stuff.
Now we are adding in a free-form search feature for our site, where user can just search anything in the search bar, say, 'foobar', and then we would search that string across all records for all columns, and if any match is found in any column, it's considered a match. This is easy for mysql and likes, but not for dynamodb.
So I came up with two potential ways to do this: first one is sorta brute force, where we make every column a GSI, then make multiple queries, each of which is for a specific column, then we aggregate the results of all the queries. Apparently, this is not a good idea.
Second way is adding a new column by concatenating all the columns of a record together. That column then contains every thing in all the other columns. We then make this column a GSI, and simply just query this column. Is this a good method?
I wonder if there are better ways to achieve this? thanks ahead.
This is not a good use case for DynamoDB. A better alternative would be to push the data to something else that is better designed for the type of searching you want to do, like ElasticSearch. Indexing Amazon DynamoDB Content with Amazon Elasticsearch Service Using AWS Lambda is a good reference for that.

What is the performance of subqueries vs two separate select queries? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Is one generally faster than the other with SQLite file databases?
Do subqueries benefit of some kind of internal optimization or are they handled internally as if you did two separate queries?
I did some testing but I don't see much difference, probably because my table is too small now (less than 100 records)?
It depends on many factors. Two separate queries means two requests. A request has a little overhead, but this weighs more heavily if the database is on a different server. For subqueries and joins, the data needs to be combined. Small amounts can easily be combined in memory, but if the data gets bigger, then it might not fit, causing the need to swap temporary data to disk, degrading performance.
So, there is no general rule to say which one is faster. It's good to do some testing and find out about these factors. Do some tests with 'real' data, and with the amount of data you are expecting to have in a year from now.
And keep testing in the future. A query that performs well, might suddenly become slow when the environment or the amount of data changes.

Datatable Archival SQL Server 2008 [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have a database that is 54 GB in size, some of the tables have loads of rows in them.
Now querying these tables eats up performance and it takes ages for queries to execute.
Have created indexes and stuff, but now I am planning to archive records in certain tables and only query them when user wants to.
For example, let's say I have a table called Transaction and I often don't need rows of older than 3 months.
Now what I want to do is store all the rows of the Transaction table which are more than 3 months old into some other table and query that table only when user in UI says view archived transactions.
Options I can think of:
Creating ArchivedTransaction in the same database, but problem is that size of database will keep growing and at some point I will have to start deleting rows.
Moving rows to different database altogether but in this case how do I manage database request, there will be lot of change required, also not sure of performance when some one says view archived rows
Adding a column Archived to tables and then checking for the flag when needed, but size issue still the same and performance doesn't improve to that extent.
I am not sure which way to go, I am sure there is much better way to handle this that I am not aware of.
Any ideas which way to go? is there some way you can suggest.

storing attribute data for many product types [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am in a bit of a quandary regarding my design for a website that has to keep track of thousands of different product types (e.g. cars, speakers, baby strollers, etc.) and store attributes (e.g. turning radius, maximum output, color, etc.) for each.
My understanding is that if people are going to be searching for, say, all blue baby strollers, I'm best to have a table for each type of product (in addition to a "generic" product table that stores things common to all products, like name and brand). If it weren't for concerns about searching speed I'd be keen to use an EAV type of model, but I've been discouraged from that based on things I've read.
This seems like a common problem, but I have yet to see a good answer for how to deal with what I'm trying to do. I know there are a ton of sites out there tracking this kind of data, but I'm finding it difficult to imagine that they've got thousands of different tables in order to have a "details" table for each product type. Is it possible they're all using EAV and that the EAV-naysayers, while technically correct, are exaggerating the performance loss of using that type of model? Or are they just using one big "product" table with columns covering every attribute, and just not worrying about all the empty values?
Any help would be greatly appreciated. I just know there's a common way of doing this, and I'd really like someone to let me in on it! :)
Ever considering using a CQRS pattern instead? You can create a read database, with a denormalized search table, that would be optimized for search.
http://blog.fossmo.net/post/Command-and-Query-Responsibility-Segregation-(CQRS).aspx
http://martinfowler.com/bliki/CQRS.html
You could use a database like RavenDB, a document database that's supposed to be really fast for reads, to gain even better performance.

how to deal huge amounts of data,such as grouping operation? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
If a table has Hundreds of millions data,how to do grouping operation in SQL Server 2008?
who can gave me some suggestion?
thanks!
If a table has Hundreds of millions data,how to do grouping operation in SQL Server 2008?
Misunderstandings.
Hundreds of millions = small. I know a table we work on my probject tha gros around 60 million per day, and I have a data set that gros 600 million entries per day.
SQL has grouping operations, you know.
Now, you state "entity framework" as tag. Hereis the deal: you do not use business objects for groups as groups have no functionality anyway and are pure read only projections.
Go SQL (either direct or with a capable LINQ provider). GROUP BY is a SQL Command you may want to read up on.
If oyu need repeatable read, then a materialized view on the server may work, or inserting the data into another table for fast access. Depends a lot on the usage pattern. And make sure you actually have hardware capable enough for your required usage (which I agree many people will never understand). Given proper hardware (Exadata shines here, but it start i thin kat a quarter million USD or so per cluster) you can pull of billion row aggregations in nearly real time.

Resources