I want to show the string value as one of the measure value. When a fact table has a integer value and string value respectively and also has some foreign table's keys. Then I could show the integer value as a measure value, but I couldn't show the string value as a measure. Because Measure element in schema of cube (written in XML) doesn't allow that a measure value doesn't have 'aggregator'(It specify the aggregate function of measure values). Of course I understood that we can't aggregate some string values. But I want to show the string value of the latest level in hierarchy.
I read following article. A figure (around middle of this page) shows a cube that contains string value as a measure value. But this is an example of Property value of Dimension table, so this string value isn't contain in fact table. I want to show the string value that contains in fact table.
A Simple Date Dimension for Mondrian Cubes
Anyone have some idea that can be shown the string value as a measure value? Or I have to edit Mondrian's source code?
I have had the same problem and solved it by setting the aggregator attribute in the measure tag to max.
e.g.
\<Measure name="Comment" datatype="String" column="comment" caption="Comment" aggregator="max"/\>
Why does it need to be a measure?
If no aggregation would naturally be applied to it and you just want the string value, it is a dimension, not a measure. Trying to force it to be a measure is not the best approach.
I think the figure you reference is just showing a drillthrough, and that the only actual
measure is Turnover. The report layout is slightly misleading in terms of dimensions and measures.
You can just use the fact table again in the schema as a dimension table if for some reason you don't want to split this out into a separate physical table.
Sounds like the string may be high cardinality to the integer, possibly 1:1. Depending upon the size of your cube, this might or might not be a performance challenge. But don't try to make it a measure.
Good luck!
Related
I believe this would be easier with PostgreSQL or MongoDB, both of which I'm familiar with, but I'm using DynamoDB with my project for the sake of learning how to use it and getting comfortable with it. I've never used it before.
I want to use DynamoDB to store high scores for my typing test project. There are 4 data attributes to be stored:
name (doesn't need to be unique)
WPM
number of errors
test type (because I have 2 different kinds of typing tests)
At first, my partition key was testType, and my sort key was WPM. Then I realized that if anyone got the same WPM as a previous user, it would overwrite the previous user's data, because testType and WPM, the two key components, were identical. So ties did not work.
So, now, name is my partition key, and WPM is my sort key. In order to filter by testType, I just use JS array filter methods. This still doesn't seem optimal though for multiple reasons. For my small typing test project, I think it's ok, but I can see that it's possible for 2 people to input the same name and get the same WPM and overwrite each other.
What would be a better way to set this up with DynamoDB?
Assuming you want the top X many WPM results for a given test type:
Set the partition key to be the test type. Set the sort key as <WPM>#<username>. Make sure to zero-pad the WPM so it’s always 3 digits even if the score is below 100. That keeps it numerically sorted.
With this key structure you have a sorted list (in the sort key) of all the scores for a given test type. You can Query against the test type and use ScanIndexForward=false to get descending high scores.
Notice how multiple identical scores by different usernames won’t overwrite each other. The username can be pulled from the returned sort key or from an attribute on the item, along with other metadata about the high score event.
If you have multiple users with the same username, well, that’s kinda weird. Presumably you have an internal identifier. You can use that as the suffix in the sort key instead of the username.
Why the following command is slow (5 mins)?
mytable | where extent_tags() contains "20210613" | count
I know this is not the best way to get count , I could have used .show table extents and could have simply calculated sum(RowCount) using summarize operator. But I am just testing. Ideally ADX should be able to search tags across extents and get counts , so it is only metadata search and once it finds correct extent, row count is already stored as part of the extent metadata anyways, so why should it take 5 mins? And by the, the extent(s) I am interested in has the following tag:-
drop-by:20210613
ingest-by:20210613
There is a datetime field in the table which I could have used to filter too , which is what adx ideally recommends in general scenarios and I can guess the reason that min and max of every datetime field in the table is stored in every extent of the table -- but then similarly even tag is stored in every extent. So which method is more efficient , filtering on a datetime field if available or tags?
a. you're correct that using .show table T extents where tags contains 'string' | ... would be much more efficient
b. as mentioned in the documentation: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/extenttagsfunction
Filtering on the value of extent_tags() performs best when one of the following string operators is used: has, has_cs, !has, !has_cs.
c. which method is more efficient , filtering on a datetime field if available or tags?
The former, especially when your filter is on a substring, and not on the full content of the tag. Tags are a non-indexed metadata property of shards, and isn't an indexed data column. Also see: https://yonileibowitz.github.io/blog-posts/datetime-columns.html
I'm struggling with DynamoDB schema design of a table storing locations. The table will have [userId, lastUpdatedTime, locationGooglePlaceId, longitude, latitude, hideOnUI(bool)]
one of the main query is given user current location (x, y) as GPS coordinate, query nearby userId based on their longitude and latitude.
The problem is how would I design index for this purpose? The table itself can have HASH key UserId, SORT key lastUpdatedTime; but how would GSI go? I seem can't identify any partition key for "equal" operation
In SQL it'll be something like:
select * from table
where x-c <= longitude and longitude < x+c
AND y-c <= latitude and latitude < y+c
Thanks
First of all, I am not sure if DynamoDB is a good fit here, maybe it's better to use another database, since Dynamo does not support complicated indexes.
Nonetheless, here is a design that you can try.
First of all you can split your map into multiple square blocks, every square block would have an id, and known position and size.
Then if you have a location and you want to find all nearby points, you can do the following.
Every point in your database will be storred in the Points table and it will have following keys:
BlockId (String, UUID, partition key) - id of a block this point belongs to
Latitude (Number, sort key) - latitute of a point
Longtitude (Number) - simple attribute
Now if you know what square a user location in and what squares are nearby, you can perform in all nearby squares the following search:
BlockId = <nearby_block_id>
Latitute between(y-c, y+c)
and use a filter expression based on the Longtitude attribute:
Longtitutede between(x-c, x+c)
It does not really matter what to use as a sort key latitude or longtitude here.
Between is a DynamoDB operator that can be used with sort keys or for filtering expressions:
BETWEEN : Greater than or equal to the first value, and less than or
equal to the second value. AttributeValueList must contain two
AttributeValue elements of the same type, either String, Number, or
Binary (not a set type). A target attribute matches if the target
value is greater than, or equal to, the first element and less than,
or equal to, the second element. If an item contains an AttributeValue
element of a different type than the one provided in the request, the
value does not match. For example, {"S":"6"} does not compare to
{"N":"6"}. Also, {"N":"6"} does not compare to {"NS":["6", "2", "1"]}
Now the downside of this is that for every partition key there can be no more than 10GB of data with this key. So the number of dots that you can put in a single square is limited. You can go around this if your squares are small enough or if your squares have variable sizes and you use big squares for not very crowded areas and small squares for very crowded areas, but seems to be a non-trivial project.
Reading through the sqlite documentation I found the following function:
http://www.sqlite.org/lang_corefunc.html#likelihood
The likelihood(X,Y) function returns argument X unchanged. The value Y in likelihood(X,Y)
must be a floating point constant between 0.0 and 1.0, inclusive. The likelihood(X) function
is a no-op that the code generator optimizes away so that it consumes no CPU cycles during
run-time (that is, during calls to sqlite3_step()). The purpose of the likelihood(X,Y)
function is to provide a hint to the query planner that the argument X is a boolean that is
true with a probability of approximately Y. The unlikely(X) function is short-hand for
likelihood(X,0.0625).
Assuming that i know that 1 will return 75% of the time, how would:
select likelihood(x,.75)
help the query optimizer?
The original example was this:
Consider the following schema and query:
CREATE TABLE composer(
cid INTEGER PRIMARY KEY,
cname TEXT
);
CREATE TABLE album(
aid INTEGER PRIMARY KEY,
aname TEXT
);
CREATE TABLE track(
tid INTEGER PRIMARY KEY,
cid INTEGER REFERENCES composer,
aid INTEGER REFERENCES album,
title TEXT
);
CREATE INDEX track_i1 ON track(cid);
CREATE INDEX track_i2 ON track(aid);
SELECT DISTINCT aname
FROM album, composer, track
WHERE cname LIKE '%bach%'
AND composer.cid=track.cid
AND album.aid=track.aid;
The schema is for a (simplified) music catalog application, though similar kinds of schemas come up in other situations. There is a large number of albums. Each album contains one or more tracks. Each track has a composer. Each composer might be associated with multiple tracks.
The query asks for the name of every album that contains a track with a composer whose name matches '%bach%'.
The query planner needs to choose among several alternative algorithms for this query. The best choices hinges on how well the expression "cname LIKE '%bach%'" filters the results. Let's give this expression a "filter value" which is a number between 1.0 and 0.0. A value of 1.0 means that cname LIKE '%bach%' is true for every row in the composer table. A value of 0.0 means the expression is never true.
The current query planner (in version 3.8.0) assumes a filter value of 1.0. In other words, it assumes that the expression is always true. The planner is assuming the worst case so that it will pick a plan that minimizes worst case run-time. That's a safe approach, but it is not optimal. The plan chosen for a filter of 1.0 is track-album-composer. That means that the "track" table is in the outer loop. For each row of track, an indexed lookup occurs on album. And then an indexed lookup occurs on composer, then the LIKE expression is run to see if the album name should be output.
A better plan would be track-composer-album. This second plan avoids the album lookup if the LIKE expression is false. The current planner would choose this second algorithm if the filter value was just slightly less than 1.0. Say 0.99. In other words, if the planner thought that the LIKE expression would be false for 1 out of every 100 rows, then it would choose the second plan. That is the correct (fastest) choice for when the filter value is large.
But in the common case of a music library, the filter value is probably much closer to 0.0 than it is to 1.0. In other words, the string "bach" is unlikely to be found in most composer names. And for values near 0.0, the best plan is composer-track-album. The composer-track-album plan is to scan the composer table once looking for entries that match '%bach%" and for each matching entry use indices to look up the track and then the album. The current 3.8.0 query planner chooses this third plan when the filter value is less than about 0.1.
The likelihood functions gives the database a (hopefully) better estimate of the selectivity of a filter.
With the example query, it would look like this:
SELECT DISTINCT aname
FROM album, composer, track
WHERE likelihood(cname LIKE '%bach%', 0.05)
AND composer.cid=track.cid
AND album.aid=track.aid;
What are they and how do they work?
Where are they used?
When should I (not) use them?
I've heard the word over and over again, yet I don't know its exact meaning.
What I heard is that they allow associative arrays by sending the array key through a hash function that converts it into an int and then uses a regular array. Am I right with that?
(Notice: This is not my homework; I go too school but they teach us only the BASICs in informatics)
Wikipedia seems to have a pretty nice answer to what they are.
You should use them when you want to look up values by some index.
As for when you shouldn't use them... when you don't want to look up values by some index (for example, if all you want to ever do is iterate over them.)
You've about got it. They're a very good way of mapping from arbitrary things (keys) to arbitrary things (values). The idea is that you apply a function (a hash function) that translates the key to an index into the array where you store the values; the hash function's speed is typically linear in the size of the key, which is great when key sizes are much smaller than the number of entries (i.e., the typical case).
The tricky bit is that hash functions are usually imperfect. (Perfect hash functions exist, but tend to be very specific to particular applications and particular datasets; they're hardly ever worthwhile.) There are two approaches to dealing with this, and each requires storing the key with the value: one (open addressing) is to use a pre-determined pattern to look onward from the location in the array with the hash for somewhere that is free, the other (chaining) is to store a linked list hanging off each entry in the array (so you do a linear lookup over what is hopefully a short list). The cases of production code where I've read the source code have all used chaining with dynamic rebuilding of the hash table when the load factor is excessive.
Good hash functions are one way functions that allow you to create a distributed value from any given input. Therefore, you will get somewhat unique values for each input value. They are also repeatable, such that any input will always generate the same output.
An example of a good hash function is SHA1 or SHA256.
Let's say that you have a database table of users. The columns are id, last_name, first_name, telephone_number, and address.
While any of these columns could have duplicates, let's assume that no rows are exactly the same.
In this case, id is simply a unique primary key of our making (a surrogate key). The id field doesn't actually contain any user data because we couldn't find a natural key that was unique for users, but we use the id field for building foreign key relationships with other tables.
We could look up the user record like this from our database:
SELECT * FROM users
WHERE last_name = 'Adams'
AND first_name = 'Marcus'
AND address = '1234 Main St'
AND telephone_number = '555-1212';
We have to search through 4 different columns, using 4 different indexes, to find my record.
However, you could create a new "hash" column, and store the hash value of all four columns combined.
String myHash = myHashFunction("Marcus" + "Adams" + "1234 Main St" + "555-1212");
You might get a hash value like AE32ABC31234CAD984EA8.
You store this hash value as a column in the database and index on that. You now only have to search one index.
SELECT * FROM users
WHERE hash_value = 'AE32ABC31234CAD984EA8';
Once we have the id for the requested user, we can use that value to look up related data in other tables.
The idea is that the hash function offloads work from the database server.
Collisions are not likely. If two users have the same hash, it's most likely that they have duplicate data.