How to find cardinality of all columns in kusto? - azure-data-explorer

I'm trying to find number of distinct values in all columns for some query. I found that dcount works well but you have to supply the specific column. I want to do this on all columns, where the column names and the number of columns are dynamic

you'll have to explicitly include all columns of interest.
note that any additional column you add to the query will increase the resources utilization of the query, so if you have any knowledge about columns that are likely to be of high cardinality, consider including only those.
FWIW: you can generate the query (for all columns, with the caveat above) dynamically, then invoke the result of this:
let tableName = "my_table";
let datetime_column_name = "my_datetime_column";
let lookback_period = 1h;
let column_names = toscalar(
table(tableName)
| getschema
| summarize make_set(ColumnName)
);
print query = strcat(
tableName,
"\n| where ",
datetime_column_name,
" > ago(timespan(",
lookback_period,
"))\n| summarize dcount(",
strcat_array(column_names, "),\ndcount("),
")")

Related

full_join by date plus one or minus one

I want to use full_join to join two tables. Below is my pseudo code:
join <- full_join(a, b, by = c("a_ID" = "b_ID" , "a_DATE_MONTH" = "b_DATE_MONTH" +1 | "a_DATE_MONTH" = "b_DATE_MONTH" -1 | "a_DATE_MONTH" = "b_DATE_MONTH"))
a_DATE_MONTH and b_DATE_MONTH are in date format "%Y-%m".
I want to do full join based on condition that a_DATE_MONTH can be one month prior to b_DATE_MONTH, OR one month after b_DATE_MONTH, OR exactly equal to b_DATE_MONTH. Thank you!
While SQL allows for (almost) arbitrary conditions in a join statement (such as a_month = b_month + 1 OR a_month + 1 = b_month) I have not found dplyr to allow the same flexibility.
The only way I have found to join in dplyr on anything other than a_column = b_column is to do a more general join and filter afterwards. Hence I recommend you try something like the following:
join <- full_join(a, b, by = c("a_ID" = "b_ID")) %>%
filter(abs(a_DATE_MONTH - b_DATE_MONTH) <= 1)
This approach still produces the same records in your final results.
It perform worse / slower if R does a complete full join before doing any filtering. However, dplyr is designed to use lazy evaluation, which means that (unless you do something unusual) both commands should be evaluated together (as they would be in a more complex SQL join).

Iterate through and conditionally append string values in a Pandas dataframe

I've got a dataframe of research participants whose IDs are stored in the following format "0000.000".
Where the first four digits are their family ID number, and the final three digits are their individual index within the family. The majority of individuals have a suffix of ".000", but some have ".001", ".002", etc.
As a result of some inefficiencies, these numbers are stored as floats. I'm trying to import them as strings so that I can use them in a join to another data frame that is formatted correctly.
Those IDs that end in .000 are imported as "0000", rather than "0000.000". All others are imported correctly.
I'm trying to iterate through the IDs and append ".000" to those that are missing the suffix.
If I were using R, I could do it like this.
df %>% mutate(StudyID = ifelse(length(StudyID)<5,
paste(StudyID,".000",sep=""),
StudyID)
I've found a Python solution (below), but it's pretty janky.
row = 0
for i in df["StudyID"]:
if len(i)<5:
df.iloc[row,3] = i + ".000"
else: df.iloc[row,3] = i
index += 1
I think it'd be ideal to do it as a list comprehension, but I haven't been able to find a solution that lets me iterate through the column, changing a single value at a time.
For example, this solution iterates and checks the logic properly, but it replaces every single value that evaluates True during each iteration. I only want the value currently being evaluated to change.
[i + ".000" if len(i)<5 else i for i in df["StudyID"]]
Is this possible?
As you said, your code is doing the trick. One other way of doing what you want that i could think of is the following :
# Start by creating a mask that gives you the index you want to change
mask = [len(i)<5 for i in df.StudyID]
# Change the value of the dataframe on the mask
df.StudyID.iloc[mask] += ".000"
I think by length(StudyID), you meant nchar(StudyID), as #akrun pointed out.
You can do it in the dplyr way in python using datar:
>>> from datar.all import f, tibble, mutate, nchar, if_else, paste
>>>
>>> df = tibble(
... StudyID = ["0000", "0001", "0000.000", "0001.001"]
... )
>>> df
StudyID
<object>
0 0000
1 0001
2 0000.000
3 0001.001
>>>
>>> df >> mutate(StudyID=if_else(
... nchar(f.StudyID) < 5,
... paste(f.StudyID, ".000", sep=""),
... f.StudyID
... ))
StudyID
<object>
0 0000.000
1 0001.000
2 0000.000
3 0001.001
Disclaimer: I am the author of the datar package.
Ultimately, I needed to do this for a few different dataframes so I ended up defining a function to solve the problem so that I could apply it to each one.
I think the list comprehension idea was going to become too complex and potentially too difficult to understand when reviewing so I stuck with a plain old for-loop.
def create_multi_index(data, col_to_split, sep = "."):
"""
This function loops through the original ID column and splits it into
multiple parts (multi-IDs) on the defined separator.
By default, the function assumes the unique ID is formatted like a decimal number
The new multi-IDs are appended into a new list.
If the original ID was formatted like an integer, rather than a decimal
the function assumes the latter half of the ID to be ".000"
"""
# Take a copy of the dataframe to modify
new_df = data
# generate two new lists to store the new multi-index
Family_ID = []
Family_Index = []
# iterate through the IDs, split and allocate the pieces to the appropriate list
for i in new_df[col_to_split]:
i = i.split(sep)
Family_ID.append(i[0])
if len(i)==1:
Family_Index.append("000")
else:
Family_Index.append(i[1])
# Modify and return the dataframe including the new multi-index
return new_df.assign(Family_ID = Family_ID,
Family_Index = Family_Index)
This returns a duplicate dataframe with a new column for each part of the multi-id.
When joining dataframes with this form of ID, as long as both dataframes have the multi index in the same format, these columns can be used with pd.merge as follows:
pd.merge(df1, df2, how= "inner", on = ["Family_ID","Family_Index"])

Simple query in neo4j - no record if a node has degree 0

To better understand how results are formatted in neo:
A simple query where node ENSG00000180447 has no neighbor:
MATCH (d:Target)-[r:Interaction]-(t:Target)
where d.uid = 'ENSG00000180447'
with d, count(t) as degree
Return d, degree
(no changes, no records)
Instead
MATCH (d:Target)
where d.uid = 'ENSG00000180447'
Return d # return the node
MATCH (d:Target)-[r:Interaction]-(t:Target)
where d.uid = 'ENSG00000180447'
with count(t) as degree
Return degree # return 0
I would like to get returned node and its degree on the same query.
What is it wrong with the first query?
"MATCH" is looking for the exact pattern match, and does not find it for the node with the uid = 'ENSG00000180447'. Two ways:
1) Use OPTIONAL MATCH:
MATCH (d:Target)
WHERE d.uid = 'ENSG00000180447'
OPTIONAL MATCH (d)-[r:Interaction]-(t:Target)
RETURN d, COUNT(t) AS degree
2) Use zero length paths:
MATCH (d:Target)-[r:Interaction*0..1]-(t:Target)
where d.uid = 'ENSG00000180447'
with d, count(t) as degree
Return d, degree-1
The problem, as stdob-- points out, is that when you perform a MATCH, it only returns rows for which the match is true. So you're asking for a match between that one specific node to a :Target node using a relationship of type :Interaction. Since no such pattern exists, no rows are returned.
The SIZE() function will probably be your best bet for a concise query, you can use it to find the occurrences of a pattern. In this case, we can use it to find the number of relationships of that type to a :Target node:
MATCH (d:Target)
WHERE d.uid = 'ENSG00000180447'
RETURN d, SIZE( (d)-[:Interaction]-(:Target) ) AS degree
EDIT - explaining why your query returning the node and count returns no rows.
COUNT() is an aggregation that only has context from the non-aggregation columns (grouping key). On its own, COUNT() has no other context and no grouping keys, and it can handle null values:
COUNT(null) = 0.
When we perform MATCHes, we build up rows. Where a MATCH doesn't find any matches, no rows are returned:
MATCH (ele:PinkElephant)
RETURN ele
// (no changes, no records)
When we try to pair this with aggregation, we will still get no rows, because the aggregation will run for every possible row, but there are no rows to execute on:
MATCH (person:Person)-[:Halucinates]->(ele:PinkElephant)
RETURN ele, COUNT(person)
// (no changes, no records)
In this case, you're asking for rows of :PinkElephant nodes, and for each of those nodes, a count of the people who hallucinate that pink elephant.
But there are no :PinkElephant nodes. There are no rows present for COUNT() to operate on. We can't show any rows because there are no nodes present to populate them.
Even if there WERE :PinkElephant nodes in the graph, if there were no relationships to :People nodes, the result would be the same. The match would find nothing, because the pattern you asked for (pink elephants that are hallucinated by people) doesn't exist. There are no :PinkElephants that are hallucinated by a :Person, so no nodes to populate the ele column, so no rows, and if there are no rows, your COUNT() has nothing to execute on, and no place to add a return value.

Converting date to a varchar using "like" in pl/sql

I need to go through few millions of data searching for a year sent as a parameter to a method. The year comes as a varchar.
This is the query I'm working with
SELECT X,Y
FROM A
WHERE mch_code = 'KN'
AND contract = '15KTN'
AND to_char(cre_date, 'YYYY') = year_;
cre_ date is of type date and year_ is from type carchar.
when performing this query it take around 25 minutes to process it completely.
Is anyone knows about a different approach to find out the quick execution.
Please help.
This didn't work out.
SELECT X,Y
FROM A
WHERE mch_code = 'KN'
AND contract = '15KTN'
AND cre_date LIKE '%2013';
The reason might be 'cre_date' and '%2013' are of different types
If you have an index on (mch_code, contract, cre_date) columns, you can improve performance by doing something like:
select x, y
from a
where mch_code = 'KN'
and contract = '15KTN'
and cre_date >= to_date('01/01/'||year_, 'dd/mm/yyyy')
and cre_date < add_months(to_date('01/01/'||year_, 'dd/mm/yyyy'), 12);
Even better would be to declare the start of the year as a DATE variable prior to running the sql, eg:
v_year_dt := to_date('01/01/'||year_, 'dd/mm/yyyy');
which would make the query:
select x, y
from a
where mch_code = 'KN'
and contract = '15KTN'
and cre_date >= v_year_dt
and cre_date < add_months(v_year_dt, 12);
If you don't have an index on those three columns, you could create a function based index on (mch_code, contract, to_char(cre_date, 'yyyy')) that should help speed up your query, depending on the percentage of rows you're expecting to select. It may help even more if you added the x and y columns into the index, so that no table access was required at all.
Alternatively, you could think about partitioning the table on cre_date, monthly or yearly.
The reason your query is slow is that you're applying a function to a column on every row in your table. Let's try it another way:
SELECT X,Y
FROM A
WHERE mch_code = 'KN' AND
contract = '15KTN' AND
CRE_DATE BETWEEN TO_DATE('01/01/' || year_, 'DD/MM/YYYY')
AND TO_DATE('01/01/' || year_, 'DD/MM/YYYY') + INTERVAL '1' YEAR;
This eliminates the need to apply a function against every row in the table, and should allow any indexes on CRE_DATE to be used.
Best of luck.
You can try with EXTRACT function:
SELECT X,Y
FROM A
WHERE mch_code = 'KN'
AND contract = '15KTN'
AND EXTRACT(YEAR FROM cre_date) = year_;

How to get count of all columns of a table, which are not null using PL/SQL?

Is there any PL/SQL function, which allows to pass a table name and returns the count of all columns, which don't include null values?
I have a huge number of columns and don't want to query each and every column. I'm new to PL/SQL and highly appreciate your help.
As of a comment to the question one approach to solve this is the following query:
SELECT t.table_name,
t.num_rows,
c.column_name,
c.num_nulls,
t.num_rows - c.num_nulls num_not_nulls,
c.data_type,
c.last_analyzed
FROM all_tab_cols c
JOIN sys.all_all_tables t ON c.table_name = t.table_name
WHERE c.table_name LIKE 'EXT%'
AND c.nullable = 'Y'
GROUP BY t.table_name,
t.num_rows,
c.column_name,
c.num_nulls,
c.data_type,
c.last_analyzed
ORDER BY t.table_name,
c.column_name

Resources