Split column string with delimiters into separate columns in azure kusto - azure-data-explorer

I have a column 'Apples' in azure table that has this string: "Colour:red,Size:small".
Current situation:
|-----------------------|
| Apples |
|-----------------------|
| Colour:red,Size:small |
|-----------------------|
Desired Situation:
|----------------|
| Colour | Size |
|----------------|
| Red | small |
|----------------|
Please help

I'll answer the title as I noticed many people searched for a solution.
The key here is mv-expand operator (expands multi-value dynamic arrays or property bags into multiple records):
datatable (str:string)["aaa,bbb,ccc", "ddd,eee,fff"]
| project splitted=split(str, ',')
| mv-expand col1=splitted[0], col2=splitted[1], col3=splitted[2]
| project-away splitted
project-away operator allows us to select what columns from the input exclude from the output.
Result:
+--------------------+
| col1 | col2 | col3 |
+--------------------+
| aaa | bbb | ccc |
| ddd | eee | fff |
+--------------------+

This query gave me the desired results:
| parse Apples with "Colour:" AppColour ", Size:" AppSize
Remember to include all the different delimiters preceding each word you want to extract, e.g ", Size". Mind the space between.
This helped me then i used my intuition to customize the query according to my needs:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/parseoperator

Related

How to replace empty spaces with values from adjacent colum that needs to be separated?

Hi everyone. I'm so sorry for my english. I need to separate the
domain data of some emails in a table. Then, if these mail data have
the domain of a country, this information must be moved to another
column that is incomplete in which the participants of a congress are
included. This for a relatively large database. I put an example
below.
| email | country |
| -------- | -------------- |
| naco#gmail.com | CO |
| monic45814#gmail.com | AR |
| jsalazar#chapingo.mx | |
| andresramirez#urosario.edu.co | |
| jeimy861491#hotmail.com | CL |
|jytvc#hotmail.com | |
Outcome should be
| email | country |
| -------- | -------------- |
| naco#gmail.com | CO |
| monic45814#gmail.com | AR |
| jsalazar#chapingo.mx | MX |
| andresramirez#urosario.edu.co | CO |
|jeimy861491#hotmail.com | CL |
|jytvc#hotmail.com | *NA* |
Thank you so much.
You can use str_extract to get the string after the last occurrence of "." and if_else to ignore rows that already have a country and rows which e-mail doesn't end with a country code:
df %>%
mutate(country = if_else(is.na(country) & str_extract(email, "[^.]+$") != "com", toupper(str_extract(email, "[^.]+$")), country))
small but not so small PS: I would always recommend to provide fake data when you are mentioning personal data like e-mail addresses
Here is a solution in base R.
Suppose:
df<-data.frame(email,country)
Then:
df$country<-ifelse(is.na(df$country)&sub(".*(.*?)[\\.|:]", "",df$email)!="com",sub(".*(.*?)[\\.|:]", "",df$email),paste(df$country))

SQLite Versioning. Is it possible to use EXCEPT to show differences between rows where only one column changes?

I'm quite new to SQLite and I'm trying to use an EXCEPT statement in order to compare two tables with very similar data. The data comes from a CSV file I download daily, and within the file new rows are added and deleted, and old rows can have one or more columns change each day. I'm trying to find a way to select rows that have had a column's data change, when I am unable to predict which column's data will change.
Say for example I have:
TABLE contracts:
|ID|Description|Name|Contract Type|
|1 |Plumbing |Bob |Paper |
|2 |Cooking |Ryan|Paper |
|3 |Driving |Eric|Paper |
|4 |Dancing |Emma|Paper |
and:
TABLE updated_contracts:
|ID|Description|Name|Contract Type|
|1 |Hiking |Bob |Paper |
|2 |Cooking |Ryan|Paper |
|3 |Driving |Eric|Paper |
|4 |Dancing |Emma|Digital |
I'd like it to return:
|1 |Hiking |Bob |Paper |
|4 |Dancing |Emma|Digital |
because contract 1 has changed the description and contract 4 has changed the contract type.
Is it possible to do this in SQLite?
One way to do it is with a LEFT join of updated_contracts to contracts where the matching rows are filtered out:
select uc.*
from updated_contracts uc left join contracts c
using(id, Description, Name, `Contract Type`)
where c.id is null
EXCEPT can also be used like this:
select * from updated_contracts
except
select * from contracts
This will work only if the tables have the same number of columns and its advantage is that it compares null values in columns and returns true if they are both null.
See the demo.
Results:
| ID | Description | Name | Contract Type |
| --- | ----------- | ---- | ------------- |
| 1 | Hiking | Bob | Paper |
| 4 | Dancing | Emma | Digital |

Cumulative count of occurrences per value in array in Kusto

I'm looking to get the count of query param usage from the query string from page views stored in app insights using KQL. My query currently looks like:
pageViews
| project parsed=parseurl(url)
| project keys=bag_keys(parsed["Query Parameters"])
and the results look like
with each row looking like
I'm looking to get the count of each value in the list when it is contained in the url in order to anwser the question "How many times does page appear in the querystring". So the results might look like:
Page | From | ...
1000 | 67 | ...
Thanks in advance
you could try something along the following lines:
datatable(url:string)
[
"https://a.b.c/d?p1=hello&p2=world",
"https://a.b.c/d?p2=world&p3=foo&p4=bar"
]
| project parsed = parseurl(url)
| project keys = bag_keys(parsed["Query Parameters"])
| mv-expand key = ['keys'] to typeof(string)
| summarize count() by key
which returns:
| key | count_ |
|-----|--------|
| p1 | 1 |
| p2 | 2 |
| p3 | 1 |
| p4 | 1 |

Get weight of words by occurence

Maybe this is related to math.stacexhange, but I am affraid, that I will get a formula in answer what I won't undersand.
I have products in our database, and I have products from different suppliers in another table.
What I want is to pair, these supplieres products to our products if it is possible, or show for me at least show me a list, where the matching is high.
I did iterate throught all the suppliers products, and explodes the product name by spaces, and store it in a table, and the count of the occurence.
The table seems like this.
+--------+-------------+---------------+-------+
| id | word | originalWord | count |
+--------+-------------+---------------+-------+
| 220950 | Tracer | Tracer | 493 |
| 220951 | Destroyer | Destroyer | 3 |
| 220952 | Avago5050 | Avago5050 | 4 |
| 220953 | mouse | mouse | 2535 |
| 220954 | TRAMYS44916 | /TRAMYS44916/ | 2 |
| 220955 | GameZone | GameZone | 16 |
| 220956 | Enduro | Enduro | 3 |
| 220957 | AVAGO | AVAGO | 10 |
| 220958 | 5050 | 5050 | 4 |
| 220959 | optical | optical | 2370 |
| 220960 | USB | USB | 6160 |
+--------+-------------+---------------+-------+
and so on. Of course, in another table I stored, what is the product id for each word.
So what I want is to determine the weight of a word by occurence.
As you see, the word TRAMYS44916 is occured only twice, almost certain that is a partnumber, so this is the most heavy word. It weight should be 1.
Let's say the most occured is USB with 6160 occurence, so it weight should be like 0.01 or something like that, I think.
What is the best way to get all the weights of the words?
There are other tables for other suppliers so dispersion is always change.
This reminds me of Naive Bayes text classification, so to determine which product should it belongs to, you can calculate tf-idf of all the words.
Then if you want to pair it from another product name, you can decompose it to words again and select the product id based on the highest term value, however maybe you should specify some threshold for this, because in some cases it would not be that clear.
tf-idf = ("number of word matches in product name"/"word count of product name") * log ("number of products" / "number of products that contains the word")
You can see how it is done in the example here (In your case the document will be the product full name): https://en.wikipedia.org/wiki/Tf–idf#Example_of_tf.E2.80.93idf
Example implementation in Java: https://guendouz.wordpress.com/2015/02/17/implementation-of-tf-idf-in-java/

Combine DataFrame rows into a new column

I am wondering if there is simple way to achieve this in Julia besides iterating over the rows in a for-loop.
I have a table with two columns that looks like this:
| Name | Interest |
|------|----------|
| AJ | Football |
| CJ | Running |
| AJ | Running |
| CC | Baseball |
| CC | Football |
| KD | Cricket |
...
I'd like to create a table where each Name in first column is matched with a combined Interest column as follows:
| Name | Interest |
|------|----------------------|
| AJ | Football, Running |
| CJ | Running |
| CC | Baseball, Football |
| KD | Cricket |
...
How do I achieve this?
UPDATE: OK, so after trying a few things including print_joint and grpby, I realized that the easiest way to do this would be by() function. I'm 99% there.
by(myTable, :Name, df->DataFrame(Interest = string(df[:Interest])))
This gives me my :Interest column as "UTF8String[\"Running\"]", and I can't figure out which method I should use instead of string() (or where to typecast) to get the desired ASCIIString output.

Resources