I am using kusto. I have a function and I want to use it for each row.
I can call the function with | invoke <FUNCTION_NAME> but how can I apply to app rows ?
Use the "extend" operator
For example:
let f = view(a:int){
a*3
};
let t = datatable(a:int) [ 1,2,3];
t
| extend b = f(a)
a
b
1
3
2
6
3
9
Related
Given a dynamic field, say, milestones, it has value like: {"ta": 1655859586546, "tb": 1655859586646},
How do I print a table with columns like "ta", "tb" etc, with the single row as unixtime_milliseconds_todatetime(tolong(taValue)), unixtime_milliseconds_todatetime(tolong(tbValue)) etc.
I figured that I'll need to write a function that I can call, so I created this:-
let f = view(a:string ){
unixtime_milliseconds_todatetime(tolong(a))
};
I can use this function with a normal column as:- project f(columnName).
However, in this case, its a dynamic field, and the number of items in the list is large, so I do not want to enter the fields manually. This is what I have so far.
log_table
| take 1
| evaluate bag_unpack(milestones, "m_") // This gives me fields as columns
// | project-keep m_* // This would work, if I just wanted the value, however, I want `view(columnValue)
| project-keep f(m_*) // This of course doesn't work, but explains the idea.
Based on the mv-apply operator
// Generate data sample. Not part of the solution.
let log_table = materialize(range record_id from 1 to 10 step 1 | mv-apply range(1, 1 + rand(5), 1) on (summarize milestones = make_bag(pack_dictionary(strcat("t", make_string(to_utf8("a")[0] + toint(rand(26)))), 1600000000000 + rand(60000000000)))));
// Solution Starts here.
log_table
| mv-apply kv = milestones on
(
extend k = tostring(bag_keys(kv)[0])
| extend v = unixtime_milliseconds_todatetime(tolong(kv[k]))
| summarize milestones = make_bag(pack_dictionary(k, v))
)
| evaluate bag_unpack(milestones)
record_id
ta
tb
tc
td
te
tf
tg
th
ti
tk
tl
tm
to
tp
tr
tt
tu
tw
tx
tz
1
2021-07-06T20:24:47.767Z
2
2021-05-09T07:21:08.551Z
2022-07-28T20:57:16.025Z
2022-07-28T14:21:33.656Z
2020-11-09T00:54:39.71Z
2020-12-22T00:30:13.463Z
3
2021-12-07T11:07:39.204Z
2022-05-16T04:33:50.002Z
2021-10-20T12:19:27.222Z
4
2022-01-31T23:24:07.305Z
2021-01-20T17:38:53.21Z
5
2022-04-27T22:41:15.643Z
7
2022-01-22T08:30:08.995Z
2021-09-30T08:58:46.47Z
8
2022-03-14T13:41:10.968Z
2022-03-26T10:45:19.56Z
2022-08-06T16:50:37.003Z
10
2021-03-03T11:02:02.217Z
2021-02-28T09:52:24.327Z
2021-04-09T07:08:06.985Z
2020-12-28T20:18:04.973Z
9
2022-02-17T04:55:35.468Z
6
2022-08-02T14:44:15.414Z
2021-03-24T10:22:36.138Z
2020-12-17T01:14:40.652Z
2022-01-30T12:45:54.28Z
2022-03-31T02:29:43.114Z
Fiddle
I am trying to run a loop where I count the total in each file under the variable _merge, and then count certain outcomes of _merge, such as _merge=1 and so on. I then want to calculate percentages by dividing each instance of _merge by the total under _merge.
Below is my code:
/*define local list*/
local ward_names B C D E FN FS GS HE
/*loop for each dbase*/
foreach file of local ward_names {
use "../../../cleaning/sra/output/`file'_ward_CTS_Merged.dta", clear
count if _merge
local ward_count=r(N)
count if _merge==1
local count_master=r(N)
count if _merge==2
local count_using=r(N)
count if _merge==3
local count_match=r(N)
clear
set obs 1
g ward_count='ward_count'
g count_master=`count_master'
g count_using=`count_using'
g count_match=`count_match'
g ward= "`file'"
save "../temp/`file'_collapsed_diagnostics.dta", replace
clear
The code was running fine until I tried to add the total count for each ward file:
g ward_count='ward_count'
'ward_count' invalid name
Is this a syntax error or something more severe?
You need to use ` instead of ' when you refer to a local macro:
generate ward_count = `ward_count'
EDIT:
As per #NickCox's recommendation you can improve your code by using the tabulate command with its matcell() option to get the counts all at once:
tabulate _merge, matcell(A)
_merge | Freq. Percent Cum.
------------------------+-----------------------------------
master only (1) | 1 16.67 16.67
matched (3) | 5 83.33 100.00
------------------------+-----------------------------------
Total | 6 100.00
matrix list A
A[2,1]
c1
r1 1
r2 5
So you could then do the following:
generate count_master = A[1,1]
generate count_match = A[2,1]
It looks like things are going wrong on line 9 for me. Here I wish to push a new copy of the TagsTable into a dictionary. I'm aware that once a namedtuple field is recorded, it can not be changed. However, results baffle me as it looks like the values do change - when this code exits all entries of mp3_tags[ any of the three dictionary keys ].date are set to the last date of "1999_03_21"
So, two questions:
Is there a way to get a new TagsTable pushed into the dictionary ?
Why doesnt the code fail and not allow the second (and even third) date to be written to the TagsTable.date field (since it seems to be references to the same namedtuple) ? I thought you could not write a second value ?
from collections import namedtuple
2 TagsTable = namedtuple('TagsTable',['title','date','subtitle','artist','summary','length','duration','pub_date'])
3 mp3files = ['42-001.mp3','42-002.mp3','42-003.mp3']
4 dates = ['1999_01_07', '1999_02_14', '1999_03_21']
5
6 mp3_tags = {}
7
8 for mp3file in mp3files:
9 mp3_tags[mp3file] = TagsTable
10
11 for mp3file,date_string in zip(mp3files,dates):
12 mp3_tags[mp3file].date = date_string
13
14 for mp3file in mp3files:
15 print( mp3_tags[mp3file].date )
looks like this is the fix I was looking for:
from collections import namedtuple
mp3files = ['42-001.mp3','42-002.mp3','42-003.mp3']
dates = ['1999_01_07', '1999_02_14', '1999_03_21']
mp3_tags = {}
for mp3file in mp3files:
mp3_tags[mp3file] = namedtuple('TagsTable',['title','date','subtitle','artist','summary','length','duration','pub_date'])
for mp3file,date_string in zip(mp3files,dates):
mp3_tags[mp3file].date = date_string
for mp3file in mp3files:
print( mp3_tags[mp3file].date )
I am banging my head against the wall when trying to perform a drop duplicate for time series, base on the value of a datetime index.
My function is the following:
def csv_import_merge_T(f):
dfsT = [pd.read_csv(fp, index_col=[0], parse_dates=[0], dayfirst=True, names=['datetime','temp','rh'], header=0) for fp in files]
dfT = pd.concat(dfsT)
#print dfT.head(); print dfT.index; print dfT.dtypes
dfT.drop_duplicates(subset=index, inplace=True)
dfT.resample('H').bfill()
return dfT
which is called by:
inputcsvT = ['./input_csv/A08_KI_T*.csv']
for csvnameT in inputcsvT:
files = glob.glob(csvnameT)
print ('___'); print (files)
t = csv_import_merge_T(files)
print csvT
I receive the error
NameError: global name 'index' is not defined
what is wrong?
UPDATE:
The issue appear to arise when csv input files (which are to be concatenated) are overlapped.
inputcsvT = ['./input_csv/A08_KI_T*.csv'] gets files
A08_KI_T5
28/05/2015 17:00,22.973,24.021
...
08/10/2015 13:30,24.368,45.974
A08_KI_T6
08/10/2015 14:00,24.779,41.526
...
10/02/2016 17:00,22.326,41.83
and it runs correctly, whereas:
inputcsvT = ['./input_csv/A08_LR_T*.csv'] gathers
A08_LR_T5
28/05/2015 17:00,22.493,25.62
...
08/10/2015 13:30,24.296,44.596
A08_LR_T6
28/05/2015 17:00,22.493,25.62
...
10/02/2016 17:15,21.991,38.45
which leads to an error.
IIUC you can call reset_index and then drop_duplicates and then set_index again:
In [304]:
df = pd.DataFrame(data=np.random.randn(5,3), index=list('aabcd'))
df
Out[304]:
0 1 2
a 0.918546 -0.621496 -0.210479
a -1.154838 -2.282168 -0.060182
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252
In [308]:
df.reset_index().drop_duplicates('index').set_index('index')
Out[308]:
0 1 2
index
a 0.918546 -0.621496 -0.210479
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252
EDIT
Actually there is a simpler method is to call duplicated on the index and invert it:
In [309]:
df[~df.index.duplicated()]
Out[308]:
0 1 2
index
a 0.918546 -0.621496 -0.210479
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252
I'm trying to compare different ways to aggregate my data.
This is my input data with 2 elements (page,visitor):
(PAG1,V1)
(PAG1,V1)
(PAG2,V1)
(PAG2,V2)
(PAG2,V1)
(PAG1,V1)
(PAG1,V2)
(PAG1,V1)
(PAG1,V2)
(PAG1,V1)
(PAG2,V2)
(PAG1,V3)
Working with a SQL command into Spark SQL with this code:
import sqlContext.implicits._
case class Log(page: String, visitor: String)
val logs = data.map(p => Log(p._1,p._2)).toDF()
logs.registerTempTable("logs")
val sqlResult= sqlContext.sql(
"""select page
,count(distinct visitor) as visitor
from logs
group by page
""")
val result = sqlResult.map(x=>(x(0).toString,x(1).toString))
result.foreach(println)
I get this output:
(PAG1,3) // PAG1 has been visited by 3 different visitors
(PAG2,2) // PAG2 has been visited by 2 different visitors
Now, I would like to get the same result using Dataframes and thiers API, but I can't get the same output:
import sqlContext.implicits._
case class Log(page: String, visitor: String)
val logs = data.map(p => Coppia(p._1,p._2)).toDF()
val result = log.select("page","visitor").groupBy("page").count().distinct
result.foreach(println)
In fact, that's what I get as output:
[PAG1,8] // just the simple page count for every page
[PAG2,4]
What you need is the DataFrame aggregation function countDistinct:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
case class Log(page: String, visitor: String)
val logs = data.map(p => Log(p._1,p._2))
.toDF()
val result = logs.select("page","visitor")
.groupBy('page)
.agg('page, countDistinct('visitor))
result.foreach(println)
You can use dataframe's groupBy command twice to do so. Here, df1 is your original input.
val df2 = df1.groupBy($"page",$"visitor").agg(count($"visitor").as("count"))
This command would produce the following result:
page visitor count
---- ------ ----
PAG2 V2 2
PAG1 V3 1
PAG1 V1 5
PAG1 V2 2
PAG2 V1 2
Then use the groupBy command again to get the final result.
df2.groupBy($"page").agg(count($"visitor").as("count"))
Final output:
page count
---- ----
PAG1 3
PAG2 2
I think in the newer versions of Spark it is easier. The following is tested with 2.4.0.
1. First, create an array for sample.
val myArr = Array(
("PAG1","V1"),
("PAG1","V1"),
("PAG2","V1"),
("PAG2","V2"),
("PAG2","V1"),
("PAG1","V1"),
("PAG1","V2"),
("PAG1","V1"),
("PAG1","V2"),
("PAG1","V1"),
("PAG2","V2"),
("PAG1","V3")
)
2. Crate a dataframe
val logs = spark.createDataFrame(myArr)
.withColumnRenamed("_1","page")
.withColumnRenamed("_2","visitor")
3. Now aggregation with distinctCount spark sql function
import org.apache.spark.sql.{functions => F}
logs.groupBy("page").agg(
F.countDistinct("visitor").as("visitor"))
.show()
4. Expected result:
+----+-------+
|page|visitor|
+----+-------+
|PAG1| 3|
|PAG2| 2|
+----+-------+
Use this if you want to display the distinct values of a column
display(sparkDF.select('columnName').distinct())