Spark: How to translate count(distinct(value)) in Dataframe API's - count

I'm trying to compare different ways to aggregate my data.
This is my input data with 2 elements (page,visitor):
(PAG1,V1)
(PAG1,V1)
(PAG2,V1)
(PAG2,V2)
(PAG2,V1)
(PAG1,V1)
(PAG1,V2)
(PAG1,V1)
(PAG1,V2)
(PAG1,V1)
(PAG2,V2)
(PAG1,V3)
Working with a SQL command into Spark SQL with this code:
import sqlContext.implicits._
case class Log(page: String, visitor: String)
val logs = data.map(p => Log(p._1,p._2)).toDF()
logs.registerTempTable("logs")
val sqlResult= sqlContext.sql(
"""select page
,count(distinct visitor) as visitor
from logs
group by page
""")
val result = sqlResult.map(x=>(x(0).toString,x(1).toString))
result.foreach(println)
I get this output:
(PAG1,3) // PAG1 has been visited by 3 different visitors
(PAG2,2) // PAG2 has been visited by 2 different visitors
Now, I would like to get the same result using Dataframes and thiers API, but I can't get the same output:
import sqlContext.implicits._
case class Log(page: String, visitor: String)
val logs = data.map(p => Coppia(p._1,p._2)).toDF()
val result = log.select("page","visitor").groupBy("page").count().distinct
result.foreach(println)
In fact, that's what I get as output:
[PAG1,8] // just the simple page count for every page
[PAG2,4]

What you need is the DataFrame aggregation function countDistinct:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
case class Log(page: String, visitor: String)
val logs = data.map(p => Log(p._1,p._2))
.toDF()
val result = logs.select("page","visitor")
.groupBy('page)
.agg('page, countDistinct('visitor))
result.foreach(println)

You can use dataframe's groupBy command twice to do so. Here, df1 is your original input.
val df2 = df1.groupBy($"page",$"visitor").agg(count($"visitor").as("count"))
This command would produce the following result:
page visitor count
---- ------ ----
PAG2 V2 2
PAG1 V3 1
PAG1 V1 5
PAG1 V2 2
PAG2 V1 2
Then use the groupBy command again to get the final result.
df2.groupBy($"page").agg(count($"visitor").as("count"))
Final output:
page count
---- ----
PAG1 3
PAG2 2

I think in the newer versions of Spark it is easier. The following is tested with 2.4.0.
1. First, create an array for sample.
val myArr = Array(
("PAG1","V1"),
("PAG1","V1"),
("PAG2","V1"),
("PAG2","V2"),
("PAG2","V1"),
("PAG1","V1"),
("PAG1","V2"),
("PAG1","V1"),
("PAG1","V2"),
("PAG1","V1"),
("PAG2","V2"),
("PAG1","V3")
)
2. Crate a dataframe
val logs = spark.createDataFrame(myArr)
.withColumnRenamed("_1","page")
.withColumnRenamed("_2","visitor")
3. Now aggregation with distinctCount spark sql function
import org.apache.spark.sql.{functions => F}
logs.groupBy("page").agg(
F.countDistinct("visitor").as("visitor"))
.show()
4. Expected result:
+----+-------+
|page|visitor|
+----+-------+
|PAG1| 3|
|PAG2| 2|
+----+-------+

Use this if you want to display the distinct values of a column
display(sparkDF.select('columnName').distinct())

Related

Filter columns by dashboard multi-select parameter

I'm trying to render a time series, but I have too many columns to show by default. To remedy this I figured I would present the user with a multi-select of all the columns and downselect the columns I render to that list, but I can't for the life of me figure out or find an answer on how to do it.
I have data with, say, columns Time, X1, X2, ... X120 and a multi-select parameter _columns of that table | getschema | project ColumnName | where ColumnName != "Time". I want to project to Time and the contents of _columns.
I can only find how to filter rows based on some column's value vs the multi-select. I feel like I'm missing something very simple.
Updated
There is also a simple solution for data that looks something like that:
This kind of data might be created by a make-series operator with multiple aggregation functions, e.g. -
make-series series_001 = count(), series_002 = min(x), series_003 = sum(x), series_004 = avg(x), series_005 = countif(type == 1), series_006 = countif(subtype == 123) on Timestamp from ago(7d) to now() step 1d
// Data sample generation, including series creation.
// Not part of the solution.
let p_series_num = 100;
let data = materialize
(
range i from 1 to p_series_num step 1
| project series_name = strcat("series_", substring(strcat("00", i), -3))
| mv-apply range(1, 7, 1) on (summarize make_list(rand()))
| evaluate pivot(series_name, take_any(list_))
| extend Timestamp = range(now() - 6d, now(), 1d)
| project-reorder Timestamp, * granny-asc
);
// Solution starts here
// We assume the creation of a parameter named _series, in the dashboard
// Uncomment the following line when executed outside the context of the dashboard
let ['_series'] = 'series_001';
data
| project Timestamp, column_ifexists(['_series'], real(null))
| render timechart
Timestamp
series_001
["2022-10-08T15:59:51.4634127Z","2022-10-09T15:59:51.4634127Z","2022-10-10T15:59:51.4634127Z","2022-10-11T15:59:51.4634127Z","2022-10-12T15:59:51.4634127Z","2022-10-13T15:59:51.4634127Z","2022-10-14T15:59:51.4634127Z"]
["0.35039128090096505","0.79027849410190631","0.023939659111657484","0.14207071795033441","0.91242141133745414","0.33016368441829869","0.50674771943297525"]
Fiddle
This solution supports multi-selection.
The original data looks something like that:
// Data sample generation. Not part of the solution.
let p_start_time = startofday(ago(1d));
let p_interval = 5m;
let p_rows = 15;
let p_cols = 120;
let data = materialize
(
range Timestamp from p_start_time to p_start_time + p_rows * p_interval step p_interval
| mv-expand MetricID = range(1, p_cols) to typeof(int)
| extend MetricVal = rand(), MetricName = strcat("x", tostring(MetricID))
| evaluate pivot(MetricName, take_any(MetricVal), Timestamp)
| project-reorder Timestamp, * granny-asc
);
// Solution starts here
// We assume the creation of a parameter named _MetricName, in the dashboard
// Uncomment the following line when executed outside the context of the dashboard
let ['_series'] = dynamic(['x1', 'x3', 'x7', 'x100', 'x120']);
data
| project Timestamp, pa = pack_all()
| project Timestamp, cols = bag_remove_keys(pa, set_difference(bag_keys(pa), _series))
| evaluate bag_unpack(cols)
| render timechart
Timestamp
x1
x100
x120
x3
x7
2022-10-20T00:00:00Z
0.40517703772298719
0.86952520047109094
0.67458442932790974
0.20662741864260561
0.19230161743580523
2022-10-20T00:05:00Z
0.098438188653858671
0.14095230636982198
0.10269711129443576
0.99361020447683746
0.093624077251808144
2022-10-20T00:10:00Z
0.3779198279036311
0.095647188329308852
0.38967218915903867
0.62601873006422182
0.18486009896828509
2022-10-20T00:15:00Z
0.141551736845493
0.64623737123678782
... etc.
Fiddle
It might be very simple, if our tabular data (post creation of the series) looks something like that:
This kind of data might be created by a make-series operator with by clause, e.g. -
make-series count() on Timestamp from ago(7d) to now() step 1d by series_name
In that case, all we need to do is add a filter on the series name, E.g. -
// Data sample generation, including series creation.
// Not part of the solution.
let p_series_num = 100;
let data = materialize
(
range i from 1 to 1000000 step 1
| extend Timestamp = ago(rand()*7d)
,series_name = strcat("series_", substring(strcat("00", tostring(toint(rand(p_series_num)))), -3))
| make-series count() on Timestamp from ago(7d) to now() step 1d by series_name
);
// Solution starts here
// We assume the creation of a parameter named _series, in the dashboard
// Uncomment the following line when executed outside the context of the dashboard
let ['_series'] = 'series_001';
data
| where series_name == _series
| render timechart
series_name
count_
Timestamp
series_001
[1434,1439,1430,1428,1422,1372,1475]
["2022-10-07T15:54:57.3677580Z","2022-10-08T15:54:57.3677580Z","2022-10-09T15:54:57.3677580Z","2022-10-10T15:54:57.3677580Z","2022-10-11T15:54:57.3677580Z","2022-10-12T15:54:57.3677580Z","2022-10-13T15:54:57.3677580Z"]
Fiddle
Here is a solution that match the data structure in your scenario.
* It is the same solution is the other solution I just modified, but since the source data structure is different, I posted an additional answer for learning purposes.
The original data looks something like that:
The code is actually very simple, leveraging column_ifexists():
// Data sample generation. Not part of the solution.
let p_start_time = startofday(ago(1d));
let p_interval = 5m;
let p_rows = 15;
let p_cols = 120;
let data = materialize
(
range Timestamp from p_start_time to p_start_time + p_rows * p_interval step p_interval
| mv-expand MetricID = range(1, p_cols) to typeof(int)
| extend MetricVal = rand(), MetricName = strcat("x", tostring(MetricID))
| evaluate pivot(MetricName, take_any(MetricVal), Timestamp)
| project-reorder Timestamp, * granny-asc
);
// Solution starts here
// We assume the creation of a parameter named _MetricName, in the dashboard
// Uncomment the following line when executed outside the context of the dashboard
let ['_MetricName'] = "x42";
data
| project Timestamp, column_ifexists(['_MetricName'], real(null))
| render timechart
Timestamp
x42
2022-10-13T00:00:00Z
0.89472385054721115
2022-10-13T00:05:00Z
0.11275174098360444
2022-10-13T00:10:00Z
0.96233152692333268
2022-10-13T00:15:00Z
0.21751913633816042
2022-10-13T00:20:00Z
0.69591667527850931
2022-10-13T00:25:00Z
0.36802228024058203
2022-10-13T00:30:00Z
0.29060518653083045
2022-10-13T00:35:00Z
0.13362332423562559
2022-10-13T00:40:00Z
0.013920161307282448
2022-10-13T00:45:00Z
0.05909880950497
2022-10-13T00:50:00Z
0.146454957813311
2022-10-13T00:55:00Z
0.318823204227693
2022-10-13T01:00:00Z
0.020087435985750794
2022-10-13T01:05:00Z
0.31110660126024159
2022-10-13T01:10:00Z
0.75531136771424379
2022-10-13T01:15:00Z
0.99289833682620265
Fiddle

Kusto: Apply function on multiple column values during bag_unpack

Given a dynamic field, say, milestones, it has value like: {"ta": 1655859586546, "tb": 1655859586646},
How do I print a table with columns like "ta", "tb" etc, with the single row as unixtime_milliseconds_todatetime(tolong(taValue)), unixtime_milliseconds_todatetime(tolong(tbValue)) etc.
I figured that I'll need to write a function that I can call, so I created this:-
let f = view(a:string ){
unixtime_milliseconds_todatetime(tolong(a))
};
I can use this function with a normal column as:- project f(columnName).
However, in this case, its a dynamic field, and the number of items in the list is large, so I do not want to enter the fields manually. This is what I have so far.
log_table
| take 1
| evaluate bag_unpack(milestones, "m_") // This gives me fields as columns
// | project-keep m_* // This would work, if I just wanted the value, however, I want `view(columnValue)
| project-keep f(m_*) // This of course doesn't work, but explains the idea.
Based on the mv-apply operator
// Generate data sample. Not part of the solution.
let log_table = materialize(range record_id from 1 to 10 step 1 | mv-apply range(1, 1 + rand(5), 1) on (summarize milestones = make_bag(pack_dictionary(strcat("t", make_string(to_utf8("a")[0] + toint(rand(26)))), 1600000000000 + rand(60000000000)))));
// Solution Starts here.
log_table
| mv-apply kv = milestones on
(
extend k = tostring(bag_keys(kv)[0])
| extend v = unixtime_milliseconds_todatetime(tolong(kv[k]))
| summarize milestones = make_bag(pack_dictionary(k, v))
)
| evaluate bag_unpack(milestones)
record_id
ta
tb
tc
td
te
tf
tg
th
ti
tk
tl
tm
to
tp
tr
tt
tu
tw
tx
tz
1
2021-07-06T20:24:47.767Z
2
2021-05-09T07:21:08.551Z
2022-07-28T20:57:16.025Z
2022-07-28T14:21:33.656Z
2020-11-09T00:54:39.71Z
2020-12-22T00:30:13.463Z
3
2021-12-07T11:07:39.204Z
2022-05-16T04:33:50.002Z
2021-10-20T12:19:27.222Z
4
2022-01-31T23:24:07.305Z
2021-01-20T17:38:53.21Z
5
2022-04-27T22:41:15.643Z
7
2022-01-22T08:30:08.995Z
2021-09-30T08:58:46.47Z
8
2022-03-14T13:41:10.968Z
2022-03-26T10:45:19.56Z
2022-08-06T16:50:37.003Z
10
2021-03-03T11:02:02.217Z
2021-02-28T09:52:24.327Z
2021-04-09T07:08:06.985Z
2020-12-28T20:18:04.973Z
9
2022-02-17T04:55:35.468Z
6
2022-08-02T14:44:15.414Z
2021-03-24T10:22:36.138Z
2020-12-17T01:14:40.652Z
2022-01-30T12:45:54.28Z
2022-03-31T02:29:43.114Z
Fiddle

Why is my vector not accumulating the iteration?

I have this type of data that I want to send to a dataframe:
So, I am iterating through it and sending it to a vector. But my vector never keeps the data.
Dv = Vector{Dict}()
for item in reader
push!(Dv,item)
end
length(Dv)
This is what I get:
And I am sure this is the right way to do it. It works in Python:
EDIT
This is the code that I use to access the data that I want to send to a dataframe:
results=pyimport("splunklib.results")
kwargs_oneshot = (earliest_time= "2019-09-07T12:00:00.000-07:00",
latest_time= "2019-09-09T12:00:00.000-07:00",
count=0)
searchquery_oneshot = "search index=iis | lookup geo_BST_ONT longitude as sLongitude, latitude as sLatitude | stats count by featureId | geom geo_BST_ONT allFeatures=True | head 2"
oneshotsearch_results = service.jobs.oneshot(searchquery_oneshot; kwargs_oneshot...)
# Get the results and display them using the ResultsReader
reader = results.ResultsReader(oneshotsearch_results)
for item in reader
println(item)
end
ResultsReader is a streaming reader. This means you will "consume" its elements as you iterate over them. You can covert it to an array with collect. Do not print the items before you collect.
results=pyimport("splunklib.results")
kwargs_oneshot = (earliest_time= "2019-09-07T12:00:00.000-07:00",
latest_time= "2019-09-09T12:00:00.000-07:00",
count=0)
searchquery_oneshot = "search index=iis | lookup geo_BST_ONT longitude as sLongitude, latitude as sLatitude | stats count by featureId | geom geo_BST_ONT allFeatures=True | head 2"
oneshotsearch_results = service.jobs.oneshot(searchquery_oneshot; kwargs_oneshot...)
# Get the results
reader = results.ResultsReader(oneshotsearch_results)
# collect them into an array
Dv = collect(reader)
# Now you can iterate over them without changing the result
for item in Dv
println(item)
end

is it possible to get a new instance for namedtuple pushed into a dictionary before values are known?

It looks like things are going wrong on line 9 for me. Here I wish to push a new copy of the TagsTable into a dictionary. I'm aware that once a namedtuple field is recorded, it can not be changed. However, results baffle me as it looks like the values do change - when this code exits all entries of mp3_tags[ any of the three dictionary keys ].date are set to the last date of "1999_03_21"
So, two questions:
Is there a way to get a new TagsTable pushed into the dictionary ?
Why doesnt the code fail and not allow the second (and even third) date to be written to the TagsTable.date field (since it seems to be references to the same namedtuple) ? I thought you could not write a second value ?
from collections import namedtuple
2 TagsTable = namedtuple('TagsTable',['title','date','subtitle','artist','summary','length','duration','pub_date'])
3 mp3files = ['42-001.mp3','42-002.mp3','42-003.mp3']
4 dates = ['1999_01_07', '1999_02_14', '1999_03_21']
5
6 mp3_tags = {}
7
8 for mp3file in mp3files:
9 mp3_tags[mp3file] = TagsTable
10
11 for mp3file,date_string in zip(mp3files,dates):
12 mp3_tags[mp3file].date = date_string
13
14 for mp3file in mp3files:
15 print( mp3_tags[mp3file].date )
looks like this is the fix I was looking for:
from collections import namedtuple
mp3files = ['42-001.mp3','42-002.mp3','42-003.mp3']
dates = ['1999_01_07', '1999_02_14', '1999_03_21']
mp3_tags = {}
for mp3file in mp3files:
mp3_tags[mp3file] = namedtuple('TagsTable',['title','date','subtitle','artist','summary','length','duration','pub_date'])
for mp3file,date_string in zip(mp3files,dates):
mp3_tags[mp3file].date = date_string
for mp3file in mp3files:
print( mp3_tags[mp3file].date )

lumen works cssv reader not reads the values starting with # symbol

i have a code snippet in c# which helps in reading the csv file in to list .the problem is it is not reading the record which starts with # symbol
for instance if i have two record like this, then only sderik record i taken and the other record is missing as it starts with # symbol. what coul be the reason?
sderik | sample1 | sample 2| sample 3
#smissingrecord | sample1 | sample 2| sample 3
using (LumenWorks.Framework.IO.Csv.CsvReader csv = new LumenWorks.Framework.IO.Csv.CsvReader(reader, true,'|'))
{
outDataTable = Common.CommonFunction.ConvertListToDataTable(csv.ToList());
retValue = true;
}
The # is the default comment character. Override it by including a 6th parameter to the CsvReader call that is not '#'.

Resources