Complex reduceByKey operation in Spark to compute multiple aggr. fields + count dist - count

I have this RDD (id, (a, b, c)), where c is a string that can be repeated for different id. Also id can be repeated.
Now, I need to aggregate by id, and the aggregation will be"
sum(a),
sum(b)
count distinct (c).
I was thinking of using reduceByKey, which I know how to use for the sum part, but no idea how to do for the count distinct
Ideally, I was thinking of something like:
RDD.reduceByKey((x,y)=> (x._1 + y._1, x._2 + y._2, countDistinct(x._3, y._3)))
Any ideas? Thanks a lot!
* UPDATE (1) *
Best I could do at the moment is:
RDD.reduceByKey((x,y)=> (x._1 + y._1, x._2 + y._2, (x._3 + "," + y._3))).map(row => row._1, row._2._1, row._2._2, row._2._3.split(",").distinct.length

The best I can think of right now is to use an intermediate set, and check it's size after. I'll think if there's a way to do this in one method
RDD.aggregateByKey((0,0,Set[Int]()))(
(x,y) => (x._1 + y._1, x._2 + y._2, x._3 + y._3),
(x,y) => (x._1 + y._1, x._2 + y._2, x._3 ++ y._3)
).mapValues(x => (x._1, x._2, x._3.size))

Related

Get-values from a html form in a for/do loop

I have a problem with get-value() method in progress4GL.
I am trying to get all values from html form.
My Progress4GL Code looks like:
for each tt:
do k = 1 to integer(h-timeframe):
h-from [k] = get-value(string(day(tt.date)) + "#" + string(tt.fnr) + "#" + string(tt.pnr) + "_von" + string(k)).
h-to [k] = get-value(string(day(tt.date)) + "#" + string(tt.fnr) + "#" + string(tt.pnr) + "_bis" + string(k)).
h-code [k] = get-value(string(day(tt.date)) + "#" + string(tt.fnr) + "#" + string(tt.pnr) + "_code" + string(k)).
end.
end.
h-timeframe is parameter and could be max. 10. (1-10)
tt is a temp-table and represents a week(fix 7 days)
It works perfectly till 9.Parameter. If I choose the 10 (which is max) then I get some performance Problem using get-value() Function.
Example when h-timeframe = 10:
as you can see from one get-value to another It takes really long time.( h-timeframe = 10 )
Example when h-timeframe = 9:
and here way much faster than other.
Can anyone explain why ? It is really strange and I have no Idea.
p.s: I have this problem just at 10. 0-9 It works perfectly
The performance difference is probably something external to your code snippet but, for performance, I would write it more like this:
define variable d as integer no-undo.
define variable n as integer no-undo.
define variable s as character no-undo.
for each tt:
// avoid recalculating and invoking functions N times per TT record
assign
d = day( tt.date )
n = integer( h-timeframe )
s = substitute( "&1#&2#&3_&&1&&2", d, tt.fnr, tt.pnr )
.
do k = 1 to n:
// consolidate multiple repeated operations, eliminate STRING() calls
assign
h-from [k] = get-value( substitute( s, "von", k ))
h-to [k] = get-value( substitute( s, "bis", k ))
h-code [k] = get-value( substitute( s, "code", k ))
.
end.
end.

PowerBI - Count Blank Values of specific Columns

My table looks a little like this. The last column is what I'm trying to figure out how to calculate. I can easily do this in Excel - but not sure how to write my formula in PowerBI
I don't think you can count it without specifying the individual columns. if that is what you are looking for. I would do it something like this:
Data Missing =
COUNTBLANK([Project Title])
+ COUNTBLANK([Status])
+ COUNTBLANK([Object])
There may be a more clever way to do this, but a simple DAX expression can do the job.
CountBlanksInRow =
VAR data1blank = IF (ISBLANK(Sheet1[Data 1]), 1, 0)
VAR data2blank = IF (ISBLANK(Sheet1[Data 2]), 1, 0)
VAR data3blank = IF (ISBLANK(Sheet1[Data 3]), 1, 0)
RETURN data1blank + data2blank + data3blank
Rather then using DAX or Measure, The best option is you can create the custom column in Power Query and the code will be as below-
Number.From([Project Title] = null)
+ Number.From([Status] = null)
+ Number.From([Objective] = null)
Here below is the sample code window-

Can I do CloseMatch of Defined Parsing grammar?

I have Defined a grammar
column = Word(alphanums + '._`')
stmt = column + Literal("(") + Group(delimitedList( column )) +Literal(")")
Now I want to match below query using close match
sql = seller(food_type,count(sellers),sum(weight),Earned_money)
I do not want to change the grammar defined above. How do I closeMatch given
functions as a argument
result = stmt.parseString(sql)
print result.dump()
def Review(sql):
stmt = GetGrammer(sql)
result = stmt.parseString(sql,parseAll=False)
print result.dump()
Where var sql is a procedure of 400-500 lines. So I Am making a Automating code Review part.For This purpose I have written grammar for sql statements.
But it is throwing exceptions Even If there is a small string which is not matching.And It is terminating after that.I want that it should not abort even if exceptions are Comming Because I know that atleast parsable part is useful for reviewing database queries.
Get Grammar is returning grammar for Procedure and all these are sql statements.
def Getgrammar(sql):
InputParameters = delimitedList( Optional((_in|_out|_inout),'') + column +
DataType)
DeclarativeSyntax = (_declare + column + DataType+';')
createProcedureStmt = createProcedure +
StoredProcedure.setResultsName("Procedure") +
lpar +
Optional(InputParameters.setResultsName("Input"),'') +
rpar +
Optional(_sql_security_invoker,'').setResultsName("SQLSECURITY") +
_begin +
ZeroOrMore( DeclarativeSyntax ).setResultsName("Declare") +
ZeroOrMore( ( selectStmt|setStmt|ifStmt.setResultsName("IfStmt")|
callStmt|updateStmt|createStmt|dropStmt|alterStmt|insertStmt
|deleteStmt|WhileStmt.setResultsName("WhileStmt")|createStmt ) + ';') +
_end+Optional(';','')
return createProcedureStmt

crc ip hdr checksum in verilog

I am implementing a task that i can use to obtain checksum from modified ip hdr. This is what i got:
task checksum_calc;
input [159:0] IP_hdr_data;
output [15:0] IP_chksum;
reg [19:0] IP_chksum_temp;
reg [19:0] IP_chksum_temp1;
reg [19:0] IP_chksum_temp2;
begin
IP_chksum_temp = IP_hdr_data[15:0] + IP_hdr_data[31:16] + IP_hdr_data[47:32] + IP_hdr_data[63:48] + IP_hdr_data[79:64] + IP_hdr_data[111:96] + IP_hdr_data[127:112] + IP_hdr_data[143:128] + IP_hdr_data[159:144];
IP_chksum_temp1 = IP_chksum_temp[15:0] + IP_chksum_temp[19:16];
IP_chksum_temp2 = IP_chksum_temp1[15:0] + IP_chksum_temp1[19:16];
IP_chksum = ! IP_chksum_temp2[15:0];
end
endtask
It's that correct? Or it will be some timing problems due to using cominational logic?
Looks like all you are doing is some combination logic calculation. A functions is a better choice. The primary purpose of a function is to return a value that is to be used in an expression.
This is huge combo logic, which in most of the scenario's will cause trouble for timing.
Better to run it through synthesis and timings check to know the exact result.
One suggestion as
IP_chksum_temp1 = IP_chksum_temp[15:0] + IP_chksum_temp[19:16];
can only generate flip the 16th bit. Hence, there is no need of 20 bits in next addition.
IP_chksum_temp2 = IP_chksum_temp1[15:0] + IP_chksum_temp1[19:16];
This can be done :-
reg [16:0] IP_chksum_temp1;
reg [16:0] IP_chksum_temp2;

Using solve and/or linsolve with the symbolic toolbox in R2010b

I asked a question a few days ago here and got an answer that seems like it would work- it involves using linsolve to find the solutions to a system of equations that are all modulo p, where p is a non-prime integer.
However, when I try to run the commands from the provided answer, or the linsolve help page, I get an error saying linsolve doesn't support arguments of type 'sym'. Is using linsolve with sym variables only possible in R2013b? I've also tried it with my school's copy, which is R2012b. Here is the code I'm attempting to execute (from the answer at the above link):
A = [0 5 4 1;1 7 0 2;8 1 0 2;10 5 1 0];
b = [2946321;5851213;2563617;10670279];
s = mod(linsolve(sym(A),sym(b)),8)
And the output is:
??? Undefined function or method linsolve' for input arguments of type 'sym'.
I've also tried to use the function solve for this, however even if I construct the equations represented by the matrices A and b above, I'm having issues. Here's what I'm attempting:
syms x y z q;
solve(5*y + 4*z + q == 2946321, x + 7*y + 2*q == 5851213, 8*x + y + 2*q == 2563617, 10*x + 5*y + z == 10670279,x,y,z,q)
And the output is:
??? Error using ==> char
Conversion to char from logical is not possible.
Error in ==> solve>getEqns at 169
vc = char(v);
Error in ==> solve at 67
[eqns,vars] = getEqns(varargin{:});
Am I using solve wrong? Should I just try to execute my code in R2013b to use linsolve with symbolic data types?
The Symbolic Math toolbox math toolbox has changed a lot (for the better) over the years. You might not have sym/linsolve, but does this work?:
s = mod(sym(A)\sym(b),8)
That will basically do the same thing. sym/linsolve just does some extra input checking and and rank calculation to mirror the capabilities of linsolve.
You're using solve correctly for current versions, but it looks like R2010b may not understand the == operator (sym/eq) in this context. You can use the old string format to specify your equations:
eqs = {'5*y + 4*z + q = 2946321',...
'x + 7*y + 2*q = 5851213',...
'8*x + y + 2*q = 2563617',...
'10*x + 5*y + z = 10670279'};
vars = {'x','y','z','q'};
[x,y,z,q] = solve(eqs{:},vars{:})

Resources