Is there a way to use expressions when doing aggregations (using Datatable)? - kibana

So do a count aggregation over terms with resuts looking like this:
term1 345
term2 249
term3 117
I would like to show a second column that would represent a percentage of total count, ex:
term1 345 3.45%
term2 249 2.47%
term3 117 2.22%
Does Kibana support something like this?
Can I do something like this for example?
{ "script" : "doc['terms'].count/(TOTAL/100)" }

I'm afraid this issue is still open. Lets hope it'll be available in the near future. +1 to make this bubble to the top.

Related

Pick categories of charts in HTML

I am plotting my benchmark tests with plotly and the results look as expected:
This is just a preliminary view as the rest of the data is still calculated. Yet it's obvious already that the current plotting doesn't make too much sense as there will be by far more plotted data in different segments (10-100,100-1000,1000+). As here the smaller data is just not seen any more (if not zoomed in)
Is there a proper way to set the displayed bars (by definition of groups?)?
There is apparently a solution with Dash (https://community.plotly.com/t/how-can-i-select-multiple-options-to-pick-from-dropdown-as-a-group/60482) which seems to be what I am looking for but it's not an independent HTML-File which can be sent and/or exported.
Alternatively I thought about displaying it in log, but the result is irritating as it doesn't really show what I'd like to display.
This code here works and gives the result shown:
if __name__ == '__main__':
filep="Tests/10k-node-test/"
data=[]
quantities=[]
for p1 in next(os.walk(filep))[1]:
quantities.append(p1)
df = pd.read_csv(filep+p1+'/'+"timing.csv")
for index, row in df.iterrows():
if index>=2:
if index%2==0:
val=row[2]
else:
val=row[2]-val
data.append([p1,row[1],val])
df = pd.DataFrame(data, columns=["Records","Iteration","Insertion Time"])
fig = px.bar(df, x="Records", y="Insertion Time",
hover_data=["Records","Iteration","Insertion Time"], color="Insertion Time",
height=400,
log_y=True)
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total ascending'})
fig.write_html("plotlye.html")
The data-frame looks like this:
Records Iteration Insertion Time
0 250 3 1.137531
1 250 4 1.137239
2 250 5 1.146533
3 250 6 1.131248
4 250 7 1.123308
.. ... ... ...
189 10 95 0.123577
190 10 96 0.131645
191 10 97 0.122587
192 10 98 0.124850
193 10 99 0.126864
I am not tied to plotly, but so far it returned what I desired - just the fine-tuning is not really what I'm lacking off. If there are alternatives I'd be open to that too, it should just convey my benchmarking-results in a proper way.

Sqlite remove duplicates within specific time range

I know there are many questions asked about removing duplicates in SQL. However in my case it is slightly more complicated.
These are data with Barcode which repeats over a month. Therefore it is expected that there will be entries with the same Barcode. However it is found out that due to possibly a machine bug, same data will be recorded within 4-5 minutes timeframe 2 to 3 times. It does not happen for every entry, but it happens rather frequently.
Allow me to demonstrate with a sample table which contains the same Barcode "A00000"
Barcode No Date A B C D
A00000 1499456 10/10/2019 3:28 607 94 1743 72D
A00000 1803564 10/20/2019 22:09 589 75 1677 14D
A00000 1803666 10/20/2019 22:13 589 75 1677 14D
A00000 1803751 10/20/2019 22:17 589 75 1677 14D
A00000 2084561 10/30/2019 12:22 583 86 1677 14D
A00000 2383742 11/9/2019 23:18 594 81 1650 07D
As you can see the entries on 10/20 contains identical data which are duplicates which should be removed so only one of the entry remains (any of the entry is fine and the exact time is not the main concern). The "No" column is a pure arbitrary number which can be safely disregarded. The other entries should be remain as it is.
I know this should be done by using "Group by", but I am struggling on how to write the conditions. I have tried also using table INNER JOIN itself and then remove these selected results:
T2.A = T2.B AND
T2.[Date] > T1.[Date] AND
strftime('%s',T2.[Date]) - strftime('%s',T1.[Date]) < 600
The results still seem a bit off as some of the entries are selected twice and some are not selected. I am still not used to SQL style of thinking. Any help is appreciated.
The format of the Date column complicates things a bit, but otherwise the solution basically is to use GROUP BY in the normal way. In the following, I've assumed the name of the table is test:
WITH sane as
(SELECT *,
substr(date,1,instr(date, ' ') - 1) as time
FROM test)
SELECT Barcode, max(No), Date, A, B, C, D
FROM sane
GROUP BY barcode, time;
The use of max() is perhaps unneeded but it gives some determinacy, which might be helpful.

Highlighting regions in ggplot2 barplot fulfilling a condition

I want to plot a horizontal barplot using ggplot2 and highlight regions satisfying a particular criteria.
In this case, if any "Term" for point "E15.5-E18.5_up_down" has more than twice the number of samples compared to point "P22-P29_up_down" and vice-versa, highlight that label or region.
I have a dataframe in following format:
CLID CLSZ GOID NodeSize SampleMatch Phyper Padj Term Ont SampleKeys
E15.5-E18.5_up_down 1364 GO:0007568 289 20 0.141830716154421 1 aging BP ENSMUSG00000049932 ENSMUSG00000046352 ENSMUSG00000078249 ENSMUSG00000039428 ENSMUSG00000014030 ENSMUSG00000039323 ENSMUSG00000026185 ENSMUSG00000027513 ENSMUSG00000023224 ENSMUSG00000037411 ENSMUSG00000020429 ENSMUSG00000020897 ENSMUSG00000025486 ENSMUSG00000021477 ENSMUSG00000019987 ENSMUSG00000023067 ENSMUSG00000031980 ENSMUSG00000023070 ENSMUSG00000025747 ENSMUSG00000079017
E15.5-E18.5_up_down 1364 GO:0006397 416 3 0.999999969537913 1 mRNA processing BP ENSMUSG00000027510 ENSMUSG00000021210 ENSMUSG00000027951
P22-P29_up_down 476 GO:0007568 289 11 0.0333771791166823 1 aging BP ENSMUSG00000049932 ENSMUSG00000037664 ENSMUSG00000026879 ENSMUSG00000026185 ENSMUSG00000026043 ENSMUSG00000060600 ENSMUSG00000022508 ENSMUSG00000020897 ENSMUSG00000028702 ENSMUSG00000030562 ENSMUSG00000021670
P22-P29_up_down 476 GO:0006397 416 2 0.998137879564768 1 mRNA processing BP ENSMUSG00000024007 ENSMUSG00000039878
reduced to (only those terms which are necessary for plotting):
CLID SampleMatch Term
E15.5-E18.5_up_down 20 aging
P22-P29_up_down 2 mRNA processing
E15.5-E18.5_up_down 3 mRNA processing
P22-P29_up_down 11 aging
I would prefer a general approach which will work with any condition, not just the one I need for this scenario. One way I imagined is to use sapply for each pair of CLID/Term and create another column which stores if the criteria is fulfilled as a boolean, but still I cannot find a way to highlight the values. What would be the most efficient way to achieve this ?
Pseudo-code for my approach:
for(i in CLID) {
for(k in CLID) {
if (Term[i] == Term[k]) {
condition = check(Term[i], Term[k]) #check if the SampleMatch count for any for any CLID/Term pair is significantly higher compared to corresponding CLID/Term pair
if (condition == True) {
highlight(term)
}
}
}
}
In the end I want something like this (highlighting the label or column):
or like this: Highlight data individually with facet_grid in R.

asRules(tree) R save rules

I do have next trouble:
I created a decision tree with R based on rpart library, and since I have a broad list of variables, rules are and endeless list.
By using asRules(tree) from rattle library, result is nicer than by just running tree once tree is computed.
The problem is the set of rules is longer than number of lines printeables from console, so I can't copy them by Control + C, and by saving this result into a variable, for instance:
t <- asRules(tree)
I would expect something like
Rule number: 1 [target=0 cover=500 (4%) prob=0.8]
var1 < 10
var2 < 2
var3 >=45
var4 >=5
Eventhough result is
[1] 297 242 295 126 127 124
And obviously this isn't what I am looking for.
So I understand 3 ways of solving:
Increasing limit of printable lines to access from console (I don't know how to do that).
Print in console with a key press to continue, in order to first copy, then paste, and the pressing the button to get next results (I don't know how to do that either).
Being able to save bunch of rules into a txt file or something similar instead of [1] 297 242 295 126 127 124.
Guys, any help is very much appreciated!
Thank you!
For #3 use
sink(file='somefile.txt')
asRules(tree)
sink()

R: iterating through unique values of a vector in for loop

I'm new to R and I am having some trouble iterating through the unique element of a vector. I have a dataframe "School" with 700 different teachers. Each teacher has around 40 students.
I want to be able to loop through each teacher, create a graphs for the mean score of his/her students' over time, save the graphs in a folder and automatically email that folder to that teacher.
I'm just getting started and am having trouble setting up the for-loop. In Stata, I know how to loop through each unique element in a list, but am having trouble doing that in R. Any help would be appreciated.
School$Teacher School$Student School$ScoreNovember School$ScoreDec School$TeacherEmail
A 1 35 45 A#school.org
A 2 43 65 A#school.org
B 1 66 54 B#school.org
A 3 97 99 A#school.org
C 1 23 45 C#school.org
Your question seems a bit vague and it looks like you want us to write your whole project. Could you share what you have done so far and where exactly you are struggling?
see ?subset
School=data.frame(Teacher=c("A","B"), ScoreNovember=10:11, ScoreDec=13:14)
for (teacher in unique(School$Teacher)) {
teacher_df=subset(School, Teacher==teacher)
MeanScoreNovember=mean(teacher_df$ScoreNovember)
MeanScoreDec =mean(teacher_df$ScoreDec)
# do your plot
# send your email
}
I think you have 3 questions, which will need separate questions, how do I:
Create graphs
Automatically email output
Compute a subset mean based on group
For the 3rd one, I like using the plyr package, other people will recommend data.table or dplyrpackages. You can also use aggregate from base. To get a teacher's mean:
library(plyr)
ddply(School,.(Teacher),summarise,Nov_m=mean(ScoreNovember))
If you want per student per teacher, etc. just add between the columns, like:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember))
You could do that for each score column (and then chart it) if your data was long rather than wide you could also add the date ('November', 'Dec') as a group in the brackets, or:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember),Dec_m=mean(ScoreDec))
See if that helps with the 3rd, but look at splitting your questions up too.

Resources