Get max marks of each student in Kusto - azure-data-explorer

Consider a table
StudentId
Subject
Marks
1
Maths
34
1
Science
54
2
Maths
64
2
French
85
2
Science
74
I'm looking for an output where it will give (note that I'm trying to find MAX marks for each student, irrespective of the subject)
StudentId
Subject
Marks
1
Science
54
2
French
85

Use the summarize operator:
T
| summarize max(Marks) by StudentId

In addition to above query from #Avnera, if you also care about the corresponding subject in which the student received the maximum marks (it seems like that based on your desired output table), you can use the arg_max function:
T
| summarize arg_max(Marks, Subject) by StudentId
arg_max(): https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/arg-max-aggfunction

Related

Identifying, reviewing, and deduplicating records in R

I'm looking to identify duplicate records in my data set based on multiple columns, review the records, and keep the ones with the most complete data in R. I would like to keep the row(s) associated with each name that have the maximum number of data points populated. In the case of date columns, I would also like to treat invalid dates as missing. My data looks like this:
df<-data.frame(Record=c(1,2,3,4,5),
First=c("Ed","Sue","Ed","Sue","Ed"),
Last=c("Bee","Cord","Bee","Cord","Bee"),
Address=c(123,NA,NA,456,789),
DOB=c("12/6/1995","0056/12/5",NA,"12/5/1956","10/4/1980"))
Record First Last Address DOB
1 Ed Bee 123 12/6/1995
2 Sue Cord 0056/12/5
3 Ed Bee
4 Sue Cord 456 12/5/1956
5 Ed Bee 789 10/4/1980
So in this case I would keep records 1, 4, and 5. There are approximately 85000 records and 130 variables, so if there is a way to do this systematically, I'd appreciate the help. Also, I'm a total R newbie (as if you couldn't tell), so any explanation is also appreciated. Thanks!
#Add a new column to the dataframe containing the number of NA values in each row.
df$nMissing <- apply(df,MARGIN=1,FUN=function(x) {return(length(x[which(is.na(x))]))})
#Using ave, find the indices of the rows for each name with min nMissing
#value and use them to filter your data
deduped_df <-
df[which(df$nMissing==ave(df$nMissing,paste(df$First,df$Last),FUN=min)),]
#If you like, remove the nMissinig column
df$nMissing<-deduped_df$nMissing<-NULL
deduped_df
Record First Last Address DOB
1 1 Ed Bee 123 12/6/1995
4 4 Sue Cord 456 12/5/1956
5 5 Ed Bee 789 10/4/1980
Edit: Per your comment, if you also want to filter on invalid DOBs, you can start by converting the column to date format, which will automatically treat invalid dates as NA (missing data).
df$DOB<-as.Date(df$DOB,format="%m/%d/%Y")

Find if a specific choice is in a Data Frame R

I have a Data Frame object which contains a list of possible choices. For example, an analogy of this would be:
FirstName, SurName, Subject, Grade
Brian, Smith, History, 75
Jenny, Jackson, English, 60
How would I...
1) Check to see if a certain pupil-subject combination is in my Data Frame
2) And for those who are there, extract their grade (And potentially other relevant fields)
?
Thanks so much
The only solutions I've found so far include appending the values onto the end of the Data Frame and trying to see if it is unique or not? This seems a crude and ridiculous hack?
learn data subset (extraction) using base R.
To subset any data frame by its rows and column you use [ ]
Let df be your data frame.
FirstName SurName Subject Grade
1 Brian Smith History 75
2 Jenny Jackson English 60
3 Tom Brandon Physics 50
You can subset it by its rows and columns using
df[rows,columns]
Here rows and column can be :
1) Index (Number/Name)
Which means subset that give me that particular row and column like
df[2,3]
this will return second row and third column
[1] English
or
df[2,"Grade"]
returns
[1] 60
2) Range (Indices/List of Names)
Which means subset that give me these rows and columns like
df[1:2,2,drop=F]
Here drop=F to avoid flattening of result and output like a data.frame. It will give you this
SurName
1 Smith
2 Jackson
Range also supports all by leaving either rows or columns empty like
df[,3,drop=F]
this will return all rows for third column
Subject
1 History
2 English
3 Physics
or
df[1:2,c("Grade","Subject")]
Grade Subject
1 75 History
2 60 English
3) Logical
Which means you want to subset using a logical condition.
df[df$FirstName=="Brian",]
meaning give me rows where FirstName is Brian and all columns for it.
FirstName SurName Subject Grade
1 Brian Smith History 75
or
df[df$FirstName=="Brian",1:3]
give me rows where FirstName is Brian and give me only 1 to 3 columns.
or create complex logicals
df[df$FirstName=="Brian" & df$SurName==" Smith",1:3]
output
FirstName SurName Subject
1 Brian Smith History
or complex logical and extract column by name
df[df$FirstName=="Brian" & df$SurName==" Smith","Grade",drop=F]
Grade
1 75
or complex logical and extract multiple columns by name
df[df$FirstName=="Brian" & df$SurName==" Smith",c("Grade","Subject")]
Grade Subject
1 75 History
to use this in a function do
myfunc<-function(input_var1,input_var2,input_var3)
{
df[df$FirstName==input_var1 & df$SurName==input_var2 & df$Subject==input_var3,"Grade",drop=F]
}
run it like this
myfunc("Tom","Brandon","Physics")
I think you are looking for this:
result <- data[data$FirstName == "Brian" & data$Subject == "History", c("Grade") ]
Try subset:
con <- textConnection("FirstName,SurName,Subject,Grade\nBrian,Smith,History,75\nJenny,Jackson,English,60")
dat <- read.csv(con, stringsAsFactors=FALSE)
subset(dat, FirstName=="Brian" & SurName=="Smith" & Subject=="History", Grade)
Maybe aggregate can be helpful, too. The following code gives the mean of the grades for all pupil/subject combinations:
dat <- transform(dat, FullName=paste(FirstName, SurName), stringsAsFactors=FALSE)
aggregate(Grade ~ FullName+Subject, data=dat, FUN=mean)

Finding max and identify name's cell for another column

Hopefully someone can solve me the following problem.
Here I have a data about different birds and their maximum lengths:
a<-c("bird1","bird2","bird1","bird3","bird2","bird2")
b<-c(32,45,35,25,51,47)
c<-data.frame(animal=a,max=b)
animal max
1 bird1 32
2 bird2 45
3 bird1 35
4 bird3 25
5 bird2 51
6 bird2 47
My purpose is to identify the name of the animal which has the maximum length. I know that using max()and which.max()is easy to identify the maximum length and the corresponding cell but how can I know the name of the animal?
Any valuable comment will be helpful for me!
This will provide the output of the bird with highest age
Modification
a<-c("bird1","bird2","bird1","bird3","bird2","bird2")
b<-c(32,45,35,25,51,47)
compined_birds<-data.frame(animal=a,max=b)
compined_birds$animal[which.max(compined_birds$max)]

R Programming - Sum Elements of Rows with Common Values

Hello and thank you in advance for your assistance,
(PLEASE Note Comments section for additional insight: i.e. the cost column in the example below was added to this question; Simon, provides a great answer, but the cost column itself is not represented in the data response from him, although the function he provides works with the cost column)
I have a data set, lets call it 'data' which looks like this
NAME DATE COLOR PAID COST
Jim 1/1/2013 GREEN 150 100
Jim 1/2/2013 GREEN 50 25
Joe 1/1/2013 GREEN 200 150
Joe 1/2/2013 GREEN 25 10
What I would like to do is sum the PAID (and COST) elements of the records with the same NAME value and reduce the number of rows (as in this example) to 2, such that my new data frame looks like this:
NAME DATE COLOR PAID COST
Jim 1/2/2013 GREEN 200 125
Joe 1/2/2013 GREEN 225 160
As far as the dates are concerned, I don't really care about which one survives the summation process.
I've gotten as far as rowSums(data), but I'm not exactly certain how to use it. Any help would be greatly appreciated.
aggregate is the function you are looking for:
aggregate( cbind( PAID , COST ) ~ NAME + COLOR , data = data , FUN = sum )
# NAME PAID
# 1 Jim 200
# 2 Joe 225

mySql sum a column and return only entries with and entry in last 10 minutes

heres a table, the time when the query runs i.e now is 2010-07-30 22:41:14
number | person | timestamp
45 mike 2008-02-15 15:31:14
56 mike 2008-02-15 15:30:56
67 mike 2008-02-17 13:31:14
34 mike 2010-07-30 22:31:14
56 bob 2009-07-30 22:37:14
67 bob 2009-07-30 22:37:14
22 tom 2010-07-30 22:37:14
78 fred 2010-07-30 22:37:14
Id like a query that can add up the number for each person. Then only display the name totals which have a recent entry say last 60 minutes. The difficult seems to be, that although its possible to use AND timestamp > now( ) - INTERVAL 600, this has the affect of stopping the full sum of the number.
the results I would from above are
Mike 202
tom 22
fred 78
bob is not included his latest entry is not recent enough its a year old! mike although he has several old entries is valid because he has one entry recently - but key, it still adds up his full 'number' and not just those with the time period.
go on get your head round that one in a single query ! and thanks
andy.
You want a HAVING clause:
select name, sum(number), max(timestamp_column)
from table
group by name
HAVING max( timestamp_column) > now( ) - INTERVAL 600;
andrew - in the spirit of education, i'm not going to show the query (actually, i'm being lazy but don't tell anyone) :).
basically tho', you'd have to do a subselect within your main criteria select. in psuedo code it would be:
select person, total as (select sum(number) from table1 t2 where t2.person=t1.person)
from table1 t1 where timestamp > now( ) - INTERVAL 600
that will blow up, but you get the gist...
jim

Resources