I have a table which looks like this:
id timestamp value1 value2
1 09:12:37 1 1
1 09:12:42 1 2
1 09:12:41 1 3
1 10:52:16 2 4
1 10:52:18 2 5
2 09:33:12 3 1
2 09:33:15 3 2
2 09:33:13 3 3
I need to group by id and value1. For each group i want to have the row with the highest timestamp.
The result for the table above would look like this:
id timestamp value1 value2
1 09:12:42 1 2
2 09:33:15 3 2
I know there is the summarize operator which would give me this:
mytable
| project id, timestamp, value1, value2
| summarize max(timestamp) by id, value1
Result:
id timestamp value1
1 09:12:42 1
2 09:33:15 3
But i was not able to get value2 for this rows too.
Thanks in advance
If i understand your question correctly, you should be able to use summarize arg_max():
doc: https://learn.microsoft.com/en-us/azure/kusto/query/arg-max-aggfunction
datatable(id:long, timestamp:datetime, value1:long, value2:long)
[
1, datetime(2019-03-20 09:12:37), 1, 1,
1, datetime(2019-03-20 09:12:42), 1, 2,
1, datetime(2019-03-20 09:12:41), 1, 3,
1, datetime(2019-03-20 10:52:16), 2, 4,
1, datetime(2019-03-20 10:52:18), 2, 5, // this has the latest timestamp for id == 1
2, datetime(2019-03-20 09:33:12), 3, 1,
2, datetime(2019-03-20 09:33:15), 3, 2, // this has the latest timestamp for id == 2
2, datetime(2019-03-20 09:33:13), 3, 3,
]
| summarize arg_max(timestamp, *) by id
This will result with:
| id | timestamp | value1 | value2 |
|----|-----------------------------|--------|--------|
| 2 | 2019-03-20 09:33:15.0000000 | 3 | 2 |
| 1 | 2019-03-20 10:52:18.0000000 | 2 | 5 |
I found a solution to my problem, but there might be a better one.
mytable
| project id, timestamp, value1, value2
| order by timestamp desc
| summarize max(timestamp), makelist(value2) by id, value1
Results in:
id timestamp value1 list_value2
1 09:12:42 1 ["2", "3", "1"]
2 09:33:15 3 ["2", "3", "1"]
Now you can extend the query by adding
| project max_timestamp, id, value1, list_value2[0]
to get the first element from that list. Replace '0' by any number between 0 and length(list_value2)-1 to access the other values.
One more advice:
The timestamp i use is the one that is generated by ApplicationInsights. In our code we call TrackTrace to log some data. If you order the rows by this timestamp, the resulting list of rows is not garanteed to be in the same order in which the data was produced in code.
Related
I have a table some_table:
id | some_integer
---|-------------
1 | ?
2 | ?
3 | ?
How do I (and can it be done in one query) update some_table to:
Where id is 1, increase some_integer by 1
Where id is 2, decrease some_integer by 2
Where id is 3, increase some_integer by 3
To make some_table like:
id | some_integer
---|-------------
1 | ? + 1
2 | ? - 2
3 | ? + 3
Thanks!
UPDATE some_table
SET some_integer = some_integer + (CASE id WHEN 1 THEN 1 WHEN 2 THEN -2 WHEN 3 THEN 3 END)
WHERE id IN (1,2,3);
Use a CASE expression:
UPDATE yourTable
SET some_integer = CASE WHEN id = 1 THEN some_integer + 1
WHEN id = 2 THEN some_integer + 2
WHEN id = 3 THEN some_integer + 3 END
WHERE id IN (1, 2, 3);
I have a table like this:
COL1 | COL2
------------
1 | NULL
2 | NULL
3 | NULL
4 | NULL
How can I use SQL to update the COL2 which has the accumulated total of all previous row? Like this:
COL1 | COL2
------------
1 | 1
2 | 3
3 | 6
4 | 10
Thanks.
Got the answer from my colleague: (Assume the table name is abc)
UPDATE abc set col2 = (
SELECT temp.t from (SELECT abc.id, SUM(def.col1) as t FROM abc join abc as def on def.id<=abc.id group by abc.id)as temp WHERE abc.id=temp.id
)
Or we can use this:
REPLACE INTO abc SELECT abc.id,r2.col1, SUM(r2.col1) as col2 FROM abc join abc as r2 on r2.id<=abc.id group by abc.id
I have multiple columns that I would like to change to one column that has a rank and a count column. The columns have an uneven amount of rows.
Example
column 1 | column 2 | column 3 | column 4 |
1 | 2 | 3 | 4 |
1 | 2 | 3 | |
1 | 2 | | |
1 | | | |
2 | 3 | 4 | 5 |
2 | 3 | 4 | |
2 | 3 | | |
2 | | | |
What i'm trying to do is get one column with all the unique numbers a rank and count column.
Column 1 has all the unique numbers from column 1 to 5. It organized by the ranking.
Ranking is just highers count to lowest count - 2 has the most number in this example 7 and 5 has the least 1. So 2 is rank number 1.
Count is just how many numbers 2 has 7 total numbers 3 has 5 total numbers.
Column 1 | Ranking | Count |
2 | 1 | 7 |
3 | 2 | 5 |
1 | 3 | 4 |
4 | 4 | 3 |
5 | 5 | 1 |
I have tried this for right now. But i have a lot more work to do.
df <- read.csv("df.csv", header = TRUE, strip.white =TRUE, stringsAsFactors = FALSE)
uniquedel <- unique(df)
write.csv(uniquedel, file = "/Users/uniqueRSA.csv")
Whatever help you can give would be very helpful. Thanks
Since it doesn't seem to matter where numbers are located, you can use unlist to just get all the values as a single numeric vector. table will then count occurrences for you; you can coerce it into a data.frame to give you two of the three columns you want. You can now use order to make a Ranking column, but since it's a permutation of indices instead of a rank, you'll need to order the order to get it back in the same order as your rows. All told, where df is the original data.frame:
df2 <- data.frame(table(unlist(df)))
df2$Ranking <- order(order(df2$Freq, decreasing = T))
gives you
> df2
Var1 Freq Ranking
1 1 4 3
2 2 7 1
3 3 5 2
4 4 3 4
5 5 1 5
If you want it ordered by Ranking, index it by order(df2$Freq). There are lots of other possible ways to go about this, too. rank would be really useful, except in base it is only ascending instead of descending, and thus would also take some manipulation.
Data:
df <- structure(list(column.1 = c(1, 1, 1, 1, 2, 2, 2, 2), column.2 = c(2,
2, 2, NA, 3, 3, 3, NA), column.3 = c(3, 3, NA, NA, 4, 4, NA,
NA), column.4 = c(4, NA, NA, NA, 5, NA, NA, NA)), .Names = c("column.1",
"column.2", "column.3", "column.4"), row.names = c(NA, -8L), class = "data.frame")
As far as I understand you simply want to tabulate the counts for each integer value in the original matrix, irrespective of the column it occurs in. Then order the table by ranks of these counts.
# make sample data, like yours
# note your example contains missing/empty cells
df <- data.frame(matrix(sample(1:5, 4*8, replace=T),ncol=4,nrow=8))
# tabulate and rank, note ranks can be fractional in case of ties
tab <- table(unlist(df))
data.frame(tab,rank(tab))[order(rank(tab), decreasing=TRUE),]
Var1 Freq rank.tab.
1 1 3 1.0
4 4 5 2.0
2 2 6 3.0
3 3 9 4.5
5 5 9 4.5
Note, what you define as Rank seems to be the inverse of how R defines it: x < y <=> rank(x) < rank(y). I have answered to the literal phrasing in your question.
You might be tempted to use:
# data.frame(tab,order(tab, decreasing=TRUE))[order(order(tab,decreasing=TRUE)),]
to reproduce your sample, however this does not handle ties in a good way.
Another option is to use the following:
data.frame(tab,nrow(tab)-rank(tab))[order(rank(tab),decreasing=TRUE),]
Var1 Freq nrow.tab....rank.tab.
3 3 9 0.5
5 5 9 0.5
2 2 6 2.0
4 4 5 3.0
1 1 3 4.0
using your non-standard definition of rank.
Consider this dataframe :
col1 | col2
1 | 1
1 | 2
1 | 3
2 | 4
2 | 5
2 | 6
I want to a new column, say col3 in the dataframe, which has the following definition : the ith element col3[i] is the mean of all values of col2[j], for all j such that col1[i] == col1[j] && i!=j.
The for loop for it goes like this :
for (i in 1:length(data$col2))
{
sum = 0
count = 0
for (j in 1:length(data$col1))
{
if (data$col1[j] == data$col1[i] && i!=j)
{
sum = sum + data$col2[j]
count = count + 1
}
}
data$col3[i] = sum/count
}
The final table is :
col1 | col2 | col3
1 | 1 | 2.5
1 | 2 | 2
1 | 3 | 1.5
2 | 4 | 5.5
2 | 5 | 5
2 | 6 | 4.5
I could use an apply function, but that would take me pretty much as much time as the for loop, right? Any help with giving a vectorized version of this loop is appreciated.
You can use dplyr:
library(dplyr)
dat %>% group_by(col1) %>%
mutate(col3 = (sum(col2) - col2)/(n()-1))
Source: local data frame [6 x 3]
Groups: col1 [2]
col1 col2 col3
(int) (int) (dbl)
1 1 1 2.5
2 1 2 2.0
3 1 3 1.5
4 2 4 5.5
5 2 5 5.0
6 2 6 4.5
This can be done with ave from base R
df1$col3 <- with(df1, ave(col2, col1,
FUN=function(x) (sum(x)-x)/(length(x)-1)))
Or using data.table
library(data.table)
setDT(df1)[, col3 := (sum(col2)-col2)/(.N-1) , col1]
and thanks in advance for looking.
I have a data frame of Events(EV):
Event_ID | Person_ID | Start_Period | End_Period | Event_Type
------------------------------------------------------------
A | Person1 | 1 | 9 | Assessment
B | Person1 | 2 | 9 | Activity
C | Person1 | 3 | 6 | Assessment
D | Person2 | 3 | 6 | Activity
E | Person3 | 7 | 13 | Assessment
And I have a data frame of Person-Periods (PP)
Person_ID | Period
----------------------
Person1 | 1
Person1 | 2
Person1 | 3
Person2 | 1
Person2 | 2
Person2 | 3
Person3 | 1
Person3 | 2
Person3 | 3
I want to find out for each period, how many activities or assessments were on-going during the period. For example if an event for Person1 in EV had a start period of 5 and end period of 10, then this event should show up in 5,6,7,8,9,10 in PP. The result would look like this:
Person_ID | Period | ActivitiesFreq | AssessmentsFreq
----------------------------------------------
Person1 | 1 | 0 | 1
Person1 | 2 | 1 | 1
Person1 | 3 | 1 | 2
Person2 | 1 | 0 | 0
Person2 | 2 | 0 | 0
Person2 | 3 | 1 | 0
Person3 | 1 | 0 | 0
Person3 | 2 | 0 | 0
Person3 | 3 | 0 | 0
At the moment I'm using a for loop - which is slow.And I'm resisting a join because the full dataset has hundreds and thousands of data. I've tried using mutate from the dplyr package:
mutate(PP,SUM(EV$Person_ID==Person_ID,EV$Start_Period<=Period,EV$End_Period>=Period)
but I get the following error:
Warning messages:
1: In mutate_impl(.data, dots) :
is.na() applied to non-(list or vector) of type 'NULL'
2: In mutate_impl(.data, dots) :
longer object length is not a multiple of shorter object length
3: In mutate_impl(.data, dots) :
longer object length is not a multiple of shorter object length
I'm open to using other packages - I think I don't quite understand something about the way mutate works
Here's a solution using data.table v1.9.5 (current devel version). I'm using it for the new on= feature that allows joins without having to set keys:
require(data.table) # v1.9.5+
ans = setDT(df2)[df1, .(Period, Event_Type,
isBetween = Period %between% c(Start_Period, End_Period)),
by = .EACHI, on = "Person_ID", nomatch = 0L]
dcast(ans, Person_ID + Period ~ Event_Type, fun.aggregate = sum)
# Using 'isBetween' as value column. Use 'value.var' to override
# Person_ID Period Activity Assessment
# 1: Person1 1 0 1
# 2: Person1 2 1 1
# 3: Person1 3 1 2
# 4: Person2 1 0 0
# 5: Person2 2 0 0
# 6: Person2 3 1 0
# 7: Person3 1 0 0
# 8: Person3 2 0 0
# 9: Person3 3 0 0
How it works:
setDT() converts a data.frame to data.table in-place (by reference).
setDT(df2)[df1, on = "Person_ID"] performs a join operation on column Person_ID. For each row in df1, the corresponding matching rows in df2 are computed, and all columns corresponding to those matching rows are extracted.
setDT(df2)[df1, on = "Person_ID", nomatch = 0L], as you might have guessed only returns matching rows, and leaves out those rows of Person_ID in df1 where there is no match in df2.
The by = .EACHI part is quite useful and powerful argument. It helps to compute the expression we provide in j, the second argument within [], for each row in df1.
For example, consider the 2nd row of df1. Joining on Person_ID, it matches with rows 1,2,3 of df2. And by = .EACHI will execute the expression provided within .(), which will return Period = 1,2,3, Event_Type = "Activity"and isBetween = FALSE,TRUE,TRUE. Event_Type is recycled to fit the length of the longest vector (= 3).
Essentially, we are joining and computing at the same time. This is a feature (only?) in data.table, where joins are considered as extensions of subset operations. Since we can compute while subsetting and grouping, we can do exactly the same while joining as well. This is both fast and *memory efficient as the entire join doesn't have to be materialised.
To understand it better, try computing what j expression will result in for the last row.
Then, have a look at ans, and the result should be obvious.
Then we've one last step to do and that is to count the number of Activity and Assessment for each Person_ID, Period and have them as separate columns. This can be done in one step using dcast function.
The formula implies that for each Person_ID, Period, we'd like to sum() the values of inBetween, as a separate column, for each unique value of Event_Type.
I haven't come up with a way to do this without joining datasets. Here is a dplyr-based solution using left_join to join the datasets first (I took only the three columns from EV needed for the task).
Once the dataset are joined, you can just group the dataset by Person_ID and calculate the cumulative sum of the two types of events. I threw in an arrange in case the real dataset wasn't in order by Period within Person_ID and removed the Event_Type column within mutate.
library(dplyr)
PP %>%
left_join(., select(EV, -Event_ID, -End_Period), by = c("Person_ID", "Period" = "Start_Period")) %>%
group_by(Person_ID) %>%
arrange(Period) %>%
mutate(ActivitiesFreq = cumsum(Event_Type == "Activity" & !is.na(Event_Type)),
AssessmentFreq = cumsum(Event_Type == "Assessment" & !is.na(Event_Type)),
Event_Type = NULL)
Source: local data frame [9 x 4]
Groups: Person_ID
Person_ID Period ActivitiesFreq AssessmentFreq
1 Person1 1 0 1
2 Person1 2 1 1
3 Person1 3 1 2
4 Person2 1 0 0
5 Person2 2 0 0
6 Person2 3 1 0
7 Person3 1 0 0
8 Person3 2 0 0
9 Person3 3 0 0
Here is a potential solution:
Left Join PP and EV (dplyr::left_join) on Person_ID and Period
Group by Person and period dplyr::group_by(Person_ID , Period)
Count the number of values using dplyr::summarise()