This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 3 years ago.
I have a data.table and I'm trying to subset it so I get the lowest TEST_RESULT by ID and LAB_DT:
DT <- data.table::data.table(ID=c("1","1","1","2","2","3","3","3"),
LAB_DT=lubridate::as_date(c("1992-11-01",
"1992-11-01",
"1992-11-02",
"1992-11-04",
"1992-11-04",
"1992-11-06",
"1992-11-06",
"1992-11-08")),
TEST_RESULT=c(5.4,5.8,5.2,5.6,6,6,7,8))
setkeyv(DT,c("ID","LAB_DT","TEST_RESULT"))
ID LAB_DT TEST_RESULT
1: 1 1992-11-01 5.4
2: 1 1992-11-01 5.8
3: 1 1992-11-02 5.2
4: 2 1992-11-04 5.6
5: 2 1992-11-04 6.0
6: 3 1992-11-06 6.0
7: 3 1992-11-06 7.0
8: 3 1992-11-08 8.0
I have already successfully done this via my own method:
DT[,FIRST.LAB_DT:=0]
DT[, FIRST.LAB_DT := c(1L, FIRST.LAB_DT[-1]), by = .(ID,LAB_DT)]
ID LAB_DT TEST_RESULT FIRST.LAB_DT
1: 1 1992-11-01 5.4 1
2: 1 1992-11-01 5.8 0
3: 1 1992-11-02 5.2 1
4: 2 1992-11-04 5.6 1
5: 2 1992-11-04 6.0 0
6: 3 1992-11-06 6.0 1
7: 3 1992-11-06 7.0 0
8: 3 1992-11-08 8.0 1
DT[FIRST.LAB_DT==1,]
ID LAB_DT TEST_RESULT FIRST.LAB_DT
1: 1 1992-11-01 5.4 1
2: 1 1992-11-02 5.2 1
3: 2 1992-11-04 5.6 1
4: 3 1992-11-06 6.0 1
5: 3 1992-11-08 8.0 1
However, the actual data.table I'm working with has ~8e6 rows and I recently found that you can subset data.table using the keys in a much faster manner.
Would anyone know how to produce the same final output using keys?
Since you report that your current approach works, your data is apparently sorted by increasing TEST_RESULT already. In this case, you can just drop duplicates to keep the first row per group:
unique(DT, by=c("ID", "LAB_DT"))
In general, if it is not sorted, you could roll from -Inf. Like your code, this selects only one row even if there are ties for the min value:
mDT = unique(DT[, .(ID, LAB_DT)])[, TEST_RESULT := -Inf]
DT[DT[mDT, on=names(mDT), roll=-Inf, which=TRUE]]
Finally, if you only have these three columns, you can just use the min function which is optimized for speed here (see ?GForce):
DT[, .(TEST_RESULT = min(TEST_RESULT)), by=.(ID, LAB_DT)]
library(dplyr)
DT %>%
group_by(ID,LAB_DT) %>%
slice(which.min(TEST_RESULT))
Related
I'm working on some code where I need to find the maximum value over a set of columns and then update that maximum value. Consider this toy example:
test <- data.table(thing1=c('AAA','BBB','CCC','DDD','EEE'),
A=c(9,5,4,2,5),
B=c(2,7,2,6,3),
C=c(6,2,5,4,1),
ttl=c(1,1,3,2,1))
where the resulting data.table looks like this:
thing1
A
B
C
ttl
AAA
9
2
6
1
BBB
5
7
2
1
CCC
4
2
5
3
DDD
2
6
4
2
EEE
5
3
1
1
The goal is to find the column (A, B, or C) with the maximum value and replace that value by the current value minus 0.1 times the value in the ttl column (i.e. new_value=old_value - 0.1*ttl). The other columns (not containing the maximum value) should remain the same. The resulting DT should look like this:
thing1
A
B
C
ttl
AAA
8.9
2
6
1
BBB
5
6.9
2
1
CCC
4
2
4.7
3
DDD
2
5.8
4
2
EEE
4.9
3
1
1
The "obvious" way of doing this is to write a for loop and loop through each row of the DT. That's easy enough to do and is what the code I'm adapting this from did. However, the real DT is much larger than my toy example and the for loop takes some time to run, which is why I'm trying to adapt the code to take advantage of vectorization and get rid of the loop.
Here's what I have so far:
test[,max_position:=names(.SD)[apply(.SD,1,function(x) which.max(x))],.SDcols=(2:4)]
test[,newmax:=get(max_position)-ttl*.1,by=1:nrow(test)]
which produces this DT:
thing1
A
B
C
ttl
max_position
newmax
AAA
9
2
6
1
A
8.9
BBB
5
7
2
1
B
6.9
CCC
4
2
5
3
C
4.7
DDD
2
6
4
2
B
5.8
EEE
5
3
1
1
A
4.9
The problem comes in assigning the value of the newmax column back to where it needs to go. I naively tried this, along with some other things, which tells me that "'max_position' not found":
test[,(max_position):=newmax,by=1:nrow(test)]
It's straightforward to solve the problem by reshaping the DT, which is the solution I have in place for now (see below), but I worry that with my full DT two reshapes will be slow as well (though presumably better than the for loop). Any suggestions on how to make this work as intended?
Reshaping solution, for reference:
test[,max_position:=names(.SD)[apply(.SD,1,function(x) which.max(x))],.SDcols=(2:4)]
test[,newmax:=get(max_position)-ttl*.1,by=1:nrow(test)]
test <- setDT(gather(test,idgroup,val,c(A,B,C)))
test[,maxval:=max(val),by='thing1']
test[val==maxval,val:=newmax][,maxval:=NULL]
test <- setDT(spread(test,idgroup,val))
With the OP's code, replace can work
test[, (2:4) := replace(.SD, which.max(.SD), max(.SD, na.rm = TRUE) - 0.1 * ttl),
by = 1:nrow(test),.SDcols = 2:4]
-output
> test
thing1 A B C ttl
1: AAA 8.9 2.0 6.0 1
2: BBB 5.0 6.9 2.0 1
3: CCC 4.0 2.0 4.7 3
4: DDD 2.0 5.8 4.0 2
5: EEE 4.9 3.0 1.0 1
In base R, this may be faster with row/column indexing
test1 <- as.data.frame(test)
m1 <- cbind(seq_len(nrow(test1)), max.col(test1[2:4], "first"))
test1[2:4][m1] <- test1[2:4][m1] - 0.1 * test1$ttl
In a data set, there is a specific column that as random values which repeat at regular interval. I want to replace these with increasing values as explained below.
Column_B has random data
Column_A Column_B
1.5 0
0.2 1
0.3 5
4.5 6
12.5 7
1.6 0
7.8 1
1.8 5
6.9 6
11.0 7
After transformation Column_B should have
Column_A Column_B
1.5 0
0.2 1
0.3 2
4.5 3
12.5 4
1.6 0
7.8 1
1.8 2
6.9 3
11.0 4
Is there a faster way to do this rather than creating a new column and then replacing it with Column_B? Thanks.
You can use recycling to fill the column with a repeating sequence. for example, if you want the sequence to be 64 long before repeating then you can use
DF$column_B <- 0:(64 - 1L)
More generally, for patterns like your example in which each element within the repeating sequence is distinct, you can find how long the sequence is, using which, then do the same thing
seq.length = which(dt$B == dt$B[1L])[2L] - 1L
dt$B = 0:(seq.length - 1L)
We group by cumulative sum of 'Column_B' where elements are 0 (or where there is decrease in the next element) and get the sequence of roww to assign it to 'Column_B'
library(data.table)
setDT(df1)[, Column_B := as.integer(seq_len(.N)-1), cumsum(Column_B==0)]
df1
# Column_A Column_B
# 1: 1.5 0
# 2: 0.2 1
# 3: 0.3 2
# 4: 4.5 3
# 5: 12.5 4
# 6: 1.6 0
# 7: 7.8 1
# 8: 1.8 2
# 9: 6.9 3
#10: 11.0 4
Or find the difference between adjacent elements in 'Column_B', get the cumulative sum based on that to create the group_by variable
setDT(df1)[, Column_B := as.integer(seq_len(.N)-1), cumsum(c(TRUE, diff(Column_B)< 0))]
I have the following data table:
id user V1 V2 V3 V4
1: 1 1 1 1 1 0
2: 1 2 4 1 3 1
3: 1 3 0 1 6 0
4: 2 1 1 0 2 1
5: 2 2 2 1 0 0
and I perform an lapply group by id calculation:
my_data[,lapply(.SD,mean)*.SD,by=id,.SDcols=3:5]
The result is the following:
id V1 V2 V3
1: 1 1.666667 1.0 3.333333
2: 1 6.666667 1.0 10.000000
3: 1 0.000000 1.0 20.000000
4: 2 1.500000 0.0 2.000000
5: 2 3.000000 0.5 0.000000
Is there an easy data table way to include the column user from the original data table?
I have managed to do it with
cbind(my_data[,.(user)], my_data[,lapply(.SD,mean)*.SD,by=id,.SDcols=3:5])
but i really hope there is a nicer way
I suggest you go through the vignettes. The Introduction to data.table vignette explains an important point, which I'll repeat here..
As long as j returns a list, each element of the list will become a column in the resulting data.table.
In base R, c(list, list) returns a new list with all the elements. We can simply use that existing functionality to do:
require(data.table) # v1.9.7 devel
dt[, c(list(user=user), lapply(.SD, function(x) x*mean(x))), by=id, .SDcols=V1:V4]
I'm on the current development version of data.table, v1.9.7 which has certain new features, e.g., usage of V1:V4 in .SDcols:
We can do the assignment
my_data[,(3:5) := lapply(.SD,mean)*.SD,by=id,.SDcols=3:5]
Or instead of multiplying by .SD, we do it within the loop itself.
my_data[, (3:5) := lapply(.SD, function(x) mean(x)*x), .SDcols = 3:5, by = id]
I am stuck with a probably simple question - how to sum consecutive duplicate rows and remove all but first row. And, if there is a NA in between two duplicates (such as 2,na,2) , also sum them and remove all but the first entry.
So far so good, here is my sample data
ia<-c(1,1,2,NA,2,1,1,1,1,2,1,2)
time<-c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a<-as.data.frame(cbind(ia, time))
sample output
a
ia time
1 1 4.5
2 1 2.4
3 2 3.6
4 NA 1.5
5 2 1.2
6 1 4.9
7 1 6.4
8 1 4.4
9 1 4.7
10 2 7.3
11 1 2.3
12 2 4.3
Now I want to
1.) sum the "time" column of consecutive ia's - i.e., sum the time if the number 1 occurs twice or more right after each other, in my case here sum first and second row of column time to 4.5+2.4.
2.) if there is a NA in between two numbers (ia column) which are the same (i.e., ia = 2, NA, 2), then also sum all of those times.
3.) keep only first occurence of the ia, and delete the rest.
In the end, I would want to have something like this:
a
ia time
1 1 6.9
3 2 6.3
6 1 20.4
10 2 7.3
11 1 2.3
12 2 4.3
I found this for summing, but it does not take into account the consecutive factor
aggregate(time~ia,data=a,FUN=sum)
and I found this for deleting
a[cumsum(rle(as.numeric(a[,1]))$lengths),]
although the rle approach keeps the last entry, and I would want to keep the first. I also have no idea how to handle the NAs.
if I have a pattern of 1-NA-2 then the NA should NOT be counted with either of them, in this case the NA row should be removed.
With data.table (as RHertel suggested for na.locf):
library(data.table)
library(zoo)
setDT(a)[na.locf(ia, fromLast=T)==na.locf(ia), sum(time), cumsum(c(T,!!diff(na.locf(ia))))]
# id V1
#1: 1 6.9
#2: 2 6.3
#3: 3 20.4
#4: 4 7.3
#5: 5 2.3
#6: 6 4.3
You first need to replace sequences of NAs with the values surrounding them (if they are the same). This answer shows zoo's na.locf function, which fills in NAs with the last observation. By testing whether it's the same when you carry values backwards or forwards, you can filter out the NAs you don't want, then do the carrying forward:
library(dplyr)
library(zoo)
a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia))
#> ia time
#> 1 1 4.5
#> 2 1 2.4
#> 3 2 3.6
#> 4 2 1.5
#> 5 2 1.2
#> 6 1 4.9
#> 7 1 6.4
#> 8 1 4.4
#> 9 2 7.3
#> 10 1 2.3
#> 11 2 4.3
Now that you've fixed those NAs, you can group consecutive sets of values using cumsum. The full solution is:
result <- a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia)) %>%
mutate(change = ia != lag(ia, default = FALSE)) %>%
group_by(group = cumsum(change), ia) %>%
summarise(time = sum(time))
result
#> Source: local data frame [6 x 3]
#> Groups: group [?]
#>
#> group ia time
#> (int) (dbl) (dbl)
#> 1 1 1 6.9
#> 2 2 2 6.3
#> 3 3 1 15.7
#> 4 4 2 7.3
#> 5 5 1 2.3
#> 6 6 2 4.3
If you want to get rid of the group column, use the additional lines:
result %>%
ungroup() %>%
select(-group)
nas <- which(is.na(df$ia))
add.index <- sapply(nas, function(x) {logi <- which(as.logical(df$ia))
aft <- logi[logi > x][1]
fore <- tail(logi[logi< x], 1)
if(df$ia[aft] == df$ia[fore]) aft else NA})
df$ia[nas] <- df$ia[add.index]
df <- df[complete.cases(df),]
First we determine if the NA values of the column are surrounded by the same value. If yes, the surrounding value replaces the NA. There is no problem if the data has consecutive NA values.
Next we do a standard sum by group operation. cumsum allows us to create a unique group based on changes in the numbers.
df$grps <- cumsum(c(F, !df$ia[-length(df$ia)] == df$ia[-1]))+1
aggregate(time ~ grps, df, sum)
# grps time
# 1 1 6.9
# 2 2 6.3
# 3 3 20.4
# 4 4 7.3
# 5 5 2.3
# 6 6 4.3
This is a base R approach. With packages like dplyr, zoo, or data.table different options are available as they come built with specialized functions to do what we did here.
I'm currently working on incorporating elements of one data table with another "master" set, by some reference column. To make things clearer, I have created some sample data:
This is the dataset I am looking to join onto another "master set".
data.frame(refID = c(1,3,4,5,7,8), value = c(3.3,3.9,4.4,8.0,1.1,2.5))
refID value
1 3.3
3 3.9
4 4.4
5 8.0
7 1.1
8 2.5
The master set:
data.frame(refID = 1:9, value = rep(0,9))
refID value
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
I am basically looking to send the column of values from the first data set into the second, but where there are gaps, let them have a value of 0. Ultimately, I am hoping to get:
Resulting Set:
refID value
1 3.3
2 0.0
3 3.9
4 4.4
5 8.0
6 0.0
7 1.1
8 2.5
9 0.0
I've played around with some stuff in the dplyr and data.table packages but can't seem to really pinpoint a good and direct way of doing it. Advice would be greatly appreciated, many thanks.
Using data.table, you can replace values from the first data.table on to the second by reference as follows:
require(data.table)
# data
DT1 = data.table(refID = c(1,3,4,5,7,8), value = c(3.3,3.9,4.4,8.0,1.1,2.5))
DT2 = data.table(refID = 1:9, value = 0)
setkey(DT2, refID)
DT2[DT1, value := i.value]
# refID value
# 1: 1 3.3
# 2: 2 0.0
# 3: 3 3.9
# 4: 4 4.4
# 5: 5 8.0
# 6: 6 0.0
# 7: 7 1.1
# 8: 8 2.5
# 9: 9 0.0
Please refer to this post for explanation.
The way I can think of would be to temporarily name the values val1 and val2, do a full_join so non-matching values are NA, then mutate a value column using the presence of the NAs. i.e. something like:
df_1 <- data.frame(refID = c(1,3,4,5,7,8), v1 = c(3.3,3.9,4.4,8.0,1.1,2.5))
df_2 <- data.frame(refID = 1:9, v2 = 0)
df_merged <- df_1 %>%
full_join(df_2) %>%
mutate(value=ifelse(!is.na(v1), v1, v2) %>% # or just 0 in this case
select(refID, value)