I can't get my head around something that looks obvious...
library(data.table)
DT1<-data.table(MyDate=as.Date(rep("2019-02-01")),MyName=c("John","Peter","Paul"),Rate=c(210,180,190))
DT2<-data.table(MyDate=seq(as.Date("2019-01-27"),as.Date("2019-02-03"),by="days"))
setkey(DT1,MyDate)
setkey(DT2,MyDate)
I would like to see the rate for John, Peter and Paul be rolled forward towards the end. When I do
DT1[DT2,on=.(MyDate),roll=TRUE]
I get :
MyDate MyName Rate
1: 2019-01-27 <NA> NA
2: 2019-01-28 <NA> NA
3: 2019-01-29 <NA> NA
4: 2019-01-30 <NA> NA
5: 2019-01-31 <NA> NA
6: 2019-02-01 John 210
7: 2019-02-01 Paul 190
8: 2019-02-01 Peter 180
9: 2019-02-02 Peter 180
10: 2019-02-03 Peter 180
While I want this :
MyDate MyName Rate
1: 2019-01-27 <NA> NA
2: 2019-01-28 <NA> NA
3: 2019-01-29 <NA> NA
4: 2019-01-30 <NA> NA
5: 2019-01-31 <NA> NA
6: 2019-02-01 John 210
7: 2019-02-01 Paul 190
8: 2019-02-01 Peter 180
9: 2019-02-02 John 210
10: 2019-02-02 Paul 190
11: 2019-02-02 Peter 180
12: 2019-02-03 John 210
13: 2019-02-03 Paul 190
14: 2019-02-03 Peter 180
It's obvious I'm overlooking something.
A convoluted way (found by trial and error):
DT1[DT2, on=.(MyDate <= MyDate), allow.cartesian = TRUE]
MyDate MyName Rate
1: 2019-01-27 <NA> NA
2: 2019-01-28 <NA> NA
3: 2019-01-29 <NA> NA
4: 2019-01-30 <NA> NA
5: 2019-01-31 <NA> NA
6: 2019-02-01 John 210
7: 2019-02-01 Peter 180
8: 2019-02-01 Paul 190
9: 2019-02-02 John 210
10: 2019-02-02 Peter 180
11: 2019-02-02 Paul 190
12: 2019-02-03 John 210
13: 2019-02-03 Peter 180
14: 2019-02-03 Paul 190
The difficult part was the cross-join-esque rows you need after a matching date but not before that matching date. I think the steps below get at this issue.
Perform a rolling join for each Name, then change the MyName column around and filter for resulting unique lines.
library(magrittr)
DT1[, .SD[DT2, roll = TRUE], by = MyName][
, MyName := ifelse(is.na(Rate), NA, MyName)
][order(MyDate, MyName), .(MyDate, MyName, Rate)] %>%
unique()
MyDate MyName Rate
1: 2019-01-27 <NA> NA
2: 2019-01-28 <NA> NA
3: 2019-01-29 <NA> NA
4: 2019-01-30 <NA> NA
5: 2019-01-31 <NA> NA
6: 2019-02-01 John 210
7: 2019-02-01 Paul 190
8: 2019-02-01 Peter 180
9: 2019-02-02 John 210
10: 2019-02-02 Paul 190
11: 2019-02-02 Peter 180
12: 2019-02-03 John 210
13: 2019-02-03 Paul 190
14: 2019-02-03 Peter 180
Related
I have data data set below... as you can see some months are missing (my data frame should be quarterly). I need to add the missing latest two quarters 2020-04-01 and 2020-07-01.. but the columns GEO, NAICS, shoule repeat the existing variables. ONLY NA should be filled in the VALUE column. Is there a way of doing this? I am using the following code, but it is not working...
REF_DATE
GEO
NAICS
VALUE
2020-01-01
AB
fishin
33
2020-01-01
AB
mining
233
2020-01-01
AB
constr
53
2020-01-01
BC
fishin
353
2020-01-01
BC
mining
253
2020-01-01
BC
constr
953
2020-10-01
AB
fishin
33
2020-10-01
AB
mining
293
2020-10-01
AB
constn
343
2020-10-01
BC
fishin
633
2020-10-01
BC
mining
363
2020-10-01
BC
constr
523
I should have these data inserted
REF_DATE
GEO
NAICS
VALUE
2020-04-01
AB
fishin
NA
2020-04-01
AB
mining
NA
2020-04-01
AB
constr
NA
2020-04-01
BC
fishin
NA
2020-04-01
BC
mining
NA
2020-04-01
BC
constr
NA
A simple solution could be
df_mod <- df
df_mod$VALUE <- NA
df_mod[df_mod$REF_DATE == "2020-01-01",1] <- as.Date("2020-04-01")
df_mod[df_mod$REF_DATE == "2020-10-01",1] <- as.Date("2020-07-01")
rbind(df, df_mod)
This returns
> rbind(df, df_mod)
REF_DATE GEO NAICS VALUE
1: 2020-01-01 AB fishin 33
2: 2020-01-01 AB mining 233
3: 2020-01-01 AB constr 53
4: 2020-01-01 BC fishin 353
5: 2020-01-01 BC mining 253
6: 2020-01-01 BC constr 953
7: 2020-10-01 AB fishin 33
8: 2020-10-01 AB mining 293
9: 2020-10-01 AB constn 343
10: 2020-10-01 BC fishin 633
11: 2020-10-01 BC mining 363
12: 2020-10-01 BC constr 523
13: 2020-04-01 AB fishin NA
14: 2020-04-01 AB mining NA
15: 2020-04-01 AB constr NA
16: 2020-04-01 BC fishin NA
17: 2020-04-01 BC mining NA
18: 2020-04-01 BC constr NA
19: 2020-07-01 AB fishin NA
20: 2020-07-01 AB mining NA
21: 2020-07-01 AB constn NA
22: 2020-07-01 BC fishin NA
23: 2020-07-01 BC mining NA
24: 2020-07-01 BC constr NA
REF_DATE GEO NAICS VALUE
I need to mutate a new value: "new_value" based on the same ID "ï..record_id". I need all with the same ID to have the same value in "date_eortc".
My data1 looks likes:
data1 %>%
select( ï..record_id, dato1, galbeta_date, date_eortc)
> ï..record_id dato1 galbeta_date date_eortc
1 1 <NA> <NA> <NA>
2 1 <NA> <NA> <NA>
3 1 <NA> 2018-01-16 <NA>
.....
99 10 2018-02-07 <NA> 2017-12-27
100 10 <NA> <NA> <NA>
101 10 <NA> <NA> <NA>
102 10 <NA> 2017-12-19 <NA>
103 10 <NA> 2017-12-26 <NA>
104 10 <NA> 2017-12-29 <NA>
105 10 <NA> 2018-01-02 <NA>
106 10 <NA> <NA> <NA>
107 10 <NA> <NA> <NA>
108 11 <NA> <NA> <NA>
In this case I need all with "ï..record_id"=10, then date date eortc should all be "2017-12-27"
So it would looks like:
ï..record_id dato1 galbeta_date date_eortc
99 10 2018-02-07 <NA> 2017-12-27
100 10 <NA> <NA> 2017-12-27
101 10 <NA> <NA> 2017-12-27
102 10 <NA> 2017-12-19 2017-12-27
103 10 <NA> 2017-12-26 2017-12-27
104 10 <NA> 2017-12-29 2017-12-27
105 10 <NA> 2018-01-02 2017-12-27
106 10 <NA> <NA> 2017-12-27
107 10 <NA> <NA> 2017-12-27
108 11 <NA> <NA> <NA>
I have tried to make an ifelse statement, but it's not the right one...
data2 <- data1 %>%
mutate(new_value= ifelse(ï..record_id == ï..record_id , date_eortc, NA))
I hope it makes sense.
Thank you for your time,
Julie
We could do a group_by the ï..record_id and fill the NA elements in 'date_eortic' with the non-NA adjacent element
library(dplyr)
library(tidyr)
data1 %>%
group_by(ï..record_id) %>%
fill(date_eortic)
In an earlier question, I learned that graphs are useful to collapse these data
require(data.table)
set.seed(333)
t <- data.table(old=1002:2001, dif=sample(1:10,1000, replace=TRUE))
t$new <- t$old + t$dif; t$foo <- rnorm(1000); t$dif <- NULL
> head(t)
old new foo
1: 1002 1007 -0.7889534
2: 1003 1004 0.3901869
3: 1004 1014 0.7907947
4: 1005 1011 2.0964612
5: 1006 1007 1.1834171
6: 1007 1015 1.1397910
to obtain only those rows such that new[i] = old[i-1]. The result could then be joined into a table with users who each have their own starting points
i <- data.table(id=1:3, start=sample(1000:1990,3))
> i
id start
1: 1 1002
2: 2 1744
3: 3 1656
Specifically, when only the first n=3 steps are calculated, the solution was
> library(igraph)
> i[, t[old %in% subcomponent(g, start, "out")[1:n]], by=.(id)]
id old new foo
1: 1 1002 1007 -0.7889534
2: 1 1007 1015 1.1397910
3: 1 1015 1022 -1.2193666
4: 2 1744 1750 -0.1368320
5: 2 1750 1758 0.3331686
6: 2 1758 1763 1.3040357
7: 3 1656 1659 -0.1556208
8: 3 1659 1663 0.1663042
9: 3 1663 1669 0.3781835
When implementing this when the setup is the same but new, old, and start are POSIXct class,
set.seed(333)
u <- data.table(old=seq(from=as.POSIXct("2013-01-01"),
to=as.POSIXct("2013-01-02"), by="15 mins"),
dif=as.difftime(sample(seq(15,120,15),97,replace=TRUE),units="mins"))
u$new <- u$old + u$dif; u$foo <- rnorm(97); u$dif <- NULL
j <- data.table(id=1:3, start=sample(seq(from=as.POSIXct("2013-01-01"),
to=as.POSIXct("2013-01-01 22:00:00"), by="15 mins"),3))
> head(u)
old new foo
1: 2013-01-01 00:00:00 2013-01-01 01:00:00 -1.5434407
2: 2013-01-01 00:15:00 2013-01-01 00:30:00 -0.2753971
3: 2013-01-01 00:30:00 2013-01-01 02:30:00 -1.5986916
4: 2013-01-01 00:45:00 2013-01-01 02:00:00 -0.6288528
5: 2013-01-01 01:00:00 2013-01-01 01:15:00 -0.8967041
6: 2013-01-01 01:15:00 2013-01-01 02:45:00 -1.2145590
> j
id start
1: 1 2013-01-01 22:00:00
2: 2 2013-01-01 21:00:00
3: 3 2013-01-01 13:30:00
the command
> j[, u[old %in% subcomponent(h, V(h)$name %in% as.character(start), "out")[1:n]], by=.(id)]
Empty data.table (0 rows and 4 cols): id,old,new,foo
returns an empty vector, which appears to be due to the inner part u[...]. I do not quite see where the problem is in this case and wonder whether anyone spots a mistake.
I have the following data frame:
Date.POSIXct Date WeekDay DayCategory Hour Holidays value
1 2018-05-01 00:00:00 2018-05-01 MA MA-MI-JU 0 0 30
2 2018-05-01 01:00:00 2018-05-01 MA MA-MI-JU 1 0 80
3 2018-05-01 02:00:00 2018-05-01 MA MA-MI-JU 2 0 42
4 2018-05-01 03:00:00 2018-05-01 MA MA-MI-JU 3 0 90
5 2018-05-01 04:00:00 2018-05-01 MA MA-MI-JU 4 0 95
6 2018-05-01 05:00:00 2018-05-01 MA MA-MI-JU 5 0 5
DayCategory groups days of the week in the following way: Mondays goes to LU DayCategory. Tuesday, Wednesday and Thursdays go to MA-MI-JU DayCategory.
Friday goes to VI, Saturdays to SA and Sundays to DO Categories respectively.
I would like to find the value for the same hour in the previous day (Date) with the same DayCategory, while Holidays remains within the same group (e.g. if one instance has holiday 0 but previous day with same DayCategory has 1, we should lookv for the previous one, etc.)
As an intermediate step and to understand the process I would like to add a column PreviousDaySameDayCategory with the Date of the previous day that has the same DayCategory that the corresponding row. Some times it will be just the same date minus seven days ("LU","VI","SA","DO") but other days it will be just one day.
Reproducible data:
library(lubridate)
Date.POSIXct <- seq(as.POSIXct("2018-05-01"), as.POSIXct("2018-05-31"), "hour")
mydf <- as.data.frame(Date.POSIXct)
mydf$Date <- as.Date(substr(as.character(mydf$Date.POSIXct),1,10))
mydf$WeekDay <- substr(toupper((weekdays(mydf$Date))),1,2)
mydf$DayCategory <-as.factor(ifelse(mydf$WeekDay == "MA" | mydf$WeekDay == "MI" | mydf$WeekDay == "JU",
"MA-MI-JU", mydf$WeekDay))
mydf$Hour <- hour(mydf$Date.POSIXct)
mydf$Holidays <- c(rep(0, 24*7),rep(1, 24*7), rep(0, 24*16+1))
set.seed(123)
mydf$myvalue <- sample.int(101,size=nrow(mydf),replace=TRUE)
I have manually started the first days and craeted a vector of how the solution should look like:
a <- rep(NA, 24)
b <- mydf$value[1:24]
c <- mydf$value[25:48]
d <- rep(NA, 24)
e <- rep(NA,24)
f <- rep(NA,24)
g <- rep(NA,24)
h <- rep(NA,24)
i <- mydf$value[169:192]
solution <- c(a,b,c,d,e,f,g,h,i)
solution
I would appreciate any hint in the thinking process to solve this kind of problems that I face with relative frequency.
Here is a data.table solution which uses a "grouped shift()" and multiple joins to copy value from the same hour of the PreviousDaySameDayCategory.
Create reproducible data
OP's code to create reproducible data was not fully reproducible because he used the weekdays() function which returns the weekday names in the current locale (which seems to be Spanish for the OP). To be independent of the current locale, I switched to format(Date, "%u") which returns the numbers 1 to 7 for Monday to Sunday. Furthermore, the fct_collapse() from the forcats package is used to collapse the days 2, 3, and 4 (Tuesday to Thursday) into one factor level.
library(data.table)
# note that package lubridate is not required
myDT <- data.table(Date.POSIXct = seq(as.POSIXct("2018-05-01"),
as.POSIXct("2018-05-31"), "hour"))
myDT[, Date := as.Date(Date.POSIXct)]
myDT[, Weekday := format(Date, "%u")]
myDT[, DayCategory := forcats::fct_collapse(Weekday, "234" = c("2", "3", "4"))]
myDT[, hour := hour(Date.POSIXct)]
myDT[, Holidays := c(rep(0, 24 * 7), rep(1, 24 * 7), rep(0, 24 * 16 + 1))]
set.seed(123)
myDT[, myvalue := sample.int(101, size = nrow(mydf), replace = TRUE)]
Intermediate step: PreviousDaySameDayCategory
The sample data set consists of hourly data but in order to determine the PreviousDaySameDayCategory we need to work day-wise and thus only have to deal with the unique values of Date, DayCategory, and Holidays. The data is grouped by DayCategory and the Holidays indicator. For each group separately, the previous day is picked by lagging Date. As the result of shift() operations depend on the order of rows the dataset has been ordered before shifting.
tmp <- unique(myDT[order(Date), .(Date, DayCategory, Holidays)])[
, .(Date, PreviousDaySameDayCategory = shift(Date)), by = .(DayCategory, Holidays)][
order(Date)]
tmp
DayCategory Holidays Date PreviousDaySameDayCategory
1: 234 0 2018-05-01 <NA>
2: 234 0 2018-05-02 2018-05-01
3: 234 0 2018-05-03 2018-05-02
4: 5 0 2018-05-04 <NA>
5: 6 0 2018-05-05 <NA>
6: 7 0 2018-05-06 <NA>
7: 1 0 2018-05-07 <NA>
8: 234 1 2018-05-08 <NA>
9: 234 1 2018-05-09 2018-05-08
10: 234 1 2018-05-10 2018-05-09
11: 5 1 2018-05-11 <NA>
12: 6 1 2018-05-12 <NA>
13: 7 1 2018-05-13 <NA>
14: 1 1 2018-05-14 <NA>
15: 234 0 2018-05-15 2018-05-03
16: 234 0 2018-05-16 2018-05-15
17: 234 0 2018-05-17 2018-05-16
18: 5 0 2018-05-18 2018-05-04
19: 6 0 2018-05-19 2018-05-05
20: 7 0 2018-05-20 2018-05-06
21: 1 0 2018-05-21 2018-05-07
22: 234 0 2018-05-22 2018-05-17
23: 234 0 2018-05-23 2018-05-22
24: 234 0 2018-05-24 2018-05-23
25: 5 0 2018-05-25 2018-05-18
26: 6 0 2018-05-26 2018-05-19
27: 7 0 2018-05-27 2018-05-20
28: 1 0 2018-05-28 2018-05-21
29: 234 0 2018-05-29 2018-05-24
30: 234 0 2018-05-30 2018-05-29
31: 234 0 2018-05-31 2018-05-30
DayCategory Holidays Date PreviousDaySameDayCategory
For days 3 and 4 (Wednesdays and Thursday) the preceeding Tuesday and Wednesday, resp., of the same week are picked. For day 2 (Tuesday) the preceeding Thursday of the preceeding week is picked if both weeks have the same holiday indicator set. If the preceeding week has a different holiday indicator the most recent Thursday of the same holiday period is picked. This is why, e.g., the 2018-05-03 is picked in row 15.
Copying value from matching PreviousDaySameDayCategory
This is done in two steps. First, the hourly values are picked from the matching PreviousDaySameDayCategory by joining with the matching days table tmp:
tmp2 <- myDT[tmp, on = .(Date = PreviousDaySameDayCategory), .(Date = i.Date, hour, myvalue), nomatch = 0L]
tmp2
Date hour myvalue
1: 2018-05-02 0 30
2: 2018-05-02 1 80
3: 2018-05-02 2 42
4: 2018-05-02 3 90
5: 2018-05-02 4 95
---
500: 2018-05-31 19 39
501: 2018-05-31 20 1
502: 2018-05-31 21 1
503: 2018-05-31 22 101
504: 2018-05-31 23 11
Second, a new column previousValue in myDT is created by updating in a join which contains the corresponding value from PreviousDaySameDayCategory:
myDT[tmp2, on = .(Date, hour), previousValue := i.myvalue]
Here, the first two days of the result are shown:
myDT[Date %between% c(as.Date("2018-05-01"), as.Date("2018-05-02"))]
Date.POSIXct Date Weekday DayCategory hour Holidays myvalue previousValue
1: 2018-05-01 00:00:00 2018-05-01 2 234 0 0 30 NA
2: 2018-05-01 01:00:00 2018-05-01 2 234 1 0 80 NA
3: 2018-05-01 02:00:00 2018-05-01 2 234 2 0 42 NA
4: 2018-05-01 03:00:00 2018-05-01 2 234 3 0 90 NA
5: 2018-05-01 04:00:00 2018-05-01 2 234 4 0 95 NA
6: 2018-05-01 05:00:00 2018-05-01 2 234 5 0 5 NA
7: 2018-05-01 06:00:00 2018-05-01 2 234 6 0 54 NA
8: 2018-05-01 07:00:00 2018-05-01 2 234 7 0 91 NA
9: 2018-05-01 08:00:00 2018-05-01 2 234 8 0 56 NA
10: 2018-05-01 09:00:00 2018-05-01 2 234 9 0 47 NA
11: 2018-05-01 10:00:00 2018-05-01 2 234 10 0 97 NA
12: 2018-05-01 11:00:00 2018-05-01 2 234 11 0 46 NA
13: 2018-05-01 12:00:00 2018-05-01 2 234 12 0 69 NA
14: 2018-05-01 13:00:00 2018-05-01 2 234 13 0 58 NA
15: 2018-05-01 14:00:00 2018-05-01 2 234 14 0 11 NA
16: 2018-05-01 15:00:00 2018-05-01 2 234 15 0 91 NA
17: 2018-05-01 16:00:00 2018-05-01 2 234 16 0 25 NA
18: 2018-05-01 17:00:00 2018-05-01 2 234 17 0 5 NA
19: 2018-05-01 18:00:00 2018-05-01 2 234 18 0 34 NA
20: 2018-05-01 19:00:00 2018-05-01 2 234 19 0 97 NA
21: 2018-05-01 20:00:00 2018-05-01 2 234 20 0 90 NA
22: 2018-05-01 21:00:00 2018-05-01 2 234 21 0 70 NA
23: 2018-05-01 22:00:00 2018-05-01 2 234 22 0 65 NA
24: 2018-05-01 23:00:00 2018-05-01 2 234 23 0 101 NA
25: 2018-05-02 00:00:00 2018-05-02 3 234 0 0 67 30
26: 2018-05-02 01:00:00 2018-05-02 3 234 1 0 72 80
27: 2018-05-02 02:00:00 2018-05-02 3 234 2 0 55 42
28: 2018-05-02 03:00:00 2018-05-02 3 234 3 0 61 90
29: 2018-05-02 04:00:00 2018-05-02 3 234 4 0 30 95
30: 2018-05-02 05:00:00 2018-05-02 3 234 5 0 15 5
31: 2018-05-02 06:00:00 2018-05-02 3 234 6 0 98 54
32: 2018-05-02 07:00:00 2018-05-02 3 234 7 0 92 91
33: 2018-05-02 08:00:00 2018-05-02 3 234 8 0 70 56
34: 2018-05-02 09:00:00 2018-05-02 3 234 9 0 81 47
35: 2018-05-02 10:00:00 2018-05-02 3 234 10 0 3 97
36: 2018-05-02 11:00:00 2018-05-02 3 234 11 0 49 46
37: 2018-05-02 12:00:00 2018-05-02 3 234 12 0 77 69
38: 2018-05-02 13:00:00 2018-05-02 3 234 13 0 22 58
39: 2018-05-02 14:00:00 2018-05-02 3 234 14 0 33 11
40: 2018-05-02 15:00:00 2018-05-02 3 234 15 0 24 91
41: 2018-05-02 16:00:00 2018-05-02 3 234 16 0 15 25
42: 2018-05-02 17:00:00 2018-05-02 3 234 17 0 42 5
43: 2018-05-02 18:00:00 2018-05-02 3 234 18 0 42 34
44: 2018-05-02 19:00:00 2018-05-02 3 234 19 0 38 97
45: 2018-05-02 20:00:00 2018-05-02 3 234 20 0 16 90
46: 2018-05-02 21:00:00 2018-05-02 3 234 21 0 15 70
47: 2018-05-02 22:00:00 2018-05-02 3 234 22 0 24 65
48: 2018-05-02 23:00:00 2018-05-02 3 234 23 0 48 101
Date.POSIXct Date Weekday DayCategory hour Holidays myvalue previousValue
Verification
The result is in line with OP's expectations
identical(myDT[, previousValue[seq_along(solution)]], solution)
[1] TRUE
OP has posted the same question in the Data Science section as well. I am including the same solution I have there here case it might help others.
It is similar to Uwe's solution but with the dplyr library instead.
library(dplyr)
rankedDf <- mydf %>%
group_by(DayCategory, Hour, Holidays) %>%
arrange(Date) %>%
mutate(rowRank = order(Date), previousRowRank = order(Date) - 1) %>%
left_join(., ., by = c("previousRowRank" = "rowRank", "DayCategory", "Hour", "Holidays")) %>%
select(
Date.POSIXct = Date.POSIXct.x,
Date = Date.x,
WeekDay = WeekDay.x,
DayCategory,
Hour,
Holidays,
myvalue = myvalue.x,
PreviousDaySameDayCategory = Date.y,
PreviousValueSameDayCategory = myvalue.y
)
print.data.frame(rankedDf)
P.S. love the way Uwe changes the original sample code.
This question already has answers here:
How do I select a subset of rows after group by a specific column in R Data table [duplicate]
(2 answers)
Closed 7 years ago.
How to drop groups when there are not enough observations?
In the following reproducible example, each person (identified by name) has 10 observations:
install.packages('randomNames') # install package if required
install.packages('data.table') # install package if required
lapply(c('data.table', 'randomNames'), require, character.only = TRUE) # load packages
set.seed(1)
testDT <- data.table( date = rep(seq(as.Date("2010/1/1"), as.Date("2019/1/1"), "years"),10),
name = rep(randomNames(10, which.names='first'), times=1, each=10),
Y = runif(100, 5, 15),
X = rnorm(100, 2, 9),
testDT <- testDT[ X > 0]
Now I want to keep only the persons with at least 6 observations, so Gracelline, Anna, Aesha and Michael must be removed, because they have
only 3, 2, 4 and 5 observations respectively.
testDT[, length(X), by=name]
name V1
1: Blake 6
2: Alexander 6
3: Leigha 8
4: Gracelline 3
5: Epifanio 7
6: Keasha 6
7: Robyn 6
8: Anna 2
9: Aesha 4
10: Michael 5
How do I do this in an automatic way (real dataset is much larger)?
Edit:
Yes it's a duplicate. :(
The last proposed method was the fastest one.
> system.time(testDT[, .SD[.N>=6], by = name])
user system elapsed
0.293 0.227 0.517
> system.time(testDT[testDT[, .I[.N>=6], by = name]$V1])
user system elapsed
0.163 0.243 0.415
> system.time(testDT[,if(.N>=6) .SD , by = name])
user system elapsed
0.073 0.323 0.399
We group by 'name', get the nrow (.N), and if it is greater than 6, we Subset the Data.table (.SD).
testDT[,if(.N>=6) .SD , by = name]
# name date Y X
# 1: Blake 2010-01-01 9.820801 3.69913070
# 2: Blake 2012-01-01 9.935413 15.18999375
# 3: Blake 2013-01-01 6.862176 3.37928004
# 4: Blake 2014-01-01 13.273733 21.55350503
# 5: Blake 2015-01-01 11.684667 6.27958576
# 6: Blake 2017-01-01 6.079436 7.49653718
# 7: Alexander 2010-01-01 13.209463 4.62301612
# 8: Alexander 2012-01-01 12.829328 2.00994816
# 9: Alexander 2013-01-01 10.530363 2.66907192
#10: Alexander 2016-01-01 5.233312 0.78339246
#11: Alexander 2017-01-01 9.772301 12.60278297
#12: Alexander 2019-01-01 11.927316 7.34551569
#13: Leigha 2010-01-01 9.776196 4.99655334
#14: Leigha 2011-01-01 13.612095 11.56789854
#15: Leigha 2013-01-01 7.447973 5.33016929
#16: Leigha 2014-01-01 5.706790 4.40388912
#17: Leigha 2016-01-01 8.162717 12.87081025
#18: Leigha 2017-01-01 10.186343 12.44362354
#19: Leigha 2018-01-01 11.620051 8.30192285
#20: Leigha 2019-01-01 9.068302 16.28150109
#21: Epifanio 2010-01-01 8.390729 17.90558542
#22: Epifanio 2011-01-01 13.394404 8.45036728
#23: Epifanio 2012-01-01 8.466835 10.19156807
#24: Epifanio 2013-01-01 8.337749 5.45766822
#25: Epifanio 2014-01-01 9.763512 17.13958472
#26: Epifanio 2017-01-01 8.899895 14.89054015
#27: Epifanio 2019-01-01 14.606180 0.13357331
#28: Keasha 2013-01-01 8.253522 6.44769498
#29: Keasha 2014-01-01 12.570871 0.40402566
#30: Keasha 2016-01-01 12.111212 14.08734943
#31: Keasha 2017-01-01 6.216919 0.06878532
#32: Keasha 2018-01-01 7.454885 0.38399123
#33: Keasha 2019-01-01 6.433044 1.09828333
#34: Robyn 2010-01-01 7.396294 8.41399676
#35: Robyn 2011-01-01 5.589344 1.33792036
#36: Robyn 2012-01-01 11.422883 1.66129246
#37: Robyn 2015-01-01 12.973088 2.54144396
#38: Robyn 2017-01-01 9.100841 6.78346573
#39: Robyn 2019-01-01 11.049333 4.75902075
Or instead of if, we can directly use .N>1 and wrap with `.SD
testDT[, .SD[.N>=6], by = name]
it could be a little slow, so another option would be .I to get the row index and then subset
testDT[testDT[, .I[.N>=6], by = name]$V1]