Repeat a record for N times and create a new sequence from 1 to N - r

I want to repeat the rows of a data.frame for N times. Here N calculates based on the difference between the values of a first and second column in each row of a data.frame. Here I am facing a problem with N. In particular, N may change per each row. And I need to create a new column by creating a sequence from a first value to second value in row 1 by increasing K. Here K remains constant for all the rows.
Ex: d1<-data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
In the above dataset, there are 5 rows. THe difference between first and second values in first row is 7. Now I need to replicate the first row for 7 times and need to create a new column with the sequence of 2,3,4,5,6,7 and 8.
I can create a dataset by using the following code.
dist<-1
rec_len<-c()
seqe<-c()
for(i in 1:nrow(d1))
{
a<-seq(d1[i,"A"],d1[i,"B"],by=dist)
rec_len<-c(rec_len,length(a))
seqe<-c(seqe,a)
}
d1$C<-rec_len
d1<-d1[rep(1:nrow(d1),d1$C),]
d1$D<-seqe
row.names(d1)<-NULL
But it is taking very long time. Is there any possibity to speed up the process?

A data.table approach for this can be to use 1:nrow(df) as grouping variable to make rowwise operation for creating a list with the sequences of A and B, and then unlist, i.e.
library(data.table)
setDT(d1)[, C := B - A + 1][,
D := list(list(seq(A, B))), by = 1:nrow(d1)][,
lapply(.SD, unlist), by = 1:nrow(d1)][,
nrow := NULL][]
Which gives,
A B C D
1: 2 8 7 2
2: 2 8 7 3
3: 2 8 7 4
4: 2 8 7 5
5: 2 8 7 6
6: 2 8 7 7
7: 2 8 7 8
8: 4 6 3 4
9: 4 6 3 5
10: 4 6 3 6
11: 6 7 2 6
12: 6 7 2 7
13: 8 8 1 8
14: 1 10 10 1
15: 1 10 10 2
16: 1 10 10 3
17: 1 10 10 4
18: 1 10 10 5
19: 1 10 10 6
20: 1 10 10 7
21: 1 10 10 8
22: 1 10 10 9
23: 1 10 10 10
A B C D
Note You can easily change K within seq, i.e.
setDT(d1)[, C := B - A + 1][,
D := list(list(seq(A, B, by = 0.2))), by = 1:nrow(d1)][,
lapply(.SD, unlist), by = 1:nrow(d1)][,
nrow := NULL][]

You could use lists and purr package to process each row of your data frame:
data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10)) %>% # take original data frame
setNames(c("from", "to")) %>% pmap(seq) %>% # sequence from A to B
map(as_data_frame) %>% # convert each element to data frame
map(~mutate(.,A=min(value), B=max(value))) %>% # add A and B columns
bind_rows() %>% select(A,B,value) # combine and reorder columns

Here is a base R option where we get the times of replication of each row by subtracting the 'B' with 'A' column ('i1'), create that as column 'C', then replicate the sequence of rows of original dataset using 'i1'. Finally, the 'D' column is created by getting the sequence of corresponding elements of 'A' and 'B' using Map. The output will be a list, so we unlist it to make a vector
i1 <- with(d1, B - A + 1)
d1$C <- i1
d2 <- d1[rep(seq_len(nrow(d1)), i1),]
d2$D <- unlist(Map(`:`, d1$A, d1$B))
row.names(d2) <- NULL
d2
# A B C D
#1 2 8 7 2
#2 2 8 7 3
#3 2 8 7 4
#4 2 8 7 5
#5 2 8 7 6
#6 2 8 7 7
#7 2 8 7 8
#8 4 6 3 4
#9 4 6 3 5
#10 4 6 3 6
#11 6 7 2 6
#12 6 7 2 7
#13 8 8 1 8
#14 1 10 10 1
#15 1 10 10 2
#16 1 10 10 3
#17 1 10 10 4
#18 1 10 10 5
#19 1 10 10 6
#20 1 10 10 7
#21 1 10 10 8
#22 1 10 10 9
#23 1 10 10 10

Simple example using N (case where k = 1)
library(dplyr)
# example data frame
d1 <- data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
# function to use (must have same column names)
f = function(d) {
A = rep(d$A, d$diff)
B = rep(d$B, d$diff)
C = seq(d$A, d$B)
data.frame(A, B, C) }
d1 %>%
mutate(diff = B - A + 1) %>% # calculate difference
rowwise() %>% # for every row
do(f(.)) %>% # apply the function
ungroup() # forget the grouping
# # A tibble: 23 x 3
# A B C
# * <dbl> <dbl> <int>
# 1 2 8 2
# 2 2 8 3
# 3 2 8 4
# 4 2 8 5
# 5 2 8 6
# 6 2 8 7
# 7 2 8 8
# 8 4 6 4
# 9 4 6 5
# 10 4 6 6
# # ... with 13 more rows
Example where you have one k for all rows (I'm using 0.25 to demonstrate)
# example data frame
d1 <- data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
# function to use (must have same column names)
f = function(d, k) {
A = d$A
B = d$B
C = seq(d$A, d$B, k)
data.frame(A, B, C) }
d1 %>%
rowwise() %>% # for every row
do(f(., 0.25)) %>% # apply the function using your own k
ungroup()
# # A tibble: 77 x 3
# A B C
# * <dbl> <dbl> <dbl>
# 1 2 8 2.00
# 2 2 8 2.25
# 3 2 8 2.50
# 4 2 8 2.75
# 5 2 8 3.00
# 6 2 8 3.25
# 7 2 8 3.50
# 8 2 8 3.75
# 9 2 8 4.00
# 10 2 8 4.25
# # ... with 67 more rows
Example where you have different k for each row
# example data frame
# give manually different k for each row
d1 <- data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
d1$k = c(0.5, 1, 2, 0.25, 1.5)
d1
# A B k
# 1 2 8 0.50
# 2 4 6 1.00
# 3 6 7 2.00
# 4 8 8 0.25
# 5 1 10 1.50
# function to use (must have same column names)
f = function(d) {
A = d$A
B = d$B
C = seq(d$A, d$B, d$k)
data.frame(A, B, C) }
d1 %>%
rowwise() %>% # for every row
do(f(.)) %>% # apply the function using different k for each row
ungroup()
# # A tibble: 25 x 3
# A B C
# * <dbl> <dbl> <dbl>
# 1 2 8 2.0
# 2 2 8 2.5
# 3 2 8 3.0
# 4 2 8 3.5
# 5 2 8 4.0
# 6 2 8 4.5
# 7 2 8 5.0
# 8 2 8 5.5
# 9 2 8 6.0
# 10 2 8 6.5
# # ... with 15 more rows

Related

join columns recursively in R

Hello I have a data frame of 245 columns but to add some sets and generate new columns try to do it recursively as follows
cl1<-sample(1:4,10,replace=TRUE)
cl2<-sample(1:4,10,replace=TRUE)
cl3<-sample(1:4,10,replace=TRUE)
cl4<-sample(1:4,10,replace=TRUE)
cl5<-sample(1:4,10,replace=TRUE)
cl6<-sample(1:4,10,replace=TRUE)
dat<-data.frame(cl1,cl2,cl3,cl4,cl5,cl6)
my intention is to add column 1 with column 3 and 5, likewise column 2 with 4 and 6 and in the end obtain a dataframe with two columns
and you should pay me something like that
I have programmed the following code
revisar<- function(a){
todos = list()
i=1
j=3
l=5
k=1
while(i<=2 ){
cl<-a[,i]
cl2<-a[,j]
cl3<-a[,l]
cl[is.na(cl)] <- 0
cl2[is.na(cl2)] <- 0
cl3[is.na(cl3)] <- 0
colu<-cl+cl2+cl3
col<-cbind(colu,colu)
i<-i+1
j<-j+1
l<-l+1
k<-k+1
}
return(col)
}
it turns out that it only returns column 2 repeated twice and I must replicate the same thing to join those 245 columns.7
I would like to know what is failing the example
base R
Literal programming:
with(dat, data.frame(s1 = cl1+cl3+cl5, s2 = cl2+cl4+cl6))
# s1 s2
# 1 7 11
# 2 7 7
# 3 4 11
# 4 4 10
# 5 9 8
# 6 12 5
# 7 7 6
# 8 7 10
# 9 4 9
# 10 6 5
Programmatically,
L <- list(s1 = c(1,3,5), s2 = c(2,4,6))
out <- data.frame(lapply(L, function(z) do.call(rowSums, list(as.matrix(dat[,z])))))
out
# s1 s2
# 1 7 11
# 2 7 7
# 3 4 11
# 4 4 10
# 5 9 8
# 6 12 5
# 7 7 6
# 8 7 10
# 9 4 9
# 10 6 5
dplyr
library(dplyr)
dat %>%
transmute(
s1 = rowSums(cbind(cl1, cl3, cl5)),
s2 = rowSums(cbind(cl2, cl4, cl6))
)
or programmatically using purrr:
purrr::map_dfc(L, ~ rowSums(dat[, .]))
Data
set.seed(42)
# your `dat` above
Here is an alternative general approach:
Here we sum all uneven columns -> s1 and
all even columns -> s2:
library(dplyr)
dat %>%
rowwise() %>%
mutate(s1 = sum(c_across(seq(1,ncol(dat),2)), na.rm = TRUE),
s2 = sum(c_across(seq(2,ncol(dat),2)), na.rm = TRUE))
cl1 cl2 cl3 cl4 cl5 cl6 s1 s2
<int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 3 2 3 2 7 5
2 2 4 1 4 2 3 5 11
3 2 2 2 2 1 3 5 7
4 2 4 4 3 1 4 7 11
5 2 4 4 3 2 2 8 9
6 3 3 3 2 2 2 8 7
7 2 1 1 2 1 4 4 7
8 2 4 1 3 2 3 5 10
9 3 1 1 2 3 4 7 7
10 2 4 1 3 4 4 7 11

Remove identical values if the same as previous in a time series

I have a time series:
df <- data.frame(t=1:10, x= c(5,7,8,9,5,5,5,5,4,3))
I want to remove values that are identical to the previous value to obtain:
x = c(5,7,8,9,5,4,3)
I tried:
df[unique(df$x),]
But this gives the incorrect answer.
You can do:
df[c(1, diff(df$x)) != 0, ]
t x
1 1 5
2 2 7
3 3 8
4 4 9
5 5 5
6 9 4
7 10 3
With dplyr, you can do:
df %>%
filter(x != lag(x, default = first(x)-1))
t x
1 1 5
2 2 7
3 3 8
4 4 9
5 5 5
6 9 4
7 10 3
In base R, we can use head and tail
subset(df, c(TRUE, head(x, -1) != tail(x, -1)))
# t x
#1 1 5
#2 2 7
#3 3 8
#4 4 9
#5 5 5
#9 9 4
#10 10 3
Another base solution would be using rle.
If you want to subset the dataframe based on the criteria, you can use lengths. Otherwise, if you only need the subset of x column, we should extract the values from rle. See below;
df[cumsum(rle(df$x)$lengths), ] # dataframe subset
# t x
# 1 1 5
# 2 2 7
# 3 3 8
# 4 4 9
# 8 8 5
# 9 9 4
# 10 10 3
rle(df$x)$values # vector of values
# [1] 5 7 8 9 5 4 3
Or using data.table:
library(data.table)
setDT(df_large)[, rn :=1:.N, by = rleid(x)][rn == 1, .(t, x)]
# t x
# 1: 1 5
# 2: 2 7
# 3: 3 8
# 4: 4 9
# 5: 5 5
# 6: 9 4
# 7: 10 3
library(dplyr)
df <- data.frame(t=1:10, x= c(5,7,8,9,5,5,5,5,4,3))
subsetVec <- df$x - lag(df$x) != 0
subsetVec <- replace_na(subsetVec, TRUE)
df[subsetVec,]

Dynamic select expression in function [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to write a function that will convert this data frame
library(dplyr)
library(rlang)
library(purrr)
df <- data.frame(obj=c(1,1,2,2,3,3,3,4,4,4),
S1=rep(c("a","b"),length.out=10),PR1=rep(c(3,7),length.out=10),
S2=rep(c("c","d"),length.out=10),PR2=rep(c(7,3),length.out=10))
obj S1 PR1 S2 PR2
1 1 a 3 c 7
2 1 b 7 d 3
3 2 a 3 c 7
4 2 b 7 d 3
5 3 a 3 c 7
6 3 b 7 d 3
7 3 a 3 c 7
8 4 b 7 d 3
9 4 a 3 c 7
10 4 b 7 d 3
In to this data frame
df %>% {bind_rows(select(., obj, S = S1, PR = PR1),
select(., obj, S = S2, PR = PR2))}
obj S PR
1 1 a 3
2 1 b 7
3 2 a 3
4 2 b 7
5 3 a 3
6 3 b 7
7 3 a 3
8 4 b 7
9 4 a 3
10 4 b 7
11 1 c 7
12 1 d 3
13 2 c 7
14 2 d 3
15 3 c 7
16 3 d 3
17 3 c 7
18 4 d 3
19 4 c 7
20 4 d 3
But I would like the function to be able to work with any number of columns. So it would also work if I had S1, S2, S3, S4 or if there was an additional category ie DS1, DS2. Ideally the function would take as arguments the patterns that determine which columns are stacked on top of each other, the number of sets of each column, the names of the output columns and the names of any variables that should also be kept.
This is my attempt at this function:
stack_col <- function(df, patterns, nums, cnames, keep){
keep <- enquo(keep)
build_exp <- function(x){
paste0("!!sym(cnames[[", x, "]]) := paste0(patterns[[", x, "]],num)") %>%
parse_expr()
}
exps <- map(1:length(patterns), ~expr(!!build_exp(.)))
sel_fun <- function(num){
df %>% select(!!keep,
!!!exps)
}
map(nums, sel_fun) %>% bind_rows()
}
I can get the sel_fun part to work for a fixed number of patterns like this
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
keep <- quo(obj)
sel_fun <- function(num){
df %>% select(!!keep,
!!sym(cnames[[1]]) := paste0(patterns[[1]], num),
!!sym(cnames[[2]]) := paste0(patterns[[2]], num))
}
sel_fun(1)
But the dynamic version that I have tried does not work and gives this error:
Error: `:=` can only be used within a quasiquoted argument
Here is a function to get the expected output. Loop through the 'patterns' and the corresponding new column names ('cnames') using map2, gather into 'long' format, rename the 'val' column to the 'cnames' passed into the function, bind the columns (bind_cols) and select the columns of interest
stack_col <- function(dat, pat, cname, keep) {
purrr::map2(pat, cname, ~
dat %>%
dplyr::select(keep, matches(.x)) %>%
tidyr::gather(key, val, matches(.x)) %>%
dplyr::select(-key) %>%
dplyr::rename(!! .y := val)) %>%
dplyr::bind_cols(.) %>%
dplyr::select(keep, cname)
}
stack_col(df, patterns, cnames, 1)
# obj Species PR
#1 1 a 3
#2 1 b 7
#3 2 a 3
#4 2 b 7
#5 3 a 3
#6 3 b 7
#7 3 a 3
#8 4 b 7
#9 4 a 3
#10 4 b 7
#11 1 c 7
#12 1 d 3
#13 2 c 7
#14 2 d 3
#15 3 c 7
#16 3 d 3
#17 3 c 7
#18 4 d 3
#19 4 c 7
#20 4 d 3
Also, multiple patterns reshaping can be done with data.table::melt
library(data.table)
melt(setDT(df), measure = patterns("^S\\d+", "^PR\\d+"),
value.name = c("Species", "PR"))[, variable := NULL][]
This solves your problem, although it does not fix your function:
The idea is to use gather and spread on the columns which starts with the specific pattern. Therefore I create a regex which matches the column names and then first gather all of them, extract the group and the rename the groups with the cnames. Finally spread takes separates the new columns.
library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
names(cnames) <- patterns
complete_pattern <- str_c("^", str_c(patterns, collapse = "|^"))
df %>%
mutate(rownumber = 1:n()) %>%
gather(new_variable, value, matches(complete_pattern)) %>%
mutate(group = str_extract(new_variable, complete_pattern),
group = str_replace_all(group, cnames),
group_number = str_extract(new_variable, "\\d+")) %>%
select(-new_variable) %>%
spread(group, value)
# obj rownumber group_number PR Species
# 1 1 1 1 3 a
# 2 1 1 2 7 c
# 3 1 2 1 7 b
# 4 1 2 2 3 d
# 5 2 3 1 3 a
# 6 2 3 2 7 c
# 7 2 4 1 7 b
# 8 2 4 2 3 d
# 9 3 5 1 3 a
# 10 3 5 2 7 c
# 11 3 6 1 7 b
# 12 3 6 2 3 d
# 13 3 7 1 3 a
# 14 3 7 2 7 c
# 15 4 8 1 7 b
# 16 4 8 2 3 d
# 17 4 9 1 3 a
# 18 4 9 2 7 c
# 19 4 10 1 7 b
# 20 4 10 2 3 d

How to mutate multiple variables without repeating codes?

I'm trying to create new variables from existing variables like below:
a1+a2=a3, b1+b2=b3, ..., z1+z2=z3
Here is an example data frame
df <- data.frame(replicate(10,sample(1:10)))
colnames(df) <- c("a1","a2","b1","b2","c1","c2","d1","d2","e1","e2")
Here's my solution with repeating codes
# a solution by base R
df$a3 <- df$a1 + df$a2
df$b3 <- df$b1 + df$b2
df$c3 <- df$c1 + df$c2
df$d3 <- df$d1 + df$d2
df$e3 <- df$e1 + df$e2
Or
# a solution by dplyr
library(dplyr)
df <- df %>%
mutate(a3 = a1+a2,
b3 = b1+b2,
c3 = c1+c2,
d3 = d1+d2,
e3 = e1+d2)
Or
# a solution by data.table
library(data.table)
DT <- data.table(df)
DT[,a3:=a1+a2][,b3:=b1+b2][,c3:=c1+c2][,d3:=d1+d2][,e3:=e1+e2]
Actually I have more than 100 variables, so I want to find a way to do so without repeating code... Although I tried to use mutate_ with standard evaluation and regular expression, I lost my way because I'm a newbie in R. Can you mutate multiple variables without repeating code?
Your data format is making this hard - I would reshape the data like this. In general, you shouldn't encode actual data information in column names, if the difference between a1 and a2 is meaningful, it is better to have a column with letter, a, b, c and a column with number, 1, 2.
df$id = 1:nrow(df)
library(tidyr)
library(dplyr)
tdf = gather(df, key = key, value = value, -id) %>%
separate(key, into = c("letter", "number"), sep = 1) %>%
mutate(number = paste0("V", number)) %>%
spread(key = number, value = value)
## now data is "tidy":
head(tdf)
# id letter V1 V2
# 1 1 a 2 7
# 2 1 b 10 4
# 3 1 c 9 10
# 4 1 d 9 4
# 5 1 e 5 8
# 6 2 a 9 8
## and the operation is simple:
tdf$V3 = tdf$V1 + tdf$V2
head(tdf)
# id letter V1 V2 V3
# 1 1 a 2 7 9
# 2 1 b 10 4 14
# 3 1 c 9 10 19
# 4 1 d 9 4 13
# 5 1 e 5 8 13
# 6 2 a 9 8 17
A possible solution using data.table:
DT <- data.table(df)[, rn := .I]
DTadd3 <- dcast(melt(DT, measure.vars = 1:10)[, `:=` (let = substr(variable,1,1), rn = 1:.N), variable
][, s3 := sum(value), .(let,rn)],
rn ~ paste0(let,3), value.var = 's3', mean)
DT[DTadd3, on = 'rn'][, rn := NULL][]
which gives:
a1 a2 b1 b2 c1 c2 d1 d2 e1 e2 a3 b3 c3 d3 e3
1: 10 5 9 5 10 4 5 3 7 10 15 14 14 8 17
2: 2 6 6 8 3 8 7 1 4 7 8 14 11 8 11
3: 6 4 7 4 4 3 4 6 3 3 10 11 7 10 6
4: 1 2 4 2 9 9 3 7 10 4 3 6 18 10 14
5: 9 10 8 1 8 7 10 5 9 1 19 9 15 15 10
6: 8 8 10 6 2 5 2 4 2 6 16 16 7 6 8
7: 7 9 1 7 5 10 9 2 1 8 16 8 15 11 9
8: 5 1 2 9 7 2 1 8 5 5 6 11 9 9 10
9: 3 7 3 3 1 6 8 10 8 9 10 6 7 18 17
10: 4 3 5 10 6 1 6 9 6 2 7 15 7 15 8
A similar solution using dplyr and tidyr:
df %>%
bind_cols(., df %>%
gather(var, val) %>%
group_by(var) %>%
mutate(let = substr(var,1,1), rn = 1:n()) %>%
group_by(let,rn) %>%
summarise(s3 = sum(val)) %>%
spread(let, s3) %>%
select(-rn)
)
However, as noted by #Gregor, it is much better to transform your data into long format. The data.table equivalent of #Gregor's answer:
DT <- data.table(df)
melt(DT[, rn := .I],
variable.name = 'let',
measure.vars = patterns('1$','2$'),
value.name = paste0('v',1:2)
)[, `:=` (let = letters[let], v3 = v1 + v2)][]
which gives (first 15 rows):
rn let v1 v2 v3
1: 1 a 10 5 15
2: 2 a 2 6 8
3: 3 a 6 4 10
4: 4 a 1 2 3
5: 5 a 9 10 19
6: 6 a 8 8 16
7: 7 a 7 9 16
8: 8 a 5 1 6
9: 9 a 3 7 10
10: 10 a 4 3 7
11: 1 b 9 5 14
12: 2 b 6 8 14
13: 3 b 7 4 11
14: 4 b 4 2 6
15: 5 b 8 1 9
My data.table solution:
sapply(c("a", "b", "c", "d", "e"), function(ll)
df[ , paste0(ll, 3) := get(paste0(ll, 1)) + get(paste0(ll, 2))])
df[]
# a1 a2 b1 b2 c1 c2 d1 d2 e1 e2 a3 b3 c3 d3 e3
# 1: 5 2 2 6 4 1 10 7 3 9 7 8 5 17 12
# 2: 4 8 7 3 3 7 9 6 9 7 12 10 10 15 16
# 3: 10 7 6 10 1 9 4 1 2 4 17 16 10 5 6
# 4: 3 4 1 7 6 4 7 4 7 5 7 8 10 11 12
# 5: 8 3 4 2 2 2 3 3 4 10 11 6 4 6 14
# 6: 6 6 5 1 8 10 1 10 5 3 12 6 18 11 8
# 7: 2 10 8 9 5 6 2 5 10 2 12 17 11 7 12
# 8: 1 1 10 8 9 5 6 9 6 8 2 18 14 15 14
# 9: 9 5 3 5 10 3 5 2 1 6 14 8 13 7 7
# 10: 7 9 9 4 7 8 8 8 8 1 16 13 15 16 9
Or, more extensibly:
sapply(c("a", "b", "c", "d", "e"), function(ll)
df[ , paste0(ll, 3) := Reduce(`+`, mget(paste0(ll, 1:2)))])
If all of the variables fit the pattern of ending with 1 or 2, you might try:
stems = unique(gsub("[0-9]", "", names(df)))
Then sapply(stems, ...)
library(tidyverse)
reduce(.init=df, .x=letters[1:5], .f~{
mutate(.x, '{.y}3' := get(str_c(.y, 1)) + get(str_c(.y, 2)))
})

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

Resources