R data.table how to create duplicates [duplicate] - r

This question already has answers here:
Repeat rows of a data.frame N times
(10 answers)
Closed 3 years ago.
I have:
dataDT <- data.table(A = 1:3, B = 1:3)
dataDT
A B
1: 1 1
2: 2 2
3: 3 3
I want:
dataDT <- data.table(A = c(1:3, 1:3), B = c(1:3, 1:3))
dataDT
A B
1: 1 1
2: 2 2
3: 3 3
4: 1 1
5: 2 2
6: 3 3
i.e. create x copies of duplicate and append after the bottom row.
I've tried (results aren't what I need):
dataDT1 <- splitstackshape::expandRows(dataset = dataDT, count = 2, count.is.col = FALSE) # order not correct
dataDT1
A B
1: 1 1
2: 1 1
3: 2 2
4: 2 2
5: 3 3
6: 3 3
Also (results aren't what I need):
dataDT2 <- rbindlist(list(rep(dataDT, 2))) # it creates columns
dataDT2
A B A B
1: 1 1 1 1
2: 2 2 2 2
3: 3 3 3 3
Can anyone recommend a correct and efficient way of doing it?

You can do it with rep:
> x = 2; dataDT[rep(seq_len(nrow(dataDT)), x), ]
A B
1: 1 1
2: 2 2
3: 3 3
4: 1 1
5: 2 2
6: 3 3
or with rbindlist and replicate:
> x = 2; rbindlist(replicate(x, dataDT, simplify = F))
A B
1: 1 1
2: 2 2
3: 3 3
4: 1 1
5: 2 2
6: 3 3

Related

R Data Table add rows to each group if not existing [duplicate]

This question already has answers here:
data.table equivalent of tidyr::complete()
(3 answers)
Closed 29 days ago.
I have a data table with multiple groups. Each group I'd like to fill with rows containing the values in vals if they are not already present. Additional columns should be filled with NAs.
DT = data.table(group = c(1,1,1,2,2,3,3,3,3), val = c(1,2,4,2,3,1,2,3,4), somethingElse = rep(1,9))
vals = data.table(val = c(1,2,3,4))
What I want:
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
The order of val does not necessarily have to be increasing, the values may also be appened at the beginning/end of each group.
I don't know how to approach this problem. I've thought about using rbindlist(...,fill = TRUE), but then the values will be simply appended.
I think some expression with DT[, lapply(...), by = c("group")] might be useful here but I have no idea how to check if a value already exists.
You can use a cross-join:
setDT(DT)[
CJ(group = group, val = val, unique = TRUE),
on = .(group, val)
]
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
Another way to solve your problem:
DT[, .SD[vals, on="val"], by=group]
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
# or
DT[CJ(group, val, unique=TRUE), on=.NATURAL]
I will just add this answer for a slightly more complex case:
#Raw Data
DT = data.table(group = c(1,1,2,2,2,3,3,3,3),
x = c(1,2,1,3,4,1,2,3,4),
y = c(2,4,2,6,8,2,4,6,8),
somethingElse = rep(1,9))
#allowed combinations of x and y
DTxy = data.table(x = c(1,2,3,4), y = c(2,4,6,8))
Here, I want to add all x,y combinations from DTxy to each group from DT, if not already present.
I've wrote a function to work for subsets.
#function to join subsets on two columns (here: x,y)
DTxyJoin = function(.SD, xy){
.SD = .SD[xy, on = .(x,y)]
return(.SD)
}
I then applied the function to each group:
#add x and y to each group if missing
DTres = DT[, DTxyJoin(.SD, DTxy), by = c("group")]
The Result:
group x y somethingElse
1: 1 1 2 1
2: 1 2 4 1
3: 1 3 6 NA
4: 1 4 8 NA
5: 2 1 2 1
6: 2 2 4 NA
7: 2 3 6 1
8: 2 4 8 1
9: 3 1 2 1
10: 3 2 4 1
11: 3 3 6 1
12: 3 4 8 1

Group variable by "n" consecutive integers in data.table

library(data.table)
DT <- data.table(var = 1:100)
I want to create a second variable, group that groups the values in var by n consecutive integers. So if n is equal to 1, it would return the same column as var. If n=2, it would return me:
var group
1: 1 1
2: 2 1
3: 3 2
4: 4 2
5: 5 3
6: 6 3
If n=3, it would return me:
var group
1: 1 1
2: 2 1
3: 3 1
4: 4 2
5: 5 2
6: 6 2
and so on. I would like to do this as flexibly as possibly.
Note that there could be repeated values:
var group
1: 1 1
2: 1 1
3: 2 1
4: 3 2
5: 3 2
6: 4 2
Here, group corresponds to n=2. Thank you!
I think we can use findInterval for this:
DT <- data.table(var = c(1L, 1:10))
n <- 2
DT[, group := findInterval(var, seq(min(var), max(var) + n, by = n))]
# var group
# <int> <int>
# 1: 1 1
# 2: 1 1
# 3: 2 1
# 4: 3 2
# 5: 4 2
# 6: 5 3
# 7: 6 3
# 8: 7 4
# 9: 8 4
# 10: 9 5
# 11: 10 5
n <- 3
DT[, group := findInterval(var, seq(min(var), max(var) + n, by = n))]
# var group
# <int> <int>
# 1: 1 1
# 2: 1 1
# 3: 2 1
# 4: 3 1
# 5: 4 2
# 6: 5 2
# 7: 6 2
# 8: 7 3
# 9: 8 3
# 10: 9 3
# 11: 10 4
(The +n in the call to seq is so that we always have a little more than we need; if we did just seq(min(.),max(.),by=n), it would be possible the highest values of var would be outside of the sequence. One could also do c(seq(min(.), max(.), by=n), Inf) for the same effect.)

Index and count unique combination of variables using R, but do NOT remove duplicates

Take this data frame for example:
DT <- data.table(A = rep(1:3, each=4),
B = rep(c(NA,1,2,4), each=3),
C = rep(1:2, 6))
I want to append a column that assign index to unique combinations of A and B, but ignore C. I also want another column that count the number of duplicates, that looks like this:
A B C Index Count
1: 1 NA 1 1 3
2: 1 NA 2 1 3
3: 1 NA 1 1 3
4: 1 1 2 2 1
5: 2 1 1 3 2
6: 2 1 2 3 2
7: 2 2 1 4 2
8: 2 2 2 4 2
9: 3 2 1 5 1
10: 3 4 2 6 3
11: 3 4 1 6 3
12: 3 4 2 6 3
I don't want to trim the data frame and (preferably)I don't want to reorder the rows.
I tried setDT, such as
setDT(DT)[,.(.I, .N), by = names(DT[,1:2])]
But the I column is not the index I want, and Column C is gone.
Thanks in advance!

Count how many times values has changed in column using R

Hi i want to count how many times value has changed in a column by the group and how many unique values was in a group, and i sort of getting what i want, but it has a NA observation which i do not want to be counted.
df <- data.frame(x=c("a",'a', "a", "b",'b', "b", "c",'c', "d")
,y=c(1,2,NA,3,3,3,2,1,5))
library(data.table) #data.table_1.9.5
setDT(df)[, wanted := rleid(y), by=x][]
setDT(df)[, count := uniqueN(y),by=x][]
x y wanted count
1: a 1 1 3
2: a 2 2 3
3: a NA 3 3
4: b 3 1 1
5: b 3 1 1
6: b 3 1 1
7: c 2 1 2
8: c 1 2 2
9: d 5 1 1`
Desired results:
x y wanted count
1: a 1 1 2
2: a 2 2 2
3: a NA 2 2
4: b 3 1 1
5: b 3 1 1
6: b 3 1 1
7: c 2 1 2
8: c 1 2 2
9: d 5 1 1
I tried rleid(!is.na(y)) but seems not to work as i expected. Thank you.
We can replace the NA elements with previous non-NA element (na.locf), take the rleid on that to get the 'wanted' and also get the length of unique elements that are not NA to get the 'count'
library(zoo)
setDT(df)[, c('wanted', 'count') := list(rleid(na.locf(y)), uniqueN(y, na.rm = TRUE)), x]
df
# x y wanted count
#1: a 1 1 2
#2: a 2 2 2
#3: a NA 2 2
#4: b 3 1 1
#5: b 3 1 1
#6: b 3 1 1
#7: c 2 1 2
#8: c 1 2 2
#9: d 5 1 1

Inserting a count field for each row by a grouping variable

I have a data set with observations that are both grouped and ordered (by rank). I'd like to add a third variable that is a count of the number of observations for each grouping variable. I'm aware of ways to group and count variables but I can't find a way to re-insert these counts back into the original data set, which has more rows. I'd like to get the variable C in the example table below.
A B C
1 1 3
1 2 3
1 3 3
2 1 4
2 2 4
2 3 4
2 4 4
Here's one way using ave:
DF <- within(DF, {C <- ave(A, A, FUN=length)})
# A B C
# 1 1 1 3
# 2 1 2 3
# 3 1 3 3
# 4 2 1 4
# 5 2 2 4
# 6 2 3 4
# 7 2 4 4
Here is one approach using data.table that makes use of .N, which is described in the help file to "data.table" as .N is an integer, length 1, containing the number of rows in the group.
> library(data.table)
> DT <- data.table(A = rep(c(1, 2), times = c(3, 4)), B = c(1:3, 1:4))
> DT
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
7: 2 4
> DT[, C := .N, by = "A"]
> DT
A B C
1: 1 1 3
2: 1 2 3
3: 1 3 3
4: 2 1 4
5: 2 2 4
6: 2 3 4
7: 2 4 4

Resources