Replicating rows in data.table by column value - r

I have a dataset that is structured as following:
data <- data.table(ID=1:10,Tenure=c(2,3,4,2,1,1,3,4,5,2),Var=rnorm(10))
ID Tenure Var
1: 1 2 -0.72892371
2: 2 3 -1.73534591
3: 3 4 0.47007030
4: 4 2 1.33173044
5: 5 1 -0.07900914
6: 6 1 0.63493316
7: 7 3 -0.62710577
8: 8 4 -1.69238758
9: 9 5 -0.85709328
10: 10 2 0.10716830
I need to replicate each row N=Tenure times. e.g. I need to replicate the first row 2 times (since Tenure = 2.
I need my transformed dataset to look like the following:
setkey(data,ID)
print(data[,.(ID=rep(ID,Tenure))][data][, Indx := 1:.N, by=ID])
ID Tenure Var Indx
1: 1 2 -0.7289237 1
2: 1 2 -0.7289237 2
3: 2 3 -1.7353459 1
4: 2 3 -1.7353459 2
5: 2 3 -1.7353459 3
6: 3 4 0.4700703 1
...
...
Is there a more efficient way (a more data.table way) to do this? My way is pretty slow. I was thinking there should be a way to do this using a by-without-by merge usng .EACHI?

I don't think using a key/merge is helpful here. Just expand by passing a vector of row indices:
DT <- data[rep(1:.N,Tenure)][,Indx:=1:.N,by=ID]

You could try:
library(splitstackshape)
expandRows(data, "Tenure", drop = FALSE)[,Indx:=1:.N,by=ID][]
Or
library(dplyr)
library(splitstackshape)
expandRows(data, "Tenure", drop = FALSE) %>%
group_by(ID) %>%
mutate(Indx = row_number(Tenure))
Which gives:
ID Tenure Var Indx
1: 1 2 -0.8808717 1
2: 1 2 -0.8808717 2
3: 2 3 0.5962590 1
4: 2 3 0.5962590 2
5: 2 3 0.5962590 3
6: 3 4 0.1197176 1
7: 3 4 0.1197176 2
8: 3 4 0.1197176 3
9: 3 4 0.1197176 4
10: 4 2 -0.2821739 1

Related

R Data Table add rows to each group if not existing [duplicate]

This question already has answers here:
data.table equivalent of tidyr::complete()
(3 answers)
Closed 29 days ago.
I have a data table with multiple groups. Each group I'd like to fill with rows containing the values in vals if they are not already present. Additional columns should be filled with NAs.
DT = data.table(group = c(1,1,1,2,2,3,3,3,3), val = c(1,2,4,2,3,1,2,3,4), somethingElse = rep(1,9))
vals = data.table(val = c(1,2,3,4))
What I want:
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
The order of val does not necessarily have to be increasing, the values may also be appened at the beginning/end of each group.
I don't know how to approach this problem. I've thought about using rbindlist(...,fill = TRUE), but then the values will be simply appended.
I think some expression with DT[, lapply(...), by = c("group")] might be useful here but I have no idea how to check if a value already exists.
You can use a cross-join:
setDT(DT)[
CJ(group = group, val = val, unique = TRUE),
on = .(group, val)
]
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
Another way to solve your problem:
DT[, .SD[vals, on="val"], by=group]
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
# or
DT[CJ(group, val, unique=TRUE), on=.NATURAL]
I will just add this answer for a slightly more complex case:
#Raw Data
DT = data.table(group = c(1,1,2,2,2,3,3,3,3),
x = c(1,2,1,3,4,1,2,3,4),
y = c(2,4,2,6,8,2,4,6,8),
somethingElse = rep(1,9))
#allowed combinations of x and y
DTxy = data.table(x = c(1,2,3,4), y = c(2,4,6,8))
Here, I want to add all x,y combinations from DTxy to each group from DT, if not already present.
I've wrote a function to work for subsets.
#function to join subsets on two columns (here: x,y)
DTxyJoin = function(.SD, xy){
.SD = .SD[xy, on = .(x,y)]
return(.SD)
}
I then applied the function to each group:
#add x and y to each group if missing
DTres = DT[, DTxyJoin(.SD, DTxy), by = c("group")]
The Result:
group x y somethingElse
1: 1 1 2 1
2: 1 2 4 1
3: 1 3 6 NA
4: 1 4 8 NA
5: 2 1 2 1
6: 2 2 4 NA
7: 2 3 6 1
8: 2 4 8 1
9: 3 1 2 1
10: 3 2 4 1
11: 3 3 6 1
12: 3 4 8 1

Create adjacency list from group info

I would like to create an adjacency list from a dataset like the following:
id group
1 1
2 1
3 1
4 2
5 2
The connected id are those who are in the same group. Therefore, I would like to get the following adjacency list:
id id2
1 2
1 3
2 1
2 3
3 1
3 2
4 5
5 4
I am struggling in figuring out how to do it. In particular, I have found a solution where order does not matter (split and expand.grid by group on large data set). In my case, it does, so I would not like to have those observations dropped.
Maybe something like this, using data.table:
require(data.table)
dt <- fread('id group
1 1
2 1
3 1
4 2
5 2')
dt[, expand.grid(id, id), by = group][Var1 != Var2][, -1]
# Var1 Var2
# 1: 2 1
# 2: 3 1
# 3: 1 2
# 4: 3 2
# 5: 1 3
# 6: 2 3
# 7: 5 4
# 8: 4 5

Index and count unique combination of variables using R, but do NOT remove duplicates

Take this data frame for example:
DT <- data.table(A = rep(1:3, each=4),
B = rep(c(NA,1,2,4), each=3),
C = rep(1:2, 6))
I want to append a column that assign index to unique combinations of A and B, but ignore C. I also want another column that count the number of duplicates, that looks like this:
A B C Index Count
1: 1 NA 1 1 3
2: 1 NA 2 1 3
3: 1 NA 1 1 3
4: 1 1 2 2 1
5: 2 1 1 3 2
6: 2 1 2 3 2
7: 2 2 1 4 2
8: 2 2 2 4 2
9: 3 2 1 5 1
10: 3 4 2 6 3
11: 3 4 1 6 3
12: 3 4 2 6 3
I don't want to trim the data frame and (preferably)I don't want to reorder the rows.
I tried setDT, such as
setDT(DT)[,.(.I, .N), by = names(DT[,1:2])]
But the I column is not the index I want, and Column C is gone.
Thanks in advance!

Data.table summary statistics from n first observations per group

I'd like to use data.table to make summary statistics based on only the first n observations found for each group. I have one solution that works below but I have a nagging feeling that this might be written as a one-liner in data.table but I cannot find out how.
library(data.table)
DT <- data.table(y=1:10, grp=rep(1:2,5))
This produces
y grp
1: 1 1
2: 2 2
3: 3 1
4: 4 2
5: 5 1
6: 6 2
7: 7 1
8: 8 2
9: 9 1
10: 10 2
and I basically want to make summary statistics of y based on, say, the first two observations for each group. The following command gives me the index (by group)
DT2 <- DT[, .(idx = 1:.N, y), by=grp]
which yields
grp idx y
1: 1 1 1
2: 1 2 3
3: 1 3 5
4: 1 4 7
5: 1 5 9
6: 2 1 2
7: 2 2 4
8: 2 3 6
9: 2 4 8
10: 2 5 10
and then I can use data.table again to create the summary based on the relevant selection.
DT2[idx<3, .(my = mean(y)), by=grp]
to get
grp my
1: 1 2
2: 2 3
Is it possible to write this as a single call to data.table?
The one call solution is
DT[, .(my = mean(y[1:2])), by = grp]

Number of copies (duplicates) in R data.table

I want to add a column to a data.table which shows how many copies of each row exist. Take the following example:
library(data.table)
DT <- data.table(id = 1:10, colA = c(1,1,2,3,4,5,6,7,7,7), colB = c(1,1,2,3,4,5,6,7,8,8))
setkey(DT, colA, colB)
DT[, copies := length(colA), by = .(colA, colB)]
The output it gives is
id colA colB copies
1: 1 1 1 1
2: 2 1 1 1
3: 3 2 2 1
4: 4 3 3 1
5: 5 4 4 1
6: 6 5 5 1
7: 7 6 6 1
8: 8 7 7 1
9: 9 7 8 1
10: 10 7 8 1
Desired output is:
id colA colB copies
1: 1 1 1 2
2: 2 1 1 2
3: 3 2 2 1
4: 4 3 3 1
5: 5 4 4 1
6: 6 5 5 1
7: 7 6 6 1
8: 8 7 7 1
9: 9 7 8 2
10: 10 7 8 2
How should I do it?
I also want to know why my approach doesn't. work. Isn't it true that when you group by colA and colB, the first group should contain two rows of data? I understand if "length" is not the function to use, but I cannot think of any other function to use. I thought of "nrow" but what can I pass to it?
DT[, copies := .N, by=.(colA,colB)]
# id colA colB copies
# 1: 1 1 1 2
# 2: 2 1 1 2
# 3: 3 2 2 1
# 4: 4 3 3 1
# 5: 5 4 4 1
# 6: 6 5 5 1
# 7: 7 6 6 1
# 8: 8 7 7 1
# 9: 9 7 8 2
# 10: 10 7 8 2
As mentioned in the comments, .N will calculate the length of the grouped object as defined in the by argument.

Resources