R data.table: summarise values of several rows - r

I have a data.table in R which looks like this one:
code gruppe proz_grouped
1: 1 2 14.751689
2: 2 2 22.063523
3: 3 2 35.441111
4: 4 2 27.743676
5: 1 3 7.575869
6: 2 3 23.420090
7: 3 3 38.513576
8: 4 3 30.490465
Is there an easy, elegant way to get the sum of proz_grouped for the codes (code) 3 and 4 by group gruppe?
The result should look sth. like this:
code gruppe proz_grouped
1: 1 2 14.751689
2: 2 2 22.063523
3: NA 2 63.18471
5: 1 3 7.575869
6: 2 3 23.420090
7: NA 3 69.0035
Since code cannot be summarized, I would expect an NA for the code column.
Thanks

dt[, .(proz_grouped = sum(proz_grouped))
, by = .(code = replace(code, code > 2, NA), gruppe)]
# code gruppe proz_grouped
#1: 1 2 14.751689
#2: 2 2 22.063523
#3: NA 2 63.184787
#4: 1 3 7.575869
#5: 2 3 23.420090
#6: NA 3 69.004041

We can use recode to change the values and then do the group by sum
library(data.table)
library(car)
df1[, code := recode(code, "c(3,4)=NA")
][, list(proz_grouped = sum(proz_grouped)), .(code, gruppe)]
# code gruppe proz_grouped
#1: 1 2 14.751689
#2: 2 2 22.063523
#3: NA 2 63.184787
#4: 1 3 7.575869
#5: 2 3 23.420090
#6: NA 3 69.004041
Or use %in% to change 3, 4 into NA, group by 'code', 'gruppe' and get the sum of 'proz_grouped'
df1[code %in% 3:4, code := NA][,
.(proz_grouped = sum(proz_grouped)) ,.(code, gruppe)]

Related

R Data Table add rows to each group if not existing [duplicate]

This question already has answers here:
data.table equivalent of tidyr::complete()
(3 answers)
Closed 29 days ago.
I have a data table with multiple groups. Each group I'd like to fill with rows containing the values in vals if they are not already present. Additional columns should be filled with NAs.
DT = data.table(group = c(1,1,1,2,2,3,3,3,3), val = c(1,2,4,2,3,1,2,3,4), somethingElse = rep(1,9))
vals = data.table(val = c(1,2,3,4))
What I want:
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
The order of val does not necessarily have to be increasing, the values may also be appened at the beginning/end of each group.
I don't know how to approach this problem. I've thought about using rbindlist(...,fill = TRUE), but then the values will be simply appended.
I think some expression with DT[, lapply(...), by = c("group")] might be useful here but I have no idea how to check if a value already exists.
You can use a cross-join:
setDT(DT)[
CJ(group = group, val = val, unique = TRUE),
on = .(group, val)
]
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
Another way to solve your problem:
DT[, .SD[vals, on="val"], by=group]
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
# or
DT[CJ(group, val, unique=TRUE), on=.NATURAL]
I will just add this answer for a slightly more complex case:
#Raw Data
DT = data.table(group = c(1,1,2,2,2,3,3,3,3),
x = c(1,2,1,3,4,1,2,3,4),
y = c(2,4,2,6,8,2,4,6,8),
somethingElse = rep(1,9))
#allowed combinations of x and y
DTxy = data.table(x = c(1,2,3,4), y = c(2,4,6,8))
Here, I want to add all x,y combinations from DTxy to each group from DT, if not already present.
I've wrote a function to work for subsets.
#function to join subsets on two columns (here: x,y)
DTxyJoin = function(.SD, xy){
.SD = .SD[xy, on = .(x,y)]
return(.SD)
}
I then applied the function to each group:
#add x and y to each group if missing
DTres = DT[, DTxyJoin(.SD, DTxy), by = c("group")]
The Result:
group x y somethingElse
1: 1 1 2 1
2: 1 2 4 1
3: 1 3 6 NA
4: 1 4 8 NA
5: 2 1 2 1
6: 2 2 4 NA
7: 2 3 6 1
8: 2 4 8 1
9: 3 1 2 1
10: 3 2 4 1
11: 3 3 6 1
12: 3 4 8 1

R Data Table - join but filter with update

I'm trying to figure out how to join 2 data tables and update the first but with a filter applied.
DT<-data.table(a=rep(1:3,3),b=seq(1:9))
DT
a b
1: 1 1
2: 2 2
3: 3 3
4: 1 4
5: 2 5
6: 3 6
7: 1 7
8: 2 8
9: 3 9
DT2 <- data.table(b=seq(1:9), c=rep(10,9))
> DT2
b c
1: 1 10
2: 2 10
3: 3 10
4: 4 10
5: 5 10
6: 6 10
7: 7 10
8: 8 10
9: 9 10
I can do a basic equijoin like so
DT[DT2, on=c(b="b")]
But what I'd like to do logically is this
DT[a==3,DT2, on=c(b="b")]
but I get the following error
Error in `[.data.table`(DT, a == 3, DT2, on = c(b = "b")) :
logical error. i is not a data.table, but 'on' argument is provided.
I can reverse the order of the join and apply the filter...
DT2[DT[a==3,], on=c(b="b")]
b a
1: 3 3
2: 6 3
3: 9 3
Which gives the correct rows but the column order is incorrect. That aside I'd like to update DT with c but only for the rows I've filtered in DT and that satisfy the join.
If this was SQL I would use an update with a subquery like so:
UPDATE
DT
set
c = (select c from DT2 where DT2.b = DT.B)
WHERE
DT.a=3
I seem to be going in circles with the Data table syntax - can anyone point me in the right direction?
Cheers
David
Another option without having to make a dummy variable is:
DT[a==3, c := DT2[DT[a==3], c, on = c(b="b")]]
DT
# a b c
#1: 1 1 NA
#2: 2 2 NA
#3: 3 3 10
#4: 1 4 NA
#5: 2 5 NA
#6: 3 6 10
#7: 1 7 NA
#8: 2 8 NA
#9: 3 9 10
You can create a dummy variable a in DT2, join on both columns a and b and then Update:
DT[DT2[, c(a = 3, .SD)], c := i.c, on = c("a", "b")]
DT
# a b c
#1: 1 1 NA
#2: 2 2 NA
#3: 3 3 10
#4: 1 4 NA
#5: 2 5 NA
#6: 3 6 10
#7: 1 7 NA
#8: 2 8 NA
#9: 3 9 10

Count how many times values has changed in column using R

Hi i want to count how many times value has changed in a column by the group and how many unique values was in a group, and i sort of getting what i want, but it has a NA observation which i do not want to be counted.
df <- data.frame(x=c("a",'a', "a", "b",'b', "b", "c",'c', "d")
,y=c(1,2,NA,3,3,3,2,1,5))
library(data.table) #data.table_1.9.5
setDT(df)[, wanted := rleid(y), by=x][]
setDT(df)[, count := uniqueN(y),by=x][]
x y wanted count
1: a 1 1 3
2: a 2 2 3
3: a NA 3 3
4: b 3 1 1
5: b 3 1 1
6: b 3 1 1
7: c 2 1 2
8: c 1 2 2
9: d 5 1 1`
Desired results:
x y wanted count
1: a 1 1 2
2: a 2 2 2
3: a NA 2 2
4: b 3 1 1
5: b 3 1 1
6: b 3 1 1
7: c 2 1 2
8: c 1 2 2
9: d 5 1 1
I tried rleid(!is.na(y)) but seems not to work as i expected. Thank you.
We can replace the NA elements with previous non-NA element (na.locf), take the rleid on that to get the 'wanted' and also get the length of unique elements that are not NA to get the 'count'
library(zoo)
setDT(df)[, c('wanted', 'count') := list(rleid(na.locf(y)), uniqueN(y, na.rm = TRUE)), x]
df
# x y wanted count
#1: a 1 1 2
#2: a 2 2 2
#3: a NA 2 2
#4: b 3 1 1
#5: b 3 1 1
#6: b 3 1 1
#7: c 2 1 2
#8: c 1 2 2
#9: d 5 1 1

Replicating rows in data.table by column value

I have a dataset that is structured as following:
data <- data.table(ID=1:10,Tenure=c(2,3,4,2,1,1,3,4,5,2),Var=rnorm(10))
ID Tenure Var
1: 1 2 -0.72892371
2: 2 3 -1.73534591
3: 3 4 0.47007030
4: 4 2 1.33173044
5: 5 1 -0.07900914
6: 6 1 0.63493316
7: 7 3 -0.62710577
8: 8 4 -1.69238758
9: 9 5 -0.85709328
10: 10 2 0.10716830
I need to replicate each row N=Tenure times. e.g. I need to replicate the first row 2 times (since Tenure = 2.
I need my transformed dataset to look like the following:
setkey(data,ID)
print(data[,.(ID=rep(ID,Tenure))][data][, Indx := 1:.N, by=ID])
ID Tenure Var Indx
1: 1 2 -0.7289237 1
2: 1 2 -0.7289237 2
3: 2 3 -1.7353459 1
4: 2 3 -1.7353459 2
5: 2 3 -1.7353459 3
6: 3 4 0.4700703 1
...
...
Is there a more efficient way (a more data.table way) to do this? My way is pretty slow. I was thinking there should be a way to do this using a by-without-by merge usng .EACHI?
I don't think using a key/merge is helpful here. Just expand by passing a vector of row indices:
DT <- data[rep(1:.N,Tenure)][,Indx:=1:.N,by=ID]
You could try:
library(splitstackshape)
expandRows(data, "Tenure", drop = FALSE)[,Indx:=1:.N,by=ID][]
Or
library(dplyr)
library(splitstackshape)
expandRows(data, "Tenure", drop = FALSE) %>%
group_by(ID) %>%
mutate(Indx = row_number(Tenure))
Which gives:
ID Tenure Var Indx
1: 1 2 -0.8808717 1
2: 1 2 -0.8808717 2
3: 2 3 0.5962590 1
4: 2 3 0.5962590 2
5: 2 3 0.5962590 3
6: 3 4 0.1197176 1
7: 3 4 0.1197176 2
8: 3 4 0.1197176 3
9: 3 4 0.1197176 4
10: 4 2 -0.2821739 1

imputing forward / backward

I am trying to impute some longitudinal data in this way (see below). For each individual (id), if first values are NA, I would like to impute using the first observed value for that individual regardless when that occurs. Then, I would like to impute forward based on the last value observed for each individual (see imputed below).
var values might not necessarily increase monotonically. Those values might be a character vector.
I have tried several ways to do this, but still I cannot get a satisfactory solution.
Any ideas?
id <- c(1,1,1,1,1,1,1,2,2,2,2)
time <- c(1,2,3,4,5,6,7,3,5,7,9)
var <- c(NA,NA,1,NA,2,3,NA,NA,2,3,NA)
imputed <- c(1,1,1,1,2,3,3,2,2,3,3)
dat <- data.table(id, time, var, imputed)
id time var imputed
1: 1 1 NA 1
2: 1 2 NA 1
3: 1 3 1 1
4: 1 4 NA 1
5: 1 5 2 2
6: 1 6 3 3
7: 1 7 NA 3
8: 2 3 NA 2
9: 2 5 2 2
10: 2 7 3 3
11: 2 9 NA 3
library(zoo)
dat[, newimp := na.locf(na.locf(var, FALSE), fromLast=TRUE), by = id]
dat
# id time var imputed newimp
# 1: 1 1 NA 1 1
# 2: 1 2 NA 1 1
# 3: 1 3 1 1 1
# 4: 1 4 NA 1 1
# 5: 1 5 2 2 2
# 6: 1 6 3 3 3
# 7: 1 7 NA 3 3
# 8: 2 3 NA 2 2
# 9: 2 5 2 2 2
#10: 2 7 3 3 3
#11: 2 9 NA 3 3

Resources