Unexpected result using unique inside a data.table - r

Given a data.table (vith version 1.9.5)
TEST <- data.table(1:20,rep(1:5,each=4, times=1))
If I run this:
TEST[unique(V2)]
I get this result:
V1 V2
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
Is it really the intended beahaviour or a bug?
Or I'm just not using it properly?
I was reading the "R book" and in an example they use TEST[unique(Vegetation),] and say it's intended to select a subset of rows unique for the vegetation.
I expected to get something like
V1 V2
1: 1 1
2: 5 2
3: 9 3
4: 13 4
5: 16 5
Though I understand that would need to specify an aggregation criteria.

TEST[,unique(V2)] gives [1] 1 2 3 4 5. Since TEST[1:5] is supposed to give you the first 5 rows and that's what you get, there is no bug.
To get your expected result, you can do this:
TEST[!duplicated(V2)]
# V1 V2
#1: 1 1
#2: 5 2
#3: 9 3
#4: 13 4
#5: 17 5
or this:
TEST[, V1[1], by = V2]
# V2 V1
#1: 1 1
#2: 2 5
#3: 3 9
#4: 4 13
#5: 5 17
or as #Arun reminds me there is now a data.table method for unique:
unique(TEST, by="V2")
# V1 V2
#1: 1 1
#2: 5 2
#3: 9 3
#4: 13 4
#5: 17 5

Related

Numbering group according changement value in a column with data.table (R)

I would like to do something very basic with data.table but I don't how to do this!
I have this data :
test <- data.table(exo = c(1,1,1,1,1,1,1), number = c(1,2,3,4,5,6,7), remark = c("OK","OK","KO","KO","OK","OK","OK"))
exo number remark
1: 1 1 OK
2: 1 2 OK
3: 1 3 KO
4: 1 4 KO
5: 1 5 OK
6: 1 6 OK
7: 1 7 OK
8: 1 8 KO
And I would like to number groups (very simple form istest[ , indic_num := .GRP, by = .(exo, remark)]) but I would like to consider in indic_num if I encounter a changement in remark : it is a new group.
So, desired output :
exo number remark indic_num
1: 1 1 OK 1
2: 1 2 OK 1
3: 1 3 KO 2
4: 1 4 KO 2
5: 1 5 OK 3
6: 1 6 OK 3
7: 1 7 OK 3
8: 1 8 KO 4
Someone can help me?
We can use rleid for remark so every change is considered as a new group.
library(data.table)
test[ , indic_num := .GRP, by = .(exo, rleid(remark))]
test
# exo number remark indic_num
#1: 1 1 OK 1
#2: 1 2 OK 1
#3: 1 3 KO 2
#4: 1 4 KO 2
#5: 1 5 OK 3
#6: 1 6 OK 3
#7: 1 7 OK 3
With dplyr, we can use cur_group_id
library(dplyr)
library(data.table)
test %>%
group_by(exo, grp = rleid(remark)) %>%
mutate(indic_num = cur_group_id()) %>%
ungroup %>%
select(-grp)
# A tibble: 7 x 4
# exo number remark indic_num
# <dbl> <dbl> <chr> <int>
#1 1 1 OK 1
#2 1 2 OK 1
#3 1 3 KO 2
#4 1 4 KO 2
#5 1 5 OK 3
#6 1 6 OK 3
#7 1 7 OK 3
With data.table, we could also do (assuming 'exo' is ordered)
test[, indic_num := rleid(exo, remark)]
test
# exo number remark indic_num
#1: 1 1 OK 1
#2: 1 2 OK 1
#3: 1 3 KO 2
#4: 1 4 KO 2
#5: 1 5 OK 3
#6: 1 6 OK 3
#7: 1 7 OK 3

R data.table "j" reference to "by" variables very unintuitive?

I'm just doing the data.table datacamp excercises and there is something which really disturbes my sense for logic.
Somehow columns which are refered to by the "by" operator are treated different to other columns?
The used data table is the following:
DT
x y z
1: 2 1 2
2: 1 3 4
3: 2 5 6
4: 1 7 8
5: 2 9 10
6: 2 11 12
7: 1 13 14
When I enter DT[,sum(x),x] I would expect:
x V1
1: 2 8
2: 1 3
but I get:
x V1
1: 2 2
2: 1 1
for other columns I get the group sum as I would expect it:
> DT[,sum(y),x]
x V1
1: 2 26
2: 1 23
One way to fix this would be to name the grouping variable with a different name
setnames(DT[, sum(x), .(xN=x)], "xN", "x")[]
# x V1
#1: 2 8
#2: 1 3

R data.table: summarise values of several rows

I have a data.table in R which looks like this one:
code gruppe proz_grouped
1: 1 2 14.751689
2: 2 2 22.063523
3: 3 2 35.441111
4: 4 2 27.743676
5: 1 3 7.575869
6: 2 3 23.420090
7: 3 3 38.513576
8: 4 3 30.490465
Is there an easy, elegant way to get the sum of proz_grouped for the codes (code) 3 and 4 by group gruppe?
The result should look sth. like this:
code gruppe proz_grouped
1: 1 2 14.751689
2: 2 2 22.063523
3: NA 2 63.18471
5: 1 3 7.575869
6: 2 3 23.420090
7: NA 3 69.0035
Since code cannot be summarized, I would expect an NA for the code column.
Thanks
dt[, .(proz_grouped = sum(proz_grouped))
, by = .(code = replace(code, code > 2, NA), gruppe)]
# code gruppe proz_grouped
#1: 1 2 14.751689
#2: 2 2 22.063523
#3: NA 2 63.184787
#4: 1 3 7.575869
#5: 2 3 23.420090
#6: NA 3 69.004041
We can use recode to change the values and then do the group by sum
library(data.table)
library(car)
df1[, code := recode(code, "c(3,4)=NA")
][, list(proz_grouped = sum(proz_grouped)), .(code, gruppe)]
# code gruppe proz_grouped
#1: 1 2 14.751689
#2: 2 2 22.063523
#3: NA 2 63.184787
#4: 1 3 7.575869
#5: 2 3 23.420090
#6: NA 3 69.004041
Or use %in% to change 3, 4 into NA, group by 'code', 'gruppe' and get the sum of 'proz_grouped'
df1[code %in% 3:4, code := NA][,
.(proz_grouped = sum(proz_grouped)) ,.(code, gruppe)]

Complex restructuring of R dataframe

as I have a dataframe like this:
participant v1 v2 v3 v4 v5 v6
1 4 2 9 7 2
2 6 8 1
3 5 4 5
4 1 1 2 3
Every two consecutive variables (v1 and v2, v3 and v4, v5 and v6) belong to each other (this is what I call "count" later).
I desperatly search a way to get the following:
participant count v(odd numbers) v(even numbers)
1 1 4 2
2 9
3 7 2
2 1 6
2 8
3 1
3 1
2 5 4
3 5
4 1 1 1
2 2
3 3
As this is my first question on stackoverflow ever, I hope you understand my request. I searched a lot for similar problems (and solutions to them) but found nothing. I would very much appreciate your support.
We can use melt
library(data.table)
melt(setDT(d1), measure = list(paste0("v", seq(1, 6, by= 2)),
paste0("v", seq(2,6, by = 2))))[order(participant)]
# participant variable value1 value2
# 1: 1 1 4 2
# 2: 1 2 NA 9
# 3: 1 3 7 2
# 4: 2 1 NA 6
# 5: 2 2 8 NA
# 6: 2 3 NA 1
# 7: 3 1 NA NA
# 8: 3 2 5 4
# 9: 3 3 NA 5
#10: 4 1 1 1
#11: 4 2 NA 2
#12: 4 3 3 NA

Replace row values in data.table using 'by' and conditions

I am trying to replace certain row values in a column according to conditions in another column, within a grouping.
EDIT: edited to highligh the recursive nature of the problem.
E.g.
DT = data.table(y=rep(c(1,3), each = 3)
,v=as.numeric(c(1,2,4,4,5,8))
,x=as.numeric(rep(c(9:11),each=2)),key=c("y","v"))
DT
y v x
1: 1 1 9
2: 1 2 9
3: 1 4 10
4: 3 4 10
5: 3 5 11
6: 3 8 11
Within each 'y', I then want to replace values of 'x' where 'v' has an observation v+t (e.g. t = 3), with 2222 (or in reality the results of a function) to following result:
y v x
1: 1 1 9
2: 1 2 9
3: 1 4 2222
4: 3 4 10
5: 3 5 11
6: 3 8 2222
I have tried the following, but to no avail.
DT[which((v-3) %in% v), x:= 2222, y][]
And it mysteriously (?) results in:
y v x
1: 1 1 9
2: 1 2 9
3: 1 4 2222
4: 3 4 2222
5: 3 5 2222
6: 3 8 2222
Running:
DT[,print(which((v-3) %in% v)), by =y]
Indicates that it does the correct indexing within the groups, but what happens from (or the lack thereof) I don't understand.
You could try using replace (which could have some overhead because it copies whole x)
DT[, x:=replace(x, which(v %in% (v+3)), 2222), by=y]
# y v x
#1: 1 1 9
#2: 1 2 9
#3: 1 4 2222
#4: 3 4 10
#5: 3 5 11
#6: 3 8 2222
Alternatively, you could create a logical index column and then do the assignment in the next step
DT[,indx:=v %in% (v+3), by=y][(indx), x:=2222, by=y][, indx:=NULL]
DT
# y v x
#1: 1 1 9
#2: 1 2 9
#3: 1 4 2222
#4: 3 4 10
#5: 3 5 11
#6: 3 8 2222
Or slightly modifying your own approach using .I in order to create an index
indx <- DT[, .I[which((v-3) %in% v)], by = y]$V1
DT[indx, x := 2222]

Resources