Assigning values in first rows of groups in a data.table - r

I'd like to assign only those values in the first row of a group in a data.table.
For example (simplified): my data.table is DT with the following content
x v
1 1
2 2
2 3
3 4
3 5
3 6
The key of DT is x.
I want to address every first line of a group.
This is working fine:DT[, .SD[1], by=x]
x v
1 1
2 2
3 4
Now, I want to assign only those values of v to 0.
But none of this is working:
DT[, .SD[1], by=x]$v <- 0
DT[, .SD[1], by=x]$v := 0
DT[, .SD[1], by=x, v:=0]
I searched the R-help from the package and any links provided but I just can't get it work.
I found notes there saying this would not work but no examples/solutions that helped me out.
I'd be very glad for any suggestions.
(I like this package very much and I don't wanna go back to a data.frame... where I got this working)
edit:
I'd like to have a result like this:
x v
1 0
2 0
2 3
3 0
3 5
3 6
This is not working:
DT[, .SD[1], by=x] <- DT[, .SD[1], by=x][, v:=0]

Another option would be:
DT[,v:={v[1]<-0L;v}, by=x]
DT
# x v
#1: 1 0
#2: 2 0
#3: 2 3
#4: 3 0
#5: 3 5
#6: 3 6
Or
DT[DT[, .I[1], by=x]$V1, v:=0]
DT
# x v
#1: 1 0
#2: 2 0
#3: 2 3
#4: 3 0
#5: 3 5
#6: 3 6

With a little help from Roland's solution, it looks like you could do the following. It simply concatenates zero with all the other grouped values of v except the first.
DT[, v := c(0L, v[-1]), by = x] ## must have the "L" after 0, as 0L
which results in
DT
# x v
# 1: 1 0
# 2: 2 0
# 3: 2 3
# 4: 3 0
# 5: 3 5
# 6: 3 6
Note: the middle section j of code could also be v := c(integer(1), v[-1])

Related

mutate variable by condition using two variables in long format data.table in r

In this data.table:
dt <- data.table(id=c(1,1,1,2,2,2), time=rep(1:3,2), x=c(1,0,0,0,1,0))
dt
id time x
1: 1 1 1
2: 1 2 0
3: 1 3 0
4: 2 1 0
5: 2 2 1
6: 2 3 0
I need the following:
id time x
1: 1 1 1
2: 1 2 1
3: 1 3 1
4: 2 1 0
5: 2 2 1
6: 2 3 1
that is
if x==1 at time==1 then x=1 at times 2 and 3, by id
if x==1 at time==2 then x=1 at time 3, by id
For the first point (I guess the second one will be similar), I have tried approaches mentioned in similar questions I posted before (here and here), but none work:
dt[x==1[time == 1], x := x[time == 1], id] gives an error
setDT(dt)[, x2:= ifelse(x==1 & time==1, x[time==1], x), by=id] changes xonly at time 1 (so, no real change observed)
It would be much easier to work with data.table in wide format, but I keep facing this kind of problem in long format and I don't want to reshape my data all the time
Thank you!
EDIT:
The answer provided by #GregorThomas, dt[, x := cummax(x), by = id], works for the problem that I presented.
Now I ask the same question for a character variable:
dt2 <- data.table(id=c(1,1,1,2,2,2), time=rep(1:3,2), x=c('a','b','b','b','a','b'))
dt2
id time x
1: 1 1 a
2: 1 2 b
3: 1 3 b
4: 2 1 b
5: 2 2 a
6: 2 3 b
In the table above, how could be done the following:
if x=='a' at time==1 then x='a' at times 2 and 3, by id
if x=='a' at time==2 then x='a' at time 3, by id
Using the cumulative maximum function cummax:
dt[, x := cummax(x), by = id]
dt
# id time x
# 1: 1 1 1
# 2: 1 2 1
# 3: 1 3 1
# 4: 2 1 0
# 5: 2 2 1
# 6: 2 3 1

create list from columns of data table expression

Consider the following dt:
dt <- data.table(a=c(1,1,2,3),b=c(4,5,6,4))
That looks like that:
> dt
a b
1: 1 4
2: 1 5
3: 2 6
4: 3 4
I'm here aggregating each column by it's unique values and then counting how many uniquye values each column has:
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 2
2: 2 1 5 1
3: 3 1 6 1
So 1 appears twice in dt and thus a.N is 2, the same logic goes on for the other values.
But the problem is if this transformations of the original datatable have different dimensions at the end, things will get recycled.
For example this dt:
dt <- data.table(a=c(1,1,2,3,7),b=c(4,5,6,4,4))
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 3
2: 2 1 5 1
3: 3 1 6 1
4: 7 1 4 3
Warning message:
In as.data.table.list(jval, .named = NULL) :
Item 2 has 3 rows but longest item has 4; recycled with remainder.
That is no longer the right answer because b.N should have now only 3 rows and things(vector) got recycled.
This is why I would like to transform the expression dt[,lapply(.SD,function(agg) dt[,.N,by=agg])] in a list with different dimensions, with the name of items in the list being the name of the columns in the new transformed dt.
A sketch of what I mean is:
newlist
$a.agg
1 2 3 7
$a.N
2 1 1 1
$b.agg
4 5 6 4
$b.N
3 1 1
Or even better solution would be to get a datatable with a track of the columns on another column:
dt_final
agg N column
1 2 a
2 1 a
3 1 a
7 1 a
4 3 b
5 1 b
6 1 b
Get the data in long format and then aggregate by group.
library(data.table)
dt_long <- melt(dt, measure.vars = c('a', 'b'))
dt_long[, .N, .(variable, value)]
# variable value N
#1: a 1 2
#2: a 2 1
#3: a 3 1
#4: a 7 1
#5: b 4 3
#6: b 5 1
#7: b 6 1
In tidyverse -
library(dplyr)
library(tidyr)
dt %>%
pivot_longer(cols = everything()) %>%
count(name, value)

R Alternatives to a for loop for searching through a large dataset

The goal here is to identify and count if the entries in column b have matching entries in column a with a range of +/-1 (or as required). A simplified version is provided:
a <- c("1231210","1231211", "1231212", "98798", "98797", "98796", "555125", "555127","555128")
b <- c("1", "2", "3", "4", "5", "6", "1231209", "98797", "555126")
df <- data.frame(a, b)
I merged this data in a dataframe to simulate my actual dataset, converted them to numerics and wrote the following function to get my desired output. (note: column a need not be part of the df, but can be a separate list I suppose?)
df$c <- mapply(
function(x){
count = 0
for (i in df$a){
if (abs(i-x) <= 1){
count = count +1
}
}
paste0(count)
},
df$b
)
a
b
c
1
1231210
1
0
2
1231211
2
0
3
1231212
3
0
4
98798
4
0
5
98797
5
0
6
98796
6
0
7
555125
1231209
1
8
555127
98797
3
9
555128
555126
2
While this appears to work fine for the trial dataset, my actual dataset has over 2 million rows which means 2M^2 iterations? (still running after 3h) I was wondering if there is an alternate strategy to tackle this, preferably using base R functions only.
I'm quite new to R and a common suggestion is to use vectorization to improve efficiency. However, I have no clue if this is possible in this case when looking at the examples provided on the net.
Would love to hear any suggestions and feel free to point out mistakes. Thanks!
As your data is quite large, outer and lapply approaches will be quite slow (for outer you need 14901.2 Gb of RAM). I suggest using data.table
require(data.table)
dt <- as.data.table(df)
dt[, id := 1:.N] # add id as maybe you have duplicated values
setkey(dt, id)
dt[, b1 := b - 1L]
dt[, b2 := b + 1L]
x <- dt[dt, on = .(a >= b1, a <= b2)] # non-equi join
x <- x[, .(c = sum(!is.na(b1))), keyby = .(id = i.id)]
dt[x, c := i.c, on = 'id']
dt
# a b id b1 b2 c
# 1: 1231210 1 1 0 2 0
# 2: 1231211 2 2 1 3 0
# 3: 1231212 3 3 2 4 0
# 4: 98798 4 4 3 5 0
# 5: 98797 5 5 4 6 0
# 6: 98796 6 6 5 7 0
# 7: 555125 1231209 7 1231208 1231210 1
# 8: 555127 98797 8 98796 98798 3
# 9: 555128 555126 9 555125 555127 2
dt[, id := NULL][, b1 := NULL][, b2 := NULL] # remove colls
p.s. check that a and b are converted to integers before...
why are vectors a and b characters? They should be numeric:
a <- c(1231210,1231211, 1231212, 98798, 98797, 98796, 555125, 555127,555128)
b <- c(1, 2, 3, 4, 5, 6, 1231209, 98797, 555126)
You can simplify by using only one loop and vectorization:
unlist(lapply(b, function(x) sum(abs(a-x) <= limit)))
where limit is variable describing allowed difference. For limit <- 1 you get:
[1] 0 0 0 0 0 0 1 3 2
What about colSums + outer?
transform(
type.convert(data.frame(a, b), as.is = TRUE),
C = colSums(abs(outer(a, b, `-`)) <= 1)
)
output
a b C
1 1231210 1 0
2 1231211 2 0
3 1231212 3 0
4 98798 4 0
5 98797 5 0
6 98796 6 0
7 555125 1231209 1
8 555127 98797 3
9 555128 555126 2

R data.table filtering on group size

I am trying to find all the records in my data.table for which there is more than one row with value v in field f.
For instance, we can use this data:
dt <- data.table(f1=c(1,2,3,4,5), f2=c(1,1,2,3,3))
If looking for that property in field f2, we'd get (note the absence of the (3,2) tuple)
f1 f2
1: 1 1
2: 2 1
3: 4 3
4: 5 3
My first guess was dt[.N>2,list(.N),by=f2], but that actually keeps entries with .N==1.
dt[.N>2,list(.N),by=f2]
f2 N
1: 1 2
2: 2 1
3: 3 2
The other easy guess, dt[duplicated(dt$f2)], doesn't do the trick, as it keeps one of the 'duplicates' out of the results.
dt[duplicated(dt$f2)]
f1 f2
1: 2 1
2: 5 3
So how can I get this done?
Edited to add example
The question is not clear. Based on the title, it looks like we want to extract all groups with number of rows (.N) greater than 1.
DT[, if(.N>1) .SD, by=f]
But the value v in field f is making it confusing.
If I understand what you're after correctly, you'll need to do some compound queries:
library(data.table)
DT <- data.table(v1 = 1:10, f = c(rep(1:3, 3), 4))
DT[, N := .N, f][N > 2][, N := NULL][]
# v1 f
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 1
# 5: 5 2
# 6: 6 3
# 7: 7 1
# 8: 8 2
# 9: 9 3

Add a countdown column to data.table containing rows until a special row encountered

I have a data.table with ordered data labled up, and I want to add a column that tells me how many records until I get to a "special" record that resets the countdown.
For example:
DT = data.table(idx = c(1,3,3,4,6,7,7,8,9),
name = c("a", "a", "a", "b", "a", "a", "b", "a", "b"))
setkey(DT, idx)
#manually add the answer
DT[, countdown := c(3,2,1,0,2,1,0,1,0)]
Gives
> DT
idx name countdown
1: 1 a 3
2: 3 a 2
3: 3 a 1
4: 4 b 0
5: 6 a 2
6: 7 a 1
7: 7 b 0
8: 8 a 1
9: 9 b 0
See how the countdown column tells me how many rows until a row called "b".
The question is how to create that column in code.
Note that the key is not evenly spaced and may contain duplicates (so is not very useful in solving the problem). In general the non-b names could be different, but I could add a dummy column that is just True/False if the solution requires this.
Here's another idea:
## Create groups that end at each occurrence of "b"
DT[, cd:=0L]
DT[name=="b", cd:=1L]
DT[, cd:=rev(cumsum(rev(cd)))]
## Count down within them
DT[, cd:=max(.I) - .I, by=cd]
# idx name cd
# 1: 1 a 3
# 2: 3 a 2
# 3: 3 a 1
# 4: 4 b 0
# 5: 6 a 2
# 6: 7 a 1
# 7: 7 b 0
# 8: 8 a 1
# 9: 9 b 0
I'm sure (or at least hopeful) that a purely "data.table" solution would be generated, but in the meantime, you could make use of rle. In this case, you're interested in reversing the countdown, so we'll use rev to reverse the "name" values before proceeding.
output <- sequence(rle(rev(DT$name))$lengths)
makezero <- cumsum(rle(rev(DT$name))$lengths)[c(TRUE, FALSE)]
output[makezero] <- 0
DT[, countdown := rev(output)]
DT
# idx name countdown
# 1: 1 a 3
# 2: 3 a 2
# 3: 3 a 1
# 4: 4 b 0
# 5: 6 a 2
# 6: 7 a 1
# 7: 7 b 0
# 8: 8 a 1
# 9: 9 b 0
Here's a mix of Josh's and Ananda's solution, in that, I use RLE to generate the way Josh has given the answer:
t <- rle(DT$name)
t <- t$lengths[t$values == "a"]
DT[, cd := rep(t, t+1)]
DT[, cd:=max(.I) - .I, by=cd]
Even better: Taking use of the fact that there's only one b always (or assuming here), you could do this one better:
t <- rle(DT$name)
t <- t$lengths[t$values == "a"]
DT[, cd := rev(sequence(rev(t+1)))-1]
Edit: From OP's comment, it seems clear that there is more than 1 b possible and in such cases, all b should be 0. The first step in doing this is to create groups where b ends after each consecutive a's.
DT <- data.table(idx=sample(10), name=c("a","a","a","b","b","a","a","b","a","b"))
t <- rle(DT$name)
val <- cumsum(t$lengths)[t$values == "b"]
DT[, grp := rep(seq(val), c(val[1], diff(val)))]
DT[, val := c(rev(seq_len(sum(name == "a"))),
rep(0, sum(name == "b"))), by = grp]
# idx name grp val
# 1: 1 a 1 3
# 2: 7 a 1 2
# 3: 9 a 1 1
# 4: 4 b 1 0
# 5: 2 b 1 0
# 6: 8 a 2 2
# 7: 6 a 2 1
# 8: 3 b 2 0
# 9: 10 a 3 1
# 10: 5 b 3 0

Resources