Run iteratively rows by rows in data.table - r

i have a question related to data table in R.
For example, i have a data like this
a=data.table(c=(1:10),d=(2:11))
a[1,e:=1]
c d e
1: 1 2 1
2: 2 3 NA
3: 3 4 NA
4: 4 5 NA
5: 5 6 NA
6: 6 7 NA
7: 7 8 NA
8: 8 9 NA
9: 9 10 NA
10: 10 11 NA
Now, i want to calculate the value of e, row by row, with the value of e equal (c+d) multiple with e of previous row. So data table must update row by row here.
i don't want to run a for loop here because it takes a long time. Any of you have any suggestions?

Like this?
a[-1, e := c + d]
a[, e := cumprod(e)]
# c d e
# 1: 1 2 1
# 2: 2 3 5
# 3: 3 4 35
# 4: 4 5 315
# 5: 5 6 3465
# 6: 6 7 45045
# 7: 7 8 675675
# 8: 8 9 11486475
# 9: 9 10 218243025
#10: 10 11 4583103525
Edit:
Here is a solution using by. However, that won't be faster than a well written for loop (e.g., using set).
a[1, f := 1]
a[, f := if (.GRP == 1) f
else (c + d) * a[.GRP - 1, f] , by = seq_len(nrow(a))]
Here a solution with set:
a[1, g := 1]
for (i in 2 : nrow(a)) set(a, i, "g", a[(i), c + d] * a[(i - 1), g])

Related

Update data.table with another data.table [duplicate]

This question already has answers here:
Update subset of data.table based on join
(3 answers)
Closed 1 year ago.
Assume that given data.table da, there is another data.table, db which has some column names equal to the column names of da. Some of those columns with equal names also have identical content. My question is how to replace those entries in da that have columns matching with db with content of db?
Much simpler put:
require(data.table)
da <- data.table(a=1:10,b=10:1,c=LETTERS[1:10])
## All (a,b) pairs in db exist in da, but c is different.
db <- data.table(a=c(2,6,8),b=c(9,5,3),c=c('x','y','z'))
## Data table dx will have c-column values of db$c where (a,b) matches between da and db.
dx <- db[da,.(a,b,c=fifelse(is.na(c),i.c,c),on=c('a','b')]
Output
> dx
a b c
1: 1 10 A
2: 2 9 x
3: 3 8 C
4: 4 7 D
5: 5 6 E
6: 6 5 y
7: 7 4 G
8: 8 3 z
9: 9 2 I
10: 10 1 J
> da
a b c
1: 1 10 A
2: 2 9 B
3: 3 8 C
4: 4 7 D
5: 5 6 E
6: 6 5 F
7: 7 4 G
8: 8 3 H
9: 9 2 I
10: 10 1 J
> db
a b c
1: 2 9 x
2: 6 5 y
3: 8 3 z
> dx
a b c
1: 1 10 A
2: 2 9 x
3: 3 8 C
4: 4 7 D
5: 5 6 E
6: 6 5 y
7: 7 4 G
8: 8 3 z
9: 9 2 I
10: 10 1 J
>
I know that the above achieves my goal, but it feels clumsy. Is there a built-in data.table way to do this?
Okay, I figured it
dx <- da[db,c:=i.c,on=c('a','b')]

Removing rows in a R data.table with NAs in specific columns

I have a data.table with a large number of features. I would like to remove the rows where the values are NAs only for certain features.
Currently I am using the following to handle this:
data.joined.sample <- data.joined.sample %>%
filter(!is.na(lat)) %>%
filter(!is.na(long)) %>%
filter(!is.na(temp)) %>%
filter(!is.na(year)) %>%
filter(!is.na(month)) %>%
filter(!is.na(day)) %>%
filter(!is.na(hour)) %>%
.......
Is there a more concise way to achieve this?
str(data.joined.sample)
Classes ‘data.table’ and 'data.frame': 336776 obs. of 50 variables:
We can select those columns, get a logical vector of NA's based on it using complete.cases and use that to remove the NA elements
data.joined.sample[complete.cases(data.joined.sample[colsofinterest]),]
where
colsofinterest <- c("lat", "long", "temp", "year", "month", "day", "hour")
Update
Based on the OP's comments, if it is a data.table, then subset the colsofinterest and use complete.cases
data.joined.sample[complete.cases(data.joined.sample[, colsofinterest, with = FALSE])]
data.table-objects, if that is in fact what your working with, have a somewhat different syntax for the "[" function. Look through this console session:
> DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
> DT[x=="a"&y==1]
x y v
1: a 1 4
> is.na(DT[x=="a"&y==1]$v) <- TRUE # make one item NA
> DT[x=="a"&y==1]
x y v
1: a 1 NA
> DT
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 NA
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT[complete.cases(DT)] # note no comma
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 3 5
5: a 6 6
6: c 1 7
7: c 3 8
8: c 6 9
> DT # But that didn't remove the NA, it only gave a value
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 NA
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT <- DT[complete.cases(DT)] # do this assignment to make permanent
> DT
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 3 5
5: a 6 6
6: c 1 7
7: c 3 8
8: c 6 9
Probably not the true "data.table way".

Rolling Join multiple columns independently to eliminate NAs

I am trying to do a rolling join in data.table that brings in multiple columns, but rolls over both entire missing rows, and individual NAs in particular columns, even when the row is present. By way of example, I have two tables, A, and B:
library(data.table)
A <- data.table(v1 = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(6,6,6,4,4,6,4,4,4,6,4,4,4),
t = c(10,20,30,60,60,10,40,50,60,20,40,50,60),
key = c("v1", "v2", "t"))
B <- data.table(v1 = c(1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(4,4,6,6,4,4,6,6,4,4,6,6),
t = c(10,70,20,70,10,70,20,70,10,70,20,70),
valA = c('a','a',NA,'a',NA,'a','b','a', 'b','b',NA,'b'),
valB = c(NA,'q','q','q','p','p',NA,'p',NA,'q',NA,'q'),
key = c("v1", "v2", "t"))
B
## v1 v2 t valA valB
## 1: 1 4 10 a NA
## 2: 1 4 70 a q
## 3: 1 6 20 NA q
## 4: 1 6 70 a q
## 5: 2 4 10 NA p
## 6: 2 4 70 a p
## 7: 2 6 20 b NA
## 8: 2 6 70 a p
## 9: 3 4 10 b NA
## 10: 3 4 70 b q
## 11: 3 6 20 NA NA
## 12: 3 6 70 b q
If I do a rolling join (in this case a backwards join), it rolls over all the points when a row cannot be found in B, but still includes points when the row exists but the data to be merged are NA:
B[A, , roll=-Inf]
## v1 v2 t valA valB
## 1: 1 4 60 a q
## 2: 1 4 60 a q
## 3: 1 6 10 NA q
## 4: 1 6 20 NA q
## 5: 1 6 30 a q
## 6: 2 4 40 a p
## 7: 2 4 50 a p
## 8: 2 4 60 a p
## 9: 2 6 10 b NA
## 10: 3 4 40 b q
## 11: 3 4 50 b q
## 12: 3 4 60 b q
## 13: 3 6 20 NA NA
I would like to rolling join in such a way that it rolls over these NAs as well. For a single column, I can subset B to remove the NAs, then roll with A:
C <- B[!is.na(valA), .(v1, v2, t, valA)][A, roll=-Inf]
C
## v1 v2 t valA
## 1: 1 4 60 a
## 2: 1 4 60 a
## 3: 1 6 10 a
## 4: 1 6 20 a
## 5: 1 6 30 a
## 6: 2 4 40 a
## 7: 2 4 50 a
## 8: 2 4 60 a
## 9: 2 6 10 b
## 10: 3 4 40 b
## 11: 3 4 50 b
## 12: 3 4 60 b
## 13: 3 6 20 b
But for multiple columns, I have to do this sequentially, storing the value for each added column and then repeat.
B[!is.na(valB), .(v1, v2, t, valB)][C, roll=-Inf]
## v1 v2 t valB valA
## 1: 1 4 60 q a
## 2: 1 4 60 q a
## 3: 1 6 10 q a
## 4: 1 6 20 q a
## 5: 1 6 30 q a
## 6: 2 4 40 p a
## 7: 2 4 50 p a
## 8: 2 4 60 p a
## 9: 2 6 10 p b
## 10: 3 4 40 q b
## 11: 3 4 50 q b
## 12: 3 4 60 q b
## 13: 3 6 20 q b
The end result above is the desired output, but for multiple columns it quickly becomes unwieldy. Is there a better solution?
Joins are about matching up rows. If you want to match rows multiple ways, you'll need multiple joins.
I'd use a loop, but add columns to A (rather than creating new tables C, D, ... following each join):
k = key(A)
bcols = setdiff(names(B), k)
for (col in bcols) A[, (col) :=
B[!.(as(NA, typeof(B[[col]]))), on=col][.SD, roll=-Inf, ..col]
][]
A
v1 v2 t valA valB
1: 1 4 60 a q
2: 1 4 60 a q
3: 1 6 10 a q
4: 1 6 20 a q
5: 1 6 30 a q
6: 2 4 40 a p
7: 2 4 50 a p
8: 2 4 60 a p
9: 2 6 10 b p
10: 3 4 40 b q
11: 3 4 50 b q
12: 3 4 60 b q
13: 3 6 20 b q
B[!.(NA_character_), on="valA"] is an anti-join that drops rows with NAs in valA. The code above attempts to generalize this (since the NA needs to match the type of the column).

rbind and overwrite duplicate rows based on key variable?

> part1<-data.frame(key=c(5,6,7,8,9),x=c("b","d","a","c","b"))
> part1
key x
1 5 b # key==5,x==b
2 6 d
3 7 a
4 8 c
5 9 b
> part2<-data.frame(key=c(1,2,3,4,5), x=c("c","a","b","d","a"))
> part2
key x
1 1 c
2 2 a
3 3 b
4 4 d
5 5 a # key==5,x==a
There are more than 2 dataframes but I'll just use 2 for this example. I then use lapply to put them all in a list called dflist1, then rbind them. For this example I'll just do it manually.
dflist1<-list(part1,part2)
final<-do.call(rbind,dflist1)
final<-final[order(final$key),] #sort by key
Result:
> final
key x
6 1 c
7 2 a
8 3 b
9 4 d
1 5 b #duplicate from part1
10 5 a #duplicate from part2
2 6 d
3 7 a
4 8 c
5 9 b
I want to get rid of the duplicates. It's easy to use !duplicated() but in this case I specifically want to drop/overwrite the rows from earlier dataframes - i.e., in this case the "5 b" from part1 should get dropped/overwritten by the "5 a" from part2. And if there was a part3 with a value "5 b" then the "5 a" from part2 would then get dropped/overwritten by the "5 b" from part3.
What I want:
key x
6 1 c
7 2 a
8 3 b
9 4 d
10 5 a #this is from part2, no more duplicate from part1
2 6 d
3 7 a
4 8 c
5 9 b
Current solution: The only thing I can think of is to add a function that flags each dataframe with an extra variable, then sort it and use !duplicated on that variable... is there an easier or more elegant solution that doesn't require flagging?
## Create many data.frames
set.seed(1)
## Normally, do this in lapply...
part1 <- data.frame(key=1:6, x=sample(letters, 6))
part2 <- data.frame(key=4:8, x=sample(letters, 5))
part3 <- data.frame(key=8:12, x=sample(letters, 5))
library(data.table)
## Collect all your "parts"
PARTS.LIST <- lapply(ls(pattern="^part\\d+"), function(x) get(x))
DT.part <- rbindlist(PARTS.LIST)
setkey(DT.part, key)
unique(DT.part, by="key")
ORIGINAL UNIQUE
--------- -----------
> DT.part > unique(DT.part, by="key")
key x key x
1: 1 l 1: 1 l
2: 2 v 2: 2 v
3: 3 k 3: 3 k
4: 4 q 4: 4 q
5: 4 i 5: 5 r
6: 5 r 6: 6 f
7: 5 w 7: 7 v
8: 6 f 8: 8 f
9: 6 o 9: 9 j
10: 7 v 10: 10 d
11: 8 f 11: 11 g
12: 8 l 12: 12 m
13: 9 j
14: 10 d
15: 11 g
16: 12 m

Use data.table in R to add multiple columns to a data.table with = with only one function call

This is a direkt expansion of this Question.
I have a dataset and I want to find all pairwise combinations of Variable v depending on Variables x and y:
library(data.table)
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,1,6), v=1:18)
x y v
1: a 1 1
2: a 1 2
3: a 6 3
4: a 1 4
5: a 1 5
6: a 6 6
7: b 1 7
8: b 1 8
9: b 6 9
10: b 1 10
11: b 1 11
12: b 6 12
13: c 1 13
14: c 1 14
15: c 6 15
16: c 1 16
17: c 1 17
18: c 6 18
DT[, list(new1 = t(combn(sort(v), m = 2))[,1],
new2 = t(combn(sort(v), m = 2))[,2]),
by = list(x, y)]
x y new1 new2
1: a 1 1 2
2: a 1 1 4
3: a 1 1 5
4: a 1 2 4
5: a 1 2 5
6: a 1 4 5
7: a 6 3 6
8: b 1 7 8
9: b 1 7 10
10: b 1 7 11
11: b 1 8 10
12: b 1 8 11
13: b 1 10 11
14: b 6 9 12
15: c 1 13 14
16: c 1 13 16
17: c 1 13 17
18: c 1 14 16
19: c 1 14 17
20: c 1 16 17
21: c 6 15 18
The Code does what I want but the twice function call makes it slow for larger dataset. My dataset has more than 3 million rows and more than 1.3 million combinations of x and y.
Any suggestions on how to do this faster?
I would prefer something like:
DT[, list(c("new1", "new2") = t(combn(sort(v), m = 2))), by = list(x, y)]
This should work:
DT[, {
tmp <- combn(sort(v), m = 2 )
list(new1 = tmp[1,], new2 = tmp[2,] )
}
, by = list(x, y) ]
The following also works. The trick is to convert the matrix into a data.table.
DT[, data.table(t(combn(sort(v), m = 2))), by=list(x, y)]
If necessary, just rename the columns after
r2 <- DT[, data.table(t(combn(sort(v), m = 2))), by=list(x, y)]
setnames(r2, c("V1", "V2"), c("new1", "new2"))

Resources