Update data.table with another data.table [duplicate] - r

This question already has answers here:
Update subset of data.table based on join
(3 answers)
Closed 1 year ago.
Assume that given data.table da, there is another data.table, db which has some column names equal to the column names of da. Some of those columns with equal names also have identical content. My question is how to replace those entries in da that have columns matching with db with content of db?
Much simpler put:
require(data.table)
da <- data.table(a=1:10,b=10:1,c=LETTERS[1:10])
## All (a,b) pairs in db exist in da, but c is different.
db <- data.table(a=c(2,6,8),b=c(9,5,3),c=c('x','y','z'))
## Data table dx will have c-column values of db$c where (a,b) matches between da and db.
dx <- db[da,.(a,b,c=fifelse(is.na(c),i.c,c),on=c('a','b')]
Output
> dx
a b c
1: 1 10 A
2: 2 9 x
3: 3 8 C
4: 4 7 D
5: 5 6 E
6: 6 5 y
7: 7 4 G
8: 8 3 z
9: 9 2 I
10: 10 1 J
> da
a b c
1: 1 10 A
2: 2 9 B
3: 3 8 C
4: 4 7 D
5: 5 6 E
6: 6 5 F
7: 7 4 G
8: 8 3 H
9: 9 2 I
10: 10 1 J
> db
a b c
1: 2 9 x
2: 6 5 y
3: 8 3 z
> dx
a b c
1: 1 10 A
2: 2 9 x
3: 3 8 C
4: 4 7 D
5: 5 6 E
6: 6 5 y
7: 7 4 G
8: 8 3 z
9: 9 2 I
10: 10 1 J
>
I know that the above achieves my goal, but it feels clumsy. Is there a built-in data.table way to do this?

Okay, I figured it
dx <- da[db,c:=i.c,on=c('a','b')]

Related

R: reshape a data frame when more than 2 dimensions

I am trying to cast a dataframe into an other one, see below for the examples:
> start = data.frame(Aa = c('A','A','A','A','a','a','a','a'),Bb = c('B','B','b','b','B','B','b','b'),Cc = c('C','c','C','c','C','c','C','c'),v=c(1,2,3,4,5,6,7,8))
> start
Aa Bb Cc v
1 A B C 1
2 A B c 2
3 A b C 3
4 A b c 4
5 a B C 5
6 a B c 6
7 a b C 7
8 a b c 8
And I would like to have a data frame like this one:
1 A B 3
2 A b 7
3 a B 11
4 a b 15
5 B C 6
6 B c 8
7 b C 10
8 b c 12
9 A C 4
10 A c 6
11 a C 12
12 a c 14
Where line 1 is calculated because we have A-B-C -> 1 and A-B-c -> 2 so A-B -> 3
The fact is that I can imagine a solution with some for loops on the columns, but I need it to time efficient, I can have 100,000 rows and up to 100 columns so I need something fast, and I don't think that the for loop are really efficient in R.
Do you have any ideas?
Thanks you!
Perhaps you can use combn on the column names.
Here, I've used data.table for its efficient aggregation and for the convenience of rbindlist to put the data back together.
library(data.table)
setDT(start)
rbindlist(combn(names(start)[1:3], 2, FUN = function(x) {
start[, sum(v), x]
}, simplify = FALSE))
# Aa Bb V1
# 1: A B 3
# 2: A b 7
# 3: a B 11
# 4: a b 15
# 5: A C 4
# 6: A c 6
# 7: a C 12
# 8: a c 14
# 9: B C 6
# 10: B c 8
# 11: b C 10
# 12: b c 12

Run iteratively rows by rows in data.table

i have a question related to data table in R.
For example, i have a data like this
a=data.table(c=(1:10),d=(2:11))
a[1,e:=1]
c d e
1: 1 2 1
2: 2 3 NA
3: 3 4 NA
4: 4 5 NA
5: 5 6 NA
6: 6 7 NA
7: 7 8 NA
8: 8 9 NA
9: 9 10 NA
10: 10 11 NA
Now, i want to calculate the value of e, row by row, with the value of e equal (c+d) multiple with e of previous row. So data table must update row by row here.
i don't want to run a for loop here because it takes a long time. Any of you have any suggestions?
Like this?
a[-1, e := c + d]
a[, e := cumprod(e)]
# c d e
# 1: 1 2 1
# 2: 2 3 5
# 3: 3 4 35
# 4: 4 5 315
# 5: 5 6 3465
# 6: 6 7 45045
# 7: 7 8 675675
# 8: 8 9 11486475
# 9: 9 10 218243025
#10: 10 11 4583103525
Edit:
Here is a solution using by. However, that won't be faster than a well written for loop (e.g., using set).
a[1, f := 1]
a[, f := if (.GRP == 1) f
else (c + d) * a[.GRP - 1, f] , by = seq_len(nrow(a))]
Here a solution with set:
a[1, g := 1]
for (i in 2 : nrow(a)) set(a, i, "g", a[(i), c + d] * a[(i - 1), g])

converting a column values into different row in data.table in R [duplicate]

This question already has answers here:
R semicolon delimited a column into rows
(3 answers)
Closed 7 years ago.
I have a data.table like:
Cell Num
a 1,2,3,4,5,6
b 7,8,9
c 10,11,12
d 13,14
I need to convert it to:
Num Cell
1 a
2 a
3 a
4 a
5 a
6 a
7 b
8 b
9 b
10 c
11 c
12 c
13 d
14 d
how can we break the table into the format required ?
We can use cSplit
library(splitstackshape)
cSplit(df1, "Num", ",", "long")
# Cell Num
# 1: a 1
# 2: a 2
# 3: a 3
# 4: a 4
# 5: a 5
# 6: a 6
# 7: b 7
# 8: b 8
# 9: b 9
#10: c 10
#11: c 11
#12: c 12
#13: d 13
#14: d 14
Or as #David Arenburg mentioned
library(data.table)
dt[, .(Num = strsplit(Num, ",")[[1]]), by = Cell]

rbind and overwrite duplicate rows based on key variable?

> part1<-data.frame(key=c(5,6,7,8,9),x=c("b","d","a","c","b"))
> part1
key x
1 5 b # key==5,x==b
2 6 d
3 7 a
4 8 c
5 9 b
> part2<-data.frame(key=c(1,2,3,4,5), x=c("c","a","b","d","a"))
> part2
key x
1 1 c
2 2 a
3 3 b
4 4 d
5 5 a # key==5,x==a
There are more than 2 dataframes but I'll just use 2 for this example. I then use lapply to put them all in a list called dflist1, then rbind them. For this example I'll just do it manually.
dflist1<-list(part1,part2)
final<-do.call(rbind,dflist1)
final<-final[order(final$key),] #sort by key
Result:
> final
key x
6 1 c
7 2 a
8 3 b
9 4 d
1 5 b #duplicate from part1
10 5 a #duplicate from part2
2 6 d
3 7 a
4 8 c
5 9 b
I want to get rid of the duplicates. It's easy to use !duplicated() but in this case I specifically want to drop/overwrite the rows from earlier dataframes - i.e., in this case the "5 b" from part1 should get dropped/overwritten by the "5 a" from part2. And if there was a part3 with a value "5 b" then the "5 a" from part2 would then get dropped/overwritten by the "5 b" from part3.
What I want:
key x
6 1 c
7 2 a
8 3 b
9 4 d
10 5 a #this is from part2, no more duplicate from part1
2 6 d
3 7 a
4 8 c
5 9 b
Current solution: The only thing I can think of is to add a function that flags each dataframe with an extra variable, then sort it and use !duplicated on that variable... is there an easier or more elegant solution that doesn't require flagging?
## Create many data.frames
set.seed(1)
## Normally, do this in lapply...
part1 <- data.frame(key=1:6, x=sample(letters, 6))
part2 <- data.frame(key=4:8, x=sample(letters, 5))
part3 <- data.frame(key=8:12, x=sample(letters, 5))
library(data.table)
## Collect all your "parts"
PARTS.LIST <- lapply(ls(pattern="^part\\d+"), function(x) get(x))
DT.part <- rbindlist(PARTS.LIST)
setkey(DT.part, key)
unique(DT.part, by="key")
ORIGINAL UNIQUE
--------- -----------
> DT.part > unique(DT.part, by="key")
key x key x
1: 1 l 1: 1 l
2: 2 v 2: 2 v
3: 3 k 3: 3 k
4: 4 q 4: 4 q
5: 4 i 5: 5 r
6: 5 r 6: 6 f
7: 5 w 7: 7 v
8: 6 f 8: 8 f
9: 6 o 9: 9 j
10: 7 v 10: 10 d
11: 8 f 11: 11 g
12: 8 l 12: 12 m
13: 9 j
14: 10 d
15: 11 g
16: 12 m

Multiple joins/merges with data.tables

I have two data.tables, DT and L:
> DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9,key="x")
> L=data.table(yv=c(1L:8L,12L),lu=c(letters[8:1],letters[12]),key="yv")
> DT
x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> L
yv lu
1: 1 h
2: 2 g
3: 3 f
4: 4 e
5: 5 d
6: 6 c
7: 7 b
8: 8 a
9: 12 l
I would like to independently look up the corresponding value of lu from L for column y and for column v in DT. The following syntax provides the correct result, but is cumbersome to generate and then understand at a glance later:
> L[setkey(L[setkey(DT,y)],v)][,list(x,y=yv.1,v=yv,lu.1=lu.1,lu.2=lu)]
x y v lu.1 lu.2
1: a 1 1 h h
2: a 2 3 g f
3: a 3 6 f c
4: b 4 1 e h
5: b 5 3 d f
6: b 6 6 c c
7: c 7 1 b h
8: c 8 3 a f
9: c 9 6 NA c
(Edit: original post had L[setkey(L[setkey(DT,y)],v)][,list(x,y=yv,v=yv.1,lu.1=lu,lu.2=lu.1)] above, which incorrectly mixed up the y and v columns and looked up values.)
In SQL this would be simple/straightforward:
SELECT DT.*, L1.lu AS lu1, L2.lu AS lu2
FROM DT
LEFT JOIN L AS L1 ON DT.y = L1.yv
LEFT JOIN L AS L2 ON DT.v = L2.yv
Is there a more elegant way to use data.table to perform multiple joins? Note that I'm joining one table to another table twice in this example, but I am also interested in joining one table to multiple different tables.
Great question. One trick is that i doesn't have to be keyed. Only x must be keyed.
There might be better ways. How about this:
> cbind( L[DT[,list(y)]], L[DT[,list(v)]], DT )
yv lu yv lu x y v
1: 1 h 1 h a 1 1
2: 3 f 2 g a 3 2
3: 6 c 3 f a 6 3
4: 1 h 4 e b 1 4
5: 3 f 5 d b 3 5
6: 6 c 6 c b 6 6
7: 1 h 7 b c 1 7
8: 3 f 8 a c 3 8
9: 6 c 9 NA c 6 9
or, to illustrate, this is the same :
> cbind( L[J(DT$y)], L[J(DT$v)], DT )
yv lu yv lu x y v
1: 1 h 1 h a 1 1
2: 3 f 2 g a 3 2
3: 6 c 3 f a 6 3
4: 1 h 4 e b 1 4
5: 3 f 5 d b 3 5
6: 6 c 6 c b 6 6
7: 1 h 7 b c 1 7
8: 3 f 8 a c 3 8
9: 6 c 9 NA c 6 9
merge could also be used, if the following feature request was implemented :
FR#2033 Add by.x and by.y to merge.data.table

Resources