data.table merging in R - r

I'm using data.table_1.9.4 and the merge function seems to not work as expected. What am I doing wrong over here?
a
letter num_num
1: a 1
2: b 2
3: c 3
4: d 4
5: e 5
6: f 6
> b
letter num_num
1: a 3
2: b 4
3: c 5
4: d 6
5: e 5
6: f 5
> merge(as.data.frame(a),as.data.frame(b),by='letter',all=TRUE)
letter num_num.x num_num.y
##....Works as expected
> merge(a,b,by='letter',all=TRUE)
Error in setcolorder(dt, c(setdiff(names(dt), end), end)) :
neworder is length 2 but x has 3 columns.

Related

How to set the values of a column in an R data.table by referring the key and values in a second data.table? [duplicate]

This question already has answers here:
Update join: replace with values from a column with the same name
(1 answer)
Update subset of data.table based on join
(3 answers)
Closed 9 days ago.
Suppose I have a data.table in R:
> A=data.table(Col1=c(1,4,2,5,6,2,3,5,3,7))
> A
Col1
1: 1
2: 4
3: 2
4: 5
5: 6
6: 2
7: 3
8: 5
9: 3
10: 7
And a key-value data.table where
> B=data.table(Col1=c(1,2,3,4,5,6,7),Col2=c("A","B","C","D","E","F","G"))
> B
Col1 Col2
1: 1 A
2: 2 B
3: 3 C
4: 4 D
5: 5 E
6: 6 F
7: 7 G
I would like to have Col1 of data.table A reference B and create a new column in A that corresponds to the key-value pairs:
Col1 Col2
1: 1 A
2: 4 D
3: 2 B
4: 5 E
5: 6 F
6: 2 B
7: 3 C
8: 5 E
9: 3 C
10: 7 G
How can I do this in data.table? Thanks
What you are looking for is a join by reference/update join.
This looks for the value of A$Col1 in B$Col1, and returns the first match of B$Col2 (so if there are >1 matches, the value returned depends on how B is ordered). In the code, this is referred as i.Col2, since B is in the i-part of the data.table syntax. It is usually the fastest way to join, but remember that it only returns the first match. SO if there are multiple values of B$Col2 fot the same B$Col1 value, you will only get one (the topmost) value returned.
A[B, Col2 := i.Col2, on = .(Col1)]
Col1 Col2
1: 1 A
2: 4 D
3: 2 B
4: 5 E
5: 6 F
6: 2 B
7: 3 C
8: 5 E
9: 3 C
10: 7 G
Using data.tables on argument
library(data.table)
A[B, , on = "Col1"]
Col1 Col2
1: 1 A
2: 2 B
3: 2 B
4: 3 C
5: 3 C
6: 4 D
7: 5 E
8: 5 E
9: 6 F
10: 7 G

Keep all the data.table when aggregating a data.table

I would like to aggregate a data.table by a list of column and keep all the columns at the end.
A <- c(1,2,3,4,4,6,4)
B <- c("a","b","c","d","e","f","g")
C <- c(10,11,23,8,8,1,3)
D <- c(2,3,5,9,7,8,4)
dt <- data.table(A,B,C,D)
Now I want to aggregate the column B paste(B,sep=";") by A and C and keep the column D too at the end. Do you know a way to do it please?
EDIT
this is what i obtained using dt[, newCol := toString(B), .(A, C)]
A B C D newCol
1: 1 a 10 2 a
2: 2 b 11 3 b
3: 3 c 23 5 c
4: 4 d 8 9 d, e
5: 4 e 8 7 d, e
6: 6 f 1 8 f
7: 4 g 3 4 g
But i would like to obtain
A B C D newCol
1: 1 a 10 2 a
2: 2 b 11 3 b
3: 3 c 23 5 c
4: 4 d 8 9 d, e
6: 6 f 1 8 f
7: 4 g 3 4 g

Removing rows in a R data.table with NAs in specific columns

I have a data.table with a large number of features. I would like to remove the rows where the values are NAs only for certain features.
Currently I am using the following to handle this:
data.joined.sample <- data.joined.sample %>%
filter(!is.na(lat)) %>%
filter(!is.na(long)) %>%
filter(!is.na(temp)) %>%
filter(!is.na(year)) %>%
filter(!is.na(month)) %>%
filter(!is.na(day)) %>%
filter(!is.na(hour)) %>%
.......
Is there a more concise way to achieve this?
str(data.joined.sample)
Classes ‘data.table’ and 'data.frame': 336776 obs. of 50 variables:
We can select those columns, get a logical vector of NA's based on it using complete.cases and use that to remove the NA elements
data.joined.sample[complete.cases(data.joined.sample[colsofinterest]),]
where
colsofinterest <- c("lat", "long", "temp", "year", "month", "day", "hour")
Update
Based on the OP's comments, if it is a data.table, then subset the colsofinterest and use complete.cases
data.joined.sample[complete.cases(data.joined.sample[, colsofinterest, with = FALSE])]
data.table-objects, if that is in fact what your working with, have a somewhat different syntax for the "[" function. Look through this console session:
> DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
> DT[x=="a"&y==1]
x y v
1: a 1 4
> is.na(DT[x=="a"&y==1]$v) <- TRUE # make one item NA
> DT[x=="a"&y==1]
x y v
1: a 1 NA
> DT
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 NA
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT[complete.cases(DT)] # note no comma
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 3 5
5: a 6 6
6: c 1 7
7: c 3 8
8: c 6 9
> DT # But that didn't remove the NA, it only gave a value
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 NA
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT <- DT[complete.cases(DT)] # do this assignment to make permanent
> DT
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 3 5
5: a 6 6
6: c 1 7
7: c 3 8
8: c 6 9
Probably not the true "data.table way".

Adding column to existing data table, using formula from one column and operand data from other column

I'm a beginner in R, I have the following data, where table is like this:
dt
fct X Y Z fct1 Q
1: a 2 1 a:a a a:a:2
2: b 4 2 b:b b b:b:4
3: c 3 1 c:c c c:c:3
4: d 2 2 d:d d d:d:2
5: c 5 1 c:c c c:c:5
6: d 4 2 d:d d d:d:4
7: a 7 1 a:a a a:a:7
8: b 2 2 b:b b b:b:2
9: c 9 1 c:c c c:c:9
10: a 1 2 a:a a a:a:1
11: b 4 1 b:b b b:b:4
12: c 2 2 c:c c c:c:2
13: b 5 1 b:b b b:b:5
14: c 4 2 c:c c c:c:4
15: d 2 1 d:d d d:d:2`
and, have a List like this:
flist
$a
[1] "X + Y"
$b
[1] "X - Y"
$c
[1] "X * Y"
$d
[1] "paste0(Z,':',fct)"
Since flist has an entry for a, whenever an entry a in fct column is seen, then the corresponding formula from the list needs to be executed, using the values of Y and Z column, and applied to Column XY.
I tried solution like this:
within(dt2, XY <- eval(parse(text=flist['a']))), which works with the clearly seen constraint, which is, it can only be applied with the formula for a.
However, this: within(dt2, XY <- eval(parse(text=flist[fct]))) doesn't work.
Even this: within(dt2, XY <- eval(parse(text=eval(parse(text=flist[fct])))))) doesn't work.
Use case is that, looking into fct Column, its variable should be used to look up the formula in flist, and then applied with the Data in X and Y and applied to XY.
I kindly look forward for help.
We can use the data.table methods by specifying the logical condition in 'i' (assuming that 'fct' is character class and assign (:=) the evaluated string from the list element (flist$a) to create the new column 'XY'
dt[fct == names(flist), XY := eval(parse(text=flist$a))]
dt
# fct X Y Z fct1 Q XY
# 1: a 2 1 a:a a a:a:2 3
# 2: b 4 2 b:b b b:b:4 NA
# 3: c 3 1 c:c c c:c:3 NA
# 4: d 2 2 d:d d d:d:2 NA
# 5: c 5 1 c:c c c:c:5 NA
# 6: d 4 2 d:d d d:d:4 NA
# 7: a 7 1 a:a a a:a:7 8
# 8: b 2 2 b:b b b:b:2 NA
# 9: c 9 1 c:c c c:c:9 NA
#10: a 1 2 a:a a a:a:1 3
#11: b 4 1 b:b b b:b:4 NA
#12: c 2 2 c:c c c:c:2 NA
#13: b 5 1 b:b b b:b:5 NA
#14: c 4 2 c:c c c:c:4 NA
#15: d 2 1 d:d d d:d:2 NA
Update
If there are multiple elements in 'flist'
for(j in seq_along(flist)){
dt[fct == names(flist)[j], XY := eval(parse(text= flist[[j]]))][]
}
dt
# fct X Y Z fct1 Q XY
# 1: a 2 1 a:a a a:a:2 3
# 2: b 4 2 b:b b b:b:4 2
# 3: c 3 1 c:c c c:c:3 3
# 4: d 2 2 d:d d d:d:2 NA
# 5: c 5 1 c:c c c:c:5 5
# 6: d 4 2 d:d d d:d:4 NA
# 7: a 7 1 a:a a a:a:7 8
# 8: b 2 2 b:b b b:b:2 0
# 9: c 9 1 c:c c c:c:9 9
#10: a 1 2 a:a a a:a:1 3
#11: b 4 1 b:b b b:b:4 3
#12: c 2 2 c:c c c:c:2 4
#13: b 5 1 b:b b b:b:5 4
#14: c 4 2 c:c c c:c:4 8
#15: d 2 1 d:d d d:d:2 NA
data
flist <- list(a= "X + Y")
#updated flist
flist <- list(a = "X + Y", b = "X - Y", c = "X * Y")

Multiple joins/merges with data.tables

I have two data.tables, DT and L:
> DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9,key="x")
> L=data.table(yv=c(1L:8L,12L),lu=c(letters[8:1],letters[12]),key="yv")
> DT
x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> L
yv lu
1: 1 h
2: 2 g
3: 3 f
4: 4 e
5: 5 d
6: 6 c
7: 7 b
8: 8 a
9: 12 l
I would like to independently look up the corresponding value of lu from L for column y and for column v in DT. The following syntax provides the correct result, but is cumbersome to generate and then understand at a glance later:
> L[setkey(L[setkey(DT,y)],v)][,list(x,y=yv.1,v=yv,lu.1=lu.1,lu.2=lu)]
x y v lu.1 lu.2
1: a 1 1 h h
2: a 2 3 g f
3: a 3 6 f c
4: b 4 1 e h
5: b 5 3 d f
6: b 6 6 c c
7: c 7 1 b h
8: c 8 3 a f
9: c 9 6 NA c
(Edit: original post had L[setkey(L[setkey(DT,y)],v)][,list(x,y=yv,v=yv.1,lu.1=lu,lu.2=lu.1)] above, which incorrectly mixed up the y and v columns and looked up values.)
In SQL this would be simple/straightforward:
SELECT DT.*, L1.lu AS lu1, L2.lu AS lu2
FROM DT
LEFT JOIN L AS L1 ON DT.y = L1.yv
LEFT JOIN L AS L2 ON DT.v = L2.yv
Is there a more elegant way to use data.table to perform multiple joins? Note that I'm joining one table to another table twice in this example, but I am also interested in joining one table to multiple different tables.
Great question. One trick is that i doesn't have to be keyed. Only x must be keyed.
There might be better ways. How about this:
> cbind( L[DT[,list(y)]], L[DT[,list(v)]], DT )
yv lu yv lu x y v
1: 1 h 1 h a 1 1
2: 3 f 2 g a 3 2
3: 6 c 3 f a 6 3
4: 1 h 4 e b 1 4
5: 3 f 5 d b 3 5
6: 6 c 6 c b 6 6
7: 1 h 7 b c 1 7
8: 3 f 8 a c 3 8
9: 6 c 9 NA c 6 9
or, to illustrate, this is the same :
> cbind( L[J(DT$y)], L[J(DT$v)], DT )
yv lu yv lu x y v
1: 1 h 1 h a 1 1
2: 3 f 2 g a 3 2
3: 6 c 3 f a 6 3
4: 1 h 4 e b 1 4
5: 3 f 5 d b 3 5
6: 6 c 6 c b 6 6
7: 1 h 7 b c 1 7
8: 3 f 8 a c 3 8
9: 6 c 9 NA c 6 9
merge could also be used, if the following feature request was implemented :
FR#2033 Add by.x and by.y to merge.data.table

Resources