This question already has answers here:
Use `j` to select the join column of `x` and all its non-join columns
(2 answers)
What does < stand for in data.table joins with on=
(2 answers)
Closed 3 years ago.
Morning everyone
In data.table I found that with a left join, when mentioning a column name implicitly i.e. without mentioning the table (in which the column resides in) induces unexpected results despite unique column names.
dummy data
x <- data.table(a = 1:2); x
# a
# 1: 1
# 2: 2
y <- data.table(c = 1
,d = 2); y
# c d
# 1: 1 2
left join without mentioning table name in retrieve of column c
z <- y[x, on=.(c=a), .(a,c,d)]; z
# a c d
# 1: 1 1 2
# 2: 2 2 NA
Problem arises when looking at results above. Row 2 of column c is supposed to be NA. However, it shows 2
This is only rectified when the user explicitly mentions the table:
z <- y[x, on=.(c=a), .(a,x.c,d)]; z
# a x.c d
# 1: 1 1 2
# 2: 2 NA NA
It is perhaps worth mentioning the x in x.c is referring to the position of syntax x[i], in this case, table y
My question is why is the explicit mention of table necessary for a task seemingly basic. Or am I missing something? Thank you.
Related
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 1 year ago.
This is something that I come across fairly often and this is the solution that I typically land on. I'm wondering if anyone has suggestions on a less verbose way to accomplish this task. Here I create an example dataframe that contains three columns. One of these columns is a parameter code rather than a parameter name.
Second I create an example of a reference dataframe that contains unique parameter codes and the associated parameter names.
My solution has been to use a 'for' loop to match the parameter name from real_param_names with the associated parameter codes in dat. I have tried to use match() and replace but haven't quite found a way that those work. None of the examples I've come across in old questions have quite hit the mark either but would be happily referred to one that does. Thank in advance.
dat <- data.frame(site = c(1,1,2,2,3,3,4,4),
param_code = c('a','b','c','d','a','b','c','d'),
param_name = NA)
dat
real_param_names <- data.frame(param_code = c('a','b','c','d'),
param_name = c('gold', 'silver', 'mercury', 'lead'))
real_param_names
for (i in unique(dat$param_code)) {
dat$param_name[dat$param_code==i] <- real_param_names$param_name[real_param_names$param_code==i]
}
dat
This is a merge/join operation:
First, let's get rid of the unnnecessary dat$param_name, since it'll be brought over in the merge:
dat$param_name <- NULL
dat
# site param_code
# 1 1 a
# 2 1 b
# 3 2 c
# 4 2 d
# 5 3 a
# 6 3 b
# 7 4 c
# 8 4 d
Now the merge:
merge(dat, real_param_names, by = "param_code", all.x = TRUE)
# param_code site param_name
# 1 a 1 gold
# 2 a 3 gold
# 3 b 1 silver
# 4 b 3 silver
# 5 c 2 mercury
# 6 c 4 mercury
# 7 d 2 lead
# 8 d 4 lead
Some good links for the concepts of joins/merges: How to join (merge) data frames (inner, outer, left, right), https://stackoverflow.com/a/6188334/3358272
I have a data frame like this:
x=data.frame(type = c('a','b','c','a','b','a','b','c'),
value=c(5,2,3,2,10,6,7,8))
every item has attributes a, b, c while some records may be missing records, i.e. only have a and b
The desired output is
y=data.frame(item=c(1,2,3), a=c(5,2,6), b=c(2,10,7), c=c(3,NA,8))
How can I transform x to y? Thanks
We can use dcast
library(data.table)
out <- dcast(setDT(x), rowid(type) ~ type, value.var = 'value')
setnames(out, 'type', 'item')
out
# item a b c
#1: 1 5 2 3
#2: 2 2 10 8
#3: 3 6 7 NA
Create a grouping vector g assuming each occurrence of a starts a new group, use tapply to create a table tab and coerce that to a data frame. No packages are used.
g <- cumsum(x$type == "a")
tab <- with(x, tapply(value, list(g, type), c))
as.data.frame(tab)
giving:
a b c
1 5 2 3
2 2 10 NA
3 6 7 8
An alternate definition of the grouping vector which is slightly more complex but would be needed if some groups have a missing is the following. It assumes that x lists the type values in order of their levels within group so that if a level is less than the prior level it must be the start of a new group.
g <- cumsum(c(-1, diff(as.numeric(x$type))) < 0)
Note that ultimately there must be some restriction on missingness; otherwise, the problem is ambiguous. For example if one group can have b and c missing and then next group can have a missing then whether b and c in the second group actually form a second group or are part of the first group is not determinable.
I am using data.table to do a one-to-many merge. Instead of matching with all the rows, the output is showing only the last matched row for each unique value of the key.
a <- data.table(x = 1:2L, y = letters[1:4])
b <- data.table(x = c(1L,3L))
setkey(a,x)
setkey(b,x)
I want to do a many to one (b to a) join based on column x.
c <- a[b,on=.(x)]
c
# x y
# 1: 1 a
# 2: 1 c
# 3: 3 NA
However, this approach creates a new data.table called c, instead of making a new data.table, I use the following code to add the column y with b.
b[a,y:=i.y]
Now b looks like,
b
# x y
# 1: 1 c
# 2: 3 NA
The desired output is the one in the first method (c). Is there a way of using := and output all the rows instead of the last matched row alone?
PS: The reason I want to use method 2 using := is because my data is huge and I do not want to make copies. The example I showed reflects what happens in my data.
This question already has answers here:
Using ifelse() to replace NAs in one data frame by referencing another data frame of different length
(3 answers)
Closed 6 years ago.
I have a data frame c like this
c
Freq CTM
000110100111 2 NA
110110100111 1 32.58847
111001011000 2 NA
111111111111 1 25.61041
and a data frame nona_c like this
nona_c
Freq CTM
000110100111 2 37.0642
111001011000 2 37.0642
I want to replace the NAs in the CTM column of c with the CTM values of nona_c. The rownames of nona_c (the binary strings) will always exist in c.
The output should be
mergedC
Freq CTM
000110100111 2 37.0642
110110100111 1 32.58847
111001011000 2 37.0642
111111111111 1 25.61041
I've been trying merge without success here.
mergedC <- merge(x = c, y = nona_c, by = 0, #rownames
all.y = TRUE)
A match operation might make this more straightforward:
c$CTM[is.na(c$CTM)] <- nona_c$CTM[match(rownames(c)[is.na(c$CTM)], rownames(nona_c))]
# Freq CTM id
#000110100111 2 37.06420 000110100111
#110110100111 1 32.58847 110110100111
#111001011000 2 37.06420 111001011000
#111111111111 1 25.61041 111111111111
We can do this with data.table using a join on the variable of interest. Here we are joining on the row name column. The values of "i.CTM" are assigned (:=) to the 'CTM'.
library(data.table)
setDT(c, keep.rownames=TRUE)[]
setDT(nona_c, keep.rownames=TRUE)[]
c[nona_c, CTM := i.CTM , on = "rn"]
c
# rn Freq CTM
#1: 000110100111 2 37.06420
#2: 110110100111 1 32.58847
#3: 111001011000 2 37.06420
#4: 111111111111 1 25.61041
NOTE: The row.names are not retained in data.table or dplyr. So, while converting the 'data.frame' to 'data.table', we use the keep.rownames = TRUE.
If I specify n columns as a key of a data.table, I'm aware that I can join to fewer columns than are defined in that key as long as I join to the head of key(DT). For example, for n=2 :
X = data.table(A=rep(1:5, each=2), B=rep(1:2, each=5), key=c('A','B'))
X
A B
1: 1 1
2: 1 1
3: 2 1
4: 2 1
5: 3 1
6: 3 2
7: 4 2
8: 4 2
9: 5 2
10: 5 2
X[J(3)]
A B
1: 3 1
2: 3 2
There I only joined to the first column of the 2-column key of DT. I know I can join to both columns of the key like this :
X[J(3,1)]
A B
1: 3 1
But how do I subset using only the second column colum of the key (e.g. B==2), but still using binary search not vector scan? I'm aware that's a duplicate of :
Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan
so I'd like to generalise this question to n. My data set has about a million rows and solution provided in dup question linked above doesn't seem to be optimal.
Here is a simple function that will extract the correct unique values and return a data table to use as a key.
X <- data.table(A=rep(1:5, each=4), B=rep(1:4, each=5),
C = letters[1:20], key=c('A','B','C'))
make.key <- function(ddd, what){
# the names of the key columns
zzz <- key(ddd)
# the key columns you wish to keep all unique values
whichUnique <- setdiff(zzz, names(what))
## unique data.table (when keyed); .. means "look up one level"
ud <- lapply([, ..whichUnique], unique)
## append the `what` columns and a Cross Join of the new
## key columns
do.call(CJ, c(ud,what)[zzz])
}
X[make.key(X, what = list(C = c('a','b'))),nomatch=0]
## A B C
## 1: 1 1 a
## 2: 1 1 b
I'm not sure this will be any quicker than a couple of vector scans on a large data.table though.
Adding secondary keys is on the feature request list :
FR#1007 Build in secondary keys
In the meantime we are stuck with either vector scan, or the approach used in the answer to the n=2 case linked in the question (which #mnel generalises nicely in his answer).