I am attempting to understand the logic in the data.table from the documentation and a bit unclear. I know I can just try this and see what happens but I would like to make sure that there is no pathological case and therefore would like to know how the logic was actually coded. When two data.table objects have a different number of key columns, for example a has 2 and b has 3, and you run c <- a[b], will a and b be merged simply on the first two key columns or will the third column in a be automatically merged to the 3rd key column in b? An example:
require(data.table)
a <- data.table(id=1:10, t=1:20, v=1:40, key=c("id", "t"))
b <- data.table(id=1:10, v2=1:20, key="id")
c <- a[b]
This should select rows of a that match the id key column in b. For example, for id==1 in b, there are 2 rows in b and 4 rows in a that should generate 8 rows in c. This is indeed what seems to happen:
> head(c,10)
id t v v2
1: 1 1 1 1
2: 1 1 21 1
3: 1 11 11 1
4: 1 11 31 1
5: 1 1 1 11
6: 1 1 21 11
7: 1 11 11 11
8: 1 11 31 11
9: 2 2 2 2
10: 2 2 22 2
The other way to try it is to do:
d <-b[a]
This should do the same thing: for every row in a it should select the matching row in b: since a has an extra key column, t, that column should not be used for matching and a join based only on the first key column, id should be done. It seems like this is the case:
> head(d,10)
id v2 t v
1: 1 1 1 1
2: 1 11 1 1
3: 1 1 1 21
4: 1 11 1 21
5: 1 1 11 11
6: 1 11 11 11
7: 1 1 11 31
8: 1 11 11 31
9: 2 2 2 2
10: 2 12 2 2
Can someone confirm? To be clear: is the third key column of a ever used in any of the merges or does data.table only use the min(length(key(DT))) of the two tables.
Good question. First the correct terminology is (from ?data.table) :
[A data.table] may have one key of one or more columns. This key can be used for row indexing instead of rownames.
So "key" (singlular) not "keys" (plural). We can get away with "keys", currently. But when secondary keys are added in future, there may then be multiple keys. Each key (singular) can have multiple columns (plural).
Otherwise you're absolutely correct. The following paragraph was improved in v1.8.2 based on feedback from others also confused. From ?data.table:
When i is a data.table, x must have a key. i is joined to x using x's key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key; i.e., column 1 of i is matched to the 1st column of x's key, column 2 to the second, etc. The match is a binary search in compiled C in O(log n) time. If i has fewer columns than x's key then many rows of x will ordinarily match to each row of i since not all of x's key columns will be joined to (a common use case). If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns (column 1 of i's key is joined to column 1 of x's key, column 2 to column 2, and so on) and a binary merge of the two tables is carried out. In all joins the names of the columns are irrelevant. The columns of x's key are joined to in order, either from column 1 onwards of i when i is unkeyed, or from column 1 onwards of i's key.
Following comments, in v1.8.3 (on R-Forge) this now reads (changes in bold) :
When i is a data.table, x must have a key. i is joined to x using x's key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key; i.e., column 1 of i is matched to the 1st column of x's key, column 2 to the second, etc. The match is a binary search in compiled C in O(log n) time. If i has fewer columns than x's key then not all of x's key columns will be joined to (a common use case) and many rows of x will (ordinarily) match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns (column 1 of i's key is joined to column 1 of x's key, column 2 of i's key to column 2 of x's key, and so on for as long as the shorter key) and a binary merge of the two tables is carried out. In all joins the names of the columns are irrelevant; the columns of x's key are joined to in order, either from column 1 onwards of i when i is unkeyed, or from column 1 onwards of i's key. In code, the number of join columns is determined by min(length(key(x)),if (haskey(i)) length(key(i)) else ncol(i)).
Quote data.table FAQ:
X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index.
Y[X] is a join, looking up Y's rows using X (or X's key if it has one) as an index.
merge(X,Y) does both ways at the same time. The number of rows of X[Y] and Y[X] usually differ;
whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same.
Related
I need to update all the values of a column, using as reference another df.
The two dataframes have equal structures:
cod name dom_by
1 A 3
2 B 4
3 C 1
4 D 2
I tried to use the following line, but apparently it did not work:
df2$name[df2$dom_by==df1$cod] <- df1$name[df2$dom_by==df1$cod]
It keeps saying that replacement has 92 rows, data has 2.
(df1 has 92 rows and df2 has 2).
Although it seems like a simple problem, I still can not solve it, even after some searches.
I have a dataframe with two columns. One is an ID column (string), the second consists of strings several hundred characters long (DNA sequences). I want to identify the unique DNA sequences and group the unique groups together.
Using:
data$duplicates<-duplicated(data$seq, fromLast = TRUE)
I have successfully identified whether a specific row is a duplicate or not. This is not sufficient - I want to know whether I have 2, 3, etc. duplicates, and to which ID's do they correspond to (it is important that the ID always stays with its corresponding sequence).
Maybe something like:
for data$duplicates = TRUE... "add number in data$grouping
corresponding to the set of duplicates."
I don't know how to write the code for the last part.
I appreciate any and all help, thank you.
Edit: As an example:
df <- data.frame(ID = c("seq1","seq2","seq3","seq4","seq5"),seq= c("AAGTCA",AGTCA","AGCCTCA","AGTCA","AGTCAGG"))
I would like the output to be a new column (e.g.: df$grouping) where a numeric value is given to each unique group, so in this case:
("1","2","3","2","4")
I would like the output to be a new column (e.g.: df$grouping) where a numeric value is given to each unique group, so in this case:
Since df$seq is already a factor, we can just use the level number. This is given when a factor is coerced to an integer.
df$grouping = as.integer(df$seq)
df
# ID seq grouping
# 1 seq1 AAGTCA 1
# 2 seq2 AGTCA 3
# 3 seq3 AGCCTCA 2
# 4 seq4 AGTCA 3
# 5 seq5 AGTCAGG 4
If, in your real data, the seq column is not of class factor, you can still use df$grouping = as.integer(factor(df$seq)). By default the order of the groups will be alphabetical---you can modify this by giving the levels argument to factor in the order you want. For example, df$grouping = as.integer(factor(df$seq, levels = unique(df$seq))) will put the levels (and thus the grouping integers) in the order in which they first occur.
If you want to see the number of rows in each group, use table, e.g.
table(df$seq)
# AAGTCA AGCCTCA AGTCA AGTCAGG
# 1 1 2 1
table(df$grouping)
# 1 2 3 4
# 1 1 2 1
sort(table(df$seq), decreasing = T)
# AGTCA AAGTCA AGCCTCA AGTCAGG
# 2 1 1 1
I have a data set where each individual has a unique person ID. I'm interested in turning these ID numbers to another set of more manageable type integer IDs.
ID <- c(59970013552, 51730213552, 1233923, 2949394, 9999999999)
Essentially, I'd like to map these IDs a new_ID, where
> new_ID
[1] 1 2 3 4 5
The reason I'm doing this is that my analysis requires as.integer(ID), and R will coerce large integers into NA. I have tried using as.integer64 from the bit64 package, but the class integer64 is not compatible with my analysis.
I've also thought to just do ID - min(ID) + 1 to get around having huge ID numbers. But this also doesn't work, because some of my larger IDs are so large that even if I subtract the min(ID) value, as.integer(ID) will still coerce them to NA.
This should be a duplicate but I couldn't find a relevant answer hence posting an answer.
We can use match
match(ID, unique(ID))
#[1] 1 2 3 4 5
OR convert the ID into factors along with levels
as.integer(factor(ID, levels = unique(ID)))
#[1] 1 2 3 4 5
I have a question regarding adding a column to a data.table based on info in another data.table.
This is how my data looks:
Datatable 1 (football matches)
TeamcodeHome TeamcodeAway GoalsHome GoalsAway Season
1 2 5 0 2006
Datatable 2 (cards received by football teams):
Teamcode Season Red Yellow
1 2005 1 15
2 2005 3 10
1 2006 4 16
2 2006 1 4
Now I would use the following function in datatable if I want to add a column based on 1 other column:
dt.1[dt.2, on="Teamcode", RedCards:=Red]
But now there are two variables that need to be matched. Teamcode and Season. How does this work?
The help page ?data.table says about the on parameter:
Indicate which columns in i should be joined with columns in x along
with the type of binary operator to join with. When specified, this
overrides the keys set on x and i. There are multiple ways of
specifying on argument:
As a character vector, e.g., X[Y, on=c("a", "b")]. This assumes both these columns are present in X and Y.
As a named character vector, e.g., X[Y, on=c(x="a", y="b")]. This is useful when column names to join by are different between the two
tables.
NB: X[Y, on=c("a", y="b")] is also possible if column "a" is
common between the two tables.
For convenience during interactive scenarios, it is also possible to use .() syntax as X[Y, on=.(a, b)].
(It also recommends the vignette Secondary indices and auto indexing.)
So, this would be a possible join on two columns:
dt.1[dt.2, on = .(TeamcodeHome = Teamcode, Season), RedCardsHome := Red][]
TeamcodeHome TeamcodeAway GoalsHome GoalsAway Season RedCardsHome
1: 1 2 5 0 2006 4
I have a column of a data.table:
DT = data.table(R=c(3,8,5,4,6,7))
Further on I have a vector of upper cluster limits for the cluster 1, 2, 3 and 4:
CP=c(2,4,6,8)
Now I want to compare each entry of R with the elements of CP considering the order of CP. The result
DT[,NoC:=c(2,4,3,2,3,4)]
shall be a column NoC in DT, whose entries are just the number of that cluster, which the element of R belongs to.
(I need the cluster number to choose a factor out of another data.table.)
For example take the 1st entry of R: 3 is not smaller than 2 (out of CP), but smaller than 4 (out of CP). So, 3 belongs to cluster 2.
Another exmaple, take the 6th entry of R: 7 is neither smaller than 2, 4 nor 6 (out of CP), but shmaller than 8 (out of CP). So, 7 belongs to cluster 4.
How can I do that without using if-clauses?
You can accomplish this using rolling joins:
data.table(CP, key="CP")[DT, roll=-Inf, which=TRUE]
# [1] 2 4 3 2 3 4
roll=-Inf performs a NOCB rolling join - Next Observation Carried Backward. That is, in the event of value falling in a gap, the next observation will be rolled backward. Ex: 7 falls between 6 and 8. The next value is 8 - will be rolled backward. We simply get the corresponding index of each match using which=TRUE.
You can just add this as a column to DT using := as you've shown.
Note that this will return the indices after ordering CP. In your example, CP is already ordered, so it returns the result as intended. If CP is not already ordered, you'll have to add an additional column and extract that column instead of using which=TRUE. But I'll leave it to you to work it out.
From your description this would seem to be the code to deliver the correct answers, but Arun, a most skillful data.tablist, seems to have come up with a completely different way to fit your expectations, so I think there must be a different way of reading your requirements.
> DT[ , NoC:= findInterval(R, c(0, 2,4,6,8) , rightmost.closed=TRUE)]
> DT
R NoC
1: 3 2
2: 8 4
3: 5 3
4: 4 3
5: 6 4
6: 7 4
I'm also very puzzled that findInterval is assigning the 5th item to the 4th interval since 6 is not greater than the upper boundary of the third interval (6).