unique working incorrectly with data.table [duplicate] - r

This question already has answers here:
Extracting unique rows from a data table in R [duplicate]
(2 answers)
Closed 4 years ago.
I've discovered some interesting behavior in data.table, and I'm curious if someone can explain to me why this is happening. I'm merging two data.tables (in this MWE, one has 1 row and the other 2 rows). The merged data.table has two unique rows, but when I call unique() on the merged data.table, I get a data.table with one row. Am I doing something wrong? Or is this a bug?
Here's an MWE:
library(data.table)
X = data.table(keyCol = 1)
setkey(X, keyCol)
Y = data.table(keyCol = 1, otherKey = 1:2)
setkeyv(Y, c("keyCol", "otherKey"))
X[Y, ] # 2 unique rows
unique(X[Y, ]) # Only 1 row???
I'd expect unique(X[Y, ]) to be the same as X[Y, ] since all rows are unique, but this doesn't seem to be the case.

The default value to by argument for unique.data.table is key(x). Therefore, if you do unique(x) on a keyed data.table, it only looks at the key columns. To override it, do:
unique(x, by = NULL)
by = NULL by default considers all the columns. Alternatively you can also provide by = names(x).

Related

Select columns using a variable in R [duplicate]

This question already has answers here:
Extract columns from data table by numeric indices stored in a vector
(2 answers)
Closed 1 year ago.
Given a data.table, how can I select a set of columns using a variable?
Example:
df[, 1:3]
is OK, but
idx <- 1:3
df[, idx]
is not OK: column named "idx" does not exist.
How can I use idx to select the columns in the simplest possible way?
We can use .. before the idx to select the columns in data.table or with = FALSE
library(data.table)
df[, ..idx]
df[, idx, with = FALSE]

Insert or print a column name inside a data table call [duplicate]

This question already has answers here:
Converting multiple data.table columns to factors in R
(2 answers)
Closed 2 years ago.
I have a rather simple problem as it seems, which I cannot solve myself however.
Can I somehow insert or print a column name within a data table call? I have something like this in mind:
col_names = c("column1","column2")
for (col in col_names){
datatable$col ...
}
or
col_names = c("column1","column2")
for (col in col_names){
datatable[,col] ...
}
What I eventually would like to do is transform the variables of certain columns into ordered factors. Since there are many columns, I'm looking for a neater way as an alternative of just coding the same line 20 times with the only difference being the column name.
Are you trying to print the just the column name or the entire column within the datatable?
You could try something like this
col_names = c("column1","column2")
for (i in seq_along(col_names)){
print(datatable[col_names[[i]]])
}
Or if you just want the names printed.
col_names = c("column1","column2")
for (i in seq_along(col_names)){
print(col_names[[i]])
}
Also, you might want to check out the iteration chapter in R for Data Science.
Perhaps, you can try lapply with SDcols to apply a function over col_names. You can try something like this :=
library(data.table)
datatable[, (col_names) := lapply(.SD, function(x) factor(x, ordered = TRUE)),
.SDcols = col_names]
Here we apply factor(x, ordered = TRUE) to each column in col_names where x is each individual column name.

Better syntax for adding a column from other data.table [duplicate]

This question already has answers here:
Left join using data.table
(3 answers)
Assign value (stemming from configuration table) to group based on condition in column
(1 answer)
Closed 4 years ago.
I have two indexed data tables, and I want to add a column from one table to the other by index. My current approach is as follows:
A <- data.table(index = seq(6,10), a = rnorm(5))
B <- data.table(index = seq(10), b = rnorm(10))
setkey(B, index)
A[, b := B[.(A[,index]), b]]
While this gets the job done, the syntax seems a bit redundant. Is there a cleaner way to perform the same operation?
We can do this with a join
A[B, b := b, on = .(index)]
The setkey step is not needed here

R data.table multi column coversion by names [duplicate]

This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 7 years ago.
Let DT be a data.table:
DT<-data.table(V1=factor(1:10),
V2=factor(1:10),
...
V9=factor(1:10),)
Is there a better/simpler method to do multicolumn factor conversion like this:
DT[,`:=`(
Vn1=as.numeric(V1),
Vn2=as.numeric(V2),
Vn3=as.numeric(V3),
Vn4=as.numeric(V4),
Vn5=as.numeric(V5),
Vn6=as.numeric(V6),
Vn7=as.numeric(V7),
Vn8=as.numeric(V8),
Vn9=as.numeric(V9)
)]
Column names are totally arbitrary.
Yes, the most efficient would be probably to run set in a for loop
Set the desired columns to modify (you can chose all the names too using names(DT) instead)
cols <- c("V1", "V2", "V3")
Then just run the loop
for (j in cols) set(DT, i = NULL, j = j, value = as.numeric(DT[[j]]))
Or a bit less efficient but more readable way would be just (note the parenthesis around cols which evaluating the variable)
## if you chose all the names in DT, you don't need to specify the `.SDcols` parameter
DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
Both should be efficient even for a big data set. You can read some more about data.table basics here
Though beware of converting factors to numeric classes in such a way, see here for more details

Update a vector in a data.table [duplicate]

This question already has answers here:
Using lists inside data.table columns
(2 answers)
Closed 8 years ago.
I have this:
dt = data.table(index=c(1,2), items=list(c(1,2,3),c(4,5)))
# index items
#1: 1 1,2,3
#2: 2 4,5
I want to change the dt[index==2,items] to c(6,7).
I tried:
dt[index==2, items] = c(6,7)
dt[index==2, items := c(6,7)]
One workaround is to use ifelse:
dt[,items:=ifelse(index==2,list(c(6,7)),items)]
index items
1: 1 1,2,3
2: 2 6,7
EDIT the correct answer:
dt[index==2,items := list(list(c(6,7)))]
Indeed, you'll need one more list because data.table uses list(.) to look for values to assign to columns by reference.
There are two ways to use the := operator in data.table:
The LHS := RHS form:
DT[, c("col1", "col2", ..) := list(val1, val2, ...)]
It takes a list() argument on the RHS. To add a list column, you'll need to wrap with another list (as illustrated above).
The functional form:
DT[, `:=`(col1 = val1, ## some comments
col2 = val2, ## some more comments
...)]
It is especially useful to add some comments along with the assignment.
dt[index==2]$items[[1]] <- list(c(6,7))
dt
# index items
# 1: 1 1,2,3
# 2: 2 6,7
The problem is that, the way you have it set up, dt$items is a list, not a vector, so you have to use list indexing (e.g., dt$items[[1]]). But AFAIK you can't update a list element by reference, so, e.g.,
dt[index==2,items[[1]]:=list(c(6,7))]
will not work.
BTW I also do not see the point of using data.tables for this.
This worked:
dt$items[[which(dt$index==2)]] = c(6,7)

Resources