I am trying to recode a variable using data.table. I have googled for almost 2 hours but couldn't find an answer.
Assume I have a data.table as the following:
DT <- data.table(V1=c(0L,1L,2L),
V2=LETTERS[1:3],
V4=1:12)
I want to recode V1 and V2. For V1, I want to recode 1s to 0 and 2s to 1.
For V2, I want to recode A to T, B to K, C to D.
If I use dplyr, it is simple.
library(dplyr)
DT %>%
mutate(V1 = recode(V1, `1` = 0L, `2` = 1L)) %>%
mutate(V2 = recode(V2, A = "T", B = "K", C = "D"))
But I have no idea how to do this in data.table
DT[V1==1, V1 := 0]
DT[V1==2, V1 := 1]
DT[V2=="A", V2 := "T"]
DT[V2=="B", V2 := "K"]
DT[V2=="C", V2 := "D"]
Above is the code that I can think as my best. But there must be a better and a more efficient way to do this.
Edit
I changed how I want to recode V2 to make my example more general.
With data.table the recode can be solved with an update on join:
DT[.(V1 = 1:2, to = 0:1), on = "V1", V1 := i.to]
DT[.(V2 = LETTERS[1:3], to = c("T", "K", "D")), on = "V2", V2 := i.to]
which converts DT to
V1 V2 V4
1: 0 T 1
2: 0 K 2
3: 1 D 3
4: 0 T 4
5: 0 K 5
6: 1 D 6
7: 0 T 7
8: 0 K 8
9: 1 D 9
10: 0 T 10
11: 0 K 11
12: 1 D 12
Edit: #Frank suggested to use i.to to be on the safe side.
Explanation
The expressions .(V1 = 1:2, to = 0:1) and .(V2 = LETTERS[1:3], to = c("T", "K", "D")), resp., create lookup tables on-the-fly.
Alternatively, the lookup tables can be set-up beforehand
lut1 <- data.table(V1 = 1:2, to = 0:1)
lut2 <- data.table(V2 = LETTERS[1:3], to = c("T", "K", "D"))
lut1
V1 to
1: 1 0
2: 2 1
lut2
V2 to
1: A T
2: B K
3: C D
Then, the update joins become
DT[lut1, on = "V1", V1 := i.to]
DT[lut2, on = "V2", V2 := i.to]
Edit 2: Answers to How can I use this code dynamically?
mat asked "How can I use this code dynamically?"
So, here is a modified version where the name of column to update is provided as a character variable my_var_name but the lookup tables still are created on-the-fly:
my_var_name <- "V1"
DT[.(from = 1:2, to = 0:1), on = paste0(my_var_name, "==from"),
(my_var_name) := i.to]
my_var_name <- "V2"
DT[.(from = LETTERS[1:3], to = c("T", "K", "D")), on = paste0(my_var_name, "==from"),
(my_var_name) := i.to]
There are 3 points to note:
Instead of naming the first column of the lookup table dynamically it gets a fixed name from. This requires a join between differently named columns (foreign key join). The names of the columns to join on have to be specified via the on parameter.
The on parameter accepts character strings for foreign key joins of the form "V1==from". This string is created dynamically using paste0().
In the expression (my_var_name) := i.to, the parentheses around the variable my_var_name forces to use the contents of my_var_name.
Dynamic code using pre-defined lookup tables
Now, while the column to recode is specified dynamically by a variable, the lookup tables to use are still hard-coded in the statement which means we have stopped halfways: We need also to select the appropriate lookup table dynamically.
This can be achieved by storing the lookup tables in a list where each list element is named according to the column of DT it is supposed to recode:
lut_list <- list(
V1 = data.table(from = 1:2, to = 0:1),
V2 = data.table(from = LETTERS[1:3], to = c("T", "K", "D"))
)
lut_list
$V1
from to
<int> <int>
1: 1 0
2: 2 1
$V2
from to
<char> <char>
1: A T
2: B K
3: C D
Now, we can pick the appropriate lookup table from the list dynamically as well:
my_var_name <- "V1"
DT[lut_list[[my_var_name]], on = paste0(my_var_name, "==from"),
(my_var_name) := i.to]
Going one step further, we can recode all relevant columns of DT in a loop:
for (v in intersect(names(lut_list), colnames(DT))) {
DT[lut_list[[v]], on = paste0(v, "==from"), (v) := i.to]
}
Note that DT is updated by reference, i.e., only the affected elements are replaced in place without copying the whole object. So, the for loop is applied iteratively on the same data object. This is a speciality of data.table and will not work with data.frames or tibbles.
I think this might be what you're looking for. On the left hand side of := we name the variables we want to update and on the right hand side we have the expressions we want to update the corresponding variables with.
DT[, c("V1","V2") := .(as.numeric(V1==2), sapply(V2, function(x) {if(x=="A") "T"
else if (x=="B") "K"
else if (x=="C") "D" }))]
# V1 V2 V4
#1: 0 T 1
#2: 0 K 2
#3: 1 D 3
#4: 0 T 4
#5: 0 K 5
#6: 1 D 6
#7: 0 T 7
#8: 0 K 8
#9: 1 D 9
#10: 0 T 10
#11: 0 K 11
#12: 1 D 12
Alternatively, just use recode within data.table:
library(dplyr)
DT[, c("V1","V2") := .(as.numeric(V1==2), recode(V2, "A" = "T", "B" = "K", "C" = "D"))]
mapvalues() from plyr, in combination with data.table, works really well.
I use it on large-ish data (50 mio - 400 mio rows). Although I haven't benchmarked it as compared to other possibilities, I find the clear syntax is worth a lot, as it means fewer errors in complicated recode operations.
library(data.table)
library(plyr)
DT <- data.table(V1=c(0L,1L,2L),
V2=LETTERS[1:3],
V4=1:12)
DT[, V1 := mapvalues(V1, from=c(1, 2), to=c(0, 1))]
DT[, V2 := mapvalues(V2, from=c('A', 'B', 'C'), to=c('T', 'K', 'D'))]
For more complicated recode operations, I would always create a new variable first with NA, and use another data.table with from-to vectors as variables.
A feature that in some use-cases is more of a bug is that mapvalues() keeps those values from the old variable that isn't in the from argument.
This is a problem if you're sure that all the correct values is in the from-vector, so that any values in the data.table that isn't in this vector should be NA instead.
DT <- data.table(V1=c(LETTERS[1:3], 'i dont want this value transfered'),
V4=1:12)
map_DT <- data.table(from=c('A', 'B', 'C'), to=c('T', 'K', 'D'))
# NA variable to begin with is good practice because it is clearer to spot an error
DT[, V1_new := NA_character_]
DT[V1 %in% map_DT$from , V1_new := mapvalues(V1, from=map_DT$from, to=map_DT$to)][]
note that plyr is deprecated, so the mapvalues-function is somewhat at risk of disappearing at some point in the future. the update-joins method proposed might be a better method because of this, although I find mapvalues to be just a tad clearer to read. although it will probably take years before mapvalues is deprecated, most likely, a lot of years. But still, something to keep in mind when deciding to use it as a tool or not.
Related
I encounter this code in one of the Kaggle Notebook:
corrplot.mixed(corr = cor(videos[,c("category_id","views","likes",
"dislikes","comment_count"),with=F]))
videos is a data.frame
"category_id","views","likes","dislikes","comment_count" are columns in the videos data.frame
Would like to understand what is the function of the with parameter when selecting dataframe subset?
As mentioned by #user20650 it might be a data.table. Although in this case your code should work even without with = F.
Consider this example :
library(data.table)
dt <- data.table(a = 1:5, b = 5:1, c = 1:5)
To subset column a and b using character vector you could do
dt[, c('a', 'b'), with = F]
# a b
#1: 1 5
#2: 2 4
#3: 3 3
#4: 4 2
#5: 5 1
However, as mentioned this would work the same without with = F.
dt[, c('a', 'b')]
with = F is helpful when you have a vector of column names stored in a variable.
cols <- c('a', 'b')
dt[, cols] ##Error
dt[, cols, with = F] ##Works
How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]
I am using the data.table in R (version 3.3.2 on OS X 10.11.6) and have noticed a change in behavior from version 1.9.6 to 1.10.0 with respect to the use of the := operator and a character string for name.
I am renaming columns inside of a loop based upon the index number. Previously, I had been using eval(as.symbol("string")) on both sides of :=, but this no longer works (this was based upon answers from a previous question). Through trial and error, I figured out I needed use ("string") on of the left side and eval(as.symbol("string")) on the right hand side.
Here is MCVE that demonstrates this behavior
library(data.table)
dt <- data.table(col1 = 1:10, col2 = 11:20)
## the next lines would be inside a loop that is excluded to simplify this MCVE
colA = paste0("col", 1)
colB = paste0("col", 2)
colC = paste0("col", 3)
## Old code that worked with 1.9.6, but not longer works
dt[ , eval(as.symbol(colC)) := eval(as.symbol(colA)) + eval(as.symbol(colB))]
## New code that now works 1.10.0
dt[ , (colC) := eval(as.symbol(colA)) + eval(as.symbol(colB))]
I have looked through the data.table documentation and have not been able to figure out why this work around works. So, here is my question:
Why do I need the eval(as.symbol("string")) on the right side, but not on the left?
From a discussion, it is now assumed that if j is a single string, it is evaluated as a symbol, so that, for example, dt[, "col" := 3] will also work.
There's be a fair bit of changing around with exactly when this became the default, but the full story is contained in both the previous post and the data.table news.
It may be of interest to you, however, that with
new_cols = c("j1", "j2")
dt[, (new_cols) := value] # brackets so we don't just make a new_col col
or
dt[, c("j1", "j2") := value]
it may be possible for you to achieve the above without needing a loop
library(data.table)
dt = data.table(a = c(2, 3), b = c(5, 7), c = c(11, 13))
cols1 = sapply(c("a", "b"), as.symbol)
cols2 = sapply(c("b", "c"), as.symbol)
new_cols = c("d", "e")
> print(dt)
a b c
1: 2 5 11
2: 3 7 13
dt[, (new_cols) := purrr::map2(cols1, cols2, ~ eval(.x) + eval(.y))]
a b c d e
1: 2 5 11 7 16
2: 3 7 13 10 20
I have a df of 4 columns - c("Observation.ID", "Event.Type", "Property.Damage", "Magnitude").
Magnitude values signify whether property damage is given in thousands, millions, or billions of dollars ("K","M","B").
I want to normalize Property.Damage, so I need to separately compute for the 3 groups:
update df set Property.Damage=(Property.Damage*n) where Magnitude='K'
In dplyr, I understand how to split on class, add the recomputed property damage, combine, then summarize. Surely it's possible to do this more simply, a la SQL?
Edit: I went with data.table because it feels quick/easy compared to base. E.g.:
setkey(df1, Magnitude)
df1["K", PROPDMG := PROPDMG*1e3]
df1["M", PROPDMG := PROPDMG*1e6]
df1["B", PROPDMG := PROPDMG*1e7]
You may be better off just making a look up table and merging this back in before doing the multiplication. Something like:
df <- data.frame(propdmg=1:6, magnitude=rep(c("K","M","B"),each=2))
df
# propdmg magnitude
#1 1 K
#2 2 K
#3 3 M
#4 4 M
#5 5 B
#6 6 B
lkup <- data.frame(magnitude=c("K","M","B"),mult=c(1e3,1e6,1e7))
left_join(df, lkup) %>% mutate(result=propdmg * mult, mult=NULL)
#Joining by: "magnitude"
# propdmg magnitude result
#1 1 K 1e+03
#2 2 K 2e+03
#3 3 M 3e+06
#4 4 M 4e+06
#5 5 B 5e+07
#6 6 B 6e+07
The direct equivalent in base R would be:
transform(merge(df, lkup), result=mult * propdmg, mult=NULL)
I found that data.table was the most appealing approach. In fact, this has switched me from dplyr to data.table for split/apply/combine. Although it appears base R makes for the fewest keystrokes - I find data.table's i,j, := is less wonky parenthetically.
setkey(df1, Magnitude)
df1["K", PROPDMG := PROPDMG*1e3]
df1["M", PROPDMG := PROPDMG*1e6]
df1["B", PROPDMG := PROPDMG*1e7]
Alternatively, we can create another data.table as follows:
df2 = data.table(Magnitude = c("K", "M", "B"), mult = c(1e3, 1e6, 1e9))
and then perform an update while joining as follows:
df1[df2, PROPDMG := PROPDMG*mult, by=.EACHI, on="Magnitude"]
on= allows to perform binary search based subsets/joins without having to set keys. by=.EACHI evaluates expression in j for each row in df2.
We could use base R to do this
transform(df1, Property.Damage = Property.Damage * setNames(c(1e3,
1e6, 1e9), c("K", "M", "B"))[Magnitude])
data
df1 <- data.frame(Observation.ID = 1:5, Event.Type = LETTERS[1:5],
Property.Damage = c(1, 5, 3, 4, 7),
Magnitude = c("K", "M", "K", "B", "M"), stringsAsFactors=FALSE)
How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]