R data.table - new column with ':=' and keep existing column - r

Is it possible to create a new column and keep (few) existing columns in the statement ? e.g. creation of "x" column and then keeping "x" and "mpg" column
dt <- data.table(mtcars)
dt[,x:=mpg]
dt[,.(x,mpg)]

If you want to do the replacement by reference, using := then you can do
dt[, x:=mpg][, setdiff(colnames(dt), c('x', 'mpg')) := NULL]

If we need it in a single step, instead of doing the := to modify the original dataset, specify it with = inside list or .(
dt[,.(x = mpg, mpg)]
Or if it necessary to create the column in original dataset, it can be piped
dt[, x := mpg][, .(x, mpg)]
If we want to update the columns in the original object, another option is set
set(dt[, x:= mpg], i = NULL, j = names(dt)[!names(dt) %in% c('x', 'mpg')], value = NULL)

Related

Creating new variable and save in new object in data table in R

This may be very elementary question. I wanted to create a new variable in existing data.table object, and wanted to save the new object with different name. However, I am facing a problem:
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
a <- colnames(DT)[-1] #variables to be modified
a.va <- paste(a, "x", sep = "_") #new variables name
DT2 <- DT[, (a.va) := lapply(.SD, FUN = function(x) (sum(x, na.rm = T)-x) / (.N - 1)), .SDcols = a, by = x]
However, it changes the both existing object DT as well. I wanted to keep the original object DT as intact. In my original data set, there are more than 100 variables, so I can not create one by one separate variable.
The := will change the current object while adding a new column as it does by reference. If we want a new object, use copy to create a new object from 'DT', then do the assignment := on the copied object
DT2 <- copy(DT)
DT2[, (a.va) := lapply(.SD, FUN = function(x) (sum(x, na.rm = T)-x) / (.N - 1)), .SDcols = a, by = x]
-checking
> names(DT2)
[1] "x" "y" "v" "y_x" "v_x"
> names(DT)
[1] "x" "y" "v"

.SD and .SDcols for the i expression in data.table join

i'm trying to copy a subset of columns from Y to X based on a join, where the subset of columns is dynamic
I can identify the columns quite easily: names(Y)[grep("xxx", names(Y))]
but when i try to use that code in the j expression, it just gives me the column names, not the values of the columns. the .SD and .SDcols gets pretty close, but they only apply to the x expression. I'm trying to do something like this:
X[Y[names(Y)[grep("xxx", names(Y))] := .SD, .SDcols = names(Y)[grep("xxx", names(Y)), on=.(zzz)]
is there an equivalent set of .SD and .SDcols constructs that apply to the i expression? Or, do I need to build up a string for the j expression and eval that string?
Perhaps this will help you get started:
library(data.table)
X <- as.data.table(mtcars[1:5], keep.rownames = "id")
Y <- as.data.table(mtcars, keep.rownames = "id")
cols <- c("gear", "carb")
# copy cols from Y to X based on common "id":
X[Y, (cols) := mget(cols), on = "id"]
As Frank notes in his comment, it might be safer to prefix the column names with i. to ensure the assigned columns are indeed from Y:
X[Y, (cols) := mget(paste0("i.", cols)), on = "id"]

delete column in data.table in R based on condition [duplicate]

I'm trying to manipulate a number of data.tables in similar ways, and would like to write a function to accomplish this. I would like to pass in a parameter containing a list of columns that would have the operations performed. This works fine when the vector declaration of columns is the left hand side of the := operator, but not if it is declared earlier (or passed into the function). The follow code shows the issue.
dt = data.table(a = letters, b = 1:2, c=1:13)
colsToDelete = c('b', 'c')
dt[,colsToDelete := NULL] # doesn't work but I don't understand why not.
dt[,c('b', 'c') := NULL] # works fine, but doesn't allow passing in of columns
The error is "Adding new column 'colsToDelete' then assigning NULL (deleting it)." So clearly, it's interpreting 'colsToDelete' as a new column name.
The same issue occurs when doing something along these lines
dt[, colNames := lapply(.SD, adjustValue, y=factor), .SDcols = colNames]
I new to R, but rather more experienced with some other languages, so this may be a silly question.
It's basically because we allow symbols on LHS of := to add new columns, for convenience: ex: DT[, col := val]. So, in order to distinguish col itself being the name from whatever is stored in col being the column names, we check if the LHS is a name or an expression.
If it's a name, it adds the column with the name as such on the LHS, and if expression, then it gets evaluated.
DT[, col := val] # col is the column name.
DT[, (col) := val] # col gets evaluated and replaced with its value
DT[, c(col) := val] # same as above
The preferred idiom is: dt[, (colsToDelete) := NULL]
HTH
I am surprised no answer provided uses the set() function.
set(DT, , colsToDelete, NULL)
This should be the easiest.
To extend on previous answer, you can delete columns by reference doing:
# delete columns 10 to 15
dt[ , (10:15) := NULL ]
or
# delete columns 3, 5 and 10 to 15
dt[ , (c(3,5,10:15)) := NULL ]
This code did the job for me. you need to have the position of the columns to be deleted e.g., posvec as mentioned in the ?set
j: Column name(s) (character) or number(s) (integer) to be assigned value
when column(s) already exist, and only column name(s) if they are to
be created.
DT_removed_slected_col = set(DT, j = posvec, value = NULL)
Also if you want to get the posvec you can try this:
selected_col = c('col_a','col_b',...)
selected_col = unlist(sapply(selected_col, function(x) grep(x,names(DT))))
namvec = names(selected_col) #col names
posvec = unname(selected_col) #col positions

Loop through data.table and create new columns basis some condition

I have a data.table with quite a few columns. I need to loop through them and create new columns using some condition. Currently I am writing separate line of condition for each column. Let me explain with an example. Let us consider a sample data as -
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
tc = rep(c('C','D'), 10),
one = rnorm(20,1,1),
two = rnorm(20,2,1),
three = rnorm(20,3,1),
four = rnorm(20,4,1),
five = rnorm(20,5,2),
six = rnorm(20,6,2),
seven = rnorm(20,7,2),
total = rnorm(20,28,3))
For each of the columns from one to total, I need to create 4 new columns, i.e. mean, sd, uplimit, lowlimit for 2 sigma outlier calculation. I am doing this by -
DTnew <- DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x), uplimit = mean(x)+1.96*sd(x), lowlimit = mean(x)-1.96*sd(x))))), by = .(town,tc)]
This DTnew data.table I am then merging with my DT
DTmerge <- merge(DT, DTnew, by= c('town','tc'))
Now to come up with the outliers, I am writing separate set of codes for each variable -
DTAoutlier <- DTmerge[ ,one.Aoutlier := ifelse (one >= one.lowlimit & one <= one.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,two.Aoutlier := ifelse (two >= two.lowlimit & two <= two.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,three.Aoutlier := ifelse (three >= three.lowlimit & three <= three.uplimit,0,1)]
can some one help to simplify this code so that
I don't have to write separate lines of code for outlier. In this example we have only 8 variables but what if we had 100 variables, would we end up writing 100 lines of code? Can this be done using a for loop? How?
In general for data.table how can we add new columns retaining the original columns. So for example below I am taking log of columns 3 to 10. If I don't create a new DTlog it overwrites the original columns in DT. How can I retain the original columns in DT and have the new columns as well in DT.
DTlog <- DT[,(lapply(.SD,log)),by = .(town,tc),.SDcols=3:10]
Look forward to some expert suggestions.
We can do this using :=. We subset the column names that are not the grouping variables ('nm'). Create a vector of names to assign for the new columns using outer ('nm1'). Then, we use the OP's code, unlist the output and assign (:=) it to 'nm1' to create the new columns.
nm <- names(DT)[-(1:2)]
nm1 <- c(t(outer(c("Mean", "SD", "uplimit", "lowlimit"), nm, paste, sep="_")))
DT[, (nm1):= unlist(lapply(.SD, function(x) { Mean = mean(x)
SD = sd(x)
uplimit = Mean + 1.96*SD
lowlimit = Mean - 1.96*SD
list(Mean, SD, uplimit, lowlimit) }), recursive=FALSE) ,
.(town, tc)]
The second part of the question involves doing a logical comparison between columns. One option would be to subset the initial columns, the 'lowlimit' and 'uplimit' columns separately and do the comparison (as these have the same dimensions) to get a logical output which can be coerced to binary with +. Then assign it to the original dataset to create the outlier columns.
m1 <- +(DT[, nm, with = FALSE] >= DT[, paste("lowlimit", nm, sep="_"),
with = FALSE] & DT[, nm, with = FALSE] <= DT[,
paste("uplimit", nm, sep="_"), with = FALSE])
DT[,paste(nm, "Aoutlier", sep=".") := as.data.frame(m1)]
Or instead of comparing data.tables, we can also use a for loop with set (which would be more efficient)
nm2 <- paste(nm, "Aoutlier", sep=".")
DT[, (nm2) := NA_integer_]
for(j in nm){
set(DT, i = NULL, j = paste(j, "Aoutlier", sep="."),
value = as.integer(DT[[j]] >= DT[[paste("lowlimit", j, sep="_")]] &
DT[[j]] <= DT[[paste("uplimit", j, sep="_")]]))
}
The 'log' columns can also be created with :=
DT[,paste(nm, "log", sep=".") := lapply(.SD,log),by = .(town,tc),.SDcols=nm]
Your data should probably be in long format:
m = melt(DT, id=c("town","tc"))
Then just write your test once
m[,
is_outlier := +(abs(value-mean(value)) > 1.96*sd(value))
, by=.(town, tc, variable)]
I see no outliers in this data (according to the given definition of outlier):
m[, .N, by=is_outlier] # this is a handy alternative to table()
# is_outlier N
# 1: 0 160
How it works
melt keeps the id columns and stacks all the rest into
variable (column names)
value (column contents)
+x does the same thing as as.integer(x), coercing TRUE/FALSE to 1/0
If you really like your data in wide format, though:
vjs = setdiff(names(DT), c("town","tc"))
DT[,
paste0(vjs,".out") := lapply(.SD, function(x) +(abs(x-mean(x)) > 1.96*sd(x)))
, by=.(town, tc), .SDcols=vjs]
For completeness, it should be noted that dplyr's mutate_each provides a handy way of tackling such problems:
library(dplyr)
result <- DT %>%
group_by(town,tc) %>%
mutate_each(funs(mean,sd,
uplimit = (mean(.) + 1.96*sd(.)),
lowlimit = (mean(.) - 1.96*sd(.)),
Aoutlier = as.integer(. >= mean(.) - 1.96*sd(.) &
. <= mean(.) - 1.96*sd(.))),
-town,-tc)

R change column class by variable

In R I have a data.table df with an integer column X. I want to convert this column from integer to a character.
This is really easy with df[, X:=as.character(X)].
Now for the question, I have the name of the column (X) stored in a variable like this:
col_name <- 'X'.
How do I access the column (and convert it to a character column) while only knowing the variable.
I tried numerious things all yielding in nothing useful or a column of NA's. Which syntax will get me the result I want?
library(data.table)
DT <- as.data.table(iris)
col_name <- "Petal.Length"
Use ( to evaluate the LHS of := and use list subsetting to select the column:
DT[, (col_name) := as.character(DT[[col_name]])]
class(DT[[col_name]])
#[1] "character"
We can specify it in .SDcols and assign the columns to character
df[, (col_name) := as.character(.SD[[1L]]), .SDcols = col_name]
If there are more than one column, use lapply
df[, (col_names) := lapply(.SD, as.character), .SDcols = col_names]
data
df <- data.table(X = as.integer(1:5), Y = LETTERS[1:5])
col_name <- "X"

Resources