Let the following data set be given:
library('data.table')
set.seed(1234)
DT <- data.table(x = LETTERS[1:10], y =sample(10))
my.rows <- sample(1:dim(DT)[1], 3)
I want to add a new column to the data set such that, whenever the rows of the data set match the row numbers given by my.rows the entry is populated with, say, true, or false otherwise.
I have got DT[my.rows, z:= "true"], which gives
head(DT)
x y z
1: A 2 NA
2: B 6 NA
3: C 5 true
4: D 8 NA
5: E 9 true
6: F 4 NA
but I do not know how to automatically populate the else condition as well, at the same time. I guess I should make use of some sort of inline ifelse but I am lacking the correct syntax.
We can compare the 'my.rows' with the sequence of row using %in% to create a logical vector and assign (:=) it to create 'z' column.
DT[, z:= 1:.N %in% my.rows ]
Or another option would be to create 'z' as a column of 'FALSE', using 'my.rows' as 'i', we assign the elements in 'z' that correspond to 'i' as 'TRUE'.
DT[, z:= FALSE][my.rows, z:= TRUE]
DT <- cbind(DT,z = ifelse(DT[, .I] %in% my.rows , T, NA))
> DT
# x y z
# 1: A 2 NA
# 2: B 6 NA
# 3: C 5 TRUE
# 4: D 8 NA
# 5: E 9 TRUE
# 6: F 4 NA
# 7: G 1 TRUE
# 8: H 7 NA
# 9: I 10 NA
#10: J 3 NA
Related
When grouping by an expression involving a column (e.g. DT[...,.SD[c(1,.N)],by=expression(col)]), I want to keep the value of col in .SD.
For example, in the following I am grouping by the remainder of a divided by 3, and keeping the first and last observation in each group. However, a is no longer present in .SD
f <- function(x) x %% 3
Q <- data.table(a = 1:20, x = rnorm(20), y = rnorm(20))
Q[, .SD[c(1., .N)], by = f(a)]
f x y
1: 1 0.2597929 1.0256259
2: 1 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 2 0.6600591 -0.5827745
5: 0 1.3758503 1.3122561
6: 0 2.6501140 1.9394756
The desired output is as if I had done the following
Q[, f := f(a)]
tmp <- Q[, .SD[c(1, .N)], by=f]
Q[, f := NULL]
tmp[, f := NULL]
tmp
a x y
1: 1 0.2597929 1.0256259
2: 19 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 20 0.6600591 -0.5827745
5: 3 1.3758503 1.3122561
6: 18 2.6501140 1.9394756
Is there a way to do this directly, without creating a new variable and creating a new intermediate data.table?
Instead of .SD, use .I to get the row index, extract that column ($V1) and subset the original dataset
library(data.table)
Q[Q[, .I[c(1., .N)], by = f(a)]$V1]
# a x y
#1: 1 0.7265238 0.5631753
#2: 19 1.7110611 -0.3141118
#3: 2 0.1643566 -0.4704501
#4: 20 0.5182394 -0.1309016
#5: 3 -0.6039137 0.1349981
#6: 18 0.3094155 -1.1892190
NOTE: The values in columns 'x', 'y' would be different as there was no set.seed
I was wondering what causes the following behavior that surprised me a bit - I defined a data table dt_3, then defined dt_1 to be equal to dt_3. When I then used set() to replace row elements in dt_1, the corresponding elements of dt_3 were changed as well:
refcols=c("A","B")
dt_3 = data.table(A=c(1,1,3,5,6,7), B = c("x","y","z","q","w","e"), C = rep("NO",6))
dt_2 = data.table(A=c(3,5,7), B = c("z","q","x"), D=c(3,5,99))
dt_1 = dt_3
dt_3
A B C
1: 1 x NO
2: 1 y NO
3: 3 z NO
4: 5 q NO
5: 6 w NO
6: 7 e NO
for(j in refcols){
set(dt_1,2,j,dt_2[3,get(j)])
}
Warning messages:
1: In set(dt_1, 2, j, dt_2[3, get(j)]) :
Coerced i from numeric to integer. Please pass integer for efficiency; e.g., 2L rather than 2
2: In set(dt_1, 2, j, dt_2[3, get(j)]) :
Coerced i from numeric to integer. Please pass integer for efficiency; e.g., 2L rather than 2
dt_3
A B C
1: 1 x NO
2: 7 x NO
3: 3 z NO
4: 5 q NO
5: 6 w NO
6: 7 e NO
What is causing this and is there an easier way to subset by explicit row indices for specific columns like this?
We can use copy so that when we replace the elements in one dataset, the other wont' change
dt_1<- copy(dt_3)
Regarding the second part, it is not very clear about the row index. If it is only based on the column index
for(j in refcols){
set(dt_1, i=NULL, j=j, value=dt_2[[j]])
}
dt_1
# A B C
#1: 3 z NO
#2: 5 q NO
#3: 7 x NO
#4: 3 z NO
#5: 5 q NO
#6: 7 x NO
If the 2nd row of the "A" and "B" column in 'dt_1' should be replaced by the 3rd row of 'dt_2' for corresponding columns (based on 'refcols')
for(j in refcols){
set(dt_1, i=2L, j=j, value=dt_2[[j]][3])
}
dt_1
# A B C
#1: 1 x NO
#2: 7 x NO
#3: 3 z NO
#4: 5 q NO
#5: 6 w NO
#6: 7 e NO
I'd like to retain the by variables in a by-without-by operation using data.table.
I have a by-without-by that used to work (ca. 2 years ago), and now with the latest version of data.table I think the behavior must have changed.
Here's a reproducible example:
library(data.table)
dt <- data.table( by1 = letters[1:3], by2 = LETTERS[1:3], x = runif(3) )
by <- c("by1","by2")
allPermutationsOfByvars <- do.call(CJ, sapply(dt[,by,with=FALSE], unique, simplify=FALSE)) ## CJ() to form index
setkeyv(dt, by)
dt[ allPermutationsOfByvars, list( x = x ) ]
Which produces:
> dt[ allPermutationsOfByvars, list( x = x ) ]
x
1: 0.9880997
2: NA
3: NA
4: NA
5: 0.4650647
6: NA
7: NA
8: NA
9: 0.4899873
I could just do:
> cbind( allPermutationsOfByvars, dt[ allPermutationsOfByvars, list( x = x ) ] )
by1 by2 x
1: a A 0.9880997
2: a B NA
3: a C NA
4: b A NA
5: b B 0.4650647
6: b C NA
7: c A NA
8: c B NA
9: c C 0.4899873
Which indeed works, but is inelegant and possibly inefficient.
Is there an argument I'm missing or a clever stratagem to retain the by variables?
Add by = .EACHI to get the "by-without-by" aka by-EACH-element-of-I:
dt[allPermutationsOfByvars, x, by = .EACHI]
And this is how I'd have done the initial part:
allPermutationsOfByvars = dt[, do.call(CJ, unique(setDT(mget(by))))]
Finally, the on argument is usually the better choice now (vs setkey).
I'd like to create a variable in dt according to a lookup table k. I'm getting some unexpected results depending on how I extract the variable of interest in k.
dt <- data.table(x=c(1:10))
setkey(dt, x)
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, x)
dt[,b:=k[.BY, list(b)],by=x]
dt #unexpected results
# x b
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 4
# 5: 5 5
# 6: 6 6
# 7: 7 7
# 8: 8 8
# 9: 9 9
# 10: 10 10
dt <- data.table(x=c(1:10))
setkey(x, x)
dt[,b:=k[.BY]$b,by=x]
dt #expected results
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
Can anyone explain why this is happening?
You don't have to use by=. here at all.
First solution:
Set appropriate keys and use X[Y] syntax from data.table:
require(data.table)
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
k[dt]
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
OP said that this creates a new data.table and it is undesirable for him.
Second solution
Again, without by:
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
# solution
dt[k, b := i.b]
This does not create a new data.table and gives the solution you're expecting.
To explain why the unexpected result happens:
For the first case you do, dt[,b:=k[.BY, list(b)],by=x]. Here, k[.BY, list(b)] itself returns a data.table. For example:
k[list(x=1), list(b)]
# x b
# 1: 1 a
So, basically, if you would do:
k[list(x=dt$x), list(b)]
That would give you the desired solution as well. To answer why you get what you get when you do b := k[.BY, list(b)], since, the RHS returns a data.table and you're assigning a variable to it, it takes the first element and drops the rest. For example, do this:
dt[, c := dt[1], by=x]
# you'll get the whole column to be 1
For the second case, to understand why it works, you'll have to know the subtle difference between, accessing a data.table as k[6] and k[list(6)], for example:
In the first case, k[6], you are accessing the 6th element of k, which is 10 d. But in the second case, you're asking for a J, join. So, it searches for x = 6 (key column) and since there isn't any in k, it returns 6 NA. In your case, since you use k[.BY] which returns a list, it is a J operation, which fetches the right value.
I hope this helps.
I am trying to get many lm models work in a function and I need to automatically drop constant columns from my data.table. Thus, I want to keep only columns with two or more unique values, excluding NA from the count.
I tried several methods found on SO, but I am still not able to drop columns that have two values: a constant and NAs.
My reproducible code:
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
> df
x y z d
1: 1 1 NA 2
2: 2 1 NA 2
3: 3 NA NA 2
4: NA NA NA 2
5: 5 NA NA 2
My intention is to drop columns y, z, and d since they are constant, including y that only have one unique value when NAs are omitted.
I tried this:
same <- sapply(df, function(.col){ all(is.na(.col)) || all(.col[1L] == .col)})
df1 <- df[ , !same, with = FALSE]
> df1
x y
1: 1 1
2: 2 1
3: 3 NA
4: NA NA
5: 5 NA
As seen, 'y' is still there ...
Any help?
Because you have a data.table, you may use uniqueN and its na.rm argument:
df[ , lapply(.SD, function(v) if(uniqueN(v, na.rm = TRUE) > 1) v)]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
A base alternative could be Filter(function(x) length(unique(x[!is.na(x)])) > 1, df)
There is simple solution with function Filter in base r. It will help.
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
# Select only columns for which SD is not 0
> Filter(function(x) sd(x, na.rm = TRUE) != 0, df)
x
1: 1
2: 2
3: 3
4: NA
5: 5
Note: Don't forget to use na.rm = TRUE.
Check if the variance is zero:
df[, sapply(df, var, na.rm = TRUE) != 0, with = FALSE]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
Here is an option:
df[,which(df[,
unlist(
sapply(.SD,function(x) length(unique(x[!is.na(x)])) >1))]),
with=FALSE]
x
1: 1
2: 2
3: 3
4: NA
5: 5
For each column of the data.table we count the number of unique values different of NA. We keep only column that have more than one value.
If you really mean DROPing those columns, here is a solution:
library(data.table)
dt <- data.table(x=c(1,2,3,NA,5),
y=c(1,1,NA,NA,NA),
z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
for (col in names(copy(dt))){
v = var(dt[[col]], na.rm = TRUE)
if (v == 0 | is.na(v)) dt[, (col) := NULL]
}
Just change
all(is.na(.col)) || all(.col[1L] == .col)
to
all(is.na(.col) | .col[1L] == .col)
Final code:
same <- sapply( df, function(.col){ all( is.na(.col) | .col[1L] == .col ) } )
df1 <- df[,!same, with=F]
Result:
x
1: 1
2: 2
3: 3
4: NA
5: 5
For removing constant columns,
Numeric Columns:-
constant_col = [const for const in df.columns if df[const].std() == 0]
print (len(constant_col))
print (constant_col)
Categorical Columns:-
constant_col = [const for const in df.columns if len(df[const].unique()) == 1]
print (len(constant_col))
print (constant_col)
Then you drop the columns using the drop method
library(janitor)
df %>%
remove_constant(na.rm = TRUE)
x
1: 1
2: 2
3: 3
4: NA
5: 5