I am trying to get many lm models work in a function and I need to automatically drop constant columns from my data.table. Thus, I want to keep only columns with two or more unique values, excluding NA from the count.
I tried several methods found on SO, but I am still not able to drop columns that have two values: a constant and NAs.
My reproducible code:
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
> df
x y z d
1: 1 1 NA 2
2: 2 1 NA 2
3: 3 NA NA 2
4: NA NA NA 2
5: 5 NA NA 2
My intention is to drop columns y, z, and d since they are constant, including y that only have one unique value when NAs are omitted.
I tried this:
same <- sapply(df, function(.col){ all(is.na(.col)) || all(.col[1L] == .col)})
df1 <- df[ , !same, with = FALSE]
> df1
x y
1: 1 1
2: 2 1
3: 3 NA
4: NA NA
5: 5 NA
As seen, 'y' is still there ...
Any help?
Because you have a data.table, you may use uniqueN and its na.rm argument:
df[ , lapply(.SD, function(v) if(uniqueN(v, na.rm = TRUE) > 1) v)]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
A base alternative could be Filter(function(x) length(unique(x[!is.na(x)])) > 1, df)
There is simple solution with function Filter in base r. It will help.
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
# Select only columns for which SD is not 0
> Filter(function(x) sd(x, na.rm = TRUE) != 0, df)
x
1: 1
2: 2
3: 3
4: NA
5: 5
Note: Don't forget to use na.rm = TRUE.
Check if the variance is zero:
df[, sapply(df, var, na.rm = TRUE) != 0, with = FALSE]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
Here is an option:
df[,which(df[,
unlist(
sapply(.SD,function(x) length(unique(x[!is.na(x)])) >1))]),
with=FALSE]
x
1: 1
2: 2
3: 3
4: NA
5: 5
For each column of the data.table we count the number of unique values different of NA. We keep only column that have more than one value.
If you really mean DROPing those columns, here is a solution:
library(data.table)
dt <- data.table(x=c(1,2,3,NA,5),
y=c(1,1,NA,NA,NA),
z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
for (col in names(copy(dt))){
v = var(dt[[col]], na.rm = TRUE)
if (v == 0 | is.na(v)) dt[, (col) := NULL]
}
Just change
all(is.na(.col)) || all(.col[1L] == .col)
to
all(is.na(.col) | .col[1L] == .col)
Final code:
same <- sapply( df, function(.col){ all( is.na(.col) | .col[1L] == .col ) } )
df1 <- df[,!same, with=F]
Result:
x
1: 1
2: 2
3: 3
4: NA
5: 5
For removing constant columns,
Numeric Columns:-
constant_col = [const for const in df.columns if df[const].std() == 0]
print (len(constant_col))
print (constant_col)
Categorical Columns:-
constant_col = [const for const in df.columns if len(df[const].unique()) == 1]
print (len(constant_col))
print (constant_col)
Then you drop the columns using the drop method
library(janitor)
df %>%
remove_constant(na.rm = TRUE)
x
1: 1
2: 2
3: 3
4: NA
5: 5
Related
I have a integer64 indexed data.table object:
library(data.table)
library(bit64)
some_data = as.integer64(c(1514772184120000026, 1514772184120000068, 1514772184120000042, 1514772184120000078,1514772184120000011, 1514772184120000043, 1514772184120000094, 1514772184120000085,
1514772184120000083, 1514772184120000017, 1514772184120000013, 1514772184120000060, 1514772184120000032, 1514772184120000059, 1514772184120000029))
#
n <- 10
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
Then I have a bunch of other integer64 that I need to binary-search for in the indexes of my original data.table object (x):
search_values <- some_data[(n+1):length(some_data)]
If these where native integers I could use findInterval() to solve the problem:
values_index <- findInterval(search_values, x$new_col)
but when the arguments to findInterval are integer64, I get:
Warning messages:
1: In as.double.integer64(vec) :
integer precision lost while converting to double
2: In as.double.integer64(x) :
integer precision lost while converting to double
and wrong indexes:
> values_index
[1] 10 10 10 10 10
e.g. it is not true that the entries of search_values are all larger than all entries of x$new_col.
Edit:
Desired output:
print(values_index)
9 10 6 10 1
Why?:
value_index has as many entries as search_values. For each entries of search_values, the corresponding entry in value_index gives the rank that entry of search_values would have if it where inserted inside x$new_col. So the first entry of value_index is 9 because the first entry of search_values (1514772184120000045) would have rank 9 among the entries of x$new_col.
Maybe you want something like this:
findInterval2 <- function(y, x) {
toadd <- y[!(y %in% x$new_col)] # search_values that is not in data
x2 <- copy(x)
x2[, i := .I] # mark the original data set
x2 <- rbindlist(list(x2, data.table(new_col = toadd)),
use.names = T, fill = T) # add missing search_values
setkey(x2, new_col) # order
x2[, index := cumsum(!is.na(i))]
x2[match(y, new_col), index]
}
# x2 is:
# a new_col i index
# 1: 0.56602278 1514772184120000011 1 1
# 2: NA 1514772184120000013 NA 1
# 3: 0.29408237 1514772184120000017 2 2
# 4: 0.28532378 1514772184120000026 3 3
# 5: NA 1514772184120000029 NA 3
# 6: NA 1514772184120000032 NA 3
# 7: 0.66844754 1514772184120000042 4 4
# 8: 0.83008829 1514772184120000043 5 5
# 9: NA 1514772184120000059 NA 5
# 10: NA 1514772184120000060 NA 5
# 11: 0.76992760 1514772184120000068 6 6
# 12: 0.57049677 1514772184120000078 7 7
# 13: 0.14406169 1514772184120000083 8 8
# 14: 0.02044602 1514772184120000085 9 9
# 15: 0.68016024 1514772184120000094 10 10
findInterval2(search_values, x)
# [1] 1 5 3 5 3
If not, then maybe you could change the code as needed.
update
look at this integer example to see that this function gives the same result as base findInterval
now <- 10
n <- 10
n2 <- 10
some_data = as.integer(now + sample.int(n + n2, n + n2))
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
search_values <- some_data[(n + 1):length(some_data)]
r1 <- findInterval2(search_values, x)
r2 <- findInterval(search_values, x$new_col)
all.equal(r1, r2)
If I get what you want, then a quick workaround could be:
toadd <- search_values[!(search_values %in% x$new_col)] # search_values that is not in data
x[, i := .I] # mark the original data set
x <- rbindlist(list(x, data.table(new_col = toadd)),
use.names = T, fill = T) # add missing search_values
setkey(x, new_col) # order
x[, index := new_col %in% search_values] # mark where the values are
x[, index := cumsum(index)] # get indexes
x <- x[!is.na(i)] # remove added rows
x$index # should contain your desired output
Suppose that we have a data table with missing values (see example below).
library(data.table)
mat <- matrix(rnorm(50), ncol = 5)
mat[c(1,3,5,9,10,11,14,37,38)] <- NA
DT <- as.data.table(mat)
In total, we have 5 unique missing data patterns in our example (see unique(!is.na(DT))).
Suppose now further that we would like to find these patterns and identify them according to their frequency of occurrence (starting with the most frequent pattern indicated by 1).
DTna <- as.data.table(!is.na(DT))
DTna <- DTna[, n := .N, by = names(x = DTna)]
DTna <- DTna[, id := 1:nrow(x = DTna)]
DTna <- DTna[order(n, decreasing = TRUE)]
DTna <- DTna[, m := .GRP, by = eval(names(x = DT))]
Finally, observations with a particular pattern should be subsetted according to a prespecification (here e.g. 1 for the most frequent pattern).
pattern <- 1
i <- DTna[m == pattern, id]
DT[i]
In summary, I need to find observations which share the same missing data pattern and subsequently subset them according to a prespecification (e.g. the most frequent pattern). Please note that I need to subset DT instead of DTna.
Question
So far, the above code works as expected, but is there a more elegant way using data.table?
I would add a grouping column to DT to join and filter on:
DT[, nag := do.call(paste0, lapply(.SD, function(x) +is.na(x)))]
nagDT = DT[, .N, by=nag][order(-N), nagid := .I][, setorder(.SD, nagid)]
# nag N nagid
# 1: 10000 4 1
# 2: 00000 2 2
# 3: 00010 2 3
# 4: 11000 1 4
# 5: 01000 1 5
# subsetting
my_id = 1L
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0]
which gives
V1 V2 V3 V4 V5 nag
1: NA 1.3306093 -2.1030978 0.06115726 -0.2527502 10000
2: NA 0.2852518 -0.1894425 0.86698633 -0.2099998 10000
3: NA -0.1325032 -0.5201166 -0.94392417 0.6515976 10000
4: NA 0.3199076 -1.0152518 -1.61417902 -0.6458374 10000
If you want to omit the new column in the result:
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0, !"nag"]
And to also omit the blank columns:
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0, !"nag"][,
Filter(function(x) !anyNA(x), .SD)]
An alternative, which is undoubtedly inferior (but nonetheless provided for variety), is
DT[, patCnt := setDT(stack(transpose(DT)))[,
paste(+(is.na(values)), collapse=""), by="ind"][,
patCnt := .N, by=(V1)]$patCnt]
which returns
DT
V1 V2 V3 V4 V5 patCnt
1: NA NA -1.5062011 -0.9846015 0.12153714 1
2: 1.4176784 -0.08078952 -0.8101335 0.6437340 -0.49474613 2
3: NA -0.08410076 -1.1709337 -0.9182901 0.67985806 4
4: 0.2104999 NA -0.1458075 0.8192693 0.05217464 1
5: NA -0.73361504 2.1431392 -1.0041705 0.29198857 4
6: 0.3841267 -0.75943774 0.6931461 -1.3417511 -1.53291515 2
7: -0.8011166 0.26857593 1.1249757 NA -0.57850361 2
8: -1.5518674 0.52004986 1.6505470 NA -0.34061924 2
9: NA 0.83135928 0.9155882 0.1856450 0.31346976 4
10: NA 0.60328545 1.3042894 -0.5835755 -0.17132227 4
Then subset
DT[patCnt == max(patCnt)]
V1 V2 V3 V4 V5 patCnt
1: NA -0.08410076 -1.1709337 -0.9182901 0.6798581 4
2: NA -0.73361504 2.1431392 -1.0041705 0.2919886 4
3: NA 0.83135928 0.9155882 0.1856450 0.3134698 4
4: NA 0.60328545 1.3042894 -0.5835755 -0.1713223 4
I'm attempting to simplify my code below. I would like to cast a list of data frames from long to wide and then add several variables to each nested data frame conditional on variables contained in the dataset. The following code produces my preferred output, I would like help understanding how to reduce the steps and possibly do this all in one lapply command. I've attempted several trials combining "with" statements to no avail.
dflist <- list(data.frame(ID=(c(rep(1,10),rep(2,10))),Y=(c(rep(1,5),rep(2,5),rep(1,5),rep(2,5))), b=rnorm(20),c=rnorm(20)),
data.frame(ID=(c(rep(1,10),rep(2,10))), Y=(c(rep(1,5),rep(2,5),rep(1,5),rep(2,5))),b=rnorm(20),c=rnorm(20)),
data.frame(ID=(c(rep(1,10),rep(2,10))), Y=(c(rep(1,5),rep(2,5),rep(1,5),rep(2,5))),b=rnorm(20),c=rnorm(20)))
wide_data<-lapply(dflist, function(x) dcast(setDT(x), ID ~ Y, value.var=c('b','c'),mean))
b_flag<-lapply(wide_data, function(x) with(x,ifelse((b_1 < .30 | b_2 >.95),"Flag",NA)))
c_flag<-lapply(wide_data, function(x) with(x,ifelse((c_1 < 0) & (c_1 < 0),"Flag",NA)))
wide_data<-Map(cbind, wide_data, b_flag = b_flag)
wide_data<-Map(cbind, wide_data, c_flag = c_flag)
wide_data
I agree with you that 1 lapply would be better:
wide_data <- lapply(dflist, function(x) {
tmp <- dcast(setDT(x), ID ~ Y, value.var=c('b','c'), mean)
tmp$b_flag <- ifelse((tmp$b_1 < .30 | tmp$b_2 >.95) , "Flag", NA)
tmp$c_flag <- ifelse((tmp$c_1 < 0) & (tmp$c_2 < 0), "Flag", NA)
tmp
})
another way of approaching it without using dcast
also, the condition for c column is ambiguous in your question. Please check if it is correct and edit your question.
library('data.table')
df2 <- lapply( dflist, function(x) {
x <- setDT(x)[, .(b = mean(b), c = mean(c)), by = .(ID, Y)]
x[ , `:=` ( b_flag = ifelse( any(b[Y == 1 ] < 0.30, b[Y == 2] > 0.95), "Flag", NA_character_ ),
c_flag = ifelse( all(c < 0), "Flag", NA_character_ ) ),
by = ID ]
return( x )
} )
df2 <- rbindlist(l = df2)
df2
# ID Y b c b_flag c_flag
# 1: 1 1 0.198227292 0.57377712 Flag NA
# 2: 1 2 0.578991810 0.40128112 Flag NA
# 3: 2 1 0.578724225 0.30608932 NA NA
# 4: 2 2 0.619338292 0.35209122 NA NA
# 5: 1 1 0.321089583 -0.83979393 NA NA
# 6: 1 2 -0.341194581 0.52508394 NA NA
# 7: 2 1 0.179836568 -0.02041203 Flag NA
# 8: 2 2 0.482725899 0.17163968 Flag NA
# 9: 1 1 0.003591178 -0.30250232 Flag NA
# 10: 1 2 -0.230479093 0.01971357 Flag NA
# 11: 2 1 -0.038689389 0.35717286 Flag NA
# 12: 2 2 0.691364217 -0.37037455 Flag NA
Let the following data set be given:
library('data.table')
set.seed(1234)
DT <- data.table(x = LETTERS[1:10], y =sample(10))
my.rows <- sample(1:dim(DT)[1], 3)
I want to add a new column to the data set such that, whenever the rows of the data set match the row numbers given by my.rows the entry is populated with, say, true, or false otherwise.
I have got DT[my.rows, z:= "true"], which gives
head(DT)
x y z
1: A 2 NA
2: B 6 NA
3: C 5 true
4: D 8 NA
5: E 9 true
6: F 4 NA
but I do not know how to automatically populate the else condition as well, at the same time. I guess I should make use of some sort of inline ifelse but I am lacking the correct syntax.
We can compare the 'my.rows' with the sequence of row using %in% to create a logical vector and assign (:=) it to create 'z' column.
DT[, z:= 1:.N %in% my.rows ]
Or another option would be to create 'z' as a column of 'FALSE', using 'my.rows' as 'i', we assign the elements in 'z' that correspond to 'i' as 'TRUE'.
DT[, z:= FALSE][my.rows, z:= TRUE]
DT <- cbind(DT,z = ifelse(DT[, .I] %in% my.rows , T, NA))
> DT
# x y z
# 1: A 2 NA
# 2: B 6 NA
# 3: C 5 TRUE
# 4: D 8 NA
# 5: E 9 TRUE
# 6: F 4 NA
# 7: G 1 TRUE
# 8: H 7 NA
# 9: I 10 NA
#10: J 3 NA
When calculating the sum of two data tables, NA+n=NA.
> dt1 <- data.table(Name=c("Joe","Ann"), "1"=c(0,NA), "2"=c(3,NA))
> dt1
Name 1 2
1: Joe 0 3
2: Ann NA NA
> dt2 <- data.table(Name=c("Joe","Ann"), "1"=c(0,NA), "2"=c(2,3))
> dt2
Name 1 2
1: Joe 0 2
2: Ann NA 3
> dtsum <- rbind(dt1, dt2)[, lapply(.SD, sum), by=Name]
> dtsum
Name 1 2
1: Joe 0 5
2: Ann NA NA
I don't want to substitute all NA's with 0. What I want is NA+NA=NA and NA+n=n to get the following result:
Name 1 2
1: Joe 0 5
2: Ann NA 3
How is this done in R?
UPDATE: removed typo in dt1
You can define your own function to act as you want
plus <- function(x) {
if(all(is.na(x))){
c(x[0],NA)} else {
sum(x,na.rm = TRUE)}
}
rbind(dt1, dt2)[,lapply(.SD, plus), by = Name]
dtsum <- rbind(dt1, dt2)[, lapply(.SD, function(x) ifelse(all(is.na(x)), as.numeric(NA), sum(x, na.rm=T))), by=Name]
(includes #Arun's suggestion)
na.rm=TRUE is very useful to remember