Treat NA as zero only when adding a number - r

When calculating the sum of two data tables, NA+n=NA.
> dt1 <- data.table(Name=c("Joe","Ann"), "1"=c(0,NA), "2"=c(3,NA))
> dt1
Name 1 2
1: Joe 0 3
2: Ann NA NA
> dt2 <- data.table(Name=c("Joe","Ann"), "1"=c(0,NA), "2"=c(2,3))
> dt2
Name 1 2
1: Joe 0 2
2: Ann NA 3
> dtsum <- rbind(dt1, dt2)[, lapply(.SD, sum), by=Name]
> dtsum
Name 1 2
1: Joe 0 5
2: Ann NA NA
I don't want to substitute all NA's with 0. What I want is NA+NA=NA and NA+n=n to get the following result:
Name 1 2
1: Joe 0 5
2: Ann NA 3
How is this done in R?
UPDATE: removed typo in dt1

You can define your own function to act as you want
plus <- function(x) {
if(all(is.na(x))){
c(x[0],NA)} else {
sum(x,na.rm = TRUE)}
}
rbind(dt1, dt2)[,lapply(.SD, plus), by = Name]

dtsum <- rbind(dt1, dt2)[, lapply(.SD, function(x) ifelse(all(is.na(x)), as.numeric(NA), sum(x, na.rm=T))), by=Name]
(includes #Arun's suggestion)
na.rm=TRUE is very useful to remember

Related

Find and subset patterns in data table

Suppose that we have a data table with missing values (see example below).
library(data.table)
mat <- matrix(rnorm(50), ncol = 5)
mat[c(1,3,5,9,10,11,14,37,38)] <- NA
DT <- as.data.table(mat)
In total, we have 5 unique missing data patterns in our example (see unique(!is.na(DT))).
Suppose now further that we would like to find these patterns and identify them according to their frequency of occurrence (starting with the most frequent pattern indicated by 1).
DTna <- as.data.table(!is.na(DT))
DTna <- DTna[, n := .N, by = names(x = DTna)]
DTna <- DTna[, id := 1:nrow(x = DTna)]
DTna <- DTna[order(n, decreasing = TRUE)]
DTna <- DTna[, m := .GRP, by = eval(names(x = DT))]
Finally, observations with a particular pattern should be subsetted according to a prespecification (here e.g. 1 for the most frequent pattern).
pattern <- 1
i <- DTna[m == pattern, id]
DT[i]
In summary, I need to find observations which share the same missing data pattern and subsequently subset them according to a prespecification (e.g. the most frequent pattern). Please note that I need to subset DT instead of DTna.
Question
So far, the above code works as expected, but is there a more elegant way using data.table?
I would add a grouping column to DT to join and filter on:
DT[, nag := do.call(paste0, lapply(.SD, function(x) +is.na(x)))]
nagDT = DT[, .N, by=nag][order(-N), nagid := .I][, setorder(.SD, nagid)]
# nag N nagid
# 1: 10000 4 1
# 2: 00000 2 2
# 3: 00010 2 3
# 4: 11000 1 4
# 5: 01000 1 5
# subsetting
my_id = 1L
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0]
which gives
V1 V2 V3 V4 V5 nag
1: NA 1.3306093 -2.1030978 0.06115726 -0.2527502 10000
2: NA 0.2852518 -0.1894425 0.86698633 -0.2099998 10000
3: NA -0.1325032 -0.5201166 -0.94392417 0.6515976 10000
4: NA 0.3199076 -1.0152518 -1.61417902 -0.6458374 10000
If you want to omit the new column in the result:
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0, !"nag"]
And to also omit the blank columns:
DT[nagDT[nagid == my_id, nag], on=.(nag), nomatch=0, !"nag"][,
Filter(function(x) !anyNA(x), .SD)]
An alternative, which is undoubtedly inferior (but nonetheless provided for variety), is
DT[, patCnt := setDT(stack(transpose(DT)))[,
paste(+(is.na(values)), collapse=""), by="ind"][,
patCnt := .N, by=(V1)]$patCnt]
which returns
DT
V1 V2 V3 V4 V5 patCnt
1: NA NA -1.5062011 -0.9846015 0.12153714 1
2: 1.4176784 -0.08078952 -0.8101335 0.6437340 -0.49474613 2
3: NA -0.08410076 -1.1709337 -0.9182901 0.67985806 4
4: 0.2104999 NA -0.1458075 0.8192693 0.05217464 1
5: NA -0.73361504 2.1431392 -1.0041705 0.29198857 4
6: 0.3841267 -0.75943774 0.6931461 -1.3417511 -1.53291515 2
7: -0.8011166 0.26857593 1.1249757 NA -0.57850361 2
8: -1.5518674 0.52004986 1.6505470 NA -0.34061924 2
9: NA 0.83135928 0.9155882 0.1856450 0.31346976 4
10: NA 0.60328545 1.3042894 -0.5835755 -0.17132227 4
Then subset
DT[patCnt == max(patCnt)]
V1 V2 V3 V4 V5 patCnt
1: NA -0.08410076 -1.1709337 -0.9182901 0.6798581 4
2: NA -0.73361504 2.1431392 -1.0041705 0.2919886 4
3: NA 0.83135928 0.9155882 0.1856450 0.3134698 4
4: NA 0.60328545 1.3042894 -0.5835755 -0.1713223 4

Filtering data.table based on values in group of variables

I ran into a problem with finding the best syntax to filter out cases where I want to filter all rows where a group of variables has at least one value non null.
An example is below:
> dat <- data.table(a=1:5, b=c(1:3, NA, NA), c=c(NA, 1:3, NA))
> cols <- c('b', 'c')
> dat[!all(is.na(cols)), .SD, with=FALSE]
Null data.table (0 rows and 0 cols)
> dat[!is.na(b)|!is.na(c), .SD]
a b c
1: 1 1 NA
2: 2 2 1
3: 3 3 2
4: 4 NA 3
As you can see if I explicitly say each variable name as !is.na(variable1) | !is.na(variable2) it works however I can't find a way to include a group of variables so I can do it in 1 condition only and not concatenate everything with or.
You can use the following syntax with rowSums and .SD:
dat[dat[, rowSums(!is.na(.SD)) > 0, .SDcols = cols]]
# a b c
#1: 1 1 NA
#2: 2 2 1
#3: 3 3 2
#4: 4 NA 3
The inner part creates a logical value that looks like this:
dat[, rowSums(!is.na(.SD)) > 0, .SDcols = cols]
# [1] TRUE TRUE TRUE TRUE FALSE
Re the comment by Michael, you can also use Reduce + lapply:
dat[dat[, Reduce("+", lapply(.SD, function(x) !is.na(x))) > 0, .SDcols = cols]]
But for most of my use cases, the rowSums approach is ok and easier to read, imo.
You could also do this.
dat[rowSums(!is.na(dat[, cols, with=FALSE])) > 0,]
a b c
#1: 1 1 NA
#2: 2 2 1
#3: 3 3 2
#4: 4 NA 3

Unexpected result from data.table when looking in other table

I am trying to check if a value from a data table is present in another data table. However, I do not get a correct output:
> dt1 <- data.table(x=c(8,5,3), y=rnorm(3))
> dt2 <- data.table(a=c(1,2,3,4,5), b=rnorm(5))
> setkey(dt1,x)
> setkey(dt2,a)
>
> dt1
x y
1: 3 0.84929113
2: 5 1.33433818
3: 8 0.04170333
> dt2
a b
1: 1 2.00634915
2: 2 -1.53137195
3: 3 -1.49436741
4: 4 -1.66878993
5: 5 -0.06394713
>
> dt1[,is_present_in_dt2:=nrow(dt2[x, nomatch=0L])]
> dt1
x y is_present_in_dt2
1: 3 0.84929113 3
2: 5 1.33433818 3
3: 8 0.04170333 3
Expected result:
x y is_present_in_dt2
1: 3 0.84929113 1
2: 5 1.33433818 1
3: 8 0.04170333 0
I think this is actually more straight forward than you are thinking. Think of it as sub-setting d1 with d2 in the i statement.
dt1 <- data.table(x=c(8,5,3), y=rnorm(3))
dt2 <- data.table(a=c(1,2,3,4,5), b=rnorm(5))
setkey(dt1,x)
setkey(dt2,a)
dt1[dt2, presnt := 1] #Where they merge, make it a 1
dt1[!dt2, presnt := 0] #Where they don't merge, make it a 0
And the result:
x y presnt
1: 3 -0.6938894 1
2: 5 0.4891611 1
3: 8 -1.8227498 0
And another way to think of it:
overlap <- intersect(dt1$x,dt2$a)
dt1[x %in% overlap, present := 1]
dt1[!(x %in% overlap), present := 0]
The first way is much faster. The second way might help the understanding of the first way.

data.table: create new columns with lapply

i have a data.table and want to apply a function to on each subset of a row.
Normaly one would do as follows: DT[, lapply(.SD, function), by = y]
But in my case the function does not return a atomic vector but simply a vector.
Is there a chance to do something like this?
library(data.table)
set.seed(9)
DT <- data.table(x1=letters[sample(x=2L,size=6,replace=TRUE)],
x2=letters[sample(x=2L,size=6,replace=TRUE)],
y=rep(1:2,3), key="y")
DT
# x1 x2 y
#1: a a 1
#2: a b 1
#3: a a 1
#4: a a 2
#5: a b 2
#6: a a 2
DT[, lapply(.SD, table), by = y]
# Desired Result, something like this:
# x1_a x2_a x2_b
# 3 2 1
# 3 2 1
Thanks in advance, and also: I would not mind if the result of the function must have a fixed length.
You simply need to unlist the table and then coerce back to a list:
> DTCounts <- DT[, as.list(unlist(lapply(.SD, table))), by=y]
> DTCounts
y x1.a x2.a x2.b
1: 1 3 2 1
2: 2 3 2 1
.
if you do not like the dots in the names, you can sub them out:
> setnames(DTCounts, sub("\\.", "_", names(DTCounts)))
> DTCounts
y x1_a x2_a x2_b
1: 1 3 2 1
2: 2 3 2 1
Note that if not all values in a column are present for each group
(ie, if x2=c("a", "b") when y=1, but x2=c("b", "b") when y=2)
then the above breaks.
The solution is to make the columns factors before counting.
DT[, lapply(.SD, is.factor)]
## OR
columnsToConvert <- c("x1", "x2") # or .. <- setdiff(names(DT), "y")
DT <- cbind(DT[, lapply(.SD, factor), .SDcols=columnsToConvert], y=DT[, y])

Remove constant columns with or without NAs

I am trying to get many lm models work in a function and I need to automatically drop constant columns from my data.table. Thus, I want to keep only columns with two or more unique values, excluding NA from the count.
I tried several methods found on SO, but I am still not able to drop columns that have two values: a constant and NAs.
My reproducible code:
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
> df
x y z d
1: 1 1 NA 2
2: 2 1 NA 2
3: 3 NA NA 2
4: NA NA NA 2
5: 5 NA NA 2
My intention is to drop columns y, z, and d since they are constant, including y that only have one unique value when NAs are omitted.
I tried this:
same <- sapply(df, function(.col){ all(is.na(.col)) || all(.col[1L] == .col)})
df1 <- df[ , !same, with = FALSE]
> df1
x y
1: 1 1
2: 2 1
3: 3 NA
4: NA NA
5: 5 NA
As seen, 'y' is still there ...
Any help?
Because you have a data.table, you may use uniqueN and its na.rm argument:
df[ , lapply(.SD, function(v) if(uniqueN(v, na.rm = TRUE) > 1) v)]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
A base alternative could be Filter(function(x) length(unique(x[!is.na(x)])) > 1, df)
There is simple solution with function Filter in base r. It will help.
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
# Select only columns for which SD is not 0
> Filter(function(x) sd(x, na.rm = TRUE) != 0, df)
x
1: 1
2: 2
3: 3
4: NA
5: 5
Note: Don't forget to use na.rm = TRUE.
Check if the variance is zero:
df[, sapply(df, var, na.rm = TRUE) != 0, with = FALSE]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
Here is an option:
df[,which(df[,
unlist(
sapply(.SD,function(x) length(unique(x[!is.na(x)])) >1))]),
with=FALSE]
x
1: 1
2: 2
3: 3
4: NA
5: 5
For each column of the data.table we count the number of unique values different of NA. We keep only column that have more than one value.
If you really mean DROPing those columns, here is a solution:
library(data.table)
dt <- data.table(x=c(1,2,3,NA,5),
y=c(1,1,NA,NA,NA),
z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
for (col in names(copy(dt))){
v = var(dt[[col]], na.rm = TRUE)
if (v == 0 | is.na(v)) dt[, (col) := NULL]
}
Just change
all(is.na(.col)) || all(.col[1L] == .col)
to
all(is.na(.col) | .col[1L] == .col)
Final code:
same <- sapply( df, function(.col){ all( is.na(.col) | .col[1L] == .col ) } )
df1 <- df[,!same, with=F]
Result:
x
1: 1
2: 2
3: 3
4: NA
5: 5
For removing constant columns,
Numeric Columns:-
constant_col = [const for const in df.columns if df[const].std() == 0]
print (len(constant_col))
print (constant_col)
Categorical Columns:-
constant_col = [const for const in df.columns if len(df[const].unique()) == 1]
print (len(constant_col))
print (constant_col)
Then you drop the columns using the drop method
library(janitor)
df %>%
remove_constant(na.rm = TRUE)
x
1: 1
2: 2
3: 3
4: NA
5: 5

Resources