subsetting a data.table using !=<some non-NA> excludes NA too

subsetting a data.table using !=<some non-NA> excludes NA too - r

I have a data.table with a column that has NAs. I want to drop rows where that column takes a particular value (which happens to be ""). However, my first attempt lead me to lose rows with NAs as well:
> a = c(1,"",NA)
> x <- data.table(a);x
a
1: 1
2:
3: NA
> y <- x[a!=""];y
a
1: 1
After looking at ?`!=`, I found a one liner that works, but it's a pain:
> z <- x[!sapply(a,function(x)identical(x,""))]; z
a
1: 1
2: NA
I'm wondering if there's a better way to do this? Also, I see no good way of extending this to excluding multiple non-NA values. Here's a bad way:
> drop_these <- function(these,where){
+ argh <- !sapply(where,
+ function(x)unlist(lapply(as.list(these),function(this)identical(x,this)))
+ )
+ if (is.matrix(argh)){argh <- apply(argh,2,all)}
+ return(argh)
+ }
> x[drop_these("",a)]
a
1: 1
2: NA
> x[drop_these(c(1,""),a)]
a
1: NA
I looked at ?J and tried things out with a data.frame, which seems to work differently, keeping NAs when subsetting:
> w <- data.frame(a,stringsAsFactors=F); w
a
1 1
2
3 <NA>
> d <- w[a!="",,drop=F]; d
a
1 1
NA <NA>

To provide a solution to your question:
You should use %in%. It gives you back a logical vector.
a %in% ""
# [1] FALSE TRUE FALSE
x[!a %in% ""]
# a
# 1: 1
# 2: NA
To find out why this is happening in data.table:
(as opposted to data.frame)
If you look at the data.table source code on the file data.table.R under the function "[.data.table", there's a set of if-statements that check for i argument. One of them is:
if (!missing(i)) {
# Part (1)
isub = substitute(i)
# Part (2)
if (is.call(isub) && isub[[1L]] == as.name("!")) {
notjoin = TRUE
if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
nomatch = 0L
isub = isub[[2L]]
}
.....
# "isub" is being evaluated using "eval" to result in a logical vector
# Part 3
if (is.logical(i)) {
# see DT[NA] thread re recycling of NA logical
if (identical(i,NA)) i = NA_integer_
# avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
else i[is.na(i)] = FALSE
}
....
}
To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.
First, why dt[a != ""] doesn't work as expected (by the OP)?
First, part 1 evaluates to an object of class call. The second part of the if statement in part 2 returns FALSE. Following that, the call is "evaluated" to give c(TRUE, FALSE, NA) . Then part 3 is executed. So, NA is replaced to FALSE (the last line of the logical loop).
why does x[!(a== "")] work as expected (by the OP)?
part 1 returns a call once again. But, part 2 evaluates to TRUE and therefore sets:
1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)
That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval) to logical again. So, (a=="") evaluates to c(FALSE, TRUE, NA).
Now, this is checked for is.logical in part 3. So, here, NA gets replaced to FALSE. It therefore becomes, c(FALSE, TRUE, FALSE). At some point later, a which(c(F,T,F)) is executed, which results in 2 here. Because notjoin = TRUE (from part 2) seq_len(nrow(x))[-2] = c(1,3) is returned. so, x[!(a=="")] basically returns x[c(1,3)] which is the desired result. Here's the relevant code snippet:
if (notjoin) {
if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
irows = irows[irows!=0L]
# WHERE MAGIC HAPPENS (returns c(1,3))
i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL # NULL meaning all rows i.e. seq_len(nrow(x))
# Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
# column when irows contains negatives.
}
Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.

Background answer from Matthew :
The behaviour with != on NA as highlighted by this question wasn't intended, thinking about it. The original intention was indeed to be different than [.data.frame w.r.t. == and NA and I believe everyone is happy with that. For example, FAQ 2.17 has :
DT[ColA==ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) &
ColA==ColB,]
That convenience is achieved by dint of :
DT[c(TRUE,NA,FALSE)] treats the NA as FALSE, but DF[c(TRUE,NA,FALSE)]
returns NA rows for each NA
The motivation is not just convenience but speed, since each and every !, is.na, & and == are themselves vector scans with associated memory allocation of each of their results (explained in intro vignette). So although x[is.na(a) | a!=""] is a working solution, it's exactly the type of logic I was trying to avoid needing in data.table. x[!a %in% ""] is slightly better; i.e, 2 scans (%in% and !) rather than 3 (is.na, | and !=). But really x[a != ""] should do what Frank expected (include NA) in a single scan.
New feature request filed which links back to this question :
DT[col!=""] should include NA
Thanks to Frank, Eddi and Arun. If I haven't understood correctly feel free to correct, otherwise the change will get made eventually. It will need to be done in a way that considers compound expressions; e.g., DT[colA=="foo" & colB!="bar"] should exclude rows with NA in colA but include rows where colA is non-NA but colB is NA. Similarly, DT[colA!=colB] should include rows where either colA or colB is NA but not both. And perhaps DT[colA==colB] should include rows where both colA and colB are NA (which it doesn't currently, I believe).

As you have already figured out, this is the reason:
a != ""
#[1] TRUE NA FALSE
You can do what you figured out already, i.e. x[is.na(a) | a != ""] or you could setkey on a and do the following:
setkey(x, a)
x[!J("")]

Related

Is there a way to check if dataframe is empty and if so to add a NA row?

For example i have a dataframe that has nothing inside but i need it to run the full code cause it usually expects there to be data. I tried this but it did not work
ifelse(dim(df_empty)[1]==0,rbind(Shots1B_empty,NA))

Maybe something like this:
df_empty <- data.frame(x=integer(0), y = numeric(0), a = character(0))
if(nrow(df_empty) == 0){
df_empty <- rbind(df_empty, data.frame(x=NA, y=NA, a=NA))
}
df_empty
# x y a
#1 NA NA NA

Simple question, OP, but actually pretty interesting. All the elements of your code should work, but the issue is that when you run as is, it will return a list, not a data frame. Let me show you with an example:
growing_df <- data.frame(
A=rep(1, 3),
B=1:3,
c=LETTERS[4:6])
df_empty <- data.frame()
If we evaluate as you have written you get:
df <- ifelse(dim(df_empty)[1]==0, rbind(growing_df, NA))
with df resulting in a List:
> class(df)
[1] "list"
> df
[[1]]
[1] 1 1 1 NA
The code "worked", but the resulting class of df is wrong. It's odd because this works:
> rbind(growing_df, NA)
A B c
1 1 1 D
2 1 2 E
3 1 3 F
4 NA NA <NA>
The answer is to use if and else, rather than ifelse(), just as #akrun noted in their answer. The reason is found if you dig into the documentation of ifelse():
ifelse returns a value with the same shape as test which is filled
with elements selected from either yes or no depending on whether the
element of test is TRUE or FALSE.
Since dim(df_empty)[1] and/or nrow(df_empty) are both vectors, the result will be saved as a list. That's why if {} works, but not ifelse() here. rbind() results in a data frame normally, but the class of the result stored into df when assigning with ifelse() is decided based on the test element, not the resulting element. Compare that to if{} statements, which have a result element decided based on whatever expression is input into {}.

We may need if/else instead of ifelse - ifelse requires all arguments to be of same length, which obviously will be not the case when we rbind
Shots1B_empty <- if(nrow(df_empty) == 0) rbind(Shots1B_empty, NA)

Avoid FOR loop in R programming

I have 2 dataframes below,
col1_x <- c(0123,123,234,4567,77789,4578,45588,669887,7887,5547)
col2_x <- c('X1','X8','X2','X55','C12','B11','Z1','SS12','D9','F55')
a <- c(10,9,8,7,6,5,4,3,2,1)
DF1 <- cbind(col1_x,col2_x,a)
DF1 <- as.data.frame(DF1, stringsAsFactors = F)
col1_y <- c(012,123,56,55,78,5547)
col2_y <- c('X1','X8','S2','ER4','KL1','F55')
b <- c(111,222,NA,NA,555,666)
DF2 <- cbind(col1_y,col2_y,b)
DF2 <- as.data.frame(DF2, stringsAsFactors = F)
Below are the codes which I written for the execution.
# code1
for (i in 1:nrow(DF2)) {
if(is.na(DF2$b[i])) {} else {
DF1 <-mutate(DF1,
a = ifelse(col1_x == DF2$col1_y[i] & col2_x == DF2$col2_y[i],
DF2$b[i],a) )
}
}
# code2
if(is.na(DF2$b)) {} else {
DF1$a <- ifelse(DF1$col1_x == DF2$col1_y & DF1$col2_x == DF2$col2_y, DF2$b, DF1$a)
}
I am getting warnings as below when I run code2,
Warning messages:
1: In if (is.na(Y$b)) { :
the condition has length > 1 and only the first element will be used
2: In X$col1 == Y$col1 :
longer object length is not a multiple of shorter object length
3: In X$col2 == Y$col2 :
longer object length is not a multiple of shorter object length
Kindly help me how can I fix this without using FOR loop as it takes a lot of time for iterations.
Note: code1 satisfies my requirement

This accomplished your code1 without the warnings.
left_join(DF1, DF2, by = c("col1_x" = "col1_y", "col2_x" = "col2_y")) %>%
mutate(a = coalesce(b, a)) %>%
select(-b)
# col1_x col2_x a
# 1 123 X1 10
# 2 123 X8 222
# 3 234 X2 8
# 4 4567 X55 7
# 5 77789 C12 6
# 6 4578 B11 5
# 7 45588 Z1 4
# 8 669887 SS12 3
# 9 7887 D9 2
# 10 5547 F55 666
If I have interpreted correctly the results that you need, then this is far faster, efficient, and safer than any implementation with for loops and base::ifelse (which can be problematic on its own).
To learn more about merges and joins like this, see How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272. Really, part of data-science-y tasks is knowing how to deal with data consistently, safely, quickly, efficiently, and ... safely. Yes, I said it twice. If there is anything in your code that might, just might, confuse one observation with another, all of your results and inferences are at-best questionable if not completely corrupted. (I'll get off my </soapbox> now.)
As for your warnings:
condition has length > 1 ....
if statements require a length-1 conditional, period. Not length 0, not length 2 or more. Length 1. Since your Y frame (actually DF2 now) has more than 1 row, this is broken.
Think of it this way: if (true) then do task 1 makes sense. if (true, false, false, true true true true, false) do task 1 does not make sense. What should happen?
One of two things are needed here:
You need if, so you should be looking at one of:
any(is.na(Y$b));
all(is.na(Y$b)); or
a specific one of them, such as is.na(Y$b[17]) (if there were at least 17 of them)
You need ifelse, which would work on a vector of logicals. (I don't think it's this one.)
longer object length is not a multiple of shorter object length
This seems clear, but you don't understand why it's happening.
Consider these questions:
c(1,2) == c(1,2) is really asking c(1==1, 2==2), right? Good.
c(1,2) == 1 is really asking c(1==1, 1==2). Good.
(Neither of those would go in an if statement, btw :-)
c(1,2) == c(1,2,3,4) is confusingly not an error in R due to argument-recycling. I really think it should be an error, because many of the times it is used/relied-on, it is a mistake, and the results are corrupted/incorrect. However, this is really producing c(1==1, 2==2, 1==3, 2==4). Yup, recycling. And while not a warning/error, this might be useful but is often a silent mistake. This only works though when the length of one vector is a perfect multiple of the length of the other vector.
c(1,2,9) == c(1,2,3,4,5) will try to recycle as c(1==1, 2==2, 9==3, 1==4, 2==5) (and will give results for that), but ... doesn't that seem just a bit odd to you? Well, it might be okay to you, and while there might be legitimate uses of this type of recycling, it more than often (in my experience) is a mistake in code. If you really mean this and you really know that this type of arbitrary comparisons is what you really want, then wrap it in suppressWarnings and don't come to me when your data results are seemingly inconsistent with the inputs.
More than often when questions pop up with this, instead of ==, people should be thinking "set operations", where they need %in%. Now, think of these:
c(1,2,9) %in% c(1,2,3,4,5) yields c(TRUE, TRUE, FALSE). (Length 3, not length 5.) You're asking c("is 1 in 1:5?", "is 2 in 1:5?", "is 9 in 1:5?").

Looping through rows in an R data frame?

I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.

If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though

The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)

I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8

Function to change blanks to NA

I'm trying to write a function that turns empty strings into NA. A summary of one of my column looks like this:
a b
12 210 468
I'd like to change the 12 empty values to NA. I also have a few other factor columns for which I'd like to change empty values to NA, so I borrowed some stuff from here and there to come up with this:
# change nulls to NAs
nullToNA <- function(df){
# split df into numeric & non-numeric functions
a<-df[,sapply(df, is.numeric), drop = FALSE]
b<-df[,sapply(df, Negate(is.numeric)), drop = FALSE]
# Change empty strings to NA
b<-b[lapply(b,function(x) levels(x) <- c(levels(x), NA) ),] # add NA level
b<-b[lapply(b,function(x) x[x=="",]<- NA),] # change Null to NA
# Put the columns back together
d<-cbind(a,b)
d[, names(df)]
}
However, I'm getting this error:
> foo<-nullToNA(bar)
Error in x[x == "", ] <- NA : incorrect number of subscripts on matrix
Called from: FUN(X[[i]], ...)
I have tried the answer found here: Replace all 0 values to NA but it changes all my columns to numeric values.

You can directly index fields that match a logical criterion. So you can just write:
df[is_empty(df)] = NA
Where is_empty is your comparison, e.g. df == "":
df[df == ""] = NA
But note that is.null(df) won’t work, and would be weird anyway1. I would advise against merging the logic for columns of different types, though! Instead, handle them separately.
1 You’ll almost never encounter NULL inside a table since that only works if the underlying vector is a list. You can create matrices and data.frames with this constraint, but then is.null(df) will never be TRUE because the NULL values are wrapped inside the list).

This worked for me
df[df == 'NULL'] <- NA

How about just:
df[apply(df, 2, function(x) x=="")] = NA
Works fine for me, at least on simple examples.

This is the function I used to solve this issue.
null_na=function(vector){
new_vector=rep(NA,length(vector))
for(i in 1:length(vector))
if(vector[i]== ""){new_vector[i]=NA}else if(is.na(vector[i]))
{new_vector[i]=NA}else{new_vector[i]=vector[i]}
return(new_vector)
}
Just plug in the column or vector you are having an issue with.

`j` doesn't evaluate to the same number of columns for each group

I am trying to use data.table where my j function could and will return a different number of columns on each call. I would like it to behave like rbind.fill in that it fills any missing columns with NA.
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]
In this case 'result' may end up with two columns; A and B. 'A' and 'B' was returned as part of the first call to 'fetch' and only 'B' was returned as part of the second. I would like the example code to return this result.
id A B
1 1 a b
2 2 <NA> b
Unfortunately, when run I get this error.
Error in `[.data.table`(data, , fetch(.BY, .SD), by = id) :
j doesn't evaluate to the same number of columns for each group
I can do this with plyr as follows, but in my real world use case plyr is running out of memory. Each call to fetch occurs rather quickly, but the memory crash occurs when plyr tries to merge all of the data back together. I am trying to see if data.table might solve this problem for me.
result <- ddply(data, "id", fetch)
Any thoughts appreciated.

DWin's approach is good. Or you could return a list column instead, where each cell is itself a vector. That's generally a better way of handling variable length vectors.
DT = data.table(A=rep(1:3,1:3),B=1:6)
DT
A B
1: 1 1
2: 2 2
3: 2 3
4: 3 4
5: 3 5
6: 3 6
ans = DT[, list(list(B)), by=A]
ans
A V1
1: 1 1
2: 2 2,3 # V1 is a list column. These aren't strings, the
3: 3 4,5,6 # vectors just display with commas
ans$V1[3]
[[1]]
[1] 4 5 6
ans$V1[[3]]
[1] 4 5 6
ans[,sapply(V1,length)]
[1] 1 2 3
So in your example you could use this as follows:
library(plyr)
rbind.fill(data[, list(list(fetch(.BY))), by = id]$V1)
# A B
#1 a b
#2 <NA> b
Or, just make the list returned conformant :
allcols = c("A","B")
fetch <- function(by) {
if(by == 1)
list(A=c("a"), B=c("b"))[allcols]
else
list(B=c("b"))[allcols]
}

Here are two approaches. The first roughly follows your strategy:
data[,list(A=if(.BY==1) 'a' else NA_character_,B='b'), by=id]
And the second does things in two steps:
DT <- copy(data)[,`:=`(A=NA_character_,B='b')][id==1,A:='a']
Using a by just to check for a single value seems wasteful (maybe computationally, but also in terms of clarity); of course, it could be that your application isn't really like that.

Try
data.table(A=NA, B=c("b"))
#NickAllen: I'm not sure from the comments whether you understood my suggestion. (I was posting from a mobile phone that limited my cut-paste capabilities and I suspect my wife was telling me to stop texting to S0 or she would divorce me.) What I meant was this:
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(A=NA, B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

subsetting a data.table using !=<some non-NA> excludes NA too - r

As you have already figured out, this is the reason: a != "" #[1] TRUE NA FALSE You can do what you figured out already, i.e. x[is.na(a) | a != ""] or you could setkey on a and do the following: setkey(x, a) x[!J("")]

Related

Is there a way to check if dataframe is empty and if so to add a NA row?

Avoid FOR loop in R programming

Looping through rows in an R data frame?

Function to change blanks to NA

`j` doesn't evaluate to the same number of columns for each group

Categories

Resources