r - Check if any value in a data.frame column is null - r

I am trying to see if the data.frame column has any null values to move to the next loop. I am currently using the code below:
if (is.na(df[,relevant_column]) == TRUE ){next}
which spits out the warning:
In if (is.na(df_cell_client[, numerator]) == TRUE) { ... : the
condition has length > 1 and only the first element will be used
How do I check if any of the values are null and not just the first row?

(I assume by "null" you really mean NA, since a data.frame cannot contain NULL in that sense.)
Your problem is that if expects a single logical, but is.na(df[,relevant_column]) is returning a vector of logicals. any reduces a vector of logicals into a single global "or" of the vector:
Try:
if (any(is.na(df[,relevant_column]))) {next}
BTW: == TRUE is unnecessary. Keep it if you feel you want the clarity in your code, but I think you'll find most R code does not use that. (I've also seen something == FALSE, equally "odd/wrong", where ! something should work ... but I digress.)

Related

NA values as conditions on a IF statement in R

My goal is to categorize the rows on my dataset depending on the values of two different dates.
if(!exists(MY_DATA$Date_1) & exists(MY_DATA$Date_2)) {
MY_DATA$NEW_COL <- c("Category_1")
} else {
MY_DATA$NEW_COL <- c("Category_2")
}
But it isn't working, I'm currently trying a simplified version as follows:
if(!exists(MY_DATA$Date_1)){
MY_DATA$NEW_COL <- c("Category_1")
}
However, it seems that this only reads the value on the first row, and it either gives me a column with all values as Category_1 or no column at all.
Also I have tried this with is.na(), is.null() and exists().
However, it seems that this only reads the value on the first row, and it either gives me a column with all values as Category_1 or no column at all.
This is because if statement requires a vector of length 1. When given a vector with length more than 1, it will only read the first member to make the decision TRUE or FALSE.
The ifelse function can accept vector argument and will return a vector of logical TRUE/FALSE. It may be suitable for your needs.
Rephrasing originally a comment by #r2evans, the use of exists() is to check if a variable is already defined in the R environment. exists() takes a character vector of length 1 as argument, otherwise it will check only the first member.
a = 1
b = 1
exists("a")
[1] TRUE
exists(c("a", "b"))
[1] TRUE
exists(c("ab", "a", "b"))
[1] FALSE
However it's worth noting that exists() does not check if a value is inside a vector. If you are trying to check if a value is in a vector, you'll want operator %in% instead.
The solution will largely depend on your precise implementations.
p.s. This is originally intended as a comment, but is too long as a comment.
Thanks everyone for your support, ifelse did the trick.
The following worked for me:
MY_DATA$NEW_COL <- c("Category_2")
MY_DATA$NEW_COL <- ifelse(!is.na(MY_DATA$Date_1),"Category_1","Category_2")

Using ifelse in R when one of the options produces NAs?

I want to vectorize a function that relies on checking a condition and depending on whether this condition is TRUE or FALSE, return the outcome of one of two functions, respectively. The problem is that, when the condition is FALSE, the first function cannot be evaluated. Then, ifelse returns the correct values but it also produces a warning. I would like to produce a function that does not produce warnings.
I have tried ifelse(), but it does not work. I was expecting that this command would skip the evaluation of the first function when the condition is FALSE.
Here is an illustrative piece of R code
p = c(-1,1,-1,1,-1,-1,-1,1)
ifelse(p>0, sqrt(p), p^2)
which returns
[1] 1 1 1 1 1 1 1 1
Warning message:
In sqrt(p) : NaNs produced
As you can see, the outcome is correct but, for some reason, it evaluates the function at the first function when condition is FALSE. Thus, I would like to somehow avoid this issue.
We can create a numeric vector and then fill the elements based on the condition put forward by 'p'
out <- numeric(length(p))
out[p > 0] <- sqrt(p[p > 0])
out[p <= 0] <- p[p <= 0]^2
With ifelse we need to have all arguments of the same length. According to ?ifelse
ifelse(test, yes, no)
A vector of the same length and attributes (including dimensions and
"class") as test and data values from the values of yes or no
What happens is that we do both the calculations on the entire vector and replace the values of 'p' based on the test condition. For sqrt, the negative values definitely gives warning and output as NaN. While the NaN elements don't show up in the output, the warning was already printed. The warning is a friendly one, but can be suppressed with suppressWarnings
Avoidance through ifelse probably isn't possible. My understanding of the ifelse process is
Create a vector of values based on the expression in yes
Create a vector of values based on the expression in no
Use the result of test to decide whether each element comes from yes or no.
If an error will occur in either yes or no, ifelse will fail.
To get around this, you need to only evaluate expressions where they will succeed. (such as in akrun's answer, a variant of which is given here for completeness)
p = c(-1,1,-1,1,-1,-1,-1,1)
condition <- p > 0
result <- numeric(length(p))
result[g1] <- sqrt(p[condition])
result[!g1] <- p[condition]^2

R error subsetting data.frame when using [[

For a data frame (data) which has one columns as sulfate,
What is a difference between data[["sulfate"]] and data[[colnames(data)=="sulfate"]]?
data["sulfate'] and data[colnames(data)=="sulfate"] yields same valued result and have data frame class but data[["sulfate"]] results into a numeric vector in my case but data[[colnames(data)=="sulfate"]] turns out to be an error. Why?
First - here are some ways to achieve what you are trying to achieve:
data$sulfate
getElement(data, "sulfate")
Next a short explanation why data[[colnames(data)=="sulfate"]] does not work.
1) The expression within [[ is colnames(data)=="sulfate" which is a logical vector.
2) Function [[ accepts a single element (because it's used to select a single element) or a numeric vector in which case it is used to select elements of a nested list. For example:
a <- list(list(2,3), list(3,4))
> a[[c(2,1)]]
[1] 3
The help page help(`[[`) will have more information about how it works.
3) The data.frame object in R is a list, you can confirm this by doing is.list(data). So the function [[ works the same way.
Now what happens when you pass it a vector instead of a single number - it gets turned into a numeric representation of 0s and 1s. For example inspect as.numeric(colnames(data)=="sulfate")).
Then the subsetting [[ encounters 0 entries and when you try to subset using a 0 it throws an error that you are attempting to select less than one element.
data[[0]]
Notice that the error is the same as when doing data[[colnames(data)=="sulfate"]]

Explanation of subsetting

Can anyone explain what this line t[exists,][1:6,] is doing in the code below and how that subsetting works?
t<-trees
t[1,1]= NA
t[5,3]= NA
t[1:6,]
exists<-complete.cases(t)
exists
t[exists,][1:6,]
The complete.cases function will check the data frame and will return a vector of TRUE and FALSE where a TRUE indicates a row with no missing data. The vector will be as long as there are rows in t.
The t[exits,] part will subset the data so that only rows where exists is true will be considered - the row that have missing data will be FALSE in exists and removed. The [1:6,] will only take the first 6 rows where there is no missing data.
Some background
In R, [ is a function like any other. R parses t[exists, ] as
`[`(t, exists) # don't forget the backticks!
Indeed you can always call [ with the backtick-and-parentheses syntax, or even crazier use it in constructions like
as.data.frame(lapply(t[exists, ], `[`, 1:6, ))
which, believe it or not, is (almost) equivalent to t[exists,][1:6,].
The same is true for functions like [[, $, and more exotic stuff like names<-, which is a special function to assign argument value to the names attribute of an object. We use functions like this all the time with syntax like
names(iris) <- tolower(names(iris))
without realizing that what we're really doing is
`names(iris)<-`(iris, tolower(names(iris))
And finally, you can type
?`[`
for documentation, or type
`[`
to return the definition, just like any other function.
What t[exists,][1:6,] does
The simple answer is that R parses t[exists,][1:6,] as something like:
Get the value of t
From the result of step 1, get the rows that correspond to TRUE elements of exists.
From the result of step 2, get rows with row numbers in the vector 1:6, i.e. rows 1 through 6
The more complicated answer is that this is handled by the parser as:
`[`(`[`(t, exists, ), 1:6, ) # yes, this has blank arguments
which a human can interpret as
temporary_variable_1 <- `[`(t, exists, )
temporary_variable_2 <- `[`(temporary_variable_1, 1:6, )
print(temporary_variable_2) # implicitly, sending an object by itself to the console will `print` that object
Interestingly, because you typically can't pass blank arguments in R, certain constructions are impossible with the bracket function, like eval(call("[", t, exists, )) which will throw an undefined columns selected error.

Simple if-else loop in R

Can someone tell me what is wrong with this if-else loop in R? I frequently can't get if-else loops to work. I get an error:
if(match('SubjResponse',names(data))==NA) {
observed <- data$SubjResponse1
}
else {
observed <- data$SubjResponse
}
Note that data is a data frame.
The error is
Error in if (match("SubjResponse", names(data)) == NA) { :
missing value where TRUE/FALSE needed
This is not a full example as we do not have the data but I see these issues:
You cannot test for NA with ==, you need is.na()
Similarly, the output of match() and friends is usually tested for NULL or length()==0
I tend to write } else { on one line.
As #DirkEddelbuettel noted, you can't test NA that way. But you can make match not return NA:
By using nomatch=0 and reversing the if clause (since 0 is treated as FALSE), the code can be simplified. Furthermore, another useful coding idiom is to assign the result of the if clause, that way you won't mistype the variable name in one of the branches...
So I'd write it like this:
observed <- if(match('SubjResponse',names(data), nomatch=0)) {
data$SubjResponse # match found
} else {
data$SubjResponse1 # no match found
}
By the way if you "frequently" have problems with if-else, you should be aware of two things:
The object to test must not contain NA or NaN, or be a string (mode character) or some other type that can't be coerced into a logical value. Numeric is OK: 0 is FALSE anything else (but NA/NaN) is TRUE.
The length of the object should be exactly 1 (a scalar value). It can be longer, but then you get a warning. If it is shorter, you get an error.
Examples:
len3 <- 1:3
if(len3) 'foo' # WARNING: the condition has length > 1 and only the first element will be used
len0 <- numeric(0)
if(len0) 'foo' # ERROR: argument is of length zero
badVec1 <- NA
if(badVec1) 'foo' # ERROR: missing value where TRUE/FALSE needed
badVec2 <- 'Hello'
if(badVec2) 'foo' # ERROR: argument is not interpretable as logical

Resources