My goal is to categorize the rows on my dataset depending on the values of two different dates.
if(!exists(MY_DATA$Date_1) & exists(MY_DATA$Date_2)) {
MY_DATA$NEW_COL <- c("Category_1")
} else {
MY_DATA$NEW_COL <- c("Category_2")
}
But it isn't working, I'm currently trying a simplified version as follows:
if(!exists(MY_DATA$Date_1)){
MY_DATA$NEW_COL <- c("Category_1")
}
However, it seems that this only reads the value on the first row, and it either gives me a column with all values as Category_1 or no column at all.
Also I have tried this with is.na(), is.null() and exists().
However, it seems that this only reads the value on the first row, and it either gives me a column with all values as Category_1 or no column at all.
This is because if statement requires a vector of length 1. When given a vector with length more than 1, it will only read the first member to make the decision TRUE or FALSE.
The ifelse function can accept vector argument and will return a vector of logical TRUE/FALSE. It may be suitable for your needs.
Rephrasing originally a comment by #r2evans, the use of exists() is to check if a variable is already defined in the R environment. exists() takes a character vector of length 1 as argument, otherwise it will check only the first member.
a = 1
b = 1
exists("a")
[1] TRUE
exists(c("a", "b"))
[1] TRUE
exists(c("ab", "a", "b"))
[1] FALSE
However it's worth noting that exists() does not check if a value is inside a vector. If you are trying to check if a value is in a vector, you'll want operator %in% instead.
The solution will largely depend on your precise implementations.
p.s. This is originally intended as a comment, but is too long as a comment.
Thanks everyone for your support, ifelse did the trick.
The following worked for me:
MY_DATA$NEW_COL <- c("Category_2")
MY_DATA$NEW_COL <- ifelse(!is.na(MY_DATA$Date_1),"Category_1","Category_2")
Related
I have dataframe like below
monkey = data.frame(girl = 1:10, kn = NA, boy = 5)
And i want to understand the following code meaning step by step
monkey %>%
mutate(t = ifelse(is.na(kn),.[,grepl('a',names(.))],ll))
Thank you everyone in advance for your support.
In my opinion, this is not good code, but I'll try to explain what it is doing.
is.na(kn) (in the context of monkey) returns a logical vector of whether each value in that column is NA,
with(monkey, is.na(kn))
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The use of . in .[grepl(*)] refers to the current data at the start of this call to mutate; it would be more dplyr-canonical to use cur_data(), which would be more-complete (e.g., taking into account previous mutated columns that . does not recognize, not a factor here). I believe this .[*] code is trying to select a column dynamically based on the current data.
Why this one is bad:
1. There is no column here whose name contains "a";
2. There could be more than one columns whose names contain "a", which means the yes= argument to ifelse would produce a nested frame in the new t= column;
3. The behavior of .[,*] changes if the original frame is the base-R data.frame or if it is the tibble-variant tbl_df: see monkey[,1] versus tibble(monkey)[,1].
no= argument refers to an object ll that is not defined. This should (intuitively) fail with Error: object 'll' not found or similar, but since all of the test= argument is true, the no= is not needed and so it not evaluated. Consider ifelse(c(TRUE, TRUE), 1:2, stop("oops")) (no error) versus ifelse(c(TRUE, FALSE), 1:2, stop("oops")).
Ultimately, this code is not defensive-enough to be safe (base-vs-tibble variant) and its intent is unclear.
My advice when using dplyr is to use dplyr::if_else instead of base R's ifelse. For one, ifelse has some issues and limitations (e.g., How to prevent ifelse() from turning Date objects into numeric objects); for another, if_else protects you from ambiguous, inconsistent-results code such as in your question.
I noticed that if I called setNames() in ifelse() the returned object does not preserved the names from setNames().
x <- 1:10
#no names kept
ifelse(x <5, setNames(x+1,letters[1:4]), setNames(x^3, letters[5:10]))
#names kept
setNames(ifelse(x <5, x+1,x^3), letters[1:10])
After looking at the code I realize that the second way is more concise but still would be interested to know why the names are not preserved when setNames() is called in ifelse(). ifelse() documentation warns of :
The mode of the result may depend on the value of test (see the examples), and the class attribute (see oldClass) of the result is taken from test and may be inappropriate for the values selected from yes and no.
Is the named list being stripped related to this warning?
It's not really specific to setNames. ifelse simply doesn't preserve names for the TRUE/FALSE parameter. It would get confusing if your TRUE and FALSE values had different names so it just doesn't bother. However, according to the Value session of the help page
A vector of the same length and attributes (including dimensions and "class") as test
Since names are stored as attributes, names are only preserved from the the test parameter. Observe these simple examples
ifelse(TRUE, c(a=1), c(x=4))
# [1] 1
ifelse(c(g=TRUE), c(a=1), c(x=4))
# g
# 1
So in your examples you need to move the names to the test condition
ifelse(setNames(x <5,letters[1:10]), x+1, x^3)
I am trying to see if the data.frame column has any null values to move to the next loop. I am currently using the code below:
if (is.na(df[,relevant_column]) == TRUE ){next}
which spits out the warning:
In if (is.na(df_cell_client[, numerator]) == TRUE) { ... : the
condition has length > 1 and only the first element will be used
How do I check if any of the values are null and not just the first row?
(I assume by "null" you really mean NA, since a data.frame cannot contain NULL in that sense.)
Your problem is that if expects a single logical, but is.na(df[,relevant_column]) is returning a vector of logicals. any reduces a vector of logicals into a single global "or" of the vector:
Try:
if (any(is.na(df[,relevant_column]))) {next}
BTW: == TRUE is unnecessary. Keep it if you feel you want the clarity in your code, but I think you'll find most R code does not use that. (I've also seen something == FALSE, equally "odd/wrong", where ! something should work ... but I digress.)
Can anyone explain what this line t[exists,][1:6,] is doing in the code below and how that subsetting works?
t<-trees
t[1,1]= NA
t[5,3]= NA
t[1:6,]
exists<-complete.cases(t)
exists
t[exists,][1:6,]
The complete.cases function will check the data frame and will return a vector of TRUE and FALSE where a TRUE indicates a row with no missing data. The vector will be as long as there are rows in t.
The t[exits,] part will subset the data so that only rows where exists is true will be considered - the row that have missing data will be FALSE in exists and removed. The [1:6,] will only take the first 6 rows where there is no missing data.
Some background
In R, [ is a function like any other. R parses t[exists, ] as
`[`(t, exists) # don't forget the backticks!
Indeed you can always call [ with the backtick-and-parentheses syntax, or even crazier use it in constructions like
as.data.frame(lapply(t[exists, ], `[`, 1:6, ))
which, believe it or not, is (almost) equivalent to t[exists,][1:6,].
The same is true for functions like [[, $, and more exotic stuff like names<-, which is a special function to assign argument value to the names attribute of an object. We use functions like this all the time with syntax like
names(iris) <- tolower(names(iris))
without realizing that what we're really doing is
`names(iris)<-`(iris, tolower(names(iris))
And finally, you can type
?`[`
for documentation, or type
`[`
to return the definition, just like any other function.
What t[exists,][1:6,] does
The simple answer is that R parses t[exists,][1:6,] as something like:
Get the value of t
From the result of step 1, get the rows that correspond to TRUE elements of exists.
From the result of step 2, get rows with row numbers in the vector 1:6, i.e. rows 1 through 6
The more complicated answer is that this is handled by the parser as:
`[`(`[`(t, exists, ), 1:6, ) # yes, this has blank arguments
which a human can interpret as
temporary_variable_1 <- `[`(t, exists, )
temporary_variable_2 <- `[`(temporary_variable_1, 1:6, )
print(temporary_variable_2) # implicitly, sending an object by itself to the console will `print` that object
Interestingly, because you typically can't pass blank arguments in R, certain constructions are impossible with the bracket function, like eval(call("[", t, exists, )) which will throw an undefined columns selected error.
Why do the if-else construct and the function ifelse() behave differently?
mylist <- list(list(a=1, b=2), list(x=10, y=20))
l1 <- ifelse(sum(sapply(mylist, class) != "list")==0, mylist, list(mylist))
l2 <-
if(sum(sapply(mylist, class) != "list") == 0){ # T: all list elements are lists
mylist
} else {
list(mylist)
}
all.equal(l1,l2)
# [1] "Length mismatch: comparison on first 1 components"
From the ifelse documentation:
‘ifelse’ returns a value with the same shape as ‘test’ which is
filled with elements selected from either ‘yes’ or ‘no’ depending
on whether the element of ‘test’ is ‘TRUE’ or ‘FALSE’.
So your input has length one so the output is truncated to length 1.
You can also see this illustrated with a more simple example:
ifelse(TRUE, c(1, 3), 7)
# [1] 1
if ( cond) { yes } else { no } is a control structure. It was designed to effect programming forks rather than to process a sequence. I think many people come from SPSS or SAS whose authors chose "IF" to implement conditional assignment within their DATA or TRANSFORM functions and so they expect R to behave the same. SA and SPSS both have implicit FOR-loops in there Data steps. Whereas R came from a programming tradition. R's implicit for-loops are built in to the many vectorized functions (including ifelse). The lapply/sapply fucntions are the more Rsavvy way to implement most sequential processing, although they don't succeed at doing lagged variable access, especially if there are any randomizing features whose "effects" get cumulatively handled.
ifelse takes an expression that builds a vector of logical values as its first argument. The second and third arguments need to be vectors of equal length and either the first of them or the second gets chosen. This is similar to the SPSS/SAS IF commands which have an implicit by-row mode of operation.
For some reason this is marked as a duplicate of
Why does ifelse() return single-value output?
So a work around for that question is:
a=3
yo <- ifelse(a==1, 1, list(c(1,2)))
yo[[1]]