I have dataframe like below
monkey = data.frame(girl = 1:10, kn = NA, boy = 5)
And i want to understand the following code meaning step by step
monkey %>%
mutate(t = ifelse(is.na(kn),.[,grepl('a',names(.))],ll))
Thank you everyone in advance for your support.
In my opinion, this is not good code, but I'll try to explain what it is doing.
is.na(kn) (in the context of monkey) returns a logical vector of whether each value in that column is NA,
with(monkey, is.na(kn))
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The use of . in .[grepl(*)] refers to the current data at the start of this call to mutate; it would be more dplyr-canonical to use cur_data(), which would be more-complete (e.g., taking into account previous mutated columns that . does not recognize, not a factor here). I believe this .[*] code is trying to select a column dynamically based on the current data.
Why this one is bad:
1. There is no column here whose name contains "a";
2. There could be more than one columns whose names contain "a", which means the yes= argument to ifelse would produce a nested frame in the new t= column;
3. The behavior of .[,*] changes if the original frame is the base-R data.frame or if it is the tibble-variant tbl_df: see monkey[,1] versus tibble(monkey)[,1].
no= argument refers to an object ll that is not defined. This should (intuitively) fail with Error: object 'll' not found or similar, but since all of the test= argument is true, the no= is not needed and so it not evaluated. Consider ifelse(c(TRUE, TRUE), 1:2, stop("oops")) (no error) versus ifelse(c(TRUE, FALSE), 1:2, stop("oops")).
Ultimately, this code is not defensive-enough to be safe (base-vs-tibble variant) and its intent is unclear.
My advice when using dplyr is to use dplyr::if_else instead of base R's ifelse. For one, ifelse has some issues and limitations (e.g., How to prevent ifelse() from turning Date objects into numeric objects); for another, if_else protects you from ambiguous, inconsistent-results code such as in your question.
Related
I noticed that if I called setNames() in ifelse() the returned object does not preserved the names from setNames().
x <- 1:10
#no names kept
ifelse(x <5, setNames(x+1,letters[1:4]), setNames(x^3, letters[5:10]))
#names kept
setNames(ifelse(x <5, x+1,x^3), letters[1:10])
After looking at the code I realize that the second way is more concise but still would be interested to know why the names are not preserved when setNames() is called in ifelse(). ifelse() documentation warns of :
The mode of the result may depend on the value of test (see the examples), and the class attribute (see oldClass) of the result is taken from test and may be inappropriate for the values selected from yes and no.
Is the named list being stripped related to this warning?
It's not really specific to setNames. ifelse simply doesn't preserve names for the TRUE/FALSE parameter. It would get confusing if your TRUE and FALSE values had different names so it just doesn't bother. However, according to the Value session of the help page
A vector of the same length and attributes (including dimensions and "class") as test
Since names are stored as attributes, names are only preserved from the the test parameter. Observe these simple examples
ifelse(TRUE, c(a=1), c(x=4))
# [1] 1
ifelse(c(g=TRUE), c(a=1), c(x=4))
# g
# 1
So in your examples you need to move the names to the test condition
ifelse(setNames(x <5,letters[1:10]), x+1, x^3)
My goal is to categorize the rows on my dataset depending on the values of two different dates.
if(!exists(MY_DATA$Date_1) & exists(MY_DATA$Date_2)) {
MY_DATA$NEW_COL <- c("Category_1")
} else {
MY_DATA$NEW_COL <- c("Category_2")
}
But it isn't working, I'm currently trying a simplified version as follows:
if(!exists(MY_DATA$Date_1)){
MY_DATA$NEW_COL <- c("Category_1")
}
However, it seems that this only reads the value on the first row, and it either gives me a column with all values as Category_1 or no column at all.
Also I have tried this with is.na(), is.null() and exists().
However, it seems that this only reads the value on the first row, and it either gives me a column with all values as Category_1 or no column at all.
This is because if statement requires a vector of length 1. When given a vector with length more than 1, it will only read the first member to make the decision TRUE or FALSE.
The ifelse function can accept vector argument and will return a vector of logical TRUE/FALSE. It may be suitable for your needs.
Rephrasing originally a comment by #r2evans, the use of exists() is to check if a variable is already defined in the R environment. exists() takes a character vector of length 1 as argument, otherwise it will check only the first member.
a = 1
b = 1
exists("a")
[1] TRUE
exists(c("a", "b"))
[1] TRUE
exists(c("ab", "a", "b"))
[1] FALSE
However it's worth noting that exists() does not check if a value is inside a vector. If you are trying to check if a value is in a vector, you'll want operator %in% instead.
The solution will largely depend on your precise implementations.
p.s. This is originally intended as a comment, but is too long as a comment.
Thanks everyone for your support, ifelse did the trick.
The following worked for me:
MY_DATA$NEW_COL <- c("Category_2")
MY_DATA$NEW_COL <- ifelse(!is.na(MY_DATA$Date_1),"Category_1","Category_2")
Let's create the data frame:
df <- data.frame(VarA = c(1, NA, 5), VarB = c(NA, 2, 7))
VarA VarB
1 1 NA
2 NA 2
3 5 7
If I run a simple NA query it shows me the locations of each NA.
is.na(df)
VarA VarB
[1,] FALSE TRUE
[2,] TRUE FALSE
[3,] FALSE FALSE
Why doesn't is.numeric return the same type of data frame? It only outputs a single "FALSE".
is.numeric(df)
[1] FALSE
Is there a good explanation of data types, classes, etc. somewhere? I read about these things often but don't have a solid feel for them. I don't get the difference between a matrix and data frame, or num vs dbl. It's easy to conflate these things.
I did the Cyclismo "basic data types" tutorial but would like to dig a little deeper.
First - documentation
Let's turn to the documentation. From ?is.na:
The generic function is.na indicates which elements are missing.
So is.na is made to tell you which individual elements within an object are missing.
From ?is.numeric:
is.numeric is a more general test of an object being interpretable as numbers.
So is.numeric tells you whether an object is numeric (not whether individual elements within the object are numeric).
These are behaving exactly as documented - is.na(df) tells you which elements of the data frame are missing. is.numeric(df) tells you what df is not numeric (in fact, it is a data.frame).
Is it inconsistent?
I can see how this seems inconsistent. There are just a few is.* functions that work element-wise. is.na, is.finite, is.nan are the only ones I can think of. All the other is.* functions work on the whole object. These function are essentially stand-ins for equality testing with == when the equality testing wouldn't work (more on this below). But once you understand the data structures a little more, they don't seem inconsistent, because they really wouldn't make sense the other way.
is.numeric makes sense the way it is
It would not make sense for is.numeric to be applied element-wise. A vector is either numeric or not in its entirety - whether or not it has missing values. If you wanted to apply the is.numeric function to each column of your data frame, you could do
sapply(df, is.numeric)
Which will tell you that both columns are numeric. You could make an argument that the default behavior when is.numeric() is given a data frame should be to apply it to every column, but it's possible someone want to make sure that something is a numeric vector, not a data.frame (or anything else), and having, say, a one-column data.frame say TRUE to is.numeric() could cause confusion and errors.
is.na makes sense the way it is
Conversely, it wouldn't make sense for is.na to not be applied element-wise. NA is a stand-in for a single value, not a complicated object like a data.frame. It wouldn't really make sense to have a "missing" data frame - you could have a missing value but there's nothing to tell you that it's a data frame. However a data.frame (or a vector, or a matrix...) can contain missing values, and is.na will tell you exactly where they are.
This is pretty much identical to how equality (or other comparisons) work. You could also check for 1s in your data frame with df == 1, or for values less than 5 with df < 5. is.na() is the recommended way to check for missing values - anything == NA returns NA, so df == NA doesn't work for that. is.na(df) is the right way to do this.
To accomplish this, is.na actually has many methods. You can seem them with methods("is.na"). In my current R session, I see
methods("is.na")
[1] is.na,abIndex-method is.na,denseMatrix-method is.na,indMatrix-method
[4] is.na,nsparseMatrix-method is.na,nsparseVector-method is.na,sparseMatrix-method
[7] is.na,sparseVector-method is.na.coxph.penalty* is.na.data.frame
[10] is.na.data.table* is.na.integer64* is.na.numeric_version
[13] is.na.POSIXlt is.na.raster* is.na.ratetable*
[16] is.na.Surv*
This shows me that all these different types of objects support a is.na() call to nicely tell me where missing values are inside of them. And if I call it on another object class, then is.na.default will try to handle it.
Secondary questions
I don't get the difference between a matrix and data frame, or num vs dbl. It's easy to conflate these things.
num vs dbl is not relevant to R. I'm shocked that anything directed at R beginners would mention doubles - it shouldn't. If you look at the help at ?double it includes.
It is identical to numeric.
... as.double is a generic function. It is identical to as.numeric.
For R purposes, forget the term double and just use numeric.
I don't get the difference between a matrix and data frame
Both are rectangular - rows and columns. A matrix can only have one data type/class inside it - the whole matrix is numeric, or character, or integer, etc, with no mixing. A data.frame can have different class for each of its columns, the first column can be numeric, the second character, the third factor, etc.
Matrices are simpler and more efficient, very suitable for linear algebra operations. Data frames are much more common because it is common to have data of mixed types.
Primarily because the test in is.numeric() applies to the whole object (so returns a single value that says whether the entire object is numeric), while is.na() applies to individual elements of the object.
The next, subtler question (which you haven't asked yet but might ask next) is: why doesn't is.numeric() return TRUE, since all the elements of the data frame are numeric? It's because data frames are internally represented as lists, and could contain elements of different types (is.numeric(as.matrix(df)) does return TRUE).
str(df)
'data.frame': 3 obs. of 2 variables:
$ VarA: num 1 NA 5
$ VarB: num NA 2 7
The thing to consider is this, is.na is testing each value that appears in a vector... whereas is.numeric is checking the class of the object itself. It's apples-to-oranges in a sense. Think of it like this,
Is this object Not Available(NA)? Since it exists, check each object contained in the tested vectors. Is this object a number? Nope.. it's a data.frame
Can anyone explain what this line t[exists,][1:6,] is doing in the code below and how that subsetting works?
t<-trees
t[1,1]= NA
t[5,3]= NA
t[1:6,]
exists<-complete.cases(t)
exists
t[exists,][1:6,]
The complete.cases function will check the data frame and will return a vector of TRUE and FALSE where a TRUE indicates a row with no missing data. The vector will be as long as there are rows in t.
The t[exits,] part will subset the data so that only rows where exists is true will be considered - the row that have missing data will be FALSE in exists and removed. The [1:6,] will only take the first 6 rows where there is no missing data.
Some background
In R, [ is a function like any other. R parses t[exists, ] as
`[`(t, exists) # don't forget the backticks!
Indeed you can always call [ with the backtick-and-parentheses syntax, or even crazier use it in constructions like
as.data.frame(lapply(t[exists, ], `[`, 1:6, ))
which, believe it or not, is (almost) equivalent to t[exists,][1:6,].
The same is true for functions like [[, $, and more exotic stuff like names<-, which is a special function to assign argument value to the names attribute of an object. We use functions like this all the time with syntax like
names(iris) <- tolower(names(iris))
without realizing that what we're really doing is
`names(iris)<-`(iris, tolower(names(iris))
And finally, you can type
?`[`
for documentation, or type
`[`
to return the definition, just like any other function.
What t[exists,][1:6,] does
The simple answer is that R parses t[exists,][1:6,] as something like:
Get the value of t
From the result of step 1, get the rows that correspond to TRUE elements of exists.
From the result of step 2, get rows with row numbers in the vector 1:6, i.e. rows 1 through 6
The more complicated answer is that this is handled by the parser as:
`[`(`[`(t, exists, ), 1:6, ) # yes, this has blank arguments
which a human can interpret as
temporary_variable_1 <- `[`(t, exists, )
temporary_variable_2 <- `[`(temporary_variable_1, 1:6, )
print(temporary_variable_2) # implicitly, sending an object by itself to the console will `print` that object
Interestingly, because you typically can't pass blank arguments in R, certain constructions are impossible with the bracket function, like eval(call("[", t, exists, )) which will throw an undefined columns selected error.
I have two lists of lists. humanSplit and ratSplit. humanSplit has element of the form::
> humanSplit[1]
$Fetal_Brain_408_AGTCAA_L001_R1_report.txt
humanGene humanReplicate alignment RNAtype
66 DGKI Fetal_Brain_408_AGTCAA_L001_R1_report.txt 6 reg
68 ARFGEF2 Fetal_Brain_408_AGTCAA_L001_R1_report.txt 5 reg
If you type humanSplit[[1]], it gives the data without name $Fetal_Brain_408_AGTCAA_L001_R1_report.txt
RatSplit is also essentially similar to humanSplit with difference in column order. I want to apply fisher's test to every possible pairing of replicates from humanSplit and ratSplit. Now I defined the following empty vector which I will use to store the informations of my fisher's test
humanReplicate <- vector(mode = 'character', length = 0)
ratReplicate <- vector(mode = 'character', length = 0)
pvalue <- vector(mode = 'numeric', length = 0)
For fisher's test between two replicates of humanSplit and ratSplit, I define the following function. In the function I use `geneList' which is a data.frame made by reading a file and has form:
> head(geneList)
human rat
1 5S_rRNA 5S_rRNA
2 5S_rRNA 5S_rRNA
Now here is the main function, where I use a function getGenetype which I already defined in other part of the code. Also x and y are integers :
fishertest <-function(x,y) {
ratReplicateName <- names(ratSplit[x])
humanReplicateName <- names(humanSplit[y])
## merging above two based on the one-to-one gene mapping as in geneList
## defined above.
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[x]], by.x = "rat", by.y = "ratGene")
## [here i do other manipulation with using already defined function
## getGenetype that is defined outside of this function and make things
## necessary to define following contingency table]
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
fisherTest <- fisher.test(contingencyTable)
humanReplicate <- c(humanReplicate,humanReplicateName )
ratReplicate <- c(ratReplicate,ratReplicateName )
pvalue <- c(pvalue , fisherTest$p)
}
After doing all this I do the make matrix eg to use in apply. Here I am basically trying to do something similar to double for loop and then using fisher
eg <- expand.grid(i = 1:length(ratSplit),j = 1:length(humanSplit))
junk = apply(eg, 1, fishertest(eg$i,eg$j))
Now the problem is, when I try to run, it gives the following error when it tries to use function fishertest in apply
Error in humanSplit[[y]] : recursive indexing failed at level 3
Rstudio points out problem in following line:
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
Ultimately, I want to do the following:
result <- data.frame(humanReplicate,ratReplicate, pvalue ,alternative, Conf.int1, Conf.int2, oddratio)
I am struggling with these questions:
In defining fishertest function, how should I pass ratSplit and humanSplit and already defined function getGenetype?
And how I should use apply here?
Any help would be much appreciated.
Up front: read ?apply. Additionally, the first three hits on google when searching for "R apply tutorial" are helpful snippets: one, two, and three.
Errors in fishertest()
The error message itself has nothing to do with apply. The reason it got as far as it did is because the arguments you provided actually resolved. Try to do eg$i by itself, and you'll see that it is returning a vector: the corresponding column in the eg data.frame. You are passing this vector as an index in the i argument. The primary reason your function erred out is because double-bracket indexing ([[) only works with singles, not vectors of length greater than 1. This is a great example of where production/deployed functions would need type-checking to ensure that each argument is a numeric of length 1; often not required for quick code but would have caught this mistake. Had it not been for the [[ limit, your function may have returned incorrect results. (I've been bitten by that many times!)
BTW: your code is also incorrect in its scoped access to pvalue, et al. If you make your function return just the numbers you need and the aggregate it outside of the function, your life will simplify. (pvalue <- c(pvalue, ...) will find pvalue assigned outside the function but will not update it as you want. You are defeating one purpose of writing this into a function. When thinking about writing this function, try to answer only this question: "how do I compare a single rat record with a single human record?" Only after that works correctly and simply without having to overwrite variables in the parent environment should you try to answer the question "how do I apply this function to all pairs and aggregate it?" Try very hard to have your function not change anything outside of its own environment.
Errors in apply()
Had your function worked properly despite these errors, you would have received the following error from apply:
apply(eg, 1, fishertest(eg$i, eg$j))
## Error in match.fun(FUN) :
## 'fishertest(eg$i, eg$j)' is not a function, character or symbol
When you call apply in this sense, it it parsing the third argument and, in this example, evaluates it. Since it is simply a call to fishertest(eg$i, eg$j) which is intended to return a data.frame row (inferred from your previous question), it resolves to such, and apply then sees something akin to:
apply(eg, 1, data.frame(...))
Now that you see that apply is being handed a data.frame and not a function.
The third argument (FUN) needs to be a function itself that takes as its first argument a vector containing the elements of the row (1) or column (2) of the matrix/data.frame. As an example, consider the following contrived example:
eg <- data.frame(aa = 1:5, bb = 11:15)
apply(eg, 1, mean)
## [1] 6 7 8 9 10
# similar to your use, will not work; this error comes from mean not getting
# any arguments, your error above is because
apply(eg, 1, mean())
## Error in mean.default() : argument "x" is missing, with no default
Realize that mean is a function itself, not the return value from a function (there is more to it, but this definition works). Because we're iterating over the rows of eg (because of the 1), the first iteration takes the first row and calls mean(c(1, 11)), which returns 6. The equivalent of your code here is mean()(c(1, 11)) will fail for a couple of reasons: (1) because mean requires an argument and is not getting, and (2) regardless, it does not return a function itself (in a "functional programming" paradigm, easy in R but uncommon for most programmers).
In the example here, mean will accept a single argument which is typically a vector of numerics. In your case, your function fishertest requires two arguments (templated by my previous answer to your question), which does not work. You have two options here:
Change your fishertest function to accept a single vector as an argument and parse the index numbers from it. Bothing of the following options do this:
fishertest <- function(v) {
x <- v[1]
y <- v[2]
ratReplicateName <- names(ratSplit[x])
## ...
}
or
fishertest <- function(x, y) {
if (missing(y)) {
y <- x[2]
x <- x[1]
}
ratReplicateName <- names(ratSplit[x])
## ...
}
The second version allows you to continue using the manual form of fishertest(1, 57) while also allowing you to do apply(eg, 1, fishertest) verbatim. Very readable, IMHO. (Better error checking and reporting can be used here, I'm just providing a MWE.)
Write an anonymous function to take the vector and split it up appropriately. This anonymous function could look something like function(ii) fishertest(ii[1], ii[2]). This is typically how it is done for functions that either do not transform as easily as in #1 above, or for functions you cannot or do not want to modify. You can either assign this intermediary function to a variable (which makes it no longer anonymous, figure that) and pass that intermediary to apply, or just pass it directly to apply, ala:
.func <- function(ii) fishertest(ii[1], ii[2])
apply(eg, 1, .func)
## equivalently
apply(eg, 1, function(ii) fishertest(ii[1], ii[2]))
There are two reasons why many people opt to name the function: (1) if the function is used multiple times, better to define once and reuse; (2) it makes the apply line easier to read than if it contained a complex multi-line function definition.
As a side note, there are some gotchas with using apply and family that, if you don't understand, will be confusing. Not the least of which is that when your function returns vectors, the matrix returned from apply will need to be transposed (with t()), after which you'll still need to rbind or otherwise aggregrate.
This is one area where using ddply may provide a more readable solution. There are several tutorials showing it off. For a quick intro, read this; for a more in depth discussion on the bigger picture in which ddply plays a part, read Hadley's Split, Apply, Combine Strategy for Data Analysis paper from JSS.