What is the difference between these data frame assignments? - r

I have a data frame that looks like so:
pid tid pname
2 NA proc/boot/procnto-smp-instr
Now if I do this, I expect nothing to happen:
y[c(FALSE), "pid"] <- 10
And nothing happens (y did not change). However, if I do this:
y[c(FALSE), ]$pid <- 10
I get:
Error in $<-.data.frame(*tmp*, "pid", value = 10) :
replacement
has 1 rows, data has 0
So my question is, what's the difference between [, "col"]<- and $col<-? Why does one throw an exception? And bonus: where in the docs can I read more about this?

The error comes from the code of $<-.data.frame which checks if the original data.frame is at least as many rows as the length of the replacement vector:
nrows <- .row_names_info(x, 2L)
if (!is.null(value)) {
N <- NROW(value)
if (N > nrows)
stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
"replacement has %d rows, data has %d"), N, nrows),
domain = NA)
[<- is a different function, which does not perform this check. It is a primitive function, which you can read more about in the R Internals manual

For once, these operations are performed by two very different functions:
y[FALSE, 'pid'] <- 10 is the call to the [<-.data.frame function, while
y[FALSE, ]$pid <- 10 is the call to the $<-.data.frame function, the error message gives you this clue. Just how different they are you can see by typing their names (with back quotes, just like above). In this particular case, though, they intended to behave the same way. And they normally do. Try y[1, 'pid'] <- 1:3 vs y[1, ]$pid <- 1:3. Your case is "special" as y[FALSE, ] returns you a "strange" object - a data.frame with 0 rows and three columns. IMHO, throwing exception is a correct behavior, and this is a minor bug in the [<-.data.frame function, but language developers's opinion on this subject is more important. If you want to see yourself where the difference is, type debug([<-.data.frame) and run your example.
The answer to your "bonus" question is to type ?[<-.data.frame and read, though it is very, very dry :(. Best.
PS. Formatting strips backticks, so, for instance, [<-.data.frame meant to be . Sorry.

Related

Is there a consistent way to force errors on incorrect list or vector indexing

My expectation from other programming languages is that (1:4)[3:5] and list(asdf = 4, qwerty = 5)$asdg should both raise exceptions. Instead, the first silently returns c(3, 4, NA), and the second silently returns NULL (as does or list(asdf = 4, qwerty = 5)[[asdg]]).
While this sort of behavior can occasionally be useful, far more often (in my experience), it turns a minor typo, off-by-one error, or failure to rename a variable everywhere that it's used from a trigger for an immediate and easy-to-debug error, into a trigger for a truly baffling error about 20 (or 200) steps down the line, when the silently propagating NULLs or NAs finally get fed into a function or operation that is loud about them. (Of course, that's still better than the times that it doesn't produce an error at all, just garbage results.)
data.frame()[,'wrong'] gives an error, but data.frame()['wrong',] just returns NA.
What I'm looking for is a way to do vector/array/list/data.frame/etc. subscripting/member access that will reliably cause an error immediately if I use an index that is invalid. For lists, get('wrong', list()) does what I'm looking for, but that can be quite ugly at times (especially if using the result as for subscripting something else). It's usable, but something better would be nice. For vectors (and data.frame rows), even that doesn't work.
Is there a good way to do this?
I am not sure if you can change this behaviour globally but you can handle them on individual basis as needed based on type of the data.
For example, for vectors -
subset_values <- function(x, ind) {
if(min(ind) > 0 && max(ind) <= length(x)) x[ind]
else stop('Incorrect length')
}
subset_values(1:4, 3:5)
#Error in subset_values(1:4, 3:5) : Incorrect length
subset_values(1:4, -1:3)
#Error in subset_values(1:4, -1:3) : Incorrect length
subset_values(1:4, 1:3)
#[1] 1 2 3

Why does 'out of bounds' indexing differ between a matrix and a data.frame?

I'm sure this is kind of basic, but I'd just like to really understand the logic of R data structures here.
If I subset a matrix by index out of bounds, I get exactly that error:
m <- matrix(data = c("foo", "bar"), nrow = 1)
m[2,]
# Error in m[2, ] : subscript out of bounds
If I do the same do a data frame, however, I get all NA rows:
df <- data.frame(foo = "foo", bar = "bar")
df[2,]
# foo bar
# NA <NA> <NA>
If I subset into a non-existent data frame column I get the familiar
df[, 3]
# Error in `[.data.frame`(df, , 3) : undefined columns selected
I know (roughly) that data frame rows are weird and to be treated carefully, but I don't quite see the connection to the above behavior.
Can someone explain why R behaves in this way for non-existent df rows?
Update
To be sure, giving NA on out-of-bounds subsets, is normal R behavior for 1D vectors:
vec <- c("foo", "bar")
vec[3]
# [1] NA
So in a way, the weird one out here is matrix subsetting, not dataframe subsetting, depending from where you're starting out.
Still the different 2D subsetting behavior (m[2, ] vs df[2, ]) might strike a dense user (as I am right now) as inconsistent.
Can someone explain why R behaves in this way[?]
Short answer: No, probably not.
Longer answer:
Once upon a time I was thinking about something similar and read this thread on R-devel: Definition of [[. Basically it boils down to:
The semantics of [ and [[ don't seem to be fully specified in the Reference manual. [...] I assume that these are features, not bugs, but I can't find documentation for them
Duncan Murdoch, a former member of the R core team gives a very nice reply:
There is more documentation in the man page for Extract, but I think it is incomplete. The most complete documentation is of course the source code*, but it may not answer the question of what's intentional and what's accidental
As mentioned in the R-devel thread, the only description in the manual is 3.4.1 Indexing by vectors:
If i is positive and exceeds length(x) then the corresponding selection is NA
But, this applies to "indexing of simple vectors". Similar out of bounds indexing for "non-simple" vectors does not seem to be described. Duncan Murdoch again:
So what is a simple vector? That is not explicitly defined, and it probably should be.
Thus, it may seem like no one knows the answer to your why question.
See also "8.2.13 nonexistent value in subscript" in the excellent R Inferno by Patrick Burns, and the section "Missing/out of bounds indices" in Hadley's book.
*Source code for the [ subset operator. A search for R_MSG_subs_o_b (which corresponds to error message "subscript out of bounds") provides no obvious clue why OOB [ indexing of matrices and when using [[ give an error, whereas OOB [ indexing of "simple vectors" results in NA.

Ordering data frame using variable as column name

I have a couple of data frames that I want to be ordered by its last column respectively, I've been trying since a while but nothing succeeds, the main idea is to create a function to avoid doing this over and over for each data frame, the function I'm building is this:
order_dataF = function(x){
tCol = colnames(x[length(x)])
print(tCol)
#x <- x[with(x, order(-tCol),)]
#x <- x[with(x, order(-(paste(tCol))),)]
#x[do.call( order, x[,match(tCol,names(x))]),]
#x <- x[order(x$tCol),]
}
All the lines that have a comment on it are the ones I tested none of this are working as expected, I know this is because order needs the column name instead the variable I'm giving.
tCol always always bring to me the last column name, when I run this function this is the result:
[1] "TotalSearches"
Error in -(paste(tCol)) : invalid argument to unary operator
Calls: main ... [.data.frame -> with -> with.default -> eval -> eval -> order
Execution halted
I'm printing tCol to see if this is really containing the last column name, in this case, indeed it does have exactly what I need.
Perhaps this is a silly question/problem and it's too easy to solve but I cannot move forward as this is slowing me down, I'm frustrated.
Also I'm seeing this looks like duplicated but is not, nobody is being asking the right question (perhaps not even me) but the idea is "Order my the content of a string variable which is obtained from the data frame column names"
Generally, don't try to use with (or other "nonstandard" evaluation functions like subset) inside functions.
order_by_last_col = function(df) {
df[order(df[, ncol(df)]), ]
}
# test
order_by_last_col(mtcars)
If using column names stored as character strings, you must use [, not $, because $ is also a non-standard evaluation shortcut, and it never evaluates the code that comes after $, it just looks for a column with that exact name. If you'd rather use names than indices (like above), do it this way with [:
order_by_last_col = function(df) {
last_col_name = tail(names(df), 1)
df[order(df[, last_col_name]), ]
}
Edit: Just a few more experiments to see why your initial attempts didn't work. they don't need to be in a function to not work, they just never work.
col = "wt"
mtcars$col # NULL
with(mtcars, head(col)) # "wt"
mtcars[, match(col, names(mtcars))] # this does work but is unnecessarily long
mtcars[, col] # works, easy
mtcars[[col]] # also works

R: passing by parameter to function and using apply instead of nested loop and recursive indexing failed

I have two lists of lists. humanSplit and ratSplit. humanSplit has element of the form::
> humanSplit[1]
$Fetal_Brain_408_AGTCAA_L001_R1_report.txt
humanGene humanReplicate alignment RNAtype
66 DGKI Fetal_Brain_408_AGTCAA_L001_R1_report.txt 6 reg
68 ARFGEF2 Fetal_Brain_408_AGTCAA_L001_R1_report.txt 5 reg
If you type humanSplit[[1]], it gives the data without name $Fetal_Brain_408_AGTCAA_L001_R1_report.txt
RatSplit is also essentially similar to humanSplit with difference in column order. I want to apply fisher's test to every possible pairing of replicates from humanSplit and ratSplit. Now I defined the following empty vector which I will use to store the informations of my fisher's test
humanReplicate <- vector(mode = 'character', length = 0)
ratReplicate <- vector(mode = 'character', length = 0)
pvalue <- vector(mode = 'numeric', length = 0)
For fisher's test between two replicates of humanSplit and ratSplit, I define the following function. In the function I use `geneList' which is a data.frame made by reading a file and has form:
> head(geneList)
human rat
1 5S_rRNA 5S_rRNA
2 5S_rRNA 5S_rRNA
Now here is the main function, where I use a function getGenetype which I already defined in other part of the code. Also x and y are integers :
fishertest <-function(x,y) {
ratReplicateName <- names(ratSplit[x])
humanReplicateName <- names(humanSplit[y])
## merging above two based on the one-to-one gene mapping as in geneList
## defined above.
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[x]], by.x = "rat", by.y = "ratGene")
## [here i do other manipulation with using already defined function
## getGenetype that is defined outside of this function and make things
## necessary to define following contingency table]
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
fisherTest <- fisher.test(contingencyTable)
humanReplicate <- c(humanReplicate,humanReplicateName )
ratReplicate <- c(ratReplicate,ratReplicateName )
pvalue <- c(pvalue , fisherTest$p)
}
After doing all this I do the make matrix eg to use in apply. Here I am basically trying to do something similar to double for loop and then using fisher
eg <- expand.grid(i = 1:length(ratSplit),j = 1:length(humanSplit))
junk = apply(eg, 1, fishertest(eg$i,eg$j))
Now the problem is, when I try to run, it gives the following error when it tries to use function fishertest in apply
Error in humanSplit[[y]] : recursive indexing failed at level 3
Rstudio points out problem in following line:
mergedHumanData <-merge(geneList,humanSplit[[y]], by.x = "human", by.y = "humanGene")
Ultimately, I want to do the following:
result <- data.frame(humanReplicate,ratReplicate, pvalue ,alternative, Conf.int1, Conf.int2, oddratio)
I am struggling with these questions:
In defining fishertest function, how should I pass ratSplit and humanSplit and already defined function getGenetype?
And how I should use apply here?
Any help would be much appreciated.
Up front: read ?apply. Additionally, the first three hits on google when searching for "R apply tutorial" are helpful snippets: one, two, and three.
Errors in fishertest()
The error message itself has nothing to do with apply. The reason it got as far as it did is because the arguments you provided actually resolved. Try to do eg$i by itself, and you'll see that it is returning a vector: the corresponding column in the eg data.frame. You are passing this vector as an index in the i argument. The primary reason your function erred out is because double-bracket indexing ([[) only works with singles, not vectors of length greater than 1. This is a great example of where production/deployed functions would need type-checking to ensure that each argument is a numeric of length 1; often not required for quick code but would have caught this mistake. Had it not been for the [[ limit, your function may have returned incorrect results. (I've been bitten by that many times!)
BTW: your code is also incorrect in its scoped access to pvalue, et al. If you make your function return just the numbers you need and the aggregate it outside of the function, your life will simplify. (pvalue <- c(pvalue, ...) will find pvalue assigned outside the function but will not update it as you want. You are defeating one purpose of writing this into a function. When thinking about writing this function, try to answer only this question: "how do I compare a single rat record with a single human record?" Only after that works correctly and simply without having to overwrite variables in the parent environment should you try to answer the question "how do I apply this function to all pairs and aggregate it?" Try very hard to have your function not change anything outside of its own environment.
Errors in apply()
Had your function worked properly despite these errors, you would have received the following error from apply:
apply(eg, 1, fishertest(eg$i, eg$j))
## Error in match.fun(FUN) :
## 'fishertest(eg$i, eg$j)' is not a function, character or symbol
When you call apply in this sense, it it parsing the third argument and, in this example, evaluates it. Since it is simply a call to fishertest(eg$i, eg$j) which is intended to return a data.frame row (inferred from your previous question), it resolves to such, and apply then sees something akin to:
apply(eg, 1, data.frame(...))
Now that you see that apply is being handed a data.frame and not a function.
The third argument (FUN) needs to be a function itself that takes as its first argument a vector containing the elements of the row (1) or column (2) of the matrix/data.frame. As an example, consider the following contrived example:
eg <- data.frame(aa = 1:5, bb = 11:15)
apply(eg, 1, mean)
## [1] 6 7 8 9 10
# similar to your use, will not work; this error comes from mean not getting
# any arguments, your error above is because
apply(eg, 1, mean())
## Error in mean.default() : argument "x" is missing, with no default
Realize that mean is a function itself, not the return value from a function (there is more to it, but this definition works). Because we're iterating over the rows of eg (because of the 1), the first iteration takes the first row and calls mean(c(1, 11)), which returns 6. The equivalent of your code here is mean()(c(1, 11)) will fail for a couple of reasons: (1) because mean requires an argument and is not getting, and (2) regardless, it does not return a function itself (in a "functional programming" paradigm, easy in R but uncommon for most programmers).
In the example here, mean will accept a single argument which is typically a vector of numerics. In your case, your function fishertest requires two arguments (templated by my previous answer to your question), which does not work. You have two options here:
Change your fishertest function to accept a single vector as an argument and parse the index numbers from it. Bothing of the following options do this:
fishertest <- function(v) {
x <- v[1]
y <- v[2]
ratReplicateName <- names(ratSplit[x])
## ...
}
or
fishertest <- function(x, y) {
if (missing(y)) {
y <- x[2]
x <- x[1]
}
ratReplicateName <- names(ratSplit[x])
## ...
}
The second version allows you to continue using the manual form of fishertest(1, 57) while also allowing you to do apply(eg, 1, fishertest) verbatim. Very readable, IMHO. (Better error checking and reporting can be used here, I'm just providing a MWE.)
Write an anonymous function to take the vector and split it up appropriately. This anonymous function could look something like function(ii) fishertest(ii[1], ii[2]). This is typically how it is done for functions that either do not transform as easily as in #1 above, or for functions you cannot or do not want to modify. You can either assign this intermediary function to a variable (which makes it no longer anonymous, figure that) and pass that intermediary to apply, or just pass it directly to apply, ala:
.func <- function(ii) fishertest(ii[1], ii[2])
apply(eg, 1, .func)
## equivalently
apply(eg, 1, function(ii) fishertest(ii[1], ii[2]))
There are two reasons why many people opt to name the function: (1) if the function is used multiple times, better to define once and reuse; (2) it makes the apply line easier to read than if it contained a complex multi-line function definition.
As a side note, there are some gotchas with using apply and family that, if you don't understand, will be confusing. Not the least of which is that when your function returns vectors, the matrix returned from apply will need to be transposed (with t()), after which you'll still need to rbind or otherwise aggregrate.
This is one area where using ddply may provide a more readable solution. There are several tutorials showing it off. For a quick intro, read this; for a more in depth discussion on the bigger picture in which ddply plays a part, read Hadley's Split, Apply, Combine Strategy for Data Analysis paper from JSS.

lapply fail, but function works fine for each individual input arguments

Many thanks in advance for any advices or hints.
I'm working with data frames. The simplified coding is as follows:
`
f<-funtion(name){
x<-tapply(name$a,list(name$b,name$c),sum)
1) y<-dataset[[deparse(substitute(name))]]
#where dataset is an already existed list object with names the same as the
#function argument. I would like to avoid inputting two arguments.
z<-vector("list",n) #where n is also defined already
2) for (i in 1:n){z[[i]]<-x[y[[i]],i]}
...
}
lapply(list_names,f)
`
The warning message is:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
and the output is incorrect. I tried debugging and found the conflict may lie in line 1) and 2). However, when I try f(name) it is perfectly fine and the output is correct. I guess the problem is in lapply and I searched for a while but could not get to the point. Any ideas? Many thanks!
The structure of the data
Thanks Joran. Checking again I found the problem might not lie in what I had described. I produce the full code as follows and you can copy-paste to see the error.
n<-4
name1<-data.frame(a=rep(0.1,20),b=rep(1:10,each=2),c=rep(1:n,each=5),
d=rep(c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a91"),each=2))
name2<-data.frame(a=rep(0.2,20),b=rep(1:10,each=2),c=rep(1:n,each=5),
d=rep(c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a91"),each=2))
name3<-data.frame(a=rep(0.3,20),b=rep(1:10,each=2),c=rep(1:n,each=5),
d=rep(c("a1","a2","a3","a4","a5","a6","a7","a8","a9","a91"),each=2))
#d is the name for the observations. d corresponds to b.
dataset<-vector("list",3)
names(dataset)<-c("name1","name2","name3")
dataset[[1]]<-list(c(1,2),c(1,2,3,4),c(1,2,3,4,5,10),c(4,5,8))
dataset[[2]]<-list(c(1,2,3,5),c(1,2),c(1,2,10),c(2,3,4,5,8,10))
dataset[[3]]<-list(c(3,5,8,10),c(1,2,5,7),c(1,2,3,4,5),c(2,3,4,6,9))
f<-function(name){
x<-tapply(name$a,list(name$b,name$c),sum)
rownames(x)<-sort(unique(name$d)) #the row names for
y<-dataset[[deparse(substitute(name))]]
z<-vector("list",n)
for (i in 1:n){
z[[i]]<-x[y[[i]],i]}
nn<-length(unique(unlist(sapply(z,names)))) # the number of names appeared
names_<-sort(unique(unlist(sapply(z,names)))) # the names appeared add to the matrix
# below
m<-matrix(,nrow=nn,ncol=n);rownames(m)<-names_
index<-vector("list",n)
for (i in 1:n){
index[[i]]<-match(names(z[[i]]),names_)
m[index[[i]],i]<-z[[i]]
}
return(m)
}
list_names<-vector("list",3)
list_names[[1]]<-name1;list_names[[2]]<-name2;list_names[[3]]<-name3
names(list_names)<-c("name1","name2","name3")
lapply(list_names,f)
f(name1)
the lapply(list_names,f) would fail, but f(name1) will produce exactly the matrix I want. Thanks again.
Why it doesn't work
The issue is the calling stack doesn't look the same in both cases. In lapply, it looks like
[[1]]
lapply(list_names, f) # lapply(X = list_names, FUN = f)
[[2]]
FUN(X[[1L]], ...)
In the expression being evaluated, f is called FUN and its argument name is called X[[1L]].
When you call f directly, the stack is simply
[[1]]
f(name1) # f(name = name1)
Usually this doesn't matter, but with substitute it does because substitute cares about the name of the function argument, not its value. When you get to
y<-dataset[[deparse(substitute(name))]]
inside lapply it's looking for the element in dataset named X[[1L]], and there isn't one, so y is bound to NULL.
A way to get it to work
The simplest way to deal with this is probably to just have f operate on character strings and pass names(list_names) to lapply. This can be accomplished fairly easily by changing the beginning of f to
f<-function(name){
passed.name <- name
name <- list_names[[name]]
x<-tapply(name$a,list(name$b,name$c),sum)
rownames(x)<-sort(unique(name$d)) #the row names for
y<-dataset[[passed.name]]
# the rest of f...
and changing lapply(list_names, f) to lapply(names(list_names),f). This should give you what you want with nearly minimal modification, but you also might consider also renaming some of your variables so the word name isn't used for so many different things--the function names, the argument of f, and all the various variables containing name.

Resources