Should I use 'which' on filters? - r

When filtering a dataset you can use:
df[df$column==value,]
or
df[which(df$column==value),]
The first filter returns a logical vector. The second one returns a list of indexes (the ones which value is 'True' in that logical vector). Should I use one better than the other? I see that sometimes the first one returns a row with all values as NA...
Which of both expression is more correct?
Thanks!

You should (almost) always prefer the first version.
Why? Because it’s simpler. Don’t add unnecessary complexity to your code — programming is hard enough as it is, we do not want to make it even harder; and small complexities add to each other supra-linearly.
One case where you might want to use which is when your input contains NAs that you want to ignore:
df = data.frame(column = c(1, NA, 2, 3))
df[df$column == 1, ]
# 1 NA
df[which(df$column == 1), ]
# 1
However, even in this case I would not use which; instead, I would handle the presence of NAs explicitly to document that the code expects NAs and wants to handle them. The idea is, once again, to make the code as simple and self-explanatory as possibly. This implies being explicit about your intent, instead of hiding it behind non-obvious functions.
That is, in the presence of NAs I would use the following instead of which:
df[! is.na(df$column) & df$column == 1, ]

Related

Is there an R function for checking if a value is a legal index in a vector or list?

There are several ways to index i R e.g. by selecting using positive integers, by excluding using negative integers, by selecting using logicals/conditions, by selecting using names (character-vectors) in a named vector or list, and probably a lot of other ways than that.
What I want is a function taking two inputs ix and lst that tells me if ix makes sense as an index of lst, i.e. that lst[ix] makes sense.
I already know you can do something like
is.index <- function(ix,lst){
ans = FALSE
try({ans=all(!is.na(lst[ix]))},silent = TRUE)
return(ans)
}
But I want it to work when the list contains NAs, and when it's lst is a list it works differently. Both of these cases I could probably easily take care of as special cases, but I feel that I don't know all the possible ways to index, and all the intricacies of these, so I have no way of knowing if I have nailed all the special cases.
I know the "make sense" term isn't well defined, but it would seem reasonable to me, that there exist a function or at least a somewhat easy way of telling if an index is reasonable.
So is there a function or a simple way to do that, preferable not something requiring a try or a try catch statement?
EDIT: I realize that I haven't been clear in the statement of my question. Even if ix is a vector I want to know if lst[ix] makes sense, and not a vector telling me if lst[ix[i]] makes sense for the different possible values of i. Preferably we should have that no matter the type of ix and lst the function should always be able to return one logical value i.e. a TRUE or a FALSE. For example if lst = 1:5 and ix = c(-1,2) should return FALSE and not c(TRUE,TRUE).
Further clarification: Personally I don't like the partial matching or that it makes sense to index by non-integer doubles (I like even less that it just uses the integer part (rather that e.g. rounding to closest integer (useful for small precision errors) or taking the floor (makes lst[x/y] = lst[x%/%y]))); but since it makes sense to R I think it should be up to the preferences of the answerer whether to return TRUE or FALSE in these situations. The same goes for lst[0] and lst[NA], whereas since list("Cheese" = 4)["Che"] gives NA I don't think that partial matching should be accepted.
But seeing that (at least I think) requiring the answerer to make their own choices is bad practice; if I were to choose I think that all these (except partial matching) should be accepted and returned as TRUE.
Something like the following seems to do the job. It uses partial matching to match character ix with the vector or list names.
is.index <- function(X, ix){
if(is.character(ix)){
if(is.null(names(X))) FALSE
!is.na(pmatch(ix, names(X)))
}else{
abs(ix) <= length(X)
}
}
Test with vectors.
x <- 1:6
y <- setNames(x, letters[x])
is.index(x, 2)
is.index(x, 7)
is.index(x, -3)
is.index(y, 'a')
is.index(y, 'z')
And now with lists.
lst <- list(1:6, letters[1:4])
is.index(lst, 3)
is.index(lst, "a")
is.index(lst, -1)
But there are problems, partial matching only works with the $ extractor function, it doesn't work with [[, not even with [.
lst2 <- setNames(lst, c("A", "2nd"))
is.index(lst2, "A")
is.index(lst2, "2n")
lst2$`2n` # works
lst[['2n']] # fails

How to assign an edited dataset to a new variable in R?

The title might be misleading but I have the scenario here:
half_paper <- lapply(data_set[,-1], function(x) x[x==0]<-0.5)
This line is supposed to substitute 0 for 0.5 in all of the columns except the first one.
Then I want to take half_paper and put it into here where it would rank all of the columns except the first one in order.:
prestige_paper <-apply(half_paper[,-1],2,rank)
But I get an error and I think that I need to somehow make half_paper into a data set like data_set.
Thanks for all of your help
Your main issue 'This line is supposed to substitute 0 for 0.5 in all of the columns except the first one' can be remedied by placing another line in your anonymous function. The gets operator <- returns the value of whatever is on the right hand side, so your lapply was returning a value of 0.5 for each column. To remedy this, another line can be added to the function that returns the modified vector.
It's also worth noting that lapply returns a list. apply was substituted in for lapply in this case for consistency, but plyr::ddply may suit this specific need better.
half_mtcars <- apply(mtcars[, -1], 2, function(x) {x[x == 0] <- .5;return(x)})
prestige_mtcars_tail <- apply(half_mtcars, 2, rank)
prestige_mtcars <- cbind(mtcars[,1, drop = F], prestige_mtcars_tail)

what does accessing zero element in R do?

if I have a vector a<-c(3, 5, 7, 8)
and run a[1], not surprisingly I will get 3
but if I will run a[0] I basically get numeric(0)
What does this mean?
And what does this do?
How can I use it for normal reasons?
Others have answered what x[0] does, so I thought I'd expand on why it's useful: generating test cases. It's great for making sure that your functions work with unusual data structure variants that users sometimes produce accidentally.
For example, it makes it easy to generate 0 row and 0 column data frames:
mtcars[0, ]
mtcars[, 0]
These can arise when subsetting goes wrong:
mtcars[mtcars$cyl > 10, ]
But in your testing code it's useful to flag that you're doing it deliberately.
http://cran.r-project.org/doc/manuals/r-release/R-lang.html#Indexing-by-vectors
As you can see it says: A special case is the zero index, which has null effects: x[0] is an empty vector and otherwise including zeros among positive or negative indices has the same effect as if they were omitted.

How to access single elements in a table in R

How do I grab elements from a table in R?
My data looks like this:
V1 V2
1 12.448 13.919
2 22.242 4.606
3 24.509 0.176
etc...
I basically just want to grab elements individually. I'm getting confused with all the R terminology, like vectors, and I just want to be able to get at the individual elements.
Is there a function where I can just do like data[v1][1] and get the element in row 1 column 1?
Try
data[1, "V1"] # Row first, quoted column name second, and case does matter
Further note: Terminology in discussing R can be crucial and sometimes tricky. Using the term "table" to refer to that structure leaves open the possibility that it was either a 'table'-classed, or a 'matrix'-classed, or a 'data.frame'-classed object. The answer above would succeed with any of them, while #BenBolker's suggestion below would only succeed with a 'data.frame'-classed object.
There is a ton of free introductory material for beginners in R: CRAN: Contributed Documentation
?"[" pretty much covers the various ways of accessing elements of things.
Under usage it lists these:
x[i]
x[i, j, ... , drop = TRUE]
x[[i, exact = TRUE]]
x[[i, j, ..., exact = TRUE]]
x$name
getElement(object, name)
x[i] <- value
x[i, j, ...] <- value
x[[i]] <- value
x$i <- value
The second item is sufficient for your purpose
Under Arguments it points out that with [ the arguments i and j can be numeric, character or logical
So these work:
data[1,1]
data[1,"V1"]
As does this:
data$V1[1]
and keeping in mind a data frame is a list of vectors:
data[[1]][1]
data[["V1"]][1]
will also both work.
So that's a few things to be going on with. I suggest you type in the examples at the bottom of the help page one line at a time (yes, actually type the whole thing in one line at a time and see what they all do, you'll pick up stuff very quickly and the typing rather than copypasting is an important part of helping to commit it to memory.)
Maybe not so perfect as above ones, but I guess this is what you were looking for.
data[1:1,3:3] #works with positive integers
data[1:1, -3:-3] #does not work, gives the entire 1st row without the 3rd element
data[i:i,j:j] #given that i and j are positive integers
Here indexing will work from 1, i.e,
data[1:1,1:1] #means the top-leftmost element

Problem with data.table ifelse behavior

I am trying to calculate a simple ratio using data.table. Different files have different tmax values, so that is why I need ifelse. When I debug this, the dt looks good. The tmaxValue is a single value (the first "t=60" encountered in this case), but t0Value is all of the "t=0" values in dt.
summaryDT <- calculate_Ratio(reviewDT[,list(Result, Time), by=key(reviewDT)])
calculate_Ratio <- function(dt){
tmaxValue <- ifelse(grepl("hhep", inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=240min"),Result],
ifelse(grepl("hlm",inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=60"),Result],
dt[which(dt[,Time] == "t=30"),Result]))
t0Value <- dt[which(dt[,Time] == "t=0"),Result]
return(dt[,Ratio:=tmaxValue/t0Value])
}
What I am getting out is theResult for tmaxValue divided by all of the Result's for all of the t0Value's, but what I want is a single ratio for each unique by.
Thanks for the help.
You didn't provide a reproducible example, but typically using ifelse is the wrong thing to do.
Try using if(...) ... else ... instead.
ifelse(test, yes, no) acts very weird: It produces a result with the attributes and length from test and the values from yes or no.
...so in your case you should get something without attributes and of length one - and that's probably not what you wanted, right?
[UPDATE] ...Hmm or maybe it is since you say that tmaxValue is a single value...
Then the problem isn't in calculating tmaxValue? Note that ifelse is still the wrong tool for the job...

Resources