keep the order of the subsetting vector with data.table

keep the order of the subsetting vector with data.table - r

I have a simple question, but I can't figure out a simple solution:
library(data.table)
plouf <- data.table(1:10,letters[1:10])
plouf[V1 %in% c(3,1),V2]
[1] "a" "c"
I would like the output to keep the initial order of the subsetting vector, i.e. "c" "a". What are the possiblities ?
I have
sapply(c(3,1),function(x){plouf[V1 == x,V2]})
but I find it uggly.
edit
I have
setkey(plouf,V1)
plouf[c(3,1),V2]
which is surely the good way for data.table.
Still I am curious about what are the solutions

Here is one option with match that can be used in data.table and in base R as well. Unlike %in%, match returns the position index of the first match and this can be used to get the corresponding elements of the other column 'V2'
plouf[, V2[match(c(3, 1), V1)]]
#[1] "c" "a"
plouf[, match(c(3, 1), V1)] # returns numeric index
#[1] 3 1
plouf[, V1 %in% c(3, 1)] # returns logical vector
#[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Because thee %in% returns logical vector, when we use this to extract the elements, the elements corresponding to each TRUE value will be extracted i.e. it extracts from 1st and 3rd positions instead of 3rd and 1st

Using data.table keys will accomplish what you're going for here, the Keys and fast binary search based subset vignette here explains the usage.
library(data.table)
plouf <- data.table(1:10,letters[1:10])
## Set a key
setkey(plouf,V1)
## Use .() syntax for key subsetting to get associated values of V2
plouf[.(c(3,1)),V2]
#[1] "c" "a"

Related

Is there a way to retrieve the vectors selected by fcoalesce?

When using fcoalesce, is there any way I can retrieve the indices or names of the selected vectors?
Here is a simplified two-vector example, for the following coalesce of vectors a and b:
library(data.table)
a = c(NA,2,3,4,NA)
b = c(1,3,3,4,5)
fcoalesce(a,b)
[1] 1 2 3 4 5
I'd like to see something like this:
b,a,a,a,b
A real life example could have any number of vectors.

We can use ifelse - coalesce is simply taking the first non-NA for each row/element between two vectors/columns. Thus, create a logical condition for NA elements, and specify the 'yes', 'no' as the object names
ifelse(is.na(a), 'b', 'a')
[1] "b" "a" "a" "a" "b"

I managed to solve it by merging all vectors into a data.table (dt_combined) and coalescing them iteratively:
apply(dt_combined, 1, function(i){
(1:length(dt_combined))[ which(!is.na(i))[1] ]
})
One could also get the column names instead of the column index:
apply(dt_combined, 1, function(i){
colnames(dt_combined)[ which(!is.na(i))[1] ]
})

Check if value is in data frame

I'm trying to check if a specific value is anywhere in a data frame.
I know the %in% operator should allow me to do this, but it doesn't seem to work the way I would expect when applying to a whole data frame:
A = data.frame(B=c(1,2,3,4), C=c(5,6,7,8))
1 %in% A
[1] FALSE
But if I apply this to the specific column the value is in it works the way I expect:
1 %in% A$C
[1] TRUE
What is the proper way of checking if a value is anywhere in a data frame?

You could do:
any(A==1)
#[1] TRUE
OR with Reduce:
Reduce("|", A==1)
OR
length(which(A==1))>0
OR
is.element(1,unlist(A))

To find the location of that value you can do f.ex:
which(A == 1, arr.ind=TRUE)
# row col
#[1,] 1 1

Or simply
sum(A == 1) > 0
#[1] TRUE

Loop through the variables with sapply, then use any.
any(sapply(A, function(x) 1 %in% x))
[1] TRUE
or following digEmAll's comment, you could use unlist, which takes a list (data.frame) and returns a vector.
1 %in% unlist(A)
[1] TRUE

The trick to understanding why your first attempt doesn't work, really comes down to understanding what a data frame is - namely a list of vectors of equal length. What you're trying to do here is not check if that list of vectors matches your condition, but checking if the values in those vectors matches the condition.

Try:
any(A == 1)
Returns FALSE or TRUE

Index a Particular Numeric Vector From a List of Vectors in R

In R, for the sake of example, I have a list composed of equal-length numeric vectors of form similar to:
list <- list(c(1,2,3),c(1,3,2),c(2,1,3))
[[1]]
[1] 1 2 3
[[2]]
[1] 1 3 2
[[3]]
[1] 2 1 3
...
Every element of the list is unique. I want to get the index number of the element x <- c(2,1,3), or any other particular numeric vector within the list.
I've attempted using match(x,list), which gives a vector full of NA, and which(list==(c(1,2,3)), which gives me a "(list) object cannot be coerced to type 'double'" error. Coercing the list to different types didn't seem to make a difference for the which function. I also attempted various grep* functions, but these don't return exact numeric vector matches. Using find(c(1,2,3),list) or even some fancy sapply which %in% type functions didn't give me what I was looking for. I feel like I have a type problem. Any suggestions?
--Update--
Summary of Solutions
Thanks for your replies. The method in the comment for this question is clean and works well (via akrun).
> which(paste(list)==deparse(x))
[1] 25
The next method didn't work correctly
> which(duplicated(c(x, list(y), fromLast = TRUE)))
[1] 49
> y
[1] 1 2 3
This sounds good, but in the next block you can see the problem
> y<-c(1,3,2)
> which(duplicated(c(list, list(y), fromLast = TRUE)))
[1] 49
More fundamentally, there are only 48 elements in the list I was using.
The last method works well (via BondedDust), and I would guess it is more efficient using an apply function:
> which( sapply(list, identical, y ))
[1] 25

match works fine if you pass it the right data.
L <- list(c(1,2,3),c(1,3,2),c(2,1,3))
match(list(c(2,1,3)), L)
#[1] 3
Beware that this works via coercing lists to character, so fringe cases will fail - with a hat-tip to #nicola:
match(list(1:3),L)
#[1] NA
even though:
1:3 == c(1,2,3)
#[1] TRUE TRUE TRUE
Although arguably:
identical(1:3,c(1,2,3))
#[1] FALSE
identical(1:3,c(1L,2L,3L))
#[1] TRUE

You can use duplicated(). If we add the matching vector to the end of the original list and set fromLast = TRUE, we will find the duplicate(s). Then we can use which() to get the index.
which(duplicated(c(list, list(c(2, 1, 3)), fromLast = TRUE))
# [1] 3
Or you could add it as the first element and subtract 1 from the result.
which(duplicated(c(list(c(2, 1, 3)), list))) - 1L
# [1] 3
Note that the type always matters with this type of comparison. When comparing integers and numerics, you will need to convert doubles to integers for this to run without issue. For example, 1:3 is not the same type as c(1, 2, 3).

> L <- list(c(1,2,3),c(1,3,2),c(2,1,3))
> sapply(L, identical, c(2,1,3))
[1] FALSE FALSE TRUE
> which( sapply(L, identical, c(2,1,3)) )
[1] 3
This would be slightly less restrictive in its test:
> which( sapply(L, function(x,y){all(x==y)}, c(1:3)) )
[1] 1

Try:
vapply(list,function(z) all(z==x),TRUE)
#[1] FALSE FALSE TRUE
Enclosing the above line to which gives you the index of the list.

Finding the existance of a vector within matrix within list within list

I have tried to use R to find a vector within a matrix within list within list. I have tried if the vector 'ab' exists by using the following 'exists' code but none of them work. How can I make it work?
aa <- list(x = matrix(1,2,3), y = 4, z = 3)
colnames(aa$x) <- c('ab','bb','cb')
aa
#$x
# ab bb cb
#[1,] 1 1 1
#[2,] 1 1 1
#
#$y
#[1] 4
#
#$z
#[1] 3
exists('ab', where=aa)
#[1] FALSE
exists('ab', where=aa$x)
# Error in exists("ab", where = aa$x) : invalid 'envir' argument
exists('ab', where=colnames(aa$x))
# Error in as.environment(where) : no item called "ab" on the search list
colnames(aa$x)
#[1] "ab" "bb" "cb"

The column names are part of either matrix or data.frames. So, we loop over the list using sapply, get the column names (colnames), unlist and check whether 'ab' is among that vector
'ab' %in% unlist(sapply(aa, colnames))
#[1] TRUE
If we want to be more specific for a particular list element, we extract the element (aa$x), get the column names and check whether 'ab' is among them.
'ab' %in% colnames(aa$x)
#[1] TRUE
Or another option would be to loop through 'aa', and if the element is a matrix, extract the 'ab' column and check whether it is a vector, wrap the sapply with any to get a single TRUE/FALSE output.
any(sapply(aa, function(x) if(is.matrix(x)) is.vector(x[, 'ab']) else FALSE))

How to pass decreasing and/or na.last argument to sort through tapply in R

I am teaching myself the basics of R and have been encountering trouble using the function tapply when passing the sort function while trying to use non-default optional arguments for sort. Here is an example of the trouble I am facing:
Given the vectors
x <- c(1.1, 1.0, 2.1, NA_real_)
y <- c("a", "b", "c","d")
I find that
tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
results in the same output regardless of the logical assignments I endow decreasing and na.last with. In fact, the output always defaults to the sort default values
decreasing = FALSE, na.last = NA
For the record, when inputing the above example, the output is
> tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
1 1.1 2.1
"b" "a" "c"
Let me also mention that if I define the alternate function
sort2 <- function(v) sort(v, decreasing=TRUE, na.last=TRUE);
and pass sort2 to tapply instead, I still encounter the same trouble.
I am using running this code on a Mac OS X 10.10.4, using R 3.2.0. Using sort standalone results in the desired behavior (calling sort on its own without passing through tapply, that is), since it acts appropriately when altering the decreasing and na.last arguments.
Thank you in advance for any help.

I don't think you're using tapply() correctly.
tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
The above line of code basically says "sort vector y grouping by categorical vector x". Your vector x is not really a categorical vector at all, it's a numeric vector with only distinct values, plus an NA. tapply() ignores the NA index, and then treats each of the remaining three distinct numeric values in x as separate groups, so it passes each of the three corresponding character strings from y to three different calls of sort(), which obviously has no effect on anything (which explains why your customization arguments have no effect) and returns the result ordered by the x groups.
Here's an example of how to do what I think you're trying to do:
x <- c(NA,1,2,3,NA,2,1,3);
g <- rep(letters[1:2],each=4);
x;
## [1] NA 1 2 3 NA 2 1 3
g;
## [1] "a" "a" "a" "a" "b" "b" "b" "b"
tapply(x,g,sort,decreasing=T,na.last=T);
## $a
## [1] 3 2 1 NA
##
## $b
## [1] 3 2 1 NA
##
Edit: When you want to sort a vector by another vector, you can use order():
y[order(x,decreasing=T,na.last=T)];
## [1] "c" "a" "b" "d"
y[order(x,decreasing=F,na.last=T)];
## [1] "b" "a" "c" "d"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

keep the order of the subsetting vector with data.table - r

Related

Is there a way to retrieve the vectors selected by fcoalesce?

Check if value is in data frame

Index a Particular Numeric Vector From a List of Vectors in R

Finding the existance of a vector within matrix within list within list

How to pass decreasing and/or na.last argument to sort through tapply in R

Categories

Resources