R ordering bug? - r

R Fiddle
vals<-c(10.3,10.3,10.2,16.4,18.8,19.7,15.6,18.2,22.6,19.9,24.2,21.0,21.4,21.3,19.1,22.2,33.8,27.4,25.7,24.9,34.5,31.7,36.3,38.3,42.6,55.4,55.7,58.3,51.5,51.0,77.0)
# Standard Order
# the second and third values should be reversed
order(vals)
# ------------------------------------------------------------
# [1] 3 1 2 7 4 8 5 15 6 10 12 14 13 16 9 11 20 19 18 22 17 21 23 24 25
# [26] 30 29 26 27 28 31
# ------------------------------------------------------------
# Reverse Decreasing
# should be the same as the original, but it isn't (it's correct)
rev(order(vals, decreasing=T))
# ------------------------------------------------------------
# [1] 3 2 1 7 4 8 5 15 6 10 12 14 13 16 9 11 20 19 18 22 17 21 23 24 25
# [26] 30 29 26 27 28 31
# ------------------------------------------------------------
I need some help in understanding what is happening in R. I think there's a bug when outputting order and how they are not the same. Notice the second and third values of both outputs. Shouldn't the order be 3,3,1 or 2,2,1 or 3,2,1 depending on how order treats the same value? Regardless.. the third value should have order=1.
Is my understanding correct, or am I missing something?

As per the documentation,
order returns a permutation which rearranges its first argument into ascending or descending order, breaking ties by further arguments.
i.e. order() returns a set of indices such that x[order(x)] is in increasing order, or that x[order(x,decreasing = TRUE)] is in decreasing order.
If two consecutive values in x are identical, then the order of their indices in the value returned by order is immaterial, and will simply depend on what is most efficient and involves the least amount of swapping values around in the internal C code.

Related

Unexpected behavior of order(x, na.last = FALSE) [duplicate]

i am having trouble understanding the difference between the R function rank and the R function order. they seem to produce the same output:
> rank(c(10,30,20,50,40))
[1] 1 3 2 5 4
> order(c(10,30,20,50,40))
[1] 1 3 2 5 4
Could somebody shed some light on this for me?
Thanks
set.seed(1)
x <- sample(1:50, 30)
x
# [1] 14 19 28 43 10 41 42 29 27 3 9 7 44 15 48 18 25 33 13 34 47 39 49 4 30 46 1 40 20 8
rank(x)
# [1] 9 12 16 25 7 23 24 17 15 2 6 4 26 10 29 11 14 19 8 20 28 21 30 3 18 27 1 22 13 5
order(x)
# [1] 27 10 24 12 30 11 5 19 1 14 16 2 29 17 9 3 8 25 18 20 22 28 6 7 4 13 26 21 15 23
rank returns a vector with the "rank" of each value. the number in the first position is the 9th lowest. order returns the indices that would put the initial vector x in order.
The 27th value of x is the lowest, so 27 is the first element of order(x) - and if you look at rank(x), the 27th element is 1.
x[order(x)]
# [1] 1 3 4 7 8 9 10 13 14 15 18 19 20 25 27 28 29 30 33 34 39 40 41 42 43 44 46 47 48 49
As it turned out this was a special case and made things confusing. I explain below for anyone interested:
rank returns the order of each element in an ascending list
order returns the index each element would have in an ascending list
I always find it confusing to think about the difference between the two, and I always think, "how can I get to order using rank"?
Starting with Justin's example:
Order using rank:
## Setup example to match Justin's example
set.seed(1)
x <- sample(1:50, 30)
## Make a vector to store the sorted x values
xx = integer(length(x))
## i is the index, ir is the ith "rank" value
i = 0
for(ir in rank(x)){
i = i + 1
xx[ir] = x[i]
}
all(xx==x[order(x)])
[1] TRUE
rank is more complicated and not neccessarily an index (integer):
> rank(c(1))
[1] 1
> rank(c(1,1))
[1] 1.5 1.5
> rank(c(1,1,1))
[1] 2 2 2
> rank(c(1,1,1,1))
[1] 2.5 2.5 2.5 2.5
In layman's language, order gives the actual place/position of a value after sorting the values
For eg:
a<-c(3,4,2,7,8,5,1,6)
sort(a) [1] 1 2 3 4 5 6 7 8
The position of 1 in a is 7. similarly position of 2 in a is 3.
order(a) [1] 7 3 1 2 6 8 4 5
as is stated by ?order() in R prompt,
order just return a permutation which sort the original vector into ascending/descending order.
suppose that we have a vector
A<-c(1,4,3,6,7,4);
A.sort<-sort(A);
then
order(A) == match(A.sort,A);
rank(A) == match(A,A.sort);
besides, i find that order has the following property(not validated theoratically):
1 order(A)∈(1,length(A))
2 order(order(order(....order(A)....))):if you take the order of A in odds number of times, the results remains the same, so as to even number of times.
some observations:
set.seed(0)
x<-matrix(rnorm(10),1)
dm<-diag(length(x))
# compute rank from order and backwards:
rank(x) == col(x)%*%dm[order(x),]
order(x) == col(x)%*%dm[rank(x),]
# in special cases like this
x<-cumsum(rep(c(2,0),times=5))+rep(c(0,-1),times=5)
# they are equal
order(x)==rank(x)

How does the order() function in R work for character vectors? [duplicate]

i am having trouble understanding the difference between the R function rank and the R function order. they seem to produce the same output:
> rank(c(10,30,20,50,40))
[1] 1 3 2 5 4
> order(c(10,30,20,50,40))
[1] 1 3 2 5 4
Could somebody shed some light on this for me?
Thanks
set.seed(1)
x <- sample(1:50, 30)
x
# [1] 14 19 28 43 10 41 42 29 27 3 9 7 44 15 48 18 25 33 13 34 47 39 49 4 30 46 1 40 20 8
rank(x)
# [1] 9 12 16 25 7 23 24 17 15 2 6 4 26 10 29 11 14 19 8 20 28 21 30 3 18 27 1 22 13 5
order(x)
# [1] 27 10 24 12 30 11 5 19 1 14 16 2 29 17 9 3 8 25 18 20 22 28 6 7 4 13 26 21 15 23
rank returns a vector with the "rank" of each value. the number in the first position is the 9th lowest. order returns the indices that would put the initial vector x in order.
The 27th value of x is the lowest, so 27 is the first element of order(x) - and if you look at rank(x), the 27th element is 1.
x[order(x)]
# [1] 1 3 4 7 8 9 10 13 14 15 18 19 20 25 27 28 29 30 33 34 39 40 41 42 43 44 46 47 48 49
As it turned out this was a special case and made things confusing. I explain below for anyone interested:
rank returns the order of each element in an ascending list
order returns the index each element would have in an ascending list
I always find it confusing to think about the difference between the two, and I always think, "how can I get to order using rank"?
Starting with Justin's example:
Order using rank:
## Setup example to match Justin's example
set.seed(1)
x <- sample(1:50, 30)
## Make a vector to store the sorted x values
xx = integer(length(x))
## i is the index, ir is the ith "rank" value
i = 0
for(ir in rank(x)){
i = i + 1
xx[ir] = x[i]
}
all(xx==x[order(x)])
[1] TRUE
rank is more complicated and not neccessarily an index (integer):
> rank(c(1))
[1] 1
> rank(c(1,1))
[1] 1.5 1.5
> rank(c(1,1,1))
[1] 2 2 2
> rank(c(1,1,1,1))
[1] 2.5 2.5 2.5 2.5
In layman's language, order gives the actual place/position of a value after sorting the values
For eg:
a<-c(3,4,2,7,8,5,1,6)
sort(a) [1] 1 2 3 4 5 6 7 8
The position of 1 in a is 7. similarly position of 2 in a is 3.
order(a) [1] 7 3 1 2 6 8 4 5
as is stated by ?order() in R prompt,
order just return a permutation which sort the original vector into ascending/descending order.
suppose that we have a vector
A<-c(1,4,3,6,7,4);
A.sort<-sort(A);
then
order(A) == match(A.sort,A);
rank(A) == match(A,A.sort);
besides, i find that order has the following property(not validated theoratically):
1 order(A)∈(1,length(A))
2 order(order(order(....order(A)....))):if you take the order of A in odds number of times, the results remains the same, so as to even number of times.
some observations:
set.seed(0)
x<-matrix(rnorm(10),1)
dm<-diag(length(x))
# compute rank from order and backwards:
rank(x) == col(x)%*%dm[order(x),]
order(x) == col(x)%*%dm[rank(x),]
# in special cases like this
x<-cumsum(rep(c(2,0),times=5))+rep(c(0,-1),times=5)
# they are equal
order(x)==rank(x)

How to create new columns in data.frame based on letter AND number character objects in a column in R [duplicate]

This question already has answers here:
Split a column to multiple columns
(2 answers)
Closed 7 years ago.
I have a data frame with a column filled with data like so, on chromosome and then base position, all in one column. I filled in the remaining columns V2 through V5 with integers just to simulate a similar data.frame.
> test
V1 V2 V3 V4 V5
1 I.1286480 9 17 25 33
2 I.1898932 10 18 26 34
3 I.11871397 11 19 27 35
4 II.1252994 12 20 28 36
5 II.18175911 13 21 29 37
6 III.10298347 14 22 30 38
7 IV.123478912 15 23 31 39
8 V.12837471234 16 24 32 40
with other data in the following columns. This is a huge data set, with 115,000 rows. I want to make two new columns, one containing the roman numerals (I, II, III, IV, V) and another column containing the number following the roman numerals. The issues I'm having trouble with are that this is a vector of character objects, so I'm not sure how to parse out the letters from the numbers. I tried using StrPos from DescTools package, but
> StrPos(test$V1, "I")
[1] 1 1 1 1 1 1 1 NA
> StrPos(test$V1, "I.")
[1] 1 1 1 1 1 1 1 NA
it returns positions of all "I"s, not just the objects with one instance of "I". I'm wondering whether substring would work? But then I have the problem of all the roman numerals being of different lengths, as well as the numbers following the roman numerals being of different lengths as well. I know there must be a simple solution to this problem, but the only things I can think up are very long for and if loops. Help me, stackoverflow, you're my only hope!
Using separate from tidyr:
library(tidyr)
separate(test, V1, into = c("chr", "pos"))
chr pos V2 V3 V4 V5
1 I 1286480 9 17 25 33
2 I 1898932 10 18 26 34
3 I 11871397 11 19 27 35
4 II 1252994 12 20 28 36
5 II 18175911 13 21 29 37
6 III 10298347 14 22 30 38
7 IV 123478912 15 23 31 39
8 V 12837471234 16 24 32 40

rank and order in R

i am having trouble understanding the difference between the R function rank and the R function order. they seem to produce the same output:
> rank(c(10,30,20,50,40))
[1] 1 3 2 5 4
> order(c(10,30,20,50,40))
[1] 1 3 2 5 4
Could somebody shed some light on this for me?
Thanks
set.seed(1)
x <- sample(1:50, 30)
x
# [1] 14 19 28 43 10 41 42 29 27 3 9 7 44 15 48 18 25 33 13 34 47 39 49 4 30 46 1 40 20 8
rank(x)
# [1] 9 12 16 25 7 23 24 17 15 2 6 4 26 10 29 11 14 19 8 20 28 21 30 3 18 27 1 22 13 5
order(x)
# [1] 27 10 24 12 30 11 5 19 1 14 16 2 29 17 9 3 8 25 18 20 22 28 6 7 4 13 26 21 15 23
rank returns a vector with the "rank" of each value. the number in the first position is the 9th lowest. order returns the indices that would put the initial vector x in order.
The 27th value of x is the lowest, so 27 is the first element of order(x) - and if you look at rank(x), the 27th element is 1.
x[order(x)]
# [1] 1 3 4 7 8 9 10 13 14 15 18 19 20 25 27 28 29 30 33 34 39 40 41 42 43 44 46 47 48 49
As it turned out this was a special case and made things confusing. I explain below for anyone interested:
rank returns the order of each element in an ascending list
order returns the index each element would have in an ascending list
I always find it confusing to think about the difference between the two, and I always think, "how can I get to order using rank"?
Starting with Justin's example:
Order using rank:
## Setup example to match Justin's example
set.seed(1)
x <- sample(1:50, 30)
## Make a vector to store the sorted x values
xx = integer(length(x))
## i is the index, ir is the ith "rank" value
i = 0
for(ir in rank(x)){
i = i + 1
xx[ir] = x[i]
}
all(xx==x[order(x)])
[1] TRUE
rank is more complicated and not neccessarily an index (integer):
> rank(c(1))
[1] 1
> rank(c(1,1))
[1] 1.5 1.5
> rank(c(1,1,1))
[1] 2 2 2
> rank(c(1,1,1,1))
[1] 2.5 2.5 2.5 2.5
In layman's language, order gives the actual place/position of a value after sorting the values
For eg:
a<-c(3,4,2,7,8,5,1,6)
sort(a) [1] 1 2 3 4 5 6 7 8
The position of 1 in a is 7. similarly position of 2 in a is 3.
order(a) [1] 7 3 1 2 6 8 4 5
as is stated by ?order() in R prompt,
order just return a permutation which sort the original vector into ascending/descending order.
suppose that we have a vector
A<-c(1,4,3,6,7,4);
A.sort<-sort(A);
then
order(A) == match(A.sort,A);
rank(A) == match(A,A.sort);
besides, i find that order has the following property(not validated theoratically):
1 order(A)∈(1,length(A))
2 order(order(order(....order(A)....))):if you take the order of A in odds number of times, the results remains the same, so as to even number of times.
some observations:
set.seed(0)
x<-matrix(rnorm(10),1)
dm<-diag(length(x))
# compute rank from order and backwards:
rank(x) == col(x)%*%dm[order(x),]
order(x) == col(x)%*%dm[rank(x),]
# in special cases like this
x<-cumsum(rep(c(2,0),times=5))+rep(c(0,-1),times=5)
# they are equal
order(x)==rank(x)

How do nested replacement operators work?

I was working with a list of lists today and needed to replace an element of one of the second-level lists. The way to do this seemed obvious, but I realized I wasn't actually clear on why it worked.
Here's an example:
a <- list(aa=list(aaa=1:10,bbb=11:20,ccc=21:30),bb=list(ddd=1:5))
Given this data structure, let's say I want to replace the 3rd element of the nested numeric vector aaa. I could do something like:
newvalue <- 100
a$aa$aaa[3] <- newvalue
Doing this seems obvious enough, but I couldn't explain to myself how this expression actually gets evaluated. Working with the quote function, I cobbled together some rough logic, along the lines of:
(1) Create and submit the top-level function call:
`<-`(a$aa$aaa[3],newvalue)
(2) Lazy evaluation of first argument in (1), call function '[':
`[`(a$aa$aaa,3)
(3) Proceed recursivley down:
`$`(a$aa,"aaa")
(4) ...further down, call '$' again:
`$`(a,"aa")
(5) With (4) returning an actual data structure, proceed back "up the stack" substituting the returned data structures until the actual assignment is made in (1).
I guess my confusion involves some aspects of lazy evaluation and/or evaluation environments. In the example above, I've simply reassigned one element of a vector. But how is it that R keeps track of where that vector is within the greater data structure?
Cheers
I think a$aa$aaa[3] works as followed:
When you want to access an element still within the current object you use single square brackets, []. This makes the element accesible, but you can not perform complex manipulations with it because the element is still part of the object.
When you access the element using $, this translates from a$aa to a[["aa"]], thus releasing the element from the current object.
The total expression a$aa$aaa[3] translates to a[["aa"]][["aaa"]][3]. This is treated as a vector of vectors ->
Take object a, access vector aa.
Access vector aaa.
Access the third element in vector aaa.
The R evaluation:
> a
$aa
$aa$aaa
[1] 1 2 3 4 5 6 7 8 9 10
$aa$bbb
[1] 11 12 13 14 15 16 17 18 19 20
$aa$ccc
[1] 21 22 23 24 25 26 27 28 29 30
$bb
$bb$ddd
[1] 1 2 3 4 5
> a$aa
$aaa
[1] 1 2 3 4 5 6 7 8 9 10
$bbb
[1] 11 12 13 14 15 16 17 18 19 20
$ccc
[1] 21 22 23 24 25 26 27 28 29 30
> a$aa$aaa
[1] 1 2 3 4 5 6 7 8 9 10
> a[["aa"]]
$aaa
[1] 1 2 3 4 5 6 7 8 9 10
$bbb
[1] 11 12 13 14 15 16 17 18 19 20
$ccc
[1] 21 22 23 24 25 26 27 28 29 30
> a[["aa"]][["aaa"]]
[1] 1 2 3 4 5 6 7 8 9 10
> a["aa"]
$aa
$aa$aaa
[1] 1 2 3 4 5 6 7 8 9 10
$aa$bbb
[1] 11 12 13 14 15 16 17 18 19 20
$aa$ccc
[1] 21 22 23 24 25 26 27 28 29 30
> a["aa"]["aaa"]
$<NA>
NULL
Hope this helps.

Resources