Rank function inconsistency with the expected output in R - r

As I read about rank function, it has Ties.method to specify what happens when ties occur.
In this vector: c(2,3,4,4,5,6), As Matt Krause suggested:
average assigns each tied element the "average" rank. The ranks would therefore be 1, 2, 3.5, 3.5, 5, 6
first lets the "earlier" entry "win", so the ranks are in numerical order (1,2,3,4,5,6)
min assigns every tied element to the lowest rank, so you get 1,2,3,3,5,6
max does the opposite: tied elements get the highest rank (1,2,4,4,5,6)
random breaks ties randomly, so you'd get either (1,2,3,4,5,6) or (1,2,4,3,5,6).
BUT, I need this output: (1,2,3,3,4,5). What can I do for that?
I want to use the output to fill in another matrix (X) which has 5 columns. The final output for this instance should be : (1,1,2,1,1), which means that we have 2 of the third-ranked item and one of the rest.
Now, if we have (2,3,4,4,5,6) as instance 1 and (2,3,3,3,4,2) as instance 2, in matrix (X), they will be converted to:
(1,1,2,1,1)
(2,3,1,0,0)
(the number of the columns of matrix (X) equals to the number of unique values in all instances; considering that all numbers are between 2 to 6 which means we have 5 different values in total) ...
I think rank does not work in this situation correctly.

There's probably a more efficient/shorter way to compute the unique values of the union of all instances, but otherwise this is pretty much as #whuber suggested in the comments:
Test case:
instances <- list(c(2,3,4,4,5,6),c(2,3,3,3,4,2))
The only tricky part is making sure we have the full range of levels so that zeros get counted properly:
ulevs <- sort(unique(Reduce(union,instances)))
f <- function(x) {
table(factor(x,levels=ulevs))
}
Apply and convert to a matrix:
t(sapply(instances,f))
## 2 3 4 5 6
## [1,] 1 1 2 1 1
## [2,] 2 3 1 0 0

Related

Removing extreme values in a dataframe while sorting for multiple columns R

I have a dataframe like this:
mydf <- data.frame(A = c(40,9,55,1,2), B = c(12,1345,112,45,789))
mydf
A B
1 40 12
2 9 1345
3 55 112
4 1 45
5 2 789
I want to retain only 95% of the observations and throw out 5% of the data that have extreme values. First, I calculate how many observations they are:
th <- length(mydf$A) * 0.95
And then I want to remove all the rows above the th (or retain the rows below the th, as you wish). I need to sort mydf in an ascending order, to remove only those extreme values. I tried several approaches:
mydf[order(mydf["A"], mydf["B"]),]
mydf[order(mydf$A,mydf$B),]
mydf[with(mydf, order(A,B)), ]
plyr::arrange(mydf,A,B)
but nothing works, so mydf is not sorted in ascending order by the two columns at the same time. I looked here Sort (order) data frame rows by multiple columns but the most common solutions do not work and I don't get why.
However, if I consider only one column at a time (e.g., A), those ordering methods work, but then I don't get how to throw out the extreme values, because this:
mydf <- mydf[(order(mydf$A) < th),]
removes the second row that has a value of 9, while my intent is to subset mydf retaining only the values below threshold (intended in this case as number of observations, not value).
I can imagine it is something very simple and basic that I am missing... And probably there are nicer tidyverse approaches.
I think you want rank here, but it doesn't work on multiple columns. To work around that, note that rank(.) is equivalent to order(order(.)):
rank(mydf$A)
# [1] 4 3 5 1 2
order(order(mydf$A))
# [1] 4 3 5 1 2
With that, we can order on both (all) columns, then order again, then compare the resulting ranks with your th value.
mydf[order(do.call(order, mydf)) < th,]
# A B
# 1 40 12
# 2 9 1345
# 4 1 45
# 5 2 789
This approach benefits from preserving the natural sort of the rows.
If you would prefer to stick with a single call to order, then you can reorder them and use head:
head(mydf[order(mydf$A, mydf$B),], th)
# A B
# 4 1 45
# 5 2 789
# 2 9 1345
# 1 40 12
though this does not preserve the original order of rows (which may or may not be important to you).
Possible approach
An alternative to your approach would be to use a dplyr ranking function such as cume_dist() or percent_rank(). These can accept a dataframe as input and return ranks / percentiles based on all columns.
set.seed(13)
dat_all <- data.frame(
A = sample(1:60, 100, replace = TRUE),
B = sample(1:1500, 100, replace = TRUE)
)
nrow(dat_all)
# 100
dat_95 <- dat_all[cume_dist(dat_all) <= .95, ]
nrow(dat_95)
# 95
General cautions about quantiles
More generally, keep in mind that defining quantiles is slippery, as there are multiple possible approaches. You'll want to think about what makes the most sense given your goal. As an example, from the dplyr docs:
cume_dist(x) counts the total number of values less than or equal to x_i, and divides it by the number of observations.
percent_rank(x) counts the total number of values less than x_i, and divides it by the number of observations minus 1.
Some implications of this are that the lowest value is always 1 / nrow() for cume_dist() but 0 for percent_rank(), while the highest value is always 1 for both methods. This means different cases might be excluded depending on the method. It also means the code I provided will always remove the highest-ranking row, which may or may not match your expectations. (e.g., in a vector with just 5 elements, is the highest value "above the 95th percentile"? It depends on how you define it.)

Match each row in a table to a row in another table based on the difference between row timestamps

I have two unevenly-spaced time series that each measure separate attributes of the same system. The two series's data points are not sampled at the same times, and the series are not the same length. I would like to match each row from series A to the row of B that is closest to it in time. What I have in mind is to add a column to A that contains indexes to the closest row in B. Both series have a time column measured in Unix time (eg. 1459719755).
for example, given two datasets
a time
2 1459719755
4 1459719772
3 1459719773
b time
45 1459719756
2 1459719763
13 1459719766
22 1459719774
The first dataset should be updated to
a time index
2 1459719755 1
4 1459719772 4
3 1459719773 4
since B[1,]$time has the closest value to A[1,]$time, B[4,]$time has the closest value to A[2,]$time and A[3,]$time.
Is there any convenient way to do this?
Try something like this:
(1+ecdf(bdat$time)(adat$time)*nrow(bdat))
[1] 1 4 4
Why should this work? The ecdf function returns another function that has a value from 0 to 1. It returns the "position" in the "probability range" [0,1] of a new value in a distribution of values defined by the first argument to ecdf. The expression is really just rescaling that function's result to the range [1, nrow(bdat)]. (I think it's flipping elegant.)
Another approach would be to use approxfun on the sorted values of bdat$time which would then let get you interpolated values. These might need to be rounded. Using them as indices would instead truncate to integer.
apf <- approxfun( x=sort(bdat$time), y=seq(length( bdat$time)) ,rule=2)
apf( adat$time)
#[1] 1.000 3.750 3.875
round( apf( adat$time))
#[1] 1 4 4
In both case you are predicting a sorted value from its "order statistic". In the second case you should check that ties are handled in the manner you desire.

Mutate Cumsum with Previous Row Value

I am trying to run a cumsum on a data frame on two separate columns. They are essentially tabulation of events for two different variables. Only one variable can have an event recorded per row in the data frame. The way I attacked the problem was to create a new variable, holding the value ‘1’, and create two new columns to sum the variables totals. This works fine, and I can get the correct total amount of occurrences, but the problem I am having is that in my current ifelse statement, if the event recorded is for variable “A”, then variable “B” is assigned 0. But, for every row, I want to have the previous variable’s value assigned to the current row, so that I don’t end up with gaps where it goes from 1 to 2, to 0, to 3.
I don't want to run summarize on this either, I would prefer to keep each recorded instance and run new columns through mutate.
CURRENT DF:
Event Value Variable Total.A Total.B
1 1 A 1 0
2 1 A 2 0
3 1 B 0 1
4 1 A 3 0
DESIRED RESULT:
Event Value Variable Total.A Total.B
1 1 A 1 0
2 1 A 2 0
3 1 B 2 1
4 1 A 3 1
Thanks!
You can use the property of booleans that you can sum them as ones and zeroes. Therefore, you can use the cumsum-function:
DF$Total.A <- cumsum(DF$variable=="A")
Or as a more general approach, provided by #Frank you can do:
uv = unique(as.character(DF$Variable))
DF[, paste0("Total.",uv)] <- lapply(uv, function(x) cumsum(DF$V == x))
If you have many levels to your factor, you can get this in one line by dummy coding and then cumsuming the matrix.
X <- model.matrix(~Variable+0, DF)
apply(X, 2, cumsum)

R select multiple rows by conditional row number

I have a R dataframe like this one:
a<-c(1,2,3,4,5)
b<-c(6,7,8,9,10)
df<-data.frame(a,b)
colnames(df)<-c("a","b")
df
a b
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
I would like to get the 1st, 2nd, 3rd AND 5th row of the column a, so 1 2 3 5, by selecting rows by their number.
I have tried df$a[1:3,5] but I get Error in df$a[1:3, 5] : incorrect number of dimensions.
What DOES work is c(df$a[1:3],df$a[5]) but I was wondering if there was an easier way to achieve this with R?
Your data frame has two dimensions (rows and columns). When you use the square brackets to extract values, R expects everything prior to the comma to indicate the rows desired, and everything after the comma to indicate the columns desired (see: ?[). Hence, df[1:3,5] means rows 1 through 3, from column 5. To turn your desired rows into a single vector, you need to concatenate (i.e., c(1:3,5)). That would all go before the comma, the column indicator, 1 or "a", would go after the comma. Thus, df[c(1:3,5), 1] is what you need.
For alternative answer (that might be more appropriate to a dataframe with many more columns), df[c(1:3, 5), "a"] as suggested by #Mamoun Benghezal would also get it done!

Determining minimum values in a vector in R

I need some help in determining more than one minimum value in a vector. Let's suppose, I have a vector x:
x<-c(1,10,2, 4, 100, 3)
and would like to determine the indexes of the smallest 3 elements, i.e. 1, 2 and 3. I need the indexes of because I will be using the indexes to access the corresponding elements in another vector. Of course, sorting will provide the minimum values but I want to know the indexes of their actual occurrence prior to sorting.
In order to find the index try this
which(x %in% sort(x)[1:3]) # this gives you and index vector
[1] 1 3 6
This says that the first, third and sixth elements are the first three lowest values in your vector, to see which values these are try:
x[ which(x %in% sort(x)[1:3])] # this gives the vector of values
[1] 1 2 3
or just
x[c(1,3,6)]
[1] 1 2 3
If you have any duplicated value you may want to select unique values first and then sort them in order to find the index, just like this (Suggested by #Jeffrey Evans in his answer)
which(x %in% sort(unique(x))[1:3])
I think you mean you want to know what are the indices of the bottom 3 elements? In that case you want order(x)[1:3]
You can use unique to account for duplicate minimum values.
x<-c(1,10,2,4,100,3,1)
which(x %in% sort(unique(x))[1:3])
Here's another way with rank that includes duplicates.
x <- c(x, 3)
# [1] 1 10 2 4 100 3 3
which(rank(x, ties.method='min') <= 3)
# [1] 1 3 6 7

Resources