R Sort or order with custom compare function - r

Can I pass a custom compare function to order that, given two items, indicates which one is ranked higher?
In my specific case I have the following list.
scores <- list(
'a' = c(1, 1, 2, 3, 4, 4),
'b' = c(1, 2, 2, 2, 3, 4),
'c' = c(1, 1, 2, 2, 3, 4),
'd' = c(1, 2, 3, 3, 3, 4)
)
If we take two vectors a and b, the index of the first element i at which a[i] > b[i] or a[i] < b[i] should determine what vector comes first. In this example, scores[['d']] > scores[['a']] because scores[['d']][2] > scores[['a']][2] (note that it doesn't matter that scores[['d']][5] < scores[['a']][5]).
Comparing two of those vectors could look something like this.
compare <- function(a, b) {
# get first element index at which vectors differ
i <- which.max(a != b)
if(a[i] > b[i])
1
else if(a[i] < b[i])
-1
else
0
}
The sorted keys of scores by using this comparison function should then be d, b, a, c.
From other solutions I've found, they mess with the data before ordering or introduce S3 classes and apply comparison attributes. With the former I fail to see how to mess with my data (maybe turn it into strings? But then what about numbers above 9?), with the latter I feel uncomfortable introducing a new class into my R package only for comparing vectors. And there doesn't seem to be a sort of comparator parameter I'd want to pass to order.

Here's an attempt. I've explained every step in the comments.
compare <- function(a, b) {
# subtract vector a from vector b
comparison <- a - b
# get the first non-zero result
restult <- comparison[comparison != 0][1]
# return 1 if result == 1 and 2 if result == -1 (0 if equal)
if(is.na(restult)) {return(0)} else if(restult == 1) {return(1)} else {return(2)}
}
compare_list <- function(list_) {
# get combinations of all possible comparison
comparisons <- combn(length(list_), 2)
# compare all possibilities
results <- apply(comparisons, 2, function(x) {
# get the "winner"
x[compare(list_[[x[1]]], list_[[x[2]]])]
})
# get frequency table (how often a vector "won" -> this is the result you want)
fr_tab <- table(results)
# vector that is last in comparison
last_vector <- which(!(1:length(list_) %in% as.numeric(names(fr_tab))))
# return the sorted results and add the last vectors name
c(as.numeric(names(sort(fr_tab, decreasing = T))), last_vector)
}
If you run the function on your example, the result is
> compare_list(scores)
[1] 4 2 1 3
I haven't dealt with the case that the two vectors are identical, you haven't explained how to deal with this.

The native R way to do this is to introduce an S3 class.
There are two things you can do with the class. You can define a method for xtfrm that converts your list entries to numbers. That could be vectorized, and conceivably could be really fast.
But you were asking for a user defined compare function. This is going to be slow because R function calls are slow, and it's a little clumsy because nobody does it. But following the instructions in the xtfrm help page, here's how to do it:
scores <- list(
'a' = c(1, 1, 2, 3, 4, 4),
'b' = c(1, 2, 2, 2, 3, 4),
'c' = c(1, 1, 2, 2, 3, 4),
'd' = c(1, 2, 3, 3, 3, 4)
)
# Add a class to the list
scores <- structure(scores, class = "lexico")
# Need to keep the class when subsetting
`[.lexico` <- function(x, i, ...) structure(unclass(x)[i], class = "lexico")
# Careful here: identical() might be too strict
`==.lexico` <- function(a, b) {identical(a, b)}
`>.lexico` <- function(a, b) {
a <- a[[1]]
b <- b[[1]]
i <- which(a != b)
length(i) > 0 && a[i[1]] > b[i[1]]
}
is.na.lexico <- function(a) FALSE
sort(scores)
#> $c
#> [1] 1 1 2 2 3 4
#>
#> $a
#> [1] 1 1 2 3 4 4
#>
#> $b
#> [1] 1 2 2 2 3 4
#>
#> $d
#> [1] 1 2 3 3 3 4
#>
#> attr(,"class")
#> [1] "lexico"
Created on 2021-11-27 by the reprex package (v2.0.1)
This is the opposite of the order you asked for, because by default sort() sorts to increasing order. If you really want d, b, a, c use sort(scores, decreasing = TRUE.

Here's another, very simple solution:
sort(sapply(scores, function(x) as.numeric(paste(x, collapse = ""))), decreasing = T)
What it does is, it takes all the the vectors, "compresses" them into a single numerical digit and then sorts those numbers in decreasing order.

Related

Applying an existing multi-argument function to multiple dataframes, row by row, with a joint output dataframe

I have a function taking four arguments,
h(a, b, c, d)
Where a and b are the i-th and the i+1-th row of df1 and c and d are the i-th and i+1-th row of df2, and the output has four variables and i-1 results.
The idea is the following: I want to use the function h to each combination of these four arguments where i is common, and so:
- for the first iteration it will take the 1st and 2nd row of df1 and 1st and 2nd row of df2
- for the second iteration it will take the 2nd and 3rd row of df1 and 2nd and 3rd row of df2
...
Afterward, perfectly, the results will be stored in a separate data frame, with 4 columns and i-1 rows.
I tried making use of apply function and of a for loop, yet my attempts failed me. I don't necessarily need a readymade solution, a hint would be nice. Thanks!
EDIT: reproducible example:
df1 <- data.frame(a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
df2 <- data.frame(c = c(4, 3, 2, 1), d = c(8, 7, 6, 5))
h <- function (a, b, c, d) {
vector <- (a + b) / (c - d)
vector
}
I would like to get a function that uses h until b and d reach the last row of df1/df2 (they have the same number of rows), and for each such combination generate vector and add it to some new data frame as a next row.
With apply you could do something like this:
df1 <- data.frame(a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
df2 <- data.frame(c = c(4, 3, 2, 1), d = c(8, 7, 6, 5))
h <- function (a, b, c, d) {
(a + b) / (c - d)
}
apply(cbind(df1, df2), 1, function(x) h(x["a"], x["b"], x["c"], x["d"]))
[1] -1.5 -2.0 -2.5 -3.0
If h is a vectorized function (as in your example) it would be better to
do.call(h, cbind(df1, df2))
Of course, I am not assuming that h is that simple, in which case (df1$a + df1$b) / (df2$c - df2$d) would suffice.
However, I advise learning about the purrr package. It is great for this kind of situation and mainly: you can define what type of output you are expecting (with purrr::map_*) to ensure consistency and avoid unexpected results.
For multiple arguments of a dataframe, use purrr::pmap_*:
# `pmap` returns a list
purrr::pmap(cbind(df1, df2), h)
[[1]]
[1] -1.5
[[2]]
[1] -2
[[3]]
[1] -2.5
[[4]]
[1] -3
# `pmap_dbl` returns a double vector or throws an error otherwise
purrr::pmap_dbl(cbind(df1, df2), h)
[1] -1.5 -2.0 -2.5 -3.0

How to write function to calculate H index in R?

I am new to R and looking for calculating h index.
H index is the popular measure to quantify scientific productivity.
Formally, if f is the function that corresponds to the number of citations for each publication, we compute the h index as follows:
First we order the values of f from the largest to the lowest value. Then, we look for the last position in which f is greater than or equal to the position (we call h this position).
For example, if we have a researcher with 5 publications A, B, C, D, and E with 10, 8, 5, 4, and 3 citations, respectively, the h index is equal to 4 because the 4th publication has 4 citations and the 5th has only 3. In contrast, if the same publications have 25, 8, 5, 3, and 3 citations, then the index is 3 because the fourth paper has only 3 citations.
Can anyone suggest smarter way to solve this
a <- c(10,8,5,4,3)
I expect the output of h index value as 4.
Assuming the input is already sorted, I would use this:
tail(which(a >= seq_along(a)), 1)
# [1] 4
You could, of course, put this in a little function:
h_index = function(cites) {
if(max(cites) == 0) return(0) # assuming this is reasonable
cites = cites[order(cites, decreasing = TRUE)]
tail(which(cites >= seq_along(cites)), 1)
}
a1 = c(10,8, 5, 4, 3)
a2 = c(10, 9, 7, 1, 1)
h_index(a1)
# [1] 4
h_index(a2)
# [1] 3
h_index(1)
# [1] 1
## set this to be 0, not sure if that's what you want
h_index(0)
# [1] 0
I propose a shorter + more flexible function that takes whatever numeric vector of citations you include (sorted or unsorted, with or without zeros, only zeros, etc.)
hindex <- function(x) {
tx <- sort(x, decreasing = T)
print(sum(tx >= seq_along(tx)))
}
A dplyr version if citation data is in a dataframe (thanks to https://stackoverflow.com/users/5313511/oelshie):
a <- data.frame(cites = c(10,8,5,4,3))
b <- a %>%
arrange(desc(cites)) %>%
summarise(h_index = sum(cites >= seq_along(cites)))
b
h_index
1 4

Finding the closest index to a value in R

I have a question about finding index values in a vector.
Let's say I have a vector as follows:
vector <- c(1,2,4,6,8,10)
And, let's say I have the value '5'. I would like to find the maximum index in "vector" such that it is less than or equal to the value 5. In the case of the example above, this index would be 3 (since 4 is less than or equal to 5). Similarly, if instead I had a vector such as:
vector <- c(1,2,4,5,6,8,10)
Then if I were to find a value less than or equal to 5, this index would now be 4 instead of 3.
However, I also want to find the first and last time this index occurs. For example, if I had a vector such as:
vector <- c(1,1,2,2,4,5,5,5,5,6,8,10)
Then the first time this index occurs would be 6 and the last time this index occurs would be 9.
Is there a short, one-line method which would allow me to perform this task? Up until now I have been using the function max(which(....)), however I find that this method is extremely inefficient for large datasets since it will literally list hundreds/thousands of values, so I would like to find a more efficient method if possible which can fit in one line.
Thanks in advance.
You can use the following code:
min(max(which(vector <= 5)), min(which(vector == 5)))
First, it searches all indices where vector is less or equal to 5 with which function, then it takes the maximum one.
Second, it searches all indices where vector is equal to 5 and takes the minimum.
Third, it takes the first of these two indices
Thanks for all those who replied, I actually found an extremely short, one-line method to do this by download a package BBmisc. It has functions called which.last and which.first, and they perform the actions I need. Thanks again for taking the time to reply, I appreciate it.
You can use:
my_ind <- function(vec, num){
ind <- which.max(vec == num) # Check for equality first
if(ind == 1L && vec[1L] != num){
ind <- which.min(vec < num) - 1L
}
ind
}
my_ind(c(1,2,4,6,8,10), 5L) # 3
my_ind(c(1,2,4,5,6,8,10), 5L) # 4
my_ind(c(1,1,2,2,4,5,5,5,5,6,8,10), 5L) # 6
my_ind(c(5,8,10), 5L) # 1
my_ind(c(6,8,10), 5L) # 0 - returns 0 if all(vec > 5L)
I don't see a need for packages here. It seems like the construct which(x == max(x[x <= 5])) would work for you.
x <- c(1, 2, 4, 6, 8, 10)
which(x == max(x[x <= 5]))
# [1] 3
x <- c(1, 2, 4, 5, 6, 8, 10)
which(x == max(x[x <= 5]))
# [1] 4
x <- c(1, 1, 2, 2, 4, 5, 5, 5, 5, 6, 8, 10)
which(x == max(x[x <= 5]))
# [1] 6 7 8 9
And to find the min/max index for multiples indices, use head/tail.
head(which(x == max(x[x <= 5])), 1)
# [1] 6
tail(which(x == max(x[x <= 5])), 1)
# [1] 9

Find all positions of all matches of one vector of values in second vector

I need to find all positions in my vector corresponding to any of values of another vector:
needles <- c(4, 3, 9)
hay <- c(2, 3, 4, 5, 3, 7)
mymatches(needles, hay) # should give vector: 2 3 5
Is there any predefined function allowing to do this?
This should work:
which(hay %in% needles) # 2 3 5
R already has the the match() function / %in% operator, which are the same thing, and they're vectorized. Your solution:
which(!is.na(match(hay, needles)))
[1] 2 3 5
or the shorter syntax which(hay %in% needles) as #jalapic showed.
With match(), if you wanted to, you could see which specific value was matched at each position...
match(hay, needles)
[1] NA 2 1 NA 2 NA
or just a logical vector of where the matches occurred:
!is.na(match(hay, needles))
[1] FALSE TRUE TRUE FALSE TRUE FALSE
If you want to match between an integer and a integer vector then the following code is about twice as fast for longer integer vectors:
library(microbenchmark)
library(parallel)
library(Rcpp)
library(RcppArmadillo)
cppFunction(depends = "RcppArmadillo",
'std::vector<double> findMatches(const int &x, const arma::ivec &y) {
arma::uvec temp = arma::find(y == x) + 1;
return as<std::vector<double>>(wrap(temp));
}')
x <- 1L
y <- as.integer(1:1e8)
microbenchmark(findMatches(x, y))
microbenchmark(which(x %in% y))
To match all elements of a vector we can do:
needles <- c(4, 3, 9)
hay <- c(2, 3, 4, 5, 3, 7)
unlist(lapply(FUN = findMatches, X = needles, y=hay))
# The same thing in parallel
unlist(mclapply(FUN = findMatches, X = needles, y=hay))
Benchmarking:
# on a 8 core server
hay <- as.integer(1:1e7)
needles <- sample(hay, 10)
microbenchmark(which(hay %in% needles)) # 74 milliseconds
microbenchmark(unlist(lapply(FUN = findMatches, X = needles, y=hay))) # 44 milliseconds
microbenchmark(unlist(mclapply(FUN = findMatches, X = needles, y=hay))) # 46 milliseconds
Doing in parallel will only be faster if each embarrassingly parallel task is long enough to make it worth the overhead. In this example that does not seem to be the case.

R how many element satisfy a condition?

Is there a better way to count how many elements of a result satisfy a condition?
a <- c(1:5, 1:-3, 1, 2, 3, 4, 5)
b <- c(6:-8)
u <- a > b
length(u[u == TRUE])
## [1] 7
sum does this directly, counting the number of TRUE values in a logical vector:
sum(u, na.rm=TRUE)
And of course there is no need to construct u for this:
sum(a > b, na.rm=TRUE)
works just as well. sum will return NA by default if any of the values are NA. na.rm=TRUE ignores NA values in the sum (for logical or numeric).
If z consists of only TRUE or FALSE, then simply
length(which(z))
I've always used table for this:
a <- c(1:5, 1:-3, 1, 2, 3, 4, 5)
b <- c(6:-8)
table(a>b)
FALSE TRUE
8 7

Resources