How to write function to calculate H index in R? - r

I am new to R and looking for calculating h index.
H index is the popular measure to quantify scientific productivity.
Formally, if f is the function that corresponds to the number of citations for each publication, we compute the h index as follows:
First we order the values of f from the largest to the lowest value. Then, we look for the last position in which f is greater than or equal to the position (we call h this position).
For example, if we have a researcher with 5 publications A, B, C, D, and E with 10, 8, 5, 4, and 3 citations, respectively, the h index is equal to 4 because the 4th publication has 4 citations and the 5th has only 3. In contrast, if the same publications have 25, 8, 5, 3, and 3 citations, then the index is 3 because the fourth paper has only 3 citations.
Can anyone suggest smarter way to solve this
a <- c(10,8,5,4,3)
I expect the output of h index value as 4.

Assuming the input is already sorted, I would use this:
tail(which(a >= seq_along(a)), 1)
# [1] 4
You could, of course, put this in a little function:
h_index = function(cites) {
if(max(cites) == 0) return(0) # assuming this is reasonable
cites = cites[order(cites, decreasing = TRUE)]
tail(which(cites >= seq_along(cites)), 1)
}
a1 = c(10,8, 5, 4, 3)
a2 = c(10, 9, 7, 1, 1)
h_index(a1)
# [1] 4
h_index(a2)
# [1] 3
h_index(1)
# [1] 1
## set this to be 0, not sure if that's what you want
h_index(0)
# [1] 0

I propose a shorter + more flexible function that takes whatever numeric vector of citations you include (sorted or unsorted, with or without zeros, only zeros, etc.)
hindex <- function(x) {
tx <- sort(x, decreasing = T)
print(sum(tx >= seq_along(tx)))
}

A dplyr version if citation data is in a dataframe (thanks to https://stackoverflow.com/users/5313511/oelshie):
a <- data.frame(cites = c(10,8,5,4,3))
b <- a %>%
arrange(desc(cites)) %>%
summarise(h_index = sum(cites >= seq_along(cites)))
b
h_index
1 4

Related

R - create a sequence of odd numbers followed by the sequence of even numbers on the same interval

So I am basically looking for a more efficient way to do this:
c(seq(1, 5, 2), seq(2, 6, 2))
Is there a simpler function built in R or some of the packages that would allow me to specify just one interval (from 1 to 6; instead of having to specify from 1 to 5 and from 2 to 6), but to sort the numbers so that all the odd numbers appear before the even ones?
You can use sequence. The first argument of the function is the length of each sequence, from is the starting point, and by is the interval.
sequence(c(3, 3), from = c(1, 2), by = 2)
#[1] 1 3 5 2 4 6
Or, as a function that fits your request:
seqOrdered <- function(from = 1, to){
n = ceiling((to - from) / 2)
sequence(c(n, n), from = c(from, from + 1), by = 2)
}
seqOrdered(1, 6)
#[1] 1 3 5 2 4 6
Just concatenate the sub-data that contains only odd numbers of the original data and the other sub-data that contains the remaining even numbers.
In the following, you can have the original data x1, which consists of 10 integers from a poisson distribution of mean 8 (rpois(n = 10, lambda = 8)), and merge the sub-data of odd numbers (x1[x1 %% 2 == 1]) and that of even numbers (x1[x1 %% 2 == 0]).
## To prepare data
x1 <- rpois(n = 10, lambda = 8)
x1
## To sort the data so that odd numbers come earlier
c(x1[x1 %% 2 == 1], x1[x1 %% 2 == 0])

R Sort or order with custom compare function

Can I pass a custom compare function to order that, given two items, indicates which one is ranked higher?
In my specific case I have the following list.
scores <- list(
'a' = c(1, 1, 2, 3, 4, 4),
'b' = c(1, 2, 2, 2, 3, 4),
'c' = c(1, 1, 2, 2, 3, 4),
'd' = c(1, 2, 3, 3, 3, 4)
)
If we take two vectors a and b, the index of the first element i at which a[i] > b[i] or a[i] < b[i] should determine what vector comes first. In this example, scores[['d']] > scores[['a']] because scores[['d']][2] > scores[['a']][2] (note that it doesn't matter that scores[['d']][5] < scores[['a']][5]).
Comparing two of those vectors could look something like this.
compare <- function(a, b) {
# get first element index at which vectors differ
i <- which.max(a != b)
if(a[i] > b[i])
1
else if(a[i] < b[i])
-1
else
0
}
The sorted keys of scores by using this comparison function should then be d, b, a, c.
From other solutions I've found, they mess with the data before ordering or introduce S3 classes and apply comparison attributes. With the former I fail to see how to mess with my data (maybe turn it into strings? But then what about numbers above 9?), with the latter I feel uncomfortable introducing a new class into my R package only for comparing vectors. And there doesn't seem to be a sort of comparator parameter I'd want to pass to order.
Here's an attempt. I've explained every step in the comments.
compare <- function(a, b) {
# subtract vector a from vector b
comparison <- a - b
# get the first non-zero result
restult <- comparison[comparison != 0][1]
# return 1 if result == 1 and 2 if result == -1 (0 if equal)
if(is.na(restult)) {return(0)} else if(restult == 1) {return(1)} else {return(2)}
}
compare_list <- function(list_) {
# get combinations of all possible comparison
comparisons <- combn(length(list_), 2)
# compare all possibilities
results <- apply(comparisons, 2, function(x) {
# get the "winner"
x[compare(list_[[x[1]]], list_[[x[2]]])]
})
# get frequency table (how often a vector "won" -> this is the result you want)
fr_tab <- table(results)
# vector that is last in comparison
last_vector <- which(!(1:length(list_) %in% as.numeric(names(fr_tab))))
# return the sorted results and add the last vectors name
c(as.numeric(names(sort(fr_tab, decreasing = T))), last_vector)
}
If you run the function on your example, the result is
> compare_list(scores)
[1] 4 2 1 3
I haven't dealt with the case that the two vectors are identical, you haven't explained how to deal with this.
The native R way to do this is to introduce an S3 class.
There are two things you can do with the class. You can define a method for xtfrm that converts your list entries to numbers. That could be vectorized, and conceivably could be really fast.
But you were asking for a user defined compare function. This is going to be slow because R function calls are slow, and it's a little clumsy because nobody does it. But following the instructions in the xtfrm help page, here's how to do it:
scores <- list(
'a' = c(1, 1, 2, 3, 4, 4),
'b' = c(1, 2, 2, 2, 3, 4),
'c' = c(1, 1, 2, 2, 3, 4),
'd' = c(1, 2, 3, 3, 3, 4)
)
# Add a class to the list
scores <- structure(scores, class = "lexico")
# Need to keep the class when subsetting
`[.lexico` <- function(x, i, ...) structure(unclass(x)[i], class = "lexico")
# Careful here: identical() might be too strict
`==.lexico` <- function(a, b) {identical(a, b)}
`>.lexico` <- function(a, b) {
a <- a[[1]]
b <- b[[1]]
i <- which(a != b)
length(i) > 0 && a[i[1]] > b[i[1]]
}
is.na.lexico <- function(a) FALSE
sort(scores)
#> $c
#> [1] 1 1 2 2 3 4
#>
#> $a
#> [1] 1 1 2 3 4 4
#>
#> $b
#> [1] 1 2 2 2 3 4
#>
#> $d
#> [1] 1 2 3 3 3 4
#>
#> attr(,"class")
#> [1] "lexico"
Created on 2021-11-27 by the reprex package (v2.0.1)
This is the opposite of the order you asked for, because by default sort() sorts to increasing order. If you really want d, b, a, c use sort(scores, decreasing = TRUE.
Here's another, very simple solution:
sort(sapply(scores, function(x) as.numeric(paste(x, collapse = ""))), decreasing = T)
What it does is, it takes all the the vectors, "compresses" them into a single numerical digit and then sorts those numbers in decreasing order.

Find the index of negative multiplying of two vectors in R

I have two vectors and I want to find the index of multipling of these two vectors which gets negative and also a[index]is negative and b[index] is positive. How can I find this indexin R?
a = c(1, -1, 2, 3, 4)
b =c(-1, 3, 5, 4, -5)
c = a*b
I have tried this but this is not my desire result:
> which( c <= 0)
[1] 1 2 5
the final result should be index = 1 and 5.
After reading your Question for 10 times, i think your regarded answer is 2 like Simon0101 says.
which( a < 0 & b >= 0 & c < 0)
Please review zour Question or the expected result.

Find first greater element with higher index

I have two vectors, A and B. For every element in A I want to find the index of the first element in B that is greater and has higher index. The length of A and B are the same.
So for vectors:
A <- c(10, 5, 3, 4, 7)
B <- c(4, 8, 11, 1, 5)
I want a result vector:
R <- c(3, 3, 5, 5, NA)
Of course I can do it with two loops, but it's very slow, and I don't know how to use apply() in this situation, when the indices matter. My data set has vectors of length 20000, so the speed is really important in this case.
A few bonus questions:
What if I have a sequence of numbers (like seq = 2:10), and I want to find the first number in B that is higher than a+s for every a of A and every s of seq.
Like with question 1), but I want to know the first greater, and the first lower value, and create a matrix, which stores which one was first. So for example I have a of A, and 10 from seq. I want to find the first value of B, which is higher than a+10, or lower than a-10, and then store it's index and value.
sapply(sapply(seq_along(a),function(x) which(b[-seq(x)]>a[x])+x),"[",1)
[1] 3 3 5 5 NA
This is a great example of when sapply is less efficient than loops.
Although the sapply does make the code look neater, you are paying for that neatness with time.
Instead you can wrap a while loop inside a for loop inside a nice, neat function.
Here are benchmarks comparing a nested-apply loop against nested for-while loop (and a mixed apply-while loop, for good measure). Update: added the vapply..match.. mentioned in comments. Faster than sapply, but still much slower than while loop.
BENCHMARK:
test elapsed relative
1 for.while 0.069 1.000
2 sapply.while 0.080 1.159
3 vapply.match 0.101 1.464
4 nested.sapply 0.104 1.507
Notice you save a third of your time; The savings will likely be larger when you start adding the sequences to A.
For the second part of your question:
If you have this all wrapped up in an nice function, it is easy to add a seq to A
# Sample data
A <- c(10, 5, 3, 4, 7, 100, 2)
B <- c(4, 8, 11, 1, 5, 18, 20)
# Sample sequence
S <- seq(1, 12, 3)
# marix with all index values (with names cleaned up)
indexesOfB <- t(sapply(S, function(s) findIndx(A+s, B)))
dimnames(indexesOfB) <- list(S, A)
Lastly, if you want to instead find values of B less than A, just swap the operation in the function.
(You could include an if-clause in the function and use only a single function. I find it more efficient
to have two separate functions)
findIndx.gt(A, B) # [1] 3 3 5 5 6 NA 8 NA NA
findIndx.lt(A, B) # [1] 2 4 4 NA 8 7 NA NA NA
Then you can wrap it up in one nice pacakge
rangeFindIndx(A, B, S)
# A S indxB.gt indxB.lt
# 10 1 3 2
# 5 1 3 4
# 3 1 5 4
# 4 1 5 NA
# 7 1 6 NA
# 100 1 NA NA
# 2 1 NA NA
# 10 4 6 4
# 5 4 3 4
# ...
FUNCTIONS
(Notice they depend on reshape2)
rangeFindIndx <- function(A, B, S) {
# For each s in S, and for each a in A,
# find the first value of B, which is higher than a+s, or lower than a-s
require(reshape2)
# Create gt & lt matricies; add dimnames for melting function
indexesOfB.gt <- sapply(S, function(s) findIndx.gt(A+s, B))
indexesOfB.lt <- sapply(S, function(s) findIndx.lt(A-s, B))
dimnames(indexesOfB.gt) <- dimnames(indexesOfB.gt) <- list(A, S)
# melt the matricies and combine into one
gtltMatrix <- cbind(melt(indexesOfB.gt), melt(indexesOfB.lt)$value)
# clean up their names
names(gtltMatrix) <- c("A", "S", "indxB.gt", "indxB.lt")
return(gtltMatrix)
}
findIndx.gt <- function(A, B) {
lng <- length(A)
ret <- integer(0)
b <- NULL
for (j in seq(lng-1)) {
i <- j + 1
while (i <= lng && ((b <- B[[i]]) < A[[j]]) ) {
i <- i + 1
}
ret <- c(ret, ifelse(i<lng, i, NA))
}
c(ret, NA)
}
findIndx.lt <- function(A, B) {
lng <- length(A)
ret <- integer(0)
b <- NULL
for (j in seq(lng-1)) {
i <- j + 1
while (i <= lng && ((b <- B[[i]]) > A[[j]]) ) { # this line contains the only difference from findIndx.gt
i <- i + 1
}
ret <- c(ret, ifelse(i<lng, i, NA))
}
c(ret, NA)
}

Applying function to consecutive subvectors of equal size

I am looking for a nice and fast way of applying some arbitrary function which operates on vectors, such as sum, consecutively to a subvector of consecutive K elements.
Here is one simple example, which should illustrate very clearly what I want:
v <- c(1, 2, 3, 4, 5, 6, 7, 8)
v2 <- myapply(v, sum, group_size=3) # v2 should be equal to c(6, 15, 15)
The function should try to process groups of group_size elements of a given vector and apply a function to each group (treating it as another vector). In this example, the vector v2 is obtained as follows: (1 + 2 + 3) = 6, (4 + 5 + 6) = 15, (7 + 8) = 15. In this case, the K did not divide N exactly, so the last group was of size less then K.
If there is a nicer/faster solution which only works if N is a multiple of K, I would also appreciate it.
Try this:
library(zoo)
rollapply(v, 3, by = 3, sum, partial = TRUE, align = "left")
## [1] 6 15 15
or
apply(matrix(c(v, rep(NA, 3 - length(v) %% 3)), 3), 2, sum, na.rm = TRUE)
## [1] 6 15 15
Also, in the case of sum the last one could be shortened to
colSums(matrix(c(v, rep(0, 3 - length(v) %% 3)), 3))
As #Chase said in a comment, you can create your own grouping variable and then use that. Wrapping that process into a function would look like
myapply <- function(v, fun, group_size=1) {
unname(tapply(v, (seq_along(v)-1) %/% group_size, fun))
}
which gives your results
> myapply(v, sum, group_size=3)
[1] 6 15 15
Note this does not require the length of v to be a multiple of the group_size.
You could try this as well. This works nicely even if you want to include overlapping intervals, as controlled by by, and as a bonus, returns the intervals over which each value is derived:
library (gtools)
v2 <- running(v, fun=sum, width=3, align="left", allow.fewer=TRUE, by=3)
v2
1:3 4:6 7:8
6 15 15

Resources