How can you find the indexes of the n least values in a vector? - r

Is there a quick way to find the indexes of the n least values in a vector in R?
I know that to find the least value you can use which.min(c(1, 5, 6, 4)). Can this be extended?

From the comments we shared either order
n <- 3
head(order(vect), n)
or with sort
head(sort(vect, index.return = TRUE)$ix, n)
data
vect <- c(1, 5, 6, 4)

which.nmin <- function(x, n){
order(x)[seq_len(n)]
}
set.seed(123)
x <- rnorm(100)
which.nmin(x, 6)
# [1] 72 18 26 57 43 8

Related

Permute the position of a subset of a vector

I want to permute a subset of a vector.
For example, say I have a vector (x) and I select a random subset of the vector (e.g., 40% of its values).
What I want to do is output a new vector (x2) that is identical to (x) except the positions of the values within the random subset are randomly swapped.
For example:
x = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
random subset = 1, 4, 5, 8
x2 could be = 4, 2, 3, 8, 1, 6, 7, 5, 9, 10
Here's an an example vector (x) and how I'd select the indices of a random subset of 40% of its values. Any help making (x2) would be appreciated!
x <- seq(1,10,1)
which(x%in%sample(x)[seq_len(length(x)*0.40)])
First draw a sample of proportion p from the indices, then sample and re-assign elements with that indices.
f <- \(x, p=0.4) {
r <- sample(seq_along(x), length(x)*p)
x[r] <- sample(x[r])
`attr<-`(x, 'subs', r) ## add attribute w/ indices that were sampled
}
set.seed(42)
f(x)
# [1] 8 2 3 4 1 5 7 10 6 9
# attr(,"subs")
# [1] 1 5 10 8
Data:
x <- 1:10
For sure there is a faster code to do what you are asking, but, a solution would be:
x <- seq(1,10,1)
y <- which(x%in%sample(x)[seq_len(length(x)*0.40)]) # Defined as "y" the vector of the random subset
# required libraries
library(combinat)
permutation <- permn(y) # permn() function in R generates a list of all permutations of the elements of x.
# https://www.geeksforgeeks.org/calculate-combinations-and-permutations-in-r/
permutation_sampled <- sample(permutation,1) # Sample one of the permutations.
x[y] <- permutation_sampled[[1]] # Substitute the selected permutation in x using y as the index of the elements that should be substituted.

How to make a generalized function update the value of a vector?

I have been trying to write a generalized function that multiplies each value in each row of a matrix by the corresponding value of a vector in terms of their position (i.e. matrix[1,1]*vector[1], matrix[1,2]*vector[2], etc) and then sum them together. It is important to note that the lengths of the vector and the rows of the matrix are always the same, which means that in each row the first value of the vector is multiplied with the first value of the matrix row. Also important to note, I think, is that the rows and columns of the matrix are of equal length. The end sum for each row should be assigned to different existing vector, the length of which is equal to the number of rows.
This is the matrix and vector:
a <- c(4, -9, 2, -1)
b <- c(-1, 3, -8, 2)
c <- c(5, 2, 6, 3)
d <- c(7, 9, -2, 5)
matrix <- cbind(a,b,c,d)
a b c d
[1,] 4 -1 5 7
[2,] -9 3 2 9
[3,] 2 -8 6 -2
[4,] -1 2 3 5
vector <- c(1, 2, 3, 4)
These are the basic functions that I have to generalize for the rows and columns of matrix and a vector of lenghts "n":
f.1 <- function() {
(matrix[1,1]*vector[1]
+ matrix[1,2]*vector[2]
+ matrix[1,3]*vector[3]
+ matrix[1,4]*vector[4])
}
f.2 <- function() {
(matrix[2,1]*vector[1]
+ matrix[2,2]*vector[2]
+ matrix[2,3]*vector[3]
+ matrix[2,4]*vector[4])
}
and so on...
This is the function I have written:
ncells = 4
f = function(x) {
i = x
result = 0
for(j in 1:ncells) {
result = result + vector[j] * matrix[i][j]
}
return(result)
}
Calling the function:
result.cell = function() {
for(i in 1:ncells) {
new.vector[i] = f(i)
}
}
The vector to which this result should be assigned (i.e. new.vector) has been defined beforehand:
new.vector <- c()
I expected that the end sum for each row will be assigned to the vector in a corresponding manner (e.g. if the sums for all rows were 1, 2, 3, 4, etc. then new.vector(1, 2, 3, 4, etc) but it did not happen.
(Edit) When I do this with the basic functions, the assignment works:
new.vector[1] <- f.1()
new.vector[2] <- f.2()
This does not however work with the generalized function:
new.vector[1:ncells] <- result cell[1:ncells]
(End Edit)
I have also tried setting the length for the the new.vector to be equal to ncells but I don't think it did any good:
length(new.vector) = ncells
My question is how can I make the new vector take the resulting sums of the multiplied elements of a row of a matrix by the corresponding value of a vector.
I hope I have been clear and thanks in advance!
There is no need for a loop here, we can use R's power of matrix multiplication and then sum the rows with rowSums. Note that m and v are used as names for matrix and vector to avoid conflict with those function names.
nr <- nrow(m)
rowSums(m * matrix(rep(v, nr), nr, byrow = TRUE))
# [1] 45 39 -4 32
However, if the vector v is always going to be the column number, we can simply use the col function as our multiplier.
rowSums(m * col(m))
# [1] 45 39 -4 32
Data:
a <- c(4, -9, 2, -1)
b <- c(-1, 3, -8, 2)
c <- c(5, 2, 6, 3)
d <- c(7, 9, -2, 5)
m <- cbind(a, b, c, d)
v <- 1:4

Finding n tuples from a list whose aggregation satisfies a condition

I have a list of two-element vectors. From this list, I'd like to find n vectors (not necessarily distinct) (x,y), such that the sum of the ys of these vectors is larger or equal to a number k. If multiple vectors satisfy this condition, select the one where the sum of the xs is the smallest.
For example, I'd like to find n=2 vectors (x1,y1) and (x2,y2) such that y1+y2 >= k. If there are more than just one which satisfies this condition, select the one where x1+x2 is the smallest.
I've so far only managed to set-up the following code:
X <- c(3, 2, 3, 8, 7, 7, 13, 11, 12, 12)
Y <- c(2, 1, 3, 6, 5, 6, 8, 9, 10, 9)
df <- data.frame(A, B)
l <- list()
for (i in seq(1:nrow(df))){
n <- as.numeric(df[i,])
l[[i]] <- n
}
Using the values above, let's say n=1, k=9, then I'll pick the tuple (x,y)=(11,9) because even though (12,9) also matches the condition that y=k, the x is smaller.
If n=2, k=6, then I'll pick (x1,y1)=(3,3) and (x2,y2)=(3,3) because it's the smallest x1+x2 that satisfies y1+y2 >= 6.
If n=2, k=8, then I'll pick (x1,y1)=(3,3) and (x2,y2)=(7,5) because y1+y2>=8 and in the next alternative tuples (3,3) and (8,6), 3+8=11 is larger than 3+7.
I feel like a brute-force solution would be possible: all possible n-sized combinations of each vector with the rest, for each permutation calculate yTotal=y1+y2+y3... find all yTotal combinations that satisfy yTotal>=k and of those, pick the one where xTotal=x1+x2+x3... is minimal.
I definitely struggle putting this into R-code and wonder if it's even the right choice. Thank you for your help!
First, it seems from your question that you allow to select from Y with replacement. The code basically does your brute-force approach: use the permutations in the gtools library to generate the permutations. Then basically do the filtering for sum(Y)>=k, and ordering first by smallest sum(Y) and then sum(X).
X <- c(3, 2, 3, 8, 7, 7, 13, 11, 12, 12)
Y <- c(2, 1, 3, 6, 5, 6, 8, 9, 10, 9)
n<-1
perm<-gtools::permutations(n=length(Y),r=n, repeats.allowed=T)
result<-apply(perm,1,function(x){ c(sum(Y[x]),sum(X[x])) })
dim(result) # 2 10
k=9 ## Case of n=1, k=9
keep<-which(result[1,]>=k)
result[,keep[order(result[1,keep],result[2,keep])[1]]] # 9 and 11
##### n=2 cases ##########
n<-2
perm<-gtools::permutations(n=length(Y),r=n, repeats.allowed=T)
result<-apply(perm,1,function(x){ c(sum(Y[x]),sum(X[x])) })
dim(result) # 2 100
## n=2, k=6
keep<-which(result[1,]>=6)
keep[order(result[1,keep],result[2,keep])[1]] # the 23 permutation
perm[23,] # 3 3 is (Y1,Y2)
result[,keep[order(result[1,keep],result[2,keep])[1]]] # sum(Y)=6 and sum(X)=6
## n=2, k=8
keep<-which(result[1,]>=8)
keep[order(result[1,keep],result[2,keep])[1]] # the 6 permutation
perm[6,] # 1 6 is (Y1,Y2)
result[,keep[order(result[1,keep],result[2,keep])[1]]] # sum(Y)=8 and sum(X)=10

Implementation of skyline query or efficient frontier

I know there must be an easy answer to this but somehow I can't seem to find it...
I have a data frame with 2 numeric columns.
I would like to remove from it, the rows, which have the property, that there exists at least one other row in the data frame, with both column values bigger than the ones in this row.
So if I have
Col1 Col2
1 2 3
2 4 7
3 5 6
I would like to remove the first row, because the second one fulfills the property and keep only rows 2 and 3.
Thanks a lot!
That problem is called a "skyline query" by database administrators (they may have other algorithms) and an "efficient frontier" by economists.
Plotting the data can make it clear what we are looking for.
n <- 40
d <- data.frame(
x = rnorm(n),
y = rnorm(n)
)
# We want the "extreme" points in the following plot
par(mar=c(1,1,1,1))
plot(d, axes=FALSE, xlab="", ylab="")
for(i in 1:n) {
polygon( c(-10,d$x[i],d$x[i],-10), c(-10,-10,d$y[i],d$y[i]),
col=rgb(.9,.9,.9,.2))
}
The algorithm is as follows: sort the points along the first coordinate,
keep each observation unless it is worse than the last retained one.
d <- d[ order(d$x, decreasing=TRUE), ]
result <- d[1,]
for(i in seq_len(nrow(d))[-1] ) {
if( d$y[i] > result$y[nrow(result)] ) {
result <- rbind(result, d[i,]) # inefficient
}
}
points(result, cex=3, pch=15)
Edit (2015-03-02): For a more efficient solution, please see Patrick Roocks' rPref, a package for "Database Preferences and Skyline Computation", (also linked to in his answer below). To show that it finds the same solution as my code here, I've appended an example using it to my original answer here.
Riffing off of Vincent Zoonekynd's enlightening response, here's an algorithm that's fully vectorized, and likely more efficient:
set.seed(100)
d <- data.frame(x = rnorm(100), y = rnorm(100))
D <- d[order(d$x, d$y, decreasing=TRUE), ]
res <- D[which(!duplicated(cummax(D$y))), ]
# x y
# 64 2.5819589 0.7946803
# 20 2.3102968 1.6151907
# 95 -0.5302965 1.8952759
# 80 -2.0744048 2.1686003
# And then, if you would prefer the rows to be in
# their original order, just do:
d[sort(as.numeric(rownames(res))), ]
# x y
# 20 2.3102968 1.6151907
# 64 2.5819589 0.7946803
# 80 -2.0744048 2.1686003
# 95 -0.5302965 1.8952759
Or, using the rPref package:
library(rPref)
psel(d, high(x) | high(y))
# x y
# 20 2.3102968 1.6151907
# 64 2.5819589 0.7946803
# 80 -2.0744048 2.1686003
# 95 -0.5302965 1.8952759
Here is an sqldf solution where DF is the data frame of data:
library(sqldf)
sqldf("select * from DF a
where not exists (
select * from DF b
where b.Col1 >= a.Col1 and b.Col2 > a.Col2
or b.Col1 > a.Col1 and b.Col2 >= a.Col2
)"
)
This question is pretty old, but meanwhile there is a new solution. I hope it is ok to do some self-promotion here: I developed a package rPref which does an efficient Skyline computation due to C++ algorithms. With installed rPref package the query from the question can be done via (assuming that df is the name of data set):
library(rPref)
psel(df, high(Col1) | high(Col2))
This removes only those tuples, where some other tuple is better in both dimensions.
If one requires the other tuple to be strictly better in just one dimension (and better or equal in the other dimension), use high(Col1) * high(Col2) instead.
In one line:
d <- matrix(c(2, 3, 4, 7, 5, 6), nrow=3, byrow=TRUE)
d[!apply(d,1,max)<max(apply(d,1,min)),]
[,1] [,2]
[1,] 4 7
[2,] 5 6
Edit: In light of your precision in jbaums' response, here's how to check for both columns separately.
d <- matrix(c(2, 3, 3, 7, 5, 6, 4, 8), nrow=4, byrow=TRUE)
d[apply(d,1,min)>min(apply(d,1,max)) ,]
[,1] [,2]
[1,] 5 6
[2,] 4 8
d <- matrix(c(2, 3, 4, 7, 5, 6), nrow=3, byrow=TRUE)
d2 <- sapply(d[, 1], function(x) x < d[, 1]) &
sapply(d[, 2], function(x) x < d[, 2])
d2 <- apply(d2, 2, any)
result <- d[!d2, ]

Applying function to consecutive subvectors of equal size

I am looking for a nice and fast way of applying some arbitrary function which operates on vectors, such as sum, consecutively to a subvector of consecutive K elements.
Here is one simple example, which should illustrate very clearly what I want:
v <- c(1, 2, 3, 4, 5, 6, 7, 8)
v2 <- myapply(v, sum, group_size=3) # v2 should be equal to c(6, 15, 15)
The function should try to process groups of group_size elements of a given vector and apply a function to each group (treating it as another vector). In this example, the vector v2 is obtained as follows: (1 + 2 + 3) = 6, (4 + 5 + 6) = 15, (7 + 8) = 15. In this case, the K did not divide N exactly, so the last group was of size less then K.
If there is a nicer/faster solution which only works if N is a multiple of K, I would also appreciate it.
Try this:
library(zoo)
rollapply(v, 3, by = 3, sum, partial = TRUE, align = "left")
## [1] 6 15 15
or
apply(matrix(c(v, rep(NA, 3 - length(v) %% 3)), 3), 2, sum, na.rm = TRUE)
## [1] 6 15 15
Also, in the case of sum the last one could be shortened to
colSums(matrix(c(v, rep(0, 3 - length(v) %% 3)), 3))
As #Chase said in a comment, you can create your own grouping variable and then use that. Wrapping that process into a function would look like
myapply <- function(v, fun, group_size=1) {
unname(tapply(v, (seq_along(v)-1) %/% group_size, fun))
}
which gives your results
> myapply(v, sum, group_size=3)
[1] 6 15 15
Note this does not require the length of v to be a multiple of the group_size.
You could try this as well. This works nicely even if you want to include overlapping intervals, as controlled by by, and as a bonus, returns the intervals over which each value is derived:
library (gtools)
v2 <- running(v, fun=sum, width=3, align="left", allow.fewer=TRUE, by=3)
v2
1:3 4:6 7:8
6 15 15

Resources