Finding the index of the minimum value which is larger than a threshold in R - r

This is probably very simple, but I'm missing the correct syntax in order to simplify it.
Given a matrix, find the entry in one column which is the lowest value, greater than some input parameter. Then, return an entry in a different column on that corresponding row. Not very complicated... and I've found something that works but, a more efficient solution would be greatly appreciated.
I found this link:Better way to find a minimum value that fits a condition?
which is great.. but that method of finding the least entry loses the index information required to find a corresponding value in a corresponding row.
Let's say column 2 is the condition column, and column 1 is the one I want to return.... currently I've made this: (note that this only works because row two is full of numbers which are less than 1).
matrix[which.max((matrix[,2]>threshhold)/matrix[,2]),1]
Any thoughts? I'm expecting that there is probably some quick and easy function which has this effect... it's just never been introduced to me haha.

rmk's answer shows the basic way to get a lot of info out of your matrix. But if you know which column you're testing for the minimum value (above your threshold), and then want to return a different value in that row, maybe something like
incol<- df[,4] # select the column to search
outcol <- 2 # select the element of the found row you want to get
threshold <- 5
df[ rev(order(incol>threshold))[1] ,outcol]

You could try the following. Say,
df <- matrix(sample(1:35,35),7,5)
> df
[,1] [,2] [,3] [,4] [,5]
[1,] 18 16 27 19 31
[2,] 24 1 7 12 5
[3,] 28 35 23 4 6
[4,] 33 3 25 26 15
[5,] 14 10 11 21 20
[6,] 9 2 32 17 13
[7,] 30 8 29 22 34
Say your threshold is 5:
apply(df,2,function(x){ x[x<5] <- max(x);which.min(x)})
[1] 6 7 2 2 2
Corresponding to the values:
[1] 9 8 7 12 5
This should give you the index of the smallest entry in each column greater than threshold according to the original column indexing.

Related

Percentile rank of column values - R

I am looking for a percentage rank for each value in a column.
It is quite easy in Excel, for example:
=RANK.EQ(A1,$A$1:$A$100,1)/COUNT($A$1:$A$100)
Returns a percent value in a new column that ranks the column I referred to above.
I have no problem finding quantile in R, but have not been able to find anything that accurately gives percentile for every single column value.
Try this using the data in your picture:
> Cost.Per.Kilo <- c(rep(c(6045170, 5412330, 3719760, 3589220), each=2),
3507400)
> Cost.Per.Kilo
[1] 6045170 6045170 5412330 5412330 3719760 3719760 3589220 3589220 3507400
> CPK.rank <- rank(Cost.Per.Kilo, ties.method="min")
> CPK.rank
[1] 8 8 6 6 4 4 2 2 1
> round(CPK.rank/length(CPK.rank) * 100)
[1] 89 89 67 67 44 44 22 22 11
In your picture you seem to have divided the ranks by 10, but there are only 9 values. That is why these percentages do not match.

Creating large matrices in R in reasonable time

I am working on a movie recommender predicts a user's movie rating for an unseen movie. Most of the work is done and I have created a 7000x3000 matrix userRatingsNew containing 7000 users and their ratings for 3000 movies, replacing all the missing values with the predicted rating.
I was provided two other files, mapping and test, and used read.csv() to load them into matrices of the following format.
mapping is a 8,400,000x3 matrix that contains id, user, movie, where id is basically the transaction id associated with a user's rating of movie x.
test is a 8,400,000x2 matrix that contains id, rating, where rating is the user's rating for that movie associated with id. The values in the rating column are empty and I need to fill those in using the predicted values that I have already calculated.
Here is my code
writeResult <- function(userRatingsNew, mapping, test, writeToFile = FALSE){
start <- Sys.time()
result <- test
entries <- nrow(test)
for (i in 1:entries){
result[i,2] <- userRatingsNew[mapping[i,2], mapping[i,3]]
}
if (writeToFile)
write.csv(result, "result.csv", row.names=FALSE)
print(Sys.time()-start)
return(result)
}
My problem is that for i=1:100, it takes ~7 seconds. So in order to process all 8.4 million entries, it'd take ~163 hours. I tried using doMC() and implemented parallel processing, but I ran into the problem where my computer ran out of memory. What exactly can I do to speed this process up?
You can index a matrix with another matrix, as in:
M <- matrix(1:25,nc=5,nr=5)
M
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 6 11 16 21
# [2,] 2 7 12 17 22
# [3,] 3 8 13 18 23
# [4,] 4 9 14 19 24
# [5,] 5 10 15 20 25
m <- cbind(1:5,5:1)
m
# [,1] [,2]
# [1,] 1 5
# [2,] 2 4
# [3,] 3 3
# [4,] 4 2
# [5,] 5 1
M[m]
# [1] 21 17 13 9 5
So try
result[,2] <- userRatingsNew[mapping[,2:3]]
You should not need a loop.
A thought:
Instead of the 3000-sized dimension attached directly to the 7000-sized dimension, for each user you can attach an array which specifies the movie id/number/place in array, and their rating, in a series of 2d datapoints. Presumably most users will not rate all 3000 films. Let's say they rate 20 movies on average, and in each of 20 cases now it calls the array of movie names by correctly referring to the location in the array, then now you only need (7000) x (20x2+20) things going on, where 20x2 refers to the 20 ratings plus the reference to the film, and the other 20 is the fact of retrieving the film name. You can compile all reports first using array location and attach the name referring to an array of film names.

R Pooled DataFrame analysis

I'm trying to perform several analysis on subsets of data in a dataframe in R, and i was wondering if there is generic way for doing this.
Say, I have a dataframe like:
one two three four
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 11 18
[4,] 4 9 11 19
[5,] 5 10 15 20
how could I apply some computation (e.g. cumulative counting) based upon values in col "one" condition upon (grouped by) the value in col "three".
That is, I wanna do stuff to one column, based upon grouping in another column. I can do this with loops, but I feel there might be standard ways to do this all at once.
thank you in advance!
ddply(data, .(coln), Stat) does the trick exactly

filling matrix with circular patern

I want to write a function that fill a matrix m by m where m is odd as follows :
1) it's starts from middle cell of matrix (for example for 5 by 5 A, matrix middle cell are A[2,2] ) , and put number 1 there
2) it's go one cell forward and add 1 to previous cell and put it in second cell
3) it's go down and put 3, left 4, left 5, up 6, up 7,...
for example the resulting matrix could be like this :
> 7 8 9
6 1 2
5 4 3
could somebody help me to implement?
max_x=5
len=max_x^2
middle=ceiling(max_x/2)
A=matrix(NA,max_x,max_x)
increments=Reduce(
f=function(lhs,rhs) c(lhs,(-1)^(rhs/2+1)*rep(1,rhs)),
x=2*(1:(max_x)),
init=0
)[1:len]
idx_x=Reduce(
f=function(lhs,rhs) c(lhs,rep(c(TRUE,FALSE),each=rhs)),
1:max_x,
init=FALSE
)[1:len]
increments_x=increments
increments_y=increments
increments_x[!idx_x]=0
increments_y[idx_x]=0
A[(middle+cumsum(increments_x)-1)*(max_x)+middle+cumsum(increments_y)]=1:(max_x^2)
Gives
#> A
# [,1] [,2] [,3] [,4] [,5]
#[1,] 21 22 23 24 25
#[2,] 20 7 8 9 10
#[3,] 19 6 1 2 11
#[4,] 18 5 4 3 12
#[5,] 17 16 15 14 13
Explanation:
The vector increments denotes the steps along the path of the increasing numbers. It's either 0/+1/-1 for unchanged/increasing/decreasing row and column indices. Important here is that these numbers do not differentiate between steps along columns and rows. This is managed by the vector idx_x - it masks out increments that are either along a row (TRUE) or a column (FALSE).
The last line takes into account R's indexing logic (matrix index increases along columns).
Edit:
As per request of the OP, here some more information about how the increments vector is calculated.
You always go two consecutive straight lines of equal length (row-wise or column-wise). The length, however, increases by 1 after you have walked twice. This corresponds to the x=2*(1:(max_x)) argument together with rep(1,rhs). The first two consecutive walks are in increasing column/row direction. Then follow two in negative direction and so on (alternating). This is accounted for by (-1)^(rhs/2+1).

Perform 'cross product' of two vectors, but with addition

I am trying to use R to perform an operation (ideally with similarly displayed output) such as
> x<-1:6
> y<-1:6
> x%o%y
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
[2,] 2 4 6 8 10 12
[3,] 3 6 9 12 15 18
[4,] 4 8 12 16 20 24
[5,] 5 10 15 20 25 30
[6,] 6 12 18 24 30 36
where each entry is found through addition not multiplication.
I would also be interested in creating the 36 ordered pairs (1,1) , (1,2), etc...
Furthermore, I want to use another vector like
z<-1:4
to create all the ordered triplets possible between x, y, and z.
I am using R to look into likelihoods of possible total when rolling dice with varied numbers of sizes.
Thank you for all your help! This site has been a big help to me. I appreciate anyone that takes the time to answer a stranger's question.
UPDATE So I found that `outer(x,y,'+') will do what I wanted first. But I still don't know how to create ordered pairs or ordered triplets.
Your first question is easily handled by outer:
outer(1:6,1:6,"+")
For the others, I suggest you try expand.grid, although there are specialized combination and permutation functions out there as well if you do a little searching.
expand.grid can answer your second question:
expand.grid(1:6,1:6)
expand.grid(1:6,1:6,1:4)

Resources