how to choose a row from a matrix which satisfies some criteria - r

M1 M2 M3
M1_1 M1_2 M1_diff M2_1 M2_2 M2_diff M3_1 M3_2 M3_diff
A 55.2 60.8 5.6 66.7 69.8 3.1 58.5 60.3 1.8
B 56.8 55.4 1.4 62.8 63.9 1.1 65.7 69.8 4.1
C 52.3 54.3 2.0 53.8 55.9 1.1 56.7 57.9 1.2
I have to find which of the M1,M2,M3 is best for each of A,B,C. the criteria are Mi_1 and Mi_2 shall be highest and Mi_diff shall be lowest(i=1,2,3). Like for id B it may be the second model. I have to select an M for an id. B has lowest diff for M2, so I chose M2 for B, M3 could have been chosen too with its larger accuracy, but diff is big.I cannot come up with any general algorithm to do this. we can put up a cutoff to the diff values and then choose the M's. Like if 1.5 is the lower bound for diff , then M3 is best for id B.
The data is quite big has almost 1000 unique ids and cannot be one manually.I was thinking there may be some easy solution I am not getting. Can anyone please help? I am using R for my computations.

You just need to come up with some equation that satisfies your criteria.
For instance, as you want M1 and M2 to be as high as possible, but their difference to be as low as possible, you may want to maximize:
M1*M2/(M1-M2)
You can add coefficients to this equation to increase the importance of any of the terms.
In R:
# Set RNG seed for reproducibility
set.seed(12345)
# Generate some data
num.rows <- 1000
df <- data.frame(M1_1 = runif(num.rows, 0, 100),
M1_2 = runif(num.rows, 0, 100),
M2_1 = runif(num.rows, 0, 100),
M2_2 = runif(num.rows, 0, 100),
M3_1 = runif(num.rows, 0, 100),
M3_2 = runif(num.rows, 0, 100))
df$M1_diff <- abs(df$M1_1 - df$M1_2)
df$M2_diff <- abs(df$M2_1 - df$M2_2)
df$M3_diff <- abs(df$M3_1 - df$M3_2)
# We call apply with 1 as the second parameter,
# meaning the function will be applied to each row
res <- apply(df, 1, function(row)
{
# Our criterium, modify at will
M1_prod <- row["M1_1"] * row["M1_2"] / row["M1_diff"]
M2_prod <- row["M2_1"] * row["M2_2"] / row["M2_diff"]
M3_prod <- row["M3_1"] * row["M3_2"] / row["M3_diff"]
# Which is the maximum? Returns 1, 2 or 3
which.max(c(M1_prod, M2_prod, M3_prod))
})
And the output
> head(df)
M1_1 M1_2 M2_1 M2_2 M3_1 M3_2 M1_diff M2_diff M3_diff
1 72.09039 7.7756704 95.32788 43.06881 27.16464 18.089266 64.314719 52.25907 9.075377
2 87.57732 84.3713648 62.17875 86.29595 62.93161 18.878981 3.205954 24.11720 44.052625
3 76.09823 0.6813684 53.16722 25.12324 85.90863 72.700354 75.416864 28.04398 13.208273
4 88.61246 35.1184204 89.20926 76.34523 36.97298 3.062528 53.494036 12.86403 33.910451
5 45.64810 68.6061032 19.58807 69.40719 28.21637 58.466682 22.958007 49.81913 30.250311
6 16.63718 25.4086494 88.43795 73.68140 81.37349 75.001685 8.771471 14.75656 6.371807
> head(res)
[1] 2 1 3 2 1 3

Related

Is there an efficient way to calculate percentiles from pre-aggregated data (R)?

First: This is my first question here and I'm relatively new to R, too. So, I'm sorry if this is a stupid question or wrong way to ask.
I have a data frame like this:
df <- data.frame(Website = c("A", "A", "A", "B", "B", "B"),
seconds = c(1,12,40,3,5,14),
visitors = c(200000,100000,12000,250000,180000,90000))
> df
Website seconds visitors
A 1 200000
A 12 100000
A 40 12000
B 3 250000
B 5 180000
B 14 90000
How to interpret the data: Website A has 200000 visitors who have been on the website for only 1 second, 100000 visitors for 12 seconds and so on.
In reality, the data has about hundred different websites, each with seconds ranging from 0 to about 900 (and a high number of visitors respectively).
Now, I want to calculate percentiles or at least quartiles for the visiting duration (for each website).
I already found and tried this solution here: https://stackoverflow.com/a/53882909
However, this solution is very inefficient as it results in a data frame with several million rows (and a very long processing time).
My question now: Is there a faster (more efficient way) to calculate percentiles from such pre-aggregated data?
I believe this will be faster. First make a function to compute the quantiles you specify. Then split the data into a list and use sapply:
quant <- function(x, p=c(.25, .50, .75)) {
v <- c(0, cumsum(x$visitors)/sum(x$visitors))
s <- c(0, x$seconds)
approx(v, s, p)$y
}
df.split <- split(df, df$Website)
p <- c(.1, .2, .3, .4, .5, .6, .7, .8, .9)
stats <- t(sapply(df.split, quant, p=p))
colnames(stats) <- as.character(p)
round(stats, 1)
# 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# A 0.2 0.3 0.5 0.6 0.8 0.9 3.0 6.5 9.9
# B 0.6 1.2 1.9 2.5 3.1 3.7 4.3 4.8 8.8
To see better what is going on here is a plot showing the data for Website A:
test1 <- df[1:3, ]
test1$cumvis <- cumsum(test1$visitors)
barplot(test1$seconds, test1$visitors, space=0, xlim=c(0, 325000))
axis(1, seq(0, 300000, 50000), c("0", "50K", "100K", "150K", "200K",
"250K", "300K"), xpd=NA)
axis(3, seq(0, sum(test1$visitors), by=31200), seq(0, 1, by=.1), lty=1)
lines(c(0, test1$cumvis), c(0, test1$seconds), col="red", lwd=2)
lines(c(0, test1$cumvis-.5*test1$visitors, tail(test1$cumvis, 1)),
c(0, test1$seconds, tail(test1$seconds, 1)), col="blue", lwd=2)
The plot shows the data as grey rectangles. The bottom x-axis shows the cumulative number of visits and the top x-axis shows the cumulative proportion. We can treat the rectangles as the distribution or we can assume that the rectangles are a sample that approximates the underlying distribution. My suggested solution took the red line and used the approx function to use linear interpolation between the data points to estimate the number of seconds along that curve.
The same approach can be used with a different definition of the curve in which the data points are placed in the middle of each rectangle, the blue curve. I'll provide code for that approach as well. It is also possible to estimate the quantiles from the original data without replicating it.
First a function to estimate the quantiles along the blue line:
quant2 <- function(x, p=c(.25, .50, .75)) {
v <- c(0, cumsum(x$visitors)-(.5*x$visitors)/sum(x$visitors), 1)
s <- c(0, x$seconds, tail(x$seconds, 1))
approx(v, s, p)$y
}
p <- c(.1, .2, .3, .4, .5, .6, .7, .8, .9)
stats <- t(sapply(df.split, quant2, p=p))
colnames(stats) <- as.character(p)
round(stats, 1)
# 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# A 4.0 8.0 12.0 16.0 20 24.0 28.0 32.0 36.0
# B 1.4 2.8 4.2 5.6 7 8.4 9.8 11.2 12.6
The estimates are higher because the blue line is above the red line.
Finally, we can simply use the rectangles without any interpolation. Basically we set breaks at the boundaries of the data points and use those to identify which proportions fall in which groups of observations (seconds).
quant3 <- function(x, p=c(.25, .50, .75)){
v <- c(0, cumsum(x$visitors)/sum(x$visitors))
limits <- cut(p, breaks=v, include.lowest=TRUE, labels=x$seconds)
limits <- as.numeric(as.character(limits))
}
p <- 0:10/10
stats <- t(sapply(df.split, quant3, p=p))
colnames(stats) <- as.character(p)
stats
# 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
# A 1 1 1 1 1 1 1 12 12 12 40
# B 3 3 3 3 3 5 5 5 5 14 14
So for website A, 1 second is the value for quantiles 0 - .6.

Use apply() on a 1-dim vector to find the best threshold

My current mission: pick up some "good" columns from a incomplete matrix, trying to remove NAs while keeping real data.
My idea: I can calculate evey column's missing data NA%. For a given threshold t, all the NA% > t columns will be removed. The removed columns also contain some real data. In these columns, present/missing will show the "price" of deleting these columes. My idea is to search the lowest "price" to delete NA as much as possible, for each dataset.
I already wrote my function till the last 2 steps:
myfunc1 <- function(x){
return(sum(is.na(x))
}
myfunc2 <- function(x){
return (round(myfunc1(x) / length(x),4))
}
myfunc3 <- function(t, set){
m <- which(apply(set, MARGIN = 2, myfunc2) > t)
missed <- sum(is.na(set[m]))
present <- sum(!is.na(set[m]))
return(present/ missed)
}
myfunc3(0.5, setA) # worked
threshold <- seq(from = 0, to = 0.95, by = 0.5)
apply(X = threshold, MARGIN = 1, FUN = myfunc3, set = setA) # not worked. stuck here.
I have 10 datasets from setA to setJ, I want to test all thresholds from 0 to 0.95. I want a matrix as a return with 10 datasets as column and 20 rows threshold with every 0.05 interval.
Did I do this correctly? Are there better ideas, or already existing libraries that I could use?
----------edit: example-----------
setA <- data.frame(cbind(c(1,2,3,4,NA,6,7,NA), c(1,2,NA,4,5,NA,NA,8),c(1,2,3,4,5,6,NA,8), c(1,2,3,4,5,6,7,8),c(NA,NA,NA,4,NA,6,NA,NA)))
colnames(setA) <- sprintf("col%s", seq(1:5))
rownames(setA) <- sprintf("sample%s", seq(1:8))
View(setA)
myfunc1 <- function(x){
return(sum(is.na(x)))
}
myfunc2 <- function(x){
return (round(myfunc1(x) / length(x),4))
}
myfunc3 <- function(t, set){
m <- which(apply(set, MARGIN = 2, myfunc2) > t)
missed <- sum(is.na(set[m]))
present <- sum(!is.na(set[m]))
return(present/ missed)
}
In setA, there are 8 samples. Each sample has 5 attributes to describe the sample. Unfortunately, some data are missing. I need to delete some column with too many NAs. First, let me calculate every column's NA% .
> apply(setA, MARGIN = 2, myfunc2)
col1 col2 col3 col4 col5
0.250 0.375 0.125 0.000 0.750
If I set the threshold t = 0.3, that means col2, col5 are considered too many NAs and need to be deleted, others are acceptable. If I delete the 2 columns, I also delete some real data. (I deleted 7 real data and 9 NAs, 7/9 = 0.78. This means I sacrifice 0.78 real data when I delete 1 NA)
> myfunc3(0.3, setA)
[1] 0.7777778
I want to try every threshold's result and then decide.
threshold <- seq(from = 0, to = 0.9, by = 0.1)
apply(X= threshold, MARGIN = 1, FUN = myfunc3, set = setA) # not work
I manualy calculate setA part:
threshold: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
price: 1.667 1.667 1.118 0.778 0.334 0.334 0.334 0.334 NaN NaN
At last I want a talbe like:
threshold: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
setA: 1.667 1.667 1.118 0.778 0.334 0.334 0.334 0.334 NaN NaN
setB:
setC:
...
setJ:
Do I have the correct way with the problem?
-----------Edit---------------
I already solved the problem and please close the thread.

Inner-loop behavior of a parallelized structure using R

I want to understand how my parallelization is working when there is a for-loop structure inside of the structure that I am parallelizing.
I have a routine called reg_simulation(), which generated 100 estimations (nrep=100) of linear regression, each of those using a different seed (seed <- seed + i).
Additionally, I wrapped up the reg_simulation() routine inside par_wrapper() to run it using different possible configurations of the data generating process. In particular, changing the number of observations (obs) and the error term variance (sigma). Finally, I parallelized this structure using pblapply.
Using the described setup, I am using a grid of obs = c(250, 500, 750, 1000, 2500) and
sigma = c(0.1, 0.2, 0.5, 0.8 , 1 ) meaning 5 values in each variable, leading to a 25 combinations of the two variables. However, I am running 100 times these 25 combinations.
Finally, here is my question:
My code is...
(a) Running in parallel 25 combinations but serially the 100 repetition inside of them.
(b) Running in parallel all the 2500 models.
If the answer is (a), please let me know how you arrived at such a conclusion because I haven't been sorted out yet, and probably it might imply that I should change my code structure.
Some additional comments: (1) The seed declaration on each iteration is important because it allows me to recover each possible combination of the data (e.g., iteration 78 (seed = 78), with sigma=0.1 and obs=1000) (2) I am using pblapply because I want to track my code simulations' progress.
Here the aforementioned routines:
reg_simulation()
reg_simulation<- function(obs = 1000,
sigma = 0.5,
nrep = 10 ,
seed = 0){
#seet seed
res <- vector("list", nrep)
# Forloop
for ( i in 1:nrep) {
#Changing seed each iteration
seed <- seed + i
#set seed
set.seed(seed)
#DGP
x1 <- rnorm(obs, 0 , sigma)
x2 <- rnorm(obs, 0 , sigma)
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1)
#Estimate OLS
ols <- lm(y ~ x1 + x2)
returnlist <- list(intercept = ols$coefficients[1],
beta1 = ols$coefficients[2],
beta2 = ols$coefficients[3],
seed = seed)
#save each iteration
res[[i]] <- returnlist
}
return(res)
}
par_wrapper()
### parallel wrapper
par_wrapper <- function(obs = c(250,500,750,1000,2500),
sigma = c(0.1, 0.2, 0.5, 0.8 , 1 ) ,
nrep = 10,
nClusters = 4)
{
require(parallel)
require(pbapply)
#grid of searching space
prs <- expand.grid(obs = obs,
sigma = sigma)
nprs <- nrow(prs)
rownames(prs) <- c(1:NROW(prs))
#Print number of combinations
print(prs)
#### ---- PARALLEL INIT ---- ####
## Parallel options
cl <- makeCluster(nClusters)
## Attaching necessary functions for internal computations
parallel::clusterExport(cl= cl,
list("reg_simulation"))
# pblapply
par_simres <- pblapply(cl = cl,
X = 1:nprs,
FUN = function(i){
reg_simulation(
sigma = prs$sigma[i],
obs = prs$obs[i],
nrep = nrep,
seed = 0)})
##exit cluster mode
stopCluster(cl)
return(par_simres)
}
Using the par_wrapper() function over a grid.
#using generated structure.
res_list <- par_wrapper(
obs = c(250,500,750,1000, 2500 ),
sigma = c(0.1, 0.2, 0.5, 0.8 , 1 ) ,
nrep = 100,
nClusters = 4)
Console output.
# obs sigma
# 1 250 0.1
# 2 500 0.1
# 3 750 0.1
# 4 1000 0.1
# 5 2500 0.1
# 6 250 0.2
# 7 500 0.2
# 8 750 0.2
# 9 1000 0.2
# 10 2500 0.2
# 11 250 0.5
# 12 500 0.5
# 13 750 0.5
# 14 1000 0.5
# 15 2500 0.5
# 16 250 0.8
# 17 500 0.8
# 18 750 0.8
# 19 1000 0.8
# 20 2500 0.8
# 21 250 1.0
# 22 500 1.0
# 23 750 1.0
# 24 1000 1.0
# 25 2500 1.0
# |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s

Loop for Correlation in R

I trying to find a way to do a nested for loop in r to get every possible correlation combination of this:
cor(y, column1* column2),
cor(y, column1* column3),
cor(y, column1* column4)
and so on
This is what I have tried so far:
for(i in 1:length(dataframe))
{
for(j in 1:length(dataframe))
{
joint_correlation(i,j)=cor(y ~ dataframe(i) * dataframe(j));
}
}
My dataframe has 115 columns like shown with a small sample:
FG_pct FGA FT FT_pct FTA GP GS GmSc MP ORB
0.625 8 0 0.00 0 1 0 6.6 28.4 2
0.500 4 0 0.00 1 2 0 2.1 17.5 0
0.000 1 0 0.00 0 3 0 1.2 6.6 1
0.500 6 0 0.00 0 4 0 3.6 13.7 1
0.500 2 0 0.00 0 5 0 0.9 7.4 1
I want to find the correlation for cor(MP, column1* column2) for every possible combination switched out for column1 and column2. This way, I wouldn't have to do every single one of them separately. If possible, I would like to save the output for each correlation combination cor(MP, column1* column2), cor(MP, column1* column3),cor(MP, column2* column4), etc. in a separate column.
This is an example of what I want:
cor(MP, FG_pct*FT_pct)
Edit: Jean-Claude Arbaut gives a better answers, as commented to this answer. Use cor(df).
Here is my botched answer: Using the library corrgram (Which is mainly a visual tool) we can easily get all combinations of correlations in a dataset. Example:
library(corrgram)
#Example data
df <- data.frame(x = rnorm(50, 5, 5),
y = rnorm(50, 2, 5))
df$z <- df$x / df$y
df$abc <- df$x * df$y * df$z
#panel arguments are necessary if you want to visualize correlations
corr <- corrgram(df,
order = F,
lower.panel = panel.cor,
upper.panel = panel.pts,
text.panel = panel.txt,
diag.panel = panel.minmax,
main = "Correlation")
#call corr gives
corr
x y z abc
x 1.00000000 0.07064179 0.1402051 0.89166002
y 0.07064179 1.00000000 0.2495239 0.08024278
z 0.14020508 0.24952388 1.0000000 0.14649093
abc 0.89166002 0.08024278 0.1464909 1.00000000
There is absolutely a better way for doing this with functions and without a package, but its early here, and if you are desperate to get the results this will probably do you fine.
p.s using the corrgram() function without assigning it will give you a nice visualization of your correlations.
Assuming you want the correlations of every column multiplied by combinations of two of the remaining columns.
We can find the names of according combinations using combn(names(dat), 2) which we put into an lapply.
combs <- do.call(cbind.data.frame,
lapply("MP", rbind, combn(names(dat)[names(dat) != "MP"], 2)))
combs
# 1 2 3
# 1 MP MP MP
# 2 FG_pct FG_pct FGA
# 3 FGA FT FT
In another lapply we subset the data on the name-combinations and calculate cor with formula cor(x1 ~ x2 * x3). Simultaneously we store the names pasted as formula in an attribute, to remember later what we've calculated in each iteration.
res.l <- lapply(combs, function(x) {
`attr<-`(cor(dat[,x[1]], dat[,x[2]]*dat[,x[3]]),
"what", {
paste0(x[1], ", ", paste(x[2], "*", x[3]))})
})
Finally we unlist and setNames according to the attributes.
res <- setNames(unlist(res.l), sapply(res.l, attr, "what"))
res
Result
# MP, FG_pct * FGA MP, FG_pct * FT MP, FGA * FT
# 0.2121374 0.2829003 0.4737892
Check:
(Note, that you can directly put the names, e.g. MP, FG_pct * FGA into the cor function.)
with(dat, cor(MP, FG_pct * FGA))
# [1] 0.2121374
with(dat, cor(MP, FG_pct * FT))
# [1] 0.2829003
with(dat, cor(MP, FGA * FT))
# [1] 0.4737892
To sort, use e.g. sort(res) or rev(sort(res)).
Toy data:
set.seed(42)
dat <- as.data.frame(`colnames<-`(MASS::mvrnorm(n=1e4,
mu=c(0.425, 4.2, 0.2, 3),
Sigma=matrix(c(1, .3, .7, 0,
.3, 1, .5, 0,
.7, .5, 1, 0,
0, 0, 0, 1), nrow=4),
empirical=T), c("FG_pct", "MP", "FGA", "FT")))

group and average a large numeric vector to plot

I have an R matrix which is very data dense. It has 500,000 rows. If I plot 1:500000 (x axis) to the third column of the matrix mat[, 3] it takes too long to plot, and sometimes even crashes. I've tried plot, matplot, and ggplot and all of them take very long.
I am looking to group the data by 10 or 20. ie, take the first 10 elements from the vector, average that, and use that as a data point.
Is there a fast and efficient way to do this?
We can use cut and aggregate to reduce the number of points plotted:
generate some data
set.seed(123)
xmat <- data.frame(x = 1:5e5, y = runif(5e5))
use cut and aggregate
xmat$cutx <- as.numeric(cut(xmat$x, breaks = 5e5/10))
xmat.agg <- aggregate(y ~ cutx, data = xmat, mean)
make plot
plot(xmat.agg, pch = ".")
more than 1 column solution:
Here, we use the data.table package to group and summarize:
generate some more data
set.seed(123)
xmat <- data.frame(x = 1:5e5,
u = runif(5e5),
z = rnorm(5e5),
p = rpois(5e5, lambda = 5),
g = rbinom(n = 5e5, size = 1, prob = 0.5))
use data.table
library(data.table)
xmat$cutx <- as.numeric(cut(xmat$x, breaks = 5e5/10))
setDT(xmat) #convert to data.table
#for each level of cutx, take the mean of each column
xmat[,lapply(.SD, mean), by = cutx] -> xmat.agg
# xmat.agg
# cutx x u z p g
# 1: 1 5.5 0.5782475 0.372984058 4.5 0.6
# 2: 2 15.5 0.5233693 0.032501186 4.6 0.8
# 3: 3 25.5 0.6155837 -0.258803746 4.6 0.4
# 4: 4 35.5 0.5378580 0.269690334 4.4 0.8
# 5: 5 45.5 0.3453964 0.312308395 4.8 0.4
# ---
# 49996: 49996 499955.5 0.4872596 0.006631221 5.6 0.4
# 49997: 49997 499965.5 0.5974486 0.022103345 4.6 0.6
# 49998: 49998 499975.5 0.5056578 -0.104263093 4.7 0.6
# 49999: 49999 499985.5 0.3083803 0.386846148 6.8 0.6
# 50000: 50000 499995.5 0.4377497 0.109197095 5.7 0.6
plot it all
par(mfrow = c(2,2))
for(i in 3:6) plot(xmat.agg[,c(1,i), with = F], pch = ".")

Resources