r create matrix from repeat loop output - r

For each value n in some vector N, I want to compute the percentage of values exceed n for each variable in my data frame T.
Consider the following input data frame:
T <- data.frame(A=c(0.1,0.2,0.3), B=c(0.3,0.3,0.9),C=c(1,0.5,0))
T
# A B C
# 1 0.1 0.3 1.0
# 2 0.2 0.3 0.5
# 3 0.3 0.9 0.0
I would like the output to be a matrix that looks something like this:
A B C
n=0.1 66.6 100 66.6
n=0.2 33.3 100 66.6
My current implementation is not working:
n <- 0.8
repeat {
Tlogic <- T > n
TU <- as.matrix(apply(Tlogic,2,sum))
q = NULL
for (i in seq(along=TU[,1]))
{
percent <- (TU[i]/nrow(T))*100
q = c(q, percent)
}
n <- n - 0.05;
print(n);
if(log(n) < -6) break
}

Basically you're asking, for each value n in some vector N, to compute the percentage of values in each column of T that exceed n.
You can actually do this in one line in R by moving from a solution that writes out loops to one that uses the *apply functions in R:
N <- c(0.1, 0.2)
do.call(rbind, lapply(N, function(n) c(n=n, 100*colMeans(T > n))))
# n A B C
# [1,] 0.1 66.66667 100 66.66667
# [2,] 0.2 33.33333 100 66.66667
For each value n in N, the call lapply(N, function(n) c(n=n, 100*colMeans(T > n))) computes a vector that indicates n as well as the percentage of values in each column of T that exceed n. Then do.call(rbind, ...) groups all of these together into a final output matrix.
In your case, you want N to form a decreasing sequence (by 0.05 each step) from 0.8 until log(n) < -6. You can get the N vector in this case with:
N <- seq(.8, 0, -.05)
N <- N[log(N) >= -6]

Related

How to do large combinations with condition in R efficiently?

Survey shows average score of 4.2 out of 5, with sample size of 14. How do I create a dataframe that provides a combination of results to achieve score of 4.2?
I tried this but it got too big
library(tidyverse)
n <- 14
avg <- 4.2
df <- expand.grid(rep(list(c(1:5)),n))
df <- df %>%
rowwise() %>%
mutate(avge = mean(c_across())) %>%
filter(ave >= 4)
The aim for this is, given the limited information above, I want to know the distribution of combinations of individual scores and see which combination is more likely to occur and how many low scores + high scores needed to have an average of that score above.
Thanks!
If you can tolerate doing this randomly, then
set.seed(42) # only so that you get the same results I show here
n <- 14
iter <- 1000000
scores <- integer(0)
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
if (mean(tmp) > 4) {
scores <- tmp
break
}
iter <- iter - 1
}
mean(scores)
# [1] 4.142857
scores
# [1] 5 3 5 5 5 3 3 5 5 2 5 5 4 3
Notes:
The reason I use iter in there is to preclude the possibility of an "infinite" loop. While here it reacts rather quickly and is highly unlikely to go that far, if you change the conditions then it is possible your conditions could be infeasible or just highly improbable. If you don't need this, then remove iter and use instead while (TRUE) ...; you can always interrupt R with Escape (or whichever mechanism your IDE provides).
The reason I prefill scores with an empty vector and use tmp is so that you won't accidentally assume that scores having values means you have your average. That is, if the constraints are too tight, then you should find nothing, and therefore scores should not have values.
FYI: if you're looking for an average of 4.2, two things to note:
change the conditional to be what you need, such as looking for 4.2 ... but ...
looking for floating-point equality is going to bite you hard (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754), I suggest looking within a tolerance, perhaps
tol <- 0.02
# ...
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
# ...
where tol is some meaningful number. Unfortunately, using this seed (and my iter limit) there is no combination of 14 votes (of 1 to 5) that produce a mean that is within tol = 0.01 of 4.2:
set.seed(42)
n <- 14
iter <- 100000
scores <- integer(0)
tol <- 0.01
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
# if (mean(tmp) > 4) {
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
iter <- iter - 1
}
iter
# [1] 0 # <-- this means the loop exited on the iteration-limit, not something found
scores
# integer(0)
if you instead set tol = 0.02 then you will find something:
tol <- 0.02
# ...
scores
# [1] 4 4 4 4 4 5 4 5 5 5 3 4 3 5
mean(scores)
# [1] 4.214286
You can try the code below
n <- 14
avg <- 4.2
repeat{
x <- sample(1:5, n, replace = TRUE)
if (sum(x) == round(avg * n)) break
}
and you will see
> x
[1] 5 5 5 5 5 5 4 5 5 4 1 5 1 4
> mean(x)
[1] 4.214286

normal distribution calculation error - Non-numeric argument to mathematical function

I have 2 data frames and I am applying pnorm() and qnorm() on the dataframe, but I am getting the errors, while calculating.
n <- c(0.3,0.5,0.1,0.2)
m <- c(0.1,0.4,0.5,0.3)
o <- c(0.2,0.2,0.2,0.4)
p <- c(0.3,0.1,0.3,0.3)
df1 = data.frame(n,m,o,p)
df1
n m o p
1 0.3 0.1 0.2 0.3
2 0.5 0.4 0.2 0.1
3 0.1 0.5 0.2 0.3
4 0.2 0.3 0.4 0.3
r <- c(0.2,0.4,0.1,0.3)
df2 = rbind.data.frame(r)
df2
X2 X4 X1 X3
1 0.2 0.4 0.1 0.3
b <- 0.15
result <- pnorm((qnorm(df1)+sqrt(b)*df2)/sqrt(1-b))
Output:
Getting an error:
Error in qnorm(df1) : Non-numeric argument to mathematical function
Expected output:
Output:
0.3139178 0.110853 0.1919158 0.3289671
0.5334785 0.4574897 0.1919158 0.1031127
0.0957727 0.5667216 0.1919158 0.3289671
0.2035948 0.3442989 0.4079641 0.3289671
actually I have these 2 data-frames df1 and df1 and in excel and I have a formula in excel which I need to convert into R.
=NORMSDIST((NORMSINV(A1)+SQRT(0.15)*H1)/SQRT(1-0.15))
here A1 is the df1 first value and so on and H1 is the df2 value and so on.
What you're trying to do is: apply a function to every row in df1. To do so we need to write a function.
getDist <- function(x, b = 0.15) {
pnormInput <- as.numeric((qnorm(as.numeric(x)) + sqrt(b) * df2) / sqrt(1 - b))
pnorm(pnormInput)
}
Next we apply this function to every row in df1 (using apply).
result <- apply(df1, 1, function(x) getDist(x))
Next we have to transpose result (flip the table we got).
result <- t(result)
# [,1] [,2] [,3] [,4]
# [1,] 0.3139178 0.1108530 0.1919158 0.3289671
# [2,] 0.5334785 0.4574897 0.1919158 0.1031127
# [3,] 0.0957727 0.5667216 0.1919158 0.3289671
# [4,] 0.2035948 0.3442989 0.4079641 0.3289671
I think this is a classic case of trying to do many operations in one line and losing track of what every function is doing. My answer is essentially the same as #PoGibas', but a bit more explicit and less elegant.
I'll calculate the terms separately and then combine them again afterwards:
num1 <- apply(df1, 1, qnorm) # Apply 'qnorm' row-wise
num2 <- sqrt(b) * r # Add the constant sqrt(b) to vector r
num <- sweep(num1, 1, num2, "+") # Add the vector num2 row-wise to the dataframe num2
den <- sqrt(1-b) # den is a constant
result <- pnorm(num/den) # num is a data frame, which is elementwise divided by the constant den.
t(result)
By doing the operations step-by-step, you will often have a much easier time finding the source of an error.

How to find the maximum value within a loop in R

I have an expression
qbinom(0.05, n, .47) - 1
and I want to create a loop which iterates this expression over n for n = (20,200). For each iteration of this loop, this function will produce a number. I want to take the maximum of the 180 numbers it will produce. So, something like.
for (n in 20:200) {
max(qbinom(0.05, n, .47)-1)
But I'm not sure how exactly to do this.
Thanks!
First, I will show you how to do this with a loop.
n <- 20:200
MAX = -Inf ## initialize maximum
for (i in 1:length(n)) {
x <- qbinom(0.05, n[i], 0.47) - 1
if (x > MAX) MAX <- x
}
MAX
# [1] 81
Note, I am not keeping a record of all 181 values generated. Each value is treated as a temporary value and will be overwritten in the next iteration. In the end, we only have a single value MAX.
If you want to at the same time retain all the records, we need first initialize a vector to hold them.
n <- 20:200
MAX = -Inf ## initialize maximum
x <- numeric(length(n)) ## vector to hold record
for (i in 1:length(n)) {
x[i] <- qbinom(0.05, n[i], 0.47) - 1
if (x[i] > MAX) MAX <- x[i]
}
## check the first few values of `x`
head(x)
# [1] 5 5 6 6 6 7
MAX
# [1] 81
Now I am showing the vectorization solution.
max(qbinom(0.05, 20:200, 0.47) - 1)
# [1] 81
R functions related to probability distributions are vectorized in the same fashion. For those related to binomial distributions, you can read ?rbinom for details.
Note, the vectorization is achieved with recycling rule. For example, by specifying:
qbinom(0.05, 1:4, 0.47)
R will first do recycling:
p: 0.05 0.05 0.05 0.05
mean: 1 2 3 4
sd: 0.47 0.47 0.47 0.47
then evaluate
qbinom(p[i], mean[i], sd[i])
via a C-level loop.
Follow-up
How would I be able to know which of the 20:200 corresponds to the maximum using the vectorization solution?
We can use
x <- qbinom(0.05, 20:200, 0.47) - 1
i <- which.max(x)
# [1] 179
Note, i is the position in vector 20:200. To get the n you want, you need:
(20:200)[i]
# 198
The maximum is
x[i]
# [1] 81

Tabular data to matrix in R

I'm trying to remove the shackles of some legacy code that we use to make decision trees in a retail setting. I got to playing with hclust in R and it's beautiful and I'd like to use it. The heavy lifting for calculating distances is done in SQL and I get an output like this:
main with dist
A A 0.00
A B 1.37
A C 0.64
B B 0
B C 0.1
C C 0
That's loaded as a data frame right now (just reading the SQL query dump), but hclust wants a matrix of distances. E.g.,:
A B C
--+-----------------
A | 0
B | 1.37 0
C | 0.64 0.1 0
My thinking is too procedural and I'm trying to do it in nested loops at the moment. Can someone point me in the direction of something more R-idiomatic to do this?
Thank!
If you are looking for an actual distance matrix in R, try:
as.dist(xtabs(dist ~ with + main, mydf), diag = TRUE)
# A B C
# A 0.00
# B 1.37 0.00
# C 0.64 0.10 0.00
I'm presuming that the combinations of "main" and "with" are unique, otherwise xtabs would sum the "dist" values.
I would suggest to change from letters to numbers (which is straight forward using the ASCII codes) and then use the linearized indices of R matrices to access each pair in a vectorwise manner.
Minimal example:
N <- 3
d <- data.frame(x = c(1,2), y = c(2,3), v = c(0.1, 0.2))
m <- matrix(0, N, N)
m[(d$y-1)*N+d$x] = d$v
The output is:
[,1] [,2] [,3]
[1,] 0 0.1 0.0
[2,] 0 0.0 0.2
[3,] 0 0.0 0.0
EDIT: To preserve arbitrary strings as row and col names, consider the following example:
codes <- c('A','B','C')
N <- 3
d <- data.frame(x = c('A','B'), y = c('B','C'), v = c(0.1, 0.2))
m <- matrix(0, N, N)
m[(vapply(d$y, function(x) which(codes == x), 0)-1)*N+
vapply(d$x, function(x) which(codes == x), 0)] = d$v
rownames(m) = codes
colnames(m) = codes

Multiply each row of a matrix by the matrix

I am a new R-user and I have a kind of algorithm problem. I made some research on the web and on Stackoverflow, but can't find my answer.
I have a squared matrix, for example :
A B C D
A 0 0 0 1
B 0 1 1 0
C 1 0 0 0
D 0 1 1 1
This matrix represents links between keywords (A, B, C and D here). A '1' (or a TRUE) means keywords are in relation. For example, the '1' on the first row means A is linked to D.
I need to find the two most linked keywords on the matrix. I know I need to compute the scalar product between each row and the initial matrix. Then I take the sum of the rows and get the maximum.
But, what is the R program which put in a new matrix the product between each row of my matrix, and the matrix itself ?
Thanks!
I thought I had a cleverer answer but it turns out to be slower ...
tmp1 <- function(a) {
n <- nrow(a)
aa <- apply(array(apply(a,1,"*",a),
rep(n,3)),3,rowSums)
apply(aa,2,which.max)
}
Previous solution:
tmp2 <- function(a) {
n <- nrow(a)
r <- numeric(n)
for(i in seq(n)) {
b <- rowSums(a[i,]*a)
r[i] <- which.max(b)
}
r
}
Test this on something reasonably large:
n <- 50
a <- matrix(0,nrow=n,ncol=n)
a[sample(length(a),size=n^2/5,replace=TRUE)] <- 1
all(tmp1(a)==tmp2(a)) ## TRUE
library(rbenchmark)
benchmark(tmp1(a),tmp2(a))
> benchmark(tmp1(a),tmp2(a))
test replications elapsed relative user.self sys.self
1 tmp1(a) 100 4.030 9.264368 2.052 1.96
2 tmp2(a) 100 0.435 1.000000 0.232 0.20
You will presumably do even better if you can do it in terms of sparse matrices.
Like this?
a=matrix(c(0,0,0,1,0,1,1,0,1,0,0,0,0,1,1,1), ncol=4, byrow=T)
for(i in 1:4){
b = rowSums(a[i,]*a)
print(which(b==max(b)))
}

Resources