normal distribution calculation error - Non-numeric argument to mathematical function - r

I have 2 data frames and I am applying pnorm() and qnorm() on the dataframe, but I am getting the errors, while calculating.
n <- c(0.3,0.5,0.1,0.2)
m <- c(0.1,0.4,0.5,0.3)
o <- c(0.2,0.2,0.2,0.4)
p <- c(0.3,0.1,0.3,0.3)
df1 = data.frame(n,m,o,p)
df1
n m o p
1 0.3 0.1 0.2 0.3
2 0.5 0.4 0.2 0.1
3 0.1 0.5 0.2 0.3
4 0.2 0.3 0.4 0.3
r <- c(0.2,0.4,0.1,0.3)
df2 = rbind.data.frame(r)
df2
X2 X4 X1 X3
1 0.2 0.4 0.1 0.3
b <- 0.15
result <- pnorm((qnorm(df1)+sqrt(b)*df2)/sqrt(1-b))
Output:
Getting an error:
Error in qnorm(df1) : Non-numeric argument to mathematical function
Expected output:
Output:
0.3139178 0.110853 0.1919158 0.3289671
0.5334785 0.4574897 0.1919158 0.1031127
0.0957727 0.5667216 0.1919158 0.3289671
0.2035948 0.3442989 0.4079641 0.3289671
actually I have these 2 data-frames df1 and df1 and in excel and I have a formula in excel which I need to convert into R.
=NORMSDIST((NORMSINV(A1)+SQRT(0.15)*H1)/SQRT(1-0.15))
here A1 is the df1 first value and so on and H1 is the df2 value and so on.

What you're trying to do is: apply a function to every row in df1. To do so we need to write a function.
getDist <- function(x, b = 0.15) {
pnormInput <- as.numeric((qnorm(as.numeric(x)) + sqrt(b) * df2) / sqrt(1 - b))
pnorm(pnormInput)
}
Next we apply this function to every row in df1 (using apply).
result <- apply(df1, 1, function(x) getDist(x))
Next we have to transpose result (flip the table we got).
result <- t(result)
# [,1] [,2] [,3] [,4]
# [1,] 0.3139178 0.1108530 0.1919158 0.3289671
# [2,] 0.5334785 0.4574897 0.1919158 0.1031127
# [3,] 0.0957727 0.5667216 0.1919158 0.3289671
# [4,] 0.2035948 0.3442989 0.4079641 0.3289671

I think this is a classic case of trying to do many operations in one line and losing track of what every function is doing. My answer is essentially the same as #PoGibas', but a bit more explicit and less elegant.
I'll calculate the terms separately and then combine them again afterwards:
num1 <- apply(df1, 1, qnorm) # Apply 'qnorm' row-wise
num2 <- sqrt(b) * r # Add the constant sqrt(b) to vector r
num <- sweep(num1, 1, num2, "+") # Add the vector num2 row-wise to the dataframe num2
den <- sqrt(1-b) # den is a constant
result <- pnorm(num/den) # num is a data frame, which is elementwise divided by the constant den.
t(result)
By doing the operations step-by-step, you will often have a much easier time finding the source of an error.

Related

How to create random vectors of another vector?

I am performing calculations with constants and vectors (approximate length = 100) for which I need to simulate normal distributions N (with rnorm). For constants (K, with standard deviation = KU) I use rnorm() in the standard way:
K <- 2
KU <- 0.2
set.seed(123)
KN <- rnorm(n = 3, mean = K, sd = KU)
what provides a vector of length 3 (KN):
[1] 1.887905 1.953965 2.311742
Now, I need to do the same thing with a vector (V, standard deviation VU). My first guess is to use:
V <- c(1, 2, 3)
VU <- 0.1 * V
set.seed(123)
VN <- rnorm(3, V, VU)
but only a vector of 3 elements is produced, one for each vector element:
[1] 0.9439524 1.9539645 3.4676125
This is actually the first simulation of the vector, but I need 3 times this vector. One solution is to create 9 numbers, but VN is a vector of 9 elements:
[1] 0.9439524 1.9539645 3.4676125 1.0070508 2.0258575 3.5145195 1.0460916 1.7469878 2.7939441
not 3 vectors of 3 elements. What I want is VN =
[1] 0.9439524 1.0070508 1.0460916
[2] 1.9539645 2.0258575 1.7469878
[3] 3.4676125 3.5145195 2.7939441
so, VN are 3 vectors which I can subsequently use in other calculations, such as KN * VN. The solution that I have found is:
set.seed(123)
VN <- as.data.frame(t(matrix(rnorm(3 * length(V), V, VU), nrow = length(V))))
but in my opinion this is a rather cumbersome expression (which I need to repeat several times in different places with rather long variable names). Is there a simpler way in base R to produce random vectors? I would like to see something like:
VN <- rnorm.vector(3, V, VU)
We can use replicate
set.seed(123)
replicate(3, rnorm(3, V, VU))
# [,1] [,2] [,3]
#[1,] 0.9439524 1.007051 1.046092
#[2,] 1.9539645 2.025858 1.746988
#[3,] 3.4676125 3.514519 2.793944
Or it could be
mapply(rnorm, n = 3, mean = V, sd = VU)
In addition to #akrun's great options, you may also use something slightly simpler than your approach:
matrix(rnorm(n * length(V), V, VU), nrow = n, byrow = TRUE)
# [,1] [,2] [,3]
# [1,] 0.9439524 1.953965 3.467612
# [2,] 1.0070508 2.025858 3.514519
# [3,] 1.0460916 1.746988 2.793944
or also the MASS package with mvrnorm letting to sample from a multivariate normal distribution:
library(MASS)
mvrnorm(n, VU, diag(VU))
# [,1] [,2] [,3]
# [1,] 0.6650715 0.37923044 0.05590089
# [2,] 0.2574341 0.24949882 0.97045721
# [3,] -0.5218990 -0.04857971 0.49707815
where
diag(VU)
# [,1] [,2] [,3]
# [1,] 0.1 0.0 0.0
# [2,] 0.0 0.2 0.0
# [3,] 0.0 0.0 0.3
The latter option is the way to go in case you want the variance-covariance matrix not to be diagonal.

Calculating standard deviation of variables in a large list in R

I have a large list that contains 1000 lists of the same variables and same length.
My goal is to calculate mean, standard deviation, and standard error of all lists within the large list.
I have calculated mean of the variables using Reduce(), but I couldn't figure out how to do the same for standard deviation.
My list looks something like this:
large.list <- vector('list', 1000)
for (i in 1:1000) {
large.list[[i]] <- as.data.frame(matrix(c(1:4), ncol=2))
}
large.list
[[1]]
V1 V2
1 1 3
2 2 4
[[2]]
V1 V2
1 1 3
2 2 4
[[3]]
V1 V2
1 1 3
2 2 4
......
[[1000]]
V1 V2
1 1 3
2 2 4
To calculate mean, I do:
list.mean <- Reduce("+", large.list) / length(large.list)
list.mean
V1 V2
1 1 3
2 2 4
This is overly simplified version of a large list, but how can I calculate list-wide standard deviation and standard error like I did for mean?
Thank you very much in advance!
If you stay with Reduce(), you have to do a little bit statistics:
var(x) = E(x^2) - (E(x))^2
Note that you already got E(x) as list.mean. To get E(x^2), it is also straightforward:
list.squared.mean <- Reduce("+", lapply(large.list, "^", 2)) / length(large.list)
Then variance is:
list.variance <- list.squared.mean - list.mean^2
Standard deviation is just
list.sd <- sqrt(list.variance)
However, a much more efficient solution is to use tapply()
vec <- unlist(large.list, use.names = FALSE)
DIM <- dim(large.list[[1]])
n <- length(large.list)
list.mean <- tapply(vec, rep(1:prod(DIM),times = n), mean)
attr(list.mean, "dim") <- DIM
list.mean <- as.data.frame(list.mean)
list.sd <- tapply(vec, rep(1:prod(DIM),times = n), sd)
attr(list.sd, "dim") <- DIM
list.sd <- as.data.frame(list.sd)
If I may suggest an alternative, you could transform the list into a 3-dimensional matrix, and then use apply() to produce the output.
Here's how to transform the list (assuming dimensional regularity):
m <- do.call(cbind,lapply(large.list,as.matrix));
m <- array(m,c(nrow(m),ncol(m)/length(large.list),length(large.list)));
And here's how to use apply() on the matrix:
apply(m,1:2,mean);
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
apply(m,1:2,sd);
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
here a solution based on reshaping the list into data.table. we are basically extracting the value of index i from each sub-list to create a single vector.
ll <- unlist(large.list)
DX <- data.table(V1= ll[c(T,F,F,F)],
V2= ll[c(F,T,F,F)],
V3= ll[c(F,F,T,F)],
V4= ll[c(F,F,F,T)])
then all calculation are straight forward:
mm <- DX[,lapply(.SD,mean)]
sdd <- DX[,lapply(.SD,sd)]

r create matrix from repeat loop output

For each value n in some vector N, I want to compute the percentage of values exceed n for each variable in my data frame T.
Consider the following input data frame:
T <- data.frame(A=c(0.1,0.2,0.3), B=c(0.3,0.3,0.9),C=c(1,0.5,0))
T
# A B C
# 1 0.1 0.3 1.0
# 2 0.2 0.3 0.5
# 3 0.3 0.9 0.0
I would like the output to be a matrix that looks something like this:
A B C
n=0.1 66.6 100 66.6
n=0.2 33.3 100 66.6
My current implementation is not working:
n <- 0.8
repeat {
Tlogic <- T > n
TU <- as.matrix(apply(Tlogic,2,sum))
q = NULL
for (i in seq(along=TU[,1]))
{
percent <- (TU[i]/nrow(T))*100
q = c(q, percent)
}
n <- n - 0.05;
print(n);
if(log(n) < -6) break
}
Basically you're asking, for each value n in some vector N, to compute the percentage of values in each column of T that exceed n.
You can actually do this in one line in R by moving from a solution that writes out loops to one that uses the *apply functions in R:
N <- c(0.1, 0.2)
do.call(rbind, lapply(N, function(n) c(n=n, 100*colMeans(T > n))))
# n A B C
# [1,] 0.1 66.66667 100 66.66667
# [2,] 0.2 33.33333 100 66.66667
For each value n in N, the call lapply(N, function(n) c(n=n, 100*colMeans(T > n))) computes a vector that indicates n as well as the percentage of values in each column of T that exceed n. Then do.call(rbind, ...) groups all of these together into a final output matrix.
In your case, you want N to form a decreasing sequence (by 0.05 each step) from 0.8 until log(n) < -6. You can get the N vector in this case with:
N <- seq(.8, 0, -.05)
N <- N[log(N) >= -6]

Tabular data to matrix in R

I'm trying to remove the shackles of some legacy code that we use to make decision trees in a retail setting. I got to playing with hclust in R and it's beautiful and I'd like to use it. The heavy lifting for calculating distances is done in SQL and I get an output like this:
main with dist
A A 0.00
A B 1.37
A C 0.64
B B 0
B C 0.1
C C 0
That's loaded as a data frame right now (just reading the SQL query dump), but hclust wants a matrix of distances. E.g.,:
A B C
--+-----------------
A | 0
B | 1.37 0
C | 0.64 0.1 0
My thinking is too procedural and I'm trying to do it in nested loops at the moment. Can someone point me in the direction of something more R-idiomatic to do this?
Thank!
If you are looking for an actual distance matrix in R, try:
as.dist(xtabs(dist ~ with + main, mydf), diag = TRUE)
# A B C
# A 0.00
# B 1.37 0.00
# C 0.64 0.10 0.00
I'm presuming that the combinations of "main" and "with" are unique, otherwise xtabs would sum the "dist" values.
I would suggest to change from letters to numbers (which is straight forward using the ASCII codes) and then use the linearized indices of R matrices to access each pair in a vectorwise manner.
Minimal example:
N <- 3
d <- data.frame(x = c(1,2), y = c(2,3), v = c(0.1, 0.2))
m <- matrix(0, N, N)
m[(d$y-1)*N+d$x] = d$v
The output is:
[,1] [,2] [,3]
[1,] 0 0.1 0.0
[2,] 0 0.0 0.2
[3,] 0 0.0 0.0
EDIT: To preserve arbitrary strings as row and col names, consider the following example:
codes <- c('A','B','C')
N <- 3
d <- data.frame(x = c('A','B'), y = c('B','C'), v = c(0.1, 0.2))
m <- matrix(0, N, N)
m[(vapply(d$y, function(x) which(codes == x), 0)-1)*N+
vapply(d$x, function(x) which(codes == x), 0)] = d$v
rownames(m) = codes
colnames(m) = codes

Pairwise Operations in R

I need to calculate pairwise, consecutive correlations for each of these date variables (there are 246 in my dataset):
Company 2009/08/21 2009/08/24 2009/08/25
A -0.0019531250 -0.0054602184 -6.274510e-03
AA -0.0063291139 -0.0266457680 -1.750199e-02
AAPL 0.0084023598 -0.0055294118 -1.770643e-04 ...
...
So that I can find cor(col1,col2), cor(col2,col3), but nothing for cor(col1,col3). I realize that if I wanted all combinations I could use the combn function, but I can't figure out how to do it for my circumstances without something inefficient like a for loop.
Approach 1
you could do:
lapply(1:(ncol(dat)-1), function(i) cor(dat[, i], dat[, i+1],
use="pairwise.complete.obs"))
Example
A dataframe with 10 variables will give you 9 consecutive correlations, i.e. columns 1-2, 2-3, 3-4 etc, if that is what you want.
dat <- replicate(10, rnorm(10))
lapply(1:(ncol(dat)-1), function(i)
cor(dat[, i], dat[, i+1], use="pairwise.complete.obs"))
Approach 2 (very concise)
Using the iris dataset as well:
dat <- iris[, 1:4]
diag(cor(dat, use="pairwise.complete.obs")[, -1])
[1] -0.1175698 -0.4284401 0.9628654
As you pointed out, combn is the way to go. Assume your data.frame is called dat then for consecutive columns, try this:
ind <- combn(ncol(dat), 2)
consecutive <- ind[ , apply(ind, 2, diff)==1]
lapply(1:ncol(consecutive), function(i) cor(dat[,consecutive[,i]]))
Consider this simple example:
> data(iris)
> dat <- iris[, 1:4]
> # changing colnames to see whether result is for consecutive columns
> colnames(dat) <- 1:ncol(dat)
> head(dat) # this is how the data looks like
1 2 3 4
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
>
> ind <- combn(ncol(dat), 2)
> consecutive <- ind[ , apply(ind, 2, diff)==1]
> lapply(1:ncol(consecutive), function(i) cor(dat[,consecutive[,i]])) # output: cor matrix
[[1]]
1 2
1 1.0000000 -0.1175698
2 -0.1175698 1.0000000
[[2]]
2 3
2 1.0000000 -0.4284401
3 -0.4284401 1.0000000
[[3]]
3 4
3 1.0000000 0.9628654
4 0.9628654 1.0000000
If you want just the correlation, use sapply
> sapply(1:ncol(consecutive), function(i) cor(dat[,consecutive[,i]])[2,1])
[1] -0.1175698 -0.4284401 0.9628654
Usually, loops in R should be avoided, but I think they sometimes have an undeserved stigma. In this case, the loop is easier for me to read than "cooler" functions. It's also fairly efficient. Any call like cor(mydata) calculates n^2 correlations, while the for loop only calculates n correlations.
x = matrix( rnorm(246*20000), nrow=246 )
out = numeric(245)
system.time( { for( i in 1:245 )
out[i] = cor(x[,i],x[,i+1]) } )
# 0.022 Seconds
system.time( diag(cor(x, use="pairwise.complete.obs")[, -1]) )
# Goes for 2 minutes and then crashes my R session
First, I'll assume your data is stored in df.
Here's what I'd do. First create a function that for any given column number it will calculate the correlation between that column and the one up from it like this
cor.neighbour <- function(i) {
j <- i + 1
cr <- cor(df[, i], df[, j])
# returning a dataframe here will make sense when you see the results from sapply
result <- data.frame(
x = names(df)[i],
y = names(df)[j],
cor = cr,
stringsAsFactors = FALSE
)
return(result)
}
Then to apply it to your whole data I would first create a vector of all the columns I want to use, i which, by the way, is all but the last column. Then use lapply to process through them
i <- 1:(ncol(df) - 1)
cor.pairs <- lapply(i, cor.neighbour)
# change list in to a data frame
cor.pairs <- melt(cor.pairs, id=names(cor.pairs[[1]]))

Resources