I'm trying to remove the shackles of some legacy code that we use to make decision trees in a retail setting. I got to playing with hclust in R and it's beautiful and I'd like to use it. The heavy lifting for calculating distances is done in SQL and I get an output like this:
main with dist
A A 0.00
A B 1.37
A C 0.64
B B 0
B C 0.1
C C 0
That's loaded as a data frame right now (just reading the SQL query dump), but hclust wants a matrix of distances. E.g.,:
A B C
--+-----------------
A | 0
B | 1.37 0
C | 0.64 0.1 0
My thinking is too procedural and I'm trying to do it in nested loops at the moment. Can someone point me in the direction of something more R-idiomatic to do this?
Thank!
If you are looking for an actual distance matrix in R, try:
as.dist(xtabs(dist ~ with + main, mydf), diag = TRUE)
# A B C
# A 0.00
# B 1.37 0.00
# C 0.64 0.10 0.00
I'm presuming that the combinations of "main" and "with" are unique, otherwise xtabs would sum the "dist" values.
I would suggest to change from letters to numbers (which is straight forward using the ASCII codes) and then use the linearized indices of R matrices to access each pair in a vectorwise manner.
Minimal example:
N <- 3
d <- data.frame(x = c(1,2), y = c(2,3), v = c(0.1, 0.2))
m <- matrix(0, N, N)
m[(d$y-1)*N+d$x] = d$v
The output is:
[,1] [,2] [,3]
[1,] 0 0.1 0.0
[2,] 0 0.0 0.2
[3,] 0 0.0 0.0
EDIT: To preserve arbitrary strings as row and col names, consider the following example:
codes <- c('A','B','C')
N <- 3
d <- data.frame(x = c('A','B'), y = c('B','C'), v = c(0.1, 0.2))
m <- matrix(0, N, N)
m[(vapply(d$y, function(x) which(codes == x), 0)-1)*N+
vapply(d$x, function(x) which(codes == x), 0)] = d$v
rownames(m) = codes
colnames(m) = codes
Related
I'm not exactly sure how to go about this in R. I've got a data set with 40 values, some of which repeat and I want to perform a small bootstrap on this dataset to find the mean of two or more consecutive values. For example, I randomly select a value from the dataset provided below, say the very first value is selected which is 0.2, so x1=0.2. How can I make sure that in the same for loop R is able to select the next value, x2, to be 0.2 as that is the second value in the dataset? Thus it would appear as x1=0.2 and x2=0.2.
I can't really think of a way for this to be done as it would need to be repeated for each iteration and since the sample() function selects any random value that makes it harder to pinpoint exactly which value it selected given there are repeated values.
I've provided a sample code that calculates the mean for 1 observation and I would like to get it to work for 2 consecutive observations. So then I can calculate the means individually and display them.
If anyone has any way to handle this I would appreciate it.
Thanks ahead of time.
x=c(0.20,0.20,0.21,0.21,0.21,0.20,0.19,0.18,0.16,0.10,
0.02,-0.02,0.01,0.03,0.07,0.14,0.22,0.13,0.12,
0.16,0.17,0.18,0.18,0.17,0.15,0.15,0.13,0.12,
0.10,0.08,0.06,0.04,0.03,0.02,0.03,0.05,0.34,
0.13,0.11,0.12)
B<- 500
result1<- numeric(B)
# result2<- numerib(B)
for (b in 1:B){
x1<-sample(x=x,size =1, replace=TRUE)
# x2<-
result1[b]<-x1
# result2[b]<-x2
}
mean1<- mean(result1)
# mean2<- mean(result2)
A simple approach could be:
result <- matrix(nrow = B, ncol = 2)
for (b in 1:B){
idx1 <- sample(seq_along(x), size = 1)
idx2 <- idx1 %% length(x) + 1
result[b, 1] <- x[idx1]
result[b, 2] <- x[idx2]
}
storing the results in a matrix:
> result
[,1] [,2]
[1,] 0.21 0.21
[2,] 0.12 0.20
[3,] 0.21 0.21
[4,] 0.10 0.02
[5,] 0.10 0.02
[6,] 0.21 0.20
[7,] 0.02 -0.02
[8,] -0.02 0.01
[9,] 0.21 0.20
[10,] 0.17 0.15
Sample the indices of x, then use this to subset x for result1. Use the sampled index + 1 to subset x for result2. However, you also need a wrap around so that if you sample the last member of x, you sample the first as well (as the "next" value)
B <- 500
result1<- numeric(B)
result2 <- numeric(B)
for(i in 1:B) {
j <- sample(seq_along(x), 1)
if(j == 40) k <- 1
else k <- j + 1
result1[i] <- x[j]
result2[i] <- x[k]
}
mean(result1)
#> [1] 0.12618
mean(result2)
#> [1] 0.13034
Note also that since R is vectorized, you don't need a loop here at all. You could just do:
result1 <- sample(seq_along(x), 500, replace = TRUE)
result2 <- result1 + 1
result2[result2 == 41] <- 1
mean(x[result1])
#> [1] 0.12568
mean(x[result2])
#> [1] 0.12596
Created on 2022-03-28 by the reprex package (v2.0.1)
Could you work out all the possible consecutive means and then sample from that? How about:
library(RcppRoll)
x=c(0.20,0.20,0.21,0.21,0.21,0.20,0.19,0.18,0.16,0.10,
0.02,-0.02,0.01,0.03,0.07,0.14,0.22,0.13,0.12,
0.16,0.17,0.18,0.18,0.17,0.15,0.15,0.13,0.12,
0.10,0.08,0.06,0.04,0.03,0.02,0.03,0.05,0.34,
0.13,0.11,0.12)
rollmean <- roll_mean(x,2)
r <- sample(rollmean, 500, replace= T)
hist(r)
Which gives you:
I am performing calculations with constants and vectors (approximate length = 100) for which I need to simulate normal distributions N (with rnorm). For constants (K, with standard deviation = KU) I use rnorm() in the standard way:
K <- 2
KU <- 0.2
set.seed(123)
KN <- rnorm(n = 3, mean = K, sd = KU)
what provides a vector of length 3 (KN):
[1] 1.887905 1.953965 2.311742
Now, I need to do the same thing with a vector (V, standard deviation VU). My first guess is to use:
V <- c(1, 2, 3)
VU <- 0.1 * V
set.seed(123)
VN <- rnorm(3, V, VU)
but only a vector of 3 elements is produced, one for each vector element:
[1] 0.9439524 1.9539645 3.4676125
This is actually the first simulation of the vector, but I need 3 times this vector. One solution is to create 9 numbers, but VN is a vector of 9 elements:
[1] 0.9439524 1.9539645 3.4676125 1.0070508 2.0258575 3.5145195 1.0460916 1.7469878 2.7939441
not 3 vectors of 3 elements. What I want is VN =
[1] 0.9439524 1.0070508 1.0460916
[2] 1.9539645 2.0258575 1.7469878
[3] 3.4676125 3.5145195 2.7939441
so, VN are 3 vectors which I can subsequently use in other calculations, such as KN * VN. The solution that I have found is:
set.seed(123)
VN <- as.data.frame(t(matrix(rnorm(3 * length(V), V, VU), nrow = length(V))))
but in my opinion this is a rather cumbersome expression (which I need to repeat several times in different places with rather long variable names). Is there a simpler way in base R to produce random vectors? I would like to see something like:
VN <- rnorm.vector(3, V, VU)
We can use replicate
set.seed(123)
replicate(3, rnorm(3, V, VU))
# [,1] [,2] [,3]
#[1,] 0.9439524 1.007051 1.046092
#[2,] 1.9539645 2.025858 1.746988
#[3,] 3.4676125 3.514519 2.793944
Or it could be
mapply(rnorm, n = 3, mean = V, sd = VU)
In addition to #akrun's great options, you may also use something slightly simpler than your approach:
matrix(rnorm(n * length(V), V, VU), nrow = n, byrow = TRUE)
# [,1] [,2] [,3]
# [1,] 0.9439524 1.953965 3.467612
# [2,] 1.0070508 2.025858 3.514519
# [3,] 1.0460916 1.746988 2.793944
or also the MASS package with mvrnorm letting to sample from a multivariate normal distribution:
library(MASS)
mvrnorm(n, VU, diag(VU))
# [,1] [,2] [,3]
# [1,] 0.6650715 0.37923044 0.05590089
# [2,] 0.2574341 0.24949882 0.97045721
# [3,] -0.5218990 -0.04857971 0.49707815
where
diag(VU)
# [,1] [,2] [,3]
# [1,] 0.1 0.0 0.0
# [2,] 0.0 0.2 0.0
# [3,] 0.0 0.0 0.3
The latter option is the way to go in case you want the variance-covariance matrix not to be diagonal.
I have 2 data frames and I am applying pnorm() and qnorm() on the dataframe, but I am getting the errors, while calculating.
n <- c(0.3,0.5,0.1,0.2)
m <- c(0.1,0.4,0.5,0.3)
o <- c(0.2,0.2,0.2,0.4)
p <- c(0.3,0.1,0.3,0.3)
df1 = data.frame(n,m,o,p)
df1
n m o p
1 0.3 0.1 0.2 0.3
2 0.5 0.4 0.2 0.1
3 0.1 0.5 0.2 0.3
4 0.2 0.3 0.4 0.3
r <- c(0.2,0.4,0.1,0.3)
df2 = rbind.data.frame(r)
df2
X2 X4 X1 X3
1 0.2 0.4 0.1 0.3
b <- 0.15
result <- pnorm((qnorm(df1)+sqrt(b)*df2)/sqrt(1-b))
Output:
Getting an error:
Error in qnorm(df1) : Non-numeric argument to mathematical function
Expected output:
Output:
0.3139178 0.110853 0.1919158 0.3289671
0.5334785 0.4574897 0.1919158 0.1031127
0.0957727 0.5667216 0.1919158 0.3289671
0.2035948 0.3442989 0.4079641 0.3289671
actually I have these 2 data-frames df1 and df1 and in excel and I have a formula in excel which I need to convert into R.
=NORMSDIST((NORMSINV(A1)+SQRT(0.15)*H1)/SQRT(1-0.15))
here A1 is the df1 first value and so on and H1 is the df2 value and so on.
What you're trying to do is: apply a function to every row in df1. To do so we need to write a function.
getDist <- function(x, b = 0.15) {
pnormInput <- as.numeric((qnorm(as.numeric(x)) + sqrt(b) * df2) / sqrt(1 - b))
pnorm(pnormInput)
}
Next we apply this function to every row in df1 (using apply).
result <- apply(df1, 1, function(x) getDist(x))
Next we have to transpose result (flip the table we got).
result <- t(result)
# [,1] [,2] [,3] [,4]
# [1,] 0.3139178 0.1108530 0.1919158 0.3289671
# [2,] 0.5334785 0.4574897 0.1919158 0.1031127
# [3,] 0.0957727 0.5667216 0.1919158 0.3289671
# [4,] 0.2035948 0.3442989 0.4079641 0.3289671
I think this is a classic case of trying to do many operations in one line and losing track of what every function is doing. My answer is essentially the same as #PoGibas', but a bit more explicit and less elegant.
I'll calculate the terms separately and then combine them again afterwards:
num1 <- apply(df1, 1, qnorm) # Apply 'qnorm' row-wise
num2 <- sqrt(b) * r # Add the constant sqrt(b) to vector r
num <- sweep(num1, 1, num2, "+") # Add the vector num2 row-wise to the dataframe num2
den <- sqrt(1-b) # den is a constant
result <- pnorm(num/den) # num is a data frame, which is elementwise divided by the constant den.
t(result)
By doing the operations step-by-step, you will often have a much easier time finding the source of an error.
For each value n in some vector N, I want to compute the percentage of values exceed n for each variable in my data frame T.
Consider the following input data frame:
T <- data.frame(A=c(0.1,0.2,0.3), B=c(0.3,0.3,0.9),C=c(1,0.5,0))
T
# A B C
# 1 0.1 0.3 1.0
# 2 0.2 0.3 0.5
# 3 0.3 0.9 0.0
I would like the output to be a matrix that looks something like this:
A B C
n=0.1 66.6 100 66.6
n=0.2 33.3 100 66.6
My current implementation is not working:
n <- 0.8
repeat {
Tlogic <- T > n
TU <- as.matrix(apply(Tlogic,2,sum))
q = NULL
for (i in seq(along=TU[,1]))
{
percent <- (TU[i]/nrow(T))*100
q = c(q, percent)
}
n <- n - 0.05;
print(n);
if(log(n) < -6) break
}
Basically you're asking, for each value n in some vector N, to compute the percentage of values in each column of T that exceed n.
You can actually do this in one line in R by moving from a solution that writes out loops to one that uses the *apply functions in R:
N <- c(0.1, 0.2)
do.call(rbind, lapply(N, function(n) c(n=n, 100*colMeans(T > n))))
# n A B C
# [1,] 0.1 66.66667 100 66.66667
# [2,] 0.2 33.33333 100 66.66667
For each value n in N, the call lapply(N, function(n) c(n=n, 100*colMeans(T > n))) computes a vector that indicates n as well as the percentage of values in each column of T that exceed n. Then do.call(rbind, ...) groups all of these together into a final output matrix.
In your case, you want N to form a decreasing sequence (by 0.05 each step) from 0.8 until log(n) < -6. You can get the N vector in this case with:
N <- seq(.8, 0, -.05)
N <- N[log(N) >= -6]
In R programming I try to do the following:
df
A B Category
0.9 0.85 A
0.7 0.75 B
0.8 0.90 B
CSF <- function(df, type) {
switch(type,
"A" = qnorm(df$A, 0 , 1),
"B" = qnorm(df$B, 0 , 1)
)
}
df<-data.frame(df, value = CSF(df,df$category))
Desired result:
df
A B Category Value
0.9 0.85 A qnorm(0.9, 0, 1)*
0.7 0.75 B qnorm(0.75, 0, 1)*
0.8 0.90 B qnorm(0.9, 0, 1)*
*: real values
Error message: EXPR must be a length 1 vector
You can use the ifelse function:
df$Value <- ifelse(df$Category=="A",qnorm(df$A,0,1),qnorm(df$B,0,1))
For a more complex arrangement of categories, I would recommend breaking it out into multiple statements. Something like
df$CSF <- NA
df.split <- split(df, df$Category)
df.split$A$CSF <- qnorm(df.split$A$A, 0, 1)
df.split$B$CSF <- qnorm(df.split$B$B, 0, 1)
...
And then merge them back together
df <- unsplit(df.split, df$Category)
It doesn't have the elegance of a "single expression" but it also avoids having a huge amount of nesting to accomplish the task. You could simplify the individual expressions using within:
df.split$A <- within(df.split$A, CSF <- qnorm(A, 0 0))