Generate numbers with specific correlation [with only positive values in the output] - r

I want to obtain a dataframe with simulated values which have a specific correlation to each other.
I need to use this function, but in the returned output there are also negative values, which do not have meaning for my purposes:
COR <- function (n, xmean, xsd, ymean, ysd, correlation) {
x <- rnorm(n)
y <- rnorm(n)
z <- correlation * scale(x)[,1] + sqrt(1 - correlation^2) *
scale(resid(lm(y ~ x)))[,1]
xresult <- xmean + xsd * scale(x)[,1]
yresult <- ymean + ysd * z
data.frame(x=xresult,y=yresult)
}
Please note that my question starts from this previous post (currently closed):
another similar discussion
Is there a method able to exclude from the final output all the rows which have at least one negative value? (in another terms, x and y must be always positives).
I spent many hours without any concrete result.....

Filtering rows which have at least one negative value can be done with the apply function, e.g.
df <- simcor(100, 1, 1, 1, 1, 0.8)
filter <- apply(df, 1, function(x) sum(x < 0) > 0)
df <- df[!filter,]
plot(df)
First, I create a dataframe df from your funcion. Then, I apply the function sum(x < 0) > 0 rowwise to the dataframe (the second argument of apply, 1 indicates to go along the first dimension of the dataframe or array). This will create a logical vector that is TRUE for every row with at least one negative value. Subsetting the dataframe with the inverse of that (!filter) leaves you with all rows that have no negative values.
UPDATE:
Seems like the package VineCopula offers functions to create distributions with a given correlation. However, I did not dive into the math as deep so I was not able to fully grasp how copulas (i.e. multivariate probability distributions) work. Using this package, you can at least create e.g. two gaussian distributions.
library(VineCopula)
BC <- BiCop(family = 1, par = 0.9)
sim <- BiCopSim(N = 1000, obj = BC)
cor(sim[,1], sim[,2])
plot(sim)
You might be able to then scale the resulting matrix to achieve a certain standard derivation.

Related

How to write Bray-Curtis function?

i tried to write braycurtis function on my own. My data is economic,social data about different regions(each row is diffrent region and each column is economenter image description hereic index). Sample(data is already normalized in range 0-1, thats why best region, the standard region have value 1)- real data have more regions and values :P
Region= c("A", "B", "C")
Sp1 =c(0.43, 1, 0.5)
Sp2 = c(0.53, 0.12, 0.75)
...
Sp23 = c(0.97, 0.2, 1)
Sp24 = c(0.34, 0.72, 0.23)
I need synthetic index of development, thats why i try to use bray_curtis. That's code of my function
bray_curtis <- function(x, na.rm = FALSE) {
return(1-(rowSums(abs(x - max(x))))/rowSums(x+max(x)))
}
gus2016_braycurtis <- as.data.frame(lapply(gus2016_norm, bray_curtis))
Formula, that i tried to implement
[1]: https://i.stack.imgur.com/LRrBb.png
What should i change? to output i need one colum of synthetic index of development for each region.
While I think you do not want to have Bray-Curtis index, I show how to get it. Your math needs fixing: your calculations had little to do with the formula you linked to (and even that linked formula botches indices, but let's ignore that and implement the formula that was intended).
BC index is a dissimilarity or similarity index calculated between two rows (or columns). When calculated just for one observation (that is, when indices are equal and the row/column is just duplicated) it will be 0 (if distance) or 1 (if similarity). The linked formula defines a similarity index. Here is a straightforward implementation of the linked formula for matrix z and doing calculations between rows. You need to switch indexing if you want to calculate similarities between columns:
N <- nrow(z) # assuming that matrix/data.frame z exists
d <- matrix(0, N, N)
for(i in 1:N) for(j in 1:N) d[i,j] <- 1 - sum(abs(z[i,]-z[j,]))/sum(z[i,]+z[j,])
## d[i,j] <- 2*sum(pmin(z[i,], z[j,]))/sum(z[i,]+z[j,]) is mathematically equivalent
This will give you a symmetric matrix with diagonal of 1 (so a lot of redundant work was done).
Then there is an easy way, as the BC index has been implemented in several packages. I just show how to get it in vegan, where we need to change the dissimilarity to similarity, and I also cast the distance structure to symmetric matrix as above:
library(vegan)
d <- 1 - as.matrix(vegdist(z)) # assuming that z exists
Probably you don't want to have this, but here you see how to get it.

R function to find which of 3 variables correlates most with another value?

I am conducting a study that analyzes speakers' production and measures their average F2 values. What I need is an R function that allows me to find a relationship for these F2 values with 3 other variables, and if there is, which one is the most significant. These variables have been coded as 1, 2, or 3 for things like "yes" "no" answers or whether responses are positive, neutral or negative (1, 2, 3 respectively).
Is there a particular technique or R function/test that we can use to approach this problem? I've considered using ANOVA or a T-Test but am unsure if this will give me what I need.
A quick solution might look like this. Here, the cor function is used. Read its help page (?cor) to understand what is calculated. By default, the Pearson correlation coefficient is used. The function below return the variable with the highest Pearson correlation with respect to the reference variable.
set.seed(111)
x <- rnorm(100)
y <- rnorm(100)
z <- rnorm(100)
ref <- 0.5*x + 0.5*rnorm(100)
find_max_corr <- function(vars, ref){
val <- sapply(vars, cor, y = ref)
val[which.max(val)]
}
find_max_corr(list('x' = x, 'y' = y, 'z' = z), ref)

Faster way to generate large list of vectors from permuted datasets [R]

Setup For the purposes of my simulation, I'm generating a list of B=2000 elements, with each element being the output of a permutation procedure in which I first permute the rows of a 200x8000 matrix and for each column, I calculate the Kolmogorov-Smirnov test statistic between the first and second 100 rows (you can think of the first 100 rows as data from one group and the second 100 rows as data from another group).
Question This process takes a very long time (about 30-40 minutes) to generate the list. Is there a much faster way? In the future, I'd like to increase B to a larger value.
Code
B=2000
n.row=200; n.col=8000
#Generate sample data
samp.dat = matrix(rnorm(n.row*n.col),nrow=n.row)
perm.KS.list = NULL
for (b in 1:B){
#permute the rows
perm.dat.tmp = samp.dat[sample(nrow(samp.dat)),]
#Compute the permutation-based test statistics
perm.KS.list[[b]]= apply(perm.dat.tmp,2,function(y) ks.test.stat(y[1:100],y[101:200]))
}
#Modified KS-test function (from base package)
ks.test.stat <- function(x,y){
x <- x[!is.na(x)]
n <- length(x)
y <- y[!is.na(y)]
n.x <- as.double(n)
n.y <- length(y)
w <- c(x, y)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)] #exclude ties
STATISTIC <- max(abs(z))
return(STATISTIC)
}
The 1:B loop has several places to optimize, but I agree that the real consumer is that inner function. Because you're simulating your well-behaved bootstrap samples, you can make two simplifying assumptions that the general base function can't:
There aren't missing values. This obviates the is.na() adjustments
The two sides (ie, x & y) have the same number of elements, so you don't need to count them separately. instead of splitting y in the loop, and them joining them back in the function (into w), just keep it together. The balanced sides also permit simplifications like remove the ifelse() clause. It produces a bunch of 0/1s, which are rescaled to -1/1s with integer arithmetic.
The function is reduced, which saves about 25% of the time. I added integers, instead of doubles inside cumsum().
ks.test.stat.balanced <- function(w){
n <- as.integer(length(w) * .5)
# z <- cumsum(ifelse(order(w) <= n, 1L, -1L)) / n
z <- cumsum((order(w) <= n)*2L - 1L) / n
# z <- z[c(which(diff(sort(w)) != 0), n + n)] #exclude ties
return( max(abs(z)) )
}
Ties shouldn't occur often with your gaussian rng, and the diff(sort(.)) is very expensive. If you're willing to remove that protection, the time is reduced by about 65%.
If you move the equation for z into abs(), it saves a little time over all those reps. I kept it separate above, so it's easier to read.
edit in case of an unbalanced simulation I'd recommend you:
still keep out the is.na,
still pass w,
still keep as much as possible in integer, not numeric, but
now include arguments n1 & n2 for the two group sizes.
Also, experiment w/ precalculating 1/n before cumsum() to avoid a lot of expensive divisions. Try to think of other math-y ways to extract calculations from an inner loop so it occurs less frequently.

How to work with binary contraints in linear optimization?

I have two input matrices, dt(10,3) & wt(3,3), that i need to use to find the optimal decision matrix (same dimension), Par(10,3) so as to maximize an objective function. Below R code would give some direction into the problem (used Sample inputs here) -
#Input Matrices
dt <- matrix(runif(300),100,3)
wt <- matrix(c(1,0,0,0,2,0,0,0,1),3,3) #weights
#objective function
Obj <- function(Par) {
P = matrix(Par, nrow = 10, byrow=F) # Reshape
X = t((dt%*%wt)[,1])%*%P[,1]
Y = t((dt%*%wt)[,2])%*%P[,2]
Z = t((dt%*%wt)[,3])%*%P[,3]
as.numeric(X+Y+Z) #maximize
}
Now I am struggling to apply the following constraints to the problem :
1) Matrix, Par can only have binary values (0 or 1)
2) rowSums(Par) = 1 (Basically a row can only have 1 in one of the three columns)
3) colSums(Par[,1]) <= 5, colSums(Par[,2]) <= 6, & colSums(Par[,3]) <= 4
4) X/(X+Y+Z) < 0.35, & Y/(X+Y+Z) < 0.4 (X,Y,Z are defined in the objective function)
I tried coding the constraints in constrOptim, but not sure how to input binary & integer constraints. I am reading up on lpSolve, but not able to figure out. Any help much appreciated. Thanks!
I believe this is indeed a MIP so no issues with convexity. If I am correct the model can look like:
This model can be easily transcribed into R. Note that LP/MIP solvers do not use functions for the objective and constraints (opposed to NLP solvers). In R typically one builds up matrices with the LP coefficients.
Note: I had to make the limits on the column sums much larger (I used 50,60,40).
Based on Erwin's response, I am able to formulate the model using lpSolve in R. However still struggling to add the final constraint to the model (4th constraint in my question above). Here's what I am able to code so far :
#input dimension
r <- 10
c <- 3
#input matrices
dt <- matrix(runif(r*c),r,c)
wt <- matrix(c(1,0,0,0,2,0,0,0,1),3,3) #weights
#column controller
c.limit <- c(60,50,70)
#create structure for lpSolve
ncol <- r*c
lp.create <- make.lp(ncol=ncol)
set.type(lp.create, columns=1:ncol, type = c("binary"))
#create objective values
obj.vals <- as.vector(t(dt%*%wt))
set.objfn(lp.create, obj.vals)
lp.control(lp.create,sense='max')
#Add constraints to ensure sum of parameters for every row (rowSum) <= 1
for (i in 1:r){
add.constraint(lp.create, xt=c(1,1,1),
indices=c(3*i-2,3*i-1,3*i), rhs=1, type="<=")
}
#Add constraints to ensure sum of parameters for every column (colSum) <= column limit (defined above)
for (i in 1:c){
add.constraint(lp.create, xt=rep(1,r),
indices=seq(i,ncol,by=c), rhs=c.limit[i], type="<=")
}
#Add constraints to ensure sum of column objective (t((dt%*%wt)[,i])%*%P[,i) <= limits defined in the problem)
#NOT SURE HOW TO APPLY A CONSTRAINT THAT IS DEPENDENT ON THE OBJECTIVE FUNCTION
solve(lp.create)
get.objective(lp.create) #20
final.par <- matrix(get.variables(lp.create), ncol = c, byrow=T) # Reshape
Any help that can get me to the finish line is much appreciated :)
Thanks

Weighted Average notation - Adjusting weights in R

I'm trying to calculate the weighted average of a statistic sample (vector) in R using this form:
The function takes a vector and the weight is adjusted according by a second parameter (1 - 3), which are:
where s is the standard deviation.
I've adjusted the weight accordingly if the parameter is 1 or 3 using else-if's, but I'm having trouble with the 2nd one given that there is criteria to meet...
I've been calculating X - xBar as a vector: m = x-mean(x)
I've been calculating s with an R function: s = sd(x)
My query is regarding how "the meeting of the conditions should be programmed" in the 2nd critera. So far I have an if for each condition, but...
When calculating the weighted average, (taking the top one as an eg), does each element of the x vector (m/s) need to be less than 1? or do I need to test each element and assign a weight from the 3 conditions accordingly?
eg. if the first elements answer was less than 1, assign a weight or 1, but second elements answer was inbetween 1 and 2, assign it a weight of 0.5?
I hope this makes sense. In R it throws a warning message saying the logic is only comparing the first element of the vector... so thats what raised the question.
Thanks in advance.
To avoid the warning message while staying reasonably efficient, you probably want to use ifelse rather than if and else, perhaps in something like
m <- mean(x)
s <- sd(x)
absstandardx <- abs( (x - m) / s )
w2 <- ifelse( absstandardx < 1, 1, ifelse( absstandardx < 2, 0.5, 0 ) )
weightedmean2 <- sum(w2 * x) / sum(w2)

Resources