Simulating Data for SEM with psych package - r

I'm simulating data for SEM (structural equation model) with psych package. I used the code given on page 17 of Using the psych package to generate and test structural models. The code is
library(psych)
set.seed(42)
fx <- matrix(c(0.9, 0.8, 0.7, rep(0, 9), 0.7, 0.6, 0.5, rep(0, 9), 0.6, 0.5, 0.4), ncol = 3)
rownames(fx) <- paste("x", 1:9, sep="")
fy <- matrix(c(0.6, 0.5, 0.4), ncol=1)
rownames(fy) <- paste("y", 1:3, sep="")
Phi <- matrix(c(1, 0.48, 0.32, 0.4, 0.48, 1, 0.32, 0.3, 0.32, 0.32, 1, 0.2, 0.4, 0.3, 0.2, 1), ncol = 4)
twelveV <- sim.structure(fx=fx, Phi=Phi, fy=fy, n=100, raw=TRUE)
round(twelveV$model, 2)
round(twelveV$model-twelveV$r, 2)
twelveV$observed
Then I tried to use sem package to analyse the simulated data. The code is
sem.mod <- structure.sem(twelveV$model)
library(sem)
sem.fit <- sem(sem.mod, twelveV$r, 100)
This code is giving the following error message:
Error in solve.default(diag(m) - A) :
Lapack routine dgesv: system is exactly singular
I don't what is causing this error. Any idea, comment and/or help will be highly appreciated. Thanks

Ah, that error message was the bane of my life for a while.
Essentially (as I eventually gathered from the R-Help archives, specifically here, it means that there is redundant information in your matrix in that (at least) one column's information can be derived from one of the others.
I believe that this is related to collinearity, but i could be wrong on this point. In most cases, dropping the column that is most highly correlated with the others will solve the problem.
In a real application, its a sign to throw out some of your questions or measures.

Related

combining lists and dataframes in R from raster values

QUESTION EDITED FOR CLARITY AND REPRODUCIBILITY
I am trying to summarize proportions of landcover classes within many buffers contained within a list. Although it appears to be a common problem, I have not found an appropriate solution:
I have a raster stack called hab_stack with discrete values 1-6 for each of 3 layers (each layer == year). I also have locational data with >800,000 locations called dat_sf. I have extracted hab_stack raster values within a 400 m buffer around each location.
I now have a large list with ~800,000 elements (not all hab classes 1-6 are represented in each list). So I tried to create a dummy dataframe with all hab_stack values 1-6 called true_names with assigned frequency/proportion == zero for classes not represented within the buffer because I need to combine all proportions together. I have tried to accomplish this using an lapply looping structure but can't seem to get it quite right. Below is the full function and error:
sum_class <- lapply(values_hab, function(x){
true_names <- data.frame(x = 1:6, Freq = 0)
prop_df <- as.data.frame(prop.table(table(x))) %>%
mutate(x = as.numeric(x))
true_names %>%
anti_join(prop_df, by = "x") %>%
bind_rows(prop_df) %>%
arrange(x)
Error in `mutate()`:
! Problem while computing `x = as.numeric(x)`.
x `x` must be size 0 or 1, not 1659.
Run `rlang::last_error()` to see where the error occurred.
})
When I dissect the function, the error arises from the table(values_hab) argument = Error in table(values_hab) : all arguments must have the same length.
I think a hypothetical list could look something like this, where there's different numbers of NAs and not all classes are represented in each element; also, see a dataframe of my desired output below:
list <- list(c(1,1,1,2,2,2,3,3,4,4,4,NA,NA,NA,5,6),
c(1,2,3,4,NA,NA,NA,NA,4,4,4,4,NA,5,1,1)
c(5,5,5,5,5,1,2,2,2,2,NA,NA,NA,NA,NA,3))
desired_output <- data.frame(`1` = c(0.4, 0.5, 0.6, 0.5, 0.5, 0.3),
`2` = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.2),
`3` = c(0.1, 0.1, 0.0, 0.1, 0.0, 0.3),
`4` = c(0.3, 0.2, 0.0, 0.1, 0.1, 0.1),
`5` = c(0.0, 0.1, 0.2, 0.2, 0.1, 0.0),
`6` = c(0.1, 0.0, 0.1, 0.0, 0.2, 0.1))
Any help is much appreciated. Best,
It looks like my function works and this was a very easy fix. dplyr::mutate was recognizing x as the entire list when in fact I wanted it to apply mutate the vector x within each list. R is still running in the background but this should have taken care of it.
sum_class_function <- function(x){
true_names <- data.frame(x = 1:6, Freq = 0)
prop_df <- as.data.frame(prop.table(table(x)))
prop_df$x <- as.numeric(prop_df$x)
temp<- true_names %>%
anti_join(prop_df, by = "x") %>%
bind_rows(prop_df) %>%
arrange(x)
return(temp)
}
sum_class <- lapply(values_hab, sum_class_function)

How to for loop using different columns of data frame?

Basically I was working on a portfolio return problem. The stock return is like:
AMZN <- c(0.1, 0.3, 0.4, 0.2)
BBY <- c(0.2, 0.4, 0.5, 0.3)
TGT <- c(-0.1, -0.3, -0.2,-0.5)
df1 <- data.frame(AMZN, BBY, TGT)
date <- c("2000-01-01","2000-02-01", "2000-03-01", "2000-04-01")
date <- as.Date(date, "%Y-%m-%d")
df1 <- cbind(date, df1)
xts <- xts(df1[,-1], order.by=df1[,1])
I want to use Return.portfolio(xts, weight) to calculate portfolio return. So
The weight is like
w1 <- c(0.2, 0.3, 0.1, 0.4)
w2 <- c(0.5, 0.1, 0.1, 0.3)
w3 <- c(0.1, 0.1, 0.4, 0.4)
Weights <- data.frame(w1, w2, w3)
Since there are several groups of weights assigned, I need to get multiple portfolio return.
The code I tried is
for (i in colnames(Weights)){
Return.portfolio(xts, (Weights[[i]]))
}
Although R does not report any error, the only thing I got is a value which i is "w3".
I think you may need to initialize a NULL object first. Maybe something like this
Return<-NULL
for (i in 1:ncol(Weights)){
Return<- cbind(Return, Return.portfolio(xts, (Weights[[i]])))
}

Non-linear Optimization solnl function error in R: 'Argument of length zero'

I am trying to implement CVaR portfolio optimisation in R. Basically trying to replicate the Matlab approach used in this paper:
https://ethz.ch/content/dam/ethz/special-interest/mtec/chair-of-entrepreneurial-risks-dam/documents/dissertation/master%20thesis/Thesis_Matthias_Kull_2014.pdf
To do this I need to perform nonlinear optimisation with nonlinear constraints.
I have tried to use the nloptr package, but found the derivative calculation for the gradient of matrices beyond me.
Instead I have opted for the NlcOptim package which formulates the constraints in the same way as the Matlab function used in the paper.
library(NlcOptim)
# ====================================================================
# Just generate arbitrary returns data and bootstrap -----------------
asset_returns <- rbind(c(0.1, 0.05, 0.05, 0.01, 0.06),
c(0.05, 0.05, 0.06, -0.01, 0.09),
c(0.025, 0.05, 0.07, 0.02, -0.1),
c(0.01, 0.05, 0.08, -0.02, -0.01),
c(0.01, 0.05, 0.08, 0.00, 0.2),
c(0.005, 0.05, 0.09, 0.005, -0.15),
c(0.01, 0.05, 0.08, 0.01, -0.01),
c(0.012, 0.05, 0.00, -0.01, -0.01),
c(0.015, 0.05, 0.00, 0.03, 0.05),
c(0.02, 0.05, -0.01, 0.04, 0.03))
# Returns for 5 assets over 10 trading periods
nAssets <- ncol(asset_returns)
nReturns <- nrow(asset_returns)
nPeriods <- 4
nSims <- 10
# BOOTSTRAP ---------------------------------------------------------
sim_period_returns <- matrix(nrow = nSims, ncol = nAssets)
for (k in 1:nSims) {# run nSims simulations
sim_returns <- matrix(nrow = nPeriods, ncol = nAssets)
sample_order <- sample(nReturns, nPeriods)
for (i in 1:nPeriods) {
sim_returns[i,] <- asset_returns[sample_order[i],]
}
sim_prices <- rbind(rep(1, nAssets), 1 + sim_returns)
for (j in 1:nAssets) {
sim_period_returns[k, j] <- prod(sim_prices[, j]) - 1
}
}
# ------------------------------------------------------------------------
# ========================================================================
# The important stuff ====================================================
returns <- sim_period_returns
alpha <- 0.95
CVaR_limit <- 0.025
UB <- 0.75
LB <- 0.05
# Inequality constraints
A <- rbind(c(rep(0, nAssets), 1, 1/((1-alpha)*nSims) * rep(1, nSims)),
cbind(- returns, -1, diag(nSims)))
b <- as.matrix(c(-CVaR_limit, rep(0, nSims)), nrow = nSims, ncol = 1)
# Equality constraints
Aeq <- c(rep(1, nAssets), 0, rep(0, nSims))
beq <- 1
# Upper and lower bounds
UB <- c(rep(UB, nAssets), Inf, rep(Inf, nSims))
LB <- c(rep(LB, nAssets), 0, rep(0, nSims))
# Initial portfolio weights
w0 <- rep(1/nAssets, nAssets)
VaR0 <- quantile(returns %*% w0, alpha, names = F)
w0 <- c(w0, VaR0, rep(0, nSims))
objective_function <- function(x) {
# objective function to minimise
return (-colMeans(returns) %*% x[1:nAssets])
}
# **********************************************
# The solnl function giving the error based on the above inputs
solnl(X = w0,
objfun = objective_function,
A = A,
B = b,
Aeq = Aeq,
Beq = beq,
lb = LB,
ub = UB)
# **********************************************
# ===================================================================
I am receiving the following error:
Error in if (eq > 0 & ineq > 0) { : argument is of length zero
I have read the package source code and tried to figure out what is causing this error, but am still at a loss.
Checking the source code and input data, I think that the error starts at line 319 on NlcOptim when the following code is called nLineareq = nrow(Aeq);By calling nrow(Aeq) in the way that you have defined Aeq it will result in NULL a few lines later the expression if (eq > 0 & ineq > 0) is evaluated resulting in the error. Regarding the error you can find an explanation in here Argument is of length zero in if statement
A quick fix could be to change the shape on Aeq by using
Aeq <- t(array(c(rep(1, nAssets), 0, rep(0, nSims))))
However by changing that I get a different error when i try to run the code
Error: object 'lambda' not found
I'm not sure if the R implementation needs a different initial conditions or the method is not converging, since in the paper, the method used for the optimization was interior-point rather than SQP as implemented in NlcOptim.

Error in nonlinear least squeares in R - Logistic and Gompertz curves

I'm working on a model for variable y, in which I intend to use time as an explanatory variable. I've chosen a Gompertz and a logistic curve as candidates, but when I try to estimate the coefficients (using both nls and nls2), I end up getting different errors (singularity or step factor reduced below 'minFactor'). I would really appreciate any help. Here is my code and a deput version of the info object.
I chose the initial values according to the criteria in http://www.metla.fi/silvafennica/full/sf33/sf334327.pdf
library(nls2)
> dput(info)
structure(list(y = c(0.308, 0.279, 0.156, 0.214, 0.224, 0.222,
0.19, 0.139, 0.111, 0.17, 0.155, 0.198, 0.811, 0.688, 0.543,
0.536, 0.587, 0.765, 0.667, 0.811, 0.587, 0.617, 0.586, 0.633,
2.231, 2.202, 1.396, 1.442, 1.704, 2.59, 2.304, 3.026, 2.7, 3.275,
3.349, 3.936, 9.212, 8.773, 6.431, 6.983, 7.169, 9.756, 10.951,
13.938, 14.378, 18.406, 24.079, 28.462, 51.461, 46.555, 39.116,
43.982, 41.722), t = 1:53), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -53L))
summary(gomp_nls <- nls2(y ~ alpha*exp(-beta*exp(-gamma*t)),
data = info,
start = list(alpha = 40, beta = 4.9, gamma = 0.02),
algorithm = "default")
)
summary(logist_nls <- nls2(y ~ alpha/(1+beta*exp(-gamma*t)),
data = info,
start = list(alpha = 40, beta = 128, gamma = 0.02),
algorithm = "default"))
)
I'd appreciate any help
The "default" algorithm for nls2 is to use nls. You want to specify "brute-force" or one of the other algorithms for finding an initial value. The starting value should be a data frame of two rows such that it will fill in the hypercube so defined with potential starting values.
It will then evaluate the residual sum of squares at each of those starting values and return the starting values at which the formula gives the least sum of squares.
If you find that the result returned by nls2 is at the boundary of the region you defined then enlarge the region and try again. (You might not need this step if the starting value returned are good enough anyways.)
Finally run nls with the starting values you found.
library(nls2)
## 1
fo1 <- y ~ alpha*exp(-beta*exp(-gamma*t))
st1 <- data.frame(alpha = c(10, 100), beta = c(1, 100), gamma = c(0.01, 0.20))
fm1.0 <- nls2(fo1, data = info, start = st1, algorithm = "brute-force")
fm1 <- nls(fo1, data = info, start = coef(fm1.0))
## 2
fo2 <- y ~ alpha/(1+beta*exp(-gamma*t))
st2 <- data.frame(alpha = c(10, 1000), beta = c(1, 10000), gamma = c(0.01, 0.20))
fm2.0 <- nls2(fo2, data = info, start = st2, algorithm = "brute-force")
fm2 <- nls(fo2, data = info, start = coef(fm2.0))
# plot both fits
plot(y ~ t, info)
lines(fitted(fm1) ~ t, info, col = "blue")
lines(fitted(fm2) ~ t, info, col = "red")
Note
Note that for the data shown these two 2-parameter exponential models fit reasonably well so if you are only interested in the range where it rises exponentially then these could be alternatives to consider. (The first one below is better because the coefficients are more similar to each other. The second one may have scaling problems.)
fm3 <- nls(y ~ a * exp(b/t), info, start = c(a = 1, b = 1))
fm4 <- nls(y ~ a * t^b, info, start = c(a = .001, b = 6))

Solve best fit polynomial and plot drop-down lines

I'm using R 3.3.1 (64-bit) on Windows 10. I have an x-y dataset that I've fit with a 2nd order polynomial. I'd like to solve that best-fit polynomial for x at y=4, and plot drop-down lines from y=4 to the x-axis.
This will generate the data in a dataframe v1:
v1 <- structure(list(x = c(-5.2549, -3.4893, -3.5909, -2.5546, -3.7247,
-5.1733, -3.3451, -2.8993, -2.6835, -3.9495, -4.9649, -2.8438,
-4.6926, -3.4768, -3.1221, -4.8175, -4.5641, -3.549, -3.08, -2.4153,
-2.9882, -3.4045, -4.6394, -3.3404, -2.6728, -3.3517, -2.6098,
-3.7733, -4.051, -2.9385, -4.5024, -4.59, -4.5617, -4.0658, -2.4986,
-3.7559, -4.245, -4.8045, -4.6615, -4.0696, -4.6638, -4.6505,
-3.7978, -4.5649, -5.7669, -4.519, -3.8561, -3.779, -3.0549,
-3.1241, -2.1423, -3.2759, -4.224, -4.028, -3.3412, -2.8832,
-3.3866, -0.1852, -3.3763, -4.317, -5.3607, -3.3398, -1.9087,
-4.431, -3.7535, -3.2545, -0.806, -3.1419, -3.7269, -3.4853,
-4.3129, -2.8891, -3.0572, -5.3309, -2.5837, -4.1128, -4.6631,
-3.4695, -4.1045, -7.064, -5.1681, -6.4866, -2.7522, -4.6305,
-4.2957, -3.7552, -4.9482, -5.6452, -6.0302, -5.3244, -3.9819,
-3.8123, -5.3085, -5.6096, -6.4557), y = c(0.99, 0.56, 0.43,
2.31, 0.31, 0.59, 0.62, 1.65, 2.12, 0.1, 0.24, 1.68, 0.09, 0.59,
1.23, 0.4, 0.36, 0.49, 1.41, 3.29, 1.22, 0.56, 0.1, 0.67, 2.38,
0.43, 1.56, 0.07, 0.08, 1.53, -0.01, 0.12, 0.1, 0.04, 3.42, 0.23,
0, 0.34, 0.15, 0.03, 0.19, 0.17, 0.2, 0.09, 2.3, 0.07, 0.15,
0.18, 1.07, 1.21, 3.4, 0.8, -0.04, 0.02, 0.74, 1.59, 0.71, 10.64,
0.64, -0.01, 1.06, 0.81, 4.58, 0.01, 0.14, 0.59, 7.35, 0.63,
0.17, 0.38, -0.08, 1.1, 0.89, 0.94, 1.52, 0.01, 0.1, 0.38, 0.02,
7.76, 0.72, 4.1, 1.36, 0.13, -0.02, 0.13, 0.42, 1.49, 2.64, 1.01,
0.08, 0.22, 1.01, 1.53, 4.39)), .Names = c("x", "y"), class = "data.frame", row.names = c(NA,
-95L))
Here's the code to plot y vs x, plot the best fit polynomial, and draw a line at y=4.
> attach(v1)
> # simple x-y plot of the data
> plot(x,y, pch=16)
> # 2nd order polynomial fit
> fit2 <- lm(y~poly(x,2,raw=TRUE))
> summary(fit2)
> # generate range of numbers for plotting polynomial
> xx <- seq(-8,0, length=50)
> # overlay best fit polynomial
>lines(xx, predict(fit2, data.frame(x=xx)), col="blue")
> # add horizontal line at y=4
> abline(h=4, col="red")
>
It's obvious from the plot that y=4 at x of around -2 and -6.5, but I'd like to actually solve the regression polynomial for those values.
Ideally, I'd like lines that drop down from the red-blue line intersections to the x-axis (i.e plot vertical ablines that terminate at the two y=4 solutions). If that's not possible, I'd be happy with good old vertical ablines that go all the way up the plot, so long as they at the proper x solution values.
This graph represents parts that will be out-of-spec when y>4, so I want to use the drop-down lines to highlight the range of x values that will produce in-spec parts.
You can use the quadratic formula to calculate the values:
betas <- coef(fit2) # get coefficients
betas[1] <- betas[1] - 4 # adjust intercept to look for values where y = 4
# note degree increases, so betas[1] is c, etc.
betas
## (Intercept) poly(x, 2, raw = TRUE)1 poly(x, 2, raw = TRUE)2
## 8.7555833 6.0807302 0.7319848
solns <- c((-betas[2] + sqrt(betas[2]^2 - 4 * betas[3] * betas[1])) / (2 * betas[3]),
(-betas[2] - sqrt(betas[2]^2 - 4 * betas[3] * betas[1])) / (2 * betas[3]))
solns
## poly(x, 2, raw = TRUE)1 poly(x, 2, raw = TRUE)1
## -1.853398 -6.453783
segments(solns, -1, solns, 4, col = 'green') # add segments to graph
Much simpler (if you can find it) is polyroot:
polyroot(betas)
## [1] -1.853398+0i -6.453783+0i
Since it returns a complex vector, you'll need to wrap it in as.numeric if you want to pass it to segments.
I absolutely understand that there is an analytical solution for this simple quadratic polynomial. The reason I show you numerical solution is that you ask this question in regression setting. Numerical solution may always be your solution in general, when you have more complicated regression curve.
In the following I will use uniroot function. If you are not familiar with it, read this short answer first: Uniroot solution in R.
This is the plot produced with your code. You are almost there. This is a root finding problem, and you may numerically use uniroot. Let's define a function:
f <- function (x) {
## subtract 4
predict(fit2, newdata = data.frame(x = x)) - 4
}
From the figure, it is clear that there are two roots, one inside [-7, -6], the other inside [-3, -1]. We use uniroot to find both:
x1 <- uniroot(f, c(-7, -6))$root
#[1] -6.453769
x2 <- uniroot(f, c(-3, -1))$root
#[1] -1.853406
Now you can drop a vertical line from these points down to x-axis:
y1 <- f(x1) + 4 ## add 4 back
y2 <- f(x2) + 4
abline(h = 0, col = 4) ## x-axis
segments(x1, 0, x1, y1, lty = 2)
segments(x2, 0, x2, y2, lty = 2)
You have a quadratic equation
0.73198 * x^2 + 6.08073 * x + 12.75558 = 4
OR
0.73198 * x^2 + 6.08073 * x + 8.75558 = 0
You can just use the quadratic formula to solve this analytically. R gives the two roots:
(-6.08073 + sqrt(6.08073^2 -4*0.73198 * 8.75558)) / (2 * 0.73198)
[1] -1.853392
(-6.08073 - sqrt(6.08073^2 -4*0.73198 * 8.75558)) / (2 * 0.73198)
[1] -6.453843
abline(v=c(-1.853392, -6.453843))
Here is one more solution, based on this
attach(v1)
fit2 = lm(y~poly(x,2,raw=TRUE))
xx = seq(-8,0, length=50)
vector1 = predict(fit2, data.frame(x=xx))
vector2= replicate(length(vector1),4)
# Find points where vector1 is above vector2.
above = vector1 > vector2
# Points always intersect when above=TRUE, then FALSE or reverse
intersect.points = which(diff(above)!=0)
# Find the slopes for each line segment.
vector1.slopes = vector1[intersect.points+1] - vector1[intersect.points]
vector2.slopes = vector2[intersect.points+1] - vector2[intersect.points]
# Find the intersection for each segment.
x.points = intersect.points + ((vector2[intersect.points] - vector1[intersect.points]) / (vector1.slopes-vector2.slopes))
y.points = vector1[intersect.points] + (vector1.slopes*(x.points-intersect.points))
#Scale x.points to the axis value of xx
x.points = xx[1] + ((x.points - 1)/(49))*(xx[50]-xx[1])
plot(xx, y = vector1, type= "l", col = "blue")
points(x,y,pch = 20)
lines(x = c(x.points[1],x.points[1]), y = c(0,y.points[1]), col='red')
lines(x = c(x.points[2],x.points[2]), y = c(0,y.points[2]), col='red')
Many solutions are already proposed, here is another one.
As obvious, we are interested to find the x values that satisfy the polynomial (quadratic) equation a_0 + a_1.x + a_2.x^2 = 4, where a_0, a_1, a_2 are the coefficients of the fitted polynomial. We can rewrite the equation as a standard quadratic equation ax^2+bx+c=0 and find the roots using Sridhar's formula using the coefficients of the fitted polynomial with polynomial regression as follows:
a <- fit2$coefficients[3]
b <- fit2$coefficients[2]
c <- fit2$coefficients[1] - 4
as.numeric((-b + sqrt(b^2-4*a*c)) / (2*a))
#[1] -1.853398
as.numeric((-b-+ sqrt(b^2-4*a*c)) / (2*a))
#[1] -6.453783
We can use some numerical methods such as Newton-Raphson to find the roots as well (although there are faster numerical methods but this will solve our purpose and it's quite fast too, takes ~160 ms on my machine), as we can see from the following code, the numerical and the theoretical solutions agree.
a <- fit2$coefficients # fitted quadratic polynomial coefficients
f <- function(x) {
as.numeric(a[1] + a[2]*x + a[3]*x^2-4)
}
df <- function(x) {
as.numeric(a[2] + 2*a[3]*x)
}
Newton.Raphson <- function(x0) {
eps <- 1e-6
x <- x0
while(TRUE) {
x <- x0 - f(x0) / df(x0)
if (abs(x - x0) < eps) {
return(x0)
}
x0 <- x
}
}
t1 <- Sys.time()
x1 <- Newton.Raphson(-10)
x2 <- Newton.Raphson(10)
x1
#[1] -6.453783
x2
#[1] -1.853398
s2
print(paste('time taken to compute the roots:' ,Sys.time() - t1))
#[1] "time taken to compute the roots: 0.0160109996795654"
points(x1, 4, pch=19, col='green')
points(x2, 4, pch=19, col='green')
abline(v=x1, col='green')
abline(v=x2, col='green')

Resources