"Manual" Factor Analysis in R - r

I am trying to follow along the Factor Analysis chapter in "Using Multivariate Statistics", by Tabachnick and Fidell.
The data, and my steps, are as follows:
# data
dat.ski <- data.frame(skiers = paste0("S", c(1:5), sep=""), cost = c(32, 61, 59, 36, 62), lift=c(64, 37, 40, 62, 46) , depth = c(65, 62, 45, 34, 43), powder = c(67, 65, 43, 35, 40))
# correlation matrix
cor.ski <- cor(dplyr::select(dat.ski, -skiers))
# eigenvalues and eigenvectors
eig.ski <- eigen(cor.ski)
The correlation matrix and eigenvalues (2.02, 1.94, 0.04 and 0.00) correspond to that in the book. The first two eigenvectors I have are (.352, -0.251, -0.626, -0.647) and (.514, -.664, .322, .280).
However, the book then continues to say that only the first two eigenvalues are retained and the "factor analysis is re-run" which results in the following two eigenvalues*: 2.00, 1.91 and eigenvectors (-2,83, 0.177, 0.568, 0.675) and (0.651, -0.685, 0.252, 0.207). I can't work out to reproduce these eigenvectors... if I run psych::fa(cor.ski, nfactors=2, fm="pa"), the SS loadings correspond to the new eigenvalues*.
Any help on how to return the eigenvectors as per the text will be greatly appreciated.
Thanks.

I worked this out by remembering that R is a visible language! By looking at the definition of psych::fac, I see that the authors have actually performed 7 iterations of factor analysis, not mereley "taken the first two eigenvectors and rerun FA"; I also finally understand how factor analysis is performed and can tie it in with the subsequent text, which in a nutshell is:
Starting with the correlation matrix (r) and assuming k factors are used
Get eigenvalues (L) and eigenvectors (V) of correlation matrix r
Calculate C = sum(diag(R))
Calculate the loadings, A = V[,1:k] * Sqrt{L[1:k]} (eqn 13.6 of text)
set R* = AA' (eqn 13.5 of text, R=AA')
set C* = sum(diag(R*))
Update diag(R) = diag(R*)
Repeat above steps until max iterations reached, or until e = abs(C-C*) is smaller than some threshold

Related

Error in runjags when fitting a multinomial model: "is partly observed and partly missing"

I am new to Jags and I'm trying to fit a multinomial model to my data.
When I run the code I get the following error: "positive.counts[1,1:9] is partly observed and partly missing".
I googled it and I found that this is due to the fact that a node cannot have observed and missing values at the same time. This is because in my data (see code below) there are values and NA in the same row. If I substitute the NA with 0 values the model works properly.
Does anyone have a solution to this?
Below you can find the data and the code!
Many thanks in advance,
Elisa
##########################################################################
# load jags
library(runjags)
# define the data:
data <- list("N_cases"=c(978,737, 737, 1189, 270, 268), "positive.counts" = matrix(c(649 ,567 ,464 ,821, 98, 117,203 , 133, 81, 290, 41, 26,3, 7, 4, 6, 5, 0,NA, NA, NA, NA, 20, 19,24, 15, 3, NA, 21, 15,NA, NA, 184, NA, 17, 23, NA, NA, NA, NA, 26, 17,99, 15, 1, 72, 14, 25,NA, NA, NA, NA, 28, 26),6,9), "n"= 6,"n_responses" = 9)
# define the model
model { mu.w <- 0
prec <- 0.5
for (s in 1:n_responses) {
w[s] ~ dnorm(mu.w, prec);
a[s] <- exp(w[s]); # positive parameter
}
for (i in 1:n){
positive.counts[i,1:n_responses] ~ dmulti(p[i,1:n_responses], N_cases[i])
}
for(i in 1:n){
for (s in 1:n_responses) {
delta[i, s] ~ dgamma(a[s], 1)
}
}
for(i in 1:n){
for (s in 1:n_responses) {
p[i, s] <- delta[i,s] / sum(delta[i,1:n_responses])
}
}
}
# run the model
n.adapt=1000
n.burn=5000
n.iters=10000
n.chains <- 5;
n.total.samples <- 10000
n.samples.per.chain <- (n.total.samples %/% n.chains)
n.thin <- n.iters %/% n.samples.per.chain; if(n.thin==0) n.thin <- 1;
tomonitor <- c("a","p")
mcmc.post <- run.jags(model="multi_model.jags",
data=data,
method="parallel",
sample=n.samples.per.chain,
burnin=n.burn,
adapt=n.adapt,
n.chains=n.chains,
thin=n.thin,
monitor= tomonitor);
As you've discovered, JAGS doesn't let you model a partially observed multinomial distribution. That's a limitation. If you replace the NAs with zeroes, you're no longer saying "I have no data for the combination of row and column variables" but rather "I have observed no events for this combination of row and column variables". So the model will (probably) run and produced some output. But you're no longer modelling your observed dataset. So the model's results don't apply to your data any more. Conclusions you draw from it will be invalid.
So, you need a way of modelling multinomial data with incomplete observations. One way of doing that is to transform your data into a series of related binomial variables. For each category c[i] in your multinomial data with k categories, you create a binomial variable Y[i] ~ Bin(p[i], n) where n is the number of observations (N_cases in your data?) with that combination of predictors (rows of positive.counts in your data?) and p[i] = Prob(x >= c[i]). In other words, p[i] is the probability that an observation with the relevant set of predictor variables is in category c[i] or above. (So, by definition, p[1] = 1.)
I'm sorry if that explanation is a little abstract, but you've given us your code but no context. I can't explain it in terms of your actual model, because I don't know what it is.
Two ways of fitting such a model in JAGS are given in this post. You can trust the author of the answer. He is Martyn Plummer. He wrote JAGS.

Creating an excel one-way data table in R -- Problem with my for loop

I'm trying to create an excel one-way data table in R so that I can find the exponent that minimizes errors of a coefficient in an equation. I have a for loop that produces the correct result but it does something strange that I can't figure out.
Here is an example of the data. I'll use the Pythogrean Win formula from baseball and use a for loop to find the exponent that minimizes the mean absolute error in the win projections.
## Create Data
Teams <- c("Bulls", "Sharks", "Snakes", "Dogs", "Cats")
Wins <- c(5, 3, 8, 1, 9)
Losses <- 10 - Wins
Win.Pct <- Wins/(Wins + Losses)
Points.Gained <- c(30, 50, 44, 28, 60)
Points.Allowed <- c(28, 74, 40, 92, 25)
season <- data.frame(Teams, Wins, Losses, Win.Pct, Points.Gained, Points.Allowed)
season
## Calculate Scoring Ratio
season$Score.Ratio <- with(season, Points.Gained/Points.Allowed)
## Predict Wins from Scoring Ratio
exponent <- 2
season$Predicted.Wins <- season$Score.Ratio^exponent / (1 + season$Score.Ratio^exponent)
## Calculate Mean Absolute Error
season$Abs.Error <- with(season, abs(Win.Pct - Predicted.Wins))
mae <- mean(season$Abs.Error)
mae
Here is my for loop that is looking at a range of exponent options to see if any of them are better than the exponent, 2, used above. For some strange reason, when I run the for loop, it keeps repeating the table several times (many of the tables with incorrect results) until finally producing the correct table as the last one. Can anyone explain to me what is wrong with my for loop and why this is happening?
## Identify potential exponent options that minimize mean absolute error
exp.options <- seq(from = 0.5, to = 3, by = 0.1)
mae.results <- data.frame("Exp" = exp.options, "Results" = NA)
for(i in 1:length(exp.options)){
win.pct <- season$Predicted.Wins
pred.win.pct <-
(season$Points.Gained/season$Points.Allowed)^exp.options[i] /
(1 + (season$Points.Gained/season$Points.Allowed)^exp.options[i])
mae.results[i,2] <- mean(abs(win.pct - pred.win.pct))
print(mae.results)
}

Calculating cumulative hypergeometric distribution

Suppose I have 100 marbles, and 8 of them are red. I draw 30 marbles, and I want to know what's the probability that at least five of the marbles are red. I am currently using http://stattrek.com/online-calculator/hypergeometric.aspx and I entered 100, 8, 30, and 5 for population size, number of success, sample size, and number of success in sample, respectively. So the probability I'm interested in is Cumulative Probability: $P(X \geq 5)$ which = 0.050 in this case. My question is, how do I calculate this in R?
I tried
> 1-phyper(5, 8, 92, 30, lower.tail = TRUE)
[1] 0.008503108
But this is very different from the previous answer.
phyper(5, 8, 92, 30) gives the probability of drawing five or fewer red marbles.
1 - phyper(5, 8, 92, 30) thus returns the probability of getting six or more red marbles
Since you want the probability of getting five or more (i.e. more than 4) red marbles, you should use one of the following:
1 - phyper(4, 8, 92, 30)
[1] 0.05042297
phyper(4, 8, 92, 30, lower.tail=FALSE)
[1] 0.05042297
Why use:
1 - phyper(..., lower.tail = TRUE)
?
Easier to use:
phyper(..., lower.tail = FALSE)
. Even if they are mathematically equivalent, there are numerical reasons for preferring the latter.
Does that fix your problem? I believe you are putting the correct inputs into the phyper function. Is it possible that you're looking at the wrong output in that web site you linked?

Converting matrix operations in Matlab to R code

I'm trying to convert Matlab code to R. I'm not familiar with Matlab matrix operations, and it appears the results from my R code do not match the results from Matlab, so any help would be greatly appreciated. The Matlab code I'm trying to convert is below (from this website):
% Mean Variance Optimizer
% S is matrix of security covariances
S = [185 86.5 80 20; 86.5 196 76 13.5; 80 76 411 -19; 20 13.5 -19 25]
% Vector of security expected returns
zbar = [14; 12; 15; 7]
% Unity vector..must have same length as zbar
unity = ones(length(zbar),1)
% Vector of security standard deviations
stdevs = sqrt(diag(S))
% Calculate Efficient Frontier
A = unity'*S^-1*unity
B = unity'*S^-1*zbar
C = zbar'*S^-1*zbar
D = A*C-B^2
% Efficient Frontier
mu = (1:300)/10;
% Plot Efficient Frontier
minvar = ((A*mu.^2)-2*B*mu+C)/D;
minstd = sqrt(minvar);
plot(minstd,mu,stdevs,zbar,'*')
title('Efficient Frontier with Individual Securities','fontsize',18)
ylabel('Expected Return (%)','fontsize',18)
xlabel('Standard Deviation (%)','fontsize',18)
Here is my attempt in R:
# S is matrix of security covariances
S <- matrix(c(185, 86.5, 80, 20, 86.5, 196, 76, 13.5, 80, 76, 411, -19, 20, 13.5, -19, 25), nrow=4, ncol=4, byrow=TRUE)
# Vector of security expected returns
zbar = c(14, 12, 15, 7)
# Unity vector..must have same length as zbar
unity <- rep(1, length(zbar))
# Vector of security standard deviations
stdevs <- sqrt(diag(S))
# Calculate Efficient Frontier
A <- unity*S^-1*unity
B <- unity*S^-1*zbar
C <- zbar*S^-1*zbar
D <- A*C-B^2
# Efficient Frontier
mu = (1:300)/10
# Plot Efficient Frontier
minvar = ((A*mu^2)-2*B*mu+C)/D
minstd = sqrt(minvar)
It appears that unity*S in Matlab is equivalent to colSums(S) in R. But I haven't been able to figure out how to calculate the equivalent of S^-1*unity in R. If I type this Matlab code in R (S^-1*unity), it calculates without error, but it gives a different answer. Because I don't understand the underlying Matlab calculation, I'm not sure how to translate it to R.
I used to do matlab -> R conversions quite a bit a few years ago.
My general suggestion is to open 2 terminals side by side and try to do everything going line-by-line. Then after each line you should check if what you got in MATLAB and R are equivalent.
This document should be handy: http://mathesaurus.sourceforge.net/octave-r.html
In your case these appear to be the commands that you should have in mind:
Matrix multiplication:
Matlab: A*B
R: A %*% B
Transpose:
Matlab: A'
R: t(A)
Matrix inverse:
Matlab: inv(A) or A^-1
R: solve(A)
Don't try to convert everything at once because you will run into trouble. When the results won't match you will not be able to tell where the error is.

R: mix() in mixdist package returning error

I have installed the mixdist package in R to combine distributions. Specifically, I'm using the mix() function. See documentation.
Basically, I'm getting
Error in nlm(mixlike, lmixdat = mixdat, lmixpar = fitpar, ldist = dist, :
missing value in parameter
I googled the error message, but no useful results popped up.
My first argument to mix() is a data frame called data.df. It is formatted exactly like the built-in data set pike65. I also did data.df <- as.mixdata(data.df).
My second argument has two rows. It is a data frame called datapar, formatted exactly like pikepar. My pi values are 0.5 and 0.5. My mu values are 250 and 463 (based on my data set). My sigma values are 0.5 and 1.
My call to mix() looks like:
fitdata <- mix(data.df, datapar, "norm", constr = mixconstr(consigma="CCV"), emsteps = 3, print.level = 2)
The printing shows that my pi values go from 0.5 to NaN after the first iteration, and that my gradient is becoming 0.
I would appreciate any help in sorting out this error.
Thanks,
n.i.
Using the test data you linked to
library(mixdist)
time <- seq(673,723)
counts <-c(3,12,8,12,18,24,39,48,64,88,101,132,198,253,331,
419,563,781,1134,1423,1842,2505,374,6099,9343,13009,
15097,13712,9969,6785,4742,3626,3794,4737,5494,5656,4806,
3474,2165,1290,799,431,213,137,66,57,41,35,27,27,27)
data.df <- data.frame(time=time, counts=counts)
We can see that
startparam <- mixparam(c(699,707),1 )
data.fit <- mix(data.mix, startparam, "norm")
Gives the same error. This error appears to be closely tied to the data (so the reason this data does not work could be potentially different than why yours does not work but this is the only example you offered up).
The problem with this data is that the probability between the two groups becomes indistinguishable at some point. Then that happens, the "E" step of the algorithm cannot estimate the pi variable properly. Here
pnorm(717,707,1)
# [1] 1
pnorm(717,699,1)
# [1] 1
both are exactly 1 and this seems to be causing the error. When mix takes 1 minus this value and compares the ratio to estimate group, it gets NaN values which are propagated to the estimate of proportions. When internally these NaN values are passed to nlm() to do the estimation, you get the error message
Error in nlm(mixlike, lmixdat = mixdat, lmixpar = fitpar, ldist = dist, :
missing value in parameter
The same error message can be replicated with
f <- function(x) sum((x-1:length(x))^2)
nlm(f, c(10,10))
nlm(f, c(10,NaN)) #error
So it appears the maxdist package will not work in this scenario. You may wish to contact the package maintainer to see if they are aware of the problem. In the meantime you will will need to find another way to estimate the parameters of you mixture model.
Now, I am not an expert in mixture distributions, but I think #MrFlick's accepted answer is a little bit misleading for anyone googling the error message (although no doubt correct for the example he gave). The core problem is that in both, your linked code and your example, the sigma values are very small compared to mu values. I think that the algorithm just cannot manage to find a solution with such small starting sigma values. If you increase the sigma values, you will get a solution. Linked code as an example:
library(mixdist)
time <- seq(673,723)
counts <- c(3, 12, 8, 12, 18, 24, 39, 48, 64, 88, 101, 132, 198, 253, 331, 419, 563, 781, 1134, 1423, 1842, 2505, 374, 6099, 9343, 13009, 15097, 13712, 9969, 6785, 4742, 3626, 3794, 4737, 5494, 5656, 4806, 3474, 2165, 1290, 799, 431, 213, 137, 66, 57, 41, 35, 27, 27, 27)
data.df <- data.frame(time=time, counts=counts)
data.mix <- as.mixdata(data.df)
startparam <- mixparam(mu = c(699,707), sigma = 1)
data.fit <- mix(data.mix, startparam, "norm") ## Leads to the error message
startparam <- mixparam(mu = c(699,707), sigma = 5) # Adjust start parameters
data.fit <- mix(data.mix, startparam, "norm")
plot(data.fit)
data.fit ### Estimates somewhat reasonable mixture distributions
# Parameters:
# pi mu sigma
# 1 0.853 699.3 4.494
# 2 0.147 708.6 2.217
A bottom line: if you can increase your start parameter sigma values, mix function might find reasonable estimates for you. You do not necessarily have to try another package.
In addition, you can get this message if you have missing data in your dataset.
From example set
data(pike65)
data(pikepar)
pike65$freq[10] <- NA
fitpike1 <- mix(pike65, pikepar, "lnorm", constr = mixconstr(consigma = "CCV"), emsteps = 3)
Error in nlm(mixlike, lmixdat = mixdat, lmixpar = fitpar, ldist =
dist, : missing value in parameter

Resources