standardization of data in R - r

I am doing some PCA analysis for large spreadsheets, and I'm picking my PCs according to the loadings.
As far as I have read, since the data I have have differnt units, standardization is a must before performing the PCA analysis.
Does the function prcomp() inherently performs standardization?
I was reading the prcomp() help file and saw this under the arguments of prcomp():
scale. a logical value indicating whether the variables should be scaled to have
unit variance before the analysis takes place. The default is FALSE for
consistency with S, but in general scaling is advisable. Alternatively, a
vector of length equal the number of columns of x can be supplied. The
value is passed to scale.
Does "scaling variables to have unit variance" mean standardization?
I am currently using this command:
prcomp(formula = ~., data=file, center = TRUE, scale = TRUE, na.action = na.omit)
is it enough? or shall I do a separate step of standardization?
Thanks,

Yes, scale = TRUE will result in all variables being scaled to have unit variance (i.e. a variance of 1, and hence a standard deviation of 1). This is the common definition of "standardise", but there are other ways to do it etc. center = TRUE mean-centres the data, i.e. the mean of a variable is subtracted from each observation of that variable.
When you do this (scale = TRUE, center = TRUE) instead of the PCA being on the covariance matrix of your data set, it is on the correlation matrix. Hence the PCA finds axes that explain the correlations between variables rather than their covariances.

If you mean by standardization that each column is divided by their standard deviation, and the mean of each column is subtracted, than using scale = TRUE and center = TRUE is what you want.

Related

What does a proportional matrix look like for glmnet response variable in R?

I'm trying to use glmnet to fit a GLM that has a proportional response variable (using the family="binomial").
The help file for glmnet says that the response variable:
"For family="binomial" should be either a factor with
two levels, or a two-column matrix of counts or proportions (the second column
is treated as the target class"
But I don't really understand how I would have a two column matrix. My variable is currently just a single column with values between 0 and 1. Can someone help me figure out how this needs to be formatted so that glmnet will run it properly? Also, can you explain what the target class means?
It is a matrix of positive label and negative label counts, for example in the example below we fit a model for proportion of Claims among Holders :
data = MASS::Insurance
y_counts = cbind(data$Holders - data$Claims,data$Claims)
x = model.matrix(~District+Age+Group,data=data)
fit1 = glmnet(x=x,y=y_counts,family="binomial",lambda=0.001)
If possible, so you should go back to before your calculation of the response variable and retrieve these counts. If that is not possible, you can provide a matrix of proportion, 2nd column for success but this assumes the weight or n is same for all observations:
y_prop = y_counts / rowSums(y_counts)
fit2 = glmnet(x=x,y=y_prop,family="binomial",lambda=0.001)

Weights in Principal Component Analysis (PCA) using Psych::principal

I am computing a Principal Component Analysis with this matrix as input using the function psych::principal . Each column in the input data is the monthly correlations between crop yields and a climatic variable in a region (30) so what I want to obtain with the PCA is to reduce the information and find simmilarities pattern of response between regions.
pc <- principal(dat,nfactors = 9, residuals = FALSE, rotate="varimax", n.obs=NA, covar=TRUE,scores=TRUE, missing=FALSE, impute="median", oblique.scores=TRUE, method="regression")
The matrix has dimensions 10*30, and the first message I get is:
The determinant of the smoothed correlation was zero. This means the
objective function is not defined. Chi square is based upon observed
residuals. The determinant of the smoothed correlation was zero. This
means the objective function is not defined for the null model either.
The Chi square is thus based upon observed correlations. Warning
messages: 1: In cor.smooth(r) : Matrix was not positive definite,
smoothing was done 2: In principal(dat, nfactors = 3, residuals = F,
rotate = "none", : The matrix is not positive semi-definite, scores
found from Structure loadings
Nontheless, the function seems to work, the main problem is when you check pc$weights and realize that is equal to pc$loadings.
When the number of columns is less than/equal to the number of rows the results are coherent, however that is not the case here.
I have to obtain the weights for refering the score values in the same magnitude as the input data (correlation values).
I would really appreciate any help.
Thank you.

How to extract saved envelope values in Spatstat?

I am new to both R & spatstat and am working with the inhomogeneous pair correlation function. My dataset consists of point values spread across several time intervals.
sp77.ppp = ppp(sp77.dat$Plot_X, sp77.dat$Plot_Y, window = window77, marks = sp77.dat$STATUS)
Dvall77 = envelope((Y=dv77.ppp[dv77.ppp$marks=='2']),fun=pcfinhom, r=seq(0,20,0.25), nsim=999,divisor = 'd', simulate=expression((rlabel(dv77.ppp)[rlabel(dv77.ppp)$marks=='1']),(rlabel(dv77.ppp)[rlabel(dv77.ppp)$marks=='2'])), savepatterns = T, savefuns = T).
I am trying to compare multiple pairwise comparisons (from different time periods) and need to create a function that will go through for every calculated envelope value, at each ‘r’ value, and find the min and max differences between the envelopes.
My question is: How do I find the saved envelope values? I know that the savefuns = T is saving all the simulated envelope values but I can’t find how to extract the values. The summary (below) says that the values are stored. How do I call the values and extract them?
> print(Dvall77)
Pointwise critical envelopes for g[inhom](r)
and observed value for ‘(Y = dv77.ppp[dv77.ppp$marks == "2"])’
Edge correction: “iso”
Obtained from 999 evaluations of user-supplied expression
(All simulated function values are stored)
(All simulated point patterns are stored)
Alternative: two.sided
Significance level of pointwise Monte Carlo test: 2/1000 = 0.002
.......................................................................................
Math.label Description
r r distance argument r
obs {hat(g)[inhom]^{obs}}(r) observed value of g[inhom](r) for data pattern
mmean {bar(g)[inhom]}(r) sample mean of g[inhom](r) from simulations
lo {hat(g)[inhom]^{lo}}(r) lower pointwise envelope of g[inhom](r) from simulations
hi {hat(g)[inhom]^{hi}}(r) upper pointwise envelope of g[inhom](r) from simulations
.......................................................................................
Default plot formula: .~r
where “.” stands for ‘obs’, ‘mmean’, ‘hi’, ‘lo’
Columns ‘lo’ and ‘hi’ will be plotted as shading (by default)
Recommended range of argument r: [0, 20]
Available range of argument r: [0, 20]
Thanks in advance for any suggestions!
If you are looking to access the values of the summary statistic (ginhom) for each of the randomly labelled patterns this is in principle documented in help(envelope.ppp). Admittedly this is long and if you are new to both R and spatstat it is easy to get lost. The clue is in the value section of the help file. The result is a data.frame with the some additional classes (envelope and fv) and as the help file says:
Additionally, if ‘savepatterns=TRUE’, the return value has an
attribute ‘"simpatterns"’ which is a list containing the ‘nsim’
simulated patterns. If ‘savefuns=TRUE’, the return value has an
attribute ‘"simfuns"’ which is an object of class ‘"fv"’
containing the summary functions computed for each of the ‘nsim’
simulated patterns.
Then of course you need to know how to access an attribute in R, which is done using attr:
funs <- attr(Dvall77, "simfuns")
Then funs is a data.frame (and fv-object) with all the function values for each randomly labelled pattern.
I can't really understand from your question whether you just need the values of the upper and lower curve defining the envelope? In that case you just access them like an ordinary data.frame (and there is no need to save all the individual function values in the envelope):
lo <- Dvall77$lo
hi <- Dvall77$hi
d <- hi - lo
More elegantly you can do:
d <- with(Dvall77, hi - lo)

Is it possible to specify a range for numbers randomly generated by mvrnorm( ) in R?

I am trying to generate a random set of numbers that exactly mirror a data set that I have (to test it). The dataset consists of 5 variables that are all correlated with different means and standard deviations as well as ranges (they are likert scales added together to form 1 variable). I have been able to get mvrnorm from the MASS package to create a dataset that replicated the correlation matrix with the observed number of observations (after 500,000+ iterations), and I can easily reassign means and std. dev. through z-score transformation, but I still have specific values within each variable vector that are far above or below the possible range of the scale whose score I wish to replicate.
Any suggestions how to fix the range appropriately?
Thank you for sharing your knowledge!
To generate a sample that does "exactly mirror" the original dataset, you need to make sure that the marginal distributions and the dependence structure of the sample matches those of the original dataset.
A simple way to achieve this is with resampling
my.data <- matrix(runif(1000, -1, 2), nrow = 200, ncol = 5) # Some dummy data
my.ind <- sample(1:nrow(my.data), nrow(my.data), replace = TRUE)
my.sample <- my.data[my.ind, ]
This will ensure that the margins and the dependence structure of the sample (closely) matches those of the original data.
An alternative is to use a parametric model for the margins and/or the dependence structure (copula). But as staded by #dickoa, this will require serious modeling effort.
Note that by using a multivariate normal distribution, you are (implicity) assuming that the dependence structure of the original data is the Gaussian copula. This is a strong assumption, and it would need to be validated beforehand.

Predict.lm() in R - how to get nonconstant prediction bands around fitted values

So I am currently trying to draw the confidence interval for a linear model. I found out I should use predict.lm() for this, but I have a few problems really understanding the function and I do not like using functions without knowing what's happening. I found several how-to's on this subject, but only with the corresponding R-code, no real explanation.
This is the function itself:
## S3 method for class 'lm'
predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
interval = c("none", "confidence", "prediction"),
level = 0.95, type = c("response", "terms"),
terms = NULL, na.action = na.pass,
pred.var = res.var/weights, weights = 1, ...)
Now, what I've trouble understanding:
1) newdata
An optional data frame in which to look for variables
with which to predict. If omitted, the fitted values are used.
Everyone seems to use newdata for this, but I cannot quite understand why. For calculating the confidence interval I obviously need the data which this interval is for (like the # of observations, mean of x etc), so cannot be what is meant by it. But then: What is does it mean?
2) interval
Type of interval calculation.
okay.. but what is "none" for?
3a) type
Type of prediction (response or model term).
3b) terms
If type="terms", which terms (default is all terms)
3a: Can I by that get the confidence interval for one specific variable in my model? And if so, what is 3b for then? If I can specify the term in 3a, it wouldn't make sense to do it in 3b again.. so I guess I'm wrong again, but I cannot figure out why.
I guess some of you might think: Why don't just try this out? And I would (even if it would maybe not solve everything here), but I right now don't know how to. As I do not now what the newdata is for, I don't know how to use it and if I try, I do not get the right confidence interval. Somehow it is very important how you choose that data, but I just don't understand!
EDIT: I want to add that my intention is to understand how predict.lm works. By that I mean I don't understand if it works the way I think it does. That is it calculates y-hat (predicted values) and than uses adds/subtracts for each the upr/lwr-bounds of the interval to calculate several datapoints(looking like a confidence-line then) ?? Then I would undestand why it is necessary to have the same lenght in the newdata as in the linear model.
Make up some data:
d <- data.frame(x=c(1,4,5,7),
y=c(0.8,4.2,4.7,8))
Fit the model:
lm1 <- lm(y~x,data=d)
Confidence and prediction intervals with the original x values:
p_conf1 <- predict(lm1,interval="confidence")
p_pred1 <- predict(lm1,interval="prediction")
Conf. and pred. intervals with new x values (extrapolation and more finely/evenly spaced than original data):
nd <- data.frame(x=seq(0,8,length=51))
p_conf2 <- predict(lm1,interval="confidence",newdata=nd)
p_pred2 <- predict(lm1,interval="prediction",newdata=nd)
Plotting everything together:
par(las=1,bty="l") ## cosmetics
plot(y~x,data=d,ylim=c(-5,12),xlim=c(0,8)) ## data
abline(lm1) ## fit
matlines(d$x,p_conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+")
matlines(d$x,p_pred1[,c("lwr","upr")],col=2,lty=2,type="b",pch=1)
matlines(nd$x,p_conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+")
matlines(nd$x,p_pred2[,c("lwr","upr")],col=4,lty=2,type="b",pch=1)
Using new data allows for extrapolation beyond the original data; also, if the original data are sparsely or unevenly spaced, the prediction intervals (which are not straight lines) may not be well approximated by linear interpolation between the original x values ...
I'm not quite sure what you mean by the "confidence interval for one specific variable in my model"; if you want confidence intervals on a parameter, then you should use confint. If you want predictions for the changes based only on some of the parameters changing (ignoring the uncertainty due to the other parameters), then you do indeed want to use type="terms".
interval="none" (the default) just tells R not to bother computing any confidence or prediction intervals, and to return just the predicted values.

Resources