Related
I use a library called "quantreg" in R and try to estimate full nonparametric quantile regression on time series basis. To get statistically significant results I try lots of variables and smoothing parameter values (lambda). But it's exhausting and very time consuming. Therefore, I want to apply grid search, however it is a little bit hard for me. I want to determine best smoothing values, so I should construct a for loop. But I want that loop to try every combination. At the I want to have the lambda values of best model or models (all variables' p values<0.05 condition).
For example if I have three variables in my equation I've written something like that:
lambdas1<-rbind(1,2,3)
lambdas2<-rbind(1,2,3)
lambdas3<-rbind(1,2,3)
mylist<-list()
for (i in 1:3) {
for (j in 1:3) {
for (n in 1:3) {
f <-try(rqss(Y~qss(X1,lambda = lambdas1[i])+qss(X2,lambda = lambdas2[j])+qss(X3,lambda = lambdas3[n]), tau=0.05))
sf<-summary(f)
if( (sf[["qsstab"]]['X1','Pr(>F)']<0.05)&(sf[["qsstab"]]['X2','Pr(>F)']<0.05)&(sf[["qsstab"]]['X3','Pr(>F)']<0.05) ){
mylist[[i]]<-f$lambdas
}
}
}
}
How can I rearrange this code?
Is there any shortcut?
Any help will be appreciated.
Thank you in advance.
You can use baseR expand.grid to create a data.frame of all the possible combinations and then use apply(grid, MARGIN=2, ...) to loop through its rows, also I "optimized" the code you were looking if each p-value I changed it to use all(p.vals < .05)
lambdas <- expand.grid(1:3,1:3,1:3)
check_lambdas <- function(lambdas){
f <-try(rqss(Y~qss(X1,lambda = lambdas[1])+qss(X2,lambda = lambdas[2])+qss(X3,lambda = lambdas[3]), tau=0.05))
if( all(summary(f)$qsstab[,'Pr(>F)']<0.05) ) f$lambdas else NULL
}
apply(lambdas, 2, check_lambdas)
I tried to use princomp() and principal() to do PCA in R with data set USArressts. However, I got two different results for loadings/rotaion and scores.
First, I centered and normalised the original data frame so it is easier to compare the outputs.
library(psych)
trans_func <- function(x){
x <- (x-mean(x))/sd(x)
return(x)
}
A <- USArrests
USArrests <- apply(USArrests, 2, trans_func)
princompPCA <- princomp(USArrests, cor = TRUE)
principalPCA <- principal(USArrests, nfactors=4 , scores=TRUE, rotate = "none",scale=TRUE)
Then I got the results for the loadings and scores using the following commands:
princompPCA$loadings
principalPCA$loadings
Could you please help me to explain why there is a difference? and how can we interprete these results?
At the very end of the help document of ?principal:
"The eigen vectors are rescaled by the sqrt of the eigen values to produce the component loadings more typical in factor analysis."
So principal returns the scaled loadings. In fact, principal produces a factor model estimated by the principal component method.
In 4 years, I would like to provide a more accurate answer to this question. I use iris data as an example.
data = iris[, 1:4]
First, do PCA by the eigen-decomposition
eigen_res = eigen(cov(data))
l = eigen_res$values
q = eigen_res$vectors
Then the eigenvector corresponding to the largest eigenvalue is the factor loadings
q[,1]
We can treat this as a reference or the correct answer. Now we check the results by different r functions.
First, by function 'princomp'
res1 = princomp(data)
res1$loadings[,1]
# compare with
q[,1]
No problem, this function actually just return the same results as 'eigen'. Now move to 'principal'
library(psych)
res2 = principal(data, nfactors=4, rotate="none")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
You may find they are still different. The problem is the 'principal' function does eigendecomposition on the correlation matrix by default. Note: PCA is not invariant with rescaling the variables. If you modify the code as
res2 = principal(data, nfactors=4, rotate="none", cor="cov")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
Now, you will get the same results as 'eigen' and 'princomp'.
Summarize:
If you want to do PCA, you'd better apply 'princomp' function.
PCA is a special case of the Factor model or a simplified version of the factor model. It is just equivalent to eigendecomposition.
We can apply PCA to get an approximation of a factor model. It doesn't care about the specific factors, i.e. epsilons in a factor model. So, if you change the number of factors in your model, you will get the same estimations of the loadings. It is different from the maximum likelihood estimation.
If you are estimating a factor model, you'd better use 'principal' function, since it provides more functions, like rotation, calculating the scores by different methods, and so on.
Rescale the loadings of a PCA model doesn't affect the results too much. Since you still project the data onto the same optimal direction, i.e. maximize the variation in the resulting PC.
ev <- eigen(R) # R is a correlation matrix of DATA
ev$vectors %*% diag(ev$values) %*% t(ev$vectors)
pc <- princomp(scale(DATA, center = F, scale = T),cor=TRUE)
p <-principal(DATA, rotate="none")
#eigen values
ev$values^0.5
pc$sdev
p$values^0.5
#eigen vectors - loadings
ev$vectors
pc$loadings
p$weights %*% diag(p$values^0.5)
pc$loading %*% diag(pc$sdev)
p$loadings
#weights
ee <- diag(0,2)
for (j in 1:2) {
for (i in 1:2) {
ee[i,j] <- ev$vectors[i,j]/p$values[j]^0.5
}
};ee
#scores
s <- as.matrix(scale(DATA, center = T, scale = T)) %*% ev$vectors
scale(s)
p$scores
scale(pc$scores)
I'm using R.
My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.
I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b.
Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.
I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?
You can use the function rcorr from the package Hmisc.
Using the same demo data from Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Then:
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
To access the p-values:
correlations$P
To visualize you can use the package corrgram
library(corrgram)
corrgram(the_data)
Which will produce:
In order to print a list of the significant correlations (p < 0.05), you can use the following.
Using the same demo data from #Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Install Hmisc
install.packages("Hmisc")
Import library and find the correlations (#Carlos)
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
Loop over the values printing the significant correlations
for (i in 1:m){
for (j in 1:m){
if ( !is.na(correlations$P[i,j])){
if ( correlations$P[i,j] < 0.05 ) {
print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
}
}
}
}
Warning
You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.
Here's some sample data for reproducibility.
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.
correlations <- vapply(
the_data[, -1],
function(x)
{
cor(the_data[, 1], x)
},
numeric(1)
)
You can then find the column with the largest magnitude of correlation with y using:
correlations[which.max(abs(correlations))]
So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.
If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y.
To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator
One option is to run a correlation matrix:
cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")
This correlates all the variables in the file against each other and outputs a matrix.
I am writing some code to do a maximum likelihood estimation of some parameter values, and I am trying to create a surface plot of parameter values taken from the optim function, and need to create a grid to do so. It is the part whereby I need to create a grid that is confounding me,
My MLE function looks like:
loglike<-function(par,dat,scale)
{ ptp<-dat[1:length(dat)-1]
ptp1<-dat[2:length(dat)]
r<-par['r']
k<-par['k']
sigma<-par['sigma']
if(scale=='log')
{
return(sum(dnorm(log(ptp1)-log(ptp)*exp(r-(ptp/k)),mean=0,sd=sigma,log=T)))
}
if (scale=='sqrt')
{
return(sum(dnorm(sqrt(ptp1)-sqrt(ptp)*exp(r-(ptp/k)),mean=0,sd=sigma,log=T)))
}
if (scale=='linear')
{
return(sum(dnorm(ptp1-ptp*exp(r-(ptp/k)),mean=0,sd=sigma,log=T)))
}
}
I have already created some data from the optim giving me corresponding parameter values
I have tried taking output from the optim function and putting it into the expand.grid function like:
gridlog<-expand.grid(logs[,"r"],logs[,"sigma"],logs[,"k"])
But all this is doing is creating a large matrix filled with all the same values.
Where the data going into the expand.grid function is filled from :
logs<-list()
for(i in seq(1,300,0.1)){
logs[i]<-optim(par=c(r=i,k=i,sigma=i),fn=loglike,dat=dat,scale='log',method='Nelder-Mead',control=list(fnscale=-1))
}
logs<-do.call(rbind,logs)
This creates a 300 long matrix of corresponding sigma's r's and k's
My data is:
c(100, 128.675595618645, 75.436115414503, 146.398449792328, 102.419994706974,
207.397726741841, 23.4579309898438, 42.4085746569567, 119.498216389673,
59.7845591706614, 119.37201616882, 252.047672957539, 28.3165331949818,
57.4918213065119, 311.615538092141, 8.53779749227741, 31.5382580618134,
115.617013730077, 43.6907812963781, 70.9139870053552, 123.004040266686,
132.575148404208, 114.813947981006, 115.950032495637, 120.891472762661,
97.0207348527786, 235.618894638631, 17.0936655960759, 49.4419128844531,
112.476950569973, 58.3241789008329, 80.0300102105128, 103.248819284132,
99.1968765946717, 113.905769052605, 143.181386861766, 62.962989192695,
174.054591300157, 39.9156352770331, 81.8344415290292, 176.631480374326,
51.5564038694108, 131.542259464434, 72.5981749979889, 38.9733086158719,
126.808054274927, 73.6960412245896, 62.5484608101147, 55.539355637003,
137.888502803112, 106.921926717155, 140.000738390606, 162.512046122238,
26.2949484171288, 80.4110888678422, 74.0481779531392, 33.9890286552257,
142.477859644323, 55.1820570626643, 107.242498924143, 56.8497685792794,
143.676120209843, 84.2334844367379, 67.0330079913484, 109.96246704725,
157.216290273118, 59.4585552091703, 67.2986524284706, 55.2529503291083,
38.932960005221, 62.7454169122216, 210.687014199037, 38.7348882392115,
75.6645116341029, 115.924283193145, 117.772958122253, 45.5313134644358,
112.306998515583, 38.7001172906923, 66.1308507048062, 122.516808638813,
38.8283932430479, 168.014298040365, 38.0902373313928, 117.414876109978,
168.615976661456, 66.5037228223079, 94.4482610053865, 505.254990783834,
1.05181785078369, 1.77594058056118, 4.36034444400473, 12.1485473106491,
82.2373017835424, 58.9775202042162, 132.907299665772, 51.2346939236555,
123.251093218535, 143.077217943039, 96.1524852870813)
Any help anyone could give would be greatly appreciated!!
#find optimum:
fit<-optim(par=c(r=1,k=1,sigma=1),fn=loglike,dat=dat,scale='log',
method='Nelder-Mead',control=list(fnscale=-1))
fit$par
r k sigma
0.3911590 254.4989317 0.5159761
# make grid around optimum with few selected sigma values:
rs<-seq(0.01,1,length=30)
ks<-seq(230,280,length=30)
sigmas<-c(0.25,0.5159761,0.75)
# this will contains all parameter combinations
# and the corresponding likelihood values
mlegrid<-cbind(as.matrix(expand.grid(rs,ks,sigmas)),0) #Matrix
colnames(mlegrid)<-c('r','k','sigma','likelihood')
for(i in 1:nrow(mlegrid)){ #go through all combinations
mlegrid[i,4]<- loglike(par=mlegrid[i,1:3],dat=dat,scale='log')
}
mlegrid[which.max(mlegrid[,4]),]
r k sigma likelihood
0.3855172 257.5862069 0.5159761 -74.9940496
# almost the same as from optim
# (differences due to sparse grid, more dense gives more accurate results)
#for interactive plots, static versions with `persp` function
library(rgl)
persp3d(x=rs,y=ks,
z=matrix(mlegrid[mlegrid[,3]==sigmas[1],4],nrow=length(rs)),col=2)
#with sigma from optim
persp3d(x=rs,y=ks,
z=matrix(mlegrid[mlegrid[,3]==sigmas[2],4],nrow=length(rs)),col=2)
persp3d(x=rs,y=ks,
z=matrix(mlegrid[mlegrid[,3]==sigmas[3],4],nrow=length(rs)),col=2)
I am using the code below to calculate the correlation map between two datasets.this code worked fine and I got the results which look like:![enter image description here]![enter image description here][1].
I would like also to get another map displaying how many pairs were used in calculation of each pixel so I get map of N a long with map of correlation.
as per Paul Hiemstra this function gave cor and N:
cor_withN = function(...) {
cor_obj = cor.test(...)
print(sprintf("N = %s", cor_obj$parameter + 2))
return(data.frame(cor = cor_obj$estimate, N = cor_obj$parameter + 2))
}
cor_withN(runif(100), runif(100))
[1] "N = 100"
cor N
cor 0.1718225 100
when I simply replaced cor by cor_withN I got this error:
Error in cor.test.default(...) : not enough finite observations
How can I imply this function in my code to get two maps of correlation and N values ?
1. Error
Error in cor.test.default(...) : not enough finite observations
According to corr.test source (http://svn.r-project.org/R/trunk/src/library/stats/R/cor.test.R) this error can appear in two cases:
You are using Pearson's correlation and have less than 3 finite pairs of observations.
You are using Kendall's or Spearman's correlation and have less than 2 pairs.
Indeed, cor.test(c(1,2), c(2,3)) causes exactly the same error, while cor(c(1,2), c(2,3)) gives an answer.
Note, that cor.test uses complete.cases(x,y) for calculations. So, look into your data - probably there are not enough pairs somewhere.
2. Function
cor returns numeric value, your function corr_withN returns data.frame. So, it doesn't look like you can simply replace one by another.
As I understand you need just a matrix of size 1440x720 which will be plotted over the map. In this case you can just use cor for the first plot, and simple function returning the number of pairs used to calculate correlation for the second. The function itself can be as simple as:
cor_withN <- function(...) {
cor.test(...)$parameter+2
}
UPDATE: After comment
If cor_withN must return NA when there are less than 3 pairs it should be modified:
cor_withN <- function(...) {
res <- try(cor.test(...)$parameter+2, silent=TRUE)
ifelse(class(res)=="try-error", NA, res)
}
This function tries to compute correlation and, if it fails, returns NA or number of pairs otherwise.