library(rqPen)
n <- 60
p <- 7
rho <- .5
beta <- c(3,1.5,0,2,0,0,0)
R <- matrix(0,p,p)
for(i in 1:p){
for(j in 1:p){
R[i,j] <- rho^abs(i-j)
}
}
set.seed(1234)
x <- matrix(rnorm(n*p),n,p) %*% t(chol(R))
y <- x %*% beta + rnorm(n)
q.lasso_scad = cv.rq.pen(x, y, tau = 0.5, lambda = NULL, penalty = "SCAD", intercept = FALSE, criteria = "CV", cvFunc = "check", nfolds = 10,
foldid = NULL, nlambda = 100, eps = 1e-04, init.lambda = 1,alg="QICD")
q.lasso_scad
coef1 = q.lasso_scad$models[[which.min(q.lasso_scad$cv[,2])]]
coef1
I have the following output
Coefficients:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
0.0000000 0.3226967 1.8131688 -0.1971847 0.1981571 0.7715635 -0.2289284 -0.1087028 0.9713283 -0.1079333
I want to extract the coefficients only. How can I do that?
Thank you in advance.
It's a bit backwards but you can do:
as.data.frame(as.list.data.frame(coef1)$coefficients)
Result:
as.list.data.frame(coef1)$coefficients
x1 3.17487201
x2 1.15712559
x3 0.05078333
x4 2.27113756
x5 0.24893740
x6 0.00000000
x7 -0.07542964
If I understand the issue correctly, the output from rqPen is some sort of a fancy list with additional attributes. as.list.data.frame basically forces coef1 to be a "normal" list, which allows me to use $coefficients to extract the coefficients values. Lastly, I use as.data.frame to convert it into a more usable object.
If you just want the values, you can replace as.data.frame with as.vector:
as.vector(as.list.data.frame(coef1)$coefficients)
Result:
[1] 3.17487201 1.15712559 0.05078333 2.27113756 0.24893740 0.00000000
[7] -0.07542964
I don't have access to R program, so I can't verify that it will work. But try this:
names(coef1) <- NULL
coef1
Related
I'm working with a database where I can't change the variable names by company decision.
One of the variables is named as follows:
%Variable
I fit a model (lm object) and this variable is included.
Now I want to use the predict() function and I need to create a dataframe using this same name in one of the columns to be able to predict values. I'm doing the following:
new_x <- data.frame(X1 = 1, X2 = 0, X3 = 0, X4= 1, X5 = 0.765, `%VARIABLE` = 16.1)
predict(object = model4, newdata = new_x, level = 0.95, interval = 'confidence')
However, the new_x dataframe has the last column named as X.VARIABLE and not as %VARIABLE.
How can I fix this?
The function names() works.
names(new_x)[6] <- "%VARIABLE"
new_x
X1 X2 X3 X4 X5 %VARIABLE
1 1 0 0 1 0.765 16.1
Out of curiosity, I am trying to figure out why the PLS regression coefficients obtained with pls differ from the coefficients obtained with plsRglm, ropls, or plsdepot which all provide the same results.
Here is some code to start with. I have tried to play with the scale, center, and method arguments of the plsr function... but no success so far.
library(pls)
library(plsRglm)
library(ropls)
library(plsdepot)
data(Cornell)
pls.plsr <- plsr(
Y~X1+X2+X3+X4+X5+X6+X7,
data = Cornell,
ncomp = 3,
scale = TRUE,
center = TRUE
)
plsRglm.plsr <- plsR(
Y~X1+X2+X3+X4+X5+X6+X7,
data = Cornell,
nt = 3,
scaleX = TRUE
)
ropls.plsr <- opls(
as.matrix(Cornell[, grep("X", colnames(Cornell))]),
Cornell[, "Y"],
scaleC = "standard"
)
plsdepot.plsr <- plsreg1(
as.matrix(Cornell[, grep("X", colnames(Cornell))]),
Cornell[, "Y"],
comps = 3
)
## extract PLS regression coefficients for the PLS model with three components
coef(pls.plsr) # a
coef(plsRglm.plsr, type = "original") # b
coef(plsRglm.plsr, type = "scaled") # c
coef(ropls.plsr) # c
plsdepot.plsr$std.coefs # c
plsdepot.plsr$reg.coefs # b
Firstly, just for re-formatting purposes, we write:
library(pls)
library(plsRglm)
library(ropls)
library(plsdepot)
data(Cornell)
pls.plsr <- plsr(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7,
data = Cornell,
ncomp = 3, scale = T, center = T)
plsRglm.plsr <- plsR(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7,
data = Cornell,
nt = 3, scaleX = TRUE)
ropls.plsr <- opls(as.matrix(Cornell[, grep("X", colnames(Cornell))]),
Cornell[, "Y"], scaleC = "standard")
plsdepot.plsr <- plsreg1(as.matrix(Cornell[, grep("X", colnames(Cornell))]),
Cornell[, "Y"], comps = 3)
That done, you may extract the coefficients in the original scale:
### ORIGINAL SCALE - plsRglm, plsdepot
coef(plsRglm.plsr, type = "original")
plsdepot.plsr$reg.coefs
Or you can have them scaled:
### SCALED - plsRglm, ropls, plsdepot
coef(plsRglm.plsr, type = "scaled")
coef(ropls.plsr)
plsdepot.plsr$std.coefs
Therefore all methods now give rise to the same coefficients... Except for pls::plsr. Why? You may ask. The key is in the command. When you run:
coef(pls.plsr) # , , 3 comps
You see ", , 3". That is characteristic of a tensor object. What is this? Coefficients should be simply a vector. The reason is that coef is a generic function and it is not working properly for pls::plsr models. To see what it is actually extracting:
pls.plsr$coefficients
matrix(pls.plsr$coefficients, ncol = 3) # or in matrix form. coef simply extracts the third column (it should not)
But you can see the same fit for all models if you examine the equivalent object in each R-package as in:
matrix(pls.plsr$projection, ncol = 3)
plsRglm.plsr$wwetoile
plsdepot.plsr$mod.wgs
ropls.plsr#weightStarMN
Therefore, for pls::plsr you were simply not extracting the coefficients.
I was wondering if there is a way to avoid using t.test() 3 times for comparing 3 variables x1, x2, and x3 and instead using t.test() one time to take any two variables at a time inputted to it?
For example, for: x1 = rnorm(20) ; x2 = rnorm(20) ; x3 = rnorm(20), I'm now using: t.test(x1, x2) ; t.test(x1, x3) ; t.test(x2, x3) but could I just use t.test() one time?
Here is what I tried with no success:
t.test(cbind(x1, x2, x3))
similar to your question on cor just now, here is the syntax for handling pairwise calculation:
set.seed(21L)
x1 <- rnorm(20); x2 <- rnorm(20); x3 <- rnorm(20)
pcor <- function(...) {
combn(list(...),
2,
function(y) cor(y[[1]], y[[2]]),
simplify=FALSE)
}
pcor(x1, x2, x3)
pttest <- function(...) {
combn(list(...),
2,
function(a) t.test(x=a[[1]], y=a[[2]]) #change this to whatever your want
simplify=FALSE)
}
pttest(x1, x2, x3)
We can use pairwise.t.test
library(dplyr)
library(magrittr)
data(airquality)
airquality %>%
mutate(Month = factor(Month, labels = month.abb[5:9])) %>%
summarise(pval = list(pairwise.t.test(Ozone, Month, p.adj = "bonf")$p.value)) %>%
pull(pval) %>%
extract2(1)
# May Jun Jul Aug
#Jun 1.0000000000 NA NA NA
#Jul 0.0002931151 0.10225483 NA NA
#Aug 0.0001949061 0.08312222 1.000000000 NA
#Sep 1.0000000000 1.00000000 0.006969712 0.004847635
Using the OP's example
pairwise.t.test(c(x1, x2, x3), rep(paste0("x", 1:3), each = 20), p.adj = "bonf")
# Pairwise comparisons using t tests with pooled SD
#data: c(x1, x2, x3) and rep(paste0("x", 1:3), each = 20)
# x1 x2
# x2 0.486 -
# x3 1.000 0.095
data
set.seed(24)
x1 <- rnorm(20)
x2 <- rnorm(20)
x3 <- rnorm(20)
If you want to randomly use any of the variable try this:
s = sample(x = c("x1","x2","x3"),size = 2,replace = F)
t.test(eval(parse(text=s[1])),eval(parse(text=s[2])))
By using pairwise t-test the alpha value must be adjusted. Bonferroni correction is often used in agriculture, Holm sometime in medicine. Without such correcting one have more significant differences than it should be.
I have a data frame with variables, of which some contain the same information
x1 = runif(1000)
x2 = runif(1000)
x3 = x1 + x2
x4 = runif(1000)
x5 = runif(1000)*0.00000001 +x4
x6 = x5 + x3
x = data.frame(x1, x2, x3, x4, x5, x6)
In a next step I want to rid myself of all variables which are perfectly multicollinear, e.g. column x3 and x6 (there might be also other combinations).
In Stata this is fairly easy: _rmcoll varlist
How is this efficiently done in R?
EDIT:
Note that the ultimate goal is to compute the Mahalanobis distance between observations. For this I need to drop redunant variables. And as far as I can foresee, for this application it would not matter whether I drop x1, x2 or x3
I don't know of a built-in convenience function, but QR decomposition will do it.
We need the data frame to be a matrix:
X <- as.matrix(x)
Use a slightly lower than default tolerance to keep the slightly-non-multicollinear column:
qr.X <- qr(X, tol=1e-9, LAPACK = FALSE)
(rnkX <- qr.X$rank) ## 4 (number of non-collinear columns)
(keep <- qr.X$pivot[seq_len(rnkX)])
## 1 2 4 5
X2 <- X[,keep]
This strictly answers your question; you might also be able to use singular value decomposition (svd()) to implement Mahalanobis distances directly on this type of data ...
For completeness I post the quick-and-dirty solution I was using until now. I actually think it does not perform that badly compared to other methods.
x1 = runif(1000)
x2 = runif(1000)
x3 = x1 + x2
x4 = runif(1000)
x5 = runif(1000)*0.00000001 +x4
x6 = x5 + x3
x = data.frame(x1, x2, x3, x4, x5, x6)
const = rep(1,1000)
a<-lm(const ~ ., data=x)
names(a$coefficients[!is.na(a$coefficients)])[c(-1)]
I have 6 classes of outcome variable and 14 predictor variables. I built the model below:
fit <- multinom(y ~ X1 + X2 + as.factor(X3) + ... + X14, data= Original)
And I want to predict probabilities of each class of outcome for a given new data point.
X1 <- 1.6
X2 <- 4
x3 <- 15
.
.
.
X14 <- 8
dfin <- data.frame( ses = c(100, 200, 300), X1, X2, X3, ..., X14)
Then I run predict:
predict(fit, todaydata = dfin, type = "probs")
The outcome looks like:
#class1 #class2 #class3 #class4 #class5 #class6
#5541 7.226948e-01 1.498199e-01 8.086624e-02 1.253289e-02 8.799416e-03 2.528670e-02
#5546 6.034188e-01 7.386553e-02 1.908132e-01 1.229962e-01 4.716406e-04 8.434623e-03
#5548 7.266859e-01 1.278779e-01 1.001634e-01 2.032530e-02 7.156766e-03 1.779076e-02
#5562 7.120179e-01 1.471181e-01 9.146071e-02 1.265592e-02 8.189511e-03 2.855781e-02
#5666 6.645056e-01 3.034978e-02 1.687687e-01 1.219601e-01 3.972833e-03 1.044308e-02
#5668 4.875966e-01 3.126855e-02 2.090006e-01 2.430828e-01 3.721631e-03 2.532970e-02
#5670 3.900772e-01 1.305786e-02 1.803779e-01 4.137106e-01 1.314298e-03 1.462155e-03
#5671 4.272971e-01 1.194599e-02 1.748494e-01 3.833422e-01 8.863019e-04 1.678975e-03
#5674 5.477521e-01 2.587478e-02 1.650817e-01 2.487404e-01 3.368726e-03 9.182195e-03
#5677 4.300207e-01 9.532836e-03 1.608679e-01 3.946310e-01 2.626104e-03 2.321351e-03
#5678 4.542981e-01 1.220728e-02 1.410984e-01 3.885146e-01 2.670689e-03 1.210891e-03
#...
Then I change values of new data point by running the lines below:
X1 <- 2.7
X2 <- 5.1
x3 <- 28
.
.
.
X14 <- 2
dfin2 <- data.frame( ses = c(100, 200, 300), X1, X2, X3, ..., X14)
predict(fit, todaydata = dfin2, type = "probs")
again I got exactly the same probabilities.
#class1 #class2 #class3 #class4 #class5 #class6
#5541 7.226948e-01 1.498199e-01 8.086624e-02 1.253289e-02 8.799416e-03 2.528670e-02
#5546 6.034188e-01 7.386553e-02 1.908132e-01 1.229962e-01 4.716406e-04 8.434623e-03
#5548 7.266859e-01 1.278779e-01 1.001634e-01 2.032530e-02 7.156766e-03 1.779076e-02
#5562 7.120179e-01 1.471181e-01 9.146071e-02 1.265592e-02 8.189511e-03 2.855781e-02
#5666 6.645056e-01 3.034978e-02 1.687687e-01 1.219601e-01 3.972833e-03 1.044308e-02
#5668 4.875966e-01 3.126855e-02 2.090006e-01 2.430828e-01 3.721631e-03 2.532970e-02
#5670 3.900772e-01 1.305786e-02 1.803779e-01 4.137106e-01 1.314298e-03 1.462155e-03
#5671 4.272971e-01 1.194599e-02 1.748494e-01 3.833422e-01 8.863019e-04 1.678975e-03
#5674 5.477521e-01 2.587478e-02 1.650817e-01 2.487404e-01 3.368726e-03 9.182195e-03
#5677 4.300207e-01 9.532836e-03 1.608679e-01 3.946310e-01 2.626104e-03 2.321351e-03
#5678 4.542981e-01 1.220728e-02 1.410984e-01 3.885146e-01 2.670689e-03 1.210891e-03
#...
What am I doing wrong that cause same outcome for 2 different dfin and dfin2 data frames?
My second question is that why for a single data point I get so many rows of outcome?
Thanks a lot for your time!