I have a data frame with variables, of which some contain the same information
x1 = runif(1000)
x2 = runif(1000)
x3 = x1 + x2
x4 = runif(1000)
x5 = runif(1000)*0.00000001 +x4
x6 = x5 + x3
x = data.frame(x1, x2, x3, x4, x5, x6)
In a next step I want to rid myself of all variables which are perfectly multicollinear, e.g. column x3 and x6 (there might be also other combinations).
In Stata this is fairly easy: _rmcoll varlist
How is this efficiently done in R?
EDIT:
Note that the ultimate goal is to compute the Mahalanobis distance between observations. For this I need to drop redunant variables. And as far as I can foresee, for this application it would not matter whether I drop x1, x2 or x3
I don't know of a built-in convenience function, but QR decomposition will do it.
We need the data frame to be a matrix:
X <- as.matrix(x)
Use a slightly lower than default tolerance to keep the slightly-non-multicollinear column:
qr.X <- qr(X, tol=1e-9, LAPACK = FALSE)
(rnkX <- qr.X$rank) ## 4 (number of non-collinear columns)
(keep <- qr.X$pivot[seq_len(rnkX)])
## 1 2 4 5
X2 <- X[,keep]
This strictly answers your question; you might also be able to use singular value decomposition (svd()) to implement Mahalanobis distances directly on this type of data ...
For completeness I post the quick-and-dirty solution I was using until now. I actually think it does not perform that badly compared to other methods.
x1 = runif(1000)
x2 = runif(1000)
x3 = x1 + x2
x4 = runif(1000)
x5 = runif(1000)*0.00000001 +x4
x6 = x5 + x3
x = data.frame(x1, x2, x3, x4, x5, x6)
const = rep(1,1000)
a<-lm(const ~ ., data=x)
names(a$coefficients[!is.na(a$coefficients)])[c(-1)]
Related
I have ran several optimal variable/model selection methods from machine/statistical learning on the same file folder of 58,000 (csv formatted) randomly generated synthetic datasets (all of the same size) in order to compare which method correctly selects the true underlying model for each dataset the most times. All of the scripts & many of the datasets can be found in my GitHub Repository for this research project.
I have already gotten the output/results I need, each of the file formatted datasets' names are formatted in the following manner: n1-n2-n3-n4 where n1 begins at 0 and ends at 1, n2 begins at 3 and ends at 15, n3 begins with 1 and ends with 9, and n4 begins with 1 and ends with 500.The dataframe/list with the results looks like the following:
> str(BM1_models)
'data.frame': 58000 obs. of 1 variable:
$ V1: chr "0-3-1-1; X1, X2, X3" "0-3-1-2; X1, X2, X3" "0-3-1-3; X1, X2, X3" "0-3-1-`4; X1, X2, X3" ...`
> head(BM1_models, n = 4)
V1
1 0-3-1-1; X1, X2, X3
2 0-3-1-2; X1, X2, X3
3 0-3-1-3; X1, X2, X3
4 0-3-1-4; X1, X2, X3
> tail(BM1_models, n = 4)
V1
57997 1-15-9-497; X2, X3, X4, X9, X10, X11, X13, X14
57998 1-15-9-498; X2, X3, X5, X6, X8, X9, X10, X11, X12, X15
57999 1-15-9-499; X3, X4, X5, X6, X8, X10, X11, X12, X15
58000 1-15-9-500; X2, X4, X6, X7, X8, X10, X11
How to tell whether the ML variable/factor selection method (in this case LASSO) is right for any given dataset is if the n2 for that dataset says 3, then the Independent Variables selected should be X1, X2, X3, if it says 4, the underlying structural model is X1, X2, X3, X4, and so on up until 15 (I'll explain what n1, n3, & n4 signify in a p.s. section at the bottom). So, I need to write something like a count function within a complex if function inside of it all within an lapply function here, but I don't know how exactly.
p.s. Part1
The datasets & scripts are also available in this GitHub repository of mine which should be far easier to navigate than the first one I linked to.
p.s. Part2
To clarify, if n2 = 5 for a given dataset and the model chosen was X1, X2, X4, X5 (known as an omitted variable model or X1, X2, X3, X4, X5, X8, X9, etc. (known as an extraneous variable model), it is not correct. Only a model which includes all of the variables X1 through Xn2 should be counted, every other result should not be.
p.s. Part3
n1 indicates the amount of multicollinearity between factors in the true (underlying) structural regression equation, n3 indicates the error variance, and n4 indicates which of the 500 random variations out of all possible randomly generated datasets for each set of the other 3 parameters it is (this is a Monte Carlo Simulation).
If I get this right, the idea is to check if the second part of a string of the form 'X1, X2, ..., Xn' equals what should be expected based on the first part of that same string. I think the easiest way is to write a function that makes the comparison for any single string, then sapply it over the string vector:
# testing df, only first (good) and last (bad) entry
df = data.frame(V1 = c('0-3-1-1; X1, X2, X3', '1-15-9-500; X2, X4, X6, X7, X8, X10, X11'))
good_model = function (str) {
str = unlist(strsplit(str, '; '))
desc = str[1]
pred = str[2]
n_2 = unlist(strsplit(desc, '-'))[2]
expt = paste0('X', 1:as.integer(n_2), collapse = ', ')
identical(pred, expt)
}
df$good = sapply(df$V1, good_model)
df
# V1 good
# 0-3-1-1; X1, X2, X3 TRUE
# 1-15-9-500; X2, X4, X6, X7, X8, X10, X11 FALSE
Note: I assumed the character after ; in the original string was a space, if it is a <tab> then the first call to strsplit() should be updated.
I'm trying to regress returns against FF 3-factors with a rolling window.
To do so, I have found the function roll_lm in R, but the function is only producing regression output for one of the 3 variables.
The code is described here:
Y <- as.matrix(Portfolio_returns[,2])
X1 <- as.matrix(Mydata[,2])
X2 <- as.matrix(Mydata[,3])
X3 <- as.matrix(Mydata[,4])
Five_years_Rolling_reg <- roll_lm(X1 + X2 + X3,Y,60)
When I apply the coef function, I only get output for X1 and not X2 nor X3.
What am I doing wrong?
You problem seems to be a basic misunderstanding of how the function works. Looking at ?roll_lm
Arguments
x
matrix or xts object. Rows are observations and columns are the independent variables.
Currently it seems like you are trying to use a formula = X1 + X2 + X3 style of input, which is not what the help page is saying. As such it is adding the columns together as if it was: x1 = 2; x2 = 3; x1 + x2 = 5
Instead you should bind the rows together.
Y <- as.matrix(Portfolio_returns[,2])
X <- as.matrix(Mydata[,2:4]
roll_lm(X, Y, 60)
Or alternatively use the model.frame, model.response, model.matrix functions from base-R, which gives you the familiarity of the formula settings.
names(Mydata)[1:4] <- c("Y", "X1", "X2", "X3")
frame <- model.frame(Y ~ X1 + X2 + X3, data = Mydata)
X <- model.matrix(Y ~ X1 + X2 + X3, data = Mydata)
roll_lm(X, model.response(frame), 60)
library(rqPen)
n <- 60
p <- 7
rho <- .5
beta <- c(3,1.5,0,2,0,0,0)
R <- matrix(0,p,p)
for(i in 1:p){
for(j in 1:p){
R[i,j] <- rho^abs(i-j)
}
}
set.seed(1234)
x <- matrix(rnorm(n*p),n,p) %*% t(chol(R))
y <- x %*% beta + rnorm(n)
q.lasso_scad = cv.rq.pen(x, y, tau = 0.5, lambda = NULL, penalty = "SCAD", intercept = FALSE, criteria = "CV", cvFunc = "check", nfolds = 10,
foldid = NULL, nlambda = 100, eps = 1e-04, init.lambda = 1,alg="QICD")
q.lasso_scad
coef1 = q.lasso_scad$models[[which.min(q.lasso_scad$cv[,2])]]
coef1
I have the following output
Coefficients:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
0.0000000 0.3226967 1.8131688 -0.1971847 0.1981571 0.7715635 -0.2289284 -0.1087028 0.9713283 -0.1079333
I want to extract the coefficients only. How can I do that?
Thank you in advance.
It's a bit backwards but you can do:
as.data.frame(as.list.data.frame(coef1)$coefficients)
Result:
as.list.data.frame(coef1)$coefficients
x1 3.17487201
x2 1.15712559
x3 0.05078333
x4 2.27113756
x5 0.24893740
x6 0.00000000
x7 -0.07542964
If I understand the issue correctly, the output from rqPen is some sort of a fancy list with additional attributes. as.list.data.frame basically forces coef1 to be a "normal" list, which allows me to use $coefficients to extract the coefficients values. Lastly, I use as.data.frame to convert it into a more usable object.
If you just want the values, you can replace as.data.frame with as.vector:
as.vector(as.list.data.frame(coef1)$coefficients)
Result:
[1] 3.17487201 1.15712559 0.05078333 2.27113756 0.24893740 0.00000000
[7] -0.07542964
I don't have access to R program, so I can't verify that it will work. But try this:
names(coef1) <- NULL
coef1
Using a dataset I built a model as below:
fit <- lm(y ~ as.numeric(X1) + as.factor(x2) + log(1 + x3) + as.numeric(X4) , dataset)
Then I build new data:
X1 <- 1
X2 <- 10
X3 <- 15
X4 <- 0.5
new <- data.frame(X1, X2, X3, X4)
predict(fit, new , se.fit=TRUE)
Then I get the Error below:
Error in data.frame(state_today, daily_creat, last1yr_min_hosp_icu_MDRD, :
object 'X2' is not found
What am I doing wrong? Is this because of logarithm in the model?
A great way of looking at your problem another way is by constructing a self contained reproducible example. With no copy/pasting. This often gives you a fresh perspective and often teases out the weirdest bugs imaginable.
As flodel and Ben have pointed out, your problem is probably due to bad choice of variable names. I'm guessing you're using Rstudio, which in my opinion uses a terrible default font exactly for this reason. I can't tell x and X apart (easily).
Here is something similar to what you're trying to do, with all variable names correctly (un)capitalized.
xy <- data.frame(y = runif(20), x1 = runif(20), x2 = sample(1:5, 20, replace = TRUE), x3 = runif(20))
fit <- lm(y ~ as.numeric(x1) + as.factor(x2) + log(1+x3), data = xy)
predict(fit, newdata = data.frame(x1 = 1, x2 = as.factor(3), x3 = 15))
1
0.05015187
I have 6 classes of outcome variable and 14 predictor variables. I built the model below:
fit <- multinom(y ~ X1 + X2 + as.factor(X3) + ... + X14, data= Original)
And I want to predict probabilities of each class of outcome for a given new data point.
X1 <- 1.6
X2 <- 4
x3 <- 15
.
.
.
X14 <- 8
dfin <- data.frame( ses = c(100, 200, 300), X1, X2, X3, ..., X14)
Then I run predict:
predict(fit, todaydata = dfin, type = "probs")
The outcome looks like:
#class1 #class2 #class3 #class4 #class5 #class6
#5541 7.226948e-01 1.498199e-01 8.086624e-02 1.253289e-02 8.799416e-03 2.528670e-02
#5546 6.034188e-01 7.386553e-02 1.908132e-01 1.229962e-01 4.716406e-04 8.434623e-03
#5548 7.266859e-01 1.278779e-01 1.001634e-01 2.032530e-02 7.156766e-03 1.779076e-02
#5562 7.120179e-01 1.471181e-01 9.146071e-02 1.265592e-02 8.189511e-03 2.855781e-02
#5666 6.645056e-01 3.034978e-02 1.687687e-01 1.219601e-01 3.972833e-03 1.044308e-02
#5668 4.875966e-01 3.126855e-02 2.090006e-01 2.430828e-01 3.721631e-03 2.532970e-02
#5670 3.900772e-01 1.305786e-02 1.803779e-01 4.137106e-01 1.314298e-03 1.462155e-03
#5671 4.272971e-01 1.194599e-02 1.748494e-01 3.833422e-01 8.863019e-04 1.678975e-03
#5674 5.477521e-01 2.587478e-02 1.650817e-01 2.487404e-01 3.368726e-03 9.182195e-03
#5677 4.300207e-01 9.532836e-03 1.608679e-01 3.946310e-01 2.626104e-03 2.321351e-03
#5678 4.542981e-01 1.220728e-02 1.410984e-01 3.885146e-01 2.670689e-03 1.210891e-03
#...
Then I change values of new data point by running the lines below:
X1 <- 2.7
X2 <- 5.1
x3 <- 28
.
.
.
X14 <- 2
dfin2 <- data.frame( ses = c(100, 200, 300), X1, X2, X3, ..., X14)
predict(fit, todaydata = dfin2, type = "probs")
again I got exactly the same probabilities.
#class1 #class2 #class3 #class4 #class5 #class6
#5541 7.226948e-01 1.498199e-01 8.086624e-02 1.253289e-02 8.799416e-03 2.528670e-02
#5546 6.034188e-01 7.386553e-02 1.908132e-01 1.229962e-01 4.716406e-04 8.434623e-03
#5548 7.266859e-01 1.278779e-01 1.001634e-01 2.032530e-02 7.156766e-03 1.779076e-02
#5562 7.120179e-01 1.471181e-01 9.146071e-02 1.265592e-02 8.189511e-03 2.855781e-02
#5666 6.645056e-01 3.034978e-02 1.687687e-01 1.219601e-01 3.972833e-03 1.044308e-02
#5668 4.875966e-01 3.126855e-02 2.090006e-01 2.430828e-01 3.721631e-03 2.532970e-02
#5670 3.900772e-01 1.305786e-02 1.803779e-01 4.137106e-01 1.314298e-03 1.462155e-03
#5671 4.272971e-01 1.194599e-02 1.748494e-01 3.833422e-01 8.863019e-04 1.678975e-03
#5674 5.477521e-01 2.587478e-02 1.650817e-01 2.487404e-01 3.368726e-03 9.182195e-03
#5677 4.300207e-01 9.532836e-03 1.608679e-01 3.946310e-01 2.626104e-03 2.321351e-03
#5678 4.542981e-01 1.220728e-02 1.410984e-01 3.885146e-01 2.670689e-03 1.210891e-03
#...
What am I doing wrong that cause same outcome for 2 different dfin and dfin2 data frames?
My second question is that why for a single data point I get so many rows of outcome?
Thanks a lot for your time!