Extract co-linear columns name - R - r

Based on the answer for this question and its script, how can I print to the console the co-linear columns names?
Script:
library(corrplot)
library(caret)
x <- seq(0, 100, 1)
# colinear with x
y <- x + 2.3
# almost colinear with x / some small gaussian noise
z <- x + rnorm(mean = 0, sd = 5, n = 101)
# uncorrrelated gaussian
w <- rnorm(mean = 0, sd = 1, n = 101)
a <- z+seq(101, 200, 1)/.33 + rnorm(mean = 0, sd = 5, n = 1001)
b <- a -2.3
# this frame is made to exemplify the procedure
df <- data.frame(x = x, y = y, z = z, w = w, a=a, b=b)
corrplot(cor(df))
#drop perfectly multicollinear variables
constant<-rep(1,nrow(df))
tmp<-lm(constant ~ ., data=df)
to_keep<-tmp$coefficients[!is.na(tmp$coefficients)]
to_keep<-names(to_keep[-which(names(to_keep) == "(Intercept)")])
df_result<-df[to_keep]
corrplot(cor(df_result))

You want the variables not included in to_keep. Based off how to_keep is defined, you can write to_drop <- tmp$coefficients[is.na(tmp$coefficients)] to get the coefficients with NA values (meaning there are no estimates for the corresponding variables because they are collinear with others). Then, to print the names of those coefficients, you can simply do print(names(to_drop)).
However, keep in mind that: 1. this will only drop perfectly collinear variables in a hacky way and 2. the way this method decides which variables out of a set of perfectly collinear variables to drop is rather arbitrary (it will depend on the other of variables in your data).

Related

R: How to access a 'complicated list'

I am working on an assignment, which tasks me to generate a list of data, using the below code.
##Use the make_data function to generate 25 different datasets, with mu_1 being a vector
x <- seq(0, 3, len=25)
make_data <- function(a){
n = 1000
p = 0.5
mu_0 = 0
mu_1=a
sigma_0 = 1
sigma_1 = 1
y <- rbinom(n, 1, p)
f_0 <- rnorm(n, mu_0, sigma_0)
f_1 <- rnorm(n, mu_1, sigma_1)
x <- ifelse(y == 1, f_1, f_0)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
list(train = data.frame(x = x, y = as.factor(y)) %>% slice(-test_index),
test = data.frame(x = x, y = as.factor(y)) %>% slice(test_index))
}
dat <- sapply(x,make_data)
The code looks good to go, and 'dat' appears to be a 25 column, 2 row table, each with its own data frame.
Now, each data frame within a cell has 2 columns.
And this is where I get stuck.
While I can get to the data frame in row 1, column 1, just fine (i.e. just use dat[1,1]), I can't reach the column of 'x' values within dat[1,1]. I've experimented with
dat[1,1]$x
dat[1,1][1]
But they only throw weird responses: error/null.
Any idea how I can pull the column? Thanks.
dat[1, 1] is a list.
class(dat[1, 1])
#[1] "list"
So to reach to x you can do
dat[1, 1]$train$x
Or
dat[1, 1][[1]]$x
As a sidenote, instead of having this 25 X 2 matrix as output in dat I would actually prefer to have a nested list.
dat <- lapply(x,make_data)
#Access `x` column of first list from `train` dataset.
dat[[1]]$train$x
However, this is quite subjective and you can chose whatever format you like the best.

R - How to apply a function with all the possible values for a specified argument?

I would like to apply the kmeans function to a dataset.
I run it several times. I increase the number of centers each time. For each run I store the total within sum of square in a vector, and I plot the total within sum of square against the number of clusters like so:
# Dummy data
cluster1_x <- rnorm(1000, mean = 3.5, sd = .75)
cluster1_y <- rnorm(1000, mean = 4, sd = 1.13)
cluster1 <- cbind(cluster1_x, cluster1_y)
cluster2_x <- rnorm(1000, mean = 5.2, sd = .75)
cluster2_y <- rnorm(1000, mean = .9, sd = .64)
cluster2 <- cbind(cluster2_x, cluster2_y)
cluster3_x <- rnorm(1000, mean = .68, sd = .86)
cluster3_y <- rnorm(1000, mean = 0.8, sd = 1)
cluster3 <- cbind(cluster3_x, cluster3_y)
df <- rbind(cluster1, cluster2, cluster3)
# To see the dummy clusters
# plot(df, pch = 20)
# Applying kmeans
# Vector that will be filled with the variance in the clusters
tot.within.sum.square <- rep(NA, 20)
for (nb_center in 1:20){
tps_start <- Sys.time()
set.seed(13)
res.kmeans <- kmeans(df, centers=nb_center, iter.max = 30)
tot.within.sum.square[nb_center] <- res.kmeans$tot.withinss
tps_exec <- Sys.time() - tps_start
print(paste0("Iteration ", nb_center, " : ", tps_exec))
}
plot(1:20, tot.within.sum.square, type = 'b', pch=20)
I would like to repeat this process 4 times, each time using a different algorithm. There are 4 different values "Hartigan-Wong", "Lloyd", "Forgy", "MacQueen", so I want to end up with 4 different vectors of length 20, one vector for each algorithm. Each element of a given vector is the value contained in res.kmeans$tot.withinss. For example, the 4th element of the vectors is the value corresponding to the total within sum of square of a run of kmeans for 4 centers. I can copy and paste the previous code but I am looking for a more elegant way to achieve the results.
I can somewhat get what I want using this:
sapply(algos, function(x) {
sapply(nb_centers, function(y) kmeans(df, centers = y, algorithm = x))
})
but I am not able to store each total.withinss from each iteration of each algorithm in a variable.
Any help will be appreciated!
As mentionned in the comments by #Parfait,
tot.withinss <- sapply(algos, function(x) {
sapply(nb_centers, function(y) kmeans(df, centers = y, algorithm = x)$tot.withinss)
})
will do the trick!

How to extract the coefficients of various n-th degree polynomial models and store them into a 1D array?

I have a collection of sixth-degree polynomial regression models from which I want to gather only the coefficients.
I have a large dataset that contains 3 columns: the first one is an arbitrary parameter that acts as a flag, the second is the input, and third is the output.
I subsetted my dataset according to my parameters, so I have 10 smaller datasets. My models arose from these subsets.
As an example:
#-----"Dummy" Dataset-----
a = seq(1:100) #act as input
b = a + rnorm(n = 100, mean = 0, sd = 20) #act as output
df = as.data.frame(cbind(a,b))
colnames(df) = c("input", "output")
#
#-----Subsets-----
df_1_XlessThen50 = subset(df, x< 50) #example of subsetting. In this
#case I used the x values itself as threshold
#for subsetting just for simplicity.
#In reality, I use the first column of my dataframe(parameter).
df_2_XmoreThen50 = subset(df, x >= 50) #second subset. In other words,
#for every parameter, I will divide that subset
#into two smaller ones.
#
#-----Models-----
model_3_ab.1 = lm(output ~ poly(input, 6, raw = T), data = df_1_XlessThen50)
model_3_ab.2 = lm(output ~ poly(input, 6, raw = T), data = df_2_XmoreThen50)
My models's names follow a pattern: "model" + parameter + "_ab." + id number.
I should clarify that the "id number" indicates which of the two models for every parameter I will consider. (Theses smaller datasets within every parameter are the results of subsetting according to a pre-determined threshold.)
What I have now is a collection of models like these two above for every parameter in my dataset. I have 10 parameters, hence, 20 models.
I want to gather only the coefficients of every model and store them into a matrix or dataframe. To achieve that, I tried:
parameter = c(2,4,6,7,9,11,33,35,37,50)
myData = array()
for (i in parameter){ #Loop over all parameters
for (j in 1:2){ #Loop over the pair of models for each parameter
for ( k in 1:6){ #Loop over my model's coefficient
aux = paste("model",i,"ab.",j, sep = "")
aux = get(aux)
myData[i,j,k] = aux$coefficients[k]
}
}
}
However, I keep getting the same error:
Error in myData[i, j, k] = aux$coefficients[k] :
incorrect number of subscripts
With this error, I can't advance into my goal, which is to write a .txt with one single column formatted as such:
A(2,1,1) = first order coefficient for the first model related to parameter 2
B(2,2,1) = second order coefficient for the first model related to parameter 2
C(2,3,1)
...
G(2,7,1)
A(2,1,2)
where in (M, N, O): M is the parameter, N is the the coefficient of the N-th degree (N = 7 is the intercept), and O is either 1 or 2, respectively, the first or second model in each pair of models for every parameter.
It'd be nice to get help/guidance for the whole problem, but I'll already be grateful if I can get past the part where I want to store my coefficients in a matrix using for-loops. Thanks
Here is what I mean:
set.seed(42)
a1 = seq(1:100) #act as input
a2 <- runif(100)
b = a1 + a2 + rnorm(n = 100, mean = 0, sd = 20) #act as output
df = data.frame(input1 = a1,
input2 = a2,
output = b)
df$flag <- a1 <= 50
library(reshape2)
df <- melt(df, id.vars = c("output", "flag"))
library(lme4)
df$flag_par <- interaction(df$flag, df$variable)
fits <- lmList(output ~ poly(value, 2, raw = TRUE) | flag_par, data = df)
coef(fits)
# (Intercept) poly(value, 2, raw = TRUE)1 poly(value, 2, raw = TRUE)2
#FALSE.input1 125.957730 -2.434849 0.022137337
#TRUE.input1 2.842223 1.216113 -0.006686362
#FALSE.input2 68.807752 -7.429319 26.486493218
#TRUE.input2 31.791633 -18.595105 16.608600876

lm models over all possible pairwise combinations of the columns of two matrices

I am working through a problem at the moment in R and have got stuck. I have searched around on various help lists for assistance but could not find anything - but apologies if I have missed something. A dummy example of my problem is below. I will continue to work on it, but any help would be greatly appreciated.
Thanks in advance for your time.
I have a matrix of response variables:
p<-matrix(c(rnorm(120,1),
rnorm(120,1),
rnorm(120,1)),
120,3)
and two matrices of covariates:
g<-matrix(c(rep(1:3, each=40),
rep(3:1, each=40),
rep(1:3, 40)),
120,3)
m<-matrix(c(rep(1:2, 60),
rep(2:1, 60),
rep(1:2, each=60)),
120,3)
For all combinations of the columns of the covariate matrices g and m I want to run these two models:
test <- function(uniq_m, uniq_g, p = p) {
full <- lm(p ~ factor(uniq_m) * factor(uniq_g))
null <- lm(p ~ factor(uniq_m) + factor(uniq_g))
return(list('f'=full, 'n'=null))
}
So I want to test for an interaction between column 1 of m and column 1 of g, then column 2 of m and column 1 of g, then column 2 of m and column 2 of g...and so forth across all possible pairwise interactions. The response variable is the same each time and is a matrix containing multiple columns.
So far, I can do this for a single combination of columns:
test_1 <- test(m[ ,1], g[ ,1], p)
And I can also run the model over all columns of m and one coloumn of g:
test_2 <- apply(m, 2, function(uniq_m) {
test(uniq_m, g[ ,1], p = p)
})
I can then get the F statistics for each response variable of each model:
sapply(summary(test_2[[1]]$f), function(x) x$fstatistic)
sapply(summary(test_2[[1]]$n), function(x) x$fstatistic)
And I can compare models for each response variable using an F-test:
d1<-colSums(matrix(residuals(test_2[[1]]$n),nrow(g),ncol(p))^2)
d2<-colSums(matrix(residuals(test_2[[2]]$f),nrow(g),ncol(p))^2)
F<-((d1-d2) / (d2/114))
My question is how do I run the lm models over all combinations of columns from the m and the g matrix, and get the F-statistics?
While this is a dummy example, the real analysis will have a response matrix that is 700 x 8000, and the covariate matrices will be 700 x 4000 and 700 x 100 so I need something that is as fast as possible.
Hopefully this helps, this is some code a friend of mine shared with me. It may not be exactly what you need but might set you off in the right direction (though given this is 9 months later than you asked it, it may be of no use to you specifically!):
#### this first function models the correlation and fixes the text size based on the strength of the correlation
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
##### this function places a histogram of your data on the diagonal
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)
}
### read in Fishers famous iris dataset for our example
data(iris)
head(iris)
library(corrgram)
##corrgram also gives you some nice panel options to use in pairs, but you dont necesarily need them
##e.g. panel.ellipse, panel.pie, panel.conf
library(asbio)
##asbio offers more panel options, such as a linear regression (panel.lm) etc
### run pairs() on your data
### set upper panel to panel.cor (the function we just wrote), and diagonal to panel.hist
### do what you like for the lower, add a smoother line isnt very informative
pairs(~ Sepal.Length + Sepal.Width + Petal.Length, data=iris, lower.panel=panel.lm, upper.panel=panel.cor, diag.panel = panel.hist, main="pair plots of variables")
All credit to James Keating.

p-value matrix of x and y variables from anova output

I have many X and Y variables (something like, 500 x 500). The following just small data:
yvars <- data.frame (Yv1 = rnorm(100, 5, 3), Y2 = rnorm (100, 6, 4),
Yv3 = rnorm (100, 14, 3))
xvars <- data.frame (Xv1 = sample (c(1,0, -1), 100, replace = T),
X2 = sample (c(1,0, -1), 100, replace = T),
Xv3 = sample (c(1,0, -1), 100, replace = T),
D = sample (c(1,0, -1), 100, replace = T))
I want to extact p-values and make a matrix like this:
Yv1 Y2 Yv3
Xv1
X2
Xv3
D
Here is my attempt to loop the process:
prob = NULL
anova.pmat <- function (x) {
mydata <- data.frame(yvar = yvars[, x], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
}
sapply (yvars,anova.pmat)
Error in .subset(x, j) : only 0's may be mixed with negative subscripts
What could be the solution ?
Edit:
For the first Y variable:
For first Y variable:
prob <- NULL
mydata <- data.frame(yvar = yvars[, 1], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
prob
[1] 0.4995179 0.4067040 0.4181571 0.6291167
Edit again:
for (j in seq(length (yvars))){
prob <- NULL
mydata <- data.frame(yvar = yvars[, j], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
}
Gives the same result as above !!!
Here is an approach that uses plyr to loop over the columns of a dataframe (treating it as a list) for each of the xvars and yvars, returning the appropriate p-value, arranging it into a matrix. Adding the row/column names is just extra.
library("plyr")
probs <- laply(xvars, function(x) {
laply(yvars, function(y) {
anova(lm(y~x))$`Pr(>F)`[1]
})
})
rownames(probs) <- names(xvars)
colnames(probs) <- names(yvars)
Here is one solution, which consists in generating all combinations of Y- and X-variables to test (we cannot use combn) and run a linear model in each case:
dfrm <- data.frame(y=gl(ncol(yvars), ncol(xvars), labels=names(yvars)),
x=gl(ncol(xvars), 1, labels=names(xvars)), pval=NA)
## little helper function to create formula on the fly
fm <- function(x) as.formula(paste(unlist(x), collapse="~"))
## merge both datasets
full.df <- cbind.data.frame(yvars, xvars)
## apply our LM row-wise
dfrm$pval <- apply(dfrm[,1:2], 1,
function(x) anova(lm(fm(x), full.df))$`Pr(>F)`[1])
## arrange everything in a rectangular matrix of p-values
res <- matrix(dfrm$pval, nc=3, dimnames=list(levels(dfrm$x), levels(dfrm$y)))
Sidenote: With high-dimensional datasets, relying on the QR decomposition to compute the p-value of a linear regression is time-consuming. It is easier to compute the matrix of Pearson linear correlation for each pairwise comparisons, and transform the r statistic into a Fisher-Snedecor F using the relation F = νar2/(1-r2), where degrees of freedom are defined as νa=(n-2)-#{(xi=NA),(yi=NA)} (that is, (n-2) minus the number of pairwise missing values--if there're no missing values, this formula is the usual coefficient R2 in regression).

Resources