When using formulas that have factors, the fitted models name the coefficients XY, where X is the name of the factor and Y is a particular level of it. I want to be able to create a formula from the names of these coefficients.
The reason: If I fit a lasso to a sparse design matrix (as I do below) I would like to create a new formula object that only contains terms for the nonzero coefficients.
require("MatrixModels")
require("glmnet")
set.seed(1)
n <- 200
Z <- data.frame(letter=factor(sample(letters,n,replace=T),letters),
x=sample(1:20,200,replace=T))
f <- ~ letter + x:letter + I(x>5):letter
X <- sparse.model.matrix(f, Z)
beta <- matrix(rnorm(dim(X)[2],0,5),dim(X)[2],1)
y <- X %*% beta + rnorm(n)
myfit <- glmnet(X,as.vector(y),lambda=.05)
fnew <- rownames(myfit$beta)[which(myfit$beta != 0)]
[1] "letterb" "letterc" "lettere"
[4] "letterf" "letterg" "letterh"
[7] "letterj" "letterm" "lettern"
[10] "lettero" "letterp" "letterr"
[13] "letters" "lettert" "letteru"
[16] "letterw" "lettery" "letterz"
[19] "lettera:x" "letterb:x" "letterc:x"
[22] "letterd:x" "lettere:x" "letterf:x"
[25] "letterg:x" "letterh:x" "letteri:x"
[28] "letterj:x" "letterk:x" "letterl:x"
[31] "letterm:x" "lettern:x" "lettero:x"
[34] "letterp:x" "letterq:x" "letterr:x"
[37] "letters:x" "lettert:x" "letteru:x"
[40] "letterv:x" "letterw:x" "letterx:x"
[43] "lettery:x" "letterz:x" "letterb:I(x > 5)TRUE"
[46] "letterc:I(x > 5)TRUE" "letterd:I(x > 5)TRUE" "lettere:I(x > 5)TRUE"
[49] "letteri:I(x > 5)TRUE" "letterj:I(x > 5)TRUE" "letterl:I(x > 5)TRUE"
[52] "letterm:I(x > 5)TRUE" "letterp:I(x > 5)TRUE" "letterq:I(x > 5)TRUE"
[55] "letterr:I(x > 5)TRUE" "letteru:I(x > 5)TRUE" "letterv:I(x > 5)TRUE"
[58] "letterx:I(x > 5)TRUE" "lettery:I(x > 5)TRUE" "letterz:I(x > 5)TRUE"
From this I would like to have a formula
~ I(letter=="d") + I(letter=="e") + ...(etc)
I checked out formula() and all.vars() to no avail. Also, writing a function to parse this is a bit of a pain because of the different types of terms that can arise. For example, for x:letter when x is a numeric value and letter is a factor, or I(x>5):letter as another annoying case.
So am I not aware of some function to convert between formula and its character representation and back again?
When I ran the code, I got something a bit different, since set.seed() had not been specified. Instead of using the variable name "letter", I used "letter_" as a convenient splitting marker:
> fnew <- rownames(myfit$beta)[which(myfit$beta != 0)]
> fnew
[1] "letter_c" "letter_d" "letter_e" "letter_f" "letter_h" "letter_k" "letter_l"
[8] "letter_o" "letter_q" "letter_r" "letter_s" "letter_t" "letter_u" "letter_v"
[15] "letter_w"
Then made the split and packaged into a character matrix:
> fnewmtx <- cbind( lapply(sapply(fnew, strsplit, split="_"), "[[", 2),
+ lapply(sapply(fnew, strsplit, split="_"), "[[", 1))
fnewmtx
[,1] [,2]
letter_c "c" "letter"
letter_d "d" "letter"
letter_e "e" "letter"
letter_f "f" "letter" snipped the rest
And wrapped the paste function(s) output in as.formula() which is half of the answer to how to "convert between formula and its character representation and back." The other half is as.character()
form <- as.formula( paste("~",
paste(
paste(" I(", fnewmtx[,2], "_ ==", "'",fnewmtx[,1],"') ", sep="") ,
sep="", collapse="+")
)
) # edit: needed to add back the underscore
And the output is now an appropriate class object:
> class(form)
[1] "formula"
> form
~I(letter_ == "c") + I(letter_ == "d") + I(letter_ == "e") +
I(letter_ == "f") + I(letter_ == "h") + I(letter_ == "k") +
I(letter_ == "l") + I(letter_ == "o") + I(letter_ == "q") +
I(letter_ == "r") + I(letter_ == "s") + I(letter_ == "t") +
I(letter_ == "u") + I(letter_ == "v") + I(letter_ == "w")
I find it interesting that the as.formula conversion made the single-quotes around the letters into double-quotes.
Edit: Now that the problem has an additional dimension or two, my suggestion is to skip the recreation of the formula. Note that the rownames of myfit$beta are exactly the same as the column names of X, so instead use the non-zero rownames as indices to select columns in the X matrix:
> str(X[ , which( colnames(X) %in% rownames(myfit$beta)[which(myfit$beta != 0)] )] )
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..# i : int [1:429] 9 54 91 157 166 37 55 68 117 131 ...
..# p : int [1:61] 0 5 13 20 28 36 42 50 60 68 ...
..# Dim : int [1:2] 200 60
..# Dimnames:List of 2
.. ..$ : chr [1:200] "1" "2" "3" "4" ...
.. ..$ : chr [1:60] "letter_b" "letter_c" "letter_e" "letter_f" ...
..# x : num [1:429] 1 1 1 1 1 1 1 1 1 1 ...
..# factors : list()
> myfit2 <- glmnet(X[ , which( colnames(X) %in% rownames(myfit$beta)[which(myfit$beta != 0)] )] ,as.vector(y),lambda=.05)
> myfit2
Call: glmnet(x = X[, which(colnames(X) %in% rownames(myfit$beta)[
which(myfit$beta != 0)])],
y = as.vector(y), lambda = 0.05)
Df %Dev Lambda
[1,] 60 0.9996 0.05
Christopher, what you are asking for appears, after some consideration and examination of sparse.model.matrix etc, to be somewhat involved. You haven't explain why you do not want to form the full sparse model matrix for X_test so it is difficult to advise a way forward other than the two options below.
If you have a large number of observations in X_test and hence do not want to produce the full sparse matrix for use in predict() for computational reasons, it might be more expedient to split X_test into two or more chunks of samples and form the sparse model matrices for each one in turn, discarding it after after use.
Failing that, you will need to study code from the Matrix package in detail. Start with sparse.model.matrix and note that it then calls Matrix:::model.spmatrix and locate calls to Matrix:::fac2Sparse in that function. You will probably need to co-opt code from these functions but use a modified fac2Sparse to achieve what you want to achieve.
Sorry I cannot provide an off-the-shelf script to do this, but that is a substantial coding task. If you go down that route, check out the Sparse Model Matrices vignette in the Matrix package and get the package sources (from CRAN) to see if the functions I mention are better documented in the source code (there are no Rd files for fac2Sparse for example). You can also ask the authors of Matrix (Martin Maechler and Doug Bates) for advice, although note that both of these chaps have had a particularly heavy teaching load this term.
Good luck!
Related
I have received a paper in which they included the R files for their empirical results. Nevertheless, I have some problems while trying to run their codes:
data <- vni$R[198:length(vni$R)]; date <- vni$Date[198:length(vni$R)]
l <- length(data)
rw_length <- 52 # 52 weeks (~ 1 year)
bound <- vector()
avr <- vector()
for (i in (rw_length+1):l) {
AVR.test <- AutoBoot.test(data[(i-rw_length):i],nboot=2000,"Normal",c(0.025, 0.975))
bound <- append(bound, AVR.test$CI.stat)
avr <- append(avr, AVR.test$test.stat)
}
lower <- bound[seq(1, length(bound), 2)]
upper <- bound[seq(2, length(bound), 2)]
results <- matrix(c(date[(rw_length+1):l],data[(rw_length+1):l],avr,upper, lower),ncol=5, dimnames = list(c(),c("Date", "Return", "AVR", "Upper", "Lower")))
And I get the following error: `
Error in as.Date.numeric(e) : 'origin' must be supplied`
for the results <- matrix(c(date[(rw_length+1):l],data[(rw_length+1):l],avr,upper, lower),ncol=5, dimnames = list(c(),c("Date", "Return", "AVR", "Upper", "Lower")))
My dataset is:
Date P R
1 2001-03-23 259.60 0.0000000000
2 2001-03-30 269.30 0.0366840150
3 2001-04-06 284.69 0.0555748690
4 2001-04-13 300.36 0.0535808860
5 2001-04-20 317.76 0.0563146260
...
935 2019-02-15 950.89 0.0454163960
936 2019-02-22 988.91 0.0392049380
937 2019-03-01 979.63 -0.0094283770
Could you please help me with that issue?
Thanks alot!
Everything in a matrix must be the same class. This is often found when there's a string among numbers, where
m <- matrix(0, nr=2, nc=2)
m
# [,1] [,2]
# [1,] 0 0
# [2,] 0 0
m[1] <- "a"
m
# [,1] [,2]
# [1,] "a" "0"
# [2,] "0" "0"
In this case, you have Date (first column) and numeric (all others? no idea what AutoBoot is). And because it's trying to coerce from least-complex to most-complex (from numeric to Date), the non-Date objects are being converted.
matrix(c(Sys.Date(), 1.1))
# Error in as.Date.numeric(e) : 'origin' must be supplied
I suggest that trying to store this in a matrix is therefore fundamentally flawed. If you want to store a Date object among numbers, you have two options:
Store it as a data.frame, where each column can have its own class.
Pre-convert the "Date" data to numeric and store it as a number. This means that if/when you need the dates to be of class Date again, you'll need to as.Date(..., origin="1970-01-01").
Suppose I have a data frame of 101 variables. I select one so-called Y as a dependent variable, and the remaining 100 so-called x_1, X_2,...,X_{100} as independent ones.
Now I would like to create a matrix containing 100 independent variables. What are the ways to do it directly? Like when I make a linear regression model, just use "." as regex, i.e lm(Y ~ ., _____)
You can use grep function to extract indpendent variable associated column names of the data frame. Then you can transform it into the matrix. Please see the code below:
# simulation of the data frame with 100 measurements and 101 variables
n <- 100
df <- data.frame(matrix(1:101 * n, ncol = 101))
names(df) <- c(paste0("X_", 1:100), "Y")
# extract matrix of Xs
m_x <- as.matrix(df[, grep("^X", names(df))])
dimnames(m_x)
Output:
[[1]]
NULL
[[2]]
[1] "X_1" "X_2" "X_3" "X_4" "X_5" "X_6" "X_7" "X_8" "X_9" "X_10" "X_11" "X_12" "X_13" "X_14" "X_15"
[16] "X_16" "X_17" "X_18" "X_19" "X_20" "X_21" "X_22" "X_23" "X_24" "X_25" "X_26" "X_27" "X_28" "X_29" "X_30"
[31] "X_31" "X_32" "X_33" "X_34" "X_35" "X_36" "X_37" "X_38" "X_39" "X_40" "X_41" "X_42" "X_43" "X_44" "X_45"
[46] "X_46" "X_47" "X_48" "X_49" "X_50" "X_51" "X_52" "X_53" "X_54" "X_55" "X_56" "X_57" "X_58" "X_59" "X_60"
[61] "X_61" "X_62" "X_63" "X_64" "X_65" "X_66" "X_67" "X_68" "X_69" "X_70" "X_71" "X_72" "X_73" "X_74" "X_75"
[76] "X_76" "X_77" "X_78" "X_79" "X_80" "X_81" "X_82" "X_83" "X_84" "X_85" "X_86" "X_87" "X_88" "X_89" "X_90"
[91] "X_91" "X_92" "X_93" "X_94" "X_95" "X_96" "X_97" "X_98" "X_99" "X_100"
I really need help with my script, I am not a professional in R.
Some background information about what I want to do.
There are two ranked lists of data ( drugs,diseases ). In these datasets there is information about how genes change in expression.
The drugRL(drug) dataset is a dataset which is a ranked list. The diseaseRL(disease) dataset is a dataset which in the description says it is the same ( ?diseaseRL ), but seems not to be a ranked list.
What i did was i took the absolute numbers from the diseaseRL dataset and normalized the data using the range of the data ( max - min of a vector of a particular disease in that dataset ).
So what i have now are two lists of dataframes containing the information of gene expression, as ranked lists.
Some code examples, first build the needed packages:
# Compile/install packages using biocLite.
#source("https://bioconductor.org/biocLite.R")
#biocLite("DrugVsDiseasedata")
#biocLite("gespeR")
#biocLite("DrugVsDisease") # may not be needed.
Then import packages/datasets :
#import libraries
library("DrugVsDisease")#may not be needed
library("DrugVsDiseasedata")
library("cMap2data")
library("gespeR")
#import datasets
data(diseaseRL)
data(drugRL)
> class(drugRL)
[1] "matrix"
>
> class(diseaseRL)
[1] "matrix"
>
> str(drugRL)
num [1:11709, 1:1309] 1870 4059 2250 10284 8999 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:11709] "ZNF702P" "SAMD4A" "VN1R1" "ZNF419" ...
..$ : chr [1:1309] "(+)-chelidonine" "(+)-isoprenaline" "(+/-)-catechin" "(-)-MK-801" ...
>
> str(diseaseRL)
num [1:11709, 1:45] 0.01683 -0.00112 -0.00126 0.04902 0.02605 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:11709] "LINC00115" "GOT2P1" "TP73-AS1" "PIN1P1" ...
..$ : chr [1:45] "wilms-tumor" "glaucoma-open-angle" "diabetes-mellitus-type-ii" "soft-tissue-sarcoma" ...
>
Now comes the part where i created a function to normalize the datasets:
NormalizeRLData <- function(x){
data.rankedlist <- x
data.rankedlist.abs <- as.data.frame(abs(data.rankedlist))
data.rankedlist.abs.ordered <-
data.rankedlist.abs[order(data.rankedlist.abs,decreasing=T), , drop = FALSE]
data.rankedlist.abs.ordered.max <- max(data.rankedlist.abs.ordered)
data.rankedlist.abs.ordered.min <- min(data.rankedlist.abs.ordered)
data.rankedlist.abs.ordered.normalizedToOwnRange <- (data.rankedlist.abs.ordered
/
(data.rankedlist.abs.ordered.max - data.rankedlist.abs.ordered.min ))
data.rankedlist.abs.ordered.normalizedToOwnRange.ordered <-
data.rankedlist.abs.ordered.normalizedToOwnRange[order(
data.rankedlist.abs.ordered.normalizedToOwnRange,decreasing=T ), , drop =
FALSE]
return(data.rankedlist.abs.ordered.normalizedToOwnRange.ordered)
}
diseaseRL.normalized <- apply(diseaseRL,2,NormalizeRLData)
drugRL.normalized <- apply(drugRL,2,NormalizeRLData)
There are multiple doubts/problems i have in order to proceed. I am unsure if what i did till now can be done more effectively, in particular in regards to the following, using rank-biased overlap (RBO).
RBO is a function which is able to compare two ranked lists. I want to make use of this function for the lists of normalized data, containing data.frames of the genes (for the disease and drug ranked lists). The input of this rbo function is a named vector.
example :
> a <- c(4,2,5,5)
> b <- c(1,2,3,4)
> names(a) <- c('one','two','three','four')
> names(b) <- c('one','two','three','four')
> rbo(a,b, p = 0.95)
[1] 0.9650417
What is the most efficient way to do this, so first of all can i have a better output then what i have at the moment to provide to the rbo function ?
And second :
If not ( or similar case ), I will have a list of data.frames containing the gene information for ether a drug, and another for a disease. I want to do the rbo function for every drug against every disease.
I tried using sapply, but i could not get it to work properly, and I am unsure if it is the right way to go. I need to maintain the names of the drugs , and for the other dataset the name of the disease, but also the gene names are important. So i can later check with genes and disease and drugs are having interactions.
I really hope someone here can shed some light into this !
p.s: If any one tries to help me here, but has problem compiling the packages, I may help ! Maybe i could send an example dataset ( not sure if i can attach anything here directly ).
Best Regards,
Rick
First, your user-defined method can vastly be reduced in verbosity. No need to cast into data.frame, initially order, or use drop in [] as vectors are being passed into the method. Consider the following adjustment where last line is the returned object:
NormalizeRLData <- function(x){
rnklist <- abs(x)
rnklist <- rnklist[order(rnklist)]
normRng <- rnklist / (max(rnklist) - min(rnklist))
normRng[order(normRng, decreasing = TRUE)]
}
diseaseRL.normalized <- apply(diseaseRL,2,NormalizeRLData)
drugRL.normalized <- apply(drugRL,2,NormalizeRLData)
Secondly, your normalized matrices (not dataframes) can possibly indeed be run with sapply by passing two inputs into the method, specifically both number of columns of each matrix as arguments. When two vectors are passed in sapply a cartesian product is applied where all combinations between both sets are iterated.
Since matrices maintain named columns and rows, it should adhere to rbo requirements. The return will be a matrix M x N where dimensions are the number of columns of disease and drug matrices.
# TWO-INPUT SAPPLY
rbo_mat <- sapply(seq(ncol(diseaseRL.normalized)), function(i,j) rbo(diseaseRL.normalized[,i], drugRL.normalized[,j], p = 0.95),
seq(ncol(drugRL.normalized)))
# EQUIVALENT WITH VAPPLY TO [V]ERIFY TYPE AND LENGTH OF OUTPUT
rbo_mat <- vapply(seq(ncol(diseaseRL.normalized)), function(i,j) rbo(diseaseRL.normalized[,i], drugRL.normalized[,j], p = 0.95),
numeric(seq(ncol(drugRL.normalized))),
seq(ncol(drugRL.normalized)))
You might even be able to use the lesser know apply function, rapply (recursive apply):
cols_list <- list(seq(ncol(diseaseRL.normalized)), seq(ncol(drugRL.normalized)))
rbo_mat2 <- rapply(cols_list, function(i,j) rbo(drugRL.normalized[,j], diseaseRL.normalized[,i], p = 0.95),
how="replace")[[1]]
TEST EXAMPLE
Because I cannot reproduce OP's data and do not have necessary packages, below is working example of above methodology with random normal data and uses the correlation function, cor as substitute for rbo:
set.seed(142)
mat1 <- sapply(1:10, function(i) rnorm(20))
colnames(mat1) <- LETTERS[1:10]
rownames(mat1) <- letters[1:20]
str(mat1)
# num [1:20, 1:10] 1.255 1.704 0.88 -0.582 -0.169 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:20] "a" "b" "c" "d" ...
# ..$ : chr [1:10] "A" "B" "C" "D" ...
mat2 <- sapply(1:5, function(i) rnorm(20))
colnames(mat2) <- LETTERS[1:5]
rownames(mat2) <- letters[1:20]
str(mat2)
# num [1:20, 1:5] -0.156 0.449 -0.822 -1.062 0.838 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:20] "a" "b" "c" "d" ...
# ..$ : chr [1:5] "A" "B" "C" "D" ...
corr_mat <- sapply(seq(ncol(mat1)), function(i,j) cor(mat1[,i], mat2[,j]),
seq(ncol(mat2)))
corr_mat2 <- vapply(seq(ncol(mat1)), function(i,j) cor(mat1[,i], mat2[,j]),
numeric(ncol(mat2)),
seq(ncol(mat2)))
corr_mat3 <- rapply(list(seq(ncol(mat1)), ncol(mat2)), function(i,j) cor(mat2[,j], mat1[,i]),
how="replace")[[1]]
I am working on converting some of the variables in the German credit file from the caret package to factors. Using factors reduce the no. of variables from 62 to 21.
Problem is that I get inconsistent results for the data summary for the Purpose.X columns:
for (i in 20:30) {
print(c(colnames(GermanCredit)[i], length( which(GermanCredit[,i] == 1))))
}
[1] "Purpose.NewCar" "234"
[1] "Purpose.UsedCar" "103"
[1] "Purpose.Furniture.Equipment" "181"
[1] "Purpose.Radio.Television" "280"
[1] "Purpose.DomesticAppliance" "12"
[1] "Purpose.Repairs" "22"
[1] "Purpose.Education" "50"
[1] "Purpose.Vacation" "0"
[1] "Purpose.Retraining" "9"
[1] "Purpose.Business" "97"
[1] "Purpose.Other" "12"
and the results from the prop.table are
prop.table(table(Purpose))
NewCar UsedCar Furniture.Equipment Radio.Television
0.234 0.103 0.181 0.280
DomesticAppliance Repairs Education Vacation
0.012 0.022 0.050 0.009
Retraining Business Other
0.097 0.012 0.000
It looks like the results from Vacation-Other are rotated for some reason. Any help in figuring out why the inconsistent results would be much appreciated. Thanks.
--
Purpose is gotten through the use of the following loop:
pcolnamerepeat <- c("CheckingAccountStatus.", "CreditHistory.", "Purpose.",
"SavingsAccountBonds.", "EmploymentDuration.", "Personal.",
"OtherDebtorsGuarantors.", "Property.",
"OtherInstallmentPlans.", "Housing.", "Job.")
for (i in pcolnamerepeat) {
rpt <- grep(i, colnames(GermanCredit))
tempfac <- factor(apply(GermanCredit[, rpt], 1, function(x) which(x == 1)))
levels(tempfac) <- substr(colnames(GermanCredit[, rpt]), nchar(i) + 1,
nchar(colnames(GermanCredit[,rpt])))
GermanCredit <- cbind(GermanCredit[-c(rpt)], tempfac)
names(GermanCredit)[length(GermanCredit)] <- substr(i, 1, nchar(i) - 1)
}
attach(GermanCredit) # Makes easy access to the columns
How to interprete the results of panel data models of R?
I estimate a adapted form of Koenker's (2004) suggestion for a quantile regression approach with panel data, for my data:
rq.fit.panel <- function(X,Y,s,w,taus,lambda)
{
require(SparseM)
require(quantreg)
K <- length(w)
if(K != length(taus))
stop("length of w and taus must match")
X <- as.matrix(X)
p <- ncol(X)
n <- length(levels(as.factor(s)))
N <- length(y)
if(N != length(s) || N != nrow(X))
stop("dimensions of y,X,s must match")
Z <- as.matrix.csr(model.matrix(~as.factor(s)-1))
Fidelity <- cbind(as(w,"matrix.diag.csr") %x% X,w %x% Z)
Penalty <- cbind(as.matrix.csr(0,n,K*p),lambda*as(n,"matrix.diag.csr"))
D <- rbind(Fidelity,Penalty)
y <- c(w %x% y,rep(0,n))
a <- c((w*(1-taus)) %x% (t(X)%*%rep(1,N)),
sum(w*(1-taus)) * (t(Z) %*% rep(1,N)) + lambda * rep(1,n))
rq.fit.sfn(D,y,rhs=a)
}enter code here
bdeduc2<-read.table("dados_rq.txt", header=T)
z<-c("inter","ne","no","su","co")
X<-bdeduc2[,z]
y<-bdeduc2$scoreedu
s<-bdeduc2$uf
w<-c(0.1,0.25,0.5,0.25,0.1)
taus<-c(0.1,0.25,0.5,0.75,0.9)
lambda<-1
But I don't know identify the results below:
$coef
[1] 1.02281339 -0.18750668 -0.13688807 -0.04180458 -0.01367417 1.02872440 -0.18055062 -0.13003224 -0.03829135 -0.01409369 1.03377335 -0.16649845 -0.11669812
[14] -0.03854060 -0.01438620 1.03851101 -0.15328087 -0.10440359 -0.03871744 -0.01465492 1.04330584 -0.14660960 -0.09670756 -0.03465501 -0.01430647 -0.29187982
[27] -0.21831160 -0.11295134 -0.21530494 -0.15664777 -0.13840296 -0.03224749 -0.11692122 -0.11237144 -0.15112171 -0.10385352 -0.08385934 -0.16090525 -0.30349309
[40] -0.16121494 -0.03106264 -0.16299994 -0.03182579 -0.22271685 -0.08251486 -0.29031224 -0.19680023 -0.20004209 -0.05601186 -0.21140762 -0.04254752 -0.01864703
$ierr
[1] 0
$it
[1] 16
$time
[1] 0
##summary rq
summary(rq)
Length Class Mode
coef 52 -none- numeric
ierr 1 -none- numeric
it 1 -none- numeric
time 1 -none- numeric
It looks like you fit the regression and saved it, then are trying to look at it in a new session without the quantile regression package loaded (it is giving you the list summary, not the object summary that is in the package).
Make sure that the package used to create your object is loaded, then do summary again to see if that gives you meaningful output.