Running lm() on user defined matrix in R - r

I am trying to set up a function that will run lm() on a model derived from a user defined matrix in R.
The modelMatrix would be set up by the user, but would be expected to be structured as follows:
source target
1 "PVC" "AA"
2 "Aro" "AA"
3 "PVC" "Aro"
This matrix serves to allow the user to define the dependent(target) and independent(source) variables in the lm and refers to the column names of another valuesMatrix:
PVC Ar AA
[1,] -2.677875504 0.76141471 0.006114699
[2,] 0.330537781 -0.18462039 -0.265710261
[3,] 0.609826160 -0.62470233 0.715474554
I need to take this modelMatrix and generate a relevant number of lms. Eg in this case:
lm(AA ~ PVC + Aro)
and
lm(Aro ~ PVC)
I tried this, but it would seem that as the user changes the model matrix, it becomes erroneous and I need to explicitly specify each independent variable according to the modelMatrix.
```lm(as.formula(paste(unique(modelMatrix[,"target"])[1], "~ .",sep="")),
data=data.frame(valuesMatrix))
```
Do I need to set up 2 loops (1 nested) to fetch the source and target strings and paste them into the formula or am I overlooking some detail. Very confused.
Ideally I would like the user to be able to change the modelMatrix to include 1 or many lms and one or many independent variables for each lm. I would really appreciate your assistance as I am really hitting a wall here.
Thanks.

For your specific example this code should work -
source <- c("PVC","Aro","PVC")
target <- c("AA","AA","Aro")
modelMatrix <- data.frame(source = source, target = target)
valuesMatrix <- as.matrix(rbind(c(-2.677875504,0.76141471,0.006114699), c(0.330537781,-0.18462039,-0.265710261),
c(0.609826160,-0.62470233,0.715474554)))
colnames(valuesMatrix) <- c("PVC","Aro","AA")
unique.target <- as.character(unique(modelMatrix$target))
lm.models <- lapply(unique.target, function(x) {lm(as.formula(paste(x,"~ .", sep = "")),
data = data.frame(valuesMatrix[,colnames(valuesMatrix) %in%
c(x,as.character(modelMatrix$source[which(modelMatrix$target==x)]))]))})
You can use the ideas of the for loop but for loops can be to expensive especially if your modelMatrix get really large. It might not look as appealing but the lapply function is optimized for this type of job. The only other trick to this was to keep the columns required to perform the lm.
You can pull the results of each lm also but using:
lm[[1]] and lm[[2]]

Related

How to automate assignment to a class from output of a function in r

I am trying desperately to automate some model testing in lme4::lmer (as I have too many to do)
The functions I use run some models and find the best one by checking their stats, and create res (that has two Lists in it.
res$res is a data frame
res$res$model is the text I need rerun the best model (to clear it out, I only use 1)
res$fits is a List of 1, with res$fits[1] being a "Formal Class 'lmerMod' with 13 slots, the name of which is always exactly the same as models
Here's some code to make more sense:
models <- theBigList
## run function
res <- fit.func(models=models, response='bnParam1')
# show model selection table
res$res
# This is where you get the best model from above and put it in here to set it up for plotting.
#
models <- res$res$model[1]
# run function
res <- fit.func(models=models, response='bnParam1')
## model selection table
res$res
# Once you get a model where the best result is the same as the "previous" one you copy and paste it in here to graph it.
# It will be the one with the the lowest CV.R2 from the 2nd 'models'
top.fit <- res$fits$'INSERT models HERE'
#top.fit is a class lmerMod, and has the list of everything needed to be extracted and calculated to ggplot it
Normally I copy and paste the text for the best model into the space where it says 'INSERT models HERE', but I would like to automate it.
I can't seem to use models as an input, nor force it, eg as.Class or as.String, things like that, nor use other ways of referencing from a list. I am at a loss as to how to assign the right variable.
EDIT #######
So res$res is the first List in res that is a data frame, it will output something like this:
> res$res
model nPar D aic d.aic w.aic R2 cv.R2
1 Sp + (1|Spec) + SE + TC 51 3804.244 3906.244 0 1 0.6376789 0.2586369
To expand on my last sentence which is the most important. Normally the last bit of code passes the parameters to lme4::fixef like this for e.g.:
top.fit <- res$fits$"Sp + (1|Spec) + SE + TC"
This line of code also has the last part of that (that I discovered earlier but changes everytime I run a different analysis):
models <- res$res$model[1]
> models
[1] "Sp + (1|Spec) + SE + TC"
So I'd basically like to put to something like this top.fit <- res$fits$models but I assume there is some form of Type incompatibility or problem with using 'models' within the reference to the List/Class?
Single square brackets were changing the class to a list which wasn't working.
I solved this by referencing with double square brackets, which stopped the conversion, and passes the List as the original lmerMod Class it needs to be. Now I know.
top.fit <- res$fits[[1]]

How to create objects in R with "unknown name"?

I am trying to create (vector) objects in R. Thereby, I want to achieve that I don't specify a priori the name of the object. For example if I have a list of length 3, I want to create the objects p1 to p3 and if I have a list of length 10, the objects p1to p10 have to be created. The length should be arbitrary and not a priori determined.
Thanks for your help!
I guess the proper way of doing that is to consider a list p = list() and then you can use p[[i]] with i as big as you wish without having specified any length.
Then once your list is filled up, you can rename it: names(p) = paste0("p",c(1:length(p)))
Finally, if you want to get all the pi variables directly accessible, you add attach(p)
This is kind of a hack but you can do the following
short_list <- list(rnorm(10),rnorm(20),1:3)
long_list <- c(short_list,short_list )
paste0("p",seq_along(short_list))
mapply(assign, paste0("p",seq_along(short_list)), short_list, MoreArgs = list(envir = .GlobalEnv))
result:
> p3
[1] 1 2 3
you can do the same with long_list
I dont see a statistical model you will need this. Better start working with lists like short_list or data.frame's directly.
PS If you just want to use it for glm you probably want to learn formula's in R.
glm(y~., data=your_data) takes all columns in your data-frame that are not named y as regressor. Maybe this helps.
assign (and maybe also attach) are often a sign that you have not yet arrived at an "Rish" version of the code.
Considering that you need this for modeling: if your $p_1 \cdot p_n$ are of the same type, you can put them into a matrix (inside a column of a data.frame; for modeling they anyways need to be of same length):
df$matrix <- p.matrix
If you directly create the data.frame, you need to make sure the matrix is not expanded to data.frame columns:
df <- data.frame (matrix = I (matrix), ...)
Then glm (y ~ matrix, ...)  will work.
For examples of this technique see e.g. packages pls or hyperSpec or the pls paper in the Journal of Statistical Software.

interpreting R code function

I would like to perform pathway enrichment analyses.
I have 21 list of significant genes, and mutiple types of pathways I would like to check (ie. check for enrichment in KEGG pathways, GOterms, complexes etc.).
I found this example of code, on an old BioC post. However, I am having trouble adapting it for myself.
Firstly,
1- what does this mean? I don't know this multiple colon syntax.
hyperg <- Category:::.doHyperGInternal
2 - I don't understand how this line works. hyperg.test is a function that needs 3 variables passed to it, correct? Is this line somehow passing "genes.by.pathways, significant.genes, and all.geneIDs to thr hyperg.test?
pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
Code that I would like to adapt
library(KEGGREST)
library(org.Hs.eg.db)
# created named list, length 449, eg:
# path:hsa00010: "Glycolysis / Gluconeogenesis"
pathways <- keggList("pathway", "hsa")
# make them into KEGG-style human pathway identifiers
human.pathways <- sub("path:", "", names(pathways))
# for demonstration, just use the first ten pathways
demo.pathway.ids <- head(human.pathways, 10)
demo.pathways <- setNames(keggGet(demo.pathway.ids), demo.pathway.ids)
genes.by.pathway <- lapply(demo.pathways, function(demo.pathway) {
demo.pathway$GENE[c(TRUE, FALSE)]
})
all.geneIDs <- keys(org.Hs.eg.db)
# chose one of these for demonstration. the first (a whole genome random
# set of 100 genes) has very little enrichment, the second, a random set
# from the pathways themselves, has very good enrichment in some pathways
set.seed(123)
significant.genes <- sample(all.geneIDs, size=100)
#significant.genes <- sample(unique(unlist(genes.by.pathway)), size=10)
# the hypergeometric distribution is traditionally explained in terms of
# drawing a sample of balls from an urn containing black and white balls.
# to keep the arguments straight (in my mind at least), I use these terms
# here also
hyperg <- Category:::.doHyperGInternal
hyperg.test <-
function(pathway.genes, significant.genes, all.genes, over=TRUE)
{
white.balls.drawn <- length(intersect(significant.genes, pathway.genes))
white.balls.in.urn <- length(pathway.genes)
total.balls.in.urn <- length(all.genes)
black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn
balls.pulled.from.urn <- length(significant.genes)
hyperg(white.balls.in.urn, black.balls.in.urn,
balls.pulled.from.urn, white.balls.drawn, over)
}
pVals.by.pathway <-
t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
print(pVals.by.pathway)
The reason you are getting your error is because it appears you don't have the Category package installed from bioconductor. I suspect this because of the triple colon operator :::. This operator is very similar to the double colon operator ::. Whereas with :: you can access exported objects from a package without loading it, the ::: allows access to non-exported objects (in this case the hyperg function from Category). If you install the Category package the code runs without error.
With regard to the sapply statement:
pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
You can break this down into the separate parts to understand it. Firstly, the sapply is iterating over the elements of gene.by.pathway and passing them to the first argument of hyperg.test. The following arguments are the two addition parameters. It is a little unclear and I personally recommend that people explicitly identify the parameters to avoid unexpected surprises and avoids the need for the exact same order. This is a little repetitive in this case but a good way to avoid a silly bug (e.g. putting significant.genes after all.geneIds)
Rewritten:
pVals.by.pathway <-
t(sapply(genes.by.pathway, hyperg.test, significant.genes=significant.genes, all.genes=all.geneIDs))
Once this loop completes, the sapply function simplifies the output in to a matrix. However, the output is much more user-friendly by taking the transpose t.
Generally speaking, when trying to understand complex apply statements I find it best to break them apart in to smaller parts and see what the objects themselves look like.

Variables created within a function not getting stored in data frame

I am writing a function to create some predicted variables within an existing data set that I am using to run some ML models. My function looks like this:
doall <- function(x1, x2){
J48 <- J48(ML, data=df1)
#summary(J48)
X1 <- predict(J48, df1, type="class")
X2 <- predict(J48, df2, type="class")
#return(X1)
}
doall(df1$DT_predict, df2$DT_predict1)
J48 is a decision tree model (via RWeka). The code works (doall(df1$DT_predict1, df2$DT_predict1)) properly, I believe, because when I include the return function, it returns the values of X1. However, the predicted variables are not getting generated/stored in the data frames (df1 and df2). Ideally, I would like to have the dataframe names within the function, but that's the next step.
Can someone show how can I store the variables X1 and X2 within dataframes df1 and df2 respectively.
Ideally your question would have a bit more information about what your data frames look like, what X1 and X2 look like, and where your data frames are stored. For my answer I am assuming your data frames are stored in the global environment, and you want to modify them through a function.
This question has to do with scoping. For an in-depth description of scoping check out this article http://adv-r.had.co.nz/Functions.html#lexical-scoping
First, by assigning your variables within a function you are assigning them in a local environment. This means that the variables you are assigning do not carry over into the global environment (what you see when you type ls().
I believe you either want change a 'global variable' from within a function. This is done by the
<<-
command
for instance
a <- 2
print(a)
returns 2
change_a<-function(x){
x<-x*4
}
change_a(a)
print(a)
still returns 2
while
change_a<-function(x){
x<<-x*4
}
change_a(a)
print(a)
would return 8
I think you want to use the <<- operator instead of <- to accomplish what you want.
On a related note, it is not generally considered to be best practices to assign and change global variables from within a function.

Extract Group Regression Coefficients in R w/ PLYR

I'm trying to run a regression for every zipcode in my dataset and save the coefficients to a data frame but I'm having trouble.
Whenever I run the code below, I get a data frame called "coefficients" containing every zip code but with the intercept and coefficient for every zipcode being equal to the results of the simple regression lm(Sealed$hhincome ~ Sealed$square_footage).
When I run the code as indicated in Ranmath's example at the link below, everything works as expected. I'm new to R after many years with STATA, so any help would be greatly appreciated :)
R extract regression coefficients from multiply regression via lapply command
library(plyr)
Sealed <- read.csv("~/Desktop/SEALED.csv")
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
regressions <- dlply(Sealed, .(Sealed$zipcode), x)
coefficients <- ldply(regressions, coef)
Because dlply takes a ... argument that allows additional arguments to be passed to the function, you can make things even simpler:
dlply(Sealed,.(zipcode),lm,formula=hhincome~square_footage)
The first two arguments to lm are formula and data. Since formula is specified here, lm will pick up the next argument it is given (the relevant zipcode-specific chunk of Sealed) as the data argument ...
You are applying the function:
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
to each subset of your data, so we shouldn't be surprised that the output each time is exactly
lm(Sealed$hhincome ~ Sealed$square_footage)
right? Try replacing Sealed with df inside your function. That way you're referring to the variables in each individual piece passed to the function, not the whole variable in the data frame Sealed.
The issue is not with plyr but rather in the definition of the function. You are calling a function, but not doing anything with the variable.
As an analogy,
myFun <- function(x) {
3 * 7
}
> myFun(2)
[1] 21
> myFun(578)
[1] 21
If you run this function on different values of x, it will still give you 21, no matter what x is. That is, there is no reference to x within the function. In my silly example, the correction is obvious; in your function above, the confusion is understandable. The $hhincome and $square_footage should conceivably serve as variables.
But you want your x to vary over what comes before the $. As #Joran correctly pointed out, swap sealed$hhincome with df$hhincome (and same for $squ..) and that will help.

Resources