I'm trying to perform multiple imputation on a dataset in R where I have two variables, one of which needs to be the same or greater than the other one. I have set up the method and the predictive matrix, but I am having trouble understanding how to configure the post-processing. The manual (or main paper - van Buuren and Groothuis-Oudshoorn, 2011) states (section 3.5): "The mice() function has an argument post that takes a vector of strings of R commands. These commands are parsed and evaluated just after the univariate imputation function returns, and thus provide a way to post-process the imputed values." There are a couple of examples, of which the second one seems most useful:
R> post["gen"] <- "imp[[j]][p$data$age[!r[,j]]<5,i] <- levels(boys$gen)[1]"
this suggests to me that I could do:
R> ini <- mice(cbind(boys), max = 0, print = FALSE)
R> post["A"] <- "imp[[j]][p$data$B[!r[,j]]>p$data$A[!r[,j]],i] <- levels(boys$A)[boys$B]"
However, this doesn't work (when I plot A v B, I get random scatter rather than the points being confined to one half of the graph where A >= B).
I have also tried using the ifdo() function, as suggested in another sx post:
post["A"] <- "ifdo(A < B), B"
However, it seems the ifdo() function is not yet implemented. I tried running the code suggested for inspiration but afraid my R programming skills are not that brilliant.
So, in summary, has anyone any advice about how to implement post-processing in mice such that value A >= value B in the final imputed datasets?
Ok, so I've found an answer to my own question - but maybe this isn't the best way to do it.
In FIMD, there is a suggestion to do this kind of thing outside the imputation process, which thus gives:
R> long <- mice::complete(imp, "long", include = TRUE)
R> long$A <- with(long, ifelse(B < A, B, A))
This seems to work, so I'm happy.
Related
I am having some doubt regarding the calculation of MSE in R.
I have tried two different ways and I am getting two different results. Wanted to know which one is the correct way of finding mse.
First:
model1 <- lm(data=d, x ~ y)
rmse_model1 <- mean((d - predict(model1))^2)
Second:
mean(model1$residuals^2)
In principle, they should give you the same result. But in the first option, you should use d$x. If you just use d, recycling rule in R will repeat predict(model1) twice (as d has two columns) and the computation will also involve d$y.
Note that it is recommended to include na.rm = TRUE to mean, and newdata = d to predict in the first option. This makes your code robust to missing values in your data. On the other hand you don't need worry about NA in the second option, as lm automatically drops NA cases. You may have a look at this thread for potential effect of this feature: Aligning Data frame with missing values.
I would like to perform pathway enrichment analyses.
I have 21 list of significant genes, and mutiple types of pathways I would like to check (ie. check for enrichment in KEGG pathways, GOterms, complexes etc.).
I found this example of code, on an old BioC post. However, I am having trouble adapting it for myself.
Firstly,
1- what does this mean? I don't know this multiple colon syntax.
hyperg <- Category:::.doHyperGInternal
2 - I don't understand how this line works. hyperg.test is a function that needs 3 variables passed to it, correct? Is this line somehow passing "genes.by.pathways, significant.genes, and all.geneIDs to thr hyperg.test?
pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
Code that I would like to adapt
library(KEGGREST)
library(org.Hs.eg.db)
# created named list, length 449, eg:
# path:hsa00010: "Glycolysis / Gluconeogenesis"
pathways <- keggList("pathway", "hsa")
# make them into KEGG-style human pathway identifiers
human.pathways <- sub("path:", "", names(pathways))
# for demonstration, just use the first ten pathways
demo.pathway.ids <- head(human.pathways, 10)
demo.pathways <- setNames(keggGet(demo.pathway.ids), demo.pathway.ids)
genes.by.pathway <- lapply(demo.pathways, function(demo.pathway) {
demo.pathway$GENE[c(TRUE, FALSE)]
})
all.geneIDs <- keys(org.Hs.eg.db)
# chose one of these for demonstration. the first (a whole genome random
# set of 100 genes) has very little enrichment, the second, a random set
# from the pathways themselves, has very good enrichment in some pathways
set.seed(123)
significant.genes <- sample(all.geneIDs, size=100)
#significant.genes <- sample(unique(unlist(genes.by.pathway)), size=10)
# the hypergeometric distribution is traditionally explained in terms of
# drawing a sample of balls from an urn containing black and white balls.
# to keep the arguments straight (in my mind at least), I use these terms
# here also
hyperg <- Category:::.doHyperGInternal
hyperg.test <-
function(pathway.genes, significant.genes, all.genes, over=TRUE)
{
white.balls.drawn <- length(intersect(significant.genes, pathway.genes))
white.balls.in.urn <- length(pathway.genes)
total.balls.in.urn <- length(all.genes)
black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn
balls.pulled.from.urn <- length(significant.genes)
hyperg(white.balls.in.urn, black.balls.in.urn,
balls.pulled.from.urn, white.balls.drawn, over)
}
pVals.by.pathway <-
t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
print(pVals.by.pathway)
The reason you are getting your error is because it appears you don't have the Category package installed from bioconductor. I suspect this because of the triple colon operator :::. This operator is very similar to the double colon operator ::. Whereas with :: you can access exported objects from a package without loading it, the ::: allows access to non-exported objects (in this case the hyperg function from Category). If you install the Category package the code runs without error.
With regard to the sapply statement:
pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
You can break this down into the separate parts to understand it. Firstly, the sapply is iterating over the elements of gene.by.pathway and passing them to the first argument of hyperg.test. The following arguments are the two addition parameters. It is a little unclear and I personally recommend that people explicitly identify the parameters to avoid unexpected surprises and avoids the need for the exact same order. This is a little repetitive in this case but a good way to avoid a silly bug (e.g. putting significant.genes after all.geneIds)
Rewritten:
pVals.by.pathway <-
t(sapply(genes.by.pathway, hyperg.test, significant.genes=significant.genes, all.genes=all.geneIDs))
Once this loop completes, the sapply function simplifies the output in to a matrix. However, the output is much more user-friendly by taking the transpose t.
Generally speaking, when trying to understand complex apply statements I find it best to break them apart in to smaller parts and see what the objects themselves look like.
Most of the times there are more than one ways to implement a solution for a specific problem. Hence, there are bad solutions and good solutions. I consider robust implementations the ones that include for loops and while statements, lists or any other function and build-in types that makes our life easier.
I am looking forward to see and understand some examples of high-programming in R.
Assume a task like the following.
#IMPORT DATASET
Dataset <- read.table("blablabla\\dataset.txt", header=T, dec=".")
#TRAINING OF MODEL
Modeltrain <- lm(temperature~latitude+sea.distance+altitude, data=Dataset)
#COEFFICIENT VALUES FOR INDEPENDENT VARIABLES
Intercept <- summary(Modeltrain)$coefficients[1]
Latitude <- summary(Modeltrain)$coefficients[2]
Sea.distance <- summary(Modeltrain)$coefficients[3]
Altitude <- summary(Modeltrain)$coefficients[4]
#ASK FOR USER INPUT AND CALCULATE y
i <- 1
while (i == 1){
#LATITUDE (Xlat)
cat("Input latitude value please: ")
Xlat <- readLines(con="stdin", 1)
Xlat <- as.numeric(Xlat)
cat(Xlat, "is the latitude value. \n")
#LONGTITUDE (Xlong)
#CALCULATE DISTANCE FROM SEA (Xdifs)
cat("Input longtitude value please: ")
Xlong <- readLines(con="stdin", 1)
Xlong <- as.numeric(Xlong)
#cat(Xlong, "\n")
Xdifs <- min(4-Xlong, Xlat)
cat(Xdifs, "is the calculated distance from sea value. \n")
#ALTITUDE(Xlat)
cat("Input altitude value please: ")
Xalt <- readLines(con="stdin", 1)
Xalt <- as.numeric(Xalt)
cat(Xalt, "is the altitude value. \n")
y = Intercept + Latitude*Xlat + Sea.distance*Xdifs + Altitude*Xalt
cat(y, "is the predicted temperature value.\n")
}
First of all, i would like to ask how to, instead of blablabla\\dataset.txt, set an absolute path making the script functional in other OS too.
Second question is how do i automate the above process, to include additional X variables as well, without having to add them manually in the script.
I understand the latest question probably means re-writing the whole thing therefore i don't expect an answer. As i said before i am more interested in understanding how it could be done and do it myself.
Thanks.
p.s. please don't ask for a reproducible example i can't provide much else.
For the first question, you may want to look at the file.path command. For the second, I would approach this by defining, outside the while loop, two lists, one to store the prompts (e.g. list(lat="Please enter Latitude")) and another, with identical names, to store the input values. Then another loop inside the while iterates through the names of the first list, produces the relevant prompt, and stores the response in the named slot in the second list.
If your users are happy interacting with R in such a way, then you're lucky. Else, as #Roland suggests, delegate the UI to some other technology.
I'm using the GillespieSSA package for R, which allows you to call a single instance of a stochastic simulation easily and quickly. But a single realization isn't super-useful, I'd like to be able to look at the variability that arises from chance, etc. Consider a very basic model:
library(GillespieSSA)
x0 <- c(S=499, I=1, R=0)
a <- c("0.001*{S}*{I}","0.1*{I}")
nu <- matrix(c(-1,0,
+1,-1,
0, +1),nrow=3,byrow=T)
out <- ssa(x0, a, nu, tf=100)
That puts out a profoundly complicated list, the interesting bits of which are in out$data.
My question is, I can grab out$data for a single call of the instance, tag it with a variable indicating what call of the function it is, and then append that data to the old data to come out with one big set at the end. So in crude pseudo-R, something like:
nruns <- 10
for (i in 1:nruns){
out <- ssa(x0, a, nu, tf=100)
data <- out$data
run <- rep(i,times=length[data[,2]))
data <- cbind(data,run)
But where data isn't being overwritten at each time step. I feel like I'm close, but having hopped between several languages this week, my R loop fu, weak as it already is, is failing.
I am not sure if I understood you correctly. Do you want to do something like the following?
out <- lapply(X=1:10,FUN=function(x) ssa(x0, a, nu, tf=100)$data)
This would do 10 runs and put the resulting data lists in a list. You could then access,e.g., the data from the second run with out[[2]].
I'm using the library poLCA. To use the main command of the library one has to create a formula as follows:
f <- cbind(V1,V2,V3)~1
After this a command is invoked:
poLCA(f,data0,...)
V1, V2, V3 are the names of variables in the dataset data0. I'm running a simulation and I need to change the formula several times. Sometimes it has 3 variables, sometimes 4, sometimes more.
If I try something like:
f <- cbind(get(names(data0)[1]),get(names(data0)[2]),get(names(data0)[3]))~1
it works fine. But then I have to know in advance how many variables I will use. I would like to define an arbitrary vector
vars0 <- c(1,5,17,21)
and then create the formula as follows
f<- cbind(get(names(data0)[var0]))
Unfortunaly I get an error. I suspect the answer may involve some form of apply but I still don't understand very well how this functions work. Thanks in advance for any help.
Using data from the examples in ?poLCA this (possibly hackish) idiom seems to work:
library(poLCA)
vec <- c(1,3,4)
M4 <- poLCA(do.call(cbind,values[,vec])~1,values,nclass = 1)
Edit
As Hadley points out in the comments, we're making this a bit more complicated than we need. In this case values is a data frame, not a matrix, so this:
M1 <- poLCA(values[,c(1,2,4)]~1,values,nclass = 1)
generates an error, but this:
M1 <- poLCA(as.matrix(values[,c(1,2,4)])~1,values,nclass = 1)
works fine. So you can just subset the columns as long as you wrap it in as.matrix.
#DWin mentioned building the formula with paste and as.formula. I thought I'd show you what that would look like using the election dataset.
library("poLCA")
data(election)
vec <- c(1,3,4)
f <- as.formula(paste("cbind(",paste(names(election)[vec],collapse=","),")~1",sep=""))