Mice in R - how can I understand what this command does? - r

mice_mod <-
mice(titanicData[, !names(titanicData) %in%
c('PassengerId','Name','Ticket','Cabin','Survived')],
method='rf')
mice_output <- complete(mice_mod)
I am new to R and we had a college lecture yesterday. What does this command do? I have read the online documentation and broke down the command to a series of outputs, with no joy.

The mice function approximates missing values. In you case you are using the "rf" statement, which means the random forest imputations algorithm is used. Since I can't reproduce your dataset, I'm using airquality which is a built in dataset by R with NA values. Those can be approximated. You are creating kinda a prediction model with mice. Actually it is a mids object, which is used by mice for imputed datasets (documentation). If you want to use those imputations, you can call complete for creating the filled dataframe.
library(mice)
df<-airquality
mice_mod <- mice(df, method='rf')
mice_output <- complete(mice_mod)
When you compare df and mice_output, you'll see the NA values in Ozone and Solar got replaced.
In your example your lecturer is using all names which are not in the called list of names. So he is filtering the dataframe beforehand.
If you want more information about the algorithm: regarding to the documentation it is described in
Doove, L.L., van Buuren, S., Dusseldorp, E. (2014), Recursive
partitioning for missing data imputation in the presence of
interaction Effects. Computational Statistics \& Data Analysis, 72,
92-104.

Related

Amelia correlation analysis

I want to perform a correlation analysis with imputed datasets from the original dataset "freetrade" from Amelia package.
So first I loaded the data and created multiple datasets with amelia function:
library(Amelia)
data <- freetrade %>%
select(c("country", "tariff", "pop", "gdp.pc", "intresmi", "fiveop", "usheg"))
am <- amelia(data, m=5, idvars=1)
Now I would like to perform a correlation between tarriff, pop and gdp.pc. I absolutely did not find anything on the Internet on how to do it, only for the mice package "micombine.cor()".
I tried transforming the imputed data sets "am" into the data type mids, since micombine.cor() only takes the data type mids:
as.mids(am)
but there is only an error called : "Imputation index .imp not found"
Do you have any methods on how to perform the correlation analysis? I would be very grateful!
You need to read the manual page for Amelia, particularly the part that tells how amelia returns its results. Trying the examples is also very useful. The example on the manual page uses the data set africa which is included in the package and seems roughly similar to yours:
am <- amelia(africa[, 3:7]) # Just using the numeric variables
cor(am$imputations[[1]]) # For the first imputed data set
lapply(am$imputations, cor) # For all five imputed data sets

How to use Dismo's predict() with a maxent model based on a dataframe

I am trying to figure out how dismo's predict function operates in terms of a model built with 'x' as a dataframe, rather than raster layers. I have successfully run models using raster layers and made prediction maps based on this.
My model is built as follows;
library(dismo)
model <- maxent(x = sightings.data, p = presence.vector)
with sightings.data being a dataframe containing the GPS locations of sightings, followed by the conditions at these times and locations. presence.vector is a vector indicating if a row is a presence or background point.
I am looking to find out;
What arguments to supply to predict given a model of this type
What predict() is capable of providing from a model such as this
I have successfully run models using raster layers and made prediction maps based on this.
The help file for predict() is not particularly detailed and the 'Species distribution modelling with R' does not successfully cover this topic (the examples just list 'cannot run this example because maxent is not available' outputs).
I have tried modelling with a dataframe containing only variables I have raster layers for, and tried predicting as I would for a model built with rasters, but I get the following error;
Error in .local(object, ...) : missing layers (or wrong names)
I have ensured the dataframe column names and the raster layers have the same names, excluding the mandatory latitude and longitude columns;
names(raster.stack) <- colnames(sightings.data[3:5])
The method I have found from the code avaialble from the following paper Oppel at al 2012 demonstrates that dismo's predict can produce relative values when provided with a dataframe of input variables.
> predictions <- predict(model, variables)
> str(predictions)
num [1:100] 0.635 ...
I'm still looking for an easy method to create a predicted distribution raster map from such predicted values.
If you provide dismo::maxent a dataframe, the function will recognize the first column as longitude and second column as latitude. If the data not follow this format, the function will not work.
In this format sightings data does not need to include the GPS locations, so you can remove the x & y columns from sightings.data. Then you can run the model, and then you can predict to a raster stack with raster names that are identical to the names in the sightings.data column names.
Predict was looking for the GPS locations in your raster stack, which I'm guessing were not there.

how to find differentially methylated regions (for example with probe lasso in Champ) based on regression continuous variable ~ beta (with CpGassoc)

I performed 450K Illumina methylation chips on human samples, and want to search for the association between a continuous variable and beta, adjusted for other covariates. For this, I used the CpGassoc package in R. I would also like to search for differentially methylated regions based on the significant CpG sites. However, the probe lasso function in the Champ package and also other packages for 450K DMR analyses always assume 2 groups for which DMRs need to be find. I do not have 2 groups, but this continuous variable. Is there a way to load my output from CpGassoc in the probe lasso function from Champ? Or into another bump hunter package? I'm a MD, not a bio-informatician, thus comb-p, etc. would not be possible for me.
Thank you very much for your help.
Kind regards,
Line
I have not worked with methylation data before, so take what I say with a grain of salt. Also, don't use acronyms without describing them I'm guessing most people on this site don't know what a DMR is.
you could use lasso from the glmnet package to run a lasso on your data. So if your continuous variable was age you could do something like. If meth.dt is your methylations data.table with your columns as the amount of methylation for a given site, and your rows as subjects. I'm not sure if methylation data is considered to be poisson, I know RNA-seq data is. I also can't get too specific but the following code should work after adjusting to your number of columns
#load libraries
library(data.table)
library(glmnet)
#read in data
meth.dt <- fread("/data")
#lasso
AgeLasso <- glmnet(as.matrix(meth.dt[,1:70999,with=F]),meth.dt$Age, family="poisson")
cv.AgeLasso <- cv.glmnet(as.matrix(meth.dt[,1:70999,with=F]), meth.dt$Age, family="poisson")
coefTranscripts <- coef(cv.AgeLasso, s= "lambda.1se")[,1][coef(cv.AgeLasso, s= "lambda.1se")[,1] != 0]
This will give you the methylation sites that are the best predictors of your continuous variable using a parsimonious model. For additional info about glmnet see http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
Also might want to ask the people over at cross validated. They may have some better answers. http://stats.stackexchange.com
What is your continuous variable just out of curiosity?
Let me know how you ended up solving it if you don't use this method.

How to use sample weights in R

I am planning to fit a Multi-Group Confirmatory Factor Analysis about views on ethical matters. I will compare people from the regions of Wallonia and Flanders in Belgium. My two samples need to be weighted, in order to be representative of their populations in terms of in terms of age, gender, education and party choice.
Sampling weights where already provided in my dataset. I then created a variable wreg, combining weights for respondents from Wallonia and Flanders.
I am new to R, and read documentation about lavaan.survey and svydesign to learn about the code. However, I haven't yet succeeded in writing something correct. I always get error messages about the part concerning weights. Apparently the programme cannot read the sampling weights variable right.
Here is the code I used:
library(lavaan.survey)
f <- "C:/.../bges07_small.csv"
s <- read.csv(f,sep=";")
r <- s[is.na(s$flawal),]
rDesign <- svydesign(ids=~1, data=r, weights=~wreg)
model.1 <- 'ethic =~ q96_1+ q96_2 +q96_3'
fit <- cfa(model.1, data=r,ordered=c("q96_1","q96_2","q96_3"))
summary(fit, fit.measures=TRUE, modindices=FALSE,standardized=FALSE)
And this is the error message I had:
Erreur dans 1/as.matrix(weights) :
argument non numérique pour un opérateur binaire
Any suggestion on how I should write my model with R? Thanks a lot!
From the results of summary(r$wreg), it looks like your weights column is a factor, and not a numeric vector. Make sure you've read your data in correctly and that column doesn't contain any character-like values. You can manually convert it with
r$wreg <- as.numeric(r$wreg)
before running your model. Also, those look like very large weight values. Are you sure they are correct?

coxph stratified by year

I think this should be something very easy, but I can't quite get my head around it.
I have the following code:
library(survival)
cox <- coxph(Surv(SURV, DEAD)~YEAR, data)
summary(cox)
but I would like to have the result split down into the individual years.
Here's what the SPSS syntax and solution would look like:
COXREG surv /STATUS=dead(1) /CONTRAST (year)=Indicator(1)
/METHOD=ENTER year /PRINT=CI(95)
/CRITERIA=PIN(.05) POUT(.10) ITERATE(20).
EXECUTE.
and the same thing in STATA:
xi: stcox i.year
Here's the output of
str(data)
You did not show us str(data) or how to construct a reproducible example the gave "data". I suspect that "YEAR" will turn out to be a numeric vector. If it had been a factor variable you would have seen an Intercept and n-1 coefficients. The Interecpt coefficient would then have been the same as the "year" and the other coefficients would have matched up to the year(n) values. You told the SPSS engine that "year" was an "INDICATOR" but you didn't offer the same courtesy to the R engine.
Try this:
data$year.ind <- factor(data$year) # equivalent of SPSS INDICATOR
# or SAS /CLASS
cox.mdl <- coxph(Surv(SURV, DEAD)~YEAR, data)
as.matrix(coef(coc.mdl)
summary(cox.mdl)
R often splits computing and display of results to allow more freedom. I assume you need the predict function of coxph (?predict.coxph).
There are examples at the bottom of the documentation page, most likely you want
predict(cox, type="terms")

Resources