Amelia correlation analysis - r

I want to perform a correlation analysis with imputed datasets from the original dataset "freetrade" from Amelia package.
So first I loaded the data and created multiple datasets with amelia function:
library(Amelia)
data <- freetrade %>%
select(c("country", "tariff", "pop", "gdp.pc", "intresmi", "fiveop", "usheg"))
am <- amelia(data, m=5, idvars=1)
Now I would like to perform a correlation between tarriff, pop and gdp.pc. I absolutely did not find anything on the Internet on how to do it, only for the mice package "micombine.cor()".
I tried transforming the imputed data sets "am" into the data type mids, since micombine.cor() only takes the data type mids:
as.mids(am)
but there is only an error called : "Imputation index .imp not found"
Do you have any methods on how to perform the correlation analysis? I would be very grateful!

You need to read the manual page for Amelia, particularly the part that tells how amelia returns its results. Trying the examples is also very useful. The example on the manual page uses the data set africa which is included in the package and seems roughly similar to yours:
am <- amelia(africa[, 3:7]) # Just using the numeric variables
cor(am$imputations[[1]]) # For the first imputed data set
lapply(am$imputations, cor) # For all five imputed data sets

Related

Using Amelia and decision trees

I have a panel dataset (countries and years) with a lot of missing data so I've decided to use multiple imputation. The goal is to see the relationship between the proportion of women in management (managerial_value) and total fatal workplace injuries (total_fatal)
From what I've read online, Amelia is the best option for panel data so I used that like so:
amelia_data <- amelia(spdata, ts = "year", cs = "country", polytime = 1,
intercs = FALSE)
where spdata is my original dataset.
This imputation process worked, but I'm unsure of how to proceed with forming decision trees using the imputed data (an object of class 'amelia').
I originally tried creating a function (amelia2df) to turn each of the 5 imputed datasets into a data frame:
amelia2df <- function(amelia_data, which_imp = 1) {
stopifnot(inherits(amelia_data, "amelia"), is.numeric(which_imp))
imps <- amelia_data$imputations[[which_imp]]
as.data.frame(imps)
}
one_amelia <- amelia2df(amelia_data, which_imp = 1)
two_amelia <- amelia2df(amelia_data, which_imp = 2)
three_amelia <- amelia2df(amelia_data, which_imp = 3)
four_amelia <- amelia2df(amelia_data, which_imp = 4)
five_amelia <- amelia2df(amelia_data, which_imp = 5)
where one_amelia is the data frame for the first imputed dataset, two_amelia is the second, and so on.
I then combined them using rbind():
total_amelia <- rbind(one_amelia, two_amelia, three_amelia, four_amelia, five_amelia)
And used the new combined dataset total_amelia to construct a decision tree:
set.seed(300)
tree_data <- total_amelia
I_index <- sample(1:nrow(tree_data), size = 0.75*nrow(tree_data), replace=FALSE)
I_train <- tree_data[I_index,]
I_test <- tree_data[-I_index,]
fatal_tree <- rpart(total_fatal ~ managerial_value, I_train)
rpart.plot(fatal_tree)
fatal_tree
This "works" as in it doesn't produce an error, but I'm not sure that it is appropriately using the imputed data.
I found a couple resources explaining how to apply least squares, logit, etc., but nothing about decision trees. I'm under the impression I'd need the 5 imputed datasets to be combined into one data frame, but I have not been able to find a way to do that.
I've also looked into Zelig and bind_rows but haven't found anything that returns one data frame that I can then use to form a decision tree.
Any help would be appreciated!
As already indicated by #Noah, you would set up the multiple imputation workflow different than you currently do.
Multiple imputation is not really a tool to improve your results or to make them more correct.
It is a method to enable you to quantify the uncertainty caused by the missing data, that comes along with your analysis.
All the different datasets created by multiple imputation are plausible imputations, because of the uncertainty, you don't know, which one is correct.
You would therefore use multiple imputation the following way:
Create your m imputed datasets
Build your trees on each imputed dataset separately
Do you analysis on each tree separately
In your final paper, you can now state how much uncertainty is caused trough the missing values/imputation
This means you get e.g. 5 different analysis results for m = 5 imputed datasets. First this looks confusing, but this enables you to give bounds, between the correct result probably lies. Or if you get completely different results for each imputed dataset, you know, there is too much uncertainty caused by the missing values to give reliable results.

Mice in R - how can I understand what this command does?

mice_mod <-
mice(titanicData[, !names(titanicData) %in%
c('PassengerId','Name','Ticket','Cabin','Survived')],
method='rf')
mice_output <- complete(mice_mod)
I am new to R and we had a college lecture yesterday. What does this command do? I have read the online documentation and broke down the command to a series of outputs, with no joy.
The mice function approximates missing values. In you case you are using the "rf" statement, which means the random forest imputations algorithm is used. Since I can't reproduce your dataset, I'm using airquality which is a built in dataset by R with NA values. Those can be approximated. You are creating kinda a prediction model with mice. Actually it is a mids object, which is used by mice for imputed datasets (documentation). If you want to use those imputations, you can call complete for creating the filled dataframe.
library(mice)
df<-airquality
mice_mod <- mice(df, method='rf')
mice_output <- complete(mice_mod)
When you compare df and mice_output, you'll see the NA values in Ozone and Solar got replaced.
In your example your lecturer is using all names which are not in the called list of names. So he is filtering the dataframe beforehand.
If you want more information about the algorithm: regarding to the documentation it is described in
Doove, L.L., van Buuren, S., Dusseldorp, E. (2014), Recursive
partitioning for missing data imputation in the presence of
interaction Effects. Computational Statistics \& Data Analysis, 72,
92-104.

How to use Dismo's predict() with a maxent model based on a dataframe

I am trying to figure out how dismo's predict function operates in terms of a model built with 'x' as a dataframe, rather than raster layers. I have successfully run models using raster layers and made prediction maps based on this.
My model is built as follows;
library(dismo)
model <- maxent(x = sightings.data, p = presence.vector)
with sightings.data being a dataframe containing the GPS locations of sightings, followed by the conditions at these times and locations. presence.vector is a vector indicating if a row is a presence or background point.
I am looking to find out;
What arguments to supply to predict given a model of this type
What predict() is capable of providing from a model such as this
I have successfully run models using raster layers and made prediction maps based on this.
The help file for predict() is not particularly detailed and the 'Species distribution modelling with R' does not successfully cover this topic (the examples just list 'cannot run this example because maxent is not available' outputs).
I have tried modelling with a dataframe containing only variables I have raster layers for, and tried predicting as I would for a model built with rasters, but I get the following error;
Error in .local(object, ...) : missing layers (or wrong names)
I have ensured the dataframe column names and the raster layers have the same names, excluding the mandatory latitude and longitude columns;
names(raster.stack) <- colnames(sightings.data[3:5])
The method I have found from the code avaialble from the following paper Oppel at al 2012 demonstrates that dismo's predict can produce relative values when provided with a dataframe of input variables.
> predictions <- predict(model, variables)
> str(predictions)
num [1:100] 0.635 ...
I'm still looking for an easy method to create a predicted distribution raster map from such predicted values.
If you provide dismo::maxent a dataframe, the function will recognize the first column as longitude and second column as latitude. If the data not follow this format, the function will not work.
In this format sightings data does not need to include the GPS locations, so you can remove the x & y columns from sightings.data. Then you can run the model, and then you can predict to a raster stack with raster names that are identical to the names in the sightings.data column names.
Predict was looking for the GPS locations in your raster stack, which I'm guessing were not there.

glm function not taking correct dataset

I have just started learning R and working on dataset which has 1470 cases. Name of dataset is ABC. Using as.factor, I have converted categorical variables as factors.
Dept_1 <- as.factor(ABC$Dept)
Education_1 <- as.factor(ABC$Education)
BusinessTravel_1 <- as.factor(ABC$BusinessTravel)
After that I have split dataset into train and test.Number cases for both train and test data seems perfect. Then I use glm function using syntax below
fit = glm(attrition~Dept_1+Education_1+BusinessTravel_1,binomial(link="logit"),train)
Fit equation runs but it gets executed on entire dataset ABC with cases 1470 instead of train dataset of 1028 records.
Not able to understand what is the issue.
When you do this:
Dept_1 <- as.factor(ABC$Dept)
Education_1 <- as.factor(ABC$Education)
BusinessTravel_1 <- as.factor(ABC$BusinessTravel)
you're actually creating three new variables in your global environment, not in your original data frame ABC. Because of this, when you split ABC into training and test samples, the new variables won't be affected.
When you go to fit the model, your glm call
fit = glm(attrition~Dept_1+Education_1+BusinessTravel_1,binomial(link="logit"),train)
will look for the variables listed in the formula. It won't find them in the train dataset, but it will find them in the global environment. That's why they have the original length.
What you probably wanted is
ABC$Dept_1 <- as.factor(ABC$Dept)
ABC$Education_1 <- as.factor(ABC$Education)
ABC$BusinessTravel_1 <- as.factor(ABC$BusinessTravel)
which will create the variables in the data frame ABC.

Using survfit object's formula in survdiff call

I'm doing some survival analysis in R, and looking to tidy up/simplify my code.
At the moment I'm doing several steps in my data analysis:
make a Surv object (time variable with indication as to whether each observation was censored);
fit this Surv object according to a categorical predictor, for plotting/estimation of median survival time processes; and
calculate a log-rank test to ask whether there is evidence of "significant" differences in survival between the groups.
As an example, here is a mock-up using the lung dataset in the survival package from R. So the following code is similar enough to what I want to do, but much simplified in terms of the predictor set (which is why I want to simplify the code, so I don't make inconsistent calls across models).
library(survival)
# Step 1: Make a survival object with time-to-event and censoring indicator.
# Following works with defaults as status = 2 = dead in this dataset.
# Create survival object
lung.Surv <- with(lung, Surv(time=time, event=status))
# Step 2: Fit survival curves to object based on patient sex, plot this.
lung.survfit <- survfit(lung.Surv ~ lung$sex)
print(lung.survfit)
plot(lung.survfit)
# Step 3: Calculate log-rank test for difference in survival objects
lung.survdiff <- survdiff(lung.Surv ~ lung$sex)
print(lung.survdiff)
Now this is all fine and dandy, and I can live with this but would like to do better.
So my question is around step 3. What I would like to do here is to be able to use information in the formula from the lung.survfit object to feed into the calculation of the differences in survival curves: i.e. in the call to survdiff. And this is where my domitable [sic] programming skills hit a wall. Below is my current attempt to do this: I'd appreciate any help that you can give! Once I can get this sorted out I should be able to wrap a solution up in a function.
lung.survdiff <- survdiff(parse(text=(lung.survfit$call$formula)))
## Which returns following:
# Error in survdiff(parse(text = (lung.survfit$call$formula))) :
# The 'formula' argument is not a formula
As I commented above, I actually sorted out the answer to this shortly after having written this question.
So step 3 above could be replaced by:
lung.survdiff <- survdiff(formula(lung.survfit$call$formula))
But as Ben Barnes points out in the comment to the question, the formula from the survfit object can be more directly extracted with
lung.survdiff <- survdiff(formula(lung.survfit))
Which is exactly what I wanted and hoped would be available -- thanks Ben!

Resources