How can I simply impute NA values in R using Amelia and then divide the data set into a data and training set in a 70:30 split? - r

I want to impute values from a data set (14 variables, 200 observations) and then split it into a 70% training data set and a 30% testing data set.
Every time I work with Amelia to impute I get different types of error messages. I'm looking for the simplest way to have Amelia impute this entire data set.
colnames(mydata) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")
mydata <- subset(mydata, select=-c(ca,thal))
I also get this error and I'm unsure of what it means:
Amelia Error Code: 36
The number of categories in the nominal variable 'chol' is greater than one-third of the observations.
Warning messages:
1: In amcheck(x = x, m = m, idvars = numopts$idvars, priors = priors, :
The number of categories in one of the variables marked nominal has greater than 10 categories. Check nominal specification.

Amelia checks to see if there are not too many categories in a variable. This is done by counting the unique occurrences of the variable and comparing this to one-third of the rows.
For example, if you have 300 rows of data and you have more than 100 unique values in your column (excluding NA's), amelia will return this error. Imputing on so many different values on so few records is almost impossible that you might as well fill in random values. Either think about whether or not you need this column, get more data, or see if you can find a way to fill in the missing data based on domain knowledge.
For more information on Amelia check the vignette, but if you want to read through the code, check the github page. You can find it here. Especially the error code might be handy to read through (amcheck.r).
Splitting your data in 70/30 can be done in multiple ways. Two that I use are either:
library(caTools)
# set.seed for reproducibility.
set.seed(144)
split <- sample.split(dataframe$"Variable to split on", SplitRatio = 0.7)
train <- subset(dataframe, split == TRUE)
test <- subset(dataframe, split == FALSE)
or
library(caret)
# set.seed for reproducibility.
set.seed(42)
split <- createDataPartition(y = dataframe$"Variable to split on", p=0.7, list=FALSE)
train <- dataframe[subtrain,]
test <- dataframe[-subtrain,]

Related

How to fix 'differing rows' error in predict function?

I am trying to use the predict function but the output does not have the number of trials I expect. I assume something is wrong with my data.frame after reading other errors but can't figure it out.
I've tried to make sure my newdata has the same variable name as my model but that won't fix it. The differing rows are the differing number of solutions being found, for example I train over 50 different sets of information, and I test over 39950 sets.
In both the train_data and the test_data there are 10 columns which are the samples that will be included in each calculation. The model correctly finds these and names them test_data1, test_data2, etc.
I'm sure there is something I'm missing but I can't seem to figure it out.
trainingSampleSize <- k
sample_sample[[k-1]] <- sample(1:ncol(pre$train_data), k, replace = FALSE)
train_data <- pre$train_data[,sample_sample[[k-1]]]
test_data <- pre$test_data[,sample_sample[[k-1]]]
data_lm <- data.frame(train_data, pre$train_targets)
cvFitList[[(k-1)]] <- lm(pre$train_targets ~ train_data, data_lm)
prediction[[k-1]] <- predict(cvFitList[[(k-1)]], data.frame(train_data=test_data))
My goal is to get a prediction for every set of test_data, 39950 results from predict.
I got a warning message:
'newdata' had 39950 rows but variables found have 50 rows
and prediction[[k-1]] has only 50 rows

Library "TableOne" multiple comparisons. Calculate line by line p-values

I received a comment from a reviewer who wanted to have all the p-values for each line of specific variables levels in a demographic characteristic table (Table 1). Even though the request appears quite strange (and inexact) to me, I would like to agree with his suggestion.
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
print(tableOne, nonnormal = c("bili","chol","copper","alk.phos","trig"), exact = c("status","stage"), smd = TRUE)
the output:
I need to have the p-values for each level of the variables status, edema and stage, with Bonferroni correction. I went through the documentation without success.
In addition, is it correct to use chi-squared to compare sample sizes across rows?
UPDATE:
I'm not sure if my approach is correct, however I would like to share it with you. I generated for the variable status a dummy variable for each strata, than I calculated the chisq .
library(tableone)
## Load data
library(survival); data(pbc)
d <- pbc[,c("status", "trt")]
# Convert dummy variables
d$status.0 <- ifelse(d$status==0, 1,0)
d$status.1 <- ifelse(d$status==1, 1,0)
d$status.2 <- ifelse(d$status==2, 1,0)
t <- rbind(
chisq.test(d$status.0, d$trt),
# p-value = 0.7202
chisq.test(d$status.1, d$trt),
# p-value = 1
chisq.test(d$status.2, d$trt)
#p-value = 0.7818
)
t
BONFERRONI ADJ FOR MULTIPLE COMPARISONS:
p <- t[,"p.value"]
p.adjust(p, method = "bonferroni")
This question was posted some time ago, so I supose you already answered the reviewer.
I don't really understand why computing adjusted p values for just three varibles. In fact, adjusting p values depends on the number of comparisons made. If you use p.adjust() with a vector of 3 p values, results will not really be "adjusted" by the amount of comparison made (you really did more than a dozen and a half!)
I show how to extract all p-values so you can compute the adjusted ones.
To extract pValues from package tableOne there is a way calling object attributes (explained first), and two quick and dirty ways (at the bottom part).
To extract them, first I copy your code to create your tableOne:
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
You can see what your "tableOne" object has via attributes()
attributes(tableOne)
You can see a tableOne usually has a table for continuous and categorical variables. You can use attributes() in them too
attributes(tableOne$CatTable)
# you can notice $pValues
Now you know "where" the pValues are, you can extract them with attr()
attr(tableOne$CatTable, "pValues")
Something similar with numerical variables:
attributes(tableOne$ContTable)
# $pValues are there
attr(tableOne$ContTable, "pValues")
You have pValues for Normal and NonNormal variables.
As you set them before, you can extract both
mypCont <- attr(tableOne$ContTable, "pValues") # put them in an object
nonnormal = c("bili","chol","copper","alk.phos","trig") # copied from your code
mypCont[rownames(mypCont) %in% c(nonnormal), "pNonNormal"] # extract NonNormal
"%!in%" <- Negate("%in%")
mypCont[rownames(mypCont) %!in% c(nonnormal), "pNonNormal"] # extract Normal
All that said, and your pValues extracted, I think there are two much more convenient quick and dirty ways to accomplish the same:
Quick and dirty way A: using dput() with your printed tableOne. Then search in the console where the pValues are and copy-paste them to the script, to store them in an object
Quick and dirty way B: If you look in tableOne vignette there is an "Exporting" section, you can use print(tableOne, quote = TRUE) and then just copy and paste to a spreadsheet (like LibreOffice, Excel...).
Then I would select the column with pValue, transpose it, and get it back to R, to compute adjusted p values with p.adjust() and copy them back to the spreadsheet for journal submission

Data set for regression: different response values for same combination of input variables

Hey dear stackoverflowers,
I would like to perform (multiple) regression analysis on a large customer data set, trying to predict amount spent after initial purchase based on various independent variables, observed during the first purchase.
In this data set, for the same combination of input variable values (say gender=male, age=30, income=40k, first_purchase_value = 99,90), I can have multiple observartions with varying y values (i.e. multiple customers share the same independent variable attributes, but behave differently according to their observed y values).
Is this a problem for regression analysis, i.e. do I have to condense these observations by e.g. averaging? I am getting negative R2 values, that's why I'm asking (I know that a linear model might also just be the wrong assumption here) ...
Thank you for helping me. I tried using the search function, but was unable to find similar topics (probably because the question is silly?).
Cheers!
Edit: This is the code I'm using:
spl <- sample.split(data$spent, SplitRatio = 0.75)
data_train <- subset(data, spl == TRUE)
data_test <- subset(data, spl == FALSE)
model_lm_spent <- lm(spent ~ ., data = data_train)
summary(model_lm_spent)
model_lm_predictions_spent <- predict(model_lm_spent, newdata = data_test)
SSE_spent = sum((data_test$spent - model_lm_predictions_spent)^2)
SST_spent = sum((data_test$spent - mean(data$spent))^2)
1 - SSE_spent/SST_spent

PLS in R: Predicting new observations returns Fitted values instead

In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. Here is what the model looks like in short.
# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56
data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set
# define explanatory variables (x)
spectra <- caldata[,1:101]
# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation =
"LOO", jackknife = TRUE)
It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components:
predict(refl.pls, ncomp = 3, newdata = valdata)
Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. I discovered this because the below code, in which newdata = is omitted, yields identical results.
predict(refl.pls, ncomp = 3)
Surely something must be going wrong, although I cannot seem to find out what specifically is. Is there someone out there who can, and is willing to help me move in the right direction?
I think the problem is with the nature of the input data. Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. The input data frame should have a matrix as one of its elements (in your case, the spectral data). I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting):
library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
height = rpois(56,10),
fbm = rpois(56,10),
nitrogen = rpois(56,10),
carbon = rpois(56,10),
chl = rpois(56,10),
ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)
DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE
refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation =
"LOO", jackknife = TRUE, subset = train)
res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])
Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs. There might be a more standard way to do this, but it works. As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok.
As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous.

MICE imputation even when same data in column

Is it possible to get an imputation using the package MICE even when all the values in the column are the same? Then it would impute just with that number.
Example:
test<-data.frame(var1=c(2.3,2.3,2.3,2.3,2.3,NA),var2=c(5.3,5.6,5.9,6.4,4.5,NA))
miceImp<-mice(test)
testImp<-complete(miceImp)
only imputate on var2. I would like it to replace the NA in var1 too with 2.3.
You can use passive imputation for this. For a full explanation, see section 3.4 on page 25 of this article. As applied to constant variables, the objective here would be to set the imputation method for any constant variable x to the constant value of x. If the constant value of x is y, then the imputation method for x should be "~I(y)".
test = data.frame(
var1=c(2.3,2.3,2.3,2.3,2.3,NA,2.3),
var2=c(5.3,5.6,5.9,6.4,4.5,5.1,NA),
var3=c(NA,1:6))
cVars = which(sapply(test,sd,na.rm=T)==0) #determine which vars are constant (props to SimonG)
allMeans = colMeans(test,na.rm=T) #get the column means
miceImp.ini = mice(test,maxit=0,print=F) #initial mids object with no imputations
meth = miceImp.ini$method #extract the imputation method vector
meth[cVars] = paste0("~I(",allMeans[cVars],")") #set the imputation method to be a constant (the current column mean)
miceImp = mice(test,method=meth) #run the imputation with the user defined imputation methods
testImp = complete(miceImp) #extract an imputedly complete dataset
View(testImp) #take a look at it
All that being said, constant values tend not to be of great use in statistics, so it might be more efficient to drop any constant variables before imputation (since imputation is such a costly process).

Resources