trouble with cbind in manova call in r - r

I'm trying to do a multivariate ANOVA with the manova function in R. My problem is that I'm trying to find a way to pass the list of dependent variables without typing them all in manually, as there are many and they have horrible names. My data are in a data frame where "unit" is the dependent variable (factor), and the rest of the columns are various numeric response variables. e.g.
unit C_pct Cln C_N_mol Cnmolln C_P_mol N_P_mol
1 C 48.22 3.88 53.92 3.99 3104.75 68.42
2 C 49.91 3.91 56.32 4.03 3454.53 62.04
3 C 50.75 3.93 56.96 4.04 3922.01 69.16
4 SH 50.72 3.93 46.58 3.84 2590.16 57.12
5 SH 51.06 3.93 43.27 3.77 2326.04 53.97
6 SH 48.62 3.88 40.97 3.71 2357.16 59.67
If I write the manova call as
fit <- manova(cbind(C_pct, Cln) ~ unit, data = plots)
it works fine, but I'd like to be able to pass a long list of columns without naming them one by one, something like
fit <- manova(cbind(colnames(plots[5:32])) ~ unit, data = plots)
or
fit <- manove(cbind(plots[,5:32]) ~ unit, data = plots)
I get the error
"Error in model.frame.default(formula = as.matrix(cbind(colnames(plots[5:32]))) ~ :
variable lengths differ (found for 'unit')
I'm sure it's because I'm using cbind wrong, but can't figure it out. Any help is appreciated! Sorry if the formatting is rough, this is my first question posted.
EDIT: Both ways (all 3, actually) work. thanks all!

manova, like most R modelling functions, builds its formula out of the names of the variables found in the dataset. However, when you pass it the colnames, you're technically passing the strings that represent the names of those variables. Hence the function doesn't know what to do with them, and chokes.
You can actually get around this. The LHS of the formula only has to resolve to a matrix; the use of cbind(C_pct, Cln, ...) is a way of obtaining a matrix by evaluating the names of its arguments C_pct, Cln, etc in the environment of your data frame. But if you provide a matrix to start with, then no evaluation is necessary.
fit <- manova(as.matrix(plots[, 5:32]) ~ unit, data=plots)
Some notes. The as.matrix is necessary because getting columns from a data frame like this, returns a data frame. manova won't like this, so we coerce the data frame to a matrix. Second, this works assuming you don't have an actual variable called plots inside your data frame plots. This is because, if R doesn't find a name inside your data frame, it then looks in the environment of the caller, in this case the global environment.
You can also create the matrix before fitting the model, with
plots$response <- as.matrix(plots[, 5:32])
fit <- manova(response ~ unit, data=plots)

You can build your formula as string and cast it to a formula:
responses <- paste( colnames( plots )[2:6], collapse=",")
myformula <- as.formula( paste0( "cbind(", responses , ")~ unit" ) )
manova( myformula, data = plots )
Call:
manova(myformula, data = plots)
Terms:
unit Residuals
resp 1 0.4 6.8
resp 2 0 0
resp 3 220.6 21.0
resp 4 0.1 0.0
resp 5 1715135.8 377938.1
Deg. of Freedom 1 4
Residual standard error: 1.3051760.027080132.293640.04966555307.3834
Estimated effects may be unbalanced

Related

Letters group Games-Howell post hoc in R

I use the sweetpotato database included in library agricolae of R:
data(sweetpotato)
This dataset contains two variables: yield(continous variable) and virus(factor variable).
Due to Levene test is significant I cannot assume homogeneity of variances and I apply Welch test in R instead of one-way ANOVA followed by Tukey posthoc.
Nevertheless, the problems come from when I apply posthoc test. In Tukey posthoc test I use library(agricolae) and displays me the superscript letters between virus groups. Therefore there are no problems.
Nevertheless, to perform Games-Howell posthoc, I use library(userfriendlyscience) and I obtain Games-Howell output but it's impossible for me to obtain a letter superscript comparison between virus groups as it is obtained through library(agricolae).
The code used it was the following:
library(userfriendlyscience)
data(sweetpotato)
oneway<-oneway(sweetpotato$virus, y=sweetpotato$yield, posthoc =
'games-howell')
oneway
I try with cld() importing previously library(multcompView) but doesn't work.
Can somebody could helps me?
Thanks in advance.
This functionality does not exist in userfriendlyscience at the moment. You can see which means differ, and with which p-values, by looking at the row names of the dataframe with the post-hoc test results. I'm not sure which package contains the sweetpotato dataset, but using the ChickWeight dataset that comes with R (and is used on the oneway manual page):
oneway(y=ChickWeight$weight, x=ChickWeight$Diet, posthoc='games-howell');
Yields:
### (First bit removed as it's not relevant.)
### Post hoc test: games-howell
diff ci.lo ci.hi t df p
2-1 19.97 0.36 39.58 2.64 201.38 .044
3-1 40.30 17.54 63.07 4.59 175.92 <.001
4-1 32.62 13.45 51.78 4.41 203.16 <.001
3-2 20.33 -6.20 46.87 1.98 229.94 .197
4-2 12.65 -10.91 36.20 1.39 235.88 .507
4-3 -7.69 -33.90 18.52 0.76 226.16 .873
The first three rows compare groups 2, 3 and 4 to 1: using alpha = .05, 1 and 2 have the same means, but 3 and 4 are higher. This allows you to compute the logical vector you need for multCompLetters in multcompView. Based on the example from the manual page at ?multcompView:
### Run oneway anova and store result in object 'res'
res <- oneway(y=ChickWeight$weight, x=ChickWeight$Diet, posthoc='games-howell');
### Extract dataframe with post hoc test results,
### and overwrite object 'res'
res <- res$intermediate$posthoc;
### Extract p-values and comparison 'names'
pValues <- res$p;
### Create logical vector, assuming alpha of .05
dif3 <- pValues > .05;
### Assign names (row names of post hoc test dataframe)
names(dif3) <- row.names(res);
### convert this vector to the letters to compare
### the group means (see `?multcompView` for the
### references for the algorithm):
multcompLetters(dif3);
This yields as final result:
2 3 4 1
"a" "b" "c" "abc"
This is what you need, right?
I added this functionality to userfriendlyscience, but it will be a while before this new version will be on CRAN. In the meantime, you can get the source code for this update at https://github.com/Matherion/userfriendlyscience/blob/master/R/oneway.R if you want (press the 'raw' button to get an easy-to-download version of the source code).
Note that if you need this updated version, you need to set parameter posthocLetters to TRUE, because it's FALSE by default. For example:
oneway(y=ChickWeight$weight,
x=ChickWeight$Diet,
posthoc='games-howell',
posthocLetters=TRUE);
shouldn't it be
dif3 <- pValues < .05, instead of dif3 <- pValues > .05 ?
This way the letters are the same if the distributions are 'the same' (this is, no evidence that they are different).
Please correct me if I'm interpreting this wrong.

GAM model error

My data frame looks like:
head(bush_status)
distance status count
0 endemic 844
1 exotic 8
5 native 3
10 endemic 5
15 endemic 4
20 endemic 3
The count data is non-normally distributed. I'm trying to fit a generalized additive model to my data in two ways so i can use anova to see if the p-value supports m2.
m1 <- gam(count ~ s(distance) + status, data=bush_status, family="nb")
m2 <- gam(count ~ s(distance, by=status) + status, data=bush_status, family="nb")
m1 works fine, but m2 sends the error message:
"Error in smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons,
scale.penalty = scale.penalty, :
Can't find by variable"
This is pretty beyond me so if anyone could offer any advice that would be much appreciated!
From your comments it became clear that you passed a character variable to by in the smoother. You must pass a factor variable there. This has been a frequent gotcha for me too and I consider it a design flaw (because base R regression functions deal with character variables just fine).

predict.lm after regression with missing data in Y

I don't understand how to generate predicted values from a linear regression using the predict.lm command when some value of the dependent variable Y are missing, even though no independent X observation is missing. Algebraically, this isn't a problem, but I don't know an efficient method to do it in R. Take for example this fake dataframe and regression model. I attempt to assign predictions in the source dataframe but am unable to do so because of one missing Y value: I get an error.
# Create a fake dataframe
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(100,200,300,400,NA,600,700,800,900,100)
df <- as.data.frame(cbind(x,y))
# Regress X and Y
model<-lm(y~x+1)
summary(model)
# Attempt to generate predictions in source dataframe but am unable to.
df$y_ip<-predict.lm(testy)
Error in `$<-.data.frame`(`*tmp*`, y_ip, value = c(221.............
replacement has 9 rows, data has 10
I got around this problem by generating the predictions using algebra, df$y<-B0+ B1*df$x, or generating the predictions by calling the coefficients of the model df$y<-((summary(model)$coefficients[1, 1]) + (summary(model)$coefficients[2, 1]*(df$x)) ; however, I am now working with a big data model with hundreds of coefficients, and these methods are no longer practical. I'd like to know how to do it using the predict function.
Thank you in advance for your assistance!
There is built-in functionality for this in R (but not necessarily obvious): it's the na.action argument/?na.exclude function. With this option set, predict() (and similar downstream processing functions) will automatically fill in NA values in the relevant spots.
Set up data:
df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA
Fit model: default na.action is na.omit, which simply removes non-complete cases.
mod1 <- lm(y~x+1,data=df)
predict(mod1)
## 1 2 3 4 6 7 8 9 10
## 100 200 300 400 600 700 800 900 1000
na.exclude removes non-complete cases before fitting, but then restores them (filled with NA) in predicted vectors:
mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
## 1 2 3 4 5 6 7 8 9 10
## 100 200 300 400 NA 600 700 800 900 1000
Actually, you are not using correctly the predict.lm function.
Either way you have to input the model itself as its first argument, hereby model, with or without the new data. Without the new data, it will only predict on the training data, thus excluding your NA row and you need this workaround to fit the initial data.frame:
df$y_ip[!is.na(df$y)] <- predict.lm(model)
Or explicitly specifying some new data. Since the new x has one more row than the training x it will fill the missing row with a new prediction:
df$y_ip <- predict.lm(model, newdata = df)

How to apply a regression in a for loop for all the variables of a dataset while adding rows in R

That is a long question I know, but bear with me.
I have a dataset in this form:
head(TRAINSET)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 Y
1 -2.973012 -2.956570 -2.386837 -0.5861751 4e-04 0.44 0.0728 0.0307 0.0354 0.0078 0.0047 0.0100 -0.0022 0.0038 -0.005200012
2 -2.937649 -2.958624 -2.373960 -0.5636891 5e-04 0.44 0.0718 0.0323 0.0351 0.0075 0.0028 0.0095 -0.0019 0.0000 0.042085781
3 -2.984238 -2.937649 -2.428712 -0.5555258 2e-04 0.43 0.0728 0.0329 0.0347 0.0088 0.0018 0.0092 -0.0019 -0.0076 0.004577122
4 -2.976535 -2.970053 -2.443424 -0.5331107 9e-04 0.47 0.0588 0.0320 0.0331 0.0253 0.0011 0.0092 -0.0170 -0.0076 0.010515970
5 -2.979631 -2.962549 -2.468805 -0.5108256 6e-04 0.46 0.0613 0.0339 0.0333 -0.0005 -0.0006 0.0090 0.0060 -0.0058 0.058487141
6 -3.030536 -2.979631 -2.528079 -0.5024574 3e-04 0.43 0.0562 0.0333 0.0327 0.0109 -0.0006 0.0093 -0.0120 0.0000 -0.022896759
This is the Train set of mine, and it is 300 rows. The remaining 700 rows are the Test set. What I am trying to accomplish is:
For each column fit a linear model of this form : Y ~ X1.
Use the model created to get the predicted value of the Y by using the first X1 of the Test set.
After that, take the first row of the Test set and rbind it to the Train set (now the Train set is 301 rows).
Predict the value of Y using the 2nd row of X1 from the test set.
Repeat for the remaining 699 rows of the Test set.
Apply it for all the remaining variables of the datasets (X2,...,X14).
I have managed to produce the accurate results when I apply a code that i made for each variable specifically:
fittedvaluess<-NULL #empty set to fill
for(i in 1:nrow(TESTSET)){ #beggin iteration over the rows of Test set
TRAINSET<-rbind(TRAINSET,TESTSET[i,]) #add the rows to the train set
LM<-lm(Y~X1,TRAINSET) #fit the evergrowing OLS
predictd<-predict(LM,TESTSET[i+1,],type = "response") #get the predicted value
fittedvaluess<-cbind(fittedvaluess,predictd) #get the vector of the predicted values
print(cbind(i,length(TRAINSET$LHS),length(TRAINSET$DP),nrow(TRAINSET))) #to make sure it works
}
However, i want to automate this to go and repeat it over the columns. I have made this:
data<-TRAINSET #cause every time i had to remake the trainset
fittedvaluesss<-NULL
for(i in 1:nrow(TESTSET){ #begin iteration on rows of Testset
data<-rbind(data,TESTSET[i,]) # rbind the rows to the Trainset called data
for(j in 1:ncol(TESTSET){ #iterate over the columns
LM<-lm(data$LHS~data[,j],data) #fit OLS
predictd<-predict(LM,TESTSET[i+1,j],type = "response") #get the predicted value
fittedvaluesss<-cbind(fittedvaluesss,predictd) #derive the predicted value
print(c(i,j)) #make sure it works
}
}
The results are unfortunately wrong: the fittedvalues are a huge matrix :
dim(fittedvaluesss)
[1] 2306 3167 #Stopped around the middle of its run
Which doesn't make any sense. I have even run it for
i in 1:3
and
j in 1:3
and still the matrix was insanely huge. I have tried having the iteration starting from the columns and the go over the lines. Exactly the same wrong results. For some reason in each run i was getting at least 362 values from the PREDICT function. I am really stuck over this problem.
Any help is highly welcome.
EDIT 1: This is also known as a RECURSIVE FORECASTING methodology in Finance. It is a method to forecast future values from a model fit from your current dataset.
Consider reversing your looping logic with columns in outer loop and rows in inner loop. Additionally, try nested apply functions which returns structures more aligned to your needs than the for loop. Specifically, the inner vapply() returns a numeric vector of all testset's predicted values for each iterated column. Then the outer sapply() binds each returned vector to a column of a matrix.
Ultimately, fittedvaluess is a matrix with dimensions: TESTSET nrow X TESTSET ncol. Notice too, outer loop leaves out last column since you do not regress Y on Y.
fittedvaluess <- sapply(1:(ncol(TESTSET)-1), function(c){
col <- names(TESTSET)[[c]] # RETRIEVE COLUMN NAME FOR LM FORMULA
predictvals <- vapply(1:nrow(TESTSET), function(r){
TRAINSET <- rbind(TRAINSET, TESTSET[1:r,]) # BINDING ROWS ON AND PRIOR TO CURRENT ROW
LM <- lm(paste0("Y~", col), TRAINSET) # CONCATENATED STRING FORMULA
predictd <- predict(LM, TESTSET[r+1,], type="response")
}, numeric(1))
})
Why sapply and vapply?
Both sapply() and vapply() are wrappers to lapply(). Where sapply() (simple lapply) can return either a vector or matrix, vapply() (verified lapply) allows you to specifically choose the returned output --vector, list, matrix-- as well as type and length. So vapply requires a third argument specifying such criteria. Here, we choose a numeric vector of one length (or one object): numeric(1). Because of this pre-specification, vapply() tends to run faster than lapply() in some cases. Had we only chose the general lapply(), we would need to run various casting and conversions of list output to align to matrix output. In a way, we could have done nested vapply() loops!
By using the below, which is has a minor version of my original code, except that I didn't use the predict
#EXPAND IT INTO DOING SO IN ALL COLUMNS
data<-TRAINSET
fittedvaluesss<-NULL
for(i in 1:nrow(TESTSET)){ #go each row
data<-rbind(data,TESTSET[i,]) #update the dataset
for(j in 1:ncol(TESTSET)){ #repead for each column the following
LM<-lm(data$LHS~data[,j]) #OLS reg
predictd<-coef(LM)[1]+coef(LM)[2]*TESTSET[i+1,j] #Simply apply the formula yourself A+Bx for each new iteration
#predict(LM,TESTSET[i+1,j],type = "response")
print(length(predictd)) #makes sure it is ONE value
fittedvaluesss<-c(fittedvaluesss,predictd)
print(c(i,j))
}
}
matrixa<-matrix(fittedvaluesss,15,648) #put the values in a matrix: Note that the Ypreds are in every row
matrixa<-t(matrixa) #transpose in order to have each Ypred from a var in a column
The reason this works, is that the predict function for each iteration returns a small matrix of size 361x15 (in my initial code) and that is for a single iteration. Thus i dropped the predict function and used the coefficients themselves. This seemed to return the correct forecasts.

predicting and calculating reliability test statistics from repeated multiple regression model in r

I want to run MLR on my data using lm function in R. However, I am using data splitting cross validation method to access the reliability of the model. I intend using "sample" function to randomly split the data into the calibration and validation datasets by 80:20 ratio. This I want to repeat in say 100 times. Without setting a seed I believe the model from the different samplings will differ. I came across the function in previous post here and it solves the first part;
lst <- lapply(1:100, function(repetition) {
mod <- lm(...)
# Replace this with the code you need to train your model
return(mod)
})
save(lst, file="myfile.RData")
The concern now is how do I validate each of these 100 models and obtain reliability test statistics like RSME, ME, Rsquare for each of the models and hopefully obtain the confidence interval.
If I can get an output in the form of dataframe containing the predicted values for all the 100 models then I should proceed from there.
Any help please?
Thanks
To quickly recap your question: it seems that you want to fit an MLR model to a large training set and then use this model to make predictions on the remaining validation set. You want to repeat this process 100 times and afterwards you want to be able to analyze the characteristics and predictions of the individual models.
To accomplisch this you could just store temporary modelinformation in a datastructure during the modelgeneration and prediction process. You can then re-obtain and process all the information afterwards. You did not provide your own dataset in the description, so I will use one of R's built in datasets in order to demonstrate how this might work:
> library(car)
> Prestige <- Prestige[,c("prestige","education","income","women")]
> Prestige[,c("income")] <- log2(Prestige[,c("income")])
> head(Prestige,n=5)
prestige education income women
gov.administrators 68.8 13.11 -0.09620212 11.16
general.managers 69.1 12.26 -0.04955335 4.02
accountants 63.4 12.77 -0.11643822 15.70
purchasing.officers 56.8 11.42 -0.11972061 9.11
chemists 73.5 14.62 -0.12368966 11.68
We start by initializing some variables first. Let's say you want to create 100 models and use 80% of your data for training purposes:
nrIterations=100
totalSize <- nrow(Prestige)
trainingSize <- floor(0.80*totalSize)
We also want to create the datastructure that will be used to hold the intermediate modelinformation. R is quite a generic high level language in this regard, so we will just create a list of lists. This means that every listentry can by itself again hold another list of information. This gives us the flexibility to add whatever we need:
trainTestTuple <- list(mode="list",length=nrIterations)
We are now ready to create our models and predictions. During every loopiteration a different random trainingsubset is created while using the remaining data for testing purposes. Next, we fit our model to the trainingdata and we then use this obtained model to make predictions on the testdata. Note that we explicitly use the independent variables in order to predict the dependent variable:
for(i in 1:nrIterations)
{
trainIndices <- sample(seq_len(totalSize),size = trainingSize)
trainSet <- Prestige[trainIndices,]
testSet <- Prestige[-trainIndices,]
trainingFit <- lm(prestige ~ education + income + women, data=trainSet)
# Perform predictions on the testdata
testingForecast <- predict(trainingFit,newdata=data.frame(education=testSet$education,income=testSet$income,women=testSet$women),interval="confidence",level=0.95)
# Do whatever else you want to do (compare with actual values, calculate other stuff/metrics ...)
# ...
# add your training and testData to a tuple and add it to a list
tuple <- list(trainingFit,testingForecast) # Add whatever else you need ..
trainTestTuple[[i]] <- tuple # Add this list to the "list of lists"
}
Now, the relevant part: At the end of the iteration we put both the fitted model and the out of sample prediction results in a list. This list contains all the intermediate information that we want to save for the current iteration. We finish by putting this list in our list of lists.
Now that we are done with the modeling, we still have access to all the information we need and we can process and analyze it any way we want. We will take a look at the modeling and prediction results of model 50. First, we extract both the model and the prediction results from the list of lists:
> tuple_50 <- trainTestTuple[[50]]
> trainingFit_50 <- tuple_50[[1]]
> testingForecast_50 <- tuple_50[[2]]
We take a look at the model summary:
> summary(trainingFit_50)
Call:
lm(formula = prestige ~ education + log2(income) + women, data = trainSet)
Residuals:
Min 1Q Median 3Q Max
-15.9552 -4.6461 0.5016 4.3196 18.4882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -287.96143 70.39697 -4.091 0.000105 ***
education 4.23426 0.43418 9.752 4.3e-15 ***
log2(income) 155.16246 38.94176 3.984 0.000152 ***
women 0.02506 0.03942 0.636 0.526875
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.308 on 77 degrees of freedom
Multiple R-squared: 0.8072, Adjusted R-squared: 0.7997
F-statistic: 107.5 on 3 and 77 DF, p-value: < 2.2e-16
We then explicitly obtain the model R-squared and RMSE:
> summary(trainingFit_50)$r.squared
[1] 0.8072008
> summary(trainingFit_50)$sigma
[1] 7.308057
We take a look at the out of sample forecasts:
> testingForecast_50
fit lwr upr
1 67.38159 63.848326 70.91485
2 74.10724 70.075823 78.13865
3 64.15322 61.284077 67.02236
4 79.61595 75.513602 83.71830
5 63.88237 60.078095 67.68664
6 71.76869 68.388457 75.14893
7 60.99983 57.052282 64.94738
8 82.84507 78.145035 87.54510
9 72.25896 68.874070 75.64384
10 49.19994 45.033546 53.36633
11 48.00888 46.134464 49.88329
12 20.14195 8.196699 32.08720
13 33.76505 27.439318 40.09079
14 24.31853 18.058742 30.57832
15 40.79585 38.329835 43.26187
16 40.35038 37.970858 42.72990
17 38.38186 35.818814 40.94491
18 40.09030 37.739428 42.44117
19 35.81084 33.139461 38.48223
20 43.43717 40.799715 46.07463
21 29.73700 26.317428 33.15657
And finally, we obtain some more detailed results about the 2nd forecasted value and the corresponding confidence intervals:
> testingPredicted_2ndprediction <- testingForecast_50[2,1]
> testingLowerConfidence_2ndprediction <- testingForecast_50[2,2]
> testingUpperConfidence_2ndprediction <- testingForecast_50[2,3]
EDIT
After rereading, it occured to me that you are obviously not splitting up the the same exact dataset each time. You are using completely different partitions of data during each iteration and they should be split up in a 80/20 fashion. However, the same solution can still be applied with minor modifications.
Also: For cross validation purposes you should probably take a look at cv.lm()
Description from the R help:
This function gives internal and cross-validation measures of predictive accuracy for multiple linear regression. (For binary logistic regression, use the CVbinary function.) The data are randomly assigned to a number of ‘folds’. Each fold is removed, in turn, while the remaining data is used to re-fit the regression model and to predict at the deleted observations.
EDIT: Reply to comment.
You can just take the means of the relevant performance metrics that you saved. For example, you can use an sapply on the trainTestTuple in order to extract the relevant elements from each sublist. sapply will return these elements as a vector from which you can calculate the mean. This should work:
mean_ME <- mean(sapply(trainTestTuple,"[[",2))
mean_MAD <- mean(sapply(trainTestTuple,"[[",3))
mean_MSE <- mean(sapply(trainTestTuple,"[[",4))
mean_RMSE <- mean(sapply(trainTestTuple,"[[",5))
mean_adjRsq <- mean(sapply(trainTestTuple,"[[",6))
Another small edit: The calculation of your MAD looks rather strange. It might be a good thing to double check if this is exactly what you want.

Resources