"Error in model.frame.default(data = train, formula = cost ~ .) : variable lengths differ", but all variables are length 76? - r

I'm modeling burrito prices in San Diego to determine whether some burritos are over/under priced (according to the model). I'm attempting to use regsubsets() to determine the best linear model, using the BIC, on a data frame of 76 observations of 14 variables. However, I keep getting an error saying that variable lengths differ, and thus a linear model doesn't work.
I've tried rounding all the observations in the data frame to one decimal place, I've used the length() function on each variable in the data frame to make sure they're all the same length, and before I made the model I used na.omit() on the data frame to make sure no NAs were present. By the way, the original dataset can be found here: https://www.kaggle.com/srcole/burritos-in-san-diego. I cleaned it up a bit in Excel first, removing all the categorical variables that appeared after the "overall" column.
burritos <- read.csv("/Users/Jack/Desktop/R/STOR 565 R Projects/Burritos.csv")
burritos <- burritos[ ,-c(1,2,5)]
burritos <- na.exclude(burritos)
burritos <- round(burritos, 1)
library(leaps)
library(MASS)
yelp <- burritos$Yelp
google <- burritos$Google
cost <- burritos$Cost
hunger <- burritos$Hunger
tortilla <- burritos$Tortilla
temp <- burritos$Temp
meat <- burritos$Meat
filling <- burritos$Meat.filling
uniformity <- burritos$Uniformity
salsa <- burritos$Salsa
synergy <- burritos$Synergy
wrap <- burritos$Wrap
overall <- burritos$overall
variable <- sample(1:nrow(burritos), 50)
train <- burritos[variable, ]
test <- burritos[-variable, ]
null <- lm(cost ~ 1, data = train)
full <- regsubsets(cost ~ ., data = train) #This is where error occurs

Related

Problems with raster prediction from linear model in r

I'm having problems with predicting a raster using a linear model.
Firstly i create my model from the data found in my polygons.
# create model
poly <- st_read("polygon.shp")
df <- na.omit(poly)
df <- df[df$gdp > 0 & df$ntl2 > 0 & df$pop2 > 0,]
x <- log(df$ntl2)
y <- log(df$gdp*df$pop2)
c <- df$iso
d <- data.frame(x,y,c)
m <- lm(y~x+c,data=d)
Then i want to use raster::predict to estimate an output raster
# raster data
iso <- raster("iso.tif")
viirs <- raster("viirs.tif")
x <- log(viirs)
c <- iso
## predict with models
s <- stack(x,c)
predicted <- raster::predict(x,model=m)
however i get following response:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
object is not a matrix
I don't know what the problem is and how to fix it. My current throughts are that its something to do with the factors/country codes:
My model includes country codes, as I would like to include some country fixed effects. Maybe there is a problems with including these. However even when excluding the country codes from the model and the entire dataframe, i still get the same error message.
Futhermore, my model is based on regional values from the whole world and the prediction datasets only include the extent of Turkey. Maybe this is the problem?
And here is the data:
https://drive.google.com/open?id=16cy7CJFrxQCTLhx-hXDNHJz8ej3vTEED
Perhaps it works if you do like this:
iso <- raster("iso.tif")
viirs <- raster("viirs.tif")
s <- stack(log(viirs), iso)
names(s) <- c("x", "c")
predicted <- raster::predict(s, model=m)
It won't work if the values in df$iso and iso.tif don't match (is one a factor, and the other numeric?).

How to run a loop inside a loop for a gam object

I am trying to predict new observations after multiple imputation. Both the newdata and the model to use are list objects. The correctness of the approach is not the issue but how to use the predict function after multiple imputation we I have a new data that is a list. Below are my code.
library(betareg)
library(mice)
library(mgcv)
data(GasolineYield)
dat1 <- GasolineYield
dat1 <- GasolineYield
dat1$yield <- with(dat1,
ifelse(yield > 0.40 | yield < 0.17,NA,yield)) # created missing values
datim <- mice(dat1,m=30) #imputing missing values
mod1 <- with(datim,gam(yield ~ batch + emp,family=betar(link="logit"))) #fit models using gam
creating data set to be used for prediction
datnew <- complete(datim,"long")
datsplit <- split(datnew,datnew$.imp)
the code below just testing out the predict without newdata. The problem I observed was that tp is saved as 1 by 32 matrix instead of 30 by 32 matrix. But the print option prints out a 30 by 32 but then I couldn't save it as such.
tot <- 0
for(i in 1:30){
tot <- mod1$analyses[[i]]
tp <- predict.gam(tot,type = "response")
print(tp)
}
the code below is me trying to predict new observation using newdata. Here I am just lost I am not sure how to go about it.
datnew <- complete(datim,"long")
datsplit <- split(datnew,datnew$.imp)
tot <- 0
for(i in 1:30){
tot <- mod1$analyses[[i]]
tp <- predict.gam(tot,newdata=datsplit[[i]], type = "response")
print(tp)
}
Can someone help me out on how best to go about it?
I finally find solved the problem. Here is the solution:
datnew <- complete(datim,"long")# stack all the imputation data
though I have to point out that this should be your new dataset
I am assuming that this is not used in building the model. My aim of opening this #thread was to address the question of how to predict observations using new data after multiple imputation/using model built with multiple imputation dataset.
datsplit <- split(datnew,datnew$.imp)
tot <- list()
tot_ <- list()
for(i in 1:30){
for(j in 1:30){
tot[[j]] <- predict.gam(mod1$analyses[[i]],newdata=datsplit[[j]])
}
tot_[[i]] <- tot
}
# flatten the lists within lists
totfl <- tot_ %>% flatten()
#nrow is the number of observations to be predicted as contained in the
#newdata set (datsplit)
totn <- matrix(unlist(totfl),nrow=32)
apply(totn,1,mean) #takes the means of prediction across the 30 data set
I hope this helps those with similar questions. I once came across a question on how to predict newdata after multiple imputation, I guess this will answer some of the questions contained in that thread.

How to get Cox p-value for each gene?

If you run the following code, you will have a data frame real.dat which has 1063 samples for 20531 genes. There are 2 extra columns named time and event where time is the survival time and event is death in case of 1 and 0 in case of censored.
lung.dat <- read.table("genomicMatrix_lung")
lung.clin.dat <- read.delim("clinical_data_lung")
# For clinical data, get only rows which do not have NA in column "X_EVENT"
lung.no.na.dat <- lung.clin.dat[!is.na(lung.clin.dat$X_EVENT), ]
# Getting the transpose of main lung cancer data
ge <- t(lung.dat)
# Getting a vector of all the id's in the clinical data frame without any 'NA' values
keep <- lung.no.na.dat$sampleID
# getting only the samples(persons) for which we have a value rather than 'NA' values
real.dat <- ge[ge[, 1] %in% keep, ]
# adding the 2 columns from clinical data to gene expression data
keep_again <- real.dat[, 1]
temp_df <- lung.no.na.dat[lung.no.na.dat$sampleID %in% keep_again, ]
# naming the columns into our gene expression data
col_names <- ge[1, ]
colnames(real.dat) <- col_names
dd <- temp_df[, c('X_TIME_TO_EVENT', 'X_EVENT')]
real.dat <- cbind(real.dat, dd)
# renaming the 2 new added columns
colnames(real.dat)[colnames(real.dat) == 'X_TIME_TO_EVENT'] <- 'time'
colnames(real.dat)[colnames(real.dat) == 'X_EVENT'] <- 'event'
I want to get the univariate Cox regression p-value for each gene in the above data frame. How can I get this?
You can download the data from here.
Edit: Sorry for not clarifying enough. I have already tried to get it with the coxph function from the survival library. But even for one gene, it shows the following error -
> coxph(Surv(time, event) ~ HIF3A, real.dat)
Error in fitter(X, Y, strats, offset, init, control, weights = weights, :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning message:
In fitter(X, Y, strats, offset, init, control, weights = weights, :
Ran out of iterations and did not converge
That is why I did not provide a smaller reproducible example.
You really going to do univariate regression for each gene of 20531 genes??
Guessing wildly at the structure of your data (so creating a dummy set, based on the examples in help), and guessing what you're trying to do with the following toy example.....
library("survival")
?coxph ## to see the examples
## create dummy data
test <- list(time=c(4,3,1,1,2,2,3),
event=c(1,1,1,0,1,1,0),
gene1=c(0,2,1,1,1,0,0),
gene2=c(0,0,0,0,1,1,1))
## Cox PH regression
coxph(Surv(time, event) ~ gene1, test)
coxph(Surv(time, event) ~ gene2, test)
You may wish to use the following to get CIs and more information.
summary(coxph(...))
Hopefully that code is reproducible enough to help you clarify the question

PLS in R: Predicting new observations returns Fitted values instead

In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. Here is what the model looks like in short.
# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56
data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set
# define explanatory variables (x)
spectra <- caldata[,1:101]
# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation =
"LOO", jackknife = TRUE)
It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components:
predict(refl.pls, ncomp = 3, newdata = valdata)
Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. I discovered this because the below code, in which newdata = is omitted, yields identical results.
predict(refl.pls, ncomp = 3)
Surely something must be going wrong, although I cannot seem to find out what specifically is. Is there someone out there who can, and is willing to help me move in the right direction?
I think the problem is with the nature of the input data. Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. The input data frame should have a matrix as one of its elements (in your case, the spectral data). I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting):
library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
height = rpois(56,10),
fbm = rpois(56,10),
nitrogen = rpois(56,10),
carbon = rpois(56,10),
chl = rpois(56,10),
ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)
DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE
refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation =
"LOO", jackknife = TRUE, subset = train)
res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])
Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs. There might be a more standard way to do this, but it works. As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok.
As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous.

How to use one variable in regression with many independent variables in lm()

I need to reproduce this code using all of these variables.
composite <- read.csv("file.csv", header = T, stringsAsFactors = FALSE)
composite <- subset(composite, select = -Date)
model1 <- lm(indepvariable ~., data = composite, na.action = na.exclude)
composite is a data frame with 82 variables.
UPDATE:
What I have done is found a way to create an object that contains only the significantly correlated variables, to narrow the number of independent variables down.
I have a variable now: sigvars, which is the names of an object that sorted a correlation matrix and picked out only the variables with correlation coefficients >0.5 and <-0.5. Here is the code:
sortedcor <- sort(cor(composite)[,1])
regvar = NULL
k = 1
for(i in 1:length(sortedcor)){
if(sortedcor[i] > .5 | sortedcor[i] < -.5){
regvar[k] = i
k = k+1
}
}
regvar
sigvars <- names(sortedcor[regvar])
However, it is not working in my lm() function:
model1 <- lm(data.matrix(composite[1]) ~ sigvars, data = composite)
Error: Error in model.frame.default(formula = data.matrix(composite[1]) ~ sigvars, : variable lengths differ (found for 'sigvars')
Think about what sigvars is for a minute...?
After sigvars <- names(sortedcor[regvar]), sigvars is a character vector of column names. Say your data have 100 rows and 5 variables come out as significant using the method you've chosen (which doesn't sound overly defensible to be). The model formula you are using will result in composite[, 1] being a vector of length 100 (100 rows) and sigvars being a character vector of length 5.
Assuming you have the variables you want to include in the model, then you could do:
form <- reformulate(sigvars, response = names(composite)[1])
model1 <- lm(form, data = composite)
or
model1 <- lm(composite[,1] ~ ., data = composite[, sigvars])
In the latter case, do yourself a favour and write the name of the dependent variable into the formula instead of composite[,1].
Also, you don't seem to have appreciated the difference between [i] and [i,j] for data frames, hence you are doing data.matrix(composite[1]) which is taking the first component of composite, leaving it as a data frame, then converting that to a matrix via the data.matrix() function. All you really need is just the name of the dependent variable on the LHS of the formula.
The error is here:
model1 <- lm(data.matrix(composite[1]) ~ sigvars, data = composite)
The sigvars is names(data). The equation is usually of the form lm(var1 ~ var2+var3+var4), you however have it as lm(var1 ~ var2 var3 var4).
Hopefully that helps.

Resources