Cant understand what predict() does in this case - r

I am looking at some code that prepares a dataframe for several prediction models to be tested later on. The general idea is to predict NormSec based on all the other columns.
Not sure what predict(dummies,newdata=data) does in this case.
I know that predict is used to predict based on an already trained fit. Why is it used in this case? The code works, just trying to understand it.
data<-read.csv(file="datatable.csv")
attach(data)
#selecting the useful columns from data table:
data<-data.frame(NormSec, Rivalry,Stars,NormFB,SeasonPart,FootballSeason,LeBron,Weekend,LastSeasonWins
,Holiday,BigGame,OverUnders,DaysSinceLast,DaysUntilNext, Weekday, Monthday, NewArena)
dummies <- dummyVars(NormSec~., data = data)
attach(dummies)
#Here is the function I don't get:
dataDescr<-predict(dummies,newdata=data)
dataDescr<-data.frame(dataDescr)
attach(dataDescr)
dummies is a dummy variable object and DataDescr (output of predict()) is the original dataframe without the NormSec column.

Related

Dataframe subsets retain information from parent dataframe

I assume this is used as a feature in data.frame() but it has presented a lot of problems for evaluating training and test sets for some packages. For example if you utilize h2o for machine learning, import a dataset, and subset the dataframe based on some random sample of the data, the h2o model builder will have access to the FULL original dataframe with all factor levels and all data. As such, if you try something like h2o.predict(model,newdata=dataset[test,]) your prediction will simply copy the response in the dataset over (tested for a deep learning model). You can see the factor retention below:
y = as.factor(c("1","0","0","1"))
X = c(5,4,3,4)
data = data.frame(y,X)
train = data[c(1,4),]
test = data[c(2,3),]
trainingData = data[train,]
trainingData
levels(trainingData[,1])
[1] "0" "1"
Now, I've been able to solve the factor information retention, but I'm not sure how to remove information from the parent dataframe in the new subset. Anyone have any ideas?
EDIT: For anyone who has had the factor problem, it's as simple as applying function droplevels().

Survdiff p-value comparison

I am trying to run a survival analysis on a set of data I have collected. In this data frame (m3), each row is a new patient and each column is a mutation I have identified. I have made a binary data table to indicate whether each patient is positive or negative for the mutation. I can run a survfit function for each column(mutation), but I have hundreds and want to loop through them. I have written the following code, but don't think it is correct (nothing is being output).
for (i in m3[,2:256]) {survdiff(Surv(m3$Overall.Survival, m3$Status) ~ i,
data = m3)}
Once I gather this data I want to make a table with each mutation (column) as a row and put the p-value from this survfit object as the column.
I'm not sure why I don't have any output for the for loop and even more so how to generate the new data frame. I believe I would be subsetting it.

Error ("variables with different types from the fit") when using the predict() function in R

I have fitted a multivariate polynomial using the lm() and step() functions in R. My data has dependent variable Y and some independent variables X1 till Xn. I formatted the formula to fit as follows: Y ~ I(X1^1)+I(X1^2)+I(X2^1)+... etc. When I use the predict() function on the original data everything works, even on the validation points which weren't used for the fit. But, I have to use the predict() function on some simulated data I produced. I made sure the simulated data is in a data.frame and all the elements are of type double like the original data. I copied the column names from the original data (X1, ... ,Xn) to the simulated data. Now when I use the predict() function I get the following error:
Error: variables ‘I(X1^1)’, ‘I(X1^2)’, ‘I(X2^1)’ were specified with different types from the fit
I really don't get it. The column names are the same, the types are the same and both original and simulated data are in a data.frame. What is happening here?
Thanks in advance!!
Sorry for not providing a reproducible example. But I've found a solution. It's not very elegant but here it is. When I coerce the data.frame with the original data to a matrix and then straight back to a data frame again some attributes and other stuff are cut off the original data. If I now use this data.frame for the fitting process the predict() function works also on the simulated data. The simulated data was in matrix format first and was converted to a data.frame. It's still not clear to me if there isn't a more elegant way to get rid off the attr, dimensions and other stuff in the data.frame of the original data. I've tried unname() but that didn't do the job.

Getting expected value through regression model and attach to original dataframe in R

My question is very similar to this one here , but I still can't solve my problem and thus would like to get little bit more help to make it clear. The original dataframe "ddf" looks like:
CONC <- c(0.15,0.52,0.45,0.29,0.42,0.36,0.22,0.12,0.27,0.14)
SPP <- c(rep('A',3),rep('B',3),rep('C',4))
LENGTH <- c(390,254,380,434,478,367,267,333,444,411)
ddf <- as.data.frame(cbind(CONC,SPECIES,LENGTH))
the regression model is constructed based on Species:
model <- dlply(ddf,.(SPP), lm, formula = CONC ~ LENGTH)
the regression model works fine and returns individual models for each species.
What I am going to get is the residual and expected value of 'Length' variable in terms of each models (corresponding to different species) and I want those data could be added into my original dataset ddf as new columns. so the new dataset should looks like:
SPP LENGTH CONC EXPECTED RESIDUAL
Firstly, I use the following code to get the expected value:
model_pre <- lapply(model,function(x)predict(x,data = ddf))
I loom there might be some mistakes in the above code, but it actually works! The result comes with two columns ( predicated value and species). My first question is whether I could believe this result of above code? (Does R fully understand what I am aiming to do, getting expected value of "length" in terms of different model?)
Then i used the following code to attach those data to ddf:
ddf_new <- cbind(ddf, model_pre)
This code works fine as well. But the problem comes here. It seems like R just attach the model_pre result directly to the original dataframe, since the result of model_pre is not sorted the same as the original ddf and thus is obviously wrong(justifying by the species column in original dataframe and model_pre).
I was using resid() and similar lapply, cbind code to get residual and attach it to original ddf. Same problem comes.
Therefore, how can I attach those result correctly in terms of length by species? (please let me know if you confuse what I am trying to explain here)
Any help would be greatly appreciated!
There are several problems with your code, you refer to columns SPP and Conc., but columns by those names don't exist in your data frame.
Your predicted values are made on the entire dataset, not just the subset corresponding to that model (this may be intended, but seems strange with the later usage).
When you cbind a data frame to a list of data frames, does it really cbind the individual data frames?
Now to more helpful suggestions.
Why use dlply at all here? You could just fit a model with interactions that effectively fits a different regression line to each species:
fit <- lm(CONC ~ SPECIES * LENGTH, data= ddf)
fitted(fit)
predict(fit)
ddf$Pred <- fitted(fit)
ddf$Resid <- ddf$CONC - ddf$Pred
Or if there is some other reason to really use dlply and the problem is combining 2 data frame that have different ordering then either use merge or reorder the data frames to match first (see functions like ordor, sort.list, and match).

Cannot coerce class ""amelia"" to a data.frame in R

I am using Amelia package in R to handle missing values.I get the below error when i am trying to train the random forest with the imputed data. I am not sure how can i convert amelia class to data frame which will be the right input to the randomForest function in R.
train_data<-read.csv("train.csv")
sum(is.na(train_data))
impute<- amelia(x=train_data,m=5,idvars=c("X13"), interacs=FALSE)
impute<= as.data.frame(impute)
for(i in 1:impute$m) {
model <- randomForest(Y ~X1+X2+X3+X4+X5+X6,
data= as.data.frame(impute))
}
Error in as.data.frame.default(impute) :
cannot coerce class ""amelia"" to a data.frame
If I used input to randomForest as impute$imputations[[i]] I the below error:
model <- randomForest(Y ~X1+X2+X3+X4+X5+X6,
impute$imputations[[i]])
Error: $ operator is invalid for atomic vectors
Can anyone suggest me how can I solve this problem .It would be a great help.
So, I think the first problem is this line here:
impute<= as.data.frame(impute)
Should be:
impute <- as.data.frame(impute)
Which will throw an error.
Multiple imputation replaces the data with multiple datasets, each with different replacements for the missing values. This reflects the uncertainty in those missing values predictions. By turning the Amelia object into a dataframe you are trying to make one data frame out of 5 data frames, and it's not obvious how to do this.
You might want to look into simpler forms of imputation (like imputing by the mean).
This is happening because you are trying to train on variable containing information on imputation you did. It does not have data you need to train on. You need to use the function complete to combine the imputed values in data set.
impute <- amelia(x=train_data,m=5,idvars=c("X13"), interacs=FALSE)
impute <- complete(impute,1)
impute <- as.data.frame(impute)
After this you won't have trouble training or predicting the data.

Resources