glm function not taking correct dataset - r

I have just started learning R and working on dataset which has 1470 cases. Name of dataset is ABC. Using as.factor, I have converted categorical variables as factors.
Dept_1 <- as.factor(ABC$Dept)
Education_1 <- as.factor(ABC$Education)
BusinessTravel_1 <- as.factor(ABC$BusinessTravel)
After that I have split dataset into train and test.Number cases for both train and test data seems perfect. Then I use glm function using syntax below
fit = glm(attrition~Dept_1+Education_1+BusinessTravel_1,binomial(link="logit"),train)
Fit equation runs but it gets executed on entire dataset ABC with cases 1470 instead of train dataset of 1028 records.
Not able to understand what is the issue.

When you do this:
Dept_1 <- as.factor(ABC$Dept)
Education_1 <- as.factor(ABC$Education)
BusinessTravel_1 <- as.factor(ABC$BusinessTravel)
you're actually creating three new variables in your global environment, not in your original data frame ABC. Because of this, when you split ABC into training and test samples, the new variables won't be affected.
When you go to fit the model, your glm call
fit = glm(attrition~Dept_1+Education_1+BusinessTravel_1,binomial(link="logit"),train)
will look for the variables listed in the formula. It won't find them in the train dataset, but it will find them in the global environment. That's why they have the original length.
What you probably wanted is
ABC$Dept_1 <- as.factor(ABC$Dept)
ABC$Education_1 <- as.factor(ABC$Education)
ABC$BusinessTravel_1 <- as.factor(ABC$BusinessTravel)
which will create the variables in the data frame ABC.

Related

Using Amelia and decision trees

I have a panel dataset (countries and years) with a lot of missing data so I've decided to use multiple imputation. The goal is to see the relationship between the proportion of women in management (managerial_value) and total fatal workplace injuries (total_fatal)
From what I've read online, Amelia is the best option for panel data so I used that like so:
amelia_data <- amelia(spdata, ts = "year", cs = "country", polytime = 1,
intercs = FALSE)
where spdata is my original dataset.
This imputation process worked, but I'm unsure of how to proceed with forming decision trees using the imputed data (an object of class 'amelia').
I originally tried creating a function (amelia2df) to turn each of the 5 imputed datasets into a data frame:
amelia2df <- function(amelia_data, which_imp = 1) {
stopifnot(inherits(amelia_data, "amelia"), is.numeric(which_imp))
imps <- amelia_data$imputations[[which_imp]]
as.data.frame(imps)
}
one_amelia <- amelia2df(amelia_data, which_imp = 1)
two_amelia <- amelia2df(amelia_data, which_imp = 2)
three_amelia <- amelia2df(amelia_data, which_imp = 3)
four_amelia <- amelia2df(amelia_data, which_imp = 4)
five_amelia <- amelia2df(amelia_data, which_imp = 5)
where one_amelia is the data frame for the first imputed dataset, two_amelia is the second, and so on.
I then combined them using rbind():
total_amelia <- rbind(one_amelia, two_amelia, three_amelia, four_amelia, five_amelia)
And used the new combined dataset total_amelia to construct a decision tree:
set.seed(300)
tree_data <- total_amelia
I_index <- sample(1:nrow(tree_data), size = 0.75*nrow(tree_data), replace=FALSE)
I_train <- tree_data[I_index,]
I_test <- tree_data[-I_index,]
fatal_tree <- rpart(total_fatal ~ managerial_value, I_train)
rpart.plot(fatal_tree)
fatal_tree
This "works" as in it doesn't produce an error, but I'm not sure that it is appropriately using the imputed data.
I found a couple resources explaining how to apply least squares, logit, etc., but nothing about decision trees. I'm under the impression I'd need the 5 imputed datasets to be combined into one data frame, but I have not been able to find a way to do that.
I've also looked into Zelig and bind_rows but haven't found anything that returns one data frame that I can then use to form a decision tree.
Any help would be appreciated!
As already indicated by #Noah, you would set up the multiple imputation workflow different than you currently do.
Multiple imputation is not really a tool to improve your results or to make them more correct.
It is a method to enable you to quantify the uncertainty caused by the missing data, that comes along with your analysis.
All the different datasets created by multiple imputation are plausible imputations, because of the uncertainty, you don't know, which one is correct.
You would therefore use multiple imputation the following way:
Create your m imputed datasets
Build your trees on each imputed dataset separately
Do you analysis on each tree separately
In your final paper, you can now state how much uncertainty is caused trough the missing values/imputation
This means you get e.g. 5 different analysis results for m = 5 imputed datasets. First this looks confusing, but this enables you to give bounds, between the correct result probably lies. Or if you get completely different results for each imputed dataset, you know, there is too much uncertainty caused by the missing values to give reliable results.

How to use Amelia package to get a best time series model in R

I'm trying to handle the missing data from a data frame use multiple imputations, professor advice me to use Amelia package. And I can build the time series model, but when I try to use lapply function to repeatedly run the time series model in each dataset, I got an error on the function in lapply.
My data frame have three variables, date, pm25, pm10. I can built an AR model for pm25.
And the imputation code is:
imp <- amelia(Exetertibble, m=50, ts = "date")
So I can get 50 imputations, and the time series model would like this:
model1 <- arima(imp$imputations$imp1$pm25, order = c(1,0,0))
Then I try to use lapply function:
extractcoefs <- lapply(imp$imputations, coef(model1))
There is an error, it said that the coef(model)is not a function or character or symbol.
My aim is to combine the 50 imputations and get the best result of coefficient of the time series model, I don't know how to write a correct function in there.
I also tried:
extractcoefs <- lapply(imp$imputations, coef(arima(order=c(1,0,0))))
and:
extractcoefs <- lapply(imp$imputations, arima(order=c(1,0,0)$coef))
No idea, what you are trying to do.
Look at this example for lapply:
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
# compute the list mean for each list element
lapply(x, mean)
So you give lapply a list und apply a function on each of the list elements. In this case the function is mean().
So for this example you will get the mean for a, beta and logic.
You are using lapply on imp$imputations.
You got imp$imputations from your call to the amelia() function. Which gives you an instance of S3 class "amelia". This instances includes several objects, one of these is a list imp, which has as list elements all the imputed datasets (in your case 50).
So using lapply(imp$imputations, coef(model1)) will apply the function in the second part on all imputed datasets. The only problem is, your second part isn't really a function. Also you can't apply coef on the imputed datasets. You must apply coef() on a model object, because it returns the model coefficients form the model.
I guess you want to do the following:
Generate your m=50 imputed datasets
Build a arima model for each dataset
Get the coefficients for each of this model
You could just use a for loop through the m=50 datasets for this.
Take this as an example:
data(africa)
imp <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc", m = 5)
for (i in 1:length(imp$imputations))
{
model <- arima(imp$imputations[[i]]$gdp_pc)
coe <- coef(model)
print(coe)
}
This would give you 50 results of coef. (for the different arima models build on the different m=50 imputed datasets)

How to predict dependent values using fitted model in r?

I am fitting a model with:
var4pca <- lm(lg[5:415,1] ~ pcalg1$x[, 1:8] + pcalg2$x[, 1:8] + pcalg3$x[, 1:8] + pcalg4$x[, 1:8])
I now want to predict values for a validation set(83 rows). How can I do this?
I am trying to use:
pred_pca<-predict(var4pca, va)
where va is my validation set. But this is returning me a vector with length 411, whereas I only want length 83
In my experience, lm is very fussy about prediction. It demands that the new data look exactly like the data used to create the model. By that I mean things like col names have to match. What typically will work is to create a data frame of all the data and then create df.train and df.test as the correct rows of the data frame. That should do the trick. As joran says be careful with formulas. One advantage of putting all the data into a df with named cols is that then one can use the formula depvar ~. - typically much easier to write.

column names - xgboost predict on new data

I have never productionised an xgboost model and am concerned re how to handle fresh data predictions within an xgboost model. Specifically when column names do not match the trained models sparse matrix column names - either due to new columns being added or certain columns being removed when fresh data is converted to a sparse matrix.
What if I attempt to predict an xgboost model on new data with extra or some missing column names? I see this definitely occurring and would like to create code to account for it so that predictions are correct. I would prefer to avoid hacking together a solution if more elegant ones already exist.
So specifically if the new datas sparse matrix has different column names then what?
My best guess is to factorise (levels based on trained data levels) > create sparse matrix > then remove non-matching columns (between trained dataset and new data).
I have created dummy data (in below code) as an example of prediction errors given different column names.
1st step = build model (just for illustrative purposes I know it's a bad build)
2nd step = resample entire dataset then predict (= no problems. Predictions match)
3rd step = only select from 10% of data then predict - this gets prediction errors due to different column names.
Here's the code:
Step 1 create dummy data and create a lazy xgboost model just for illustrative purposes.
library(xgboost) # for xgboost algo
library(Matrix) # for sparse matrix
### Create dummy data
num_rows <- 100
set.seed(1234)
target <- runif(num_rows)
dummy_data <- data.frame(
LETTER_SINGLE=sample(LETTERS,num_rows,replace=TRUE),
DOUBLE_LETTER=paste(sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sep=""),
TRIPLE_LETTER=paste(sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sample(LETTERS,num_rows,replace=TRUE),sep=""),
stringsAsFactors=FALSE
)
## STEP 1 CREATE XGBOOST MODEL AND GET PREDICTED VALUES TO COMPARE WITH FUTURE DATA CUTS.
model_data_01 <- dummy_data
target_01 <- target
# create matrix
model_01_sparse <- sparse.model.matrix(~ .-1, data = model_data_01)
# colnames model 1
colnames_trained_model <- colnames(model_01_sparse)
# train a model
xgb_fit_01 <-
xgboost(data = model_01_sparse,
label = target_01,
#param = best_param,
nrounds=100,
verbose = T
)
pred_01 <- predict(xgb_fit_01,newdata=model_01_sparse)
Step 2. Test to see if order of observations cause differences in predictions. Spoiler - no prediction errors occur.
## STEP 2 CREATE SHUFFLED DATA (SAME DATA SAMPLES BUT SHUFFLED) THEN PREDICT AND COMPARE.
sample_order <- sample(1:num_rows)
model_data_shuffled <- dummy_data[sample_order,]
target_shuffled <- target[sample_order]
# They are different
head(model_data_01)
head(model_data_shuffled)
# create matrix
model_shuffled_sparse <- sparse.model.matrix(~ .-1, data = model_data_shuffled)
# colnames model 1
colnames_shuffled <- colnames(model_shuffled_sparse)
pred_shuffled <- predict(xgb_fit_01,newdata=model_shuffled_sparse)
# check if predictions differ
pred_01[sample_order] - pred_shuffled
## This matched. Yay. sparse.model.matrix function must first sort alphabetically then create column names.
# due to same column names
mean(colnames_trained_model == colnames_shuffled)
Step 3. Only sample a select few rows and predict to see whether missing columns - in sparse matrix - cause prediction errors.
## STEP 2 WORKED FINE SO ONTO...
## STEP 3 RANDOMLY SAMPLE ONLY A HANDFUL OF ROWS PREDICT AND COMPARE.
sample_order_02 <- sample(1:(num_rows*0.1))
model_data_shuffled_02 <- dummy_data[sample_order_02,]
target_shuffled_02 <- target[sample_order_02]
# create matrix
model_shuffled_sparse_02 <- sparse.model.matrix(~ .-1, data = model_data_shuffled_02)
# colnames model 1
colnames_shuffled_02 <- colnames(model_shuffled_sparse_02)
pred_shuffled_02 <- predict(xgb_fit_01,newdata=model_shuffled_sparse_02)
# check if predictions differ
pred_01[sample_order_02] - pred_shuffled_02
## This did not matched. Damn.
# Due to different column names
colnames_trained_model
colnames_shuffled_02
mean(colnames_trained_model == colnames_shuffled_02)
As you can see this last attempt gets variance in the predicted values due solely to missing column names in the spare matrix.
I don't want to hack an ugly solution together if an elegant one exists for me to learn from.
So my question is... Is there an elegant way to force sparse model matrix column names to match that of the built model (the one used for predictions on new data)?
I have searched the web and no luck thus far finding any best practices solution.
If anybody could help by answering the Question or pointing me in the right direction that would be much appreciated.
What is your production environment? R, Python, Java or something else?
The idea is to use XGBoost functionality (both training and prediction) via production environment-specific wrapper library, not directly. For example, in Python, you could use Scikit-Learn wrappers, which encapsulate feature engineering and -selection tasks into a reusable sklearn.pipeline.Pipeline object. You would 1) fit the pipeline object (where the XGBoost estimator is the final task) in development environment and serialize it to a pickle file, 2) move the pickle file from development to production environment, and 3) de-serialize it from the pickle file and use for transforming new data in production environment. This is a high-level API, which completely abstracts away low-level details such as the layout of XGBoost "internal" data matrices.
For a platform-independent solution, you could export XGBoost models (and associated data pre-processing logic) in the standardized PMML representation.

How does randomForest() predict for new factor levels not in training data?

When I create training set and test set by splitting a single data frame and build a random forest using randomForest package, for some factor levels which are not present in the training data, the predict() function still throws an output. While this gives no error(which was what I was looking for in the related question), my question is on what basis does the randomForest() model predict the value, as it ideally should have thrown the following error...
Error in predict.randomForest() :
New factor levels not present in the training data
Want to know just out of curiosity if randomForest() method makes some inherent assumption for new factor levels in test data.
Here's a reproducible example:
seq1 <- c(5,3,1,3,1,"unwanted_char",4,2,2,3,0,4,1,1,0,1,0,1)
df1 <- matrix(seq1,6)
df1 <- as.data.frame(df1)
colnames(df1) <- c("a","b","c")
train <- df1[1:4,]
test <- df1[5:6,]
Now when we create a forest using train and run the predict() on test as follow...
forest1 <- randomForest(c~a+b,data=train,ntree=500)
test$prediction <- predict(forest1,test,type='response')
The test matrix contain's prediction of '1' for the last observation which has a = 'unwanted_char' and b = '4'.
Please note: When you create test and train data separately the predict function throws the above mentioned error instead of predicting.
My opinion is that this is a very bad example; but, here's the answer:
Your created df1 only has factor variables and 4 observations. Here, mtry will equal 1, meaning that roughly 1/2 your trees will be based on b alone and 1/2 on a alone. When b == "4" the classification is always 1. IE- b == 4 perfectly predicts c. Similarly a == 1 perfectly predicts c == 0.
The reason that this works when you create the data in a single dataset is that the variables are factor variables, where the possible levels exist in both train and test, although the observed quantities for some levels == 0 in train. Since "unwanted_char" is a possible level in train$a (although unobserved) it's not problematic for your prediction. If you create these as separate datasets, the factor variables are created distinctly and test has new levels.
That is to say that, essentially, your problem works because you do not understand how factors work in R.
I concur with Alex that this is not a good example.
Here is the answer to your question:
str(train)
If you check the structure of your train data, you will see that variable 'a' has all 4 levels, because the levels were assigned when you created the dataframe df1.

Resources