How to handle errors in predict function of R? - r

I have a dataframe df, I am building an machine learning model (C5.0 decision tree) to predict the class of a column (loan_approved):
Structure (not real data):
id occupation income loan_approved
1 business 4214214 yes
2 business 32134 yes
3 business 43255 no
4 sailor 5642 yes
5 teacher 53335 no
6 teacher 6342 no
Process:
I randomly split the data frame into test and train, learned on train
dataset (rows 1,2,3,5,6 train and row 4 as test)
In order to account for new categorical levels in one or many column, I used try function
Function:
error_free_predict = function(x){
output = tryCatch({
predict(C50_model, newdata = test[x,], type = "class")
}, error = function(e) {
"no"
})
return(output)
}
Applied the predict function:
test <- mutate(test, predicted_class = error_free_predict(1:NROW(test)))
Problem:
id occupation income loan_approved predicted_class
1 business 4214214 yes no
2 business 32134 yes no
3 business 43255 no no
4 sailor 5642 yes no
5 teacher 53335 no no
6 teacher 6342 no no
Question:
I know this is because the test data frame had a new level that was not present in train data, but should not my function work all cases except this?
P.S: did not use sapply because it was too slow

There are two parts of this problem.
First part of problem comes during training the model because categorical variables are not equally divided in between train and test if one do random splitting. In your case say you have only one record with occupation "sailor" then it is possible that it will end up in test set when you do random split. Model built using train dataset would have never seen impact of occupation "sailor" and hence it will throw error. In more generalized case it is possible some other categorical variable level goes entirely to test set after random splitting.
So instead of dividing the data randomly in between train and test you can do stratified sampling. Code using data.table for 70:30 split is :
ind <- total_data[, sample(.I, round(0.3*.N), FALSE),by="occupation"]$V1
train <- total_data[-ind,]
test <- total_data[ind,]
This makes sure any level is divided equally among train and test dataset. So you will not get "new" categorical level in test dataset; which in random splitting case could be there.
Second part of the problem comes when model is in production and it encounters a altogether new variable which was not there in even training or test set. To tackle this problem one can maintain a list of all levels of all categorical variables by using
lvl_cat_var1 <- unique(cat_var1) and lvl_cat_var2 <- unique(cat_var2) etc. Then before predict one can check for new level and filter:
new_lvl_data <- total_data[!(var1 %in% lvl_cat_var1 & var2 %in% lvl_cat_var2)]
pred_data <- total_data[(var1 %in% lvl_cat_var1 & var2 %in% lvl_cat_var2)]
then for the default prediction do:
new_lvl_data$predicted_class <- "no"
and full blown prediction for pred_data.

I generally do this using a loop where any levels outside of the train would be recoded as NA by this function. Here train is the data that you used for training the model and test is the data which would be used for prediction.
for(i in 1:ncol(train)){
if(is.factor(train[,i])){
test[,i] <- factor(test[,i],levels=levels(train[,i]))
}
}
Trycatch is an error handling mechanism, i.e. after the error has been encountered. It would not be applicable unless you would like to do something different after the error has been encountered. But you would still like to run the model, then this loop would take care of the new levels.

Related

T test using column variable from 2 different data frames in R

I am attempting to conduct a t test in R to try and determine whether there is a statistically significant difference in salary between US and foreign born workers in the Western US. I have 2 different data frames for the two groups based on nativity, and want to compare the column variable I have on salary titled "adj_SALARY". For simplicity, say that there are 3 observations in the US_Born_west frame, and 5 in the Immigrant_West data frame.
US_born_West$adj_SALARY=30000, 25000,22000
Immigrant_West$adj_SALARY=14000,20000,12000,16000,15000
#Here is what I attempted to run:
t.test(US_born_West$adj_SALARY~Immigrant_West$adj_SALARY, alternative="greater",conf.level = .95)
However I received this error message: "Error in model.frame.default(formula = US_born_West$adj_SALARY ~ Immigrant_West$adj_SALARY) :
variable lengths differ (found for 'Immigrant_West$adj_SALARY')"
Any ideas on how I can fix this? Thank you!
US_born_West$adj_SALARY and Immigrant_West$adj_SALARY are of unequal length. Using formula interface of t.test gives an error about that. We can pass them as individual vectors instead.
t.test(US_born_West$adj_SALARY, Immigrant_West$adj_SALARY,
alternative="greater",conf.level = .95)

Build loop to use increasing part of dataframe in R as input to function

I'm using the first principal component from a PCA analysis as an explanatory variable in a forecasting model that forecasts recursively using Kalman filtering. In other words, at each point in time, the model updates and produces a new forecast based on the new observation included into the model. Since PCA uses data from all observations included in the model for its calculations, I need to run also the PCAs recursively, using only the observations prior to the point in time that I am forecasting (otherwise, the PCA-result could reveal information about the future, and help the model produce a more accurate answer than it would have otherwise). I think a loop might be the solution, but I am struggling with how to formulate the code.
As a more specific example, consider if I have the following data.frame
data <- as.data.frame(rbind(c(6,15,23),c(9,11,22), c(7,13,23), c(6,12,25),c(7,13,23)))
names(data) <- c("V1","V2","V3")
> data
V1 V2 V3
1 6 15 23
2 9 11 22
3 7 13 23
4 6 12 25
5 7 13 23
At each observation date, I wish to run a PCA (function prcomp() from the stats-package) for all observations up to, and including, that observation. So I want to first run PCA for the two first observation
pca2 <- prcomp(data[1:2,], scale = TRUE)
next I want to run PCA with the first, second and third observation as input
pca3 <- prcomp(data[1:3,], scale = TRUE)
next I want to run PCA with the first, second, third and fourth observation as input
pca4 <- prcomp(data[1:4,], scale = TRUE)
and so on, until the last run of the PCA, which includes all observations in the dataframe. For each of these "runs" of the PCA, I wish to extract the last value (though for pca2, I use both the first and second value) of the first principal component (PC1), and merge these into a final dataframe, where each monthly observation is the last value of the first principal component of PCA results for each of the runs.
The principal component outputs are:
> my_pca2 <- as.data.frame(pca2$x)
> my_pca2
PC1 PC2
1 -1.224745 -5.551115e-17
2 1.224745 5.551115e-17
> my_pca3 <- as.data.frame(pca3$x)
> my_pca3
PC1 PC2 PC3
1 -1.4172321 -0.2944338 6.106227e-16
2 1.8732448 -0.1215046 3.330669e-16
3 -0.4560127 0.4159384 4.163336e-16
> my_pca4 <- as.data.frame(pca4$x)
> my_pca4
PC1 PC2 PC3
1 -1.03030993 -1.10154914 0.015457199
2 2.00769890 0.07649216 0.011670433
3 0.03301806 -0.24226508 -0.033461874
4 -1.01040702 1.26732205 0.006334242
So I want my final output to be a dataframe to look like
>final.output
PC1
1 -1.224745
2 1.224745
3 -0.4560127
4 -1.01040702
Comment: yes, it looks a bit weird with the two first values, but please don't pay too much attention to that. My point is that I wish to build a dataframe that consists of the last calculated value for the first principal component for each of the PCA runs.
I am thinking that a for.loop might be the best solution here, but I have not been successful in finding any threads that might guide me closer to a coding solution. How can I make the loop use an increasing amount of the dataframe in the calculations? Does anyone have any suggestions/tips/links? Any help on this is much appreciated!
I had a very similar approach.
PCA <- vector("list", length=nrow(data)-1)
for(i in 1:(nrow(data)-1)) {
if(i==1) j <- 1:2 else j<-i+1
PCA[[i]] <- as.data.frame(prcomp(data[1:(1+i),], scale = TRUE)$x)[j, 1]
}
unlist(PCA)
You can use a for loop. It's maybe not the most efficient solution, but it will work.
First, you create an empty list to store your results:
all_results <- list()
Next, you iterate from 2 to the number of rows of data with a loop. For each iteration of the loop, run prcomp on data[1:i,]. You can directly create your pca data frame and extract PC1from it as a vector. Now you store it in the list at index i - 1
for(i in 2:nrow(data))
{
all_results[[i - 1]] <- as.data.frame(prcomp(data[1:i,], scale = TRUE)$x)$PC1
}
Now to extract all the results, you use lapply (list apply) to extract only the last element from each PC1 vector:
PC1 <- lapply(all_results, function(pca) pca[length(pca)] )
Now you convert these from a list of single elements to a vector:
PC1 <- do.call("c", PC1)
Finally, you want to stick the first value of the first analysis back on to the front of this vector:
PC1 <- c(all_results[[1]][1], PC1)

SVM Predict Levels not matching between test and training data

I'm trying to predict a binary classification problem dealing with recommending films.
I've got a training data set of 50 rows (movies) and 6 columns (5 movie attributes and a consensus on the film).
I then have a test data set of 20 films with the same columns.
I then run
pred<-predict(svm_model, test)
and receive
Error in predict.svm(svm_model, test) : test data does not match model !.
From similar posts, it seems that the error is because the levels don't match between the training and test datasets. This is true and I've proved it by comparing str(test) and str(train). However, both datasets come from randomly selected films and will always have different levels for their categorical attributes. Doing
levels(test$Attr1) <- levels(train$Attr1)
changes the actual column data in test, thus rendering the predictor incorrect. Does anyone know how to solve this issue?
The first half dozen rows of my training set are in the following link.
https://justpaste.it/1ifsx
You could do something like this, assuming Attr1 is a character:
Create a levels attribute with the unique values from attribute1 from both test and train.
Create a factor on train and test attribute1 with all the levels found in point 1.
levels <- unique(c(train$Attr1, test$Attr1))
test$Attr1 <- factor(test$Attr1, levels=levels)
train$Attr1 <- factor(train$Attr1, levels=levels)
If you do not want factos, add as.integer to part of the code and you will get numbers instaed of factors. That is sometimes handier in models like xgboost and saves on one hot encoding.
as.integer(factor(test$Attr1, levels=levels))

Multinomial logit model in R on grouped data, data conversion and mlogit set-up

I want to estimate the parameters of a multinomial logit model in R and wondered how to correctly structure my data. I’m using the “mlogit” package.
The purpose is to model people's choice of transportation mode. However, the dataset is a time series on aggregated level, e.g.:
This data must be reshaped from grouped count data to ungrouped data. My approach is to make three new rows for every individual, so I end up with a dataset looking like this:
For every individual's choice in the grouped data I make three new rows and use chid to tie these three
rows together. I now want to run :
mlogit.data(MyData, choice = “choice”, chid.var = “chid”, alt.var = “mode”).
Is this the correct approach? Or have I misunderstood the purpose of the chid function?
It's too bad this was migrated from stats.stackexchange.com, because you probably would have gotten a better answer there.
The mlogit package expects data on individuals, and can accept either "wide" or "long" data. In the former there is one row per individual indicating the mode chosen, with separate columns for every combination for the mode-specific variables (time and price in your example). In the long format there is are n rows for every individual, where n is the number of modes, a second column containing TRUE or FALSE indicating which mode was chosen for each individual, and one additional column for each mode-specific variable. Internally, mlogit uses long format datasets, but you can provide wide format and have mlogit transform it for you. In this case, with just two variables, that might be the better option.
Since mlogit expects individuals, and you have counts of individuals, one way to deal with this is to expand your data to have the appropriate number of rows for each mode, filling out the resulting data.frame with the variable combinations. The code below does that:
df.agg <- data.frame(month=1:4,car=c(3465,3674,3543,4334),bus=c(1543,2561,2432,1266),bicycle=c(453,234,123,524))
df.lvl <- data.frame(mode=c("car","bus","bicycle"), price=c(120,60,0), time=c(5,10,30))
get.mnth <- function(mnth) data.frame(mode=rep(names(df.agg[2:4]),df.agg[mnth,2:4]),month=mnth)
df <- do.call(rbind,lapply(df.agg$month,get.mnth))
cols <- unlist(lapply(df.lvl$mode,function(x)paste(names(df.lvl)[2:3],x,sep=".")))
cols <- with(df.lvl,setNames(as.vector(apply(df.lvl[2:3],1,c)),cols))
df <- data.frame(df, as.list(cols))
head(df)
# mode month price.car time.car price.bus time.bus price.bicycle time.bicycle
# 1 car 1 120 5 60 10 0 30
# 2 car 1 120 5 60 10 0 30
# 3 car 1 120 5 60 10 0 30
# 4 car 1 120 5 60 10 0 30
# 5 car 1 120 5 60 10 0 30
# 6 car 1 120 5 60 10 0 30
Now we can use mlogit(...)
library(mlogit)
fit <- mlogit(mode ~ price+time|0 , df, shape = "wide", varying = 3:8)
summary(fit)
#...
# Frequencies of alternatives:
# bicycle bus car
# 0.055234 0.323037 0.621729
#
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# price 0.0047375 0.0003936 12.036 < 2.2e-16 ***
# time -0.0740975 0.0024303 -30.489 < 2.2e-16 ***
# ...
coef(fit)["time"]/coef(fit)["price"]
# time
# -15.64069
So this suggests the reducing travel time by 1 (minute?) is worth about 15 (dollars)?
This analysis ignores the month variable. It's not clear to me how you would incorporate that, as month is neither mode-specific nor individual specific. You could "pretend" that month is individual-specific, and use a model formula like : mode ~ price+time|month, but with your dataset the system is computationally singular.
To reproduce the result from the other answer, you can use mode ~ 1|month with reflevel="car". This ignores the mode-specific variables and just estimates the effect of month (relative to mode = car).
There's a nice tutorial on mlogit here.
Are price and time real variables that you're trying to make a part of the model?
If not, then you don't need to "unaggregate" that data. It's perfectly fine to work with counts of the outcomes directly (even with covariates). I don't know the particulars of doing that in mlogit but with multinom, it's simple, and I imagine it's possible with mlogit:
# Assuming your original data frame is saved in "df" below
library(nnet)
response <- as.matrix(df[,c('Car', 'Bus', 'Bicycle')])
predictor <- df$Month
# Determine how the multinomial distribution parameter estimates
# are changing as a function of time
fit <- multinom(response ~ predictor)
In the above case the counts of the outcomes are used directly with one covariate, "Month". If you don't care about covariates, you could also just use multinom(response ~ 1) but it's hard to say what you're really trying to do.
Glancing at the "TravelMode" data in the mlogit package and some examples for it though, I do believe the options you've chosen are correct if you really want to go with individual records per person.

Looping a Student T-Test and Chi-Squared with Missing Data in R

I am trying to use R to run a student t-test and a chi squared test with large data sets. Since I am fairly new to R my inexperience has been preventing much success in my own code.
Both data sets have missing data and look something like this:
AA assayX activity assayY1 activity assayY2 activity
chemical 1 TRUE 0 12.2
chemical 2 TRUE 0
chemical 3 45.2 35.6
chemical 4 FALSE 0 0
AB assayX activity assayY1 activity assayY2 activity
chemical 1 TRUE FALSE TRUE
chemical 2 TRUE FALSE
chemical 3 TRUE TRUE
chemical 4 FALSE FALSE FALSE
Since it is a large data set I am trying to create a code where I can compare assayX to all assayYs. I'm hoping to create a student t-test loop for the first data set, and a chi squared loop to come the second data set. I had previously been successful creating a loop code for a correlation analysis, so I based my code off of that idea.
x<- na.omit(mydata1[, c(assayX)])
y<- na.omit(mydata1[, c(assayY1:assayYend)])
lapply(y, function(x)t.test(y~x))
x<-na.omit(mydata2[, c(assayX)])
y<- na.omit(mydata2[, c(assayY1:assayYend)]
lapply(y, x=x, chisq.test)
Problem with the first code is:
Invalid variable y
Problem with the second code is:
x and y must have the same length
I've done small tweaks here and there and have just got different types of errors like not enough 'y' observations and so on. I've been primarily using this site to figure out how work R, so I'm hoping you guys will have a clever little solution for a new guy.
After a long time and gaining experience in R, I can answer my own question. First is to make the datafile change blanks to NA.
df1 <- read.csv("data2.csv", header=T, na.strings=c("","NA"))
Then for the student.t
df1.p= rep(NA, 418)
for (i in length(df1$Assays)){
test= t.test(df1[,c(i)]~df1$assay.activity)
current.p.val= test$p.value
p.df1[i]=current.p.val
}
Then to add a Pearson's or Chi sq (not actually appropriate for this dataset, but just as an ex)
df1.p.2= rep(NA, length(df1$Assays))
df1.r.2= rep(NA, length(df1$Assays))
for (i in length(df1$Assays)){
test2= cor.test(df1$assay.activity, df1[,c(i)], mehtod='pearson')
current.p.val2= test2$p.value
current.rval = test2$estimate
df1.p.2[i] = current.p.val2
df1.r.2[i] = current.rval
}
df2= cbind(df1$Assays, df1.p, df1.p.2, df1.r.2)
I then filtered it for only assays with 0.1 significance, but that wasn't the question here. If you want to know that, just ask a question and I'll post an answer there :)
I don't think your data is being passed correctly to the test. t.test has arguments for whether the data is paired or not (default is false) and how to handle NAs should you want to change from the defalt. You should probably use those rather than omit NAs up front. An example with NAs in the data:
set.seed(1)
y <- runif(30, 0, 1)
y.NA <- c(3,24,27)
y[y.NA] <- NA
x <- runif(30, 0, 1)
x.NA <- c(1,3,8,12,21)
x[x.NA] <- NA
t.test(x,y)
For chisq.test you can use the table function.
chisq.test(table(x,y))$p.value

Resources