Wrong prediction while loading the model using Keras load_model() as if no model training happened. This is happening only with the EMNIST dataset - keras-2

I have trained my model using EMNIST byclass dataset by loading .csv files of training and testing data as well as label for digit (0-9) and letter (A-Z, a-z) classification. I got the model evaluation accuracy around 87%. But when I am using the the best saved model weights (.hdf5) using keras load_model(), it gives me weird results, as if no training has been done. But after loading the model if I run the model evaluation still iyt gives me 87% accuracy.
Then what could be the problem while doing prediction of any new image and I am getting wrong prediction ?
Thanks

One more input I would like to pass everyone about my above mentioned issue -
One more I would like to mention that I am using all the required same pre-processing for EMNIST training/validation data what I used for MNIST data too. And the same code is doing right prediction while using MNIST dataset. But now I am using EMNIST dataset to predict both alphabets & digits (A-Z, a-z, 0-9) since MNIST dataset has only digits. Have anyone used EMNIST dataset for Alphabet & Digit prediction and prediction happened right ? Or still EMNIST dataset is not OK for such prediction. But I am surprised in that case how am I getting evaluation accuracy as 88%.

It seems like EMNIST dataset photos are rotated and mirrored.You should do vise versa before feeding image to your net,and don't forget bit-wise inversion and thresholding.
x1 = cv2.imread('c9.jpg')
x2 = cv2.cvtColor(x1, cv2.COLOR_BGR2GRAY)
ret,x3 = cv2.threshold(x2, 127, 255, cv2.THRESH_BINARY)
#compute a bit-wise inversion so black becomes white and vice versa
x4 = np.invert(x3)
#make it the right size
x5 = cv2.resize(x4, (28, 28))
#rotate and flip
rows,cols = x5.shape
M = cv2.getRotationMatrix2D((cols/2,rows/2),270,1)
dst = cv2.warpAffine(x5,M,(cols,rows))
flip = cv2.flip(dst,1)
#convert to a 4D tensor to feed into our model
x6 = flip.reshape(1,28,28,1)
x7 = x6.astype('float32')
x7 /= 255
out = model.predict(x7)
print(np.argmax(out))

Related

R: How to check which model of an ensemble algorithm has been selected to perform regression?

I am using the R package machisplin (it's not on CRAN) to downscale a satellite image. According to the description of the package:
The machisplin.mltps function simultaneously evaluates different combinations of the six algorithms to predict the input data. During model tuning, each algorithm is systematically weighted from 0-1 and the fit of the ensembled model is evaluated. The best performing model is determined through k-fold cross validation (k=10) and the model that has the lowest residual sum of squares of test data is chosen. After determining the best model algorithms and weights, a final model is created using the full training dataset.
My question is how can I check which model out of the 6 has been selected for the downscaling? To put it differently, when I export the downscaled image, I would like to know which algorithm (out of the 6) has been used to perform the downscaling.
Here is the code:
library(MACHISPLIN)
library(raster)
library(gbm)
evi = raster("path/evi.tif") # covariate
ntl = raster("path/ntl_1600.tif") # raster to be downscaled
##convert one of the rasters to a point dataframe to sample. Use any raster input.
ntl.points<-rasterToPoints(ntl,
fun = NULL,
spatial = FALSE)
##subset only the x and y data
ntl.points<- ntl.points[,1:2]
##Extract values to points from rasters
RAST_VAL<-data.frame(extract(ntl, ntl.points))
##merge sampled data to input
InInterp<-cbind(ntl.points, RAST_VAL)
#run an ensemble machine learning thin plate spline
interp.rast<-machisplin.mltps(int.values = InInterp,
covar.ras = evi,
smooth.outputs.only = T,
tps = T,
n.cores = 4)
#set negative values to 0
interp.rast[[1]]$final[interp.rast[[1]]$final <= 0] <- 0
writeRaster(interp.rast[[1]]$final,
filename = "path/ntl_splines.tif")
I vied all the output parameters (please refer to Example 2 in the package description) but I couldn't find anything relevant to my question.
I have posted a question on GitHub as well. From here you can download my images.
I think this is a misunderstanding; mahcisplin, isnt testing 6 and gives one. it's trying many ensembles of 6 and its giving one ensemble... or in other words
that its the best 'combination of 6 algorithms' that I will get, and not one of 6 algo's chosen.
It will get something like "a model which is 20% algo1 , 10% algo2 etc. "and not "algo1 is the best and chosen"

How to run a multinomial logit regression with both individual and time fixed effects in R

Long story short:
I need to run a multinomial logit regression with both individual and time fixed effects in R.
I thought I could use the packages mlogit and survival to this purpose, but I am cannot find a way to include fixed effects.
Now the long story:
I have found many questions on this topic on various stack-related websites, none of them were able to provide an answer. Also, I have noticed a lot of confusion regarding what a multinomial logit regression with fixed effects is (people use different names) and about the R packages implementing this function.
So I think it would be beneficial to provide some background before getting to the point.
Consider the following.
In a multiple choice question, each respondent take one choice.
Respondents are asked the same question every year. There is no apriori on the extent to which choice at time t is affected by the choice at t-1.
Now imagine to have a panel data recording these choices. The data, would look like this:
set.seed(123)
# number of observations
n <- 100
# number of possible choice
possible_choice <- letters[1:4]
# number of years
years <- 3
# individual characteristics
x1 <- runif(n * 3, 5.0, 70.5)
x2 <- sample(1:n^2, n * 3, replace = F)
# actual choice at time 1
actual_choice_year_1 <- possible_choice[sample(1:4, n, replace = T, prob = rep(1/4, 4))]
actual_choice_year_2 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.4, 0.3, 0.2, 0.1))]
actual_choice_year_3 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.2, 0.5, 0.2, 0.1))]
# create long dataset
df <- data.frame(choice = c(actual_choice_year_1, actual_choice_year_2, actual_choice_year_3),
x1 = x1, x2 = x2,
individual_fixed_effect = as.character(rep(1:n, years)),
time_fixed_effect = as.character(rep(1:years, each = n)),
stringsAsFactors = F)
I am new to this kind of analysis. But if I understand correctly, if I want to estimate the effects of respondents' characteristics on their choice, I may use a multinomial logit regression.
In order to take advantage of the longitudinal structure of the data, I want to include in my specification individual and time fixed effects.
To the best of my knowledge, the multinomial logit regression with fixed effects was first proposed by Chamberlain (1980, Review of Economic Studies 47: 225–238). Recently, Stata users have been provided with the routines to implement this model (femlogit).
In the vignette of the femlogit package, the author refers to the R function clogit, in the survival package.
According to the help page, clogit requires data to be rearranged in a different format:
library(mlogit)
# create wide dataset
data_mlogit <- mlogit.data(df, id.var = "individual_fixed_effect",
group.var = "time_fixed_effect",
choice = "choice",
shape = "wide")
Now, if I understand correctly how clogit works, fixed effects can be passed through the function strata (see for additional details this tutorial). However, I am afraid that it is not clear to me how to use this function, as no coefficient values are returned for the individual characteristic variables (i.e. I get only NAs).
library(survival)
fit <- clogit(formula("choice ~ alt + x1 + x2 + strata(individual_fixed_effect, time_fixed_effect)"), as.data.frame(data_mlogit))
summary(fit)
Since I was not able to find a reason for this (there must be something that I am missing on the way these functions are estimated), I have looked for a solution using other packages in R: e.g., glmnet, VGAM, nnet, globaltest, and mlogit.
Only the latter seems to be able to explicitly deal with panel structures using appropriate estimation strategy. For this reason, I have decided to give it a try. However, I was only able to run a multinomial logit regression without fixed effects.
# state formula
formula_mlogit <- formula("choice ~ 1| x1 + x2")
# run multinomial regression
fit <- mlogit(formula_mlogit, data_mlogit)
summary(fit)
If I understand correctly how mlogit works, here's what I have done.
By using the function mlogit.data, I have created a dataset compatible with the function mlogit. Here, I have also specified the id of each individual (id.var = individual_fixed_effect) and the group to which individuals belongs to (group.var = "time_fixed_effect"). In my case, the group represents the observations registered in the same year.
My formula specifies that there are no variables correlated with a specific choice, and which are randomly distributed among individuals (i.e., the variables before the |). By contrast, choices are only motivated by individual characteristics (i.e., x1 and x2).
In the help of the function mlogit, it is specified that one can use the argument panel to use panel techniques. To set panel = TRUE is what I am after here.
The problem is that panel can be set to TRUE only if another argument of mlogit, i.e. rpar, is not NULL.
The argument rpar is used to specify the distribution of the random variables: i.e. the variables before the |.
The problem is that, since these variables does not exist in my case, I can't use the argument rpar and then set panel = TRUE.
An interesting question related to this is here. A few suggestions were given, and one seems to go in my direction. Unfortunately, no examples that I can replicate are provided, and I do not understand how to follow this strategy to solve my problem.
Moreover, I am not particularly interested in using mlogit, any efficient way to perform this task would be fine for me (e.g., I am ok with survival or other packages).
Do you know any solution to this problem?
Two caveats for those interested in answering:
I am interested in fixed effects, not in random effects. However, if you believe there is no other way to take advantage of the longitudinal structure of my data in R (there is indeed in Stata but I don't want to use it), please feel free to share your code.
I am not interested in going Bayesian. So if possible, please do not suggest this approach.

glm summary not giving coefficients values

I'm trying to apply glm on a given dataset,but the summary(model1) is not giving me the correct output , it's not giving coefficient values for Estimate Std. Error z value Pr(>|z|) etc, it's just giving me NA as an output for individual attribute element.
TEXT <- c('Learned a new concept today : metamorphic testing. t.co/0is1IUs3aW','BMC Bioinformatics BioMed Central: Detecting novel ncRNAs by experimental #RNomics is not an easy task... http:/t.co/ui3Unxpx #bing #MyEN','BMC Bioinformatics BioMed Central: small #RNA with a regulatory function as a scientific ... Detecting novel… http:/t.co/wWHOEkR0vc #bing','True or false? link(#Addition, #Classification) http:/t.co/zMJuTFt8iq #Oxytocin','Biologists do have a sense of humor, especially computational bio people http:/t.co/wFZqaaFy')
NAME <- c('QSoft Consulting','Fabrice Leclerc','Sungsam Gong','Frederic','Zach Stednick')
SCREEN_NAME <-c ('QSoftConsulting','rnomics','sunggong','rnomics','jdwasmuth')
FOLLOWERS_COUNT <- c(734,1900,234,266,788)
RETWEET <- c(1,3,5,0,2)
FRIENDS_COUNT <-c(34,532,77,213,422)
STATUSES_COUNT <- c(234,643,899,222,226)
FAVOURITES_COUNT <- c(144,2677,445,930,254)
df <- data.frame(TEXT,NAME,SCREEN_NAME,RETWEET,FRIENDS_COUNT,STATUSES_COUNT,FAVOURITES_COUNT)
mydata<-df
mydata$FAVOURITES_COUNT <- ifelse( mydata$FAVOURITES_COUNT >= 445, 1, 0) #converting fav_count to binary values
Splitting data
library(caret)
split=0.60
trainIndex <- createDataPartition(mydata$FAVOURITES_COUNT, p=split, list=FALSE)
data_train <- mydata[ trainIndex,]
data_test <- mydata[-trainIndex,]
glm model
library(e1071)
model1 <- glm(FAVOURITES_COUNT~.,family = binomial, data = data_train)
summary(model1)
I want to get the p value for further analysis so far i think my code is right, how can i get the correct output?
A binomial distribution will only work if the dependent variable has two outcomes. You should consider a Poisson distribution when the dependent variable is a count. See here for more details: http://www.statmethods.net/advstats/glm.html
Your code for fitting the GLM is programmatically correct. However, there are a few issues:
As mentioned in the comments, for every variable that is categorical, you should use as.factor() to make it into a factor. GLM doesn't know what a "string" variable is.
As MorganBall indicated, if your data truly is count data, you may consider fitting it using a Poisson GLM, instead of converting to binary and using Logistic regression.
You indicate that you have 13 parameters and 1000 observations. While this may seem like enough data, note that some of these parameters may have very few (close to 0?) observations in them. This is a problem.
In addition, did you make sure that your data does not perfectly separate the response? Because if there are some combinations of parameters that do separate the response perfectly, the maximum likelihood estimate won't converge and theoretically goes to infinity. Practically speaking, you'll get very large standard errors for your estimates.

R- Random Forest - Importance / varImPlot

I have an issue with Random Forest with the Importance / varImPlot function, I hope someone could help me with?
I tried to code versions but I am confused about the (different) results:
1.)
rffit = randomForest(price~.,data=train,mtry=x,ntree=500)
rfvalpred = predict(rffit,newdata=test)
varImpPlot(rffit)
importance(rffit)
Shows the plot and the data of “importance”, however only “IncNodePurity”. And the data is different the plot and the data, I tried with "Scale" but did not work.
2.)
rf.analyzed_data = randomForest(price~.,data=train,mtry=x,ntree=500,importance=TRUE)
yhat.rf = predict(rf.analyzed_data,newdata=test)
varImpPlot(rf.analyzed_data)
importance(rf.analyzed_data)
In that case it does not produce any plot anymore and the importance data is showing “%IncMSE” and “IncNodePurity” data but the “IncNodePurity” data is different to first code?
Questions:
1.) Any idea why data is different for “IncNodePurity”?
2.) Any idea why no “%IncMSE” is shown in the first version?
3.) Why no plot is shown in the second version?
Many thanks!!
Ed
1) IncNodePurity is derived from the loss function, and you get that measure for free just by training the model. On the downside it is a more unstable estimate as results may vary from each model run. It is also more biased as it favors variables with many levels. I guess your found the differences are due to randomness.
2) VI, %IncMSE takes a little extra time to compute and is therefore optional. Roughly all values in data set needs to be shuffled and every OOB sample needs to be predicted once for every tree times for every variable. As the package randomForest is designed, you have to compute VI during training. importance must be set to TRUE. varImpPlot cannot plot it as it has not been computed.
3) Not sure. In this code example I see both plots at least.
library(randomForest)
#data
X = data.frame(replicate(6,rnorm(1000)))
y = with(X, X1^2 + sin(X2*pi) + X3*X4)
train = data.frame(y=y,X=X)
#training
rf1=randomForest(y~.,data=train,importance=F)
rf2=randomForest(y~.,data=train, importance=T)
#plotting importnace
varImpPlot(rf1) #plot only with IncNodePurity
varImpPlot(rf2) #bi-plot also with %IncMSE

Preprocess data in R

Im using R to create logistic regression classifier model.
Here is the code sample:
library(ROCR)
DATA_SET <- read.csv('E:/1.csv')
classOneCount= 4000
classZeroCount = 4000
sample.churn <- sample(which(DATA_SET$Class==1),classOneCount)
sample.nochurn <- sample(which(DATA_SET$Class==0),classZeroCount )
train.set <- DATA_SET[c(sample.churn,sample.nochurn),]
test.set <- DATA_SET[c(-sample.churn,-sample.nochurn),]
full.logit <- glm(Class~., data = train.set, family = binomial)
And it works fine, but I would like to preprocess the data to see if it improves classification model.
What I would like to do would be to divide input vector variables which are continuoes into intervals. Lets say that one variable is height in centimeters in float.
Sample values of height:
183.23
173.43
163.53
153.63
193.27
and so on, and I would like to split it into lets say 3 different intervals: small, medium, large.
And do it with all variables from my set - there are 32 variables.
What's more I would like to see at the end correlation between value of the variables (this intervals) and classification result class.
Is this clear?
Thank you very much in advance
The classification model creates some decision boundary and existing algorithms are rather good at estimating it. Let's assume that you have one variable - height - and linear decision boundary. Your algorithm can then decide between what values put decision boundary by estimating error on training set. If you perform quantization and create few intervals your algorithm have fewer places to put boundary(data loss). It will likely perform worse on such cropped dataset than on original one. It could help if your learning algorithm is suffering from high variance (is overfitting data) but then you could also try getting more training examples, use smaller set (subset) of features or use algorithm with regularization and increase regularization parameter
There are also many questions about how to choose number of intervals and how to divide data into them like: should all intervals be equally frequent or of equal width or most similar to each other inside each interval?
If you want just to experiment use some software like f.e. free version of RapidMiner Studio (it can read CSV and Excel files and have some quick quantization options) to convert your data

Resources