order/number of variables in lm causing singularities? - r
I was trying to run a linear model using lm() in R with 12 explanatory variables and 33 observations), but the coefficients for the last three variables are not defined because of singularities. When I switched the order of the variables, the same thing happens again, even though those variables (TotalPrec_11, TotalPrec_12, TotalPrec_10) were significant before. The coefficients were also different between two models.
ab <- lm(value ~ TotalPrec_12 + TotalPrec_11 + TotalPrec_10 + TotalPrec_9 + TotalPrec_8 + TotalPrec_7 + TotalPrec_6 + TotalPrec_5 + TotalPrec_4 + TotalPrec_3 + TotalPrec_2 + TotalPrec_1, data = aa)
summary(ab)
#Coefficients: (3 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 64.34 30.80 2.089 0.0480 *
#TotalPrec_12 19811.97 11080.14 1.788 0.0869 .
#TotalPrec_11 -16159.45 7099.89 -2.276 0.0325 *
#TotalPrec_10 -16500.62 18813.96 -0.877 0.3895
#TotalPrec_9 62662.08 51143.37 1.225 0.2329
#TotalPrec_8 665.39 36411.95 0.018 0.9856
#TotalPrec_7 -77203.59 51555.71 -1.497 0.1479
#TotalPrec_6 4830.11 19503.52 0.248 0.8066
#TotalPrec_5 6403.94 14902.77 0.430 0.6714
#TotalPrec_4 -735.73 5023.83 -0.146 0.8848
#TotalPrec_3 NA NA NA NA
#TotalPrec_2 NA NA NA NA
#TotalPrec_1 NA NA NA NA
The same data here with a different order of variables:
ab1 <- lm(value ~ TotalPrec_1 + TotalPrec_2 + TotalPrec_3 + TotalPrec_9 + TotalPrec_8 + TotalPrec_7 + TotalPrec_6 + TotalPrec_5 + TotalPrec_4 + TotalPrec_11 + TotalPrec_12 + TotalPrec_10, data = aa)
summary(ab1)
#Coefficients: (3 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 63.72 54.44 1.171 0.2538
#TotalPrec_1 19611.54 19366.33 1.013 0.3218
#TotalPrec_2 -14791.44 7847.87 -1.885 0.0722 .
#TotalPrec_3 6766.60 3144.68 2.152 0.0422 *
#TotalPrec_9 28677.62 53530.82 0.536 0.5973
#TotalPrec_8 -23207.34 65965.12 -0.352 0.7282
#TotalPrec_7 -26628.10 55839.25 -0.477 0.6380
#TotalPrec_6 -28694.23 13796.80 -2.080 0.0489 *
#TotalPrec_5 46982.35 17941.89 2.619 0.0154 *
#TotalPrec_4 -26393.70 17656.70 -1.495 0.1486
#TotalPrec_11 NA NA NA NA
#TotalPrec_12 NA NA NA NA
#TotalPrec_10 NA NA NA NA
Several posts online suggest that it might be a multicollinearity problems. I ran the cor() function to check for collinearity, and nothing came out to be perfectly correlated.
I used the same set of these 12 variables with other response variables, and there was no problem with singularities. So I'm not sure what happens here and what I need to do differently to figure this out.
edit
here is my data
> dput(aa)
structure(list(value = c(93, 95, 88, 90, 90, 80, 100, 80, 96,
100, 100, 100, 80, 94, 88, 76, 90, 0, 93, 100, 88, 90, 95, 71,
92, 93, 92, 100, 85, 90, 100, 100, 100), TotalPrec_1 = c(0.00319885835051536,
0.00319885835051536, 0.00319885835051536, 0.00717973057180643,
0.00717973057180643, 0.00717973057180643, 0.00464357063174247,
0.00464357063174247, 0.00464357063174247, 0.00598198547959327,
0.00598198547959327, 0.00598198547959327, 0.00380058260634541,
0.00380058260634541, 0.00380058260634541, 0.00380058260634541,
0.00364887388423085, 0.00364887388423085, 0.00364887388423085,
0.00475014140829443, 0.00475014140829443, 0.00475014140829443,
0.00475014140829443, 0.00499139120802283, 0.00499139120802283,
0.00499139120802283, 0.00499139120802283, 0.00490436097607016,
0.00490436097607016, 0.00490436097607016, 0.00623255362734198,
0.00623255362734198, 0.00623255362734198), TotalPrec_2 = c(0.00387580785900354,
0.00387580785900354, 0.00387580785900354, 0.00625309534370899,
0.00625309534370899, 0.00625309534370899, 0.00298969540745019,
0.00298969540745019, 0.00298969540745019, 0.00558579061180353,
0.00558579061180353, 0.00558579061180353, 0.00370361795648932,
0.00370361795648932, 0.00370361795648932, 0.00370361795648932,
0.00335893919691443, 0.00335893919691443, 0.00335893919691443,
0.00621500937268137, 0.00621500937268137, 0.00621500937268137,
0.00621500937268137, 0.00234323320910334, 0.00234323320910334,
0.00234323320910334, 0.00234323320910334, 0.00644989637658, 0.00644989637658,
0.00644989637658, 0.00476496992632746, 0.00476496992632746, 0.00476496992632746
), TotalPrec_3 = c(0.00418250449001789, 0.00418250449001789,
0.00418250449001789, 0.00702223135158419, 0.00702223135158419,
0.00702223135158419, 0.00427648611366748, 0.00427648611366748,
0.00427648611366748, 0.00562589056789875, 0.00562589056789875,
0.00562589056789875, 0.0037367227487266, 0.0037367227487266,
0.0037367227487266, 0.0037367227487266, 0.00477339653298258,
0.00477339653298258, 0.00477339653298258, 0.0124167986214161,
0.0124167986214161, 0.0124167986214161, 0.0124167986214161, 0.010678518563509,
0.010678518563509, 0.010678518563509, 0.010678518563509, 0.0139585845172405,
0.0139585845172405, 0.0139585845172405, 0.00741709442809224,
0.00741709442809224, 0.00741709442809224), TotalPrec_4 = c(0.00659881485626101,
0.00659881485626101, 0.00659881485626101, 0.00347008113749325,
0.00347008113749325, 0.00347008113749325, 0.00720167113468051,
0.00720167113468051, 0.00720167113468051, 0.00704727275297045,
0.00704727275297045, 0.00704727275297045, 0.00856677815318107,
0.00856677815318107, 0.00856677815318107, 0.00856677815318107,
0.00867980346083641, 0.00867980346083641, 0.00867980346083641,
0.00614343490451574, 0.00614343490451574, 0.00614343490451574,
0.00614343490451574, 0.00704662408679723, 0.00704662408679723,
0.00704662408679723, 0.00704662408679723, 0.00495137926191091,
0.00495137926191091, 0.00495137926191091, 0.00796654727309942,
0.00796654727309942, 0.00796654727309942), TotalPrec_5 = c(0.00515584181994199,
0.00515584181994199, 0.00515584181994199, 0.000977653078734875,
0.000977653078734875, 0.000977653078734875, 0.00485571753233671,
0.00485571753233671, 0.00485571753233671, 0.00477610062807798,
0.00477610062807798, 0.00477610062807798, 0.00664602871984243,
0.00664602871984243, 0.00664602871984243, 0.00664602871984243,
0.00533714797347784, 0.00533714797347784, 0.00533714797347784,
0.00265633105300366, 0.00265633105300366, 0.00265633105300366,
0.00265633105300366, 0.00200922577641904, 0.00200922577641904,
0.00200922577641904, 0.00200922577641904, 0.00172789173666387,
0.00172789173666387, 0.00172789173666387, 0.00347296684049069,
0.00347296684049069, 0.00347296684049069), TotalPrec_6 = c(0.00170362275093793,
0.00170362275093793, 0.00170362275093793, 0.000670029199682176,
0.000670029199682176, 0.000670029199682176, 0.0018315939232707,
0.0018315939232707, 0.0018315939232707, 0.00138648133724927,
0.00138648133724927, 0.00138648133724927, 0.00329410820268094,
0.00329410820268094, 0.00329410820268094, 0.00329410820268094,
0.00210500298999249, 0.00210500298999249, 0.00210500298999249,
0.000628655252512544, 0.000628655252512544, 0.000628655252512544,
0.000628655252512544, 0.000631613133009523, 0.000631613133009523,
0.000631613133009523, 0.000631613133009523, 0.000616533157881349,
0.000616533157881349, 0.000616533157881349, 0.000599739549215883,
0.000599739549215883, 0.000599739549215883), TotalPrec_7 = c(0.00124496815260499,
0.00124496815260499, 0.00124496815260499, 0.000289129035081714,
0.000289129035081714, 0.000289129035081714, 0.00089572963770479,
0.00089572963770479, 0.00089572963770479, 0.00187503395136445,
0.00187503395136445, 0.00187503395136445, 0.00070394336944446,
0.00070394336944446, 0.00070394336944446, 0.00070394336944446,
0.000733022985514253, 0.000733022985514253, 0.000733022985514253,
4.50894685855019e-06, 4.50894685855019e-06, 4.50894685855019e-06,
4.50894685855019e-06, 3.02730550174601e-05, 3.02730550174601e-05,
3.02730550174601e-05, 3.02730550174601e-05, 3.71173496205301e-06,
3.71173496205301e-06, 3.71173496205301e-06, 4.58224167232402e-05,
4.58224167232402e-05, 4.58224167232402e-05), TotalPrec_8 = c(0.000394100265111774,
0.000394100265111774, 0.000394100265111774, 0.000930351321585476,
0.000930351321585476, 0.000930351321585476, 0.000679628865327686,
0.000679628865327686, 0.000679628865327686, 0.000997507828287781,
0.000997507828287781, 0.000997507828287781, 1.77486290340312e-05,
1.77486290340312e-05, 1.77486290340312e-05, 1.77486290340312e-05,
1.63553704624064e-05, 1.63553704624064e-05, 1.63553704624064e-05,
4.31556363764685e-05, 4.31556363764685e-05, 4.31556363764685e-05,
4.31556363764685e-05, 8.14739760244265e-05, 8.14739760244265e-05,
8.14739760244265e-05, 8.14739760244265e-05, 4.07490988436621e-05,
4.07490988436621e-05, 4.07490988436621e-05, 0.000140139847644605,
0.000140139847644605, 0.000140139847644605), TotalPrec_9 = c(0.000616681878454983,
0.000616681878454983, 0.000616681878454983, 0.000742240983527154,
0.000742240983527154, 0.000742240983527154, 0.000230846126214601,
0.000230846126214601, 0.000230846126214601, 0.00132466584909707,
0.00132466584909707, 0.00132466584909707, 0.000114383190521039,
0.000114383190521039, 0.000114383190521039, 0.000114383190521039,
6.07241054240149e-05, 6.07241054240149e-05, 6.07241054240149e-05,
2.74324702331796e-05, 2.74324702331796e-05, 2.74324702331796e-05,
2.74324702331796e-05, 6.96572624292457e-06, 6.96572624292457e-06,
6.96572624292457e-06, 6.96572624292457e-06, 3.32364725181833e-05,
3.32364725181833e-05, 3.32364725181833e-05, 0.000108777909190394,
0.000108777909190394, 0.000108777909190394), TotalPrec_10 = c(0.00040393992094323,
0.00040393992094323, 0.00040393992094323, 0.00166831514798104,
0.00166831514798104, 0.00166831514798104, 0.000324568885844201,
0.000324568885844201, 0.000324568885844201, 0.000868275004904717,
0.000868275004904717, 0.000868275004904717, 1.25834640130051e-05,
1.25834640130051e-05, 1.25834640130051e-05, 1.25834640130051e-05,
7.2861012085923e-06, 7.2861012085923e-06, 7.2861012085923e-06,
0.000946254527661949, 0.000946254527661949, 0.000946254527661949,
0.000946254527661949, 0.000476793473353609, 0.000476793473353609,
0.000476793473353609, 0.000476793473353609, 0.00102826312649995,
0.00102826312649995, 0.00102826312649995, 0.00266417209059, 0.00266417209059,
0.00266417209059), TotalPrec_11 = c(0.00124716362915933, 0.00124716362915933,
0.00124716362915933, 0.00470362277701497, 0.00470362277701497,
0.00470362277701497, 0.0017967780586332, 0.0017967780586332,
0.0017967780586332, 0.000694554066285491, 0.000694554066285491,
0.000694554066285491, 0.000485763972392306, 0.000485763972392306,
0.000485763972392306, 0.000485763972392306, 0.00074231723556295,
0.00074231723556295, 0.00074231723556295, 0.000763822405133396,
0.000763822405133396, 0.000763822405133396, 0.000763822405133396,
0.00114128366112709, 0.00114128366112709, 0.00114128366112709,
0.00114128366112709, 0.000856105296406895, 0.000856105296406895,
0.000856105296406895, 0.00255026295781135, 0.00255026295781135,
0.00255026295781135), TotalPrec_12 = c(0.00380058260634541, 0.00380058260634541,
0.00380058260634541, 0.00475014140829443, 0.00475014140829443,
0.00475014140829443, 0.00412079365924, 0.00412079365924, 0.00412079365924,
0.00455283792689442, 0.00455283792689442, 0.00455283792689442,
0.00117174908518791, 0.00117174908518791, 0.00117174908518791,
0.00117174908518791, 0.00119069591164588, 0.00119069591164588,
0.00119069591164588, 0.00201585865579545, 0.00201585865579545,
0.00201585865579545, 0.00201585865579545, 0.00202310062013566,
0.00202310062013566, 0.00202310062013566, 0.00202310062013566,
0.00231692171655595, 0.00231692171655595, 0.00231692171655595,
0.00495567917823791, 0.00495567917823791, 0.00495567917823791
)), row.names = c(NA, -33L), class = c("tbl_df", "tbl", "data.frame"
))
When you have multiple predictors, singularity doesn’t necessarily mean that two variables are perfectly correlated. It means that at least one of your variables can be perfectly predicted by some combination of the other variables, even if none of those variables is a perfect predictor on its own. When you have many predictors relative to few observations, as you do, the odds of this happening increase. So you will probably need to simplify your model.
You are trying to estimate a linear model y = X %*% beta + epsilon given X and y. The model matrix X has 33 rows and 13 columns, one for the intercept and one for each numeric variable:
X <- model.matrix(ab)
dim(X)
## [1] 33 13
colnames(X)
## [1] "(Intercept)" "TotalPrec_12" "TotalPrec_11" "TotalPrec_10" "TotalPrec_9"
## [6] "TotalPrec_8" "TotalPrec_7" "TotalPrec_6" "TotalPrec_5" "TotalPrec_4"
## [11] "TotalPrec_3" "TotalPrec_2" "TotalPrec_1"
But X has rank 10, not 13:
qr(X)$rank
## [1] 10
So there is no unique least squares solution beta. lm copes by fitting a reduced model to the first set of 10 linearly independent columns of X, as indicated in your summary output. (Whether it copes or throws an error depends on its argument singular.ok. The default value is TRUE.)
I find it curious that changing the response makes the problem go away, given that the rank of X does not depend on y. Perhaps you changed more than just the response without realizing?
Related
How to integrate cut() into predict() in predicting a variable for a model I created
R-Programming: Using the ISLR library, I want to predict someone's wage if they are age 35, with a supposed model that utilizes the function cut() with values 0, 35, 45, 55, 65, 80 to cut variable "age" into different brackets. With that being said, how should the predict() code look like with cut() and my model in consideration? Here is my code so far prior to predict(): table(cut(age, breaks = c(0, 35, 45, 55, 65, 80))) # cut() getfit.1 = lm(wage~education+cut(age, breaks = c(0,25,35,45,55,80)),data=Wage) # model with cut()
You will make your life easier if you create the categorical variable and then use it to fit the model: library(ISLR) agecat <- cut(Wage$age, breaks = c(0,25,35,45,55,80)) getfit.1 <- lm(wage~education+agecat,data=Wage) predict(getfit.1, data.frame(education="2. HS Grad", agecat="(25,35]")) # 1 # 88.56445 Note, you must specify the education category as well to get a prediction. As a result it may be useful to get all the the combinations: cross <- expand.grid(agecat=levels(agecat), education=levels(Wage$education)) predictions <- data.frame(cross, pwage=predict(getfit.1, cross)) head(predictions) # agecat education pwage # 1 (0,25] 1. < HS Grad 59.12711 # 2 (25,35] 1. < HS Grad 77.65516 # 3 (35,45] 1. < HS Grad 91.86200 # 4 (45,55] 1. < HS Grad 90.84853 # 5 (55,80] 1. < HS Grad 88.53072 # 6 (0,25] 2. HS Grad 70.03640
predict.MCMCglmm error -- giving broad non-granular values
I'm using the MCMCglmm function in its package to create a mixed-effects model for three-level categorical model. The goal is to get probabilities of each of the three classes in response for each row of data in the 5000 row test_week test set. prior6 <- list(R=list(V=diag(2), nu=0.0001), G=list(G1=list(V=diag(4), nu=0.0001, alpha.mu=rep(0,4), alpha.V=diag(4) * 25), G2=list(V=diag(1), nu=0.0001, alpha.mu=rep(0,1), alpha.V=diag(1) * 25), G3=list(V=diag(2), nu=0.0001, alpha.mu=rep(0,2), alpha.V=diag(2) * 25), G4=list(V=diag(1), nu=0.0001, alpha.mu=rep(0,1), alpha.V=diag(1) * 25), G5=list(V=diag(2), nu=0.0001, alpha.mu=rep(0,2), alpha.V=diag(2) * 25), G6=list(V=diag(1), nu=0.0001, alpha.mu=rep(0,1), alpha.V=diag(1) * 25))) mix_mod_fit6 <- MCMCglmm(response ~ 1 + x + y + z, random=~us(1 + x + y + z):a + us(1):d + us(1 + x):b + us(1):e + us(1 + z):c + us(1):f, rcov=~ us(trait):units, prior=prior6, family='categorical', data=train_weeks_sample1, nitt=3850, thin=45, burnin=2500) mixed_final_predictions6 <- predict.MCMCglmm(mix_mod_fit6, test_week, type='response', interval='prediction') The issue arises with the predict function, returning a 10000x1 matrix of numbers that very roughly mirror the probabilities of the 2nd and 3rd levels in response (the first 5000 rows corresponding to the 2nd level and the following 5000 to the 3rd level) after I split them into a 5000x2 matrix. I can then backsolve to get the predictions for the 1st level, but the predictions are problematic. They only predict in multiples of 1/30 as shown below (and also illustrated in the histograms in the picture). temp6 0 0.0333333333333333 0.0666666666666667 0.1 0.133333333333333 0.166666666666667 7935 3914 2199 1901 1883 2173 0.2 0.233333333333333 0.266666666666667 0.3 0.333333333333333 0.366666666666667 2257 2198 1991 1703 1465 1184 0.4 0.433333333333333 0.466666666666667 0.5 0.533333333333333 0.566666666666667 987 756 527 410 268 164 0.6 0.633333333333333 0.666666666666667 0.7 98 35 24 6 Any insight or examples on how the predict.MCMCglmm function works, how to receive predictions more granular than just thirtieths, as well as predictions for all three levels rather than just two would be greatly appreciated! Thank you! Note: The prior's V values were specifically selected to match the element sizes in the MCMCglmm call, and cannot be changed. However, altering the value of nu has no effect on the predictions, nor does ommitting alpha.mu and alpha.V. Current results vs ideal results distribution example:
Fitting parameters do not converge, but nlsLM() stops
In R, I'm trying to fit a series of data to a curve given by the theory, which is: y(x) = (1 + fitbeta * x)^(-fitgamma) In theory this should work and others have used this formula successfully. However, I cannot seem to make my fit parameters converge, no matter the starting conditions or options I try. Below is a reproducible example. Note that I am using the package minpack.lm, which is a nice improvement over nls() which - by the way - throws a Missing value or an infinity produced when evaluating the model error at me when I try to use it (with the same settings). This may be a trivial question for most of you, but I do not know nls() at all and I'm just starting to use it. There must be something super-simple I'm missing! Here's the code: #Libraries: library(ggplot2) #Plots library(minpack.lm) #Non linear formula fits #Reproducible example parameters: startbeta=0.001 startgamma=-10 X <- c(1, seq(2,240,2)) #Vector of durations in hours Y <- c(1, 0.999349067282531, 0.997149973857984, 0.993613390445064, 0.988771983639396, 0.982724692889081, 0.975628661286657, 0.96751072812657, 0.958414894569813, 0.948463753530251, 0.93767394420049, 0.926259971613655, 0.91433495083748, 0.901955098152661, 0.889290679032582, 0.876927340669535, 0.864697870521103, 0.852357436802833, 0.839855401239168, 0.827134255036668, 0.814227652658426, 0.801278249082419, 0.788355912487271, 0.775514097293561, 0.762891867628759, 0.750380786683852, 0.738018762182673, 0.725799137700828, 0.713720035497274, 0.701808749767634, 0.690046213599144, 0.678484705844808, 0.667111445204795, 0.655977696751697, 0.645116379924585, 0.634460211234775, 0.623985607991471, 0.613706080277076, 0.603604313599018, 0.593685433942668, 0.58395373490791, 0.574696581531438, 0.565639259757887, 0.556883924877305, 0.54829105550864, 0.539882579975057, 0.531669333311634, 0.523789998486779, 0.516140798533169, 0.508732414242052, 0.501549858546355, 0.494581375404643, 0.487806083201077, 0.481215549260729, 0.475344757534521, 0.469883620527239, 0.464505182123833, 0.459295389779093, 0.454254664927743, 0.449272635346615, 0.444353923395879, 0.439502685117945, 0.434723592424652, 0.430300205554656, 0.425950322720235, 0.421651255977861, 0.417403585324494, 0.413205553596921, 0.409056611802817, 0.404966487426596, 0.400979187396173, 0.39721419353495, 0.393559414540655, 0.389971147514211, 0.386435641037176, 0.382947750185137, 0.379497143530884, 0.376143019983175, 0.373016368099911, 0.369904788649644, 0.366813427145508, 0.363784767175811, 0.360999911892512, 0.358249758228913, 0.355539445964091, 0.352899943455576, 0.35037387237155, 0.347925865476795, 0.345529621963385, 0.343187675737988, 0.340930763575173, 0.338722557572396, 0.336557943062853, 0.334418098646777, 0.332805911075547, 0.33117666406428, 0.329516391536038, 0.327847961775104, 0.32615922691243, 0.324473380427564, 0.322807248963926, 0.321128906622371, 0.319431589984492, 0.317714245126025, 0.315983206323488, 0.314233066510948, 0.312462213805877, 0.310672914913813, 0.308902798280917, 0.307149178519641, 0.305387995487162, 0.303621881791372, 0.301859643176666, 0.300098168944162, 0.298340765140062, 0.296633192262476, 0.294981140721158, 0.293349312493173, 0.291718961012827, 0.290087159821697, 0.288466970001273) #Plot function: ggp <- function(b=fitbeta, g=fitgamma) { gg <- ggplot(data.frame(x = c(0, 240)), aes(x)) + stat_function(fun = function(x) (1 + b * x)^(-g), geom = "line") + geom_line(data=data.frame(x=X, y=Y), aes(y=y, x=x), colour="red") print(gg) } #Fit: nlc <- nls.control(maxiter = 1000) fit <- nlsLM(y ~ I((1 + fitbeta * x)^(-fitgamma)), #Function from NERC(1975) data = data.frame(y=Y, x=X), start = list(fitbeta = startbeta, fitgamma=startgamma), trace=TRUE, control=nlc) #Get the coefficients of the fit fitbeta <- coef(fit)[1] fitgamma <- coef(fit)[2] #Plot: ggp() The result is not too shabby: (in black the fitted curve, in red the input data) However, if I look at the nlsLM() trace: ... It. 64, RSS = 0.13003, Par. = -1.37045e-05 -444.272 It. 65, RSS = 0.130024, Par. = -1.33717e-05 -455.143 It. 66, RSS = 0.130014, Par. = -1.30753e-05 -465.508 It. 67, RSS = 0.130006, Par. = -1.29238e-05 -471.145 It. 68, RSS = 0.13, Par. = -1.26237e-05 -482.163 It. 69, RSS = 0.129991, Par. = -1.23554e-05 -492.681 It. 70, RSS = 0.129984, Par. = -1.22119e-05 -498.639 It. 71, RSS = 0.129979, Par. = -1.19291e-05 -510.269 It. 72, RSS = 0.12997, Par. = -1.16755e-05 -521.396 It. 73, RSS = 0.129964, Par. = -1.15447e-05 -527.488 It. 74, RSS = 0.129959, Par. = -1.12865e-05 -539.368 It. 75, RSS = 0.129951, Par. = -1.10541e-05 -550.747 It. 76, RSS = 0.129945, Par. = -1.09312e-05 -557.117 It. 77, RSS = 0.129941, Par. = -1.06886e-05 -569.567 It. 78, RSS = 0.129933, Par. = -1.04713e-05 -581.433 It. 79, RSS = 0.129928, Par. = -1.03588e-05 -587.931 It. 80, RSS = 0.129924, Par. = -1.01368e-05 -600.613 It. 81, RSS = 0.129917, Par. = -9.93656e-06 -612.764 It. 82, RSS = 0.129912, Par. = -9.82966e-06 -619.607 It. 83, RSS = 0.129908, Par. = -9.61868e-06 -632.993 It. 84, RSS = 0.129902, Par. = -9.42521e-06 -646.024 It. 85, RSS = 0.129896, Par. = -9.24949e-06 -658.345 It. 86, RSS = 0.129891, Par. = -9.14794e-06 -665.812 It. 87, RSS = 0.129888, Par. = -8.9485e-06 -680.423 It. 88, RSS = 0.129882, Par. = -8.7693e-06 -694.38 It. 89, RSS = 0.129876, Par. = -8.58442e-06 -709.315 It. 90, RSS = 0.129871, Par. = -8.41437e-06 -723.697 It. 91, RSS = 0.129866, Par. = -8.25997e-06 -737.276 It. 92, RSS = 0.129862, Par. = -8.17745e-06 -744.9 It. 93, RSS = 0.129859, Par. = -8.01476e-06 -759.809 It does not actually look like it is converging, the two parameters continue to change and actually diverge. What is happening here and why does the fitter stop, if the parameters are not stable?
This is not a solution, but some food for thoughts: In my experience, if you have serious errors from nls, using nlsLM is not a good fix as it will give you usually a result, but often not a good result. An alternative approach would be using a whole bunch of optimizers and check if they can somewhat agree on a solution. library(optimx) fun <- function(x, b, c) (1 + b * x)^(-log(c)) fun1 <- function(pars) sum((Y - fun(X, pars["b"], pars["c"]))^2) fits <- optimx(c(b = 0.001, c = log(10)), fun1, control = list(all.methods = TRUE)) fits # b c value fevals gevals niter convcode kkt1 kkt2 xtimes #BFGS 0.009277751 2.745042 2.453159e-01 88 20 NA 0 FALSE TRUE 0.02 #CG 0.012242506 2.300605 3.184743e-01 171 19 NA 0 FALSE TRUE 0.00 #Nelder-Mead 0.004664550 5.396015 1.427017e-01 175 NA NA 0 FALSE FALSE 0.00 #L-BFGS-B NA NA 8.988466e+307 NA NA NA 9999 NA NA 0.00 #nlm 0.001000000 2.302585 1.797693e+308 NA NA 0 0 NA NA 0.00 #nlminb 0.001674994 58.216382 1.081060e-01 123 191 89 0 TRUE FALSE 0.00 #spg 0.019577915 2.286535 1.448918e+00 7 NA 4 3 FALSE FALSE 0.14 #ucminf 0.001000000 2.302585 2.153628e+01 1 1 NA 0 FALSE FALSE 0.00 #Rcgmin NA NA 8.988466e+307 NA NA NA 9999 NA NA 0.00 #Rvmmin NA NA 8.988466e+307 NA NA NA 9999 NA NA 0.00 #newuoa NA NA 8.988466e+307 NA NA NA 9999 NA NA 0.00 #bobyqa 0.024625964 2.801075 5.831466e+00 23 NA NA 0 NA NA 0.00 #nmkb NA NA 8.988466e+307 NA NA NA 9999 NA NA 0.00 #hjkb 0.001000000 2.302585 2.153628e+01 1 NA 0 9999 NA NA 0.00 plot(X, Y) curve(fun(x, fits["BFGS", "b"], fits["BFGS", "c"]), add = TRUE) curve(fun(x, fits["CG", "b"], fits["CG", "c"]), add = TRUE, col = "blue") curve(fun(x, fits["Nelder-Mead", "b"], fits["Nelder-Mead", "c"]), add = TRUE, col = "red") curve(fun(x, fits["nlminb", "b"], fits["nlminb", "c"]), add = TRUE, col = "green") As you see, most of the converging optimizers give quite different parameter estimates and non of these fits can be considered good. I suspect that this is not merely a result of bad starting values, but that your data just doesn't conform to the model. In particular, it doesn't look like the model can represent the slightly sigmoidal shape of your data.
R nls function and starting values
I'm wondering how I can find/choose the starting values for the nls function as I'm getting errors with any I put in. I also want to confirm that I can actually use the nls function with my data set. data [1] 108 128 93 96 107 126 105 137 78 117 111 131 106 123 112 90 79 106 120 [20] 91 100 103 112 138 61 57 97 100 95 92 78 week = (1:31) > data.fit = nls(data~M*(((P+Q)^2/P)*exp((P+Q)*week)/(1+(Q/P)*exp(-(P+Q)*week))^2), start=c(M=?, P=?, Q=?))
If we change the function a bit and use nls2 to get starting values then we can get it to converge. The model we are using is: log(data) = .lin1 + .lin2 * log((exp((P+Q)*week)/(1+(Q/P)*exp(-(P+Q)*week))^2))) +error In this model .lin1 = log(M*(((P+Q)^2/P)) and when .lin2=1 it reduces to the model in the question (except for the multiplicative rather than additive error and the fact that the parameterization is different but when appropriately reduced gives the same predictions). This is a 4 parameter rather than 3 parameter model. The linear parameters are .lin1 and .lin2. We are using algorithm = "plinear" which does not require starting values for these parameters. The RHS of plinear formulas is specified as a matrix with one column for each linear parameter specifying its coefficient (which may be a nonlinear function of the nonlinear parameters). The code is: data <- c(108, 128, 93, 96, 107, 126, 105, 137, 78, 117, 111, 131, 106, 123, 112, 90, 79, 106, 120, 91, 100, 103, 112, 138, 61, 57, 97, 100, 95, 92, 78) week <- 1:31 if (exists("fit2")) rm(fit2) library(nls2) fo <- log(data) ~ cbind(1, log((exp((P+Q)*week)/(1+(Q/P)*exp(-(P+Q)*week))^2))) # try maxiter random starting values set.seed(123) fit2 = nls2(fo, alg = "plinear-random", start = data.frame(P = c(-10, 10), Q = c(-10, 10)), control = nls.control(maxiter = 1000)) # use starting value found by nls2 fit = nls(fo, alg = "plinear", start = coef(fit2)[1:2]) plot(log(data) ~ week) lines(fitted(fit) ~ week, col = "red") giving: > fit Nonlinear regression model model: log(data) ~ cbind(1, log((exp((P + Q) * week)/(1 + (Q/P) * exp(-(P + Q) * week))^2))) data: parent.frame() P Q .lin1 .lin2 0.05974 -0.02538 5.63199 -0.87963 residual sum-of-squares: 1.069 Number of iterations to convergence: 16 Achieved convergence tolerance: 9.421e-06
Logistic Regression : not actual out put with predict function
I am new to R, when I am going to estimate a logistic model using glm() it's not predicting the response, but gives a not actual output on calling predict function like 1 for every input at my predict function. Code: ex2data1R <- read.csv("/media/ex2data1R.txt") x <-ex2data1R$x y <-ex2data1R$y z <-ex2data1R$z logisticmodel <- glm(z~x+y,family=binomial(link = "logit"),data=ex2data1R) newdata = data.frame(x=c(10),y=(10)) predict(logisticmodel, newdata, type="response") Output: > predict(logisticmodel, newdata, type="response") 1 1.181875e-11 Data(ex2data1R.txt) : "x","y","z" 34.62365962451697,78.0246928153624,0 30.28671076822607,43.89499752400101,0 35.84740876993872,72.90219802708364,0 60.18259938620976,86.30855209546826,1 79.0327360507101,75.3443764369103,1 45.08327747668339,56.3163717815305,0 61.10666453684766,96.51142588489624,1 75.02474556738889,46.55401354116538,1 76.09878670226257,87.42056971926803,1 84.43281996120035,43.53339331072109,1 95.86155507093572,38.22527805795094,0 75.01365838958247,30.60326323428011,0 82.30705337399482,76.48196330235604,1 69.36458875970939,97.71869196188608,1 39.53833914367223,76.03681085115882,0 53.9710521485623,89.20735013750205,1 69.07014406283025,52.74046973016765,1 67.94685547711617,46.67857410673128,0 70.66150955499435,92.92713789364831,1 76.97878372747498,47.57596364975532,1 67.37202754570876,42.83843832029179,0 89.67677575072079,65.79936592745237,1 50.534788289883,48.85581152764205,0 34.21206097786789,44.20952859866288,0 77.9240914545704,68.9723599933059,1 62.27101367004632,69.95445795447587,1 80.1901807509566,44.82162893218353,1 93.114388797442,38.80067033713209,0 61.83020602312595,50.25610789244621,0 38.78580379679423,64.99568095539578,0 61.379289447425,72.80788731317097,1 85.40451939411645,57.05198397627122,1 52.10797973193984,63.12762376881715,0 52.04540476831827,69.43286012045222,1 40.23689373545111,71.16774802184875,0 54.63510555424817,52.21388588061123,0 33.91550010906887,98.86943574220611,0 64.17698887494485,80.90806058670817,1 74.78925295941542,41.57341522824434,0 34.1836400264419,75.2377203360134,0 83.90239366249155,56.30804621605327,1 51.54772026906181,46.85629026349976,0 94.44336776917852,65.56892160559052,1 82.36875375713919,40.61825515970618,0 51.04775177128865,45.82270145776001,0 62.22267576120188,52.06099194836679,0 77.19303492601364,70.45820000180959,1 97.77159928000232,86.7278223300282,1 62.07306379667647,96.76882412413983,1 91.56497449807442,88.69629254546599,1 79.94481794066932,74.16311935043758,1 99.2725269292572,60.99903099844988,1 90.54671411399852,43.39060180650027,1 34.52451385320009,60.39634245837173,0 50.2864961189907,49.80453881323059,0 49.58667721632031,59.80895099453265,0 97.64563396007767,68.86157272420604,1 32.57720016809309,95.59854761387875,0 74.24869136721598,69.82457122657193,1 71.79646205863379,78.45356224515052,1 75.3956114656803,85.75993667331619,1 35.28611281526193,47.02051394723416,0 56.25381749711624,39.26147251058019,0 30.05882244669796,49.59297386723685,0 44.66826172480893,66.45008614558913,0 66.56089447242954,41.09209807936973,0 40.45755098375164,97.53518548909936,1 49.07256321908844,51.88321182073966,0 80.27957401466998,92.11606081344084,1 66.74671856944039,60.99139402740988,1 32.72283304060323,43.30717306430063,0 64.0393204150601,78.03168802018232,1 72.34649422579923,96.22759296761404,1 60.45788573918959,73.09499809758037,1 58.84095621726802,75.85844831279042,1 99.82785779692128,72.36925193383885,1 47.26426910848174,88.47586499559782,1 50.45815980285988,75.80985952982456,1 60.45555629271532,42.50840943572217,0 82.22666157785568,42.71987853716458,0 88.9138964166533,69.80378889835472,1 94.83450672430196,45.69430680250754,1 67.31925746917527,66.58935317747915,1 57.23870631569862,59.51428198012956,1 80.36675600171273,90.96014789746954,1 68.46852178591112,85.59430710452014,1 42.0754545384731,78.84478600148043,0 75.47770200533905,90.42453899753964,1 78.63542434898018,96.64742716885644,1 52.34800398794107,60.76950525602592,0 94.09433112516793,77.15910509073893,1 90.44855097096364,87.50879176484702,1 55.48216114069585,35.57070347228866,0 74.49269241843041,84.84513684930135,1 89.84580670720979,45.35828361091658,1 83.48916274498238,48.38028579728175,1 42.2617008099817,87.10385094025457,1 99.31500880510394,68.77540947206617,1 55.34001756003703,64.9319380069486,1 74.77589300092767,89.52981289513276,1 Let me know am I doing something wrong?
I'm not seeing any problem. Here are predictions for x,y = 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100: newdata = data.frame(x=seq(30, 100, 5) ,y=seq(30, 100, 5)) predict(logisticmodel, newdata, type="response") 1 2 3 4 5 6 2.423648e-06 1.861140e-05 1.429031e-04 1.096336e-03 8.357794e-03 6.078786e-02 7 8 9 10 11 12 3.320041e-01 7.923883e-01 9.670066e-01 9.955766e-01 9.994218e-01 9.999247e-01 13 14 15 9.999902e-01 9.999987e-01 9.999998e-01 You were predicting x=10, y=10 which is way outside the range of your x, y values (30 - 100), but the prediction was zero which fits these results. When x and y are low (30 - 55), the prediction for z is zero. when x and y are high (75 - 100), the prediction is one (or nearly one). It may be easier to interpret the results if you round them to a few decimals: round(predict(logisticmodel, newdata, type="response") , 5) 1 2 3 4 5 6 7 8 9 10 0.00000 0.00002 0.00014 0.00110 0.00836 0.06079 0.33200 0.79239 0.96701 0.99558 11 12 13 14 15 0.99942 0.99992 0.99999 1.00000 1.00000 Here is a simple way to predict a category and compare the results with your data: predict <- ifelse(predict(logisticmodel, type="response")>.5, 1, 0) xtabs(~predict+ex2data1R$z) ex2data1R$z predict 0 1 0 34 5 1 6 55 We used predict() on your original data and then created a rule that picks 1 if the probability is greater than .5 and 0 if it is not. Then we use xtabs() to compare the predictions to the data. When z is 0, we correctly predict zero 34 times and incorrectly predict one 6 times. When z is 1 we correctly predict one 55 times and incorrectly predict zero 5 times. We are correct 89% of the time (34+55)/100*100. You could explore the accuracy of prediction if you use .45 or .55 as the cutoff instead of .5.
In my opinion all is correct, as you can read from R manual: newdata - optionally, a data frame in which to look for variables with which to predict. If omitted, the fitted linear predictors are used. If you have data frame with 1 record it will produce prediction only for that one. For more details see R manual/glm/predict or just in R console, after loading library glm put: ?glm
You can also use the following command to make the confusion matrix: predict <- ifelse(predict(logisticmodel, type="response")>.5, 1, 0) table(predict,ex2data1R$z)