Plotting mixture of univariate normal distributions using the R package EMCluster - r
I would like to visualize a mixture of univariate normal distributions fitted by the EMCluster R package. I would like to get a set of Gaussian curves plotted for each group separately onto a histogram of my data like this:
However, there does not seem to be any plotting function in EMCluster doing that, and the EMCluster help files do not cover this topic. I am aware of other R packages having this functionality (e.g., mixtools or mclust), but these do not do the fitting job right unlike EMCluster, which as the only one fits the distribution correctly.
This is is the code used to fit the model:
# input data
ratio <- c(2.3082202,2.2682008,2.3430013,2.3131582,2.3442648,2.2866254,2.4438874,2.4360583,2.4907970,2.4785809,2.4368449,2.4589041,2.2404580,2.2568378,2.2305135,2.2316156,2.1975250,2.2003426,2.3995671,2.2367229,2.2380535,2.2695250,2.2795190,2.2804133,2.2873157,2.3025447,2.2981834,2.2774566,2.1899404,2.2675393,2.2863328,2.2702749,2.2173223,2.1615549,2.3052489,1.4739972,1.4703164,1.8989637,1.6309663,1.4742799,1.6040551,1.5125876,2.3090968,2.4032732,2.2939723,2.4932422,2.4572377,2.2887470,2.2541456,2.3912709,2.3709839,2.2961881,2.3252021,2.4311603,2.4526981,2.2712559,2.4556190,2.4480402,2.2345277,2.2832188,2.3259353,2.3486292,2.3477749,2.3160682,2.3025502,2.3645101,2.2972784,2.3914385,2.4182051,2.3144094,2.3182206,2.3032635,2.3304741,2.4386540,2.4750668,1.5558920,1.5520053,1.5679544,1.5674089,1.5318896,1.5378909,1.5780276,1.5884973,1.5535807,2.4912484,1.5725149,1.6011670,1.5669198,1.5581934,1.5688439,1.5578162,1.5593121,1.5563160,1.5468341,1.5433628,1.5141012,1.5842708,1.5632946,1.5233117,1.5867471,1.5010637,1.5865281,1.5836973,1.6140125,1.6285195,1.5592994,1.5584742,1.6113194,1.6074361,1.5850861,1.5612799,1.5905453,2.3874244,2.4045643,2.3996815,2.3605345,2.3346210,2.3865179,2.4406780,2.3688209,2.3649233,2.3668617,2.4675781,2.4767129,2.4664701,2.3560843,2.5487013,2.4604951,2.5133258,2.4387729,1.4955564,1.5082731,1.5440476,1.5911176,2.4114691,2.4188795,2.4320730,2.4896641,2.4439351,2.4726592,2.4819837,2.4053318,2.4694447,2.3929463,2.3948703,2.3993741,2.4535933,2.4556870,2.3877090,2.4466891,2.4426443,2.3879938,2.3402072,2.3448416,2.4782167,2.5233350,1.5917363,1.6355997,1.5816622,1.6164543,1.5466306,1.5081628,1.4992875,1.6306420,1.6762845,1.6866838,2.4210023,2.4406259,2.3720587,2.4113856,2.3190864,2.4250728,2.4457677,2.4429783,2.4122941,2.4424428,2.4527037,2.4966437,2.4035152,2.4137089,2.3954934,2.4212645,2.4087689,2.4105695,2.5525013,1.5290941,1.5728092,1.5543364,1.5641066,1.6006943,1.5889370,1.5321614,1.5586370,1.5415335,1.5491241,1.5541842,1.5852345,1.5902462,1.5801461,1.5930383,1.5523744,1.5575027,1.5546879,1.5726514,1.5938686,1.6222843,1.5881920,1.5889394,1.6246987,1.5791531,1.6059873,1.5860296,1.5701796,1.5763713,1.6315066,1.6206847,1.5685662,1.5892450,1.6186198,1.6242586,1.5481500,2.5555498,2.5258320,1.6016010,1.6078024,1.7627993,2.5282046,2.5283876,1.6267890,1.6268478,1.6069182,1.5574029,1.5672845,1.6389830,1.5810995,1.5809019,1.5621111,1.6275267,1.5630592,1.5361265,1.5750796,1.5447555,1.5755329,1.5092036,1.5607702,1.5812295,1.6181003,1.6067561,1.6846355,1.6251237,1.7229672,1.6319477,2.0122061,1.6008821,1.4929003,1.5977780,1.6336924,1.7014357,1.7084902,1.5106016,1.6115451,1.5218875,1.5394796,1.5841880,1.6021178,1.6286143,1.6322306,1.6479547,1.6051730,1.6903781,1.5605062,1.6159015,1.5648148,1.6410461,1.5526851,1.5951099,1.6702494,1.6863064,1.5488679,1.5722265,1.5789145,2.3835058)
ratio <- c(ratio,2.4312322,2.4034768,2.4384596,2.4436314,2.4303576,2.4266296,2.3727400,2.3769620,1.4916100,2.0513010,2.0537072,1.6218816,1.6315519,1.6242272,1.5328026,2.4791800,2.4943584,1.5233619,1.5520066,1.5211411,1.5946195,1.5954327,1.5099963,1.6025225,1.4824897,1.5543129,1.6041963,1.9676651,1.8662227,1.9669943,1.5584472,1.5664382,1.5551137,1.5563330,1.5637281,1.6058321,2.3605461,1.4467948,2.4481076,2.2974519,2.4029441,2.4499736,2.4637541,2.4147364,2.4483680,2.4730025,2.4344831,2.4246536,2.3520329,2.3551388,2.4119589,2.4621961,2.3978363,2.4289587,2.4244460,2.3955454,2.4106017,2.4766059,2.5505131,2.5904964,2.4405065,2.5502677,2.5418027,2.5946520,2.3462233,2.3910095,2.3808582,2.4900000,2.4929675,2.5957481,2.6473263,2.6821399,2.5863157,2.5706163,2.5873251,2.3470114,2.3549401,2.4447632,2.3895783,2.4598023,1.9194875,1.9098258,1.5563858,1.9655190,1.9843544,1.5075750,1.9197088,1.8606831,1.5003864,1.5420225,1.5832182,1.4812740,1.5309472,1.5425323,1.6757806,1.6147225,1.5449271,1.6736108,1.6062912,1.5309035,1.5375653,1.5230470,1.4537051,1.5208274,1.6269986,1.6722362,1.5627355,1.5736269,1.5597532,1.6075006,1.5912113,1.4857339,1.4865164,1.5915908,1.5626763,1.6001816,1.5838017,1.6547792,1.6521739,1.5541260,1.5589220,1.6041114,1.6052877,1.6188839,1.5851005,1.5210468,1.5914289,1.5612923,1.5543678,1.6405511,1.5491183,1.5760101,1.5988843,1.5453074,1.5629486,2.3506363,2.3900338,2.3692994,2.3774440,2.3778159,1.4801504,1.5272571,1.5429716,1.5863007,1.5595683,1.5373692,1.5547804,1.6523496,1.6553747,2.4441022,2.3405286,2.4331678,2.4992773,2.4313793,2.3579883,2.4041657,2.3764722,2.3899039,2.4933925,2.4881635,2.3814493,2.4341179,2.5575816,2.4556496,2.5024020,2.4669698,2.4933790,2.5047439,2.4975195,2.4015712,2.4469022,2.4567287,2.5280128,2.4781728,2.5028472,2.5189290,2.5043568,2.4434804,2.4769850,2.4082988,2.4613414,2.3871942,2.4679938,2.4711610,2.3716282,2.4665808,2.4719631,2.4671665,2.3450332,2.3711461,2.3567956,2.4531954,2.4627118,2.4688380,2.4264698,2.4634535,2.4754286,2.4152280,2.4200143,2.4366610,2.4748473,2.3998817,2.4435630,2.4505969,2.4825692,2.4722832,2.3908582,2.5265103,2.4205017,2.4975109,1.8273926,1.8191600,1.7649152,1.8441525,1.8362518,1.7798777,1.7749614,1.8514624,1.8231416,1.7980675,1.8139444,1.7813007,1.8112133,1.8506497,1.7888465,1.7981612,1.7877846,1.8448715,1.8428863,1.7778798,1.8004113,1.8293173,1.8581288,0.7568285,0.7681167,0.7569872,0.7478473,0.7586729,0.7574590,0.7707119,0.7654296,0.7560802,0.7223322,0.7634792,0.7577954,0.7630233,0.7439842,0.7501500,0.7666750,0.7503895,0.7622286,0.7635520,0.7534987,0.7599475,0.7748956,0.7590139,0.7568013,0.7586510,0.7564873,0.7576587,0.7756744,0.7637029,0.7579558,0.7735586,0.7692345,0.7611547,0.7551704,0.7715118,0.7736083,0.7635049,0.7559560,0.7667778,0.7618454,0.7648308,0.7693512,0.7653699,0.7589299,2.3590999,2.3619440,2.3519471,2.3453709,2.3486040,2.4295357,2.4219361,2.4488210,2.4213269,2.3923218,1.5402144,0.7652187,0.7650076,0.7486516,0.7643569,0.7677992,0.7586368,0.7584026,0.7662775,0.7740569,0.7713377,0.7740677,0.7686109,0.7490695,0.7681402,0.7639513,0.7896797,0.7660664,0.7720809,2.4936417,2.3834268,0.7626155,0.7566950,0.7824697,0.7735270,2.4219853,2.5196019,0.7636966,0.7570312,2.4539218,2.4338622,2.4396830,0.7733773,0.7632390,0.7697633,0.7677199,0.7563686,0.7569366,0.7785563,2.3883388,2.4508850,2.5454545,2.4078138,2.4295884,2.5461211,2.3826057,2.4070722,2.4136287,2.4216196,2.3854115,2.4173648,2.4553045,2.4356992,2.3863270,2.4346182,2.3953523,2.4102911,2.4179386,2.4251441,2.4217263,2.4586260,2.4059818,2.4069762,2.4099267,2.4387039,2.4458400,2.4573490,2.4759171,2.4772782,2.4684424,2.4434910,2.4804417,2.4192554,2.4454277,2.3955812)
ratio <- c(ratio,2.4749573,2.4724740,2.4209932,2.4249042,0.7610100,2.3671184,2.3697960,2.4023920,2.4053188,2.4351452,2.4386339,2.4094596,2.3468609,2.3155974,2.3104149,2.3627904,2.2658038,2.3885135,2.3038569,2.3350929,2.3612206,2.2781386,2.2077955,2.2325030,2.2679324,2.2823724,2.4542063,2.1998876,2.3031394,2.2881462,2.3198411,2.1884115,2.2710756,2.1827032,2.3180018,2.2993929,2.2874717,2.2829849,2.3455578,2.2977960,2.3410225,2.4081727,2.3638018,2.2795117,2.3976837,2.3932511,2.4075917,2.4015616,2.3418264,2.3349013,2.4240838,2.4173924,2.3960085,2.4031925,2.3945173,2.3856273,2.4690719,2.4686576,2.6263546,2.4718199,2.3463103,2.3652738,2.4379702,2.4796148,2.4287476,2.2620077,2.3805959,2.3458551,2.3461876,2.3171412,2.3897433,2.3912464,2.3614724,2.5442086,1.7773526,1.7178727,2.3689272,2.3640495,2.2916104,2.3356112,2.3964407,2.4015147,1.7590361,2.3414859,2.2694388,2.3567392,2.2708003,2.2956377,2.3101355,2.3125120,2.3004763,2.2705803,2.2875544,2.2712565,2.3089819,2.2945688,2.3205069,2.2732516,1.7467849,1.7259182,1.7686619,1.7095978,2.2485174,2.4754472,2.4495703,2.4074395,2.4539152,2.3435356,2.3811995,2.4362843,2.4403479,2.5180754,0.7635700,0.7702974,0.7722231,0.7511177,0.7551106,0.7601832,0.7361963,0.7794874,0.7934396,0.7709755,0.7844490,0.7596845,0.7930613,0.7504008,0.7627609,0.7532478,0.7581377,0.7705301,0.7721196,0.7567074,0.7828257,0.7780374,0.7669904,0.7812995,0.7834886,0.7889141,0.7859603,0.7682853,0.7478378,0.7673373,0.7629394,0.7715725,0.7791851,0.7666474,0.7714516,0.7496962,0.7533639,0.7579321,0.7737278,0.7853168,0.7645964,0.7725103,0.7677113,1.6156064,1.5390375,1.5211357,1.5385035,1.4920242,1.6010156,1.5269279,1.5407954,1.4766077,1.5165483,2.4101888,2.4461818,2.4832620,2.4234102,2.3891139,2.4782895,2.4514298,2.4655178,1.5269570,1.5543206,1.5397302,1.5121552,1.5089835,1.5259734,1.5593889,1.5634023,1.5243902,1.5291230,1.5649990,2.3870746,2.4165985,2.4108952,2.5178706,2.4626444,2.4056250,2.4241015,2.2767663,2.1800944,2.2449665,2.2155462,2.2102232,2.1888782,2.1780130,2.1797660,2.2164432,2.1815838,2.2745605,2.2695021,2.1916002,2.1620086,2.1604276,2.1606576,2.1582508,2.1380764,2.1869745,2.1967953,2.1466257,2.1625341,2.2056076,2.1491951,2.2865496,2.2079291,2.1408957,2.1638916,2.5895003,2.1258483,2.1444175,2.2055520,2.1909330,2.2072956,2.2051386,2.4054279,2.4282109,2.4143596,2.4389768,2.4716971,2.4614070,2.4076957,2.4289880,2.4710598,2.4395144,2.4251923,2.4528039,2.4597420,2.4047160,2.4214961,2.4052775,2.4002135,2.3944374,2.4051049,2.4450801,1.5737999,1.5879028,1.4740943,1.5251975,1.5084994,1.5176669,1.5827066,1.8177106,2.4441547,2.4143742,2.4699203,2.3762657,2.3688625,2.4603598,2.4191691,2.4546725,2.4275861,2.4220888,2.4021738,2.4979091,2.4092485,2.4321795,2.4349427,2.4462136,2.4662561,2.3542403,2.4312644,2.4911342,2.4228929,2.4431390,2.4214492,2.4497284,2.4559866,2.4404771,2.4329174,2.4118930,2.4634206,2.4609880)
ratio <- c(ratio,2.4827362,2.3897923,2.4143036,2.4375662,2.4641043,2.4222062,2.4024445,2.4035411,2.4564304,2.4355113,2.3968869,2.4147049,2.4618251,2.4281002,2.3793209,2.4370193,2.3959920,2.3882394,2.4338975,2.4221606,2.4012736,2.4163148,2.4014352,2.3909576,2.4497283,2.3379209,2.2898380,2.3337816,2.2496425,2.3140063,2.2480368,2.3382981,2.4370101,2.3976064,1.5515221,1.6095634,1.5416116,1.5449938,1.5357950,1.6159637,2.5099319,2.4895688,2.5469330,2.4921635,2.4849558,2.5936576,2.4149410,2.3313097,2.3891210,2.3958461,2.4481496,2.3889537,2.3672844,2.3638062,2.3213211,2.2783695,2.3841916,2.3915916,2.4455549,2.3920387,2.3327502,2.3814912,2.3998684,2.5413619,2.4137805,2.4694112,2.2786062,2.3302104,2.3343673,2.3937296,2.3613661,2.4505636,2.4254633,2.5044136,2.4344560,2.3947676,2.3527630,2.3999480,2.3888254,2.4736389,2.4171134,2.4438647,2.2072640,2.4066542,2.4867753,2.4228284,2.4383648,2.4733934,2.3856568,2.3750662,2.2850762,2.3953039,2.4272293,2.4479172,2.4583161,2.4372287,2.4391820,2.4068650,2.3923539,2.4295108,2.4074787,2.4039005,2.4916317,2.4349603,2.4786877,2.4942217,2.2279876,2.4491045,2.2402399,2.3116072,2.3713333,2.3000641,2.2848940,2.4239510,2.4702209,2.4698362,2.4101398,2.3967287,2.4586776,2.4489123,2.4533943,2.4532398,2.4235131,2.4177043,2.4007880,2.4442298,2.4343659,2.4388599,2.3495992,2.4192377,2.3928401)
library(EMCluster)
ret <- init.EM(as.data.frame(ratio),nclass=6,method="Rnd.EM") # fitting model
# now how to plot the distribution curves?
Thank you in advance for your advice.
EDIT:
The structure of the resulting object is this:
> str(ret,vec.len = 10)
List of 13
$ pi : num [1:6] 0.0537 0.234 0.299 0.1124 0.2225 0.0784
$ Mu : num [1:6, 1] 1.794 1.568 2.429 0.765 2.316 2.474
$ LTSigma : num [1:6, 1] 0.011002 0.001946 0.001355 0.000124 0.008215 0.006486
$ llhdval : num 435
$ nc : int [1:6] 52 252 406 119 195 35
$ class : num [1:1059] 5 5 5 5 5 5 3 3 3 3 3 3 5 5 5 5 5 5 3 5 5 5 5 5 5 ...
$ conv.iter: int 93
$ conv.eps : num 9.81e-07
$ flag : num 0
$ n : int 1059
$ p : int 1
$ nclass : num 6
$ method : chr "Rnd.EM"
- attr(*, "class")= chr "emret"
This is pretty straightforward. There's a little bit of calculation involved in getting the heights of the curves to line up reasonably, but otherwise it's pretty basic. First plot a histogram. If you want a box around it, like your example, do that. Then you'll need to call lines() 6 times to plot the 6 normals. In R, lines are just a sequence of interpolated points—(x,y) coordinates—so make a sufficiently fine-grained set of x coordinates, then compute the normal density for each component using dnorm() and the fitted parameters. You'll need to multiply those y-values by the appropriate proportion and a height adjustment factor to get the heights of the curves right. It turns out that the highest bin in your histogram is 82, which is approximately the peak of your third component, but since that represents only 30%, you need to rescale the adjustment factor by that. You may want to choose your own colors. Consider:
xs <- seq(min(ratio), max(ratio), length.out=1000)
windows()
h <- hist(ratio, breaks=seq(0, 3, by=0.02)); box()
# max(h$counts) # [1] 82
height <- 82/dnorm(ret$Mu[3,1], mean=ret$Mu[3,1], sd=sqrt(ret$LTSigma[3,1]))
height <- height/ret$pi[3]
for(i in 1:6){
lines(xs, dnorm(xs, mean=ret$Mu[i,1], sd=sqrt(ret$LTSigma[i,1]))*height*ret$pi[i],
lwd=2, col=i)
}
Related
getting predicted values from an nls() model for Fabens von Bertalanffy growth curve
I have a growth dataset based off of recaptures. There are columns with the capture length, the recapture length, and the time (in yrs) based off the recapture minus the capture. > str(data) 'data.frame': 60 obs. of 3 variables: $ sizecapture : num 40.3 43 38.3 41.5 37.6 ... $ sizerecapture: num 43 48.7 39.5 42 46.7 43.5 43.5 47.2 45.7 59.9 ... $ timeinterval : num 0.945 1.036 0.997 0.997 2.471 ... I am following Ogle 2013 vignette (http://derekogle.com/fishR/examples/oldFishRVignettes/VonBertalanffyExtra.pdf) in R, for the Fabens method of trying to derive size at age. For this method I don't need an initial age (as I don't know age at all). I am not interested in extrapolating, but for only estimating the age of individuals that I have sizes for. I can easily follow the instructions for calculating the two parameters needed to inform the nls model: the k and the Linf. My aim is to create a age at length curve with the growth data, but when I get errors when I try to fitPlot. I get the error "Error in mdl$model[[gpos[2]]] : subscript out of bounds". I have also tried curve() and get the error "Error in FVB1(x) : could not find function "FVB1". I also can't figure out how to extract the confidence intervals that fit with the predicted data. I have searched and have found some similar cases but nothing that has worked. I'll continue to research, but am I missing something very basic? Below is a subsample of the data. I'd appreciate any help. Thank you install.packages("FSA") install.packages("FSAdata") install.packages("nlstools") install.packages(car) library(FSA) library(FSAdata) library(nlstools) library(car) sizecapture <- c(40.30,43.00,38.30,41.50,37.60,41.63,41.80,38.40,40.00,41.20,37.70,41.70,43.70,41.80,42.70,44.60,45.50,44.50,45.60,44.80,47.00,49.20,44.50,45.20,46.40,46.90,49.40,61.00,36.50,42.10,43.90,43.90,46.40,45.50,47.20,64.30,43.00,59.90,39.60,36.80) sizerecapture < c(43.0,48.7,39.5,42.0,46.7,43.5,43.5,47.2,45.7,59.9,48.1,46.5,45.7,49.1,48.7,47.1,46.9,48.3,47.2,53.7,52.0,51.2,56.2,56.3,57.5,57.7,55.4,74.5,45.6,44.9,46.7,51.0,49.4,58.0,56.8,71.6,43.8,44.6,43.7,41.9) timeinterval <-c(0.9452055,1.0356164,0.9972603,0.9972603,2.4712329,0.9534247,1.1945205,2.0027397,1.3178082,4.5342466,2.1863014,0.9178082,1.1315068,2.3698630,2.0575342,1.3835616,1.1726027,1.1972603,3.1698630,1.9589041,1.0712329,0.9150685,2.5671233,2.7780822,3.2000000,2.2246575,1.9150685,4.1753425,0.9287671,1.0328767,1.3945205,2.6739726,0.9205479,3.1479452,1.9506849,1.7178082,1.0520548,3.0767123,1.3726027,1.2520548) data <- data.frame(sizecapture, sizerecapture, timeinterval) ### using Ogle 2013 to calculate Linf and k # k and Linf with(data,mean((log(sizerecapture)-log(sizecapture))/timeinterval)) #0.0676` max(data$sizerecapture) # largest size is 74.5 Fabens.sv <- list(Linf=74.5, K=0.0676) # declare the model fvb <- vbFuns("Fabens") # fit and summarize FVB1<- nls(sizerecapture~ fvb(sizecapture,timeinterval,Linf,K),start=Fabens.sv,data=data) summary(FVB1,correlation=TRUE) # confidence intervals through bootstrapping boot <- nlsBoot(FVB1, niter=500) confint(boot,plot=TRUE) # plotting a fitted line plot ages2plot <- 0:40 LCI <- UCI <- numeric(length(ages2plot)) fitPlot(FVB1, xlim=range(ages2plot)) estes <- boot$coefboot for (i in 1:length(ages2plot)) { pv <-estes[,"Linf"]*(1-exp(-ests[,"K]*(ages2plot))) LCI[i] <- quantile(pv,0.025) UCI[i] <- quantile(pv,0.975) } lines(UCI~ages2plot,type="1") lines(LCI~ages2plot,type="1") # tried to just get a visual and errors arise curve(FVB1)
How exactly are outliers removed in R boxplot and how can the same outliers be removed for further calculation (e.g. mean)?
In a boxplot I've set the option outline=FALSE to remove the outliers. Now I'd like to include points that show the mean in the boxplot. Obviously, the means calculated using mean include the outliers. How can the very same outliers be removed from a dataframe so that the calculated mean corresponds to the data shown in the boxplot? I know how outliers can be removed, but which settings are used by the outline option from boxplot internally? Unfortunately, the manual does not give any clarifications.
To answer the second part of your question, about how the outliers are choosen, it's good to remind how the boxplot is constructed: the "body" of the boxplot corresponds to the second + third quartiles of the data (= interquartile range, IQR) each whisker limit is generally calculated taking 1.5*IQR beyond the end of that body. If you take the hypothesis that your data has a normal distribution, there are this amount of data outside each whisker: 1-pnorm(qnorm(0.75)+1.5*2*qnorm(0.75)) being 0.0035. Therefore, a normal variable has 0.7% of "boxplot outliers". But this is not a very "reliable" way to detect outliers, there are packages specifically designed for this.
To remove the outliers, you must set the option outline to FALSE. Let's assume that your data are the following: data <- data.frame(a = c(seq(0,1,0.1),3)) Then, you use the boxplot function: res <- boxplot(data, outline=FALSE) In the res object, you have several pieces of information about your data. Among these, res$out gives you all the outliers. Here there is only the value 3. Thus, to compute the mean without the outliers, you can simply do: mean(data$a[!data$a %in% res$out])
If you look at the Value section of ?boxplot, you find: "List with the following components:" [...] out the values of any data points which lie beyond the extremes of the whiskers." Thus, you can assing the result of your boxplot call to an object, extract the outliers, and remove them from the original values: x <- c(-10, 1:5, 50) x # [1] -10 1 2 3 4 5 50 bx <- boxplot(x) str(bx) # List of 6 # $ stats: num [1:5, 1] 1 1.5 3 4.5 5 # $ n : num 7 # $ conf : num [1:2, 1] 1.21 4.79 # $ out : num [1:2] -10 50 # $ group: num [1:2] 1 1 # $ names: chr "1" x2 <- x[!(x %in% bx$out)] x2 # [1] 1 2 3 4 5
Obtain spline surface on R
How do I generate a b-spline surface, let's say: x=attitude$rating y=attitude$complaints z=attitude$privileges would be x and y for the spline basis. z is the set of control points.
If I understand you, you have x,y, and z data and you want to use bivariate spline interpolation on x and y, using z for the control points. You can do this with interp(...) in the akima package. library(akima) spline <- interp(x,y,z,linear=FALSE) # rotatable 3D plot of points and spline surface library(rgl) open3d(scale=c(1/diff(range(x)),1/diff(range(y)),1/diff(range(z)))) with(spline,surface3d(x,y,z,alpha=.2)) points3d(x,y,z) title3d(xlab="rating",ylab="complaints",zlab="privileges") axes3d() The plot itself is fairly uninteresting with your dataset because x, y, and x are highly correlated. EDIT response to OP's comment. If you want a b-spline surface, try out mba.surf(...) in the unfortunately named MBA package. library(MBA) spline <- mba.surf(data.frame(x,y,z),100,100) library(rgl) open3d(scale=c(1/diff(range(x)),1/diff(range(y)),1/diff(range(z)))) with(spline$xyz,surface3d(x,y,z,alpha=.2)) points3d(x,y,z) title3d(xlab="rating",ylab="complaints",zlab="privileges") axes3d()
require(rms) # Harrell's gift to the R world. # Better to keep the original names and do so within a dataframe. att <- attitude[c('rating','complaints','privileges')] add <- datadist(att) # records ranges and descriptive info on data options(datadist="add") # need these for the rms functions # rms-`ols` function (ordinary least squares) is a version of `lm` mdl <- ols( privileges ~ rcs(rating,4)*rcs(complaints,4) ,data=att) # Predict is an rms function that works with rms's particular classes pred <- Predict(mdl, 'rating','complaints') # bplot calls lattice functions; levelplot by default; this gives a "3d" plot bplot(pred, yhat~rating+complaints, lfun=wireframe) It's a crossed restricted-cubic spline model. If you have a favorite spline function you want to use instead, then by all means try it out. I've had good luck with the rcs- function. This gives a more open mesh with fewer calculated points: pred <- Predict(mdl, 'rating','complaints', np=25) bplot(pred, yhat~rating+complaints, lfun=wireframe) png() bplot(pred, yhat~rating+complaints, lfun=wireframe) dev.off() You could use the rgl methods being illustrated by jhoward. The top of str(pred) looks like: str(pred) Classes ‘Predict’ and 'data.frame': 625 obs. of 5 variables: $ rating : num 43 44.6 46.2 47.8 49.4 ... $ complaints: num 45 45 45 45 45 ... $ yhat : num 39.9 39.5 39.1 38.7 38.3 ... $ lower : num 28 28.3 27.3 25 22 ... $ upper : num 51.7 50.6 50.9 52.4 54.6 ... snipped library(rgl) open3d() with(pred, surface3d(unique(rating),unique(complaints),yhat,alpha=.2)) with(att, points3d(rating,complaints,privileges, col="red")) title3d(xlab="rating",ylab="complaints",zlab="privileges") axes3d() aspect3d(x=1,z=.05) Good illustration of the dangers of extrapolation once you realize there are no data out on the extremes of inappropriate extrapolations from that model. The rms-package has a perimeter function and the plotting functions have a perim argument to which perimeter-objects are passed.
How to extract Ecdf value from the Ecdf() return?
The answer for This question here suggest a way by applying ecdf. However I am using Ecdf() from package Hmisc for it provides a convenient way to do a ccdf(Complementary Cumulative Distribution Function) plot. (by setting the what option to '1-F') By default, Ecdf() does the plot and return a nested list containing x and y. How can I extract the y value of a certain x value? and then plot it on the original plot? FYI: > str(Ecdf(rnorm(20), lwd = 2)) List of 2 $ x: num [1:21] -1.46 -1.46 -1.18 -1.17 -1.16 ... $ y: num [1:21] 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 ... - attr(*, "N")=List of 2 ..$ n: num 20 ..$ m: num 0 At first i am considering convert this list to a data.frame using methods suggested from R List to Data Frame, but my data is huge and the rbind seems really slow.
Ecdf returns a list whereas ecdf returns a function. It's a lot easier to use the R-stats function ecdf than it is to use something tortured like: Ecdf(.)$y[ min(which(Ecdf(.)$x>val))]. If you want the value of ecdf(x=0) from an Ecdf-object then this should work: ecdf( Ecdf(rnorm(20), lwd = 2)$x ) (v=0) [1] 0.5238095 (It turns out that the formal parameter for the function returned by ecdf is "v".) But if you want the less elegant method and you already have assigned the result to an object named 'oneEcdf': oneEcdf <- Ecdf(rnorm(20), lwd = 2) oneEcdf$y[ min( which(oneEcdf$x > 0 ))] [1] 0.6
Compute projection / hat matrix via QR factorization, SVD (and Cholesky factorization?)
I'm trying to calculate in R a projection matrix P of an arbitrary N x J matrix S: P = S (S'S) ^ -1 S' I've been trying to perform this with the following function: P <- function(S){ output <- S %*% solve(t(S) %*% S) %*% t(S) return(output) } But when I use this I get errors that look like this: # Error in solve.default(t(S) %*% S, t(S), tol = 1e-07) : # system is computationally singular: reciprocal condition number = 2.26005e-28 I think that this is a result of numerical underflow and/or instability as discussed in numerous places like r-help and here, but I'm not experienced enough using SVD or QR decomposition to fix the problem, or else put this existing code into action. I've also tried the suggested code, which is to write solve as a system: output <- S %*% solve (t(S) %*% S, t(S), tol=1e-7) But still it doesn't work. Any suggestions would be appreciated. I'm pretty sure that my matrix should be invertible and does not have any co-linearities, if only because I have tried testing this with a matrix of orthogonal dummy variables, and it still doesn't work. Also, I'd like to apply this to fairly large matrices, so I'm looking for a neat general solution.
Although OP has not been active for more than a year, I still decide to post an answer. I would use X instead of S, as in statistics, we often want projection matrix in linear regression context, where X is the model matrix, y is the response vector, while H = X(X'X)^{-1}X' is hat / projection matrix so that Hy gives predictive values. This answer assumes the context of ordinary least squares. For weighted least squares, see Get hat matrix from QR decomposition for weighted least square regression. An overview solve is based on LU factorization of a general square matrix. For X'X (should be computed by crossprod(X) rather than t(X) %*% X in R, read ?crossprod for more) which is symmetric, we can use chol2inv which is based on Choleksy factorization. However, triangular factorization is less stable than QR factorization. This is not hard to understand. If X has conditional number kappa, X'X will have conditional number kappa ^ 2. This can cause big numerical difficulty. The error message you get: # system is computationally singular: reciprocal condition number = 2.26005e-28 is just telling this. kappa ^ 2 is about e-28, much much smaller than machine precision at around e-16. With tolerance tol = .Machine$double.eps, X'X will be seen as rank deficient, thus LU and Cholesky factorization will break down. Generally, we switch to SVD or QR in this situation, but pivoted Cholesky factorization is another choice. SVD is the most stable method, but too expensive; QR is satisfyingly stable, at moderate computational costs, and is commonly used in practice; Pivoted Cholesky is fast, with acceptable stability. For large matrix this one is preferred. In the following, I will explain all three methods. Using QR factorization Note that the projection matrix is permutation independent, i.e., it does not matter whether we perform QR factorization with or without pivoting. In R, qr.default can call LINPACK routine DQRDC for non-pivoted QR factorization, and LAPACK routine DGEQP3 for block pivoted QR factorization. Let's generate a toy matrix and test both options: set.seed(0); X <- matrix(rnorm(50), 10, 5) qr_linpack <- qr.default(X) qr_lapack <- qr.default(X, LAPACK = TRUE) str(qr_linpack) # List of 4 # $ qr : num [1:10, 1:5] -3.79 -0.0861 0.3509 0.3357 0.1094 ... # $ rank : int 5 # $ qraux: num [1:5] 1.33 1.37 1.03 1.01 1.15 # $ pivot: int [1:5] 1 2 3 4 5 # - attr(*, "class")= chr "qr" str(qr_lapack) # List of 4 # $ qr : num [1:10, 1:5] -3.79 -0.0646 0.2632 0.2518 0.0821 ... # $ rank : int 5 # $ qraux: num [1:5] 1.33 1.21 1.56 1.36 1.09 # $ pivot: int [1:5] 1 5 2 4 3 # - attr(*, "useLAPACK")= logi TRUE # - attr(*, "class")= chr "qr" Note the $pivot is different for two objects. Now, we define a wrapper function to compute QQ': f <- function (QR) { ## thin Q-factor Q <- qr.qy(QR, diag(1, nrow = nrow(QR$qr), ncol = QR$rank)) ## QQ' tcrossprod(Q) } We will see that qr_linpack and qr_lapack give the same projection matrix: H1 <- f(qr_linpack) H2 <- f(qr_lapack) mean(abs(H1 - H2)) # [1] 9.530571e-17 Using singular value decomposition In R, svd computes singular value decomposition. We still use the above example matrix X: SVD <- svd(X) str(SVD) # List of 3 # $ d: num [1:5] 4.321 3.667 2.158 1.904 0.876 # $ u: num [1:10, 1:5] -0.4108 -0.0646 -0.2643 -0.1734 0.1007 ... # $ v: num [1:5, 1:5] -0.766 0.164 0.176 0.383 -0.457 ... H3 <- tcrossprod(SVD$u) mean(abs(H1 - H3)) # [1] 1.311668e-16 Again, we get the same projection matrix. Using Pivoted Cholesky factorization For demonstration, we still use the example X above. ## pivoted Chol for `X'X`; we want lower triangular factor `L = R'`: ## we also suppress possible rank-deficient warnings (no harm at all!) L <- t(suppressWarnings(chol(crossprod(X), pivot = TRUE))) str(L) # num [1:5, 1:5] 3.79 0.552 -0.82 -1.179 -0.182 ... # - attr(*, "pivot")= int [1:5] 1 5 2 4 3 # - attr(*, "rank")= int 5 ## compute `Q'` r <- attr(L, "rank") piv <- attr(L, "pivot") Qt <- forwardsolve(L, t(X[, piv]), r) ## P = QQ' H4 <- crossprod(Qt) ## compare mean(abs(H1 - H4)) # [1] 6.983997e-17 Again, we get the same projection matrix.