Im kinda new to R. I have a dataset, which also includes data of family income and I have to fit a Gamma distribution to this data, using the Maximum Likelihood Estimates. It is specifically told that we need to use the package optim, and not fitdistr. So this is my code:
t1 <- sum(log(newdata$faminc))
t2 <- sum(newdata$faminc)
obs <- nrow(newdata)
lh.gamma <- function(par) {
-((par[1]-1)*t1 - par[2]*t2 - obs*par[1]*log(par[2]) - obs*lgamma(par[1]))
}
#initial guess for a = mean^2(x)/var(x) and b = mean(x) / var(x)
a1 <- (mean(newdata$faminc))^2/var(newdata$faminc)
b1 <- mean(newdata$faminc)/var(newdata$faminc)
init <- c(a1,b1)
q <- optim(init, lh.gamma, method = "BFGS")
q
Also tried filling in just values in the init vector, and including this piece of code;
dlh.gamma <- function(par){
cbind(obs*digamma(par[1])+obs*log(par[2])-t2,
obs*par[1]/par[2]-1/par[2]^2*t1)
}
and then the optim would look like :
q <- optim(init, lh.gamma, dhl.gamma, method="BFGS")
None of it 'works'. First, when I tried the code at school computers, it gave me very huge numbers for the shape and rate parameters, which was not possible. Now, trying at home, I get this:
> q <- optim(init, lh.gamma, method = "BFGS")
Error in optim(init, lh.gamma, method = "BFGS") :
non-finite finite-difference value [2]
In addition: There were 50 or more warnings (use warnings() to see the first 50)
> q
function (save = "default", status = 0, runLast = TRUE)
.Internal(quit(save, status, runLast))
<bytecode: 0x000000000eaac960>
<environment: namespace:base>
q is not even 'created'. Except for when I include the dlh.gamma part above, but then I just get huge numbers again and no convergence.
Anybody who knows what goes wrong/what to do?
Edit:
> dput(sample(newdata$faminc, 500))
c(42.5, 87.5, 22.5, 17.5, 12.5, 30, 30, 17.5, 42.5, 62.5, 62.5,
30, 30, 150, 22.5, 30, 42.5, 30, 17.5, 8.75, 42.5, 42.5, 42.5,
62.5, 42.5, 30, 17.5, 87.5, 62.5, 150, 42.5, 150, 42.5, 42.5,
42.5, 6.25, 62.5, 87.5, 6.25, 87.5, 30, 150, 22.5, 62.5, 42.5,
150, 17.5, 42.5, 42.5, 42.5, 62.5, 22.5, 42.5, 42.5, 30, 62.5,
30, 62.5, 87.5, 87.5, 42.5, 22.5, 62.5, 22.5, 8.75, 30, 30, 17.5,
87.5, 8.75, 62.5, 30, 17.5, 22.5, 62.5, 42.5, 30, 17.5, 62.5,
8.75, 62.5, 42.5, 150, 30, 62.5, 87.5, 17.5, 62.5, 30, 62.5,
87.5, 42.5, 62.5, 30, 62.5, 42.5, 87.5, 150, 12.5, 42.5, 62.5,
42.5, 62.5, 62.5, 150, 30, 87.5, 12.5, 17.5, 42.5, 62.5, 30,
6.25, 62.5, 42.5, 12.5, 62.5, 8.75, 17.5, 42.5, 62.5, 87.5, 8.75,
62.5, 30, 62.5, 87.5, 42.5, 62.5, 62.5, 12.5, 150, 42.5, 62.5,
12.5, 62.5, 42.5, 62.5, 62.5, 87.5, 42.5, 62.5, 30, 42.5, 150,
42.5, 30, 62.5, 62.5, 87.5, 42.5, 30, 62.5, 62.5, 42.5, 42.5,
30, 62.5, 42.5, 42.5, 62.5, 62.5, 150, 42.5, 30, 42.5, 62.5,
17.5, 62.5, 17.5, 150, 8.75, 62.5, 30, 62.5, 42.5, 42.5, 22.5,
150, 62.5, 42.5, 62.5, 62.5, 22.5, 30, 62.5, 30, 150, 42.5, 42.5,
42.5, 62.5, 30, 12.5, 30, 150, 12.5, 8.75, 22.5, 30, 22.5, 30,
42.5, 42.5, 42.5, 30, 12.5, 62.5, 42.5, 30, 22.5, 42.5, 87.5,
22.5, 12.5, 42.5, 62.5, 62.5, 62.5, 30, 42.5, 30, 62.5, 30, 62.5,
12.5, 22.5, 42.5, 22.5, 87.5, 30, 22.5, 17.5, 42.5, 62.5, 17.5,
250, 150, 42.5, 30, 42.5, 30, 62.5, 17.5, 87.5, 22.5, 150, 62.5,
42.5, 6.25, 87.5, 62.5, 42.5, 30, 42.5, 62.5, 42.5, 87.5, 62.5,
150, 42.5, 30, 6.25, 22.5, 30, 42.5, 42.5, 62.5, 250, 8.75, 150,
42.5, 30, 42.5, 30, 42.5, 42.5, 30, 30, 150, 22.5, 62.5, 30,
8.75, 150, 62.5, 87.5, 150, 42.5, 30, 42.5, 42.5, 42.5, 30, 8.75,
42.5, 42.5, 30, 22.5, 62.5, 17.5, 62.5, 62.5, 42.5, 8.75, 42.5,
12.5, 12.5, 150, 42.5, 42.5, 17.5, 42.5, 62.5, 62.5, 42.5, 42.5,
30, 42.5, 62.5, 30, 62.5, 42.5, 42.5, 42.5, 22.5, 62.5, 62.5,
62.5, 22.5, 150, 62.5, 42.5, 62.5, 42.5, 30, 30, 62.5, 22.5,
62.5, 87.5, 62.5, 42.5, 42.5, 22.5, 62.5, 62.5, 30, 42.5, 42.5,
8.75, 87.5, 42.5, 42.5, 87.5, 30, 62.5, 17.5, 62.5, 42.5, 17.5,
22.5, 62.5, 8.75, 62.5, 22.5, 22.5, 22.5, 42.5, 17.5, 22.5, 62.5,
42.5, 42.5, 42.5, 42.5, 42.5, 30, 30, 8.75, 30, 42.5, 62.5, 22.5,
6.25, 30, 42.5, 62.5, 17.5, 62.5, 42.5, 8.75, 22.5, 30, 17.5,
22.5, 62.5, 42.5, 150, 87.5, 22.5, 12.5, 62.5, 62.5, 62.5, 30,
42.5, 22.5, 62.5, 87.5, 30, 42.5, 62.5, 22.5, 87.5, 30, 30, 22.5,
87.5, 87.5, 250, 30, 62.5, 250, 62.5, 42.5, 42.5, 62.5, 62.5,
42.5, 6.25, 62.5, 62.5, 62.5, 42.5, 42.5, 150, 62.5, 62.5, 30,
150, 22.5, 87.5, 30, 150, 17.5, 8.75, 62.5, 42.5, 62.5, 150,
42.5, 22.5, 42.5, 42.5, 17.5, 62.5, 17.5, 62.5, 42.5, 150, 250,
22.5, 42.5, 30, 62.5, 62.5, 42.5, 42.5, 30, 150, 150, 42.5, 17.5,
17.5, 42.5, 8.75, 62.5, 42.5, 42.5, 22.5, 150, 62.5, 30, 250,
62.5, 87.5, 62.5, 8.75, 62.5, 30, 30, 8.75, 17.5, 17.5, 150,
22.5, 62.5, 62.5, 42.5)
The faminc variable is in 1000s
Edit2:
Okay, the code is good, but now I try to fit the distribution over the histogram using the following:
x <- rgamma(500,shape=q$par[1],scale=q$par[2])
hist(newdata$faminc, prob = TRUE)
curve(dgamma(x, shape=q$par[1], scale=q$par[2]), add=TRUE, col='blue')
It just produces a flat blue line at the x-axis..
You've got some things going on I haven't been able to work out, but here's a demonstration of the estimation.
Let's start by generating some data (so we know if the optimization is working). I only changed your optimization function below, and used Nelder-Mead instead of the quasi-Newton.
set.seed(23)
a <- 2 # shape
b <- 3 # rate
require(data.table)
newdata <- data.table(faminc = rgamma(10000, a, b))
t1 <- sum(log(newdata$faminc))
t2 <- sum(newdata$faminc)
obs <- nrow(newdata)
llf <- function(x){
a <- x[1]
b <- x[2]
# log-likelihood function
return( - ((a - 1) * t1 - b * t2 - obs * a * log(1/b) - obs * log(gamma(a))))
}
# initial guess for a = mean^2(x)/var(x) and b = mean(x) / var(x)
a1 <- (mean(newdata$faminc))^2/var(newdata$faminc)
b1 <- mean(newdata$faminc)/var(newdata$faminc)
q <- optim(c(a1, b1), llf)
q$par
[1] 2.024353 3.019376
I'd say we're pretty close.
With your data:
(est <- q$par)
[1] 2.21333613 0.04243384
theoretical <- data.table(true = rgamma(10000, est[1], est[2]))
library(ggplot2)
ggplot(newdata, aes(x = faminc)) + geom_density() + geom_density(data = theoretical, aes(x = true, colour = "red")) + theme(legend.position = "none")
Not great, but reasonable for 500 obs.
Response to OP Edit 2:
You should look more closely at the functions you're using, curve accepts a function argument, not vector values:
gamma_density = function(x, a, b) ((b^a)/gamma(a)) * (x^(a - 1)) * exp(-b * x)
hist(newdata$faminc, prob = TRUE, ylim = c(0, 0.015))
curve(gamma_density(x, a = q$par[1], b = q$par[2]), add=TRUE, col='blue')
Related
I want to choose 100 houses randomly from my dataset, and find the mean value of their total price. Then repeat this action 100 times, and for each time I repeat the action, calculate the mean price. And then plot all the mean values in a histogram. This is my code (rome is the house dataset):
run <- rome[sample(1:nrow(rome), 100, replace=FALSE),]
dun <- mean(run$PRICE)
c <- replicate(100, dun)
I also tried the for loop, which I'm pretty sure I need to use here, but there are mistakes in my code:
d <- for(i in 1:100){
run <- rome[sample(1:nrow(rome), 100, replace=FALSE),]
dun <- mean(run$PRICE)
c <- replicate(100, dun)
}
And finally hist(d) , which doesn't run because of the mistakes. Can you help me?
The data (price values):
good_struct <-
c(
47,
113,
165,
104.3,
62.5,
70,
127.5,
64.5,
145,
63.5,
58.9,
65,
48,
3.5,
12.8,
17.5,
36,
41.9,
53.5,
24.5,
24.5,
55.5,
60,
51,
46,
46,
44,
54.9,
42.5,
44,
44.9,
37.9,
33,
43.9,
49.6,
52,
37.5,
50,
35.9,
42.9,
107,
112,
44.9,
55,
102,
35.5,
62.9,
39,
110,
8,
62,
85.9,
57,
110,
67.7,
89.5,
70,
74,
13,
48,
24,
53.5,
34.5,
53,
87.5,
33.5,
24,
9.6,
30,
41,
30,
38.9,
20.7,
49.9,
18.6,
39,
34,
16,
18.9,
15.2,
41.5,
53,
22,
24.9,
6.7,
32.5,
30,
59,
29.5,
26,
16.5,
39,
48.9,
33.5,
46,
54,
57.9,
37.9,
32,
31,
34,
29,
32.5,
51.9,
31,
41.8,
48,
28,
35,
46.5,
51.9,
35.4,
16,
35,
35,
36.5,
35.9,
45,
40,
35,
38,
37,
23,
25.5,
39.5,
21.5,
9,
67.5,
13.4,
12.5,
28.5,
23,
33.5,
9,
11,
30.9,
31.65,
33,
33.4,
47,
40,
46,
45.5,
57,
29.9,
30,
34,
51,
64.5,
57.5,
85.5,
61,
38,
56.5,
60.4,
51.5,
54,
69,
56,
27.9,
37.5,
32.9,
22,
29.9,
39.9,
32.6,
38.5,
21.5,
25.9,
27.5,
22.9,
31.5,
8.5,
5.5,
33,
57,
47,
43.5,
43.9,
68.5,
44.25,
61,
40,
44.5,
57,
35,
35.1,
64.5,
40,
42.6,
50,
58,
58,
55,
43,
54,
39,
45,
42,
38.9,
43.215,
26.5,
30,
29.5
)
Since replicate is a wrapper to sapply, consider adjusting the call by passing in an expression that subsets a vector then calls mean:
random_mean_prices <- replicate(
100, mean(rome$PRICE[sample(1:nrow(rome), 100, replace=FALSE)])
)
hist(random_mean_prices)
Perhaps something like this?
rome <- data.frame(PRICE = rnorm(1e6,3e5,5e4),
ID = 1:1e6)
dun = NULL
for(i in 1:100){
run <- rome[sample(1:nrow(rome), 100, replace=FALSE),]
dun <- c(dun, mean(run$PRICE))
}
hist(dun)
Why does a fixed intercept lead to a huge negative shift? See the red line.
Form the docs ?poly
Returns or evaluates orthogonal polynomials of degree 1 to degree over
the specified set of points x: these are all orthogonal to the
constant polynomial of degree 0.
Thus, I would expect the polynomial of degree 0 to be the intercept. What do I miss?
plot(df$t, df$y)
# this is working as expected
model1 <- lm(y ~ -1 + poly(t, 10, raw = TRUE), data = df)
model2 <- lm(y ~ -1 + poly(t, 10, raw = FALSE), data = df)
model3 <- lm(y ~ poly(t, 10, raw = TRUE), data = df) # raw = FALSE gives similar results
nsamples <- 1000
new_df <- data.frame(t = seq(0, 96, length.out = nsamples))
new_df$y1 <- predict(model1, newdata = new_df)
new_df$y2 <- predict(model2, newdata = new_df)
new_df$y3 <- predict(model3, newdata = new_df)
plot(new_df$t, new_df$y1, type = "l", ylim = c(-0.5, 1))
lines(new_df$t, new_df$y2, col = "red")
lines(new_df$t, new_df$y3 + 0.05, col = "blue") # offest for visibilty added!!
lines(c(0, 96), -c(mean(df$y), mean(df$y)), col = "red")
Edit: I think the question is equivalent to "what orthogonal polynomials are used (formula)?". The reference in the docs is a really old book - I can't get it. And there are a lot of different ortogonal poynomials, see e.g. Wikipedia.
Data:
df <- structure(list(t = c(0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5,
8.5, 9.5, 10.5, 11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5,
19.5, 20.5, 21.5, 22.5, 23.5, 24.5, 25.5, 26.5, 27.5, 28.5, 29.5,
30.5, 31.5, 32.5, 33.5, 34.5, 35.5, 36.5, 37.5, 38.5, 39.5, 40.5,
41.5, 42.5, 43.5, 44.5, 45.5, 46.5, 47.5, 48.5, 49.5, 50.5, 51.5,
52.5, 53.5, 54.5, 55.5, 56.5, 57.5, 58.5, 59.5, 60.5, 61.5, 62.5,
63.5, 64.5, 65.5, 66.5, 67.5, 68.5, 69.5, 70.5, 71.5, 72.5, 73.5,
74.5, 75.5, 76.5, 77.5, 78.5, 79.5, 80.5, 81.5, 82.5, 83.5, 84.5,
85.5, 86.5, 87.5, 88.5, 89.5, 90.5, 91.5, 92.5, 93.5, 94.5, 95.5),
y = c(0.00561299852289513, 0.0117183653372723, 0.0171836533727228,
0.0234367306745446, 0.0280157557853274, 0.0331856228458887, 0.0391432791728213,
0.0438700147710487, 0.048793697685869, 0.0539635647464303, 0.0586903003446578,
0.0630723781388479, 0.0681437715411128, 0.0732151649433777, 0.0780403741999015,
0.0813884785819793, 0.085425898572132, 0.0896110290497292, 0.0934022648941408,
0.0968980797636632, 0.0996061053668144, 0.103495814869522, 0.107631708517971,
0.111176760216642, 0.115017232890202, 0.119350073855244, 0.124766125061546,
0.131216149679961, 0.139586410635155, 0.148153618906942, 0.156080748399803,
0.166814377154111, 0.177006400787789, 0.189118660758247, 0.202412604628262,
0.217577548005908, 0.234318069916297, 0.249089118660758, 0.267355982274741,
0.284539635647464, 0.301477104874446, 0.316100443131462, 0.332151649433776,
0.346873461349089, 0.361792220580995, 0.376366322008863, 0.392220580994584,
0.408173313638602, 0.424224519940916, 0.439192516001969, 0.454849827671098,
0.471196454948301, 0.485622845888725, 0.500443131462334, 0.514869522402757,
0.529148202855736, 0.544559330379124, 0.559773510585918, 0.576218611521418,
0.593303791235844, 0.609010339734121, 0.623929098966027, 0.6397341211226,
0.655489906450025, 0.669768586903003, 0.68493353028065, 0.698867552929591,
0.713244707040867, 0.726095519448548, 0.74027572624323, 0.752584933530281,
0.76903003446578, 0.781486952240276, 0.794091580502216, 0.804726735598227,
0.818217626784835, 0.832742491383555, 0.845691777449532, 0.856179222058099,
0.866075824716888, 0.875923190546529, 0.886952240275726, 0.896898079763663,
0.906203840472674, 0.915755785327425, 0.923879862136878, 0.932693254554407,
0.940768094534712, 0.949187592319055, 0.956523879862137, 0.964204825209257,
0.971344165435746, 0.978532742491384, 0.986558345642541, 0.993205317577548, 1)),
class = "data.frame", row.names = c(NA, -96L))
Just think about a regression line. For (x, y) data, let xx = mean(x) and yy = mean(y). Fitting
y = b * (x - xx)
is different from fitting
y = a + b * (x - xx)
and that a (intercept) measures the vertical shift. Furthermore, it can be shown that a = yy.
I have a vector of length 370 that I would like to fit to a mixture of Gaussians. I have followed the example here: Any suggestions for how I can plot mixEM type data using ggplot2 to plot the data, but as you can see from the image link, my results are different from those in the example: Plot of a mixture of three Gaussians
Here is a snippet of the code that I used:
library(ggplot2)
library(mixtools)
gg.mixEM <- function(EM) {
require(ggplot2)
x <- with(EM,seq(min(x),max(x),len=1000))
pars <- with(EM,data.frame(comp=colnames(posterior), mu, sigma,lambda))
em.df <- data.frame(x=rep(x,each=nrow(pars)),pars)
em.df$y <- with(em.df,lambda*dnorm(x,mean=mu,sd=sigma))
ggplot(data.frame(x=EM$x),aes(x,y=..density..)) +
geom_histogram(fill=NA,color="black",bins=41)+
geom_polygon(data=em.df,aes(x,y,fill=comp),color="grey50", alpha=0.5)+
scale_fill_discrete("Component\nMeans",labels=format(em.df$mu,digits=3))+
theme_bw()
}
dput(gradesCS)
c(6.5, 22.375, 20.5, 24.25, 33.25, 24, 26.75, 30.75, 35.5, 23.5,
26.875, 24, 35.5, 29.875, 29.75, 31.25, 32.875, 33.75, 34, 29,
33, 24, 12, 26.375, 6.75, 31.25, 21.625, 32.875, 29.25, 27.125,
28.25, 26.25, 24.875, 35.5, 26.5, 37.5, 35.375, 27.5, 33, 27.5,
39.5, 34.25, 28.125, 28, 32.625, 37.625, 34.5, 29.5, 38.5, 37.5,
28.75, 38, 16, 35.75, 30, 33.5, 36, 31.125, 29.75, 32.5, 35,
24.375, 23.375, 28, 32.125, 36, 31.5, 33.5, 1.5, 30.5, 37, 29.5,
29.5, 31.125, 32.5, 20.5, 28.75, 30.25, 32.5, 28, 36, 37.5, 28.5,
35.5, 30.25, 36.375, 36, 23.25, 31.5, 25.125, 33.5, 34, 19.5,
31.75, 39.5, 33.25, 24.875, 26.75, 23.375, 34, 16.5, 37, 33.375,
31.25, 31.75, 35.5, 32, 27.5, 23.375, 20.625, 35.5, 31.5, 25.375,
24.5, 27.25, 25.25, 35.75, 24, 28.25, 33.125, 31.5, 39.5, 39.25,
24.75, 37, 25.5, 34.75, 34, 20.25, 37.625, 30.5, 32.375, 15,
32.75, 33.5, 32.75, 31.5, 29.25, 30, 37.25, 34.5, 23, 32.5, 38.25,
35.625, 33, 35, 31.125, 37, 28.125, 29.25, 31.75, 34.75, 34.625,
36.625, 15.25, 35.5, 37, 33.5, 30.875, 35, 31.625, 22.75, 31,
31.125, 25.125, 35.5, 2, 36.125, 25.25, 32.5, 28, 38.5, 35.5,
38.5, 30.5, 34, 28.125, 38, 29.25, 29.75, 33.25, 25.125, 35,
34.5, 32, 35, 26.875, 20.5, 35.5, 23.25, 26.25, 36, 35.5, 38,
39.25, 22, 38.5, 31, 35.5, 33.5, 31.5, 26, 30.375, 35.75, 29.75,
34, 37.625, 38, 35.5, 34.25, 24.375, 30, 33.75, 39.5, 36.5, 36.5,
32, 36.5, 29.75, 29.75, 25, 32, 29.25, 32.125, 31.25, 38, 33.5,
33.5, 38.5, 37.25, 31.125, 33.5, 31, 28, 29.75, 36, 36, 37, 22,
29, 36.5, 32.25, 30.75, 38.5, 24.125, 28.75, 38.25, 32.5, 34.75,
29, 30.375, 33.5, 31.25, 30, 33, 33.5, 27.5, 26.5, 30.25, 34.75,
33.5, 39, 33.25, 38.5, 27, 39.5, 34.25, 33, 35.125, 38, 31.25,
32.75, 22.75, 31.125, 34.5, 33, 37.125, 31, 18.75, 30.25, 31.75,
34, 30.75, 29, 34.5, 36, 36.5, 31.5, 26, 27.5, 27.5, 36.5, 19.75,
33, 35.125, 16, 19.75, 31.5, 38.5, 34.25, 36.5, 27, 22, 21.75,
36, 31.5, 33, 29.75, 32.5, 26.25, 33.5, 35.75, 33, 39, 35, 34.25,
28.5, 25.5, 30.5, 28, 21.25, 39.125, 22.75, 28.375, 29.125, 30,
34.125, 31.25, 32, 26.25, 36, 24.5, 30.25, 32.75, 29.625, 16,
34, 16.75, 25.25, 33, 38, 28, 24.75, 29.75, 24.5, 19.25, 32.75,
27.5, 24.75, 17.375, 25.25, 30.125, 38, 28, 35, 11.75, 27.75,
38, 28.625, 31.25, 31.25, 32, 17.25, 18.25, 32.625, 25.5, 27.5,
35.25, 35.5)
b <- gradesCS
c <- b[sample(length(b), length(b)) ]
c3 <- normalmixEM(c, lambda=NULL, mu=NULL, sigma=NULL,k=3,maxit=1000,epsilon = 1e-2)
gg.mixEM(c3)
The problem is that polygons freak out if they don't have continuous drawing space (e.g. if you end abruptly at 0, but the polygon function has not reached 0).
In the first line of the ggplot function, add extra spacing on each side of x. I'm going with 5 here, but you just need enough for the function to hit 0.
x <- with(EM,seq(min(x)-5,max(x)+5,len=1000))
In the bottom, we cut off the excess space with
coord_cartesian(xlim = c(0,42),
expand = c(0,0))
This renders the graph with your spacing, and then "zooms in" on the selected x interval.
fit_test <- normalmixEM(
test,
k = 2)
gg.mixEM <- function(EM) {
require(ggplot2)
x <- with(EM,seq(min(x)-5,max(x)+5,len=1000))
pars <- with(EM,data.frame(comp=colnames(posterior), mu, sigma,lambda))
em.df <- data.frame(x=rep(x,each=nrow(pars)),pars)
em.df$y <- with(em.df,lambda*dnorm(x,mean=mu,sd=sigma))
ggplot(data.frame(x=EM$x),aes(x,y=..density..)) +
geom_histogram(fill=NA,color="black",bins=41)+
geom_polygon(data=em.df,aes(x,y,fill=comp),color="grey50", alpha=0.5)+
scale_fill_discrete("Component\nMeans",labels=format(em.df$mu,digits=3))+
theme_bw() +
coord_cartesian(xlim = c(0,42),
expand = c(0,0))
}
gg.mixEM(fit_test)
And we get
I am new to R and I got a hold of this program I am trying to run. But I am getting error in the variable "outcome01". It could be that if I (somehow) fix this variable, there would be other similar errors. Any help is appreciated; Here is the code:
library(norm)
# Make appropriate changes in file and variable names.
setwd ("c:\\Users\\Dave Desktop\\Dropbox\\Webs\\StatPages\\More_Stuff\\Missing_Data")
x <- read.table("survrateMissingNA.dat", header = TRUE)
y <- as.matrix(x) #convert table to matrix
cat("Logistic regression using 132 cases and missing data \n\n")
print(summary(glm(formula = outcome01~survrate + gsi + avoid + intrus, binomial, data = x))) #Use original data with missing values
attach(x)
##########
## Important The following code will run m = 5 times. The data will be concatenated into ComFile and then analyzed.
# Data Augmentation using norm.R
m <- 5 #Number of imputations
k <- 9 #Number of variables in raw data file
l <- 5 #Number of variables actually used in regression
CombFile <- matrix(nrow = 0, ncol = k)
for (i in 1:m) {
s <- prelim.norm(y) #get preliminary statistics for the analysis
thetahat <-em.norm(s) #Get MLE for start value
rngseed(25672)
theta <- da.norm(s, thetahat, steps=200, showits=TRUE) # GET MLE
getparam.norm(s, theta) #Print out those results
impdata <-imp.norm(s, theta, y) #Impute the data
filename <- paste("CombFile", i, sep = "")
CombFile <- rbind(CombFile, impdata)
write(t(impdata), file = "impsurvrate", ncolumns = 9, sep = " ")
z <- data.frame(impdata)
z$outcome01 <- round(z$outcome01, digits = 0)
summary((glm(formula = outcome01~survrate + gsi + avoid + intrus, binomial, data = z))) #Use imputed data.
}
## Creating the final data file with imputed data 660 rows
nPerImp <- nrow(CombFile)/m
imps <- rep(1:m, each = nPerImp)
# Add a variable representing the imputation number.
data <- as.data.frame(cbind(imps, CombFile))
data$outcome01 <- round(data$outcome01, digits = 0)
# head(data)
attach(data)
## Set up variables to hold results
b <- matrix(NA,nrow = m, ncol = 2*l)
meanb <- numeric(l)
meanvar <- numeric(l)
varb <- numeric(l)
TT <- numeric(l)
sqrtt <- numeric(l)
t <- numeric(l)
## Run a logistic regression on each of the 5 imputed data sets and store the
## coefficients and theire standard errors.
for (i in 1:m) { # Modify following line appropriately
model <- glm(outcome01~survrate + gsi + avoid + intrus ,subset = (imps ==i), binomial, data = data)
a <- summary(model)
# print(a)
n <- 2*l
b[i,] <- a$coefficients[1:n]
}
## Calculate the coefficients, st. errors, and t values across 5 imputations
for (i in 1:l) {
meanb[i] <- mean(b[,i])
meanvar[i] <- mean((b[,i+l]^2))
varb[i] <- var(b[,i])
}
cat("\n\n\nThe mean regression coefficients are: \n\n")
print(meanb)
for (i in 1:l) {
TT[i] <- meanvar[i] + (1 + 1/5)*varb[i]
sqrtt[i] <- sqrt(TT[i])
t[i] <- meanb[i]/sqrtt[i]
}
cat("The standard errors are: \n\n")
print(sqrtt)
cat("\n The t values are: \n\n")
print(t)
Here is the data ( in survrateMissingNA.dat file):
c(1, 4.405, 17.2, 31.144, 491, 1029, 61, 20.2, 999, 2, 8.963,
17.6, 47.951, 445, 934, 32, 21, 3.85, 3, 4.778, 19.3, 32.175,
448, 944, 27, 21.1, 3.296, 4, 999, 17.1, 28.934, 482, 1005, 66,
20.3, 1.792, 5, 4.992, 24, 41.078, 417, 902, 11, 21, 3.807, 6,
5.443, 18.4, 34.571, 462, 980, 62, 21.5, 3.367, 7, 8.817, 14.4,
999, 431, 908, 3, 21.7, 4.394, 8, 7.03, 16.6, 39.076, 429, 897,
3, 21, 4.22, 9, 5.718, 19.1, 32.588, 420, 889, 36, 20.7, 3.871,
10, 5.193, 16.3, 32.291, 406, 854, 16, 20.2, 4.174, 11, 6.078,
17.9, 38.518, 407, 889, 17, 21.6, 999, 12, 4.21, 19.1, 29.783,
468, 979, 62, 21.4, 2.708, 13, 6.136, 17.3, 39.431, 488, 1048,
69, 21.2, 2.565, 14, 5.826, 17.5, 36.785, 415, 882, 19, 21.2,
4.06, 15, 5.483, 15.8, 31.511, 516, 1099, 64, 22.1, 1.609, 16,
5.817, 15.1, 34.652, 503, 1060, 74, 21.7, 2.197, 17, 999, 17,
32.257, 477, 999, 65, 20.1, 2.398, 18, 4.761, 999, 26.461, 486,
1021, 80, 19.4, 2.197, 19, 6.428, 13.8, 31.972, 427, 896, 2,
21.5, 4.22, 20, 7.245, 17, 40.661, 430, 909, 11, 20.7, 4.159,
21, 7.287, 14.8, 40.795, 430, 907, 6, 21.6, 4.382, 22, 6.994,
20.1, 41.895, 484, 1033, 68, 21.3, 999, 23, 6, 17.5, 35.948,
506, 1085, 60, 22.1, 2.197, 24, 4.08, 17.5, 26.818, 496, 1036,
79, 18.7, 1.386, 25, 5.383, 15.5, 31.189, 495, 1045, 64, 21.5,
2.197, 26, 5.692, 16.3, 28.785, 473, 1009, 55, 21.9, 3.045, 27,
5.935, 14.5, 30.922, 494, 1050, 73, 21.7, 2.197, 28, 999, 18.7,
34.836, 434, 917, 39, 21.3, 3.401, 29, 5.859, 15.6, 999, 444,
935, 4, 22.3, 999, 30, 9.774, 13.8, 46.087, 420, 898, 3, 20.8,
4.248, 31, 4.586, 17.2, 28.493, 485, 1015, 59, 20.3, 2.398, 32,
9.623, 15.2, 47.612, 419, 892, 16, 21.9, 4.304, 33, 5.077, 16.2,
30.793, 411, 865, 11, 19.3, 4.094, 34, 4.775, 15.3, 26.327, 515,
1107, 78, 21.4, 1.609, 35, 6.162, 16.6, 36.802, 460, 975, 60,
21.3, 3.135, 36, 4.845, 15.5, 28.172, 491, 1027, 66, 20.6, 2.197,
37, 6.436, 19.9, 38.555, 448, 947, 12, 22.3, 3.932, 38, 7.109,
17.1, 999, 419, 880, 8, 21, 4.248, 39, 999, 14.7, 40.729, 425,
888, 2, 21.4, 4.248, 40, 4.797, 16.4, 30.279, 401, 844, 13, 18.9,
4.06, 41, 4.775, 14.4, 25.994, 505, 1068, 68, 21.3, 1.609, 42,
4.388, 18.6, 32.477, 497, 1040, 83, 19.7, 2.485, 43, 5.222, 15.7,
31.223, 419, 893, 30, 20.2, 3.85, 44, 3.656, 24.3, 29.082, 513,
1076, 69, 21.5, 1.386, 45, 6.75, 13.8, 35.406, 429, 901, 7, 21.9,
4.22, 46, 5.327, 14.6, 33.987, 428, 896, 6, 20.7, 999, 47, 5.906,
20.2, 36.151, 443, 937, 16, 22.4, 3.871, 48, 6.107, 14.8, 31.944,
448, 932, 57, 20, 2.833, 49, 6.93, 15.9, 37.746, 501, 1073, 64,
22.3, 2.197, 50, 6.16, 14.9, 31.285, 476, 1001, 70, 21.4, 2.303)
Can you please be a bit more specific with your question, in particular, where in the code does the error occur?
With the survrateMissingNA.dat, are there any headers on it, or does it appear exactly as you posted it?
If there are no headers, this line of code will give you an error straight away: print(summary(glm(formula = outcome01~survrate + gsi + avoid + intrus, binomial, data = x)))
That line is reading data from the data.frame x, which contains the survrateMissingNA.dat data. If that data file has no headers, then outcome01 does not exist. (not does survrate, gsi avoid orintrus).
To fix this (initial error), make sure that the file contains headings.
The Problem
I'm trying to find a solution to overcome a deficient experimental design in establishing sampling points. The aim is to subset the original dataset, forcing sampling points stratification based on 2 factors with several levels.
I need a general formulation of the problem that may allow me to redefine the set of criteria levels.
Note
I've found examples of subseting tables based on criteria, the most relevant is a post from Brian Diggs but I cannot find a general way to apply that solution to my particular case.
The Dataset
My data.frame have 3 columns, sample id and two factors (f1 and f2).
Criteria are based on interval of values for f1 and f2.
dat <- structure(list(id = 1:203, f1 = c(22, 20.8, 20.7, 22, 12.1, 8,
20.6, 22, 22, 21.6, 0, 22, 21.4, 15.9, 21.2, 19.1, 12.5, 16.6,
14, 21.2, 14.7, 20.7, 20.5, 5.4, 19.1, 18.9, 22, 22, 22, 0, 0,
22, 1.3, 1, 0, 9.4, 7.9, 14.5, 0, 1.5, 0, 20.3, 18, 17.3, 1,
22, 0, 15, 17.9, 4.3, 19.5, 21.2, 21.2, 14.6, 2.3, 0, 6.7, 17.9,
9.5, 19, 21.6, 16.6, 11.7, 13.7, 1.5, 1, 7.6, 3.7, 18.5, 13.5,
20.9, 18.2, 11.5, 7.3, 6.5, 21.1, 22, 20.5, 20.5, 20, 16.2, 18.6,
22, 15.1, 14.4, 10.8, 17.1, 5.7, 15.1, 12.8, 14.5, 8.8, 16.8,
18.7, 1, 6.3, 1.8, 14.6, 22, 16.2, 12.9, 9.1, 2, 7.6, 7, 11.7,
1, 1, 9.6, 11, 2, 2, 14, 14.9, 7.8, 11.4, 8.3, 7.6, 9.1, 4.5,
18, 11.4, 3.1, 4.3, 9.3, 8.1, 1.4, 5.2, 14.7, 3.6, 5, 2.7, 10.3,
11.3, 17.9, 5.2, 1, 1.5, 13.2, 0, 1, 7.4, 1.7, 11.5, 20.2, 0,
14.7, 17, 15.2, 22, 22, 22, 17.2, 15.3, 10.9, 18.7, 11.2, 18.5,
20.3, 21, 20.8, 15, 21, 16.9, 18.5, 18.5, 10.3, 12.6, 15, 19.8,
21, 17.2, 16.3, 18.3, 10.3, 17.8, 11.2, 1.5, 1, 0, 1, 14, 19.1,
6.1, 19.2, 17.1, 14.5, 18.4, 22, 20.3, 6, 13, 18.3, 8.5, 15.3,
10.6, 7.2, 6.2, 1, 7.9, 2, 20, 16.3), f2 = c(100, 100, 92.9,
38.5, 100, 90.9, 100, 100, 100, 91.7, 0, 100, 71.4, 100, 100,
53.8, 28.6, 91.7, 100, 100, 64.3, 100, 92.9, 78.6, 100, 100,
27.3, 83.3, 14.3, 0, 0, 9.1, 23.1, 12.5, 0, 100, 81.8, 100, 0,
15.4, 0, 83.3, 100, 75, 7.1, 81.8, 0, 21.4, 84.6, 25, 80, 90.9,
100, 71.4, 50, 0, 46.2, 90.9, 14.3, 66.7, 90.9, 84.6, 46.2, 91.7,
33.3, 7.7, 71.4, 27.3, 46.2, 100, 100, 100, 60, 54.5, 46.2, 53.8,
91.7, 100, 100, 66.7, 45.5, 57.1, 15.4, 75, 75, 76.9, 53.8, 25,
90.9, 84.6, 91.7, 90.9, 100, 54.5, 23.1, 63.6, 30.8, 90.9, 92.9,
100, 92.3, 90.9, 12.5, 38.5, 15.4, 84.6, 27.3, 7.1, 75, 21.4,
7.7, 15.4, 84.6, 100, 69.2, 63.6, 64.3, 53.8, 92.3, 33.3, 11.1,
61.5, 66.7, 23.1, 85.7, 81.8, 41.7, 69.2, 76.9, 38.5, 9.1, 23.1,
85.7, 90, 100, 100, 14.3, 36.4, 84.6, 0, 7.7, 61.5, 25, 50, 100,
0, 63.6, 36.4, 76.9, 100, 100, 100, 100, 90.9, 100, 100, 100,
100, 100, 83.3, 100, 100, 100, 100, 50, 54.5, 71.4, 100, 85.7,
100, 75, 100, 76.9, 83.3, 100, 92.3, 33.3, 76.9, 33.3, 0, 40,
91.7, 100, 53.8, 100, 100, 100, 100, 100, 92.3, 76.9, 23.1, 84.6,
33.3, 100, 92.3, 46.2, 100, 9.1, 53.8, 7.7, 20, 42.9)), .Names = c("id",
"f1", "f2"), class = "data.frame", row.names = c(NA, -203L))
The expected output
Sampling points should ideally be grouped following a crossed design (it is not a complete factorial design).
For Factor f1: 0, 1-15, 30-60, 80-95, 100
For Factor f2: 0, 5-10, 15-20
I need to find points given all combinations of f1 and f2 intervals, something like this fashion:
gr <- expand.grid(f1=c('0', '1-15', '30-60', '80-95', '100'),
f2=c('0', '5-10', '15-20'))
> gr
f1 f2
1 0 0
2 1-15 0
3 30-60 0
4 80-95 0
5 100 0
6 0 5-10
7 1-15 5-10
8 30-60 5-10
9 80-95 5-10
10 100 5-10
11 0 15-20
12 1-15 15-20
13 30-60 15-20
14 80-95 15-20
15 100 15-20
The solution should split dat based on lines of gr.
This is not a complete factorial design since not all combinations will fulfill this particular criteria combination but it is important to identify NA's as well.
Any help will be appreciated. Please let me know if I'm providing sufficient information.
Use cut, to split f1 and f2 into factors based on your breakpoints, paste the factor together, and then split based on the combined factor.
dat$f1.group<-cut(dat$f1,c(0,1,15,30,60,80,90,95,100))
dat$f2.group<-cut(dat$f1,c(0,5,10,15,20))
gr<-expand.grid(levels(dat$f1.group),levels(dat$f2.group))
names(gr)<-c('f1.group','f2.group')
gr$combined = paste(gr$f1.group,gr$f2.group)
dat<-merge(gr,dat)[c('id','f1','f2','combined')]
split(dat,dat$combined)
That will get you a list of data.frame, with one element for each combo defined in gr. You can them easily sample by these strata.