I'm trying to fit a model with just one inflation and whereas the pred.zoib function worked when I included both one and zero inflation, it isn't running now that I've excluded the zero inflation - I get this error:
Error in x1[i, ] %*% b1 : non-conformable arguments
The chaffinchdata is just a data frame with various bits of information, but importantly a detection probability column that has been adjusted to remove the zeros and replace them with 0.00001s, and a distance column that holds numeric values between 0 and 131.
When using pred.zoib before with both zero and one inflation, it worked fine:
fit.zoib.chaffinch4 <- zoib(Detection_probability ~ Distance | Distance | Distance | Distance , data = chaffinchdata)
chaffinchxnew <- data.frame(Distance = seq(0,150,0.1))
pred.chaff4 <- pred.zoib(fit.zoib.chaffinch4, chaffinchxnew)
dfchaff4 <- data.frame(Distance = seq(0,150,0.1),pred.chaff4$summary)
So this all worked perfectly up until now. Then the below runs up until the pred.zoib stage.
# now set the zero inflation to FALSE, to investigate just one inflation
nozeroinf <- birddata$Detection_probability
nozeroinf <- ifelse(nozeroinf == 0, 0.00001, nozeroinf)
birddata$nozeroinfs <- nozeroinf
chaffinchdata <- filter(birddata, Species == 'Chaffinch')
fit.zoib.chaff.oneinf <- zoib(nozeroinfs ~ Distance | Distance | Distance,
data = chaffinchdata, zero.inflation = F,
one.inflation = T)
chaffxnew.oneinf <- data.frame(Distance = seq(0, 100, 0.1))
pred.chaff.oneinf <- pred.zoib(fit.zoib.chaff.oneinf, chaffxnew.oneinf)
I've tried using the distance data straight from the chaffinchdata dataset instead of creating a sequence of my own, ie.
pred.zoib(fit.zoib.chaff.oneinf, data.frame(chaffinchdata$Distance))
but that didn't work, nor did
pred.zoib(fit.zoib.chaff.oneinf, chaffinchdata)
Any help on this would be greatly appreciated!
Related
I would like to derive individual growth rates from our growth model directly, similar to this OP and this OP.
I am working with a dataset that contains the age and weight (wt) measurements for ~2000 individuals in a population. Each individual is represented by a unique id number.
A sample of the data can be found here. Here is what the data looks like:
id age wt
1615 6 15
3468 32 61
1615 27 50
1615 60 145
6071 109 209
6071 125 207
10645 56 170
10645 118 200
I have developed a non-linear growth curve to model growth for this dataset (at the population level). It looks like this:
wt~ A*atan(k*age - t0) + m
which predicts weight (wt) for a given age and has modifiable parameters A, t0, and m. I have fit this model to the dataset at the population level using a nlme regression fit where I specified individual id as a random effect and used pdDiag to specify each parameter as uncorrelated. (Note: the random effect would need to be dropped when looking at the individual level.)
The code for this looks like:
nlme.k = nlme(wt~ A*atan(k*age - t0) + m,
data = df,
fixed = A+k+t0+m~1,
random = list(id = pdDiag(A+t0+k+m~1)), #cannot include when looking at the individual level
start = c(A = 99.31,k = 0.02667, t0 = 1.249, m = 103.8), #these values are what we are using at the population level # might need to be changed for individual models
na.action = na.omit,
control = nlmeControl(maxIter = 200, pnlsMaxIter = 10, msMaxIter = 100))
I have our population level growth model (nlme.k), but I would like to use it to derive/extract individual values for each growth constant.
How can I extract individual growth constants for each id using my population level growth model (nlme.k)? Note that I don't need it to be a solution that uses nlme, that is just the model I used for the population growth model.
Any suggestions would be appreciated!
I think this is not possible due to the nature on how random effects are designed. According to this post the effect size (your growth constant) is estimated using partial pooling. This involves using data points from other groups. Thus you can not estimate the effect size of each group (your individual id).
Strictly speaking (see here) random effects are not really a part of the model at all, but more a part of the error.
However, you can estimate the R2 for all groups together. If you want it on an individual level (e.g. parameter estiamtes for id 1), then just run the same model only on all data points of this particular individual. This give you n models with n parameter sets for n individuals.
We ended up using a few loops to do this.
Note that our answer builds off a model posted in this OP if anyone wants the background script. We will also link to the published script when it is posted.
For now - this is should give a general idea of how we did this.
#Individual fits dataframe generation
yid_list <- unique(young_inds$squirrel_id)
indf_prs <- list('df', 'squirrel_id', 'A_value', 'k_value', 'mx_value', 'my_value', 'max_grate', 'hit_asymptote', 'age_asymptote', 'ind_asymptote', 'ind_mass_asy', 'converge') #List of parameters
ind_fits <- data.frame(matrix(ncol = length(indf_prs), nrow = length(yid_list))) #Blank dataframe for all individual fits
colnames(ind_fits) <- indf_prs
#Calculates individual fits for all individuals and appends into ind_fits
for (i in 1:length(yid_list)) {
yind_df <-young_inds%>%filter(squirrel_id %in% yid_list[i]) #Extracts a dataframe for each squirrel
ind_fits[i , 'squirrel_id'] <- as.numeric(yid_list[i]) #Appends squirrel i's id into individual fits dataframe
sex_lab <- unique(yind_df$sex) #Identifies and extracts squirrel "i"s sex
mast_lab <- unique(yind_df$b_mast) #Identifies and extracts squirrel "i"s mast value
Hi_dp <- max(yind_df$wt) #Extracts the largest mass for each squirrel
ind_long <- unique(yind_df$longevity) #Extracts the individual death date
#Sets corresponding values for squirrel "i"
if (mast_lab==0 && sex_lab=="F") { #Female no mast
ind_fits[i , 'df'] <- "fnm" #Squirrel dataframe (appends into ind_fits dataframe)
df_asm <- af_asm #average asymptote value corresponding to sex
df_B_guess <- guess_df[1, "B_value"] #Inital guesses for nls fits corresponding to sex and mast sex and mast
df_k_guess <- guess_df[1, "k_value"]
df_mx_guess <- guess_df[1, "mx_value"]
df_my_guess <- guess_df[1, "my_value"]
ind_asyr <- indf_asy #growth rate at individual asymptote
} else if (mast_lab==0 && sex_lab=="M") { #Male no mast
ind_fits[i , 'df'] <- "mnm"
df_asm <- am_asm
df_B_guess <- guess_df[2, "B_value"]
df_k_guess <- guess_df[2, "k_value"]
df_mx_guess <- guess_df[2, "mx_value"]
df_my_guess <- guess_df[2, "my_value"]
ind_asyr <- indm_asy
} else if (mast_lab==1 && sex_lab=="F") { #Female mast
ind_fits[i , 'df'] <- "fma"
df_asm <- af_asm
df_B_guess <- guess_df[3, "B_value"]
df_k_guess <- guess_df[3, "k_value"]
df_mx_guess <- guess_df[3, "mx_value"]
df_my_guess <- guess_df[3, "my_value"]
ind_asyr <- indm_asy
} else if (mast_lab==1 && sex_lab=="M") { #Males mast
ind_fits[i , 'df'] <- "mma"
df_asm <- am_asm
df_B_guess <- guess_df[4, "B_value"]
df_k_guess <- guess_df[4, "k_value"]
df_mx_guess <- guess_df[4, "mx_value"]
df_my_guess <- guess_df[4, "my_value"]
ind_asyr <- indf_asy
} else { #If sex or mast is not identified or identified improperlly in the data
print("NA")
} #End of if else loop
#Arctangent
#Fits nls model to the created dataframe
nls.floop <- tryCatch({data.frame(tidy(nls(wt~ B*atan(k*(age - mx)) + my, #tryCatch lets nls have alternate results instead of "code stopping" errors
data=yind_df,
start = list(B = df_B_guess, k = df_k_guess, mx = df_mx_guess, my = df_my_guess),
control= list(maxiter = 200000, minFactor = 1/100000000))))
},
error = function(e){
nls.floop <- data.frame(c(0,0), c(0,0)) #Specifies nls.floop as a dummy dataframe if no convergence
},
warning = function(w) {
nls.floop <- data.frame(tidy(nls.floop)) #Fit is the same if warning is displayed
}) #End of nls.floop
#Creates a dummy numerical index from nls.floop for if else loop below
numeric_floop <- as.numeric(nls.floop[1, 2])
#print(numeric_floop) #Taking a look at the values. If numaric floop...
# == 0, function did not converge on iteration "i"
# != 0, function did converge on rapid "i" and code will run through calculations
if (numeric_floop != 0) {
results_DF <- nls.floop
ind_fits[i , 'converge'] <- 1 #converge = 1 for converging fit
#Extracting, calculating, and appending values into dataframe
B_value <- as.numeric(results_DF[1, "estimate"]) #B value
k_value <- as.numeric(results_DF[2, "estimate"]) #k value
mx_value <- as.numeric(results_DF[3, "estimate"]) #mx value
my_value <- as.numeric(results_DF[4, "estimate"]) #my value
A_value <- ((B_value*pi)/2)+ my_value #A value calculation
ind_fits[i , 'A_value'] <- A_value
ind_fits[i , 'k_value'] <- k_value
ind_fits[i , 'mx_value'] <- mx_value
ind_fits[i , 'my_value'] <- my_value #appends my_value into df
ind_fits[i , 'max_grate'] <- adr(mx_value, B_value, k_value, mx_value, my_value) #Calculates max growth rate
}
} #End of individual fits loop
Which gives this output:
> head(ind_fits%>%select(df, squirrel_id, A_value, k_value, mx_value, my_value))
df squirrel_id A_value k_value mx_value my_value
1 mnm 332 257.2572 0.05209824 52.26842 126.13183
2 mnm 1252 261.0728 0.02810033 42.37454 103.02102
3 mnm 3466 260.4936 0.03946594 62.27705 131.56665
4 fnm 855 437.9569 0.01347379 86.18629 158.27641
5 fnm 2409 228.7047 0.04919819 63.99252 123.63404
6 fnm 1417 196.0578 0.05035963 57.67139 99.65781
Note that you need to create a blank dataframe first before running the loops.
I am working on a portfolio optimazion algorithm and part of the problem consists in generating moment matching scenario.
My choice due to its simplicity and quickness was to go through paper "An algorithm for moment-matching scenario generation with application to financial portfolio optimization" (Ponomareva, Roman and Date).
The problem is that even though the mathematics are very simple, I am stuck by the fact that some of probability weights pi are negative even though the formulas in the paper should ensure otherwise. If I put a loop to run the algorithm until it finds a positive combination it essentially runs forever.
I put the bit of code based on the paper were things get stuck:
dummy1 = 0
while (dummy1 <=0 | dummy1 >= 1) {
dummy1 = round(rnorm(1, mean = 0.5, sd = 0.25), 2)
}
diag.cov.returns = diag(cov.returns)
Z = dummy1 * sqrt (diag.cov.returns) #Vector Z according to paper formula
ZZT = Z %*% t(Z)
LLT = cov.returns - ZZT
L = chol(LLT) #cholesky decomposition to get matrix L
s = sample (1:5, 1)
F1 = 0
F2 = -1
S = (2*N*s)+3
while (((4*F2)-(3*F1*F1)) < 0) {
#Gamma = (2*s*s)*(((N*mean.fourth) - (0.75*(sum(Z^4)* (N*mean.third/sum(Z^3))^2)))/sum(L^4))
#Gamma is necessary if we want to get p from Uniform Distribution
#U = runif(s, 0, 1)
U = rgamma(s, shape = 1, scale = ((1/exp(1)):1))
#p = (s*(N/Gamma)) + ((1/(2*N*s)) - (s/(N*Gamma)))*U
p = (-log(U, base = exp(1)))
p = p/(((2*sum(p))+max(p))*N*s) #this is the array expected to have positive and bounded between 0 and 1
q1 = 1/p
pz = p
p[s+1] = (1-(2*N*sum(p))) #extra point necessary to get the 3 moment mathcing probabilities
F1 = (N*mean.third*sqrt(p[s+1]))/(sum(Z^3))
F2 = p[s+1]*(((N*mean.fourth) - (1/(2*s*s))*sum(L^4)*(sum(1/p)))/sum(Z^4))
}
alpha = (0.5*F1) + 0.5*sqrt((4*F2)-(3*F1*F1))
beta = -(0.5*F1) + 0.5*sqrt((4*F2)-(3*F1*F1))
w1 = 1/(alpha*(alpha+beta))
w2 = 1/(beta*(alpha+beta))
w0 = 1 - (1/(alpha*beta))
P = rep(pz, 2*N) #Vector with Probabilities starting from p + 3 extra probabilities to match third and fourth moments
P[(2*N*s)+1] = p[s+1]*w0
P[(2*N*s)+2] = p[s+1]*w1
P[(2*N*s)+3] = p[s+1]*w2
Unfortunately I cannot discolose the input dataset containing funds returns. However I can surely be more specific. Starting from a data.frame() containing N assets' returns (in my case there 11 funds and monthly returns from 30/01/2001 to 30/09/2020). Once the mean returns, covariance matrix, central third and fourth moments (NOT skewness and kurtosis) and the averages are computed. The algorithm follows as I have reported in the problem. The point where i get stuck is that p takes also negative values. This is a problem since the first s elements of p are later used as probabilities in P.
I hope that in this way the problem is more clear. I also want to add that in the paper the data used by the authors is reported, unfortunately to import them in R would be necessary to import them manually. However I repeat any data.frame() containing assets' returns will do.
I am doing the simulation of linear regression with R.
A regression model I consider is y_i = a + b_1 * x_1i + b_2 * x_2i + e_i.
The parameter design as follows:
x_1i ~ N(2,1), x_2i ~ Poisson(4), e_i ~ N(0, 1), theta = (a, b_1, b_2)
The following code I am doing is that I would like to generate 100 independent random samples of (y, x_1, x_2) 1000 times using the distribution which I have mentioned above, and I also want to estimate theta_hat (the estimator of theta). After getting the theta_hat, I would like to plot the distribution of estimator of a (a_hat), b_1 (b_1_hat), b_2 (b_2_hat), respectively.
## Construct 1000 x_1
x_1_1000 <- as.data.frame(replicate(n = 1000,expr = rnorm(n = 100,
mean = 2, sd = 1)))
colnames(x_1_1000) <- paste("x_1", 1:1000, sep = "_")
x_2_1000 <- as.data.frame(replicate(n = 1000,expr = rpois(n = 100,
lambda = 4)))
colnames(x_2_1000) <- paste("x_2", 1:1000, sep = "_")
error_1000 <- as.data.frame(replicate(n = 1000, expr = rnorm(n = 100,
mean = 0, sd = 1)))
colnames(error_1000) <- paste("e", 1:1000, sep = "_")
y_1000 <- as.data.frame(matrix(data = 0, nrow = 100, ncol = 1000))
y_1000 = 1 + x_1_1000 * 1 + x_2_1000*(-2) + error_1000
colnames(y_1000) <- paste("y", 1:1000, sep = "_")
######################################################################
lms <- lapply(1:1000, function(x) lm(y_1000[,x] ~ x_1_1000[,x] + x_2_1000[,x]))
theta_hat_1000 <- as.data.frame(sapply(lms, coef))
After doing the linear regression, I just store the result into lms which is a list. Because I just want the data of coefficient, I store those simulation coefficients into "theta_hat_1000" However, when I wanna plot the distribution graph, I cannot get what I want in the final. I have tried two ways to solve the problem but still being confused.
The first way I tried is that I just rename the data frame "theta_hat_1000". I have successfully renamed the column_i, where i from 1 to 1000. However, I just cannot successfully rename the rows.
rownames(theta_hat_1000[1,]) <- "ahat"
rownames(theta_hat_1000[2,]) <- "x1hat"
rownames(theta_hat_1000[3,]) <- "x2hat"
The code listed above did not show any error message but finally failed to change the row names. Thus, I have tried the following code
rownames(theta_hat_1000) <- c("ahat", "x1hat", "x2hat")
This has successfully renamed. However, when I want to check there is anything store in the data frame, it reports "NULL"
theta_hat_1000$ahat
NULL
Therefore, I notice that there is something weird. Thus I have tried the second way like the following.
I have tried to unlist "theta_hat_1000" which is a list stored in my global environment. However, after doing such things, I am not getting what I want. The expected result is just getting three rows and each row with 1000 values, but the actual is that I got 3000 obs with 1 column.
The ideal result is getting three columns and each column with 1000 values and putting them into a data frame to do a further process like using ggplot to demonstrate the distribution of estimated coefficients.
I have stuck on this for several hours. It would be appreciated if anyone can help me and give me some suggestions.
This line theta_hat_1000$ahat in your code does not work, because "ahat" is a rowname not a column name in the data frame. You would get the result by calling theta_hat_1000["ahat",].
However, I understand that your desired result is actually a dataframe with 3 columns (and 1000 rows) representing the 3 parameters of your regression model (intercept, x1, x2). This line in your code as.data.frame(sapply(lms, coef)) produces a dataframe with 3 rows and 1000 columns. You can, for instance, transpose the matrix before changing it into a data frame to get 1000 rows and 3 columns.
theta_hat_1000 <- sapply(lms, coef)
theta_hat_1000 <- as.data.frame(t(theta_hat_1000))
colnames(theta_hat_1000) <- c("ahat", "x1hat", "x2hat")
head(theta_hat_1000)
ahat x1hat x2hat
1 2.0259326 0.7417404 -2.111874
2 0.7827929 0.9437324 -1.944320
3 1.1034906 1.0091594 -2.035405
4 0.9677150 0.8168757 -1.905367
5 1.0518646 0.9616123 -1.985357
6 0.8600449 1.0781489 -2.017061
Now you could also call the variables with theta_hat_1000$ahat.
I would like some help with my nested loop which is not returning the values I expect. I am new to nested loops so please bear with me. I want to calculate a new independent variable for a logistic regression model which is based upon different calculations of the original variables. Specifically, I have six variables "x1...x6", and I then create three new variables (newvar1, newvar2, newvar3) by extracting a percentile from pairs of the original variables. From these three new variables I then combine them via subtraction to form a final new variable which forms the independent variable for a logistic regression model. The value of that final variable is then evaluated by the AIC of the logistic regression model.
I need to determine the optimal combination of percentile values which form newvar2, newvar2, and newvar3 which gives me the best logistic regression model. To do this I have attempted to create a three level nested like this:
df <- data.frame(x1 <- rnorm(100),
x2 <- rnorm(100),
x3 <- rnorm(100),
x4 <- rnorm(100),
x5 <- rnorm(100),
x6 <- rnorm(100),
y <- as.factor(runif(100)<=.70))
n = 1
AIC = NULL
for (i in 0.1:n){
for (j in 0.1:n){
for (k in 0.1:n){
df$newvar1 <-apply(df[,1:2], 1, quantile, probs = i, na.rm = T)
df$newvar2 <-apply(df[,3:4], 1, quantile, probs = j, na.rm = T)
df$newvar3 <-apply(df[,5:6], 1, quantile, probs = k, na.rm = T)
df$finalvar <- df$newvar1 - df$newvar2 - df$newvar3
model <- glm(y ~ finalvar, data = df, family = "binomial")
AIC[i] <- as.numeric(model$aic)
}
}
}
I would like to provide a sequence of 11 values (0, 0.1, 0.2....0.9,1) to the "probs" argument of the quantile function, and I would like to get the AIC for each of the possible quantile parameter estimations (11*11*11). Thus the AIC variable in the end should be a numeric vector of 121 values. However, when I run the above code I get an empty numeric value for AIC. How can I get this code the run properly and supply me the values for all possible 121 models?
Thanks!
EDIT: this isn't the solution but provides part of the answer I think. in my previous code "n" was less than one so it was only performing a single iteration, (obviously) "n" needs to greater than one. The reason it was less than 1 before is that the "probs" argument to quantile requires a value betwee 0 and 1. The over come this, the parameter passed to the argument probs is now divided by 10. Now with AIC[1] i can get a vector of 10, but I still don"t understand how to get the full 10*10*10 (or 11*11*11) representing all combinations.
New code:
n = 10
AIC = NULL
for (i in 1:n){
for (j in 1:n){
for (k in 1:n){
df$newvar1 <-apply(df[,1:2], 1, quantile, probs = i/10, na.rm = T)
df$newvar2 <-apply(df[,3:4], 1, quantile, probs = j/10, na.rm = T)
df$newvar3 <-apply(df[,5:6], 1, quantile, probs = k/10, na.rm = T)
df$finalvar <- df$newvar1 - df$newvar2 - df$newvar3
model <- glm(y ~ finalvar, data = df, family = "binomial")
AIC[i] <- as.numeric(model$aic)
}
}
}
First of all, AICis an R function so I've changed the name to aic.
Second, in your code's innermost loop you index by i only, when you have 3 indices. So maybe this is what you really need.
n = 10
aic = array(0, dim = c(n, n, n)) # changed
for(...)
for(...)
for(...){
[...]
aic[i, j, k] <- as.numeric(model$aic) # changed
}
I'm doing hierarchical clustering with an R package called pvclust, which builds on hclust by incorporating bootstrapping to calculate significance levels for the clusters obtained.
Consider the following data set with 3 dimensions and 10 observations:
mat <- as.matrix(data.frame("A"=c(9000,2,238),"B"=c(10000,6,224),"C"=c(1001,3,259),
"D"=c(9580,94,51),"E"=c(9328,5,248),"F"=c(10000,100,50),
"G"=c(1020,2,240),"H"=c(1012,3,260),"I"=c(1012,3,260),
"J"=c(984,98,49)))
When I use hclust alone, the clustering runs fine for both Euclidean measures and correlation measures:
# euclidean-based distance
dist1 <- dist(t(mat),method="euclidean")
mat.cl1 <- hclust(dist1,method="average")
# correlation-based distance
dist2 <- as.dist(1 - cor(mat))
mat.cl2 <- hclust(dist2, method="average")
However, when using the each set up with pvclust, as follows:
library(pvclust)
# euclidean-based distance
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", nboot=1000)
# correlation-based distance
mat.pcl2 <- pvclust(mat, method.hclust="average", method.dist="correlation", nboot=1000)
... I get the following errors:
Euclidean: Error in hclust(distance, method = method.hclust) :
must have n >= 2 objects to cluster
Correlation: Error in cor(x, method = "pearson", use = use.cor) :
supply both 'x' and 'y' or a matrix-like 'x'.
Note that the distance is calculated by pvclust so there is no need for a distance calculation beforehand. Also note that the hclust method (average, median, etc.) does not affect the problem.
When I increase the dimensionality of the data set to 4, pvclust now runs fine. Why is it that I'm getting these errors for pvclust at 3 dimensions and below but not for hclust? Furthermore, why do the errors disappear when I use a data set above 4 dimensions?
At the end of function pvclust we see a line
mboot <- lapply(r, boot.hclust, data = data, object.hclust = data.hclust,
nboot = nboot, method.dist = method.dist, use.cor = use.cor,
method.hclust = method.hclust, store = store, weight = weight)
then digging deeper we find
getAnywhere("boot.hclust")
function (r, data, object.hclust, method.dist, use.cor, method.hclust,
nboot, store, weight = F)
{
n <- nrow(data)
size <- round(n * r, digits = 0)
....
smpl <- sample(1:n, size, replace = TRUE)
suppressWarnings(distance <- dist.pvclust(data[smpl,
], method = method.dist, use.cor = use.cor))
....
}
also note, that the default value of parameter r for function pvclust is r=seq(.5,1.4,by=.1). Well, actually as we can see this value is being changed somewhere:
Bootstrap (r = 0.33)...
so what we get is size <- round(3 * 0.33, digits =0) which is 1, finally data[smpl,] has only 1 row, which is less than 2. After correction of r it returns some error which possibly is harmless and output is given too:
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean",
nboot=1000, r=seq(0.7,1.4,by=.1))
Bootstrap (r = 0.67)... Done.
....
Bootstrap (r = 1.33)... Done.
Warning message:
In a$p[] <- c(1, bp[r == 1]) :
number of items to replace is not a multiple of replacement length
Let me know if the results is satisfactory.