Simulating data using existing data and probability - r

I have measured multiple attributes (height, species, crown width, condition etc) for about 1500 trees in a city. Using remote sensing techniques I also have the heights for the rest of the 9000 trees in the city. I want to simulate/generate/estimate the missing attributes for these unmeasured trees by using their heights.
From the measured data I can obtain proportion of each species in the measured population (and thus a rough probability), height distributions for each species, height-crown width relationships for the species, species-condition relationship and so on. I want to use the height data for the unmeasured trees to first estimate the species and then estimate the rest of the attributes too using probability theory. So for a height of say 25m its more likely to be a Cedar (height range 5 - 30 m) rather than a Mulberry tree (height range 2 -8 m) and more likely to be a cedar (50% of population) than an oak (same height range but 2% of population) and hence will have a crown width of 10m and have a health condition of 95% (based on the distributions for cedar trees in my measured data). But also I am expecting some of the other trees of 25m to be given oak, just less frequently than cedar based on the proportion in population.
Is there a way to do this using probability theory in R preferably utilising Bayesian or machine learning methods?
Im not asking for someone to write the code for me - I am fairly experienced with R. I just want to be pointed in the right direction i.e. a package that does this kind of thing neatly.
Thanks!

Because you want to predict a categorical variable, i.e. the species, you should consider using a tree regression, a method which can be found in the R packages rpart and RandomForest. These models excel when you have a discrete number of categories and you need to slot your observations into those categories. I think those packages would work in your application. As a comparison, you can also look at multinomial regression (mnlogit, nnet, maxent) which can also predict categorical outcomes; unfortunately multinomial regression can get unwieldy with large numbers of outcomes and/or large datasets.
If you want to then predict the individual values for individual trees in your species, first run a regression of all of your measured variables, including species type, on the measured trees. Then take the categorical labels that you predicted and predict out-of-sample for the unmeasured trees where you use the categorical labels as predictors for the unmeasured variable of interest, say tree height. That way the regression will predict the average height for that species/dummy variable, plus some error and incorporating any other information you have on that out-of-sample tree.
If you want to use a Bayesian method, you consider using a hierarchical regression to model these out-of-sample predictions. Sometimes hierarchical models do better at predicting as they tend to be fairly conservative. Consider looking at the package Rstanarm for some examples.

I suggest you looking over Bayesian Networks with table CPDs over your random variables. This is a generative model that can handle missing data and do inference over casual relationships between variables. Bayesian Network structure can be specified by-hand or learned from data by a algorithm.
R has several implementations of Bayesian Networks with bnlearn being one of them: http://www.bnlearn.com/
Please see a tutorial on how to use it here: https://www.r-bloggers.com/bayesian-network-in-r-introduction/

For each species, the distribution of the other variables (height, width, condition) is probably a fairly simple bump. You can probably model the height and width as a joint Gaussian distribution; dunno about condition. Anyway with a joint distribution for variables other than species, you can construct a mixture distribution of all those per-species bumps, with mixing weights equal to the proportion of each species in the available data. Given the height, you can find the conditional distribution of the other variables conditional on height (and it will also be a mixture distribution). Given the conditional mixture, you can sample from it as usual: pick a bump with frequency equal to its mixing weight, and then sample from the selected bump.
Sounds like a good problem. Good luck and have fun.

Related

Extracting linear term from a polynomial predictor in a GLM

I am relatively new to both R and Stack overflow so please bear with me. I am currently using GLMs to model ecological count data under a negative binomial distribution in brms. Here is my general model structure, which I have chosen based on fit, convergence, low LOOIC when compared to other models, etc:
My goal is to characterize population trends of study organisms over the study period. I have created marginal effects plots by using the model to predict on a new dataset where all covariates are constant except year (shaded areas are 80% and 95% credible intervals for posterior predicted means):
I am now hoping to extract trend magnitudes that I can report and compare across species (i.e. say a certain species declined or increased by x% (+/- y%) per year). Because I use poly() in the model, my understanding is that R uses orthogonal polynomials, and the resulting polynomial coefficients are not easily interpretable. I have tried generating raw polynomials (setting raw=TRUE in poly()), which I thought would produce the same fit and have directly interpretable coefficients. However, the resulting models don't really run (after 5 hours neither chain gets through even a single iteration, whereas the same model with raw=FALSE only takes a few minutes to run). Very simplified versions of the model (e.g. count ~ poly(year, 2, raw=TRUE)) do run, but take several orders of magnitude longer than setting raw=FALSE, and the resulting model also predicts different counts than the model with orthogonal polynomials. My questions are (1) what is going on here? and (2) more broadly, how can I feasibly extract the linear term of the quartic polynomial describing response to year, or otherwise get at a value corresponding to population trend?
I feel like this should be relatively simple and I apologize if I'm overlooking something obvious. Please let me know if there is further code that I should share for more clarity–I didn't want to make the initial post crazy long, but happy to show specific predictions from different models or anything else. Thank you for any help.

Need help in understanding the below plot generated using randomForestExplainer library in R

I am not able to understand what this plot depicts and what we mean by the relation between rankings.Let me add a few more things to make it a bit precise. This plot is generated by using plot_importance_rankings function available as part of randomForestExplainer package.
A sample code which generates this code is as
plot_importance_rankings(var_importance_frame)
where var_importance_frame contains important variables which we get as from
var_importance_frame <- measure_importance(rf_model)
Here rf_model is the trained random forest model. A sample example can be found at this link RandomForestExplainer - sample example
The randomForestExplainer package implements several measures to assess a given variable's importance in the random forests models. On this plot, you have mean_min_depth (average minimum depth of a variable across all trees), accuracy_decrease (accuracy loss by randomly permuting on a given variable), gini_decrease (average gain of purity by splitting on a given variable), no_of_nodes (number of nodes that split on a given variable across trees), times_a_root (number of times a given variable is used as the root of trees). Ideally you'd want these importance measures to be somewhat consistent, in that a variable measured as of high importance by one metric is also measured high by the others. What this plot is showing is this as a sanity check. In your case the variable importances are largely consistent and positively correlated. Each dot on the scatter plot represents a variable.

Function to produce a single metric to compare the shape of two distributions (predictions vs actuals)

I am assessing the accuracy of a model that predicts count data.
My actual data has quite an unusual distribution - although I have a large amount of data, the shape is unlike any standard distributions (poisson, normal, negative binomial etc.).
As part of my assessment, I want a metric for how well the distribution of the predictions match the distribution of actual data. I've tried using standard model performance metrics, such as MAE or RMSE, but they don't seem to capture how well the predictions match the expected distribution.
My initial idea was to split the predictions into deciles, and calculate what proportion fall in each decile. This would be a very rough indication of the underlying distribution. I would then calculate the same for my 'actuals' and sum the absolute differences between the proportions.
This works to some extent, but feels a bit clunky, and the split into deciles feels arbitrary. Is there a function in R to produce a single metric for how well two distributions match?

random forest regression output calculation

Hi this is a purely theoretical question which i cant get my head around ( and could be completely wrong)
With random forest regressions - you grow n number of trees, each tree uses a subset of the data and in some cases a subset of the available variables to predict the dependent variable. the average of these n number of trees is taken to give us a predicted value. however, is there any need to look at the distribution of predictions at the individual tree level? are we able to obtain a number that provides some certainty of the overall predicted value? i would assume that a more consistent number being produced at the individual tree level would be preferred than a wide variety of numbers?
Thanks in advance
This method of determining variable importance has some drawbacks. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as partial permutations and growing unbiased trees can be used to solve the problem. If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.

Variable sample size per cluster/group in mixed effects logistic regression

I am attempting to run mixed effects logistic regression models, yet am concerned about the variable samples sizes in each cluster/group, and also the very low number of "successes" in some models.
I have ~ 700 trees distributed across 163 field plots (i.e., the cluster/group), visited annually from 2004-11. I am fitting separate mixed effects logistic regression models (hereafter GLMMs) for each year of the study to compare this output to inference from a shared frailty model (i.e., survival analysis with random effect).
The number of trees per plot varies from 1-22. Also, some years have a very low number of "successes" (i.e., diseased trees). For example, in 2011 there were only 4 successes out of 694 "failures" (i.e., healthy trees).
My questions are: (1) is there a general rule for the ideal number of samples|group when the inference focus is only on estimating the fixed effects in the GLMM, and (2) are GLMMs stable when there is such an extreme difference in the ratio of successes:failures.
Thank you for any advice or suggestions of sources.
-Sarah
(Hi, Sarah, sorry I didn't answer previously via e-mail ...)
It's hard to answer these questions in general -- you're stuck
with your data, right? So it's not a question of power analysis.
If you want to make sure that your results will be reasonably
reliable, probably the best thing to do is to run some simulations.
I'm going to show off a fairly recent feature of lme4 (in the
development version 1.1-1, on Github), which is to simulate
data from a GLMM given a formula and a set of parameters.
First I have to simulate the predictor variables (you wouldn't
have to do this, since you already have the data -- although
you might want to try varying the range of number of plots,
trees per plot, etc.).
set.seed(101)
## simulate number of trees per plot
## want mean of 700/163=4.3 trees, range=1-22
## by trial and error this is about right
r1 <- rnbinom(163,mu=3.3,size=2)+1
## generate plots and trees within plots
d <- data.frame(plot=factor(rep(1:163,r1)),
tree=factor(unlist(lapply(r1,seq))))
## expand by year
library(plyr)
d2 <- ddply(d,c("plot","tree"),
transform,year=factor(2004:2011))
Now set up the parameters: I'm going to assume year is a fixed
effect and that overall disease incidence is plogis(-2)=0.12 except
in 2011 when it is plogis(-2-3)=0.0067. The among-plot standard deviation
is 1 (on the logit scale), as is the among-tree-within-plot standard
deviation:
beta <- c(-2,0,0,0,0,0,0,-3)
theta <- c(1,1) ## sd by plot and plot:tree
Now simulate: year as fixed effect, plot and tree-within-plot as
random effects
library(lme4)
s1 <- simulate(~year+(1|plot/tree),family=binomial,
newdata=d2,newparams=list(beta=beta,theta=theta))
d2$diseased <- s1[[1]]
Summarize/check:
d2sum <- ddply(d2,c("year","plot"),
summarise,
n=length(tree),
nDis=sum(diseased),
propDis=nDis/n)
library(ggplot2)
library(Hmisc) ## for mean_cl_boot
theme_set(theme_bw())
ggplot(d2sum,aes(x=year,y=propDis))+geom_point(aes(size=n),alpha=0.3)+
stat_summary(fun.data=mean_cl_boot,colour="red")
Now fit the model:
g1 <- glmer(diseased~year+(1|plot/tree),family=binomial,
data=d2)
fixef(g1)
You can try this many times and see how often the results are reliable ...
As Josh said, this is a better questions for CrossValidated.
There are no hard and fast rules for logistic regression, but one rule of thumb is 10 successes and 10 failures are needed per cell in the design (cluster in this case) times the number continuous variables in the model.
In your case, I would think the model, if it converges, would be unstable. You can examine that by bootstrapping the errors of the estimates of the fixed effects.

Resources