I have a problem deriving a meaningful generalized linear regression model. The predictor variable depends on several predictands of which some are factors, and other vectors (elevation, slope, depth to permafrost (vector), exposure (rank, vector), permeability (factor)). There are seven groups of the data which correspond to their geographic location. The predictands within these groups are always equal, so that only the predicted variable (erosion rate) changes. There are 1700 data points.
I would like to set up a generalized linear regression model to test the influence of the predictands on the erosion rate. As the data are spatially autocorrelated, the analysis should take 30 samples from each group, then run the glm, let's say 10,000 times. I would like to compare the spread of the coefficients. Any help is highly appreciated!
Related
I am relatively new to both R and Stack overflow so please bear with me. I am currently using GLMs to model ecological count data under a negative binomial distribution in brms. Here is my general model structure, which I have chosen based on fit, convergence, low LOOIC when compared to other models, etc:
My goal is to characterize population trends of study organisms over the study period. I have created marginal effects plots by using the model to predict on a new dataset where all covariates are constant except year (shaded areas are 80% and 95% credible intervals for posterior predicted means):
I am now hoping to extract trend magnitudes that I can report and compare across species (i.e. say a certain species declined or increased by x% (+/- y%) per year). Because I use poly() in the model, my understanding is that R uses orthogonal polynomials, and the resulting polynomial coefficients are not easily interpretable. I have tried generating raw polynomials (setting raw=TRUE in poly()), which I thought would produce the same fit and have directly interpretable coefficients. However, the resulting models don't really run (after 5 hours neither chain gets through even a single iteration, whereas the same model with raw=FALSE only takes a few minutes to run). Very simplified versions of the model (e.g. count ~ poly(year, 2, raw=TRUE)) do run, but take several orders of magnitude longer than setting raw=FALSE, and the resulting model also predicts different counts than the model with orthogonal polynomials. My questions are (1) what is going on here? and (2) more broadly, how can I feasibly extract the linear term of the quartic polynomial describing response to year, or otherwise get at a value corresponding to population trend?
I feel like this should be relatively simple and I apologize if I'm overlooking something obvious. Please let me know if there is further code that I should share for more clarity–I didn't want to make the initial post crazy long, but happy to show specific predictions from different models or anything else. Thank you for any help.
I know when random forest (RF) is used for classification, the AUC normally is used to assess the quality of classification after applying it to test data. However,I have no clue the parameter to assess the quality of regression with RF. Now I want to use RF for the regression analysis, e.g. using a metrics with several hundreds samples and features to predict the concentration (numerical) of chemicals.
The first step is to run randomForest to build the regression model, with y as continuous numerics. How can I know whether the model is good or not, based on the Mean of squared residuals and % Var explained? Sometime my % Var explained is negative.
Afterwards, if the model is fine and/or used straightforward for test data, and I get the predicted values. Now how can I assess the predicted values good or not? I read online some calculated the accuracy (formula: 1-abs(predicted-actual)/actual), which also makes sense to me. However, I have many zero values in my actual dataset, are there any other solutions to assess the accuracy of predicted values?
Looking forward to any suggestions and thanks in advance.
The randomForest R package comes with an importance function which can used to determine the accuracy of a model. From the documentation:
importance(x, type=NULL, class=NULL, scale=TRUE, ...), where x is the output from your initial call to randomForest.
There are two types of importance measurements. One uses a permutation of out of bag data to test the accuracy of the model. The other uses the GINI index. Again, from the documentation:
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
For further information, one more simple importance check you may do, really more of a sanity check than anything else, is to use something called the best constant model. The best constant model has a constant output, which is the mean of all responses in the test data set. The best constant model can be assumed to be the crudest model possible. You may compare the average performance of your random forest model against the best constant model, for a given set of test data. If the latter does not outperform the former by at least a factor of say 3-5, then your RF model is not very good.
I have conducted a logistic regression on a binary dependent variable and 5 independent variables. The dataframe I drew these variables from is survey data asking whether a person has voted for or against a policy change (binary dependent variable), with the other variables being questions regarding their income, location and other such personal information that could inform whether they would vote for or against the vote.
Having conducted the regression, I'd now like to calculate the predicted probability that each person would have voted yes/no to see how informative those variables are. In total my dataframe has information on 3000 people and I'd like to calculate the predicted probability of voting for/against for every single row/person.
What methods are available for doing so?
Appreciate the help!
You can use the predict function in order to calculate the predicted probabilities.
predict(model, newdata, type="response")
With model our logistic regression (the result of the glm() function), newdata a dataset which contains all the variables defined in our model and for all the individuals for which you want a probability.
I have measured multiple attributes (height, species, crown width, condition etc) for about 1500 trees in a city. Using remote sensing techniques I also have the heights for the rest of the 9000 trees in the city. I want to simulate/generate/estimate the missing attributes for these unmeasured trees by using their heights.
From the measured data I can obtain proportion of each species in the measured population (and thus a rough probability), height distributions for each species, height-crown width relationships for the species, species-condition relationship and so on. I want to use the height data for the unmeasured trees to first estimate the species and then estimate the rest of the attributes too using probability theory. So for a height of say 25m its more likely to be a Cedar (height range 5 - 30 m) rather than a Mulberry tree (height range 2 -8 m) and more likely to be a cedar (50% of population) than an oak (same height range but 2% of population) and hence will have a crown width of 10m and have a health condition of 95% (based on the distributions for cedar trees in my measured data). But also I am expecting some of the other trees of 25m to be given oak, just less frequently than cedar based on the proportion in population.
Is there a way to do this using probability theory in R preferably utilising Bayesian or machine learning methods?
Im not asking for someone to write the code for me - I am fairly experienced with R. I just want to be pointed in the right direction i.e. a package that does this kind of thing neatly.
Thanks!
Because you want to predict a categorical variable, i.e. the species, you should consider using a tree regression, a method which can be found in the R packages rpart and RandomForest. These models excel when you have a discrete number of categories and you need to slot your observations into those categories. I think those packages would work in your application. As a comparison, you can also look at multinomial regression (mnlogit, nnet, maxent) which can also predict categorical outcomes; unfortunately multinomial regression can get unwieldy with large numbers of outcomes and/or large datasets.
If you want to then predict the individual values for individual trees in your species, first run a regression of all of your measured variables, including species type, on the measured trees. Then take the categorical labels that you predicted and predict out-of-sample for the unmeasured trees where you use the categorical labels as predictors for the unmeasured variable of interest, say tree height. That way the regression will predict the average height for that species/dummy variable, plus some error and incorporating any other information you have on that out-of-sample tree.
If you want to use a Bayesian method, you consider using a hierarchical regression to model these out-of-sample predictions. Sometimes hierarchical models do better at predicting as they tend to be fairly conservative. Consider looking at the package Rstanarm for some examples.
I suggest you looking over Bayesian Networks with table CPDs over your random variables. This is a generative model that can handle missing data and do inference over casual relationships between variables. Bayesian Network structure can be specified by-hand or learned from data by a algorithm.
R has several implementations of Bayesian Networks with bnlearn being one of them: http://www.bnlearn.com/
Please see a tutorial on how to use it here: https://www.r-bloggers.com/bayesian-network-in-r-introduction/
For each species, the distribution of the other variables (height, width, condition) is probably a fairly simple bump. You can probably model the height and width as a joint Gaussian distribution; dunno about condition. Anyway with a joint distribution for variables other than species, you can construct a mixture distribution of all those per-species bumps, with mixing weights equal to the proportion of each species in the available data. Given the height, you can find the conditional distribution of the other variables conditional on height (and it will also be a mixture distribution). Given the conditional mixture, you can sample from it as usual: pick a bump with frequency equal to its mixing weight, and then sample from the selected bump.
Sounds like a good problem. Good luck and have fun.
I have a mixed data(both quantitative and categorical) predicting a quantitative variable. I have converted the categorical data into factors before feeding into glm model in R. My data has categorical variables with most of them having more than 150 levels. When I try to feed them to glm model, it fails with memory issues because of these factors having more levels. We can put a threshold and accept only the variables upto certain number of levels. But, I need to embed these factors which has more levels into the model. Is there any methodology to follow to address this issue.
Edit: The dataset has 120000 rows and 50 columns. When the data is expanded with model.matrix there are 4772 columns.
If you have a lot of data, the easiest thing to do is sample from your matrix/data frame, then run the regression.
Given sampling theory, we know that the standard error of a proportion p is equal to sqrt((p(1-p))/n). So if you have 150 levels, assuming that the number of observations in levels is evenly distributed, then we would want to be able to find proportions as small as .005 or so from your data set. So if we take a 10,000 row sample, the standard error of one of those factor levels is roughly:
sqrt((.005*.995)/10000) = 0.0007053368
That's really not all that much additional variance that you added to your regression estimate. Especially when you are doing exploratory analysis, sampling from the rows in your data, say a 12,000 row sample, should still give you plenty of data to estimate quantities while making estimation possible. Reducing your rows by a factor of 10 should also help R do the estimation without running out of memory. Win-win.