R Zeroinfl model - r

I am carrying out a zero-inflated negative binomial GLM on some insect count data in R. My problem is how to get R to read my species data as one stacked column so as to preserve the zero inflation. If I subtotal and import it into R as a single row titled Abundance, I loose the zeros and the model doesn't work. Already, I have tried to:
stack the data myself (there are 80 columns * 47 rows) so with 3760 rows after stacking manually you can imagine how slow R gets when using the pscl zeroinfl() command (It takes 20mins on my computer!, It still worked)
The next problem concerns a spatial correlation. Certain samplers sampled from the same medium so as to violate independence. Can I just put medium in as a factor in the model?

3760 rows take 20 mminutes with PSCL? my god, I have battle 30.000 rows :) that´s why my pscl calculation did not finish...
However, I then worked with a GLMM including nested random effects (lme/gamm) and a negative binomial distribution setting the theta to a low value so that the distribution handles the zero inflation. I think this depends on the degree of zeros. In my case it was 44% and the residuals looked rather good.

Related

Simulating data using existing data and probability

I have measured multiple attributes (height, species, crown width, condition etc) for about 1500 trees in a city. Using remote sensing techniques I also have the heights for the rest of the 9000 trees in the city. I want to simulate/generate/estimate the missing attributes for these unmeasured trees by using their heights.
From the measured data I can obtain proportion of each species in the measured population (and thus a rough probability), height distributions for each species, height-crown width relationships for the species, species-condition relationship and so on. I want to use the height data for the unmeasured trees to first estimate the species and then estimate the rest of the attributes too using probability theory. So for a height of say 25m its more likely to be a Cedar (height range 5 - 30 m) rather than a Mulberry tree (height range 2 -8 m) and more likely to be a cedar (50% of population) than an oak (same height range but 2% of population) and hence will have a crown width of 10m and have a health condition of 95% (based on the distributions for cedar trees in my measured data). But also I am expecting some of the other trees of 25m to be given oak, just less frequently than cedar based on the proportion in population.
Is there a way to do this using probability theory in R preferably utilising Bayesian or machine learning methods?
Im not asking for someone to write the code for me - I am fairly experienced with R. I just want to be pointed in the right direction i.e. a package that does this kind of thing neatly.
Thanks!
Because you want to predict a categorical variable, i.e. the species, you should consider using a tree regression, a method which can be found in the R packages rpart and RandomForest. These models excel when you have a discrete number of categories and you need to slot your observations into those categories. I think those packages would work in your application. As a comparison, you can also look at multinomial regression (mnlogit, nnet, maxent) which can also predict categorical outcomes; unfortunately multinomial regression can get unwieldy with large numbers of outcomes and/or large datasets.
If you want to then predict the individual values for individual trees in your species, first run a regression of all of your measured variables, including species type, on the measured trees. Then take the categorical labels that you predicted and predict out-of-sample for the unmeasured trees where you use the categorical labels as predictors for the unmeasured variable of interest, say tree height. That way the regression will predict the average height for that species/dummy variable, plus some error and incorporating any other information you have on that out-of-sample tree.
If you want to use a Bayesian method, you consider using a hierarchical regression to model these out-of-sample predictions. Sometimes hierarchical models do better at predicting as they tend to be fairly conservative. Consider looking at the package Rstanarm for some examples.
I suggest you looking over Bayesian Networks with table CPDs over your random variables. This is a generative model that can handle missing data and do inference over casual relationships between variables. Bayesian Network structure can be specified by-hand or learned from data by a algorithm.
R has several implementations of Bayesian Networks with bnlearn being one of them: http://www.bnlearn.com/
Please see a tutorial on how to use it here: https://www.r-bloggers.com/bayesian-network-in-r-introduction/
For each species, the distribution of the other variables (height, width, condition) is probably a fairly simple bump. You can probably model the height and width as a joint Gaussian distribution; dunno about condition. Anyway with a joint distribution for variables other than species, you can construct a mixture distribution of all those per-species bumps, with mixing weights equal to the proportion of each species in the available data. Given the height, you can find the conditional distribution of the other variables conditional on height (and it will also be a mixture distribution). Given the conditional mixture, you can sample from it as usual: pick a bump with frequency equal to its mixing weight, and then sample from the selected bump.
Sounds like a good problem. Good luck and have fun.

Regression model performance fails with a factor having more number of levels

I have a mixed data(both quantitative and categorical) predicting a quantitative variable. I have converted the categorical data into factors before feeding into glm model in R. My data has categorical variables with most of them having more than 150 levels. When I try to feed them to glm model, it fails with memory issues because of these factors having more levels. We can put a threshold and accept only the variables upto certain number of levels. But, I need to embed these factors which has more levels into the model. Is there any methodology to follow to address this issue.
Edit: The dataset has 120000 rows and 50 columns. When the data is expanded with model.matrix there are 4772 columns.
If you have a lot of data, the easiest thing to do is sample from your matrix/data frame, then run the regression.
Given sampling theory, we know that the standard error of a proportion p is equal to sqrt((p(1-p))/n). So if you have 150 levels, assuming that the number of observations in levels is evenly distributed, then we would want to be able to find proportions as small as .005 or so from your data set. So if we take a 10,000 row sample, the standard error of one of those factor levels is roughly:
sqrt((.005*.995)/10000) = 0.0007053368
That's really not all that much additional variance that you added to your regression estimate. Especially when you are doing exploratory analysis, sampling from the rows in your data, say a 12,000 row sample, should still give you plenty of data to estimate quantities while making estimation possible. Reducing your rows by a factor of 10 should also help R do the estimation without running out of memory. Win-win.

Heteroskedasticity and the other one is the autocorrelation problems in multiple regression analysis

I am running a multiple linear regression model using lm function in R to study the impact of some characteristics on the gene expression level.
My data matrix contains one continuous dependent variable (i.e. gene expression levels) and 50 explanatory variables which are the count of these characteristics on each gene and many of these counts are zeros.
I checked all of the regression assumptions and I found two issues the first one is the Heteroscedasticity and the other one is the autocorrelation problem. The later is not serious. I wonder if using multiple linear regression is correct or not and if there is any other regression techniques can be used to solve these problems.
I used stepwise method and I got just 11 significant variables among those 50. But when I checked the heteroscedasticity, and I found it still appears as shown below. The sample size is 15,000 genes. (15,000 rows and 50 columns).
Updated image, with weights added to lm call, re comments

how to deal with replicate weights equal zero in R?

I have zero values for some replicate and sampling weights. Therefore, when I use Svycoxph from the “survey” package, I get an error message that the “package Invalid weights, must be >0”. I think that one way might be to exclude these observations. I wonder if there is a way to keep those observations for the Cox proportional hazards model?
Thanks!
Julia
Zero replicate weights are perfectly normal -- with jackknife replicates they will be observations from the PSU being left out; with bootstrap replicates they will be from PSUs that appeared zero times in a particular resample. Zero sampling weights, on the other hand, don't make a lot of sense and probably indicate observations that aren't in the sample or aren't in the sampling frame (personally, I'd code these as NA)
The coxph() function in the survival package can't handle negative or zero weights, and that's what svycoxph() calls. For replicate weights, svycoxph() adds a small value (1e-10) to weights to stop problems with zeros.
However, svycoxph() can't handle zero sampling weights. You probably want to remove these before constructing the design object.

Imbalanced training dataset and regression model

I have a large dataset (>300,000 observations) that represent the distance (RMSD) between proteins. I'm building a regression model (Random Forest) that is supposed to predict the distance between any two proteins.
My problem is that I'm more interested in close matches (short distances), however my data distribution is highly biased such that the majority of the distances are large. I don't really care how good the model will be able to predict large distances, so I want to make sure that the model will be able to accurately predict the distance of close models. However, when I train the model on the full data the performance of the model isn't good, so I wonder what is the best sampling way I can do such that I can guarantee that the model will predict the close matches distance as much accurately as possible and at the same time now to stratify the data so much since unfortunately this biased data distribution represent the real world data distribution that I'm going to validate and test the model on.
The following is my data distribution where the first column represents the distances and the second column represent the number of observations in this distance range:
Distance Observations
0 330
1 1903
2 12210
3 35486
4 54640
5 62193
6 60728
7 47874
8 33666
9 21640
10 12535
11 6592
12 3159
13 1157
14 349
15 86
16 12
The first thing I would try here is building a regression model of the log of the distance, since this will concentrate the range of larger distances. If you're using a generalised linear model this is the log link function; for other methods you could just manually do this by estimating a regression function of your inputs, x, and exponentiating the result:
y = exp( f(x) )
remember to use the log of the distance for a pair to train with.
Popular techniques for dealing with imbalanced distribution in regression include:
Random over/under-sampling.
Synthetic Minority Oversampling Technique for Regression (SMOTER). Which has an R package to implement.
The Weighted Relevance-based Combination Strategy (WERCS). Which has a GitHub repository of R codes to implement it.
PS: The table you show seems like you have a classification problem and not a regression problem.
As previously mentioned, I think what might help you given your problem is Synthetic Minority Over-Sampling Technique for Regression (SMOTER).
If you're a Python user, I'm currently working to improve my implementation of the SMOGN algorithm, a variant of SMOTER. https://github.com/nickkunz/smogn
Also, there are a few examples on Kaggle that have applied SMOGN to improve their prediction results. https://www.kaggle.com/aleksandradeis/regression-addressing-extreme-rare-cases

Resources