ANN - Constraints - Independence - constraints

I try to model time-series of plant biophysiological responses to various ecosystem drivers using ANN. During the night, I would like to remove some variables' influences, because science has proved that they should not have an influence on the processes I model, while their influence is important during the day. I thought about modifying my loss function by adding penalties targeting these variables during the night potentially with independent statistical tests.
Is this a legit way to solve this problem? Is there a more conventional way to impose independence at given parameter subspaces in Machine Learning?
Thanks in advance

Related

glmmLasso function won't finish running

I am trying to build a Mixed Model Lasso model using glmmLasso in RStudio. However, I am looking for some assistance.
I have the equation of my model as follows:
glmmModel <- glmmLasso(outcome ~ year + married ,list(ID=~1), lambda = 100, family=gaussian(link="identity"), data=data1,control = list(print.iter=TRUE))
where outcome is a continuous variable, year is the year the data was collected, and married is a binary indicator (1/0) of whether or not the subject is married. I eventually would like to include more covariates in my model, but for the purpose of successfully first getting this to run, right now I am just attempting to run a model with these two covariates. My data1 dataframe is 48000 observations and 57 variables.
When I click run, however, the model runs for many hours (48+) without stopping. The only feedback I am getting is "ITERATION 1," "ITERATION 2," etc... Is there something I am missing or doing wrong? Please note, I am running on a machine with only 8 GB RAM, but I don't think this should be the issue, right? My dataset (48000 observations) isn't particularly large (at least I don't think so). Any advice or thoughts would be appreciated on how I can fix this issue. Thank you!
This is too long to be a comment, but I feel like you deserve an answer to this confusion.
It is not uncommon to experience "slow" performance. In fact in many glmm implementations it is more common than not. The fact is that Generalized Linear Mixed Effect models are very hard to estimate. For purely gaussian models (no penalizer) a series of proofs gives us the REML estimator, which can be estimated very efficiently, but for generalized models this is not the case. As such note that the Random Effect model matrix can become absolutely massive. Remember that for every random effect, you obtain a block-diagonal matrix so even for small sized data, you might have a model matrix with 2000+ columns, that needs to go through optimization through PIRLS (inversions and so on).
Some packages (glmmTMB, lme4 and to some extend nlme) have very efficient implementations that abuse the block-diagonality of the random effect matrix and high-performance C/C++ libraries to perform optimized sparse-matrix calculations, while the glmmLasso (link to source) package uses R-base to perform all of it's computations. No matter how we go about it, the fact that it does not abuse sparse computations and implements it's code in R, causes it to be slow.
As a side-note, my thesis project had about 24000~ observations, with 3 random effect variables (and some odd 20 fixed effects). The fitting process of this dataset could take anywhere between 15 minutes to 3 hours, depending on the complexity, and was primarily decided by the random effect structure.
So the answer from here:
Yes glmmLasso will be slow. It may take hours, days or even weeks depending on your dataset. I would suggest using a stratified (or/and clustered) subsample across independent groups, fit the model using a smaller dataset (3000 - 4000 maybe?), to obtain initial starting points, and "hope" that these are close to the real values. Be patient. If you think neural networks are complex, welcome to the world of generalized mixed effect models.

Is there an R package with which I can model the effects of competition on ideal free distribution?

I am a university student working on a research project, because of our local lockdown I cannot go into the field to collect observation data, I am therefore looking for an R package that will allow me to model the effects of competition when testing for ideal free distribution (IFD).
To give you a better idea of what I am looking for I have described the project in more detail below.
In my original dataset (which I received i.e., I did not collect the data myself) I have two patches (A,B) which received random treatments of food input (1:1, 2:1, 5:1). Under the ideal free distribution hypothesis species should distribute into the patches in accordance with the treatment ratios. This is not the case.
Under normal circumstances I would go into the field and observe behaviour of individuals in the patches to see if dominance affects distribution. Since we are in a lockdown I am unable to do so. I am hoping that there is a package out there that would allow me to model this scenario and help me investigate how competition affects IFD.
I have already found two packages called coexist and EcoVirtual but they model coexistence and extinction dynamics, whereas I want to investigate how competition might alter distribution between profitable patches when there is variation in the level of competition.
I am fairly new to R and creating my own package is beyond my skillset at this point, so I would appreciate the help.
I hope this makes sense and thanks in advance.
Wow, that's an odd place to find another researcher of IFD. I do not believe there are packages on R specifically about IFD. Its too specific and most models are relatively simple to estimate using common tests. For example, the input-matching rule you mentioned can be tested using a simple run-of-the-mill t-test, already included in base R.
What you have is not a coding problem per say, or even an statistical one. It is a biological problem. What ratio would you expect when animals are ideal (full knowledge of the environment), free (no movement costs), but with the presence of competition? Is this ratio equal to the ratio in your dataset? Sutherland,1983 suggests animals would undermatch.
I would love to discuss this at depth, given my PhD was in IFD, but I fear you hit the wrong forum.

Best alternative for stepwise regression in R

I know that there dosens of similar questions/answers, and lots of papers. But please read till the end.
Non-statisticians tend to use stepwise regressions which is strongly argued by statisticians. This is stomething that I don't understand, but I just obey them. "Ok this is not a good way to do your modelling".
Here is (was) my model:
b <- lmer(metric1~a+b+c+d+e+f+g+h+i+j+k+l+(1|X/Y) + (1|Z), data = dataset)
drop1 (b, test="Chisq")
(Just a small note: Watch out for the random effects in my model; random effects are Year, Month, Sampling.location; one of my variables is 1/0: I allready log-transformed my variables)
I am trying to find a exploratory model (with drop1 to reach final model) and evaluating it with my biological knowledge to see if the dependent ("metric" in this case) seems to be responding variables. I will repeat this process with 100 metrics just to evaulate which metrics seems to be responding environmental variables.
I was in the search for an acceptable model instead of stepwise according to the suggestions of statistics gurus.
However, there are lots of alternatives. I read alot, but still feel myself lost. Some say Lasso, some say elastic modelling, some say ridge regression... Which one fits for my purpose?
Any advise for a better alternative and an easy model or a help page for dummies, or examples (that could be better) would be much appreciated.
Thanks in advance.

The first principal component has almost all the information, but it does not seem to be the best indicator for classification

I have a feature vector of 180 elements, and have applied a PCA on it. The problem is that the first pc has a high variance, but according to this biplot diagram for pc1 vs pc2, it seems that this is happening because of an outlier. Which is strange to me.
Apparently the first PC is not the best indicator for classification here.
Here is also the biplot diagram for pc2 vs pc3:
I am using R for this. Any suggestion why is this happening and how I can solve this? Should I remove the outliers? If yes what is the best way to do so by R.
--Edit
I am using prcomp(features.df, center= TRUE, scale = TRUE) to normalize the data.
Even without the outlier, PCA may be entirely nonsensical if your goal is classification aka "discrimination" ((the term being completely "politized" is rare nowaday in the statistical context)).
That's why "they" invented "crimcoords" as different but related to the "prin.coords" where the latter are stats slang for 'principal coordinates' (related to your principal components).
"Crimcoords" seem no longer easy to find on the web; in the last century every good statistician knew +- what they were. A good reference seems Gnanadesikan's monography "Methods for Statistical Data Analysis of Multivariate Observations" (1st edition 1977, 2nd ed 1997; Wiley).
And Ram Gnanadesikan was already very much aware of the problem of outliers and so mentioned "robust" methods.
Nowadays, the "standard" R package for robust multivariate statistics is 'rrcov' (by Valentin Todorov)... a modern version of the topic (I think allowing "lasso" type regularization) is package 'rrlda' with main function rrlda() indeed allowing both robust and Lasso (L1) penalization.

Statistical comparison of machine learning algorithm

I am working in machine learning. I am stuck in one of the thing.
I want to compare 4 machine learning techniques among 10 datasets. After performing experiment i got Area Under Curve value. After this i have applied Analysis of variance test which shows there is a significant difference between 4 machine learning techniques.
Now my problem is that which test will conclude that particular algorithm perform well compared to other algorithm and i want only one winner among the machine learning techniques.
A classifier's quality can be measured by the F-Score which measures the test's accuracy. Comparing these respective scores will give you a simple measure.
However, if you want to measure whether the difference between the classifiers' accuracies is significant, you can try the Bayesian Test or, if classifiers are trained once, McNemar's test.
There are other possibilities and the papers On Comparing Classifiers: Pitfalls to Avoid and a
Recommended Approach and Approximate Statistical Tests for Comparing
Supervised Classification Learning Algorithms are probably worth reading.
If you are gathering performance metrics (ROC,accuracy,sensitivity,specificity...) from identicially resampled data sets then you can perform statistical tests using paired comparisons. Most statistical software impliment Tukeys Range test (ANOVA). https://en.wikipedia.org/wiki/Tukey%27s_range_test. A formal treatment of this material is here: http://epub.ub.uni-muenchen.de/4134/1/tr030.pdf. This is the test I like to use for the purpose you discuss, although there are others and people have varying opinions.
You will still have to choose how you will sample based on your data (k-fold), repeated (k-fold), bootstrap, leave one out, repeated training test splits. Bootstrap methods tend to give you the tightest confidence intervals after leave one out; but leave one out might not be an option if your data is huge.
That being said you may also need to consider the problem domain. False positives may be an issue in classification. You may need to consider other metrics to choose the best performer for the domain. AUC might not always be the best model for a specific domain. For instance a credit card company may not want to deny a transaction to customers, we need a very low false positive on fraud classification.
You may also want to consider implementation. If a logistic regression performs near as well it may be a better choice over a more complicated implementation of a random forest. Are there legal implications to model use (Fair Credit Reporting Act...)?
A common sense approach is to begin with something like RF or Gradient boosted trees to get an empirical sense of a performance ceiling. Then build simpler models and use the simpler model that performs reasonabley well compared to the ceiling.
Or you could combine all your models using something like LASSO... or some other model.

Resources