Is there an R package with which I can model the effects of competition on ideal free distribution? - r

I am a university student working on a research project, because of our local lockdown I cannot go into the field to collect observation data, I am therefore looking for an R package that will allow me to model the effects of competition when testing for ideal free distribution (IFD).
To give you a better idea of what I am looking for I have described the project in more detail below.
In my original dataset (which I received i.e., I did not collect the data myself) I have two patches (A,B) which received random treatments of food input (1:1, 2:1, 5:1). Under the ideal free distribution hypothesis species should distribute into the patches in accordance with the treatment ratios. This is not the case.
Under normal circumstances I would go into the field and observe behaviour of individuals in the patches to see if dominance affects distribution. Since we are in a lockdown I am unable to do so. I am hoping that there is a package out there that would allow me to model this scenario and help me investigate how competition affects IFD.
I have already found two packages called coexist and EcoVirtual but they model coexistence and extinction dynamics, whereas I want to investigate how competition might alter distribution between profitable patches when there is variation in the level of competition.
I am fairly new to R and creating my own package is beyond my skillset at this point, so I would appreciate the help.
I hope this makes sense and thanks in advance.

Wow, that's an odd place to find another researcher of IFD. I do not believe there are packages on R specifically about IFD. Its too specific and most models are relatively simple to estimate using common tests. For example, the input-matching rule you mentioned can be tested using a simple run-of-the-mill t-test, already included in base R.
What you have is not a coding problem per say, or even an statistical one. It is a biological problem. What ratio would you expect when animals are ideal (full knowledge of the environment), free (no movement costs), but with the presence of competition? Is this ratio equal to the ratio in your dataset? Sutherland,1983 suggests animals would undermatch.
I would love to discuss this at depth, given my PhD was in IFD, but I fear you hit the wrong forum.

Related

How to control for an independent variable in logistic regression?

I am trying to predict the Spotify popularity score using a range of machine learning algorithms in the R Caret package including logistic regression. The aim is to predict track popularity based on audio features e.g. danceability, energy etc. The problem I have is that Spotify are not transparent about how the popularity score is calculated but I know it is based on a number of things including play counts and how recent the track is. That means that the number of days released will have an impact on the popularity score so I have included days_released as an independent variable in my modelling to try and control for it.
So, I have 50 variables (days_released being one of them). I am using the rfe function in Caret to perform feature selection but for every algorithm, days_released is the only variable selected. Does anyone have any advice or recommended reading on how to overcome this problem? I want to predict popularity and explore which track features have a significant relationship with popularity, controlling for days_released.
Do I take the days_released variable out altogether?
Do I leave it in but force rfe to select more than one feature?
Any help would be much appreciated! Thanks in advance!

The first principal component has almost all the information, but it does not seem to be the best indicator for classification

I have a feature vector of 180 elements, and have applied a PCA on it. The problem is that the first pc has a high variance, but according to this biplot diagram for pc1 vs pc2, it seems that this is happening because of an outlier. Which is strange to me.
Apparently the first PC is not the best indicator for classification here.
Here is also the biplot diagram for pc2 vs pc3:
I am using R for this. Any suggestion why is this happening and how I can solve this? Should I remove the outliers? If yes what is the best way to do so by R.
--Edit
I am using prcomp(features.df, center= TRUE, scale = TRUE) to normalize the data.
Even without the outlier, PCA may be entirely nonsensical if your goal is classification aka "discrimination" ((the term being completely "politized" is rare nowaday in the statistical context)).
That's why "they" invented "crimcoords" as different but related to the "prin.coords" where the latter are stats slang for 'principal coordinates' (related to your principal components).
"Crimcoords" seem no longer easy to find on the web; in the last century every good statistician knew +- what they were. A good reference seems Gnanadesikan's monography "Methods for Statistical Data Analysis of Multivariate Observations" (1st edition 1977, 2nd ed 1997; Wiley).
And Ram Gnanadesikan was already very much aware of the problem of outliers and so mentioned "robust" methods.
Nowadays, the "standard" R package for robust multivariate statistics is 'rrcov' (by Valentin Todorov)... a modern version of the topic (I think allowing "lasso" type regularization) is package 'rrlda' with main function rrlda() indeed allowing both robust and Lasso (L1) penalization.

Point pattern similarity and comparison

I recently started to work with a huge dataset, provided by medical emergency
service. I have cca 25.000 spatial points of incidents.
I am searching books and internet for quite some time and am getting more and more confused about what to do and how to do it.
The points are, of course, very clustered. I calculated K, L and G function
for it and they confirm serious clustering.
I also have population point dataset - one point for every citizen, that is similarly clustered as incidents dataset (incidents happen to people, so there is a strong link between these two datasets).
I want to compare these two datasets to figure out, if they are similarly
distributed. I want to know, if there are places, where there are more
incidents, compared to population. In other words, I want to use population dataset to explain intensity and then figure out if the incident dataset corresponds to that intensity. The assumption is, that incidents should appear randomly regarding to population.
I want to get a plot of the region with information where there are more or less incidents than expected if the incidents were randomly happening to people.
How would you do it with R?
Should I use Kest or Kinhom to calculate K function?
I read the description, but still don't understand what is a basic difference
between them.
I tried using Kcross, but as I figured out, one of two datasets used
should be CSR - completely spatial random.
I also found Kcross.inhom, should I use that one for my data?
How can I get a plot (image) of incident deviations regarding population?
I hope I asked clearly.
Thank you for your time to read my question and
even more thanks if you can answer any of my questions.
Best regards!
Jernej
I do not have time to answer all your questions in full, but here are some pointers.
DISCLAIMER: I am a coauthor of the spatstat package and the book Spatial Point Patterns: Methodology and Applications with R so I have a preference for using these (and I genuinely believe these are the best tools for your problem).
Conceptual issue: How big is your study region and does it make sense to treat the points as distributed everywhere in the region or are they confined to be on the road network?
For now I will assume we can assume they are distributed anywhere.
A simple approach would be to estimate the population density using density.ppp and then fit a Poisson model to the incidents with the population density as the intensity using ppm. This would probably be a reasonable null model and if that fits the data well you can basically say that incidents happen "completely at random in space when controlling for the uneven population density". More info density.ppp and ppm are in chapters 6 and 9 of 1, respectively, and of course in the spatstat help files.
If you use summary statistics like the K/L/G/F/J-functions you should always use the inhom versions to take the population density into account. This is covered in chapter 7 of 1.
Also it could probably be interesting to see the relative risk (relrisk) if you combine all your points in to a marked point pattern with two types (background and incidents). See chapter 14 of 1.
Unfortunately, only chapters 3, 7 and 9 of 1 are availble as free to download sample chapters, but I hope you have access to it at your library or have the option of buying it.

R - Fit many distributions to sample, visualize, and sort by g.o.f. test

Is there any package in R that allows fitting many pdfs at the same time to some sample data, plots all fits and the sample histogram, and then allows sorting the fittings by some gof criteria like Kolmogorov-Smirnov, Anderson-Darling, X2, ...? Something similar to what commercial software EasyFit does?
UPDATE
I've received valuable comments to my initial question. Specifically, the AIC stands out as a metric that allows comparing pdfs with differing number of parameters. However, AIC also has limitations. Therefore, it'd be interesting to come up with/find some sort of summary that states the pros and cons of all the g.o.f. tests for model selection. Many of these topics are common to statisticians but might not be so, and are very useful for practitioners that must perform many g.o.f. decisions on a daily basis for practical problems.
Any suggestions are welcome.
Thanks!

technique to obfuscate clustered data and preserve privacy in r

background
i have some private survey data that contains a column of confidential information: the geographic location of the survey respondents. under no circumstances can this information be released.
as is common in survey research, in order for users to correctly calculate a variance on my survey data set, those users will either need that geographic location (unacceptable) or, alternatively, a set of replicate weights. i can create that set of replicate weights; however, it's quite easy to look at the correlations between those weights and back-calculate which of the survey respondents share the same geographic location. that is also unacceptable.
to help me with this question, you don't have to be familiar with replicate weights -- just think of them as a few columns of strongly-correlated clustered data.
i understand that if i want to maintain that clustering, an evil data user will always have semi-decent guesses at who shares geographic locations; i just want to make that guessing game less precise. on the un-obfuscated replicate weights, an evil data user can figure out 100% of the cases.
request
i am looking for a technique that
prevents the public use file users from easily deducing the shared geographic location off of the correlations between my replicate weights variables
does not obliterate the correlations between my columns of data (the replicate weights variables)
can be implemented on an R data.frame object without a major time investment
i say shared because the evil user might not know where the location is, but they might know if two survey respondents are from the same location -- an unacceptable possibility.
what i have tried
i don't really want to re-invent the wheel here. i am looking for r syntax, an r package, or anything else that would be relatively straightforward to implement. i've found one, two, three, four papers describing techniques that would all be suitable for my purposes; unfortunately, none of the authors have been willing to share actual code to implement them.
i can do simple things like add and subtract random values to my replicate weights columns according to a normal distribution, but i'd prefer to rely on the work of someone who understands privacy issues better than i do.
thanks!!!!
i have written this nine-step tutorial to walk through the process in an attempt to answer my own question. i am not an expert in the field of privacy/confidentiality and would love to hear both feedback about this idea and also other ideas. thanks!
http://www.asdfree.com/2014/09/how-to-provide-variance-calculation-on.html

Resources