GLM for overdispersed count data, negative residual trends - r

I have been trying to analyze count data of shark detections and how it has changed throughout different periods of time over several years. I have y=number of detections for an event, x=covid_period, afactorial with three levels (before, during, after) as well as diel period (day/night),sex(male or female), year(to see if the covid period has a difference in detections in general in other years where there wasn't a lockdown). Since all my response variables are categorical, I have been trying to run a glm with family=quasipoisson.
glm_quasi<-glm(num_detections~covid_periodyearSEXdiel_periodanimal_id, family=quasipoisson, data=nursesharks)
My residuals/qqnorm plots indicate this is not a good model.
In essence I want to know if there were more or less detections of female sharks during the day during the the covid_period of 2020. Am I choosing the right model?

Related

Time series forecasting of outcome variable based on current performance of outcome variable in R

I have a very large dataset (~55,000 datapoints) for chicken crops. Chickens are grown over ~35 day period. The dataset covers 10 sheds of ~20,000 chickens each. In the sheds are weighing platforms and as chickens step on them they send the weight recorded to a server. They are sending continuously from day 0 to the final day.
The variables I have are: House (as a number, House 1 up to House 10), Weight (measured in grams, to 5 decimal points) and Day (measured as a number between two integers, e.g. 12 noon on day 0 might be 0.5 in the day, whereas day 23.3 suggests a third of the way through day 23 (8AM). But as this data is sent continuously the numbers can be very precise).
I want to construct either a Time Series Regression model or an ML model so that if I take a new crop, as data is sent by the sensors, the model can make a prediction for what the end weight will be. Then as that crop cycle finishes it can be added to the training data and repeat.
Currently I'm using this very simple Weight VS Time model, but eventually would include things like temperature, water and food consumption, humidity etc.
I've run regression analyses on the data sets to determine the relationship between time and weight (it's likely quadratic, see image attached) and tried using randomForrest in R to create a model. The test model seemed to work well in regards to the MAPE value being similar to the training value, but that was by taking out one house and using that as the test.
Potentially what I've tried so far is completely the wrong methodology but this is a new area so I'm really not sure of the best approach.

Is It Appropriate to Conduct Interrupted Time Series (ITS) Analysis or Repeated-Measures Panel Analysis When Intervention Start Dates Vary?

I am attempting to estimate the causal effect of intervention receipt (i.e., enrollment in a case management program) on a set of count outcomes (i.e., monthly visits to the doctor). Individuals enroll in the case management program at different points in time (e.g., an individual can enroll in the program anytime between 01/2017 and 01/2022). I have count data on the number of monthly doctor visits for each client for each of the 24 months prior to program enrollment and the 24 months following program enrollment. I want to estimate whether the number of doctor visits decreases following enrollment in the case management program.
Most of the interrupted time series (ITS) research for count data (e.g., negative binomial count models using tscount in R) I have come across uses population-level interventions which occur at one discrete time-point (e.g., July 1, 2018) instead of individual-level interventions which occur at varying time-points (e.g., one client enrolls on July 1, 2018; another client enrolls on January 1, 2019). I would appreciate any guidance on how to explore this question going forward (e.g., is an ITS design where intervention start dates vary across individuals even appropriate analytically or would some version of a repeated-measures panel approach with an intervention dummy be more appropriate)? Thanks!

Should I use Friedman test or Mixed Model for my data in R? Nested or not?

I have my Response variable which is Proportion of Range Exposed to extreme events for terrestrial mammal species in the future. More clearly, it is the Difference of Proportion of Range Exposed (DPRE) from historical period to future green gases emission scenarios (it is a measure of the level of increase/decrease of percentage of range exposed): it means that my response variable goes from -1 to 1 (where +1 implies that the range will experience a +100% increase in the proportion of exposure: from 0% in historical period, to 100% in the future scenario).
As said, I am analyzing these differences for all terrestrial mammals (5311 species, across different scenarios and for two time periods, near future (means of 2021-2040) and far future (means of 2081-2100).
So, my Explicative variables are:
3 Scenarios of green gas emissions (Representative Concentration Pathways: RCP2.6, RCP4.5 and RCP8.5);
Time Periods (Near Future and Far Future): NF and FF;
Species: 5311 individuals.
I am not so expert in statistics , so I'm not sure which of the two suggestions I recieved:
Friedman test with Species as blocks (but in which I should somehow do a nested model, with RCPs as groups, nested within TimePeriods; or a sort of two way Friedman, with RCP and TimePeriod as the two different factors).
Linear Mixed Models with RCP*TimePeriod as fixed effects, and (TimePeriod | Species ) as random effects.
I run t-test, and all distribution result to be not normal, this is why I was suggested to use Friendman instead of ANOVA; I run pairwise Wilcoxon Rank Sum test and in this case I found significative differences from NF and FF for all RCPs.
I have to say I run 3 Wilcoxon, one for every RCP, so maybe a third option would be to create 3 different models, one for every RCP, but this would also go away from the standard analysis of "repated measures" for Friedman test.
Last consideration: I have to run Another model, where the Response variable is the Difference of Proportion of Subrange Exposed. In this case, other Explicative variables are mantained, but in this case analysis is not global but takes in consideration the difference that could be present across 14 IUCN Biomes. So every analysis is made across RCPs, for NF and FF and for all Biomes. Should I create and run 14 (biomes) x 3 (RCPs) x 2 (Time Periods) = 84 models, in this case? OR a sort of double nested (Time Periods and Biomes) model?
If necessary I can provide the large dataframe.

R: lmer coding for a (random) discontinuous time for all subjects with multiple treatments

I have a set of data that came from a psychological experiment where subjects were randomly assigned to one of four treatment conditions and their wellbeing w measured on six different occasions. The exact day of measurement on each occasion differs slightly from subject to subject. The first measurement occasion for all subjects is day zero.
I analyse this with lmer :
model.a <- lmer(w ~ day * treatment + (day | subject),
REML=FALSE,
data=exper.data)
Following a simple visual inspection of the change-trajectories of subjects, I'd now like to include (and examine the effect of including) the possibility that the slope of the line for each subject changes at a point mid-way between measurement occasion 3 and 4.
I'm familiar with modeling the alteration in slope by including an additional time-variable in the lmer specification. The approach is described in chapter 6 ('Modeling non-linear change') of the book Applied Longitudinal Data Analysis by Singer and Willett (2005). Following their advice, for each measurement, for each subject, there is now an additional variable called latter.day. For measurements up to measurement 3, the value of latter.day is zero; for later measurements, latter.day encodes the number of days after day 40 (which is the point at which I'd like to include the possible slope-change).
What I cannot see is how to adjust the lmer coding of the examples in the Singer and Willett cases to suit my own problem ... which includes the same point-of-slope-change for all subjects as well as a between-subjects factor (treatment). I'd appreciate help on how to write the specification for lmer.

How do I code a Mixed effects model for abalone growth in Aquaculture nutrition with nested individuals

I am a biologist working in aquaculture nutrition research and until recently I haven't paid much attention to the power of statistics. The usual method of analysis had been to run ANOVA on final weights of animals given various treatments and boom, you have a result. I have tried to improve my results by designing an experiment that could track individuals growth over time but I am having a really hard time trying to understand which model to use for the data I have.
For simplified explanation of my experiment: I have 900 abalone/snails which were sourced from a single cohort (spawned/born at the same time). I have individually marked each abalone (id) and recorded a length and weight at Time 0. The animals were then randomly assigned 1 of 6 treatment diets (n=30 abalone per treatment) each replicated n=5 times (n=150 abalone / replicate). Each replicate looks like a randomized block design where each treatment is only replicate once within each block and each is assigned to independent tank with n=30 abalone/tank (n treatment). Abalone were fed a known amount of feed for 90 days before being weighed and measured again (Time 1). They are back in their homes for another 90 days before the concluding the experiment.
From my understanding:
fixed effects - Time, Treatment
nested random effects - replicate, id
My raw data entered is in Long format with each row being a unique animal and columns for Time (0 or 1), Replicate (1-5), Treatment (1-6), Sex (M or F) Animal ID (1-900), Length (mm), Weight (g), Condition Factor (Weight/Length^2.99*5655)
I have used columns from my raw data and converted them to factors and vectors before using the new variables to create a data frame.
id<-as.factor(data.long[,5])
time<-as.factor(data.long[,1])
replicate<-as.factor(data.long[,2])
treatment<-data.long[,3]
weight<-as.vector(data.long[,7])
length<-as.vector(data.long[,6])
cf<-as.vector(data.long[,10])
My data frame is currently in the following structure:
df1<-data.frame(time,replicate,treatment,id,weight,length,cf)
I am struggling to understand how to nest my individual abalone within replicates. I can convert the weight data to change from initial but I think the package nlme already accounts this change when coded correctly. I could also create another measure of Specific Growth Rate for each animal at Time 1 but this would not allow the Time factor to be used.
lme(weight ~ time*treatment, random=~1 | id, method="ML", data=df1))
I would like to structure a mixed effects model so that my code takes into account the individual animal variability to detect statistical differences in their weight at Time 1 between treatments.

Resources