How to compare temperature data over a period of time - r

My aim is to evaluate the effect of a treatment (on microclimate data) applied to a canopy compared to a control. Therefore I put three data logger in the canopy at 5 sites and each variant ("treatment applied" vs. "control"). Data is averaged every 5 minutes over a period of 217 days. The logged data looks like this:
Timepoint,Time,Celsius(°C),Humidity(%rh),dew point(°C)
1,27/03/2019 17:02:39,23.5,37.5,8.2
2,27/03/2019 17:07:39,23.5,36.5,7.8
3,27/03/2019 17:12:39,23.5,36.5,7.8
4,27/03/2019 17:17:39,24.0,37.5,8.6
5,27/03/2019 17:22:39,23.5,36.0,7.6
6,27/03/2019 17:27:39,23.0,37.0,7.5
7,27/03/2019 17:32:39,22.5,34.5,6.1
8,27/03/2019 17:37:39,22.5,34.5,6.1
Records are sumamrized daily to obtain mean/max/min temperature for each of the 217 days. Regardless of the site I want to determine the effect of the treatment applied and to expose the differences over time.
I was told that Time Series Analysis doesn't work here. I tried to apply linear regression (inspired from this paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0234436) on the data, but since the control does not affect the treatment I discarded this approach.
So my question is: which method would be the proper way to analyse this microclimatic data in R?

You can try running linear regression with Time as a function of humidity and Celsius for the control and the treatment separately, and then compare the slopes of both models for each site. Naturally if you get a higher slope on your treatment than on your control, this indicates a responsive result to the treatment - the higher the delta between the slopes, the better the response to treatment is.
The model would go something like this(for a single site):
lm(Time~Celsius+Humidity, data = ControlData)
lm(Time~Celsius+Humidity, data = TreatmentData)
Then you can start playing with the coefficients and derive results from the differences, and the general slope of the regression line for each site. And after that, you can even combine the results by averaging the coefficients of the 5 control regression and compare them to the average of 5 treatment regressions (since the model is linear this should be statistically valid).

Related

Extremely wide confidence interval for a significant coefficient in a GLMM logistic regression. Due to my approach? Or somethimng else?

I have a concern with a GLMM I am running and I would be very grateful if you could help me out.
I am modelling the factors that cause a frog species to make either a type 1 or type 2 calls. I am using a GLMM logistic regression. The data from this were generated from recordings of individuals in frog choruses of various sizes. For each male in the dataset, I randomly chose 100 of his calls, and then determined if they were type 1 or 2 (type 1 call =0, type 2 call =1). So each frog is represented by the same number of calls (100 calls), and some frogs are represented in several choruses of different sizes (total n= 12400). The response variable is whether each call in the dataset is type 1 or 2, and my fixed effects are: the size of the chorus a frog is calling in (2,3,4,5,6), the body condition of the frog (residuals from an LM of mass on body length), and standardized body length (SVL) (body length and body condition score are not correlated so no VIF issues). I included frog ID and the chorus ID as random intercepts.
Model results
The model looks fine, and the coefficients seem sensible; they are about what I expect. The only thing that worries me is that, when I calculate the 95%CI for the coefficients, the coefficient for body condition has a huge range (-9.7 to 6.3) (see screenshot). This seems crazy. Even when exponentiated, it seems quite crazy (0 to 492). Is this reasonable?
This variable was involved in a significant interaction with chorus size; does this explain a wide CI? Or does this suggest my approach is flawed? Instead of having each male equally represented in the dataset by 100 calls in each chorus he is in, should I instead collapse that down to a proportion (e.g. proportion of type 2 calls out of the 100 randomly selected calls for each male) and model this as a poisson regression or something? Is the way I’m doing my logistic regression a reasonable approach? I have run model checks and everything and they all seem to point to logistic regression being suitable for my data, at least as I have set it up currently.
Thanks for any help you can provide!
Values I get after standardizing condition:
2.5 % 97.5 %
.sig01 2.0948132676 3.1943483
.sig02 0.0000000000 2.0980214
(Intercept) -3.1595281536 -1.2902779
chorus_size 0.8936643930 1.0418465
cond_resid -0.8872467384 0.5746653
svl -0.0865697646 1.2413117
chorus_size:cond_resid -0.0005998784 0.1383067

How to create and analyze a time series with variable test frequency in R

Here is a short description of the problem I am trying to solve: I have test data for multiple variables (weight, thickness, absorption, etc.) that are taken at varying intervals over time - no set schedule, sometimes a test a day, sometimes days might go between tests. I want to detect trends in each of these and alert stake holders when any parameter is trending up/down more than a certain amount. I first did a linear model between each variable's raw data and test time (I converted the test time to days or weeks since a fixed date) and create a table with slopes for each variable - so the stake holders can view one table for all variables and quickly see if any of them is raising concern. The issue was that the data for most variables is very noisy. Someone suggested using time series functions, separating noise and seasonality from the trends, and studying the trend component for a cleaner analysis. I started to look into this and see a couple concerns/questions already:
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable? Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
Finally, is there a better way of accomplishing what I explained above? I am only looking for general trends over time - say over 3 months of data, not day to day trends.
Time series are generally used to see if previous observations of a variable have influence on future observations. You would model under the assumption that the previous observations are able to predict the future observations. That is the reason for that most (not all) time series models require evenly spaced instances of training data. If your data is not only very noisy, but also not collected on a regular basis, then you should seriously consider if time series is the appropriate choice of modelling.
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals.
What you can do, is creating an aggregate by increasing the time bucket (shift from daily data to a weekly average for instance) such that every unit of time has an instance of training data. Following your final comment, you could create the average of the observations of the last 3 months of data instead from the observations.
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable?
In the simplest case of a linear model, the independent variable is the unit of time corresponding to the prediction you are trying to make. However this is not always regarded a time series model.
In the case of an autoregressive model, this would be the previous observation of what you are trying to predict, something similar to y(t) = x(t-1), for instance multiplied by a smoothing factor. I encourage you to read Forecasting: principles and practice which is an excellent book on the matter.
Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
The function decompose.ts returns a list which includes trend. Trend is a vector of the estimated trend components corresponding to it's respective time value.
Let's create an example time series with linear trend
df <- data.frame(
date = seq(from = as.Date("2021-01-01"), to = as.Date("2021-01-10"), by=1)
)
df$value <- jitter(seq(from = 1, to = nrow(df), by=1))
time_series <- ts(df$value, frequency = 5)
df$trend <- decompose(time_series)$trend
> df
date value trend
1 2021-01-01 0.9170296 NA
2 2021-01-02 1.8899565 NA
3 2021-01-03 3.0816892 2.992256
4 2021-01-04 4.0075589 4.042486
5 2021-01-05 5.0650478 5.046874
6 2021-01-06 6.1681775 6.051641
7 2021-01-07 6.9118942 7.074260
8 2021-01-08 8.1055282 8.041628
9 2021-01-09 9.1206522 NA
10 2021-01-10 9.9018900 NA
As you see, the trend component already is an estimate of the dependent variable at the corresponding time. In decompose the estimate of trend is based on a moving average.

Should I use Friedman test or Mixed Model for my data in R? Nested or not?

I have my Response variable which is Proportion of Range Exposed to extreme events for terrestrial mammal species in the future. More clearly, it is the Difference of Proportion of Range Exposed (DPRE) from historical period to future green gases emission scenarios (it is a measure of the level of increase/decrease of percentage of range exposed): it means that my response variable goes from -1 to 1 (where +1 implies that the range will experience a +100% increase in the proportion of exposure: from 0% in historical period, to 100% in the future scenario).
As said, I am analyzing these differences for all terrestrial mammals (5311 species, across different scenarios and for two time periods, near future (means of 2021-2040) and far future (means of 2081-2100).
So, my Explicative variables are:
3 Scenarios of green gas emissions (Representative Concentration Pathways: RCP2.6, RCP4.5 and RCP8.5);
Time Periods (Near Future and Far Future): NF and FF;
Species: 5311 individuals.
I am not so expert in statistics , so I'm not sure which of the two suggestions I recieved:
Friedman test with Species as blocks (but in which I should somehow do a nested model, with RCPs as groups, nested within TimePeriods; or a sort of two way Friedman, with RCP and TimePeriod as the two different factors).
Linear Mixed Models with RCP*TimePeriod as fixed effects, and (TimePeriod | Species ) as random effects.
I run t-test, and all distribution result to be not normal, this is why I was suggested to use Friendman instead of ANOVA; I run pairwise Wilcoxon Rank Sum test and in this case I found significative differences from NF and FF for all RCPs.
I have to say I run 3 Wilcoxon, one for every RCP, so maybe a third option would be to create 3 different models, one for every RCP, but this would also go away from the standard analysis of "repated measures" for Friedman test.
Last consideration: I have to run Another model, where the Response variable is the Difference of Proportion of Subrange Exposed. In this case, other Explicative variables are mantained, but in this case analysis is not global but takes in consideration the difference that could be present across 14 IUCN Biomes. So every analysis is made across RCPs, for NF and FF and for all Biomes. Should I create and run 14 (biomes) x 3 (RCPs) x 2 (Time Periods) = 84 models, in this case? OR a sort of double nested (Time Periods and Biomes) model?
If necessary I can provide the large dataframe.

Regression model for a continuous dependent variable with count independent variables

I am currently working on a project where I have to estimate the average processing time of different work items (tasks).
I have the following panel data:
My sample size is n=2000 individual workers, and T=10 (each time interval is a four week period)
Independent variables: 51 different work items. I have count data for each work item (# of times they are performed by each worker over a four week period)
Dependent variable: Total Working Hour of the worker (over a 4 week period)
The goal of my analysis is to find the regression coefficents (which are estimâtes of the average completion time of each work item). I may also include other regressors (other than #of work items) such as experience, age... into my model.
y= Bo + B1*X1 +...+BkXk +e y: total working hours; X: # of work items
Issues:
Right now, I finished cleaning and processing the data and I performed some exploratory data analysis.
Some work items have a lot of zeros (the work item is only performed once or twice by several workers in the time period).
From VIF, I can see that there are imperfect multicollinearity in the independent variables. Some independent variables have VIF of 5 to 6.
Questions:
Any advice on how I should specify my model?
I look at boxplots and eliminate outliers of each regressor, I see that some regressors are highly skewed (due to lots of zéros).
I also plot each regressors against the total complétion time to see if there is any linear relation. So do, other looks more like a quadratic relation.
Any way to deal with the multicollinearity aside from eliminating the regressors that have high VIF? This is because I need to estimate the coefficent of each of the work item.
Should I set the intercept to 0? I know for sure that when ALL the regressors are 0 (# of work items are all 0, I should have zero total working hours).
I would also welcome any advice/things that I should look into for this problem. Thanks!

How do I code a Mixed effects model for abalone growth in Aquaculture nutrition with nested individuals

I am a biologist working in aquaculture nutrition research and until recently I haven't paid much attention to the power of statistics. The usual method of analysis had been to run ANOVA on final weights of animals given various treatments and boom, you have a result. I have tried to improve my results by designing an experiment that could track individuals growth over time but I am having a really hard time trying to understand which model to use for the data I have.
For simplified explanation of my experiment: I have 900 abalone/snails which were sourced from a single cohort (spawned/born at the same time). I have individually marked each abalone (id) and recorded a length and weight at Time 0. The animals were then randomly assigned 1 of 6 treatment diets (n=30 abalone per treatment) each replicated n=5 times (n=150 abalone / replicate). Each replicate looks like a randomized block design where each treatment is only replicate once within each block and each is assigned to independent tank with n=30 abalone/tank (n treatment). Abalone were fed a known amount of feed for 90 days before being weighed and measured again (Time 1). They are back in their homes for another 90 days before the concluding the experiment.
From my understanding:
fixed effects - Time, Treatment
nested random effects - replicate, id
My raw data entered is in Long format with each row being a unique animal and columns for Time (0 or 1), Replicate (1-5), Treatment (1-6), Sex (M or F) Animal ID (1-900), Length (mm), Weight (g), Condition Factor (Weight/Length^2.99*5655)
I have used columns from my raw data and converted them to factors and vectors before using the new variables to create a data frame.
id<-as.factor(data.long[,5])
time<-as.factor(data.long[,1])
replicate<-as.factor(data.long[,2])
treatment<-data.long[,3]
weight<-as.vector(data.long[,7])
length<-as.vector(data.long[,6])
cf<-as.vector(data.long[,10])
My data frame is currently in the following structure:
df1<-data.frame(time,replicate,treatment,id,weight,length,cf)
I am struggling to understand how to nest my individual abalone within replicates. I can convert the weight data to change from initial but I think the package nlme already accounts this change when coded correctly. I could also create another measure of Specific Growth Rate for each animal at Time 1 but this would not allow the Time factor to be used.
lme(weight ~ time*treatment, random=~1 | id, method="ML", data=df1))
I would like to structure a mixed effects model so that my code takes into account the individual animal variability to detect statistical differences in their weight at Time 1 between treatments.

Resources