Regression model for a continuous dependent variable with count independent variables - r

I am currently working on a project where I have to estimate the average processing time of different work items (tasks).
I have the following panel data:
My sample size is n=2000 individual workers, and T=10 (each time interval is a four week period)
Independent variables: 51 different work items. I have count data for each work item (# of times they are performed by each worker over a four week period)
Dependent variable: Total Working Hour of the worker (over a 4 week period)
The goal of my analysis is to find the regression coefficents (which are estimâtes of the average completion time of each work item). I may also include other regressors (other than #of work items) such as experience, age... into my model.
y= Bo + B1*X1 +...+BkXk +e y: total working hours; X: # of work items
Issues:
Right now, I finished cleaning and processing the data and I performed some exploratory data analysis.
Some work items have a lot of zeros (the work item is only performed once or twice by several workers in the time period).
From VIF, I can see that there are imperfect multicollinearity in the independent variables. Some independent variables have VIF of 5 to 6.
Questions:
Any advice on how I should specify my model?
I look at boxplots and eliminate outliers of each regressor, I see that some regressors are highly skewed (due to lots of zéros).
I also plot each regressors against the total complétion time to see if there is any linear relation. So do, other looks more like a quadratic relation.
Any way to deal with the multicollinearity aside from eliminating the regressors that have high VIF? This is because I need to estimate the coefficent of each of the work item.
Should I set the intercept to 0? I know for sure that when ALL the regressors are 0 (# of work items are all 0, I should have zero total working hours).
I would also welcome any advice/things that I should look into for this problem. Thanks!

Related

Should I use Friedman test or Mixed Model for my data in R? Nested or not?

I have my Response variable which is Proportion of Range Exposed to extreme events for terrestrial mammal species in the future. More clearly, it is the Difference of Proportion of Range Exposed (DPRE) from historical period to future green gases emission scenarios (it is a measure of the level of increase/decrease of percentage of range exposed): it means that my response variable goes from -1 to 1 (where +1 implies that the range will experience a +100% increase in the proportion of exposure: from 0% in historical period, to 100% in the future scenario).
As said, I am analyzing these differences for all terrestrial mammals (5311 species, across different scenarios and for two time periods, near future (means of 2021-2040) and far future (means of 2081-2100).
So, my Explicative variables are:
3 Scenarios of green gas emissions (Representative Concentration Pathways: RCP2.6, RCP4.5 and RCP8.5);
Time Periods (Near Future and Far Future): NF and FF;
Species: 5311 individuals.
I am not so expert in statistics , so I'm not sure which of the two suggestions I recieved:
Friedman test with Species as blocks (but in which I should somehow do a nested model, with RCPs as groups, nested within TimePeriods; or a sort of two way Friedman, with RCP and TimePeriod as the two different factors).
Linear Mixed Models with RCP*TimePeriod as fixed effects, and (TimePeriod | Species ) as random effects.
I run t-test, and all distribution result to be not normal, this is why I was suggested to use Friendman instead of ANOVA; I run pairwise Wilcoxon Rank Sum test and in this case I found significative differences from NF and FF for all RCPs.
I have to say I run 3 Wilcoxon, one for every RCP, so maybe a third option would be to create 3 different models, one for every RCP, but this would also go away from the standard analysis of "repated measures" for Friedman test.
Last consideration: I have to run Another model, where the Response variable is the Difference of Proportion of Subrange Exposed. In this case, other Explicative variables are mantained, but in this case analysis is not global but takes in consideration the difference that could be present across 14 IUCN Biomes. So every analysis is made across RCPs, for NF and FF and for all Biomes. Should I create and run 14 (biomes) x 3 (RCPs) x 2 (Time Periods) = 84 models, in this case? OR a sort of double nested (Time Periods and Biomes) model?
If necessary I can provide the large dataframe.

How to compare temperature data over a period of time

My aim is to evaluate the effect of a treatment (on microclimate data) applied to a canopy compared to a control. Therefore I put three data logger in the canopy at 5 sites and each variant ("treatment applied" vs. "control"). Data is averaged every 5 minutes over a period of 217 days. The logged data looks like this:
Timepoint,Time,Celsius(°C),Humidity(%rh),dew point(°C)
1,27/03/2019 17:02:39,23.5,37.5,8.2
2,27/03/2019 17:07:39,23.5,36.5,7.8
3,27/03/2019 17:12:39,23.5,36.5,7.8
4,27/03/2019 17:17:39,24.0,37.5,8.6
5,27/03/2019 17:22:39,23.5,36.0,7.6
6,27/03/2019 17:27:39,23.0,37.0,7.5
7,27/03/2019 17:32:39,22.5,34.5,6.1
8,27/03/2019 17:37:39,22.5,34.5,6.1
Records are sumamrized daily to obtain mean/max/min temperature for each of the 217 days. Regardless of the site I want to determine the effect of the treatment applied and to expose the differences over time.
I was told that Time Series Analysis doesn't work here. I tried to apply linear regression (inspired from this paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0234436) on the data, but since the control does not affect the treatment I discarded this approach.
So my question is: which method would be the proper way to analyse this microclimatic data in R?
You can try running linear regression with Time as a function of humidity and Celsius for the control and the treatment separately, and then compare the slopes of both models for each site. Naturally if you get a higher slope on your treatment than on your control, this indicates a responsive result to the treatment - the higher the delta between the slopes, the better the response to treatment is.
The model would go something like this(for a single site):
lm(Time~Celsius+Humidity, data = ControlData)
lm(Time~Celsius+Humidity, data = TreatmentData)
Then you can start playing with the coefficients and derive results from the differences, and the general slope of the regression line for each site. And after that, you can even combine the results by averaging the coefficients of the 5 control regression and compare them to the average of 5 treatment regressions (since the model is linear this should be statistically valid).

How to understand RandomForestExplainer output (R package)

I have the following code, which basically try to predict the Species from iris data using randomForest. What I'm really intersed in is to find what are the best features (variable) that explain the species classification. I found the package randomForestExplainer is the best
to serve the purpose.
library(randomForest)
library(randomForestExplainer)
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE)
importance_frame <- randomForestExplainer::measure_importance(forest)
randomForestExplainer::plot_multi_way_importance(importance_frame, size_measure = "no_of_nodes")
The result of the code produce this plot:
Based on the plot, the key factor to explain why Petal.Length and Petal.Width is the best factor are these (the explanation is based on the vignette):
mean_min_depth – mean minimal depth calculated in one of three ways specified by the parameter mean_sample,
times_a_root – total number of trees in which Xj is used for splitting the root node (i.e., the whole sample is divided into two based on the value of Xj),
no_of_nodes – total number of nodes that use Xj for splitting (it is usually equal to no_of_trees if trees are shallow),
It's not entirely clear to me why the high times_a_root and no_of_nodes is better? And low mean_min_depth is better?
What are the intuitive explanation for that?
The vignette information doesn't help.
You would like a statistical model or measure to be a balance between "power" and "parsimony". The randomForest is designed internally to do penalization as its statistical strategy for achieving parsimony. Furthermore the number of variables selected in any given sample will be less than the the total number of predictors. This allows model building when hte number of predictors exceeds the number of cases (rows) in the dataset. Early splitting or classification rules can be applied relatively easily, but subsequent splits become increasingly difficult to meet criteria of validity. "Power" is the ability to correctly classify items that were not in the subsample, for which a proxy, the so-called OOB or "out-of-bag" items is used. The randomForest strategy is to do this many times to build up a representative set of rules that classify items under the assumptions that the out-of-bag samples will be a fair representation of the "universe" from which the whole dataset arose.
The times_a_root would fall into the category of measuring the "relative power" of a variable compared to its "competitors". The times_a_root statistic measures the number of times a variable is "at the top" of a decision tree, i.e., how likely it is to be chosen first in the process of selecting split criteria. The no_of_node measures the number of times the variable is chosen at all as a splitting criterion among all of the subsampled.
From:
?randomForest # to find the names of the object leaves
forest$ntree
[1] 500
... we can see get a denominator for assessing the meaning of the roughly 200 values in the y-axis of the plot. About 2/5ths of the sample regressions had Petal.Length in the top split criterion, while another 2/5ths had Petal.Width as the top variable selected as the most important variable. About 75 of 500 had Sepal.Length while only about 8 or 9 had Sepal.Width (... it's a log scale.) In the case of the iris dataset, the subsamples would have ignored at least one of the variables in each subsample, so the maximum possible value of times_a_root would have been less than 500. Scores of 200 are pretty good in this situation and we can see that both of these variables have a comparable explanatory ability.
The no_of_nodes statistic totals up the total number of trees that had that variable in any of its nodes, remembering that the number of nodes would be constrained by the penalization rules.

R: lmer coding for a (random) discontinuous time for all subjects with multiple treatments

I have a set of data that came from a psychological experiment where subjects were randomly assigned to one of four treatment conditions and their wellbeing w measured on six different occasions. The exact day of measurement on each occasion differs slightly from subject to subject. The first measurement occasion for all subjects is day zero.
I analyse this with lmer :
model.a <- lmer(w ~ day * treatment + (day | subject),
REML=FALSE,
data=exper.data)
Following a simple visual inspection of the change-trajectories of subjects, I'd now like to include (and examine the effect of including) the possibility that the slope of the line for each subject changes at a point mid-way between measurement occasion 3 and 4.
I'm familiar with modeling the alteration in slope by including an additional time-variable in the lmer specification. The approach is described in chapter 6 ('Modeling non-linear change') of the book Applied Longitudinal Data Analysis by Singer and Willett (2005). Following their advice, for each measurement, for each subject, there is now an additional variable called latter.day. For measurements up to measurement 3, the value of latter.day is zero; for later measurements, latter.day encodes the number of days after day 40 (which is the point at which I'd like to include the possible slope-change).
What I cannot see is how to adjust the lmer coding of the examples in the Singer and Willett cases to suit my own problem ... which includes the same point-of-slope-change for all subjects as well as a between-subjects factor (treatment). I'd appreciate help on how to write the specification for lmer.

How do I code a Mixed effects model for abalone growth in Aquaculture nutrition with nested individuals

I am a biologist working in aquaculture nutrition research and until recently I haven't paid much attention to the power of statistics. The usual method of analysis had been to run ANOVA on final weights of animals given various treatments and boom, you have a result. I have tried to improve my results by designing an experiment that could track individuals growth over time but I am having a really hard time trying to understand which model to use for the data I have.
For simplified explanation of my experiment: I have 900 abalone/snails which were sourced from a single cohort (spawned/born at the same time). I have individually marked each abalone (id) and recorded a length and weight at Time 0. The animals were then randomly assigned 1 of 6 treatment diets (n=30 abalone per treatment) each replicated n=5 times (n=150 abalone / replicate). Each replicate looks like a randomized block design where each treatment is only replicate once within each block and each is assigned to independent tank with n=30 abalone/tank (n treatment). Abalone were fed a known amount of feed for 90 days before being weighed and measured again (Time 1). They are back in their homes for another 90 days before the concluding the experiment.
From my understanding:
fixed effects - Time, Treatment
nested random effects - replicate, id
My raw data entered is in Long format with each row being a unique animal and columns for Time (0 or 1), Replicate (1-5), Treatment (1-6), Sex (M or F) Animal ID (1-900), Length (mm), Weight (g), Condition Factor (Weight/Length^2.99*5655)
I have used columns from my raw data and converted them to factors and vectors before using the new variables to create a data frame.
id<-as.factor(data.long[,5])
time<-as.factor(data.long[,1])
replicate<-as.factor(data.long[,2])
treatment<-data.long[,3]
weight<-as.vector(data.long[,7])
length<-as.vector(data.long[,6])
cf<-as.vector(data.long[,10])
My data frame is currently in the following structure:
df1<-data.frame(time,replicate,treatment,id,weight,length,cf)
I am struggling to understand how to nest my individual abalone within replicates. I can convert the weight data to change from initial but I think the package nlme already accounts this change when coded correctly. I could also create another measure of Specific Growth Rate for each animal at Time 1 but this would not allow the Time factor to be used.
lme(weight ~ time*treatment, random=~1 | id, method="ML", data=df1))
I would like to structure a mixed effects model so that my code takes into account the individual animal variability to detect statistical differences in their weight at Time 1 between treatments.

Resources