Im working through a project that is due imminently and I have almost zero idea how to use R, so was wondering if someone could lay out a step by step of how to bootstrap my data. I have data comparing the diet of 2 carnivore species, each species is feeding on 16 different prey items (almost all the same). I want to test whether these observed values (attributed to the frequency of each prey item) are significant vs. randomly generated (1000 bootstrap iterations) values. I have very little idea of how to do it.
Related
I have a very large dataset (~55,000 datapoints) for chicken crops. Chickens are grown over ~35 day period. The dataset covers 10 sheds of ~20,000 chickens each. In the sheds are weighing platforms and as chickens step on them they send the weight recorded to a server. They are sending continuously from day 0 to the final day.
The variables I have are: House (as a number, House 1 up to House 10), Weight (measured in grams, to 5 decimal points) and Day (measured as a number between two integers, e.g. 12 noon on day 0 might be 0.5 in the day, whereas day 23.3 suggests a third of the way through day 23 (8AM). But as this data is sent continuously the numbers can be very precise).
I want to construct either a Time Series Regression model or an ML model so that if I take a new crop, as data is sent by the sensors, the model can make a prediction for what the end weight will be. Then as that crop cycle finishes it can be added to the training data and repeat.
Currently I'm using this very simple Weight VS Time model, but eventually would include things like temperature, water and food consumption, humidity etc.
I've run regression analyses on the data sets to determine the relationship between time and weight (it's likely quadratic, see image attached) and tried using randomForrest in R to create a model. The test model seemed to work well in regards to the MAPE value being similar to the training value, but that was by taking out one house and using that as the test.
Potentially what I've tried so far is completely the wrong methodology but this is a new area so I'm really not sure of the best approach.
I have my Response variable which is Proportion of Range Exposed to extreme events for terrestrial mammal species in the future. More clearly, it is the Difference of Proportion of Range Exposed (DPRE) from historical period to future green gases emission scenarios (it is a measure of the level of increase/decrease of percentage of range exposed): it means that my response variable goes from -1 to 1 (where +1 implies that the range will experience a +100% increase in the proportion of exposure: from 0% in historical period, to 100% in the future scenario).
As said, I am analyzing these differences for all terrestrial mammals (5311 species, across different scenarios and for two time periods, near future (means of 2021-2040) and far future (means of 2081-2100).
So, my Explicative variables are:
3 Scenarios of green gas emissions (Representative Concentration Pathways: RCP2.6, RCP4.5 and RCP8.5);
Time Periods (Near Future and Far Future): NF and FF;
Species: 5311 individuals.
I am not so expert in statistics , so I'm not sure which of the two suggestions I recieved:
Friedman test with Species as blocks (but in which I should somehow do a nested model, with RCPs as groups, nested within TimePeriods; or a sort of two way Friedman, with RCP and TimePeriod as the two different factors).
Linear Mixed Models with RCP*TimePeriod as fixed effects, and (TimePeriod | Species ) as random effects.
I run t-test, and all distribution result to be not normal, this is why I was suggested to use Friendman instead of ANOVA; I run pairwise Wilcoxon Rank Sum test and in this case I found significative differences from NF and FF for all RCPs.
I have to say I run 3 Wilcoxon, one for every RCP, so maybe a third option would be to create 3 different models, one for every RCP, but this would also go away from the standard analysis of "repated measures" for Friedman test.
Last consideration: I have to run Another model, where the Response variable is the Difference of Proportion of Subrange Exposed. In this case, other Explicative variables are mantained, but in this case analysis is not global but takes in consideration the difference that could be present across 14 IUCN Biomes. So every analysis is made across RCPs, for NF and FF and for all Biomes. Should I create and run 14 (biomes) x 3 (RCPs) x 2 (Time Periods) = 84 models, in this case? OR a sort of double nested (Time Periods and Biomes) model?
If necessary I can provide the large dataframe.
I am unsure as to which test to use in R. Here is the oversimplified sampling procedure, with easy-looking numbers :
We have 20 patches of the same size in a field.
Inside these patches, we look for 50 different species (10 species of grass, and 40 species of insect).
Every time we find a species of grass, we record its coverage on a rough logarithmic scale from 1 to 4.
Every time we find a species of insect, we count them and record their abundance on a rough logarithmic scale from 1 to 4.
So my data sort of looks like this:
My problem is, how do I test which species are significantly associated? How do I detect clusters? Multivariate analysis? Half weight index? Bootstrap?
I'm not exactly gifted when it comes to statistics, so any help would be greatly appreciated!
I'm currently doing randomisation tests (RTs) for a single-case design in R. A bit of context about my phases:
I am doing an AB design, which was staggered using a multiple baseline design across 5 participants.
Here is my phase layout over 9 weeks, with 7 data points per week:
AA/BBBBBB
A = baseline; B = intervention; / = changeover phase with 7 possible start points
I have figured out how to calculate the randomisation test for all participants as a whole, but I'm struggling to do the randomisation test for each individual participant, as it asks for minimum phase length as opposed to the possible start points in a .txt file.
After Monte Carlo, I'm doing 1,000 randomisation distributions. For the RTs, I'm using the following code in R, based on a minimum of 14 baseline data points and a minimum of 42 intervention data points:
pvalue.random(design="AB",statistic="B-A",number=1000,limit=14)
For these calculations, R asks for the min number of data points for each phase (limit). Technically, the min number of data points for phase B is 42. However, as the transition phase is only 7 days, this means there's no time to have a 42-day baseline. So I've put the minimum phase length at 14 days to reflect the 2-week baseline. Is this right?
I am currently working on a project where I have to estimate the average processing time of different work items (tasks).
I have the following panel data:
My sample size is n=2000 individual workers, and T=10 (each time interval is a four week period)
Independent variables: 51 different work items. I have count data for each work item (# of times they are performed by each worker over a four week period)
Dependent variable: Total Working Hour of the worker (over a 4 week period)
The goal of my analysis is to find the regression coefficents (which are estimâtes of the average completion time of each work item). I may also include other regressors (other than #of work items) such as experience, age... into my model.
y= Bo + B1*X1 +...+BkXk +e y: total working hours; X: # of work items
Issues:
Right now, I finished cleaning and processing the data and I performed some exploratory data analysis.
Some work items have a lot of zeros (the work item is only performed once or twice by several workers in the time period).
From VIF, I can see that there are imperfect multicollinearity in the independent variables. Some independent variables have VIF of 5 to 6.
Questions:
Any advice on how I should specify my model?
I look at boxplots and eliminate outliers of each regressor, I see that some regressors are highly skewed (due to lots of zéros).
I also plot each regressors against the total complétion time to see if there is any linear relation. So do, other looks more like a quadratic relation.
Any way to deal with the multicollinearity aside from eliminating the regressors that have high VIF? This is because I need to estimate the coefficent of each of the work item.
Should I set the intercept to 0? I know for sure that when ALL the regressors are 0 (# of work items are all 0, I should have zero total working hours).
I would also welcome any advice/things that I should look into for this problem. Thanks!