Fitting a poisson GLM in R with an aggregated count data - r

I have a dataset of the number of stranded turtles reported at a variety of locations along the Queensland coast of Australia. What I would like to find out is the number of stranded turtles that are NOT reported at each of these locations. In order to estimate that number, I have collected data on the frequency with which a turtle is reported to a stranding location; i.e. how often is a single turtle stranding reported more than one time at about 20 points along the coast? So I have count data which indicates the number of turtles that are reported to a stranding location one time, two times, or three or more times. Ultimately I would like to relate these data to covariates such as local population density and distance to the nearest road, in order to predict the "zero reporting" incidence for the rest of the coastal areas as well.
My data should look something like this, then:
loc<-c("A","B","C")
rep1<-c(51,24,10)
rep2<-c(4,8,3)
rep3ormore<-c(2,1,0)
pop<-c(50,1000,100)
turtle- cbind.data.frame(loc, rep1, rep2, rep3ormore, pop)
There are other possible covariates, but I'll keep it simple for now! I think this should be able to be done using a Poisson distribution, but I'm having trouble wrapping my head around how to do it.
Additionally, in certain instances I don't have exact numbers for the turtles that have been reported, but instead I have categories; 4-6, 7-10, >10, etc. If there's a way to model that possibility, that would be great as well!

Related

Time series forecasting of outcome variable based on current performance of outcome variable in R

I have a very large dataset (~55,000 datapoints) for chicken crops. Chickens are grown over ~35 day period. The dataset covers 10 sheds of ~20,000 chickens each. In the sheds are weighing platforms and as chickens step on them they send the weight recorded to a server. They are sending continuously from day 0 to the final day.
The variables I have are: House (as a number, House 1 up to House 10), Weight (measured in grams, to 5 decimal points) and Day (measured as a number between two integers, e.g. 12 noon on day 0 might be 0.5 in the day, whereas day 23.3 suggests a third of the way through day 23 (8AM). But as this data is sent continuously the numbers can be very precise).
I want to construct either a Time Series Regression model or an ML model so that if I take a new crop, as data is sent by the sensors, the model can make a prediction for what the end weight will be. Then as that crop cycle finishes it can be added to the training data and repeat.
Currently I'm using this very simple Weight VS Time model, but eventually would include things like temperature, water and food consumption, humidity etc.
I've run regression analyses on the data sets to determine the relationship between time and weight (it's likely quadratic, see image attached) and tried using randomForrest in R to create a model. The test model seemed to work well in regards to the MAPE value being similar to the training value, but that was by taking out one house and using that as the test.
Potentially what I've tried so far is completely the wrong methodology but this is a new area so I'm really not sure of the best approach.

Should I use Friedman test or Mixed Model for my data in R? Nested or not?

I have my Response variable which is Proportion of Range Exposed to extreme events for terrestrial mammal species in the future. More clearly, it is the Difference of Proportion of Range Exposed (DPRE) from historical period to future green gases emission scenarios (it is a measure of the level of increase/decrease of percentage of range exposed): it means that my response variable goes from -1 to 1 (where +1 implies that the range will experience a +100% increase in the proportion of exposure: from 0% in historical period, to 100% in the future scenario).
As said, I am analyzing these differences for all terrestrial mammals (5311 species, across different scenarios and for two time periods, near future (means of 2021-2040) and far future (means of 2081-2100).
So, my Explicative variables are:
3 Scenarios of green gas emissions (Representative Concentration Pathways: RCP2.6, RCP4.5 and RCP8.5);
Time Periods (Near Future and Far Future): NF and FF;
Species: 5311 individuals.
I am not so expert in statistics , so I'm not sure which of the two suggestions I recieved:
Friedman test with Species as blocks (but in which I should somehow do a nested model, with RCPs as groups, nested within TimePeriods; or a sort of two way Friedman, with RCP and TimePeriod as the two different factors).
Linear Mixed Models with RCP*TimePeriod as fixed effects, and (TimePeriod | Species ) as random effects.
I run t-test, and all distribution result to be not normal, this is why I was suggested to use Friendman instead of ANOVA; I run pairwise Wilcoxon Rank Sum test and in this case I found significative differences from NF and FF for all RCPs.
I have to say I run 3 Wilcoxon, one for every RCP, so maybe a third option would be to create 3 different models, one for every RCP, but this would also go away from the standard analysis of "repated measures" for Friedman test.
Last consideration: I have to run Another model, where the Response variable is the Difference of Proportion of Subrange Exposed. In this case, other Explicative variables are mantained, but in this case analysis is not global but takes in consideration the difference that could be present across 14 IUCN Biomes. So every analysis is made across RCPs, for NF and FF and for all Biomes. Should I create and run 14 (biomes) x 3 (RCPs) x 2 (Time Periods) = 84 models, in this case? OR a sort of double nested (Time Periods and Biomes) model?
If necessary I can provide the large dataframe.

Understanding TSA::periodogram()

I have some data sampled at regular intervals that looks sinusoidal and I would like to determine the frequency of the wave, to that end I obtained R and loaded the TSA package that contains a function named 'periodogram'.
In an attempt to understand how it works I created some data as follows:
x<-.0001*1:260
This could be interpreted to be 260 samples with an interval of .0001 seconds
Frequency=80
The frequency could be interpreted to be 80Hz so there should be about 125 points per wave period
y<-sin(2*pi*Frequency*x)
I then do:
foo=TSA::periodogram(y)
In the resulting periodogram I would expect to see a sharp spike at the frequency that corresponds to my data - I do see a sharp spike but the maximum 'spec' value has a frequency of 0.007407407, how does this relate to my frequency of 80Hz?
I note that there is variable foo$bandwidth with a value of 0.001069167 which I also have difficulty interpreting.
If there are better ways of determining the frequency of my data I would be interested - my experience with R is limited to one day.
The periodogram is computed from the time series without knowledge of your actual sampling interval. This result in frequencies which are limited to the normalized [0,0.5] range. To obtain a frequency in Hertz that takes into account the sampling interval, you simply need to multiply by the sampling rate. In your case, the spike you get at a normalized frequency of 0.007407407 and a sampling rate of 10,000Hz, this correspond to a frequency of ~74Hz.
Now, that's not quite 80Hz (the original tone frequency), but you have to keep in mind that a periodogram is a frequency spectrum estimate, and its frequency resolution is limited by the number of input samples. In your case you are using 260 samples, so the frequency resolution is on the order of 10,000Hz/260 or ~38Hz. Since 74Hz is well within 80 +/- 38Hz, it is a reasonable result. To get a better frequency estimate you would have to increase the number of samples.
Note that the periodogram of a sinusoidal tone will typically spike near the tone frequency and decay on either side (a phenomenon caused by the limited number of samples used for the estimation, often called spectral leakage) until the value can be considered comparatively 'negligeable'. The foo$bandwidth variable then indicates that the input signal starts to contain less energy for frequencies above 0.001069167*10000Hz ~ 107Hz, which is consistent with the tone's decay.

Suggestions for clustering methods

I have two time series of meteorological measurements (i.e., X and Y). Both X and Y time series were constructed using daily measurements over a period of one year. By plotting X time series versus Y times series as a scatterplot and connecting all the points by date in ascending order, a closed loop is obtained representing the annual cycle. I have measurements at N locations and thus I have N loops (i.e., annual cycles) which I want to cluster to find those that have similar shapes.
With so many clustering methods, I am not sure which one will be more appropriate to use for this analysis (initially I was
thinking to use self-organizing maps).
Thank you very much for any suggestions.
Unless you have too many time series, I suggest to start with hierarchical clustering. It's easy to interpret because of the dendrogram.
For similarity, a cyclic version of DTW may be good, assuming that there is some delay between different locations.

How to structure stratified data for Poisson regression

I'm trying to use R to conduct Poisson regression on some data that I have. The current structure of the data is as follows:
Data is stratified based on three occupations. There are four levels of income in the data. Within each stratum, for each level of income there is
the number of workplace accidents that have occurred, and
the total man months observed.
Here's an example of the setup. The number in parentheses is the total man months observed and the number not in parentheses is the number of workplace accidents.
My question is how do I set up this data and perform a Poisson regression on the effect of income level on the occurrence of workplace accidents? Ideally I would like to adjust for occupation and find out the effect of only income, but as a starting point, I'm not sure how to set it up as a Poisson regression problem at all. I thought about doing something like dividing the number of injuries by the months of observation, but then that gives non-integer values so I assume that's not the right thing to do.
To reiterate, predictor: income level; response variable: workplace accidents.
BTW, it would be very easy to separate the parentheses numbers and put them into their own column, if that would make sense to do.
I'd really appreciate any suggestions on how to set this up. I am sure other statisticians are working with similarly structured data and might like to gain some insight as well. Thanks so much!
#thelatemail might be correct in think this to be better suited for stats.stackexchange.com but here is some R code. That data is in wide format and you need to re-structure it to long format. (And you will not want to include the totals columns. After converting the first four columns to a long format where you had 'occupation' and 'level' as factor-class variables, and accident 'counts' and exposure 'months' as numeric columns, you could use this call to glm.
fit <- glm( counts ~ level + occup + offset(log(months)), data=dfrm, family="poisson")
The offset needs to be log()-ed to agree with the logged counts created by the default link function for the poisson-family.
(You cannot really expect us to redo that data entry task, now can you?)

Resources