Using a Point Process model for Prediction - r

I am analysing ambulance incident data. The dataset covers three years and has roughly 250000 incidents.
Preliminary analysis indicates that the incident distribution is related to population distribution.
Fitting a point process model using spatstat agrees with this, with broad agreement in a partial residual plot.
However, it is believed that the trend diverges from this population related trend during the "social hours", that is Friday, Saturday night, public holidays.
I want to take subsets of the data and see how they differ from the gross picture. How do I account for the difference in intensity due to the smaller number of points inherent in a subset of the data?
Or is there a way to directly use my fitted model for the gross picture?
It is difficult to provide data as there are privacy issues, and with the size of the dataset, it's hard to simulate the situation. I am not by any means a statistician, hence I am flundering a bit here. I have a copy of
"Spatial Point Patterns Methodology and Applications with R" which is very useful.
I will try with pseudocode to explain my methodology so far..
250k_pts.ppp <- ppp(the_ambulance_data x and y, the_window)
1.3m_census_pts <- ppp(census_data x and y, the_window)
Best bandwidth for the density surface by visual inspection seemed to be bw.scott. This was used to fit a density surface for the points.
inc_density <- density(250k_pts.ppp, bw.scott)
pop_density <- density(1.3m_census_pts, bw.scott)
fit0 <- ppm(inc_density ~ 1)
fit_pop <- ppm(inc_density ~ pop_density)
partials <- parres(fit_pop, "pop_density")
Plotting the partial residuals shows that the agreement with the linear fit is broadly acceptable, with some areas of 'wobble'..
What I am thinking of doing next:
the_ambulance_data %>% group_by(day_of_week, hour_of_day) %>%
select(x_coord, y_coord) %>% nest() -> nested_day_hour_pts
Taking one of these list items and creating a ppp, say fri_2300hr_ppp;
fri23.den <- density(fri_2300hr_ppp, bw.scott)
fit_fri23 <- fit(fri_2300hr_ppp ~ pop_density)
How do I then compare this ppp or density with the broader model? I can do characteristic tests such as dispersion, clustering.. Can I compare the partial residuals of fit_pop and fit_fri23?
How do I control for the effect of the number of points on the density - i.e. I have 250k points versus maybe 8000 points in the subset. I'm thinking maybe quantiles of the density surface?

Attach marks to the ambulance data representing the subset/categories of interest (eg 'busy' vs 'non-busy'). For an informal or nonparametric analysis, use tools like relrisk, or use density.splitppp after separating the different types of points using split.ppp. For a formal analysis (taking into account the sample sizes etc etc) you should fit several candidate models to the same data, one model having a busy/nonbusy effect and another model having no such effect, then use anova.ppm to test formally whether there is a busy/nonbusy effect. See Chapter 14 of the book mentioned.

Related

Use of svyglm and svydesign with R for multistage stratified cluster design

I have a complicated data set which was made by a multistage stratified cluster design. I had originally analysed this using glm, however now realise that I have to use svyglm. I'm not quite sure about how is best to model the data utilising svyglm. I was wondering if anyone could help shed some light.
I am attempting to see the effect that a variety of covariates taken at time 1 have on a binary outcome taken at time 2.
The sampling strategy was as follows: state -> urban/rural -> district -> subdistrict -> village. Within each village, individuals were randomly selected, with each of these having an id (uniqid).
I have a variable in the df for each of these stages of the sampling strategy. I also have the following variables: outcome, age, sex, income, marital_status, urban_or_rural_area, uniqid, weights. The formula that I want for my regression equation is outcome ~ age + sex + income + marital_status + urban_or_rural_area . Weights are coded by the weights variable. I had set the family to binomial(link = logit).
If anyone has any idea how such an approach could be coded in R with svyglm I would be most appreciative. I'm quite confused as to what should be inputted as ID, fpc and nest. Do I have to specify all levels of the stratified design or just some?
Any direction, or resources which explain this well would be massively appreciated.
You don't really give enough information about the design: which of the geographical units are strata and which are clusters. For example, my guess is that you sample both urban and rural in all states, and you don't sample all villages, but I don't know whether you sample all districts or subdistricts. I also don't know whether your overall sampling fraction is large or small (so whether the with-replacement approximation is ok)
Let's pretend you sample just some districts, so districts are your Primary Sampling Units, and that the overall sampling fraction of people is small. The design command is
your_design <- svydesign(id=~district, weights=~weights,
strata=~interaction(state, urban_rural,drop=TRUE),
data=your_data_frame)
That is, the strata are combinations of state and urban/rural and any combinations that aren't in your data set don't exist in the population (maybe some states are all-rural or all-urban). Within each stratum you have districts, and only some of these appear in the sample. In your geographical hierarchy, districts are then the first level that is sampled rather than exhaustively enumerated.
You don't need fpc unless you want to specify the full multistage design without replacement.
The nest option is not about how the survey was done but is about how variables are coded. The US National Center for Health Statistics (bless their hearts) set up a lot of designs that have many strata and two primary sampling units per stratum. They call these primary sampling units 1 and 2; that is, they reuse the names 1 and 2 in every stratum. The svydesign function is set up to expect different sampling unit names in different strata, and to verify that each sampling unit name appears in just one stratum, as a check against data errors. This check has to be disabled for NCHS surveys and perhaps some others that also reuse sampling unit names. You can always leave out the nest option at first; svydesign will tell you if it might be needed.
Finally, the models:
svyglm(outcome ~ age + sex + income + marital_status + urban_or_rural_area,
design=your_design, family=quasibinomial)
Using binomial or quasibinomial will give identical answers, but using binomial will give you a harmless warning about non-integer weights. If you use quasibinomial, the harmless warning is suppressed.

Interpreting a pattern in a residual plot produced by gam.check()

I'm working on creating a model that examines the effect of ocean characteristics on fishing outcomes. I have spatial data on a 0.5 degree grid and I created the following model:
gam(inverse hyperbolic sine(yvar) ~ s(lat, lon, bs="sos) + s(xvar1) +
s(xvar2) + s(xvar3), data = dat, method = "REML"
The QQ plot and histogram of residuals look okay. However, gam.check() produces an odd pattern in the residuals plot. I know that the points should be scattered around 0, but I have a very odd pattern in the residuals. Can anyone provide some insight on the interpretation of this plot:
Those will be either all the 0s (most likely) or 1s/smallest value in your original data. You don’t say what these data are but as you mentioning fishing outcomes it is highly likely that these have some natural lower bound and this line in the residuals are all the observations that take this lower bound (before transformation).
As you don’t exactly what your data are it is difficult to comment further as to how to proceed (this may not be an issue or you may need to not use the transform that you did, and instead use a GLM or other non-Gaussian response), but
Such patterns are common in ecological/biological data, and
Transforming your response invariably doesn’t work for ecological data.

Reduce range of function for functional PCA in R - Functional Data Analysis

I have discrete measurements of river flow spanning 22 years. As river flow is naturally continuous, I have attempted to fit a function to the data.
library(FDA)
set.seed(1)
### 3 years of flow data
base = c(1,1,1,1,1,2,2,1,2,2,3,3,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,6,5,5,4,4,4,3,4,3,3,3,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1)
year1 = sapply(base, function(x){x + runif(1)})
year2 = sapply(base, function(x){x + runif(1)})
year3 = sapply(base, function(x){x + runif(1)})
flow.mat = matrix(c(year1, year2, year3), ncol = 3)
Whilst Fourier basis systems are recommended for periodic data, the true data does not exhibit a strongly repeating pattern (ignore data simulation for this assumption). It also contains important extreme values. Therefore, I attempted to fit bSpline basis systems to the data.
sp.basis=create.bspline.basis(c(1,length(base)), norder=6, nbasis=15)
sb.fd=smooth.basis(1:length(base), flow.mat, sp.basis)$fd
Ultimately, I intend on using the flow data as a covariate in a regression model with a monthly interval. This poses an issue as I fit annual functions to the data, as this provided an improved fit for monthly data, given the data lack of temporal independence.
Therefore, I was wondering if it was possible for me to subset the generated functions, selecting a month at a time.
I suspect this is not possible, therefore, is it possible to run a fPCA on subsetted data, as I intend on using the fPCA scores as the covariate in the model?
So far I have been completely unsuccessful in running a subsetted fPCA. Instead, I have been obtaining annual scores via the following:
pca.flow=pca.fd(sb.fd, 2)
Without getting into much sophistication, I just plotted your data and made a polynomial fit. I did use a 4 degree polynomial because it is wave with 3 ups and downs (4 is one more than the extrema of the fitting curve). A a matter of facts, degree 5 or more did not gave a significant improvement.
What about doing the same for you 22 years time series?

Normalizing data in r using population raster

I have two pixel images that I created using spatstat, one is a density image created by a set of points (using function density.ppp), and the other is a pixel image created from a population raster. I am wondering if there is a way to use the population raster to normalize the density image. Basically, I have a dataset of 10000+ cyber attack origin locations in the US, using the spatstat function I hope to investigate for spatial patterns. However, the obvious problem is that areas of higher population have more cyber attack origins because there are more people. I would like to use the population raster to fix that. Any ideas would be appreciated.
As the comment by #RHA says: The first solution is to simply divide by the intensity.
I don't have your data so I will make some that might seem similar. The Chorley dataset has two types of cancer cases. I will make an estimate of the intensity of lung cancer and use it as your given population density. Then a density estimate of the larynx cases serves as your estimate of the cyber attack intensity:
library(spatstat)
# Split into list of two patterns
tmp <- split(chorley)
# Generate fake population density
pop <- density(tmp$lung)
# Generate fake attack locations
attack <- tmp$larynx
# Plot the intensity of attacks relative to population
plot(density(attack)/pop)
Alternatively, you could use the inverse population density as weights in density.ppp:
plot(density(attack, weights = 1/pop[attack]))
This might be the preferred way, where you basically say that an attack occurring at e.g. a place with population density 10 only "counts" half as much as an attack occurring at a place with density 5.
I'm not sure what exactly you want to do with your analysis, but maybe the you should consider fitting a simple Poisson model with ppm and see how your data diverges from the proposed model to understand the behaviour of the attacks.

Cox regression in MATLAB

I know there is COXPHFIT function in MATLAB to do Cox regression, but I have problems understanding how to apply it.
1) How to compare two groups of samples with survival data in days (survdays), censoring (cens) and some predictor value (x)? The groups defined by groups logical variable. Groups have different number of samples.
2) What is the baseline parameter in coxphfit? I did read the docs, but how should I choose the baseline properly?
It would be great if you know a site with good detailed examples on medical survival data. I found only the Mathworks demo that does not even mention coxphfit.
Do you know may be another 3rd party function for Cox regression?
UPDATE: The r tag added since the answer I've got is for R.
With survival analysis, the hazard function is the instantaneous death rate.
In these analyses, you are typically measuring what effect something has on this hazard function. For example, you may ask "does swallowing arsenic increase the rate at which people die?". A background hazard is the level at which people would die anyway (without swallowing arsenic, in this case).
If you read the docs for coxphfit carefully, you will notice that that function tries to calculate the baseline hazard; it is not something that you enter.
baseline The X values at which to
compute the baseline hazard.
EDIT: MATLAB's coxphfit function doesn't obviously work with grouped data. If you are happy to switch to R, then the anaylsis is a one-liner.
library(survival)
#Create some data
n <- 20;
dfr <- data.frame(
survdays = runif(n, 5, 15),
cens = runif(n) < .3,
x = rlnorm(n),
groups = rep(c("first", "second"), each = n / 2)
)
#The Cox ph analysis
summary(coxph(Surv(survdays, cens) ~ x / groups, dfr))
ANOTHER EDIT: That baseline parameter to MATLAB's coxphfit appears to be a normalising constant. R's coxph function doesn't have an equivalent parameter. I looked in Statistical Computing by Michael Crawley and it seems to suggest that the baseline hazard isn't important, since it cancels out when you calculate the likelihood of your individual dying. See Chapter 33, and p615-616 in particular. My knowledge of how the model works isn't deep enough to explain the discrepancy in the MATLAB and R implementations; perhaps you could ask on the Stack Exchange Stats Analysis site.

Resources