Normalizing data in r using population raster - r

I have two pixel images that I created using spatstat, one is a density image created by a set of points (using function density.ppp), and the other is a pixel image created from a population raster. I am wondering if there is a way to use the population raster to normalize the density image. Basically, I have a dataset of 10000+ cyber attack origin locations in the US, using the spatstat function I hope to investigate for spatial patterns. However, the obvious problem is that areas of higher population have more cyber attack origins because there are more people. I would like to use the population raster to fix that. Any ideas would be appreciated.

As the comment by #RHA says: The first solution is to simply divide by the intensity.
I don't have your data so I will make some that might seem similar. The Chorley dataset has two types of cancer cases. I will make an estimate of the intensity of lung cancer and use it as your given population density. Then a density estimate of the larynx cases serves as your estimate of the cyber attack intensity:
library(spatstat)
# Split into list of two patterns
tmp <- split(chorley)
# Generate fake population density
pop <- density(tmp$lung)
# Generate fake attack locations
attack <- tmp$larynx
# Plot the intensity of attacks relative to population
plot(density(attack)/pop)
Alternatively, you could use the inverse population density as weights in density.ppp:
plot(density(attack, weights = 1/pop[attack]))
This might be the preferred way, where you basically say that an attack occurring at e.g. a place with population density 10 only "counts" half as much as an attack occurring at a place with density 5.
I'm not sure what exactly you want to do with your analysis, but maybe the you should consider fitting a simple Poisson model with ppm and see how your data diverges from the proposed model to understand the behaviour of the attacks.

Related

How to calculate NME(Normalized Mean Error) between ground-truth and predicted landmarks when some of gt has no corresponding in predicted?

I am trying to learn some facial landmark detection model, and notice that many of them use NME(Normalized Mean Error) as performance metric:
The formula is straightforward, it calculate the l2 distance between ground-truth points and model prediction result, then divided it by a normalized factor, which vary from different dataset.
However, when adopting this formula on some landmark detector that some one developed, i have to deal with this non-trivial situation, that is some detector may not able to generate enough number landmarks for some input image(might because of NMS/model inherited problem/image quality etc). Thus some of ground-truth points might not have their corresponding one in the prediction result.
So how to solve this problem, should i just add such missing point result to "failure result set" and use FR to measure the model, and ignore them when doing the NME calculation?
If you have as output of neural network an vector 10x1 as example
that is your points like [x1,y1,x2,y2...x5,y5]. This vector will be fixed length cause of number of neurons in your model.
If you have missing points - this is because (as example you have 4 from 5 points) some points are go beyond the image width and height. Or are with minus (negative) like [-0.1, -0.2, 0.5,0.7 ...] there first 2 points you can not see on image like they are mission but they will be in vector and you can callculate NME.
In some custom neural nets that can be possible, because missing values will be changed to biggest error points.

Using envfit (vegan) to calculate species scores

I am running an NMDS and have a few questions regarding the envfit() function in the vegan package. I have read the documentation for this function and numerous posts on SO and others about vegan, envfit(), and species scores in general.
I have seen both envfit() and wascore() used to calculate species scores for ordination techniques. By default, metaMDS() uses wascore(). This uses weighted averaging, which I understand. I am having a harder time understanding envfit(). Do envfit() and wascore( yield the same results? Is wascore() preferable given that it is the default? I realize that in some situations, wascore() might not be an option (ie. negative values), as mentioned in this post. How to get 'species score' for ordination with metaMDS()?
Given that envfit() and wascore() both seem to be used for species scores, they should yield similar results, right? I am hoping that we could do a proof of this here...
The following shows species scores determined using metaMDS() using the default wascore():
data(varespec)
ord <- metaMDS(varespec)
species.scores <- as.data.frame(scores(ord, "species"))
species.scores
wascore() makes sense to me, it uses weighted averaging. There is a good explanation of weighted averaging for species scores in Analysis of Ecological Data by McCune and Grace (2002) p. 150.
Could somebody help me breakdown envfit?
species.envfit <- envfit(ord, varespec, choices = c(1,2), permutations = 999)
species.scores.envfit <- as.data.frame(scores(species.envfit, display = "vectors"))
species.scores.envfit
"The values that you see in the table are the standardised coefficients from the linear regression used to project the vectors into the ordination. These are directions for arrows of unit length." - comment from Plotted envfit vectors not matching NMDS scores
^Could somebody please show me what linear model is being run here and what standardized value is being extracted?
species.scores
species.scores.envfit
These values are very different from each other. What am I missing here?
This is my first SO post, please have mercy. I would have asked a question on some of the other relevant threads, but I am the dregs of SO and don't even have the reputation to comment.
Thanks!
Q: Do wascores() and envfit() give the same result?
No they do not give the same result as these are doing two quite different things. In this answer I have explained how envfit() works. wascores() takes the coordinates of the points in the nmds space and computes the mean on each dimension, weighting observations by the abundance of the species at each point. Hence the species score returned by wascores() is a weighted centroid in the NMDS space for each species, where the weights are the abundances of the species. envfit() fits vectors that point in the direction of increasing abundance. This implies a plane over the NMDS ordination where abundance increase linearly from any point on the plane as you move parallel to the arrow, whereas wascores() are best thought of as optima, where the abundance declines as you move away from the weighted centroid, although I think this analogy is looser than say with a CA ordination.
The issue about being optimal or not, is an issue if you passed in standardised data; as the answer you linked to shows, this would imply negative weights which doesn't work. Typically one doesn't standardise species abundances — there are transformations that we apply like converting to proportions, square root or log transformations, normalizing the data to the interval 0-1 — but these wouldn't give you negative abundances so you;re less likely to run into that issue.
envfit() in an NMDS is not necessarily a good thing as we wouldn't expect abundances to vary linearly over the ordination space. The wascores() are better as they imply non-linear abundances, but they are a little hackish in NMDS. ordisurf() is a better option in general as it adds a GAM (smooth) surface instead of the plane implied by the vectors, but you can't show more than one or a few surfaces on the ordination, whereas you can add as many species WA scores or arrows as you want.
The basic issue here is the assumption that envfit() and wascores() should give the same results. There is no reason to assume that as these are fundamentally different approaches to computing "species scores" for NMDS and each comes with it's own assumptions and advantages and disadvantages.

Poorly fitting curve in natural log regression

I'm fitting a logarithmic curve to 20+ data sets using the equation
y = intercept + coefficient * ln(x)
Generated in R via
output$curvePlot <- renderPlot ({
x=medianX
y=medianY
Estimate = lad(formula = y~log(x),method = "EM")
logEstimate = lad(formula = y~log(x),method = "EM")
plot(x,predict(Estimate),type='l',col='white')
lines(x,predict(logEstimate),col='red')
points(x,y)
cf <- round(coef(logEstimate),1)
eq <- paste0("y = ", cf[1],
ifelse(sign(cf[2])==1, " + ", " - "), abs(cf[2]), " * ln(x) from 0 to ",xmax)
mtext(eq,3,line=-2,col = "red")
output$summary <- renderPrint(summary(logEstimate))
output$calcCurve <-
renderPrint(round(cf[2]*log(input$calcFeet)+cf[1]))
})
The curve consistently "crosses twice" on the data; fitting too low at low/high points on the X axis, fitting too high at the middle of the X axis.
I don't really understand where to go from here. Am I missing a factor or using the wrong curve?
The dataset is about 60,000 rows long, but I condensed it into medians. Medians were selected due to unavoidable outliers in the data, particularly a thick left tail, caused by our instrumentation.
x,y
2,6.42
4,5.57
6,4.46
8,3.55
10,2.72
12,2.24
14,1.84
16,1.56
18,1.33
20,1.11
22,0.92
24,0.79
26,0.65
28,0.58
30,0.34
32,0.43
34,0.48
36,0.38
38,0.37
40,0.35
42,0.32
44,0.21
46,0.25
48,0.24
50,0.25
52,0.23
Full methodology for context:
Samples of dependent variable, velocity (ft/min), were collected at
various distances from fan nozzle with a NIST-calibrated hot wire
anemometer. We controlled for instrumentation accuracy by subjecting
the anemometer to a weekly test against a known environment, a
pressure tube with a known aperture diameter, ensuring that
calibration was maintained within +/- 1%, the anemometer’s published
accuracy rating.
We controlled for fan alignment with the anemometer down the entire
length of the track using a laser from the center of the fan, which
aimed no more than one inch from the center of the anemometer at any
distance.
While we did not explicitly control for environmental factors, such as
outdoor air temperature, barometric pressure, we believe that these
factors will have minimal influence on the test results. To ensure
that data was collected evenly in a number of environmental
conditions, we built a robot that drove the anemometer down the track
to a different distance every five minutes. This meant that data would
be collected at every independent variable position repeatedly, over
the course of hours, rather than at one position over the course of
hours. As a result, a 24 hour test would measure the air velocity at
each distance over 200 times, allowing changes in temperature as the
room warmed or cooled throughout the day to address any confounding
environmental factors by introducing randomization.
The data was collected via Serial port on the hot wire anemometer,
saving a timestamped CSV that included fields: Date, Time, Distance
from Fan, Measured Temperature, and Measured Velocity. Analysis on the
data was performed in R.
Testing: To gather an initial set of hypotheses, we took the median of
air velocity at each distance. The median was selected, rather than
the mean, as outliers are common in data sets measuring physical
quantities. As air moves around the room, it can cause the airflow to
temporarily curve away from the anemometer. This results in outliers
on the low end that do not reflect the actual variable we were trying
to measure. It’s also the case that, sometimes, the air velocity at a
measured distance appears to “puff,” or surge and fall. This is
perceptible by simply standing in front of the fan, and it happens on
all fans at all distances, to some degree. We believe the most likely
cause of this puffing is due to eddy currents and entrainment of the
surrounding air, temporarily increasing airflow. The median result
absolves us from worrying about how strong or weak a “puff” may feel,
and it helps limit the effects on air speed of the air curving away
from the anemometer, which does not affect actual air velocity, but
only measured air velocity. With our initial dataset of medians, we
used logarithmic regression to calculate a curve to match the data and
generated our initial velocity profiles at set distances. To validate
that the initial data was accurate, we ran 10 monte carlo folding
simulations at 25% of the data set and ensured that the generated
medians were within a reasonable value of each other.
Validation: Fans were run every three months and the monte carlo
folding simulations were observed. If the error rate was <5% from our
previous test, we validated the previous test.
There is no problem with the code itself, you found the best possible fit using a logarithmic curve. I double-checked using Mathematica, and I obtain the same results.
The problem seems to reside in your model. From the data you provided and the description of the origin of the data, the logarithmic function might not the best model for your measurements. The description indicates that the velocity must be a finite value at x=0, and slowly tends towards 0 while going to infinity. However, the negative logarithmic function will be infinite at x=0 and negative after a while.
I am not a physicist, but my intuition would tend towards using the inverse-square law or using the exponential function. I tested both, and the exponential function gives way better results:

Using a Point Process model for Prediction

I am analysing ambulance incident data. The dataset covers three years and has roughly 250000 incidents.
Preliminary analysis indicates that the incident distribution is related to population distribution.
Fitting a point process model using spatstat agrees with this, with broad agreement in a partial residual plot.
However, it is believed that the trend diverges from this population related trend during the "social hours", that is Friday, Saturday night, public holidays.
I want to take subsets of the data and see how they differ from the gross picture. How do I account for the difference in intensity due to the smaller number of points inherent in a subset of the data?
Or is there a way to directly use my fitted model for the gross picture?
It is difficult to provide data as there are privacy issues, and with the size of the dataset, it's hard to simulate the situation. I am not by any means a statistician, hence I am flundering a bit here. I have a copy of
"Spatial Point Patterns Methodology and Applications with R" which is very useful.
I will try with pseudocode to explain my methodology so far..
250k_pts.ppp <- ppp(the_ambulance_data x and y, the_window)
1.3m_census_pts <- ppp(census_data x and y, the_window)
Best bandwidth for the density surface by visual inspection seemed to be bw.scott. This was used to fit a density surface for the points.
inc_density <- density(250k_pts.ppp, bw.scott)
pop_density <- density(1.3m_census_pts, bw.scott)
fit0 <- ppm(inc_density ~ 1)
fit_pop <- ppm(inc_density ~ pop_density)
partials <- parres(fit_pop, "pop_density")
Plotting the partial residuals shows that the agreement with the linear fit is broadly acceptable, with some areas of 'wobble'..
What I am thinking of doing next:
the_ambulance_data %>% group_by(day_of_week, hour_of_day) %>%
select(x_coord, y_coord) %>% nest() -> nested_day_hour_pts
Taking one of these list items and creating a ppp, say fri_2300hr_ppp;
fri23.den <- density(fri_2300hr_ppp, bw.scott)
fit_fri23 <- fit(fri_2300hr_ppp ~ pop_density)
How do I then compare this ppp or density with the broader model? I can do characteristic tests such as dispersion, clustering.. Can I compare the partial residuals of fit_pop and fit_fri23?
How do I control for the effect of the number of points on the density - i.e. I have 250k points versus maybe 8000 points in the subset. I'm thinking maybe quantiles of the density surface?
Attach marks to the ambulance data representing the subset/categories of interest (eg 'busy' vs 'non-busy'). For an informal or nonparametric analysis, use tools like relrisk, or use density.splitppp after separating the different types of points using split.ppp. For a formal analysis (taking into account the sample sizes etc etc) you should fit several candidate models to the same data, one model having a busy/nonbusy effect and another model having no such effect, then use anova.ppm to test formally whether there is a busy/nonbusy effect. See Chapter 14 of the book mentioned.

R - simulate data for probability density distribution obtained from kernel density estimate

First off, I'm not entirely sure if this is the correct place to be posting this, as perhaps it should go in a more statistics-focussed forum. However, as I'm planning to implement this with R, I figured it would be best to post it here. Please apologise if I'm wrong.
So, what I'm trying to do is the following. I want to simulate data for a total of 250.000 observations, assigning a continuous (non-integer) value in line with a kernel density estimate derived from empirical data (discrete), with original values ranging from -5 to +5. Here's a plot of the distribution I want to use.
It's quite essential to me that I don't simulate the new data based on the discrete probabilities, but rather the continuous ones as it's really important that a value can be say 2.89 rather than 3 or 2. So new values would be assigned based on the probabilities depicted in the plot. The most frequent value in the simulated data would be somewhere around +2, whereas values around -4 and +5 would be rather rare.
I have done quite a bit of reading on simulating data in R and about how kernel density estimates work, but I'm really not moving forward at all. So my question basically entails two steps - how do I even simulate the data (1) and furthermore, how do I simulate the data using this particular probability distribution (2)?
Thanks in advance, I hope you guys can help me out with this.
With your underlying discrete data, create a kernel density estimate on as fine a grid as you wish (i.e., as "close to continuous" as needed for your application (within the limits of machine precision and computing time, of course)). Then sample from that kernel density, using the density values to ensure that more probable values of your distribution are more likely to be sampled. For example:
Fake data, just to have something to work with in this example:
set.seed(4396)
dat = round(rnorm(1000,100,10))
Create kernel density estimate. Increase n if you want the density estimated on a finer grid of points:
dens = density(dat, n=2^14)
In this case, the density is estimated on a grid of 2^14 points, with distance mean(diff(dens$x))=0.0045 between each point.
Now, sample from the kernel density estimate: We sample the x-values of the density estimate, and set prob equal to the y-values (densities) of the density estimate, so that more probable x-values will be more likely to be sampled:
kern.samp = sample(dens$x, 250000, replace=TRUE, prob=dens$y)
Compare dens (the density estimate of our original data) (black line), with the density of kern.samp (red):
plot(dens, lwd=2)
lines(density(kern.samp), col="red",lwd=2)
With the method above, you can create a finer and finer grid for the density estimate, but you'll still be limited to density values at grid points used for the density estimate (i.e., the values of dens$x). However, if you really need to be able to get the density for any data value, you can create an approximation function. In this case, you would still create the density estimate--at whatever bandwidth and grid size necessary to capture the structure of the data--and then create a function that interpolates the density between the grid points. For example:
dens = density(dat, n=2^14)
dens.func = approxfun(dens)
x = c(72.4588, 86.94, 101.1058301)
dens.func(x)
[1] 0.001689885 0.017292405 0.040875436
You can use this to obtain the density distribution at any x value (rather than just at the grid points used by the density function), and then use the output of dens.func as the prob argument to sample.

Resources