Disclaimer: I'm very new to spatstat and spatial point modeling in general... please excuse my naivete.
I have recently tried using spatstat to fit and simulate spatial point patterns related to weather phenomenon where the spatial pattern represents a set of eye-witness reports (for example, reports of hail occurrence) and the observational window and covariate is based on some meteorological parameter (eg. the window is area where moisture is at least X, and then the moisture variable is additionally passed as a covariate when training the model).
moistureMask = owin(mask=moisture>X)
moistureVar = im(moisture)
obsPPP = ppp(x=obsX,y=obsY,window=moistureMask)
myModel = ppm(obsPPP ~ moistureVar)
### then simulate
mySim = simulate(myModel,nsim=10)
My questions are the following:
Is it possible (or more importantly, even valid), to take a ppm trained on one day with a specific moisture variable and mask, and apply it to another day with a different moisture value and mask. I had considered using the update function to switch out the window and covariate fields from the trained model, but haven't actually tried it yet. If the answer is yes... its a little unclear to me how to actually do this, programmatically
Is it it possible to do an online update of the ppm with additional data. For example, train the model on data from different days (each with their own window and covariate), iteratively (similar to how many machine learning models are trained, using blocks of training data). For example, lets say I have 10-years of daily data which I'd like to use to train the model, and another 10-years of moisture variables over which I'd like to simulate point patterns. Again, I considered the update function here as well, but it was unclear if the new model would simply be based ONLY on the new data, or a combination of the original and new data.
Please let me know if I'm going the completely wrong direction with this. References and resources appreciated.
If you have fitted a model using ppm and you update it by specifying new data and/or new covariates, then the new data replace the old data; the updated model's parameters are determined using only the new data that you gave when you called update.
The syntax for the update command is described in the online help for update.ppm (the method for the generic update for an object of class ppm).
It seems that what you really want to do is to fit a point process model to many replicate datasets, each dataset consisting of a predictor moistureVar and a point pattern obsPPP. In that case, you should use the function mppm which fits a point process model to replicated data.
To do this, first make a list A containing the moisture regions for each day, and another list B containing the hail report location patterns for each day. That is, A[[1]] is the moisture region for day 1, and B[[1]] is the point pattern of hail report locations for day 1, and so on. Then do
h <- hyperframe(moistureVar=A, obsPPP=B)
m <- mppm(obsPPP ~ moistureVar, data=h)
This will fit a single point process model to the full set of data.
Finally can I point out that the model
obsPPP ~ moistureVar
is very simple, because moistureVar is a binary predictor. The model will simply say that the intensity of hail reports takes one value inside the high-moisture region, and another value outside that region. As an alternative, you could consider use the moisture content (eg humidity) as a predictor variable.
See Chapters 9 and 16 of the spatstat book for more detail.
Related
I am in the process of creating a neural network with the aim of being able to predict the temperature of tomorrow in my area. I have loaded the data, normalized it, divided it into a train and a test set, created a NN using the neural net library, and predicted the temperatures in the test set with a 78% accuracy.
My system parts are named as follows:
nural network <- nn
nural network output <- nn.results
Data <- data frame with all the inputs (Humidity, air pressure, averages of previous temperatures, etc.)
Predicted data <- results$predicted (for the test set)
What function/code would I use to, instead of predicting data in the test set, actually predict the temperature of tomorrow?
I hope the question is not all too stupid, any help would be much appreciated though. Sorry that there is not all too much code, but it is difficult with neural networks to just give snippets.
Thank you!
Edit: I think the predict function is the way to go, however it requests “new data”, but I would like to predict solely with old data, would special formatting of the test set do the trick?
Is it possible to misuse JAGS as a tool for generating data from a model with known parameters? I need to sample data points from a predefined model in order to do a simulation study and test the power of a model I have developed in R.
Unfortunately, the model is somehow tricky (hierarchical structure with AR and VAR component) and I was not able to simulate the data directly in R.
While searching the internet, I found a blog post where the data was generated in JAGS using the data{} Block in JAGS. In the post, the author than estimated the model directly in JAGS. Since I have my model in R, I would like to transfer the data back to R without a model{} block. Is this possible?
Best,
win
There is no particular reason that you need to use the data block for generating data in this way - the model block can just as easily work in 'reverse' to generate data based on fixed parameters. Just specify the parameters as 'data' to JAGS, and monitor the simulated data points (and run for as many iterations as you need datasets - which might only be 1!).
Having said that, in principle you can simulate data using either the data or model blocks (or a combination of both), but you need to have a model block (even if it is a simple and unrelated model) for JAGS to run. For example, the following uses the data block to simulate some data:
txtstring <- '
data{
for(i in 1:N){
Simulated[i] ~ dpois(i)
}
}
model{
fake <- 0
}
#monitor# Simulated
#data# N
'
library('runjags')
N <- 10
Simulated <- coda::as.mcmc(run.jags(txtstring, sample=1, n.chains=1, summarise=FALSE))
Simulated
The only real difference is that the data block is updated only once (at the start of the simulation), whereas the model block is updated at each iteration. In this case we only take 1 sample so it doesn't matter, but if you wanted to generate multiple realisations of your simulated data within the same JAGS run you would have to put the code in the model block. [There might also be other differences between data and model blocks but I can't think of any offhand].
Note that you will get the data back out of JAGS in a different format (a single vector with names giving the indices of any arrays within the monitored data), so some legwork might be required to get that back to a list of vectors / arrays / whatever in R. Edit: unless R2jags provides some utility for this - I'm not sure as I don't use that package.
Using a model block to run a single MCMC chain that simulates multiple datasets would be problematic because MCMC samples are typically correlated. (Each subsequent sample is drawn using the previous sample). For a simulation study, you would want to generate independent samples from your distribution. The way to go would be to use the data or model block recursively, e.g. in a for loop, which would ensure that your samples are independent.
I am just starting out with segmenting a customer database using R I have for an ecommerce retail business. I seek some guidance about the best approach to proceed with for this exercise.
I have searched the topics already posted here and tried them out myself like dist() and hclust(). However I am running into one issue or another and not able to overcome it since I am new to using R.
Here is the brief description of my problem.
I have approximately 480K records of customers who have bought so far. The data contains following columns:
email id
gender
city
total transactions so far
average basket value
average basket size ( no of item purchased during one transaction)
average discount claimed per transaction
No of days since the user first purchased
Average duration between two purchases
No of days since last transaction
The business goal of this exercise is to identify the most profitable segments and encourage repeat purchases in those segments using campaigns. Can I please get some guidance as to how to do this successfully without running into problems like the size of the sample or the data type of columns?
Read this to learn how to subset data frames. When you try to define d, it looks like you're providing way to much data, which might be fixed by subsetting your table first. If not, you might want to take a random sample of your data instead of all of it. Suppose you know that columns 4 through 10 of your data frame called cust_data contain numerical data, then you might try this:
cust_data2 <- cust_data[, 4:10]
d <- dist(cust_data2)
For large values, you may want to log transform them--just experiment and see what makes sense. I really am not sure about this, and that's just a suggestion. Maybe choosing a more appropriate clustering or distance metric would be better.
Finally, when you run hclust, you need to pass in the d matrix, and not the original data set.
h <- hclust(d, "ave")
Sadly your data does not contain any attributes that indicate what types of items/transactions did NOT result in a sale.
I am not sure if clustering is the way to go here.
Here are some ideas:
First split your data into a training set (say 70%) and a test set.
Set up a simple linear regression model with,say, "average basket value" as a response variable, and all other variables as independent variables.
fit <-lm(averagebasketvalue ~., data = custdata)
Run the model on the training set, determine significant attributes (those with at least one star in the summary(fit) output), then focus on those variables.
Check your regression coefficients on the test set, by calculating R-squared and Sum of squared errors (SSE) on the test set. You can use the predict() function , the calls will look like
fitpred <- predict(fit, newdata=testset)
summary(fitpred) # will give you R²
Maybe "city" contains too many unique values to be meaningful. Try to generalize them by introducing a new attribute CityClass (e.g. BigCity-MediumCity-SmallCity ... or whatever classification scheme is useful for your cities). You might also condition the model on "gender". Drop "email id".
This can go on for a while... play with the model to try to get better R-squared and SSEs.
I think a tree-based model (rpart) might also work well here.
Then you might change to cluster analysis at a later time.
What statistical methods out there that will estimate the probability density of data as it arrives temporally?
I need to estimate the pdf of a multivariate dataset; however, new data arrives over time and as the data arrives the density estimation must update.
What I have been using so far is kernel estimations by storing a buffer of the data and computing a new kernel density estimation with every update of new data; however, I can no longer keep up with the amount of data needed to be stored. Therefore, I need a method that will keep track of the overall pdf/density estimation rather that the individual datum. Any suggestions would be really helpful. I work in Python, but since this is long-winded any algorithm suggestions would be also helpful.
Scipy's implementation of KDE includes the functionality to increment the KDE by each datum instead of for each point. This is nested inside a "if more points than data" loop, but you could probably re-purpose it for your needs.
if m >= self.n:
# there are more points than data, so loop over data
for i in range(self.n):
diff = self.dataset[:, i, newaxis] - points
tdiff = dot(self.inv_cov, diff)
energy = sum(diff*tdiff,axis=0) / 2.0
result = result + exp(-energy)
In this case, you could store the result of your kde as result, and each time you get a new point you could just calculate the new Gaussian and add it to your result. Data can be dropped as needed, you are only storing the KDE.
I am aiming to better predict a buying habits of a company's customer base according to several customer attributes (demographic, past purchase categories, etc). I have a data set of about 100,000 returning customers including the time interval from their last purchase (the dependent variable in this study) along with several attributes (both continuous and categorical).
I plan on doing a survival analysis on each segment (segments defined as having similar time intervals across observations) to help understand likely time intervals between purchases. The problem I am encountering is how to best define these segments; i.e. groupings of attributes such that the time interval is sufficiently different between segments and similar within segments. I believe that building a decision tree is the best way to do this, I would suppose using recursive partitioning.
I am new to R and have poked around with the party package's mob command, however I am confused by which variables to include in the model and which to include for partitioning (command: mob(y ~ x1 + ... + xk | z1 + ... + zk), x being model variables and z being partitions). I simply want to build a tree from the set of attributes, so I suppose I want to partition on all of them? Not sure. I have also tried the rpart command but either get no tree or a tree with hundreds of thousands of nodes depending on the cp level.
If anyone has any suggestions, I'd appreciate it. Sorry for the novel and thanks for the help.
From the documentation at ?mob:
MOB is an algorithm for model-based recursive partitioning yielding a
tree with fitted models associated with each terminal node.
It's asking for model variables because it will build a model at every terminal node (e.g. linear, logistic) after splitting on the partition variables. If you want to partition without fitting models to the terminal nodes, the function I've used is ctree (also in the party package).