I have a panel data that is designed for survival model.
Some observations have missing data. however, the intervals are not constant.
Here is example of it:
t
value
5
5
10
8
15
12
18
NA
20
3
25
9
30
15
35
21
As you can see t intervals are 5 units. However we have a record that t is 18 and it is missing the value. I want to interpolate the values column with respect to the column t in R.
Do you have any suggestion?
It would be better if the method can support non-linear interpolation.
P.S.
The data is relatively huge, so generating a panel with small steps is not possible with my hardware.
I can handel python as well, but R is more convenient as the interpolation happens mid analysis in R.
In R
Use na_interpolation and pass the desired parameters, such as
library(imputeTS)
df$value <- na_interpolation(df$value, option = "spline")
In Python
Use pandas.Series.interpolate and pass various methods as follows, such as quadratic
import pandas as pd
df['value'] = df['value'].interpolate(method = 'quadratic', limit_direction = 'both')
Use sklearn.impute.KNNImputer, such as
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors = 2, weights = 'distance')
df['value'] = imputer.fit_transform(df[['value']])
Related
I want to do a lagrange multiplier test on a panel dataset of the following type:
UGA Date Sales Nb_AM Nb_BX ......
A 01/2017 1 4 14
A 02/2017 8 5 17
A 03/2017 26 2 24
B 01/2017 3 3 35
B 02/2017 5 10 42
B 03/2017 8 24 2
I want to use the following command : lm.LMtests()
However, according to R documentation, I need to put an argument of the
type "listw" in lm.LMtests but I have no idea what to put in my case. Could
you help me?
For the moment my code is the following :
fusion2<-read_excel("C:/Users/david/OneDrive/Bureau/Master data/Mémoire
data analyst/Bases de données/Fusion/fusion.xlsx")
modeleam<-Sales ~ Nb_AM + Nb_BX +
Total_PdS_sensibilisés_aux_événement_AM + Mails_AM_ouvert +
Mails_AM_non_ouvert + Total_PdS_sensibilisés_aux_RP_AM +
Total_PdS_sensibilisés_aux_Staff_AM + Total_PdS_sensibilisés_aux_Congrés_AM
+ Total_PdS_sensibilisés_aux_Opportunités_AM
mcoam <-lm(modeleam, data=fusion2)
lagrangeam <- lm.LMtests(mcoam, ,test="all")
Thanks in advance
when you're in the "matter" it's pretty basic. So this test is crated for spatial statistics and a listw-object is nothing else than a neighbour dependence. That means how strongly one value could potentially be influenced by neighbour values.
For that you need for example a simple feature with geometries of a landscape or a city where you can assign the values to a specific polygon. From this pattern, you can create a neighbourhood and then neighbourhood weights (listw-object).
Small tutorial:
library(spdep); library(sf)
#Get your data
shape_and_data <- st_read("your/shape")
#Create your neighbourhood, nb-object
data_nb <- poly2nb(shape_and_data)
#Create the neighbour weights, listw-object
data_listw <- nb2listw(data_nb)
#Calculate
lm.LMtests(lm(...), listw = data_listw, test = "all")
This is a really basic example. For creating the neighbourhood (nb-object) you can choose different methods and for the weights (listw) there are also several methods.
Hope it helped a bit,
Loubert
I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.
I am having trouble generating posterior predictions using posterior_survfit(). I am trying to use a new data frame, but it is not using the new data frame and instead is using values from the dataset I used to fit the model. The fitted variables in the model are New.Treatment (6 treatments = categorical), Openness (a continuous light index min= 2.22, mean= 6.903221 and max=10.54), subplot_by_site(categorical-720 sites), New.Species.name(categorical- 165 species). My new data frame has 94 rows and the posterior_survfit() is giving me 3017800 rows. Help, please!
head(nd)
New.Treatment Openness
1 BE 5
2 BE 6
3 BE 7
4 BE 8
5 BE 9
6 BE 10
fit= stan_surv(formula = Surv(days, Status_surv) ~ New.Treatment*Openness + (1 |subplot_by_site)+(1|New.Species.name),
data = dataset,
basehaz = "weibull",
chains=4,
iter = 2000,
cores =4 )
Post=posterior_survfit(fit, type="surv",
newdata=nd5)
head(Post)
id cond_time time median ci_lb ci_ub
1 1 NA 62.0000 0.9626 0.9623 1.0000
2 1 NA 69.1313 0.9603 0.9600 0.9997
3 1 NA 76.2626 0.9581 0.9579 0.9696
4 1 NA 83.3939 0.9561 0.9557 0.9665
5 1 NA 90.5253 0.9541 0.9537 0.9545
6 1 NA 97.6566 0.9522 0.9517 0.9526
##Here some reproducible code to explain my problem:
library(rstanarm)
data_NHN<- expand.grid(New.Treatment = c("A","B","C"), Openness = c(seq(2, 11, by=0.15)))
data_NHN$subplot_by_site=c(rep("P1",63),rep("P2",60),rep("P3",60))
data_NHN$Status_surv=sample(0:1,183, replace=TRUE)
data_NHN$New.Species.name=c(rep("sp1",10),rep("sp2",40),rep("sp1",80),rep("sp2",20),rep("sp1",33))
data_NHN$days=sample(10, size = nrow(data_NHN), replace = TRUE)
nd_t<- expand.grid(New.Treatment = c("A","B","C"), Openness = c(seq(2, 11, by=1)))
mod= stan_surv(formula = Surv(days, Status_surv) ~ New.Treatment+Openness + (1 |subplot_by_site)+(1|New.Species.name),
data =data_NHN,
basehaz = "weibull",
chains=4,
iter = 30,
cores =4)
summary(mod)
pos=posterior_survfit(mod, type="surv",
newdataEvent=nd_t,
times = 0)
head(pos)
#I am interested in predicting values for specific Openess values
#(nd_t=20 rows)but I am getting instead values for each point in time
#(pos=18300rows)
Operating System: Mac OS Catalina 10.15.6
R version: 4.0
rstan version: 2.21.2
rstanarm Version: rstanarm_2.21.2
Any suggestions on why is it not working. it’s not clear how to give some sort of plot of the effects of one variable in the interaction as the other changes and the associated uncertainty (i.e. a marginal effects plot). In my example, I am interested in getting the values at specific "Openness" values and not at each specific time as appears in the posterior results. TIA.
I can only give you a partial answer. You should be aware that this is pretty bleeding-edge; although the ArXiv paper (dated Feb 2020) says
Hopefully by the time you are reading
this, the functionality will be available in the stable release on the Comprehensive R Archive Network (CRAN)
but so far that's not even true; it's not even in the master GitHub branch, so I used remotes::install_github("stan-dev/rstanarm#feature/survival") to install it from source.
The proximal problem is that you should specify the new data frame as newdata, not newdataEvent. There seem to be a lot of mismatches between the master and this branch, and between the docs and the code ... newdataEvent is used in the older method for stanjm models, but not for stansurv models. You can look at the code here, or use formals(rstanarm:::posterior_survfit.stansurv). Unfortunately, because this method has an (unused, unchecked) ... argument, that means that any misnamed arguments will be silently ignored.
The next problem is that if you specify the new data in your example as newdata you'll get
Error: The following variables are missing from the data: subplot_by_site, New.Species.name
That is, there doesn't seem to be an obvious way to generate a population-level posterior prediction. (Setting the random effects grouping variables to NA isn't allowed.) If you want to do this, you could either:
expand your newdata to include all combinations of the grouping variables in your data set, and average the results across levels yourself;
post an issue on GitHub or contact the maintainers ...
I have a two-dimensional point pattern (no marks) and I am trying to test for clustering in the presence of spatial inhomogeneity using envelopes and the inhomogenous pair correlation function. I am estimating an inhomogenous intensity function for the data using the density.ppp function. Here is some sample data:
x y
1 533.03 411.58
2 468.39 622.92
3 402.86 530.94
4 427.13 616.81
5 495.20 680.62
6 566.61 598.99
7 799.03 585.16
8 1060.09 544.23
9 144.66 747.40
10 138.14 752.92
11 449.49 839.15
12 756.45 713.72
13 741.01 728.41
14 760.22 740.28
15 802.34 756.21
16 799.04 764.89
17 773.81 771.97
18 768.41 720.07
19 746.14 754.11
20 815.40 765.14
There are ~1700 data points overall
Here is my code:
library("spatstat")
WT <- read.csv("Test.csv")
colnames(WT) <- c("x","y")
#determine bounding window
win <- ripras(WT)
unitname(win) <- c("micrometer")
#convert to ppp data class
WT.ppp <- as.ppp(WT, win)
plot(WT.ppp)
#estimate intensity function using cross validation
I <- density.ppp(WT.ppp,sigma=bw.diggle(WT.ppp),adjust=0.3,kernal="epanechnikov")
plot(I)
#predetermined r values for PCF
radius <- seq(from = 0, to = 50, by = 0.5)
#use envelopes to test the null hypothesis (ie. inhomogenous poisson process)
PCF_envelopes <- envelope(WT.ppp,divisor="d", pcfinhom,r = radius,nsim=10,simulate=expression(rpoispp(I)) )
When I run rpoisspp(I), I get the following error:
Error in sample.int(npix, size = ni, replace = TRUE, prob = lpix) :
negative probability
I can't seem to figure out what the issue is....any suggestions?
Thanks for your help!
This is happening because the image I contains some negative values, probably very small values but negative. You can check that by computing range(I) or min(I) or any(I < 0).
The help for density.ppp says that the result may contain negative values (very small ones) due to numerical error. To remove these, you need to set positive=TRUE in the call to density.ppp.
By the way, the argument kernel has been mis-spelt in the code above. Also the vector r is too coarsely spaced - you would be better to leave this argument un-specified. Also you don't need to type density.ppp, just density.
I am having difficulties understanding how the multiclass.roc parameters should look like.
Here a snapshot of my data:
> head(testing.logist$cut.rank)
[1] 3 3 3 3 1 3
Levels: 1 2 3
> head(mnm.predict.test.probs)
1 2 3
9 1.013755e-04 3.713862e-02 0.96276001
10 1.904435e-11 3.153587e-02 0.96846413
12 6.445101e-23 1.119782e-11 1.00000000
13 1.238355e-04 2.882145e-02 0.97105472
22 9.027254e-01 7.259787e-07 0.09727389
26 1.365667e-01 4.034372e-01 0.45999610
>
I tried calling multiclass.roc with:
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs,
formula=response~predictor
)
but naturally I get an error:
Error in roc.default(response, predictor, levels = X, percent = percent, :
Predictor must be numeric or ordered.
When it's a binary classification problem I know that 'predictor' should contain probabilities (one per observation). However, in my case, I have 3 classes, so my predictor is a list of rows that each have 3 columns (or a sublist of 3 values) correspond to the probability for each class.
Does anyone know how should my 'predictor' should look like rather than what it's currently look like ?
The pROC package is not really designed to handle this case where you get multiple predictions (as probabilities for each class). Typically you would assess your P(class = 1)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs[,1])
And then do it again with P(class = 2) and P(class = 3). Or better, determine the most likely class:
predicted.class <- apply(mnm.predict.test.probs, 1, which.max)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=predicted.class)
Consider multiclass.roc as a toy that can sometimes be helpful but most likely won't really fit your needs.