Transform data to be fitted by pammtools package - r

I am trying to use the pammtools package to conduct some Piece-wise exponential Additive Mixed modelling: https://adibender.github.io/pammtools/index.html. In order to fit the model, I first need to transform the data using the as_ped function provided by the pammtools package: https://adibender.github.io/pammtools/reference/as_ped.html. I don't know how to make it work for my dataset. Here is a toy dataset:
df <- data.frame(id = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3),
time=c(0, 500, 1000, 1500,
0, 500, 1000, 1500, 2000, 2500,
0, 500, 1000, 1500),
event=c(0, 1, 0, 1,
0, 1, 0, 1, 0, 1,
0, 0, 0, 1),
tdc=c(2, 3, 4, 6,
2, 1, 2, 2, 3, 5,
4, 2, 1, 2))
It seems that the as_ped function mandates a single row for each id (no multiple rows for each id). However, My data are longitudinal and therefore each id takes up several rows. How can I make this work? Any help will be greatly appreciated!

Related

Plot histogram for each group dplyr

Let's cosnider very easy dataframe containing four groups:
cat <- c(1, 0, 0, 1, 2, 1, 2, 3, 2, 1, 3)
var <- c(10, 5, 3, 2, 5, 1, 2, 10, 50, 2, 30)
df <- data.frame(cat, var)
What I would like to do is that using dplyr plot distribution of values between those four categories
I have the feeling that it can be eaisly done with group_by, but I'm not sure how it can be done. Do you know how I can do it?

Complicated filtering of data frame without loops

I have big data frame with positions, time stamps, trip ids etc.
I would like to in a simple way, to avoid double loops, filter out and save only some of the rows.
So for all the rows that have the same combination of trip_id and stop_id, I want to save the row where the speed was first equal to zero. Either by take the minimum timestamp where the speed is zero or simple just the first time the speed is zero since the frame is ordered by the timestamp.
So in the example below, I would like to find the three top rows (in the real data frame a lot more rows) and just save the second row where the speed first was zero.
Is there a way to do this without any loops?
trip_id.x stop_id latitude.x longitude.x bearing speed timestamp vehicle id
55700000048910944 9022005000050006 58.416879999999999 15.624510000000001 30 0.2 1541399400 9031005990005424
55700000048910944 9022005000050006 58.416879999999999 15.624510000000001 0 0 1541399401 9031005990005424
55700000048910944 9022005000050006 58.416879999999999 15.624510000000001 0 0 1541399402 9031005990005424
55700000048910300 9022005000050006 58.416879999999999 15.624510000000001 30 0.5 1541400000 9031005990005424
Edit:
Here is the dput() of a longer exampel with a simpler format of the data I have:
structure(list(trip_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3), stop_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1,
1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3,
3, 3), speed = c(5, 0, 0, 5, 2, 0, 0, 2, 4, 0, 0, 4, 5, 0, 0,
5, 2, 0, 0, 2, 4, 0, 0, 4, 5, 0, 0, 5, 2, 0, 0, 2, 4, 0, 0, 4
), timestamp = c(1, 2, 3, 4, 101, 102, 103, 104, 201, 202, 203,
204, 301, 302, 303, 304, 401, 402, 403, 404, 501, 502, 503, 504,
601, 602, 603, 604, 701, 702, 703, 704, 801, 802, 803, 804)), row.names = c(NA,
-36L), class = c("tbl_df", "tbl", "data.frame"))
And the wanted output:
structure(list(trip_id = c(1, 1, 2, 2, 2, 3, 3, 3), stop_id = c(1,
3, 1, 2, 3, 1, 2, 3), speed = c(0, 0, 0, 0, 0, 0, 0, 0), timestamp = c(2,
202, 302, 402, 502, 602, 702, 802)), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
Edit: Trying to change to code to have conditions in it. Tried with case_when and if but can't get it to work:
df_arrival_z <- df %>%
group_by(trip_id, stop_id) %>%
filter(speed == 0)
# Check if there is any rows where speed is zero
if (nrow(filter(speed == 0)) > 0){
# Take the first row if there is rows with zero
filter(speed == 0) %>% slice(1)
}
if (nrow(filter(speed == 0)) == 0){
# Take the middle point if there is no rows with speed = 0
slice(nrow%/%2)
}
Without desired output I can't be sure what you expect, but try this and let me know:
library(dplyr)
df %>%
group_by(trip_id, stop_id) %>%
filter(speed == 0) %>%
slice(1)

Getting (maybe manually) confidence interval of fits after using multi-way clustering package (multiwayvcov)

I am interested in plotting fits with confidence intervals after using two-way clustering package (multiwayvcov).
Here is my reproducible data.
rm(list=ls(all=TRUE))
library(lmtest)
library(multiwayvcov)
dv<-c(1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0)
int1<-c(0.0123, 0.3428, 0.2091, 0.8325, 0.7113, 0.7401, 0.6009, 0.5062, 0.4841, 0.8912, 0.3850, 0.2463, 0.0625, 0.5374, 0.1984)
int2<-c(0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0)
cont<-c(3, 1, 2, 4, 6, 7, 1, 4, 3, 2, 4, 3, 6, 1, 3)
cluster1<-c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3)
cluster2<-c(1, 2, 3, 1, 2, 3, 1, 2, 1, 2, 1, 2, 3, 1, 2)
mydata<-as.data.frame(cbind(dv, int1, int2, cont, cluster1, cluster2))
This is my non-clustered model:
result_lm <- lm(dv~int1+int2+cont,data=mydata)
To get clustered results using "cluster1" and "cluster2", I use functions in the package of "lmtest" and "multiwayvcov" as follows.
cluster_vcov<-cluster.vcov(result_lm, ~cluster1+cluster2)
result_2c<-coeftest(result_lm, cluster_vcov)
Here, "cluster_vcov" is just a variance-covariance matrix and "result_2c" is just an atomic vector. Thus, I am not able to use "predict" function to plot fits on a new dataset ("datagrid") such as
grid <- seq(0,1,.2)
datagrid <- data.frame(int1=rep(grid,2),
int2=c(rep(0,length(grid)),
rep(1,length(grid))))
datagrid$cont<-mean(mydata$cont, na.rm=T)
Before moving to what I have done, here is something similar what I would like to have eventually.
fits <- predict(result_lm,newdata=datagrid,interval="confidence")
plotdata <- data.frame(fits,datagrid)
plotdata$int2 <- plotdata$int2==1
ggplot(plotdata,aes(x=int1,y=fit,ymin=lwr,ymax=upr,color=int2)) + geom_line(aes(linetype = int2)) + geom_ribbon(alpha=.2) + theme(legend.position="none") + scale_color_manual(values=c("red", "darkgreen")) + scale_linetype_manual(values=c("dashed", "solid"))
The result is
To address the problem that "result_2c" does not give a dataframe that can be directly used with "predict", I decided to construct a data by myself as follows.
d_twc_result<-data.frame(matrix(0, nrow =4, ncol = 4) )
colnames(d_twc_result) <- c("Estimate","Std. Error","t value", "Pr(>|t|)")
rownames(d_twc_result) <-c("(Intercept)", "int1","int2", "cont")
for (j in 1:4){
for (i in 1:4){
d_twc_result[i, j]<-result_2c[i,j]
}
}
Then, using "d_twc_result$Estimate", I generate a vector that corresponds to "fits" that one could get after running "predict".
fits<-c(1:12)
for (i in 1:12){
fits[i]<-d_twc_result$Estimate[1]+
d_twc_result$Estimate[2]*datagrid$int1[i]+
d_twc_result$Estimate[3]*datagrid$int2[i]+
d_twc_result$Estimate[4]*datagrid$cont[i]
}
Yet, I was still not able to construct vectors for "lwr" and "upr", which requires 'residuals' or 'standard error'. What I was actually stuck is that it seems impossible to get 'residuals' or 'standard error' because there is no observation on 'dv' in the dataset "datagrid".
Nevertheless, "predict" works with the dataset "datagrid", so I guess that I am poorly understanding how "predict" works or the concept of fit.
It will be highly appreciated if you could help me to get "lwr" and "upr" (if my understanding of the concept of fit is incorrect). Thank for any comment in advance.

Add accuracy to data frame based on several predicted values and known actual values

I have a data frame
testdf <- data.frame(predicted1 = c(1, 0, 1, 3, 2, 1, 1, 0, 1, 0), predicted2 = c(1, 0, 2, 2, 2, 1, 1, 0, 0, 0), predicted3 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), actual = c(1, 0, 2, 3, 2, 1, 1, 1, 0, 0))
I want to add another column to this data frame which tells me the total percentage accuracy when looking at all predicted values. So for example, row 1 of this would have an accuracy of 100%, because all prediction columns predicted the correct value (1).
How can this be done?
Thanks!
We can compare with the 'actual' get the rowMeans, multiply by 100 and round if needed
round(100*rowMeans(testdf[1:3] == testdf$actual), 2)

Computing iterations in R to calculate the rate of population increase

I´ve tried to calculate the rate of population increase, denoted as r, which is obtained from:
sum(e^(r*x)*lx*mx) = 1
I know the values of x, lx, and mx, but r value should be iteratively obtained to get a sum of one.
This is the code I wrote (or tried to), which it´s not correct as it returns values for the sum but not for r.
I don´t know what´s wrong. I´ll appreciate any clue to solve this.
Thanks.
lx <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)
mx <- c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0)
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22)
radj = sum((exp(-r*x))*lx*mx)
for (radj in 0:1) {
repeat { radj <- sum((exp(-r*x))*lx*mx)
print(radj)
if (radj < 1) break ()} }
Try this:
root <- uniroot( f = function(r) sum(exp(r*x)*lx*mx) - 1, interval = c(-1, 0))
root$root
> [1] -0.8340894

Resources