Formating life-tables to use in survival analysis

Formating life-tables to use in survival analysis - r

I'm trying to use the 'relsurv' package in R to compare the survival of a cohort to national life tables. The code below shows my problem using the example from relsurv but changing the life-table data. I've just used two years and two ages in the life-table data below, the actual data is much larger but gives the same error. The error is 'invalid ratetable argument' but I've formatted it as per the example life-tables 'slopop' and 'survexp.us'.
library(survival)
library(relsurv)
data(rdata) # example data from relsurv
raw = read.table(header=T, stringsAsFactors = F, sep=' ', text='
Year Age sex qx
1980 30 1 0.00189
1980 31 1 0.00188
1981 30 1 0.00191
1981 31 1 0.00191
1980 30 2 0.00077
1980 31 2 0.00078
1981 30 2 0.00076
1981 31 2 0.00074
')
ages = c(30,40) # in years
years = c(1980, 1990)
rtab = array(data=NA, dim=c(length(ages), 2, length(years))) # set up blank array: ages, sexes, years
for (y in unique(raw$Year)){
for (s in 1:2){
rtab[ , s, y-min(years)+1] = -1 * log(1-subset(raw, Year==y&sex==s)$qx) / 365.24 # probability of death in next year, transformed to hazard (see ratetables help)
}
}
attributes(rtab)$dimnames[[1]] = as.character(ages)
attributes(rtab)$dimnames[[2]] = c('male','female')
attributes(rtab)$dimnames[[3]] = as.character(years)
attributes(rtab)$dimid <- c("age", "sex", 'year')
attributes(rtab)$dim <- c(length(ages), 2, length(years))
attributes(rtab)$factor = c(0,0,1)
attributes(rtab)$type = c(2,1,4)
attributes(rtab)$cutpoints[[1]] = ages*365.24 # must be in days
attributes(rtab)$cutpoints[[2]] = NULL
attributes(rtab)$cutpoints[[3]] = as.date(paste("1Jan", years, sep='')) # must be date
attributes(rtab)$class = "ratetable"
# example from relsurv
rsmul(Surv(time,cens) ~ sex+as.factor(agegr)+
ratetable(age=age*365.24, sex=sex, year=year),
data=rdata, ratetable=rtab, int=1)

Try using the transrate function from the relsurv package to reformat the data. That should give you a compatible dataset.
Regards,
Josh

Three things to add:
You should set attributes(rtab)$factor = c(0,1,0), since sex (the second dimension) is a factor (i.e., doesn't change over time).
A good way to check whether something is a valid rate table is to use the is.ratetable() function. is.ratetable(rtab, verbose = TRUE) will even return a message stating what was wrong.
Check the result of is.ratetable without using verbose first, because it will lie about valid rate tables.
The rest of this comment is about this lie.
If the type attribute isn't given, is.ratetable will calculate it using the factor attribute; you can see this by just printing the function. However, it seems to do so incorrectly. It uses type <- 1 * (fac == 1) + 2 * (fac == 0) + 4 * (fac > 0), where fac is attributes(rtab)$factor.
But the next section, which checks the type attribute if it's provided, says the only valid values are 1, 2, 3, and 4. It's impossible to get 1 from the code above.
For example, let's examine the slopop ratetable provided with the relsurv package.
library(relsurv)
data(slopop)
is.ratetable(slopop)
# [1] TRUE
is.ratetable(slopop, verbose = TRUE)
# [1] "wrong length for cutpoints 3"
I think this is where your rate table is being hung up.

Related

R: trying to calculate means and sd + warning that object cannot be found while the object is listed in the data frame header

I'm fairly new to R and I struggle with calculating means for a single column. RStudio returns the same warning on several different possibilities (described further down below). I have searched the existing questions, but either the questions did not ask for what I was searching for, or the solution did not work with my data.
My data has different studies as rows and study quality ratings with multiple sub-points as columns.
A simplified version looks like this:
> dd <- data.frame(authoryear = c("Smith, 2020", "Meyer, 2019", "Lim, 2019", "Lowe, 2018"),
+ stqu1 = c(1, 3, 2, 4),
+ stqu2 = c(8, 3, 9, 9),
+ stqu3 = c(1, 1, 1, 2))
> dd
authoryear stqu1 stqu2 stqu3
1 Smith, 2020 1 8 1
2 Meyer, 2019 3 3 1
3 Lim, 2019 2 9 1
4 Lowe, 2018 4 9 2
I calculated the sums of the study quality ratings for each study by rowSums and created a new column in my data frame called "stqu_sum".
Like so:
dd$stqu_sum <- rowSums(subset(dd, select = c(stqu1, stqu2, stqu3)), na.rm = TRUE)
Now I would like to calculate the mean and standard deviation of stqu_sum over all the studies (rows). I googled and found many different ways to do this, but no matter what I try, I get the same warning which I don't know how to fix.
Things I have tried:
#defining stqu_sum as numeric
dd[, stqu_sum := as.numeric(stqu_sum)]
#colMeans
colMeans(dd, select = stqu_sum, na.rm = TRUE)
#sapply
sapply(dd, function(dd) c( "Stand dev" = sd(stqu_sum),
"Mean"= mean(stqu_sum,na.rm=TRUE),
"n" = length(stqu_sum),
"Median" = median(stqu_sum),
))
#data.table
dd[, .(mean_stqu = mean("stqu_sum"), sd_stqu = sd("stqu_sum")),.(variable, value)]
All of these have returned the warning: object stqu_sum not found. However the stqu_sum column is shown in the header of my data frame as seen above.
Can anyone help me to fix this or show me another way to do this?
I hope this is detailed enough. Please let me know if I should add any information.
Thank you in advance!

Is this what you're after? Mean and SD for stqu_sum:
dd_summary <- dd %>%
summarise(mean=mean(stqu_sum),
SD = sd(stqu_sum))
Gives:
> dd_summary
mean SD
1 11 3.366502

With data.table, we don't need to quote the column names
library(data.table)
dd[, .(mean_stqu = mean(stqu_sum), sd_stqu = sd(stqu_sum)),.(variable, value)]

Median and Boxplot (R)

I am writing to your forum because I do not find solution for my problem. I am trying to represent graphically the Median catching time (MCT) of mosquito that we (my team and I) have collected (I am currently in an internship to study the malaria in Ivory Coast). The MCT represents the time for which 50% of the total malaria vectors were caught on humans.
For example, we collected this sample:
Hour of collection / Mosquitoes number:
20H-21H = 1
21H-22H = 1
22H-23H = 2
23H-00H = 2
00H-01H = 13
01H-02H = 10
02H-03H = 15
03H-04H = 15
04H-05H = 8
05H-06H = 10
06H-07H = 6
Here the effective cumulated is 83 mosquitoes. And I am assuming that the median of this mosquito serie is 83+1/2 = 42 (And I don't even find this number on R), inducing a Median catching time at 2 am (02).
Therefore, I have tried to use the function "boxplot" with different parameters, but I cannot have what I want to represent. Indeed, I have boxes for each hour of collection when I want the representation of the effective cumulated over the time of collection. And the time use in R is "20H-21H" = 20, "21H-22H" = 21 etc.
I have found an article (Nicolas Moiroux, 2012) who presents the Median Catching Time and a boxplot that I should like to have. I copy the image of the cited boxplot:
Boxplot_Moiroux2012
Thank you in advance for your help, and I hope that my grammar is fine (I speak and write mainly in French, my mother tongue).
Kind Regards,
Edouard
PS : And regarding the code I have used with this set of data, here I am (with "Eff" = Number of mosquito and "Heure" = time of collection):
sum(Eff)
as.factor(Heure)
tapply(Eff,Heure,median)
tapply(Heure,Eff,median)
boxplot(Eff,horizontal=T)
boxplot(Heure~Eff)
boxplot(Eff~Heur))
(My skills on R are not very sharp...)

You need to use a trick since you already have counts and not the time data for each catch.
First, you convert your time values to a more continuous variable, then you generate a vector with all the time values and then you boxplot (with a custom axis).
txt <- "20H-21H = 1
21H-22H = 1
22H-23H = 2
23H-00H = 2
00H-01H = 13
01H-02H = 10
02H-03H = 15
03H-04H = 15
04H-05H = 8
05H-06H = 10
06H-07H = 6"
dat <- read.table(text = txt, sep = "=", h = F)
colnames(dat) <- c("collect_time", "nb_mosquito")
# make a continuous numerical proxy for time
dat$collect_time_num <- 1:nrow(dat)
# get values of proxy according to your data
tvals <- rep(dat$collect_time_num, dat$nb_mosquito)
# plot
boxplot(tvals, horizontal = T, xaxt = "n")
axis(1, labels = as.character(dat$collect_time), at = dat$collect_time_num)
outputs the following plot :

Simulating a timeseries in dplyr instead of using a for loop

So, while lag and lead in dplyr are great, I want to simulate a timeseries of something like population growth. My old school code would look something like:
tdf <- data.frame(time=1:5, pop=50)
for(i in 2:5){
tdf$pop[i] = 1.1*tdf$pop[i-1]
}
which produces
time pop
1 1 50.000
2 2 55.000
3 3 60.500
4 4 66.550
5 5 73.205
I feel like there has to be a dplyr or tidyverse way to do this (as much as I love my for loop).
But, something like
tdf <- data.frame(time=1:5, pop=50) %>%
mutate(pop = 1.1*lag(pop))
which would have been my first guess just produces
time pop
1 1 NA
2 2 55
3 3 55
4 4 55
5 5 55
I feel like I'm missing something obvious.... what is it?
Note - this is a trivial example - my real examples use multiple parameters, many of which are time-varying (I'm simulating forecasts under different GCM scenarios), so, the tidyverse is proving to be a powerful tool in bringing my simulations together.

Reduce (or its purrr variants, if you like) is what you want for cumulative functions that don't already have a cum* version written:
data.frame(time = 1:5, pop = 50) %>%
mutate(pop = Reduce(function(x, y){x * 1.1}, pop, accumulate = TRUE))
## time pop
## 1 1 50.000
## 2 2 55.000
## 3 3 60.500
## 4 4 66.550
## 5 5 73.205
or with purrr,
data.frame(time = 1:5, pop = 50) %>%
mutate(pop = accumulate(pop, ~.x * 1.1))
## time pop
## 1 1 50.000
## 2 2 55.000
## 3 3 60.500
## 4 4 66.550
## 5 5 73.205

If the starting value of pop is, say, 50, then pop = 50 * 1.1^(0:4) will give you the next four values. With your code, you could do:
data.frame(time=1:5, pop=50) %>%
mutate(pop = pop * 1.1^(1:n() - 1))
Or,
base = 50
data.frame(time=1:5) %>%
mutate(pop = base * 1.1^(1:n()-1))

Purrr's accumulate function can handle time-varying indices, if you pass them
to your simulation function as a list with all the parameters in it. However, it takes a bit of wrangling to get this working correctly. The trick here is that accumulate() can work on list as well as vector columns. You can use the tidyr function nest() to group columns into a list vector containing the current population state and parameters, then use accumulate() on the resulting list column. This is a bit complicated to explain, so I've included a demo, simulating logistic growth with either a constant growth rate or a time-varying stochastic growth rate. I also included an example of how to use this to simulate multiple replicates for a given model using dpylr+purrr+tidyr.
library(dplyr)
library(purrr)
library(ggplot2)
library(tidyr)
# Declare the population growth function. Note: the first two arguments
# have to be .x (the prior vector of populations and parameters) and .y,
# the current parameter value and population vector.
# This example function is a Ricker population growth model.
logistic_growth = function(.x, .y, growth, comp) {
pop = .x$pop[1]
growth = .y$growth[1]
comp = .y$comp[1]
# Note: this uses the state from .x, and the parameter values from .y.
# The first observation will use the first entry in the vector for .x and .y
new_pop = pop*exp(growth - pop*comp)
.y$pop[1] = new_pop
return(.y)
}
# Starting parameters the number of time steps to simulate, initial population size,
# and ecological parameters (growth rate and intraspecific competition rate)
n_steps = 100
pop_init = 1
growth = 0.5
comp = 0.05
#First test: fixed growth rates
test1 = data_frame(time = 1:n_steps,pop = pop_init,
growth=growth,comp =comp)
# here, the combination of nest() and group_by() split the data into individual
# time points and then groups all parameters into a new vector called state.
# ungroup() removes the grouping structure, then accumulate runs the function
#on the vector of states. Finally unnest transforms it all back to a
#data frame
out1 = test1 %>%
group_by(time)%>%
nest(pop, growth, comp,.key = state)%>%
ungroup()%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
# This is the same example, except I drew the growth rates from a normal distribution
# with a mean equal to the mean growth rate and a std. dev. of 0.1
test2 = data_frame(time = 1:n_steps,pop = pop_init,
growth=rnorm(n_steps, growth,0.1),comp=comp)
out2 = test2 %>%
group_by(time)%>%
nest(pop, growth, comp,.key = state)%>%
ungroup()%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
# This demostrates how to use this approach to simulate replicates using dplyr
# Note the crossing function creates all combinations of its input values
test3 = crossing(rep = 1:10, time = 1:n_steps,pop = pop_init, comp=comp) %>%
mutate(growth=rnorm(n_steps*10, growth,0.1))
out3 = test3 %>%
group_by(rep)%>%
group_by(rep,time)%>%
nest(pop, growth, comp,.key = state)%>%
group_by(rep)%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
print(qplot(time, pop, data=out1)+
geom_line() +
geom_point(data= out2, col="red")+
geom_line(data=out2, col="red")+
geom_point(data=out3, col="red", alpha=0.1)+
geom_line(data=out3, col="red", alpha=0.1,aes(group=rep)))

The problem here is that dplyr is running this as a set of vector operations rather than evaluating the term one at a time. Here, 1.1*lag(pop) is being interpreted as "calculate the lagged values for all of pop, then multiple them all by 1.1". Since you set pop=50 lagged values for all the steps were 50.
dplyr does have some helper functions for sequential evaluation; the standard function cumsum, cumprod, etc. work, and a few new ones (see ?cummean) all work within dplyr. In your example, you could simulate the model with:
tdf <- data.frame(time=1:5, pop=50, growth_rate = c(1, rep(1.1,times=4)) %>%
mutate(pop = pop*cumprod(growth_rate))
time pop growth_rate
1 50.000 1.0
2 55.000 1.1
3 60.500 1.1
4 66.550 1.1
5 73.205 1.1
Note that I added growth rate as a column here, and I set the first growth rate to 1. You could also specify it like this:
tdf <- data.frame(time=1:5, pop=50, growth_rate = 1.1) %>%
mutate(pop = pop*cumprod(lead(growth_rate,default=1))
This makes it explicit that the growth rate column refers to the rate of growth in the current time step from the previous one.
There are limits to how many different simulations you can do this way, but it should be feasible to construct a lot of discrete-time ecological models using some combination of the cumulative functions and parameters specified in columns.

What about the map functions, i.e.
tdf <- data_frame(time=1:5)
tdf %>% mutate(pop = map_dbl(.x = tdf$time, .f = (function(x) 50*1.1^x)))

Building dummy variable with many conditions (R)

My dataset looks something like this
ID YOB ATT94 GRADE94 ATT96 GRADE96 ATT 96 .....
1 1975 1 12 0 NA
2 1985 1 3 1 5
3 1977 0 NA 0 NA
4 ......
(with ATTXX a dummy var. denoting attendance at school in year XX, GRADEXX denoting the school grade)
I'm trying to create a dummy variable that = 1 if an individual is attending school when they are 19/20 years old. e.g. if YOB = 1988 and ATT98 = 1 then the new variable = 1 etc. I've been attempting this using mutate in dplyr but I'm new to R (and coding in general!) so struggle to get anything other than an error any code I write.
Any help would be appreciated, thanks.
Edit:
So, I've just noticed that something has gone wrong, I changed your code a bit just to add another column to the long format data table. Here is what I did in the end:
df %>%
melt(id = c("ID", "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
so it looks something like e.g.
ID YOB VARIABLE VALUE dummy
1 1979 ATT94 1994 1
1 1979 ATT96 1996 1
1 1979 ATT98 0 0
2 1976 ATT94 0 0
2 1976 ATT96 1996 1
2 1976 ATT98 1998 1
i.e. whenever the ATT variables take a value other than 0 the dummy = 1, even if they're not 19/20 years old. Any ideas what could be going wrong?

On my phone so I can't check this right now but try:
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Edit: The above approach will create the column but when the condition does not hold it will be equal to NA
As #Greg Snow mentions, this approach assumes that the column was already created and is equal to zero initially. So you can do the following to get your dummy variable:
df$dummy <- rep(0, nrow(df))
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1

Welcome to the world of code! R's syntax can be tricky (even for experienced coders) and dplyr adds its own quirks. First off, it's useful when you ask questions to provide code that other people can run in order to be able to reproduce your data. You can learn more about that here.
Are you trying to create code that works for all possible values of DOB and ATTx? In other words, do you have a whole bunch of variables that start with ATT and you want to look at all of them? That format is called wide data, and R works much better with long data. Fortunately the reshape2 package does exactly that. The code below creates a dummy variable with a value of 1 for people who were in school when they were either 19 or 20 years old.
# Load libraries
library(dplyr)
library(reshape2)
# Create a sample dataset
ATT94 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT96 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT98 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
DOB <- rnorm(500, mean = 1977, sd = 5) %>% round(digits = 0)
df <- cbind(DOB, ATT94, ATT96, ATT98) %>% data.frame()
# Recode ATTx variables with the actual year
df$ATT94[df$ATT94==1] <- 1994
df$ATT96[df$ATT96==1] <- 1996
df$ATT98[df$ATT98==1] <- 1998
# Melt the data into a long format and perform requested analysis
df %>%
melt(id = "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))

#Warner shows a way to create the variable (or at least the 1's the assumption is the column has already been set to 0). Another approach is to not explicitly create a dummy variable, but have it created for you in the model syntax (what you asked for is essentially an interaction). If running a regression, this would be something like:
fit <- lm( resp ~ I(DOB==1988):I(ATT98==1), data=df )
or
fit <- lm( resp ~ I( (DOB==1988) & (ATT98==1) ), data=df)

Find standard deviation of first differences of series defined with GROUP BY using RSQLite

In SQLite I would like to find the standard deviation of the first differences of a (logged) series that I define with GROUP BY. My data provider gives me a daily price series, but I would like to find annualized daily volatility (the standard deviation of daily returns -- first difference of the natural log of the series -- over each year). I can bring the data to R, then use ddply(), but I would like to do this entirely in SQLite. I tried the difference() function from the RSQLite.extfunctions package, but my usage is wrong. I expected it to work like diff() in R, but I can't find much documentation.
This generates some data.
stocks <- 5
years <- 5
list.n <- as.list(rep(252, stocks * years))
list.mean <- as.list(rep(0, stocks * years))
list.sd <- as.list(abs(runif(stocks * years, min = 0, max = 0.1)))
list.po <- as.list(runif(n = stocks, min = 25, max = 100))
list.ret <- mapply(rnorm, n = list.n, mean = list.mean, sd = list.sd, SIMPLIFY = F)
my.price <- function(po, ret) po * exp(cumsum(ret))
list.price <- mapply(my.price, po = list.po, ret = list.ret, SIMPLIFY = F)
gvkey <- rep(seq(stocks), each = 252 * years)
day <- rep(seq(252), n = stocks * years)
fyr <- rep(seq(years), n = stocks, each = 252)
data.dly <- data.frame(gvkey, fyr, day, p = unlist(list.price))
Here is how I would do it with ddply() and the result.
# I could do this easily with ddply and subset
library(plyr)
data.dly <- ddply(data.dly, .(gvkey, fyr), transform, vol = sd(diff(log(p))))
data.ann <- subset(data.dly, day == 252)
head(data.ann)
gvkey fyr day p vol
252 1 1 252 86.08568 0.077287182
504 1 2 252 43.32113 0.066741862
756 1 3 252 68.69734 0.084419564
1008 1 4 252 75.37267 0.006003969
1260 1 5 252 17.53583 0.083688727
1512 2 1 252 168.44656 0.035959492
And here is my (failed) SQLite attempt and error.
# but I can't figure it out in SQLite
library(RSQLite)
library(RSQLite.extfuns)
db <- dbConnect(SQLite())
init_extensions(db)
[1] TRUE
dbWriteTable(db, name = "data_dly", value = data.dly)
[1] TRUE
temp <- dbGetQuery(db, "SELECT stdev(difference(log(p))) FROM data_dly GROUP BY gvkey, fyr ORDER BY gvkey, fyr, day")
Error in sqliteExecStatement(con, statement, bind.data) :
RS-DBI driver: (error in statement: wrong number of arguments to function difference())
Does difference() need a comma separated list of numbers? Can I do this entirely in SQLite? Or do I need to perform in R? Thanks!

Try this where data.dly is the data frame in the post:
library(sqldf)
out <- sqldf("select A.gvkey, A.fyr, stdev(log(A.p) - log(B.p)) vol
from `data.dly` A join `data.dly` B
where A.day = B.day + 1
and A.gvkey = B.gvkey
and A.fyr = B.fyr
group by A.gvkey, A.fyr")
This gives:
> head(out)
gvkey fyr vol
1 1 1 0.09312510
2 1 2 0.01905447
3 1 3 0.01651095
4 1 4 0.06962667
5 1 5 0.05243940
6 2 1 0.03039751

The difference SQL command takes two character arguments, and has a different meaning to R's diff command.
Retrieve the data with an SQL command, then do stats using R.
temp <- dbGetQuery(db, "SELECT p FROM data_dly GROUP BY gvkey, fyr ORDER BY gvkey, fyr, day")
sd(diff(log(temp$p)))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Formating life-tables to use in survival analysis - r

Try using the transrate function from the relsurv package to reformat the data. That should give you a compatible dataset. Regards, Josh

Related

R: trying to calculate means and sd + warning that object cannot be found while the object is listed in the data frame header

Median and Boxplot (R)

Simulating a timeseries in dplyr instead of using a for loop

Building dummy variable with many conditions (R)

Find standard deviation of first differences of series defined with GROUP BY using RSQLite

Categories

Resources