How do I calculate (age-specific) mortality rates in R? - r

Using data in the following form, in which ways can I calculate the (age-specific) mortality rate in the R programming language?
head(data)
## age gender zone Class misc bonus duration death cost
## 1 0 M 1 4 12 1 0.1753 0 0
## 2 4 M 3 6 9 1 0.0000 1 0
## 3 5 F 3 3 18 1 0.4548 0 0
## 4 5 F 4 1 25 1 0.1726 0 0
## 5 6 F 2 1 26 1 0.1808 0 0
## 6 9 F 3 3 8 1 0.5425 0 0
That is, for each age I want to calculate the number of deaths and divide by the total number of exposed individuals in that particular age. I tried the following:
n <- length(data$age);
rate <- c(1:n);
for (i in 1:n){
rate[i] <- sum(subset(data, age == i)$death)/ length(subset(data, age == i))
}
But this was useless - obviously not all ages from 1 to n is present in the dataset - I am looking for a written program in the sense of the above which will do the job.

Because the variable death only takes on the value of zero or one, you can calculate the age-specific mortality in one line of code.
tapply(data$death, data$age, mean)

You can get most of the way there with table(). If we assume that all those not dying are present for 100% of the time (a year, say), and that those dying are present for 1/2 of the time, then we have enough info to calculate exposure from these data. I'm not sure what your duration column is, but you haven't really described the data.
# cheap version of your data:
DF <- data.frame(age = c(0,4,5,5,6,9), death = c(0,1,0,0,0,0))
(DAT <- table(DF$death,DF$age))
0 4 5 6 9
0 1 0 2 1 1
1 0 1 0 0 0
# weight these two rows for components of exposure:
Exposure <- colSums(DAT * c(1,.5))
# rates are the ratio of death counts in each age to exposure to risk in each age:
Rates <- DAT["1",] / Exposure
If you then go on to calculate a lifetable, this is the so-called Mx or mx column.

Related

Transforming longitudinal data for time-to-event analysis in R

I am trying to reformat longitudinal data for a time to event analysis. In the example data below, I simply want to find the earliest week that the result was “0” for each ID.
The specific issue I am having is how to patients that don't convert to 0, and had either all 1's or 2's. In the example data, patient J has all 1's.
#Sample data
have<-data.frame(patient=rep(LETTERS[1:10], each=9),
week=rep(0:8,times=10),
result=c(1,0,2,rep(0,6),1,1,2,1,rep(0,5),1,1,rep(0,7),1,rep(0,8),
1,1,1,1,2,1,0,0,0,1,1,1,rep(0,6),1,2,1,rep(0,6),1,2,rep(0,7),
1,rep(0,8),rep(1,9)))
patient week result
A 0 1
A 1 0
A 2 2
A 3 0
A 4 0
A 5 0
A 6 0
A 7 0
A 8 0
B 0 1
B 1 0
... .....
J 6 1
J 7 1
J 8 1
I am able to do this relatively straightforward process with the following code:
want<-aggregate(have$week, by=list(have$patient,have$result), min)
want<-want[which(want[2]==0),]
but realize if someone does not convert to 0, it excludes them (in this example, patient J is excluded). Instead, J should be present with a 1 in the second column and an 8 in the third column. Instead it of course is omitted
print(want)
Group.1 Group.2 x
A 0 1
B 0 4
C 0 2
D 0 1
E 0 6
F 0 3
G 0 3
H 0 2
I 0 1
#But also need
J 1 8
Pursuant to guidelines on posting here, I did work to solve this, am able to get what I need very inelegantly:
mins<-aggregate(have$week, by=list(have$patient,have$result), min)
maxs<-aggregate(have$week, by=list(have$patient,have$result), max)
want<-rbind(mins[which(mins[2]==0),],maxs[which(maxs[2]==1&maxs[3]==8),])
This returns the correct desired dataset, but the coding is terrible and not sustainable as I work with other datasets (i.e. datasets with different timeframes since I have to manually put in maxsp[3]==8, etc).
Is there a more elegant or systematic way to approach this data manipulation issue?
We can write a function to select a row from the group.
select_row <- function(result, week) {
if(any(result == 0)) which.max(result == 0) else which.max(week)
}
This function returns the index of first 0 value if it is present or else returns index of maximum value of week.
and apply it to all groups.
library(dplyr)
have %>% group_by(patient) %>% slice(select_row(result, week))
# patient week result
# <fct> <int> <dbl>
# 1 A 1 0
# 2 B 4 0
# 3 C 2 0
# 4 D 1 0
# 5 E 6 0
# 6 F 3 0
# 7 G 3 0
# 8 H 2 0
# 9 I 1 0
#10 J 8 1

Creating a rank variable by group for a binary response in R

I have a dataset that includes year, month, and day variables and a dichotomous variable for heat wave(Yes=1, NO=0). I am trying to rank the heat waves from earliest date to latest date by year. If the heat wave variable =0 then I want to the rank variable to =0
This is what my data looks like and I would like to create the rank variable:
dd <- read.table(text="Year Month Day HW Rank
1999 5 2 0 0
1999 6 1 1 1
1999 6 5 1 2
2000 9 14 1 1
2000 9 15 0 0
2000 10 1 1 2
2001 7 3 0 0
2001 7 24 1 1
2001 8 12 1 2
2001 8 13 1 3", header=T)
Using dplyr you can do
dd %>% group_by(Year) %>% arrange(Year, Month, Day) %>%
mutate(Rank2=ifelse(HW!=0, cumsum(HW), 0))
Basically we just "add up" the number of HW flags you have for each given year.
In base R, assuming that the data.frame has been sorted properly by year, you can use ave and cumsum like this:
df$Rank2 <- with(dd, ave(HW, Year, FUN=function(x) cumsum(x) * x))
Here, ave groups the data by year, and performs the function cumsum(x) * x to each HW group. By multiplying by X, we set values of HW that are 0 to 0 regardless of the current value of cumsum. with is used to reduce typing.
In the first part, we compute grouped cumsum like in lmo's solution. Then, we use replace to substitute 0 in the indices where dd$HW is 0
replace(x = ave(dd$HW, dd$Year, FUN = cumsum), list = dd$HW == 0, values = 0)
# [1] 0 1 2 1 0 2 0 1 2 3
Here is an option using data.table
library(data.table)
setDT(dd)[, Rank2 := cumsum(HW)*HW, Year]
dd$Rank2
#[1] 0 1 2 1 0 2 0 1 2 3

match and add the cluster number to the original data

I am using the regular method to do a Hierarchical Clustering project.
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.98)
mydata.df <- as.data.frame(inspect(mydata.dtm2))
mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
groups <- cutree(fit, k=10)
groups
congestion cough ear eye fever flu fluzonenon medicare painpressure physical pink ppd pressure
1 2 3 4 5 6 5 5 5 7 4 8 5
rash screening shot sinus sore sports symptoms throat uti
5 5 6 1 9 7 5 9 10
And I what I want is to put the group number back to the new column in the original data.
I've looked at approximate string matching within single list - r
Because the df here is a document matrix so what I got after df <- t(data.frame(mydata.df.scale,cutree(hc,k=10))) is a matrix like
df[1:5,1:5]
congestion cough ear eye fever
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
[4,] 0 0 0 1 0
[5,] 0 0 0 0 0
Since eye has the group number 3 then I want add the number 3 to the new column in 4th row.
note that in this case a single document can be mapped to two items in the same group.
df[23,17:21]
sinus sore sports symptoms throat
0 1 0 0 1
Instead of put back the number directly I use the 0-1 matrix:
label_back <-t(data.frame(mydata.df,cutree(fit,k=10)))
row.names(label_back) <- NULL
#label_back<-label_back[1:(nrow(label_back)-1),]# the last line is the sum
groups.df<-as.data.frame(groups)
groups.df$label<-rownames(groups.df)
for (i in 1:length((colnames(label_back)))){
ind<-which(colnames(label_back)[i]==groups.df$label) # match names and return index
label_back[,i]=groups.df$groups[ind]*label_back[,i] # time the 0-1 with the #group number
}
find the max value in each row because there are more than 1 value in some rows.
data_group<-rep(0,nrow(data)
for (i in 1:nrow(data)){
data_group[i]<-max(unique(label_back[i,]))
}
data$group<-data_group
I am looking for more elegant way.

R: How to sum two separate values of two variables?

I have data 7320 obs of 3 variables: age groups and contact number between them. Ex:
ageGroup ageGroup1 mij
0 0 0.012093847617507
0 1 0.00510485237464309
0 2 0.00374919082969427
0 3 0.00307241431437433
0 4 0.00254487083293498
0 5 0.00213734013959765
0 6 0.00182565778959543
0 7 0.00159036659169942
1 0 0.00475097494199872
1 1 0.00748329237103462
1 2 0.00427123298868537
1 3 0.00319622224196792
1 4 0.00287522072903812
1 5 0.00257773394696414
1 6 0.00230322568677366
1 7 0.00205265986733139
and so on until 86. I have to calculate mean of contact number (mij) between ageGroups so that, for example, ageGroup = 0 contacts with ageGroup1 =1 with mij and ageGroup = 1 contacts with ageGroup1 = 0 with mij. I need to sum this values and divide by 2 to get an average between then. Would you be so kind to give me a hint how to do that all over the data?
Use ddply from plyr package (assuming your dataframe is data)
ddply(data,.(ageGroup,ageGroup1),summarize,sum.mij=sum(mij))
ageGroup ageGroup1 sum.mij
1 0 0 0.012093848
2 0 1 0.005104852
3 0 2 0.003749191
4 0 3 0.003072414
5 0 4 0.002544871
6 0 5 0.002137340
7 0 6 0.001825658
8 0 7 0.001590367
9 1 0 0.004750975
10 1 1 0.007483292
11 1 2 0.004271233
12 1 3 0.003196222
13 1 4 0.002875221
14 1 5 0.002577734
15 1 6 0.002303226
16 1 7 0.002052660
I think I see what you're trying to do here. You want to treat interactions between the two ageGroup columns as being non-directional and get the mean interaction? The code below should do this using base R functions.
Note that since the example dataset is truncated, it will only give a correct answer for the group with index 01. However if you run with the full dataset, it should work for all interactions.
# Create the data frame
df=read.table(header=T,text="
ageGroup,ageGroup1,mij
0,0,0.012093848
0,1,0.005104852
0,2,0.003749191
0,3,0.003072414
0,4,0.002544871
0,5,0.00213734
0,6,0.001825658
0,7,0.001590367
1,0,0.004750975
1,1,0.007483292
1,2,0.004271233
1,3,0.003196222
1,4,0.002875221
1,5,0.002577734
1,6,0.002303226
1,7,0.00205266
",sep=",")
df
# Using the strSort function from this SO answer:
# http://stackoverflow.com/questions/5904797/how-to-sort-letters-in-a-string-in-r
strSort <- function(x)
sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")
# Label each of the i-j interactions and j-i interactions with an index ij
# e.g. anything in ageGroup=1 interacting with ageGroup1=0 OR ageGroup=0 interacting with ageGroup1=1
# are labelled with index 01
df$ind=strSort(paste(df$ageGroup,df$ageGroup1,sep=""))
# Use the tapply function to get mean interactions for each group as suggested by Paul
tapply(df$mij,df$ind,mean)

Reshaping wide dataset in interval format

I am working on a "wide" dataset, and now I would like to use a specific package (-msSurv-, for non-parametric multistate models) which requires data in interval form.
My current dataset is characterized by one row for each individual:
dat <- read.table(text = "
id cohort t0 s1 t1 s2 t2 s3 t3
1 2 0 1 50 2 70 4 100
2 1 0 2 15 3 100 0 0
", header=TRUE)
where cohort is a time-fixed covariate, and s1-s3 correspond to the values that a time-varying covariate s = 1,2,3,4 takes over time (they are the distinct states visited by the individual over time). Calendar time is defined by t1-t3, and ranges from 0 to 100 for each individual.
So, for instance, individual 1 stays in state = 1 up to calendar time = 50, then he stays in state = 2 up to time = 70, and finally he stays in state = 4 up to time 100.
What I would like to obtain is a dataset in "interval" form, that is:
id cohort t.start t.stop start.s end.s
1 2 0 50 1 2
1 2 50 70 2 4
1 2 70 100 4 4
2 1 0 15 2 3
2 1 15 100 3 3
I hope the example is sufficiently clear, otherwise please let me know and I will try to further clarify.
How would you automatize this reshaping? Consider that I have a relatively large number of (simulated) individuals, around 1 million.
Thank you very much for any help.
I think I understand. Does this work?
require(data.table)
dt <- data.table(dat, key=c("id", "cohort"))
dt.out <- dt[, list(t.start=c(t0,t1,t2), t.stop=c(t1,t2,t3),
start.s=c(s1,s2,s3), end.s=c(s2,s3,s3)),
by = c("id", "cohort")]
# id cohort t.start t.stop start.s end.s
# 1: 1 2 0 50 1 2
# 2: 1 2 50 70 2 4
# 3: 1 2 70 100 4 4
# 4: 2 1 0 15 2 3
# 5: 2 1 15 100 3 0
# 6: 2 1 100 0 0 0
If the output you show is indeed right and is what you require, then you can obtain with two more lines (not the best way probably, but it should nevertheless be fast)
# remove rows where start.s and end.s are both 0
dt.out <- dt.out[, .SD[start.s > 0 | end.s > 0], by=1:nrow(dt.out)]
# replace end.s values with corresponding start.s values where end.s == 0
# it can be easily done with max(start.s, end.s) because end.s >= start.s ALWAYS
dt.out <- dt.out[, end.s := max(start.s, end.s), by=1:nrow(dt.out)]
dt.out[, nrow:=NULL]
> dt.out
# id cohort t.start t.stop start.s end.s
# 1: 1 2 0 50 1 2
# 2: 1 2 50 70 2 4
# 3: 1 2 70 100 4 4
# 4: 2 1 0 15 2 3
# 5: 2 1 15 100 3 3

Resources