The typical preparation steps for mstate involve converting "wide" format data (1x row per 'patient') into "multi-state" format data (multiple rows per 'patient' for each possible transition in the multi-state model).
For example, data in wide format:
library(mstate)
data(ebmt4)
ebmt <- ebmt4
> head(ebmt)
id rec rec.s ae ae.s recae recae.s rel rel.s srv srv.s year agecl proph match
1 1 22 1 995 0 995 0 995 0 995 0 1995-1998 20-40 no no gender mismatch
2 2 29 1 12 1 29 1 422 1 579 1 1995-1998 20-40 no no gender mismatch
3 3 1264 0 27 1 1264 0 1264 0 1264 0 1995-1998 20-40 no no gender mismatch
4 4 50 1 42 1 50 1 84 1 117 1 1995-1998 20-40 no gender mismatch
5 5 22 1 1133 0 1133 0 114 1 1133 0 1995-1998 >40 no gender mismatch
6 6 33 1 27 1 33 1 1427 0 1427 0 1995-1998 20-40 no no gender mismatch
Is converted to multi-state format:
tmat <- transMat(x = list(c(2, 3, 5, 6), c(4, 5, 6), c(4, 5, 6), c(5, 6), c(), c()), names = c("Tx", "Rec", "AE", "Rec+AE", "Rel", "Death"))
msebmt <- msprep(data = ebmt, trans = tmat, time = c(NA, "rec", "ae", "recae", "rel", "srv"), status = c(NA, "rec.s", "ae.s", "recae.s", "rel.s", "srv.s"), keep = c("match", "proph", "year", "agecl"))
> head(msebmt)
An object of class 'msdata'
Data:
id from to trans Tstart Tstop time status match proph year agecl
1 1 1 2 1 0 22 22 1 no gender mismatch no 1995-1998 20-40
2 1 1 3 2 0 22 22 0 no gender mismatch no 1995-1998 20-40
3 1 1 5 3 0 22 22 0 no gender mismatch no 1995-1998 20-40
4 1 1 6 4 0 22 22 0 no gender mismatch no 1995-1998 20-40
5 1 2 4 5 22 995 973 0 no gender mismatch no 1995-1998 20-40
6 1 2 5 6 22 995 973 0 no gender mismatch no 1995-1998 20-40
But what if my original dataset has time-varying covariates (i.e. long format) and I want to format the data into multi-state mode? All of the tutorials I have found online are only for converting initially wide data to multi-state data (not initially long data); for example the mstate package vignette.
So, let's say I have the below data df, where id is for a 'patient', (start,stop] tell us the time periods, state is the state the patient is in at the end of the time period, and tv.cov is their time-varying covariate (assumed constant over the time period). Note that only patient id=5 has 3x entries and that person's tv.cov changes.
id start stop state tv.cov
1 0 1 1 1
2 0 4 1 2
3 0 7 1 1
4 0 10 1 5
5 0 6 1 4
5 6 10 2 10
5 10 15 3 12
Assuming the basic "illness-death" transition model:
tmat <- mstate::trans.illdeath(names = c("healthy", "sick", "death"))
> tmat
to
from healthy sick death
healthy NA 1 2
sick NA NA 3
death NA NA NA
How can I prep df into multi-state format?
As a hack, should I setup the data in "wide" format, format the data into "multi-state" format using msprep and then join another frame onto it which contains the time-varying covariates for each patient at each time interval?
Related
I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2
I am attempting to use Random Forest. The training data has 7000 observations with 12 variables. These variables include both categorical and continuous variables. When I submit the code I receive the following
warning: Warning message: In randomForest.default(m, y, ...) : The
response has five or fewer unique values. Are you sure you want to do
regression?
The data is structured as such:
CustomerId CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
15634602 619 France Female 42 2 0 1 1 1 101348.88 1
15647311 608 Spain Female 41 1 83807.86 1 0 1 112542.58 0
15619304 502 France Female 42 8 159660.8 3 1 0 113931.57 1
15701354 699 France Female 39 1 0 2 0 0 93826.63 0
15737888 850 Spain Female 43 2 125510.82 1 1 1 79084.1 0
15574012 645 Spain Male 44 8 113755.78 2 1 0 149756.71 1
15592531 822 France Male 50 7 0 2 1 1 10062.8 0
15656148 376 Germany Female 29 4 115046.74 4 1 0 119346.88 1
15792365 501 France Male 44 4 142051.07 2 0 1 74940.5 0
Based on research, I have attempted to change variables to factors, but this has not corrected the issue.
The random forest model code that I am using is as follows:
rfModel=randomForest(Exited~.,data=train)
I have been unable to proceed past the warning to this point.
Try converting the outcome variable to a factor. For example if you the outcome variable in train is y
then run this before the model
train$y <- as.factor(train$y)
After a merging process, I got a data frame that looks like:
df <- data.frame(trip=c(315,328,422,422,458,652,652,652,699),
catch_kg=c(10,8,12,2,26,4,18,14,11),
age_1=c(0,0,0,0,0,0,0,0,0),
age_2=c(2,1,7.5,7.5,8,11,11,11,13),
id=c(1,2,3,3,4,5,5,5,6))
trip catch_kg age_1 age_2 id
315 10 0 2 1
328 8 0 1 2
422 12 0 7.5 3
422 2 0 7.5 3
458 26 0 8 4
652 4 0 11 5
652 18 0 11 5
652 14 0 11 5
699 11 0 13 6
where trips represents the fishing trip, catch_kg the amount of caught fish (in kg), age_1 & age_2 is the number of individuals in each trip and per age group, and id represents the haul identity in each trip.
In some fishing trips I have more than 1 haul - this can be accessed in the id column, where trips with more than 1 haul have the same id number. For instance: trip number 422 has two hauls (id=3).
At this very moment, for a trip with more than 1 haul, I have that the number of individuals within each age group is equally divided by the number of hauls that appears within that specific trip. For example, in trip 422 I have a total of 15 individuals, but since there are 2 hauls, this number was divided by 2 leading to 7.5 individuals per haul.
What I would like, however, is to compute the number of individuals within each age group as a proportion of the total catch in each haul group.
Thus, at the end I would like to have a data frame that looks like:
trip catch_kg age_1 age_2 id
315 10 0 2 1
328 8 0 1 2
422 12 0 13 3
422 2 0 2 3
458 26 0 8 4
652 4 0 4 5
652 18 0 16 5
652 14 0 13 5
699 11 0 13 6
This is basically a rule of three calculation, where for trip 422 (2 hauls), for instance, I would have the following calculation:
haul1: 12*(7.5 + 7.5)/(12 + 2) = 13 individuals
haul2: 2*(7.5 + 7.5)/(12 + 2) = 2 individuals
Is there an easy way to compute these calculations?
Any help would be much appreciated.
-M
You could use dplyr to help with this
library(dplyr)
df %>% group_by(trip) %>%
mutate(age_2=catch_kg/sum(catch_kg)*sum(age_2))
# trip catch_kg age_1 age_2 id
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 315 10 0 2.000000 1
# 2 328 8 0 1.000000 2
# 3 422 12 0 12.857143 3
# 4 422 2 0 2.142857 3
# 5 458 26 0 8.000000 4
# 6 652 4 0 3.666667 5
# 7 652 18 0 16.500000 5
# 8 652 14 0 12.833333 5
# 9 699 11 0 13.000000 6
Not sure exactly what rounding rule you were using to get to integer counts of people, but you'd likely run into trouble with parts not adding up to wholes in more complicated scenarios.
Another solution using data.table:
library(data.table)
setDT(df)
df[, age_2 := catch_kg * sum(age_2) / sum(catch_kg), trip]
# trip catch_kg age_1 age_2 id
#1: 315 10 0 2.000000 1
#2: 328 8 0 1.000000 2
#3: 422 12 0 12.857143 3
#4: 422 2 0 2.142857 3
#5: 458 26 0 8.000000 4
#6: 652 4 0 3.666667 5
#7: 652 18 0 16.500000 5
#8: 652 14 0 12.833333 5
#9: 699 11 0 13.000000 6
If you want you can round age_2 with round(): age_2 := round(catch_kg * sum(age_2) / sum(catch_kg))
imagine, you have the following data set:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 0 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
Further, imagine you want to compile summary tables that print out the frequencies of those that drink wine, beer, water.
I solved it that way.
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
This allows me to complete my ultimate goal of compiling a bar chart in the way I want it:
barplot(con_P)
It works perfectly. No problem. Now, let us tweak the data set as follows: We set all entries for water to 1.
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
If I now run the following commands:
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
it gives me the following error message after the second line: Error in margin.table(x, margin) : 'x' is not an array!
Through another question here on this forum, I learned that the following will help me to overcome this issue:
con_P <- lapply(con, function(x) x/sum(x))
However, if I now run
barplot(con_P)
R does not create a barplot: Error in -0.01 * height : non-numeric argument to binary operator. I assume it is because it is no array!
My question is what to do now (how would I transform con_P in th second example into an array?). Secondly, how can I make the entire step of creating prop.tables and then a bar chart more efficient? Any help is much appreciated.
We can by converting the columns to factor with levels specified. In the second example, as the columns have 0 and 1 values in the 2nd and 3rd, we use the levels as 0:1, then get the table and convert to proportion with prop.table. and do the barplot
barplot(prop.table(sapply(df[2:4],
function(x) table(factor(x, levels=0:1))),2))
Reproducing your data:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
con <-lapply(df[,c(2:4)], table)
con_P <- lapply(con, function(x) x/sum(x))
You can use reshape2 to melt the data:
library(reshape2)
df <- melt(con_P)
Now, if you want to use gpplot2 you can use df to plot the bar plot:
ggplot(df, aes(x = L1, y = value, fill = factor(Var1) )) +
geom_bar(stat= "identity") +
theme_bw()
If you want to use barplot you can reshape the data.frame into an array:
array <- acast( df, Var1~L1)
array[is.na(array)] <- 0
barplot(array)
I am trying to figure out a way to loop through my data frame and replace any values greater than 200 with a decimal point.
Here is my code:
for (i in data$AGE) if (i > 199) i <- i*.01-2
Here is a head() sample of my data frame:
AGE LOC RACE SEX WORKREL PROD1 ICD10 INJ_ST DTH_YEAR DTH_MONTH DTH_DAY ACC_YEAR ACC_MONTH ACC_DAY
1 26 5 1 1 0 1290 V865 UT 2003 1 1 2002 12 31
2 20 1 7 2 0 1899 X47 HI 2003 1 1 2003 1 1
3 202 1 2 2 0 1598 W75 FL 2003 1 1 2003 1 1
4 86 5 1 2 0 1807 W18 FL 2003 1 1 2002 12 14
5 203 1 2 1 0 1598 W75 GA 2003 1 1 2003 1 1
6 79 0 1 2 2 921 X49 MA 2003 1 1 NA NA NA
So basically, if the value of AGE is greater than 200, then I want to multiply that value by .01 and then subtract 2.
My reason is because any value with 200 and greater is the age in months.
I'm not a Stats or R genius so my humble thanks in advance for all advice.
data$AGE[data$AGE> 200] <- data$AGE[data$AGE > 200] * 0.01 - 2
You can do this reasonably eleganty within and replace
data <- within(data, AGE <- replace(AGE, AGE > 200, AGE[AGE>200] * 0.01-2))
Or using data.table for memory efficiency and syntax elegance
library(data.table)
DT <- as.data.table(data)
# make sure that AGE is numeric not integer
DT[,AGE:= as.numeric(AGE)]
DT[AGE>200, AGE := AGE *0.01 -2]