Create missing observations in panel data - r

I am working on panel data with a unique case identifier and a column for the time points of the observations (long format). There are both time-constant variables and time-varying observations:
id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2
For my model I now need data with complete records per id for each time point. In other words, if an observation is missing I still need to put in a row with id, time, time-constant variables, and NA for the observed variables (as would be the line (102, 2, "female", NA) in the above example). So my question is:
How can I find out if a row with unique combination of id and time already exists in my dataset?
If not, how can I add this row, carry over time-constant variables and fill the observations with NA?
Would be great if someone could shed some light on this.
Thanks a lot in advance!
EDIT
Thank you everyone for your replies. Here is what I finally did, which is mix of several suggested approaches. The thing is that I have several time-varying variables (obs1-obsn) per row and I did not get dcast to accomodate for that - value.name does not take more than argument.
# create all possible permutations of id and year
iddat = expand.grid(id = unique(dataset$id), time = (c(1996,1999,2002,2005,2008,2011)))
iddat <- iddat[order(iddat$id, iddat$time), ]
# add permutations to existing data, combinations so far missing are NA
dataset_new <- merge(dataset, iddat, all.x=TRUE, all.y=TRUE, by=c("id", "time"))
# drop time-constant variables from data
dataset_new[c("tc1", "tc2", "tc3")] <- list(NULL)
# merge back time-constant variables from original data
temp <- dataset[c("tc1", "tc2", "tc3")]
dataset_new <- merge(dataset_new, temp, by=c("id"))
# sort
dataset_new <- dataset_new[order(dataset_new$id, dataset_new$time), ]
dataset_new <- unique(dataset_new) # some rows are duplicates after last merge, no idea why
rm(temp)
rm(iddat)
All the best and thanks again, Matt

You could create an empty dataset and then merge in the records in which you have matches.
# Create dataset. For you actual data ,you would replace c(1:3) with
# c(1:max(yourdata$id)) and adjust the number of time periods to match your data.
id <- rep(c(1:3), each = 3)
time <- rep(c(1:3), 3)
df <- data.frame(id,time)
test <- df[c(1,3,5,7,9),]
test$tc1 <- c("male", "male", "female", "male", "male")
test$obs1 <-c(4,5,3,6,2)
merge(df, test, by.x = c("id","time"), by.y = c("id","time"), all.x = TRUE)
The result:
id time tc1 obs1
1 1 1 male 4
2 1 2 <NA> NA
3 1 3 male 5
4 2 1 <NA> NA
5 2 2 female 3
6 2 3 <NA> NA
7 3 1 male 6
8 3 2 <NA> NA
9 3 3 male 2

There are probably more elegant ways, but here's one option. I'm assuming that you need all combinations of id and time but not tc1 (i.e. tc1 is tied to id).
# your data
df <- read.table(text = " id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2", header = TRUE)
First cast your data to wide format to introduce NAs, then convert back to long.
library('reshape2')
df_wide <- dcast(
df,
id + tc1 ~ time,
value.var = "obs1",
fill = NA
)
df_long <- melt(
df_wide,
id.vars = c("id","tc1"),
variable.name = "time",
value.name = "obs1"
)
# sort by id and then time
df_long[order(df_long$id, df_long$time), ]
id tc1 time obs1
1 101 male 1 4
4 101 male 2 5
7 101 male 3 3
2 102 female 1 6
5 102 female 2 NA
8 102 female 3 2
3 103 male 1 2
6 103 male 2 NA
9 103 male 3 NA

Related

Applying custom formats to numeric variables in R

I have a df that uses numbers to refer to categorical data and a CSV that defines the categories (1=Smoker, 2=Non-Smoker, etc). In SAS, I was able to convert the format CSV into a format file and apply these formats to the variables:
data want;
set have;
formatted = put(varX,custFormat.);
run;
This would provide me with the output:
varX formatted
1 Smoker
2 Non-Smoker
3 Occasional Smoker
1 Smoker
Given that I have a csv with all the formats, I could bring this in and merge to my R df to have the formats in a new column:
print(have)
varX
1
2
3
1
print(format.file)
formatIndex group
1 Smoker
2 Non-Smoker
3 Occasional Smoker
11 Female
12 Male
13 Unknown
df.format <- merge(have, format.file, by.x = "varX", by.y = "formatIndex")
print(df.format)
varX group
1 Smoker
2 Non-Smoker
3 Occasional Smoker
1 Smoker
The issue with a join approach is I often want to apply the same formats for many columns (i.e. varX, varY, and varZ all use different formatIndex). Is there a similar method of applying formats to variables as SAS has?
You could use plyr::mapvalues within the across verb.
Example:
df <- data.frame(V1 = c(1,2,3,4),
V2 = c(2,3,1,3))
V1 V2
1 1 2
2 2 3
3 3 1
4 4 3
liste_format <- data.frame(ID = c(1,2,3,4),
group = c("Smoker","Non-Smoker","Occasional Somker","Unknown"))
ID group
1 1 Smoker
2 2 Non-Smoker
3 3 Occasional Somker
4 4 Unknown
library(dplyr)
df |>
mutate(across(V1:V2,
~ plyr::mapvalues(.,
from = liste_format$ID,
to = liste_format$group,
warn_missing = F),
.names = "format_{.col}"))
V1 V2 format_V1 format_V2
1 1 2 Smoker Non-Smoker
2 2 3 Non-Smoker Occasional Somker
3 3 1 Occasional Somker Smoker
4 4 3 Unknown Occasional Somker

R: function to detect time-invariant variables

Suppose we have the following dataset:
Lines <- "id time sex Age
1 1 male 90
1 2 male 91
1 3 male 92
2 1 female 87
2 2 female 88
2 3 female 89
3 1 male 50
3 2 male 51
3 3 male 52
4 1 female 54
4 2 female 55
4 3 female 56"
dat <- read.table(text = Lines, header = TRUE)
I would like to create a function that scans all columns in the dataset except id and time and retrieves a string mentioning which variables are not time-invariant (change for every time period). In this example, it would be Age.
With dplyr here is an option -
library(dplyr)
dat %>%
group_by(id) %>%
summarise(across(-time, ~all(. != lag(.), na.rm = TRUE))) %>%
select(where(~is.logical(.) && all(.))) %>%
names
#[1] "Age"
Within each id except time column return TRUE if the current value is different than the previous value for all the values. We can return the column name where all the values are TRUE for every id.

R statistics, panel data and NAs: replacing NA value in vector with a specific row in another vector using panel data

My apologies for a poorly formulated question. I am new to R and to programming and for posting questions.
I am working with panel data. I have two context varying variables: cat (category that ranges from 1 to 4, where an individual have gambled in 3 out of 4 possible places) and d.stake = amount of money staked in a given category. Cat and d.stake are nested within the individual (id) (context independent variable).
I wish to make a difference score between the different categories amount staked in different categories.
I have created four variables. Two of them lag is a lag variable (ldstake and ldstake2) and two variables with difference scores (diff1 = stake - ldstake; diff2 stake - ldstake2), using the code
df.3$ldstake <- c(NA, df.3$d.stake[-nrow(df.3)])
df.3$ldstake[which(!duplicated(df.3$id))] <- NA
df.3$ldstake2 <- c(NA, df.3$ldstake[-nrow(df.3)])
df.3$ldstake2[which(!duplicated(df.3$id))] <- NA
df.3 <- df.3 %>%
mutate(diff1 = d.stake - ldstake,
diff2 = d.stake - ldstake2)
This give me the following dataframe:
id cat d.stake ldstake ldstake2 diff1 diff2
1 1 50 NA NA NA NA
1 2 60 50 NA 10 NA
1 3 55 60 50 -5 5
2 1 34 NA NA NA NA
2 2 74 34 NA 40 NA
2 4 12 74 34 -62 22
However, I wish to replace the first row of diff1 (the NA) for each individual with the third row of diff2 from each individual (See example below).
id cat d.stake ldstake ldstake2 diff1 diff2
1 1 50 NA NA !5! NA
1 2 60 50 NA 10 NA
1 3 55 60 50 -5 !5!
2 1 34 NA NA *22* NA
2 2 74 34 NA 40 NA
2 4 12 74 34 -62 *22*
Is this possible? I would be grateful to receive a script where I can replace the first NA value with the value of diff2 and last value for the individual (third or last observation). Furthermore, if there is a script that would do this automatically (that is create the difference score between cat2-1 cat3-2 and cat3-1) I would be grateful to receive any help.
All the best,
Tony
Here is one possibility based on something else I had been working on this past week.
library(tidyverse)
df_wide <- df %>%
pivot_wider(id_cols = id, names_from = cat, values_from = d.stake) %>%
as.data.frame(.)
data.frame(id = df_wide$id, combn(df_wide[-1], 2, function(x) x[,1]-x[,2])) %>%
setNames(c("id", apply(combn(names(df_wide[-1]), 2), 2, paste0, collapse = "-"))) %>%
pivot_longer(cols = -id, names_to = "cats", values_to = "diff") %>%
drop_na()
Output
# A tibble: 6 x 3
id cats diff
<dbl> <chr> <dbl>
1 1 1-2 -10
2 1 1-3 -5
3 1 2-3 5
4 2 1-2 -40
5 2 1-4 22
6 2 2-4 62
Data
df <- data.frame(
id = c(1,1,1,2,2,2),
cat = c(1,2,3,1,2,4),
d.stake = c(50,60,55,34,74,12)
)

How can I create an incremental ID column based on whenever one of two variables are encountered?

My data came to me like this (but with 4000+ records). The following is data for 4 patients. Every time you see surgery OR age reappear, it is referring to a new patient.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
So to say again, every time surgery or age appear (surgery isn't always there, but age is), those records and the ones after pertain to the same patient until you see surgery or age appear again.
Thus I somehow need to add an ID column with this data:
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,4)
testdat$ID = ID
I know how to transpose and melt and all that to put the data into regular format, but how can I create that ID column?
Advice on relevant tags to use is helpful!
Assuming that surgery and age will be the first two pieces of information for each patient and that each patient will have a information that is not age or surgery afterward, this is a solution.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
# Use a tibble and get rid of factors.
dfTest = as_tibble(testdat) %>%
mutate_all(as.character)
# A little dplyr magic to see find if the start of a new patient, then give them an id.
dfTest = dfTest %>%
mutate(couldBeStart = if_else(col1 == "surgery" | col1 == "age", T, F)) %>%
mutate(isStart = couldBeStart & !lag(couldBeStart, default = FALSE)) %>%
mutate(patientID = cumsum(isStart)) %>%
select(-couldBeStart, -isStart)
# # A tibble: 17 x 3
# col1 col2 patientID
# <chr> <chr> <int>
# 1 surgery yes 1
# 2 age 54 1
# 3 weight 153 1
# 4 albumin normal 1
# 5 abiotics 2 1
# 6 surgery no 2
# 7 age 65 2
# 8 weight 134 2
# 9 BAPPS yes 2
# 10 abiotics 1 2
# 11 surgery yes 3
# 12 age 61 3
# 13 weight 210 3
# 14 age 46 4
# 15 weight 178 4
# 16 BAPPS no 4
# 17 albumin low 4
# Get the data to a wide workable format.
dfTest %>% spread(col1, col2)
# # A tibble: 4 x 7
# patientID abiotics age albumin BAPPS surgery weight
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2 54 normal NA yes 153
# 2 2 1 65 NA yes no 134
# 3 3 NA 61 NA NA yes 210
# 4 4 NA 46 low no NA 178
Using dplyr:
library(dplyr)
testdat = testdat %>%
mutate(patient_counter = cumsum(col1 == 'surgery' | (col1 == 'age' & lag(col1 != 'surgery'))))
This works by checking whether the col1 value is either 'surgery' or 'age', provided 'age' is not preceded by 'surgery'. It then uses cumsum() to get the cumulative sum of the resulting logical vector.
You can try the following
keywords <- c('surgery', 'age')
lgl <- testdat$col1 %in% keywords
testdat$ID <- cumsum(c(0, diff(lgl)) == 1) + 1
col1 col2 ID
1 surgery yes 1
2 age 54 1
3 weight 153 1
4 albumin normal 1
5 abiotics 2 1
6 surgery no 2
7 age 65 2
8 weight 134 2
9 BAPPS yes 2
10 abiotics 1 2
11 surgery yes 3
12 age 61 3
13 weight 210 3
14 age 46 4
15 weight 178 4
16 BAPPS no 4
17 albumin low 4

merge two dataframes by nearest preceding date while aggregating

I am trying to match two datasets by nearest preceding date, by group.
So within a group, I would like to add the variables of a second dataset (d2) to that of the first (d1) when the date of the first is the nearest date on or before the date in the second. If two rows in the second dataset are matched with one row in the first I would like to add the larger of the values. (there will always be at least one date in d1 less then the date in d2, by group)
Here is an example, which hopefully makes it clearer
d1 = data.frame(id=c(1,1,1,2,2),
ref=as.Date(c("2013-12-07", "2014-12-07", "2015-12-07", "2013-11-07", "2014-11-07" )))
d1
# id ref
# 1 1 2013-12-07
# 2 1 2014-12-07
# 3 1 2015-12-07
# 4 2 2013-11-07
# 5 2 2014-11-07
d2 = data.frame(id=c(1,1,2),
date=as.Date(c("2014-05-07","2014-12-05", "2015-11-05")),
x1 = factor(c(1,2,2), ordered = TRUE),
x2 = factor(c(2, NA ,2), ordered=TRUE))
d2
# id date x1 x2
# 1 1 2014-05-07 1 2
# 2 1 2014-12-05 2 <NA>
# 3 2 2015-11-05 2 2
With the expected outcome
output = data.frame(id=c(1,1,1,2,2),
ref=as.Date(c("2013-12-07", "2014-12-07", "2015-12-07", "2013-11-07", "2014-11-07" )),
x1 = c(2, NA, NA, NA, 2),
x2 = c(2, NA, NA, NA, 2))
output
# id ref x1 x2
# 1 1 2013-12-07 2 2
# 2 1 2014-12-07 NA NA
# 3 1 2015-12-07 NA NA
# 4 2 2013-11-07 NA NA
# 5 2 2014-11-07 2 2
So for example, the first two observations of d2, id=1, with dates "2014-05-07","2014-12-05", are matched to the earlier date "2013-12-07" in d1. As there are two rows matched to one row in d1,
then the highest level is selected.
I could do this in base R by looping the following calculations through
each group but I was hoping for something more efficient.
I would love to see a data.table approach (but I am limited to R v3.1 and data.table v1.9.4). Thanks
real dataset:
d1: rows 1M / 100K groups
d2: rows 11K / 4K groups
# for one group
x = d1[d1$id==1, ]
y = d2[d2$id==1, ]
id = apply(outer(x$ref, y$date, "-"), 2, which.min)
temp = cbind(y, ref=x$ref[id])
# aggregate variables by ref
temp = merge(aggregate(x1 ~ ref, data=temp, max),
aggregate(x2 ~ ref, data=temp, max)
)
merge(x, temp, all=T)
ps: I had looked at How to match by nearest date from two data frames? and Join data.table on exact date or if not the case on the nearest less than date with no success.
You can do this using dplyr:
d2$ind <- 0
library(dplyr)
out <- d1 %>% full_join(d2,by=c("id","ref"="date")) %>%
arrange(id,ref) %>%
mutate(ind=cumsum(ifelse(is.na(ind),1,ind))) %>%
group_by(ind) %>%
summarise(ref=min(ref),x1=max(x1,na.rm=TRUE),x2=max(x2,na.rm=TRUE))
### A tibble: 5 x 4
## ind ref x1 x2
## <dbl> <date> <fctr> <fctr>
##1 1 2013-12-07 2 2
##2 2 2014-12-07 NA NA
##3 3 2015-12-07 NA NA
##4 4 2013-11-07 NA NA
##5 5 2014-11-07 2 2
We first add a column of indicators to d2 and set those to zero. Then, we perform a full outer join between d1 and d2. Those rows in d1 will have ind of NA. We sort by id and ref (i.e., the date), and we replace the NA entries of ind with 1 and perform a cumsum. This results in:
id ref x1 x2 ind
1 1 2013-12-07 <NA> <NA> 1
2 1 2014-05-07 1 2 1
3 1 2014-12-05 2 <NA> 1
4 1 2014-12-07 <NA> <NA> 2
5 1 2015-12-07 <NA> <NA> 3
6 2 2013-11-07 <NA> <NA> 4
7 2 2014-11-07 <NA> <NA> 5
8 2 2015-11-05 2 2 5
From this we can easily see that we can group by ind and summarise appropriately to get your result.

Resources