R: function to detect time-invariant variables - r

Suppose we have the following dataset:
Lines <- "id time sex Age
1 1 male 90
1 2 male 91
1 3 male 92
2 1 female 87
2 2 female 88
2 3 female 89
3 1 male 50
3 2 male 51
3 3 male 52
4 1 female 54
4 2 female 55
4 3 female 56"
dat <- read.table(text = Lines, header = TRUE)
I would like to create a function that scans all columns in the dataset except id and time and retrieves a string mentioning which variables are not time-invariant (change for every time period). In this example, it would be Age.

With dplyr here is an option -
library(dplyr)
dat %>%
group_by(id) %>%
summarise(across(-time, ~all(. != lag(.), na.rm = TRUE))) %>%
select(where(~is.logical(.) && all(.))) %>%
names
#[1] "Age"
Within each id except time column return TRUE if the current value is different than the previous value for all the values. We can return the column name where all the values are TRUE for every id.

Related

How to make the values of one column the main column names using spread()

I need to run a chi-square test so I need the levels of one column (gender) to be the column names for the output of different variables. Here's some data:
test <- data.frame(gender = as.character(sample(c('male','female'),10, replace = T)),
test1 = sample(c(1:10)),
test2 = sample(1:5,10 , replace = T))
> test
gender test1 test2
1 female 2 2
2 male 9 1
3 male 4 4
4 female 8 1
5 female 5 4
6 female 3 3
7 female 7 3
8 female 1 1
9 male 10 2
10 male 6 2
I've used the following line of code with dplyr::spread() but it's giving me an error:
test %>% spread(gender,test1)
Error: Each row of output must be identified by a unique combination of keys.
I've followed all the examples that dplyr provides using gather() and spread() but nothing is working. If you have any tips please let me know. Here's my desired outcome:
> goal
male female
1 10 3
2 1 4
3 5 10
4 3 9
5 6 7
We can create a sequence column grouped by gender to make a unique identifier and then use `spread
library(dplyr)
library(tidyr)
test %>%
select(-test2) %>%
group_by(gender) %>%
mutate(rn = row_number()) %>%
spread(gender, test1)

R statistics, panel data and NAs: replacing NA value in vector with a specific row in another vector using panel data

My apologies for a poorly formulated question. I am new to R and to programming and for posting questions.
I am working with panel data. I have two context varying variables: cat (category that ranges from 1 to 4, where an individual have gambled in 3 out of 4 possible places) and d.stake = amount of money staked in a given category. Cat and d.stake are nested within the individual (id) (context independent variable).
I wish to make a difference score between the different categories amount staked in different categories.
I have created four variables. Two of them lag is a lag variable (ldstake and ldstake2) and two variables with difference scores (diff1 = stake - ldstake; diff2 stake - ldstake2), using the code
df.3$ldstake <- c(NA, df.3$d.stake[-nrow(df.3)])
df.3$ldstake[which(!duplicated(df.3$id))] <- NA
df.3$ldstake2 <- c(NA, df.3$ldstake[-nrow(df.3)])
df.3$ldstake2[which(!duplicated(df.3$id))] <- NA
df.3 <- df.3 %>%
mutate(diff1 = d.stake - ldstake,
diff2 = d.stake - ldstake2)
This give me the following dataframe:
id cat d.stake ldstake ldstake2 diff1 diff2
1 1 50 NA NA NA NA
1 2 60 50 NA 10 NA
1 3 55 60 50 -5 5
2 1 34 NA NA NA NA
2 2 74 34 NA 40 NA
2 4 12 74 34 -62 22
However, I wish to replace the first row of diff1 (the NA) for each individual with the third row of diff2 from each individual (See example below).
id cat d.stake ldstake ldstake2 diff1 diff2
1 1 50 NA NA !5! NA
1 2 60 50 NA 10 NA
1 3 55 60 50 -5 !5!
2 1 34 NA NA *22* NA
2 2 74 34 NA 40 NA
2 4 12 74 34 -62 *22*
Is this possible? I would be grateful to receive a script where I can replace the first NA value with the value of diff2 and last value for the individual (third or last observation). Furthermore, if there is a script that would do this automatically (that is create the difference score between cat2-1 cat3-2 and cat3-1) I would be grateful to receive any help.
All the best,
Tony
Here is one possibility based on something else I had been working on this past week.
library(tidyverse)
df_wide <- df %>%
pivot_wider(id_cols = id, names_from = cat, values_from = d.stake) %>%
as.data.frame(.)
data.frame(id = df_wide$id, combn(df_wide[-1], 2, function(x) x[,1]-x[,2])) %>%
setNames(c("id", apply(combn(names(df_wide[-1]), 2), 2, paste0, collapse = "-"))) %>%
pivot_longer(cols = -id, names_to = "cats", values_to = "diff") %>%
drop_na()
Output
# A tibble: 6 x 3
id cats diff
<dbl> <chr> <dbl>
1 1 1-2 -10
2 1 1-3 -5
3 1 2-3 5
4 2 1-2 -40
5 2 1-4 22
6 2 2-4 62
Data
df <- data.frame(
id = c(1,1,1,2,2,2),
cat = c(1,2,3,1,2,4),
d.stake = c(50,60,55,34,74,12)
)

How do I combine row entries for the same patient ID# in R while keeping other columns and NA values?

I need to combine some of the columns for these multiple IDs and can just use the values from the first ID listing for the others. For example here I just want to combine the "spending" column as well as the heart attack column to just say whether they ever had a heart attack. I then want to delete the duplicate ID#s and just keep the values from the first listing for the other columns:
df <- read.table(text =
"ID Age Gender heartattack spending
1 24 f 0 140
2 24 m na 123
2 24 m 1 58
2 24 m 0 na
3 85 f 1 170
4 45 m na 204", header=TRUE)
What I need:
df2 <- read.table(text =
"ID Age Gender ever_heartattack all_spending
1 24 f 0 140
2 24 m 1 181
3 85 f 1 170
4 45 m na 204", header=TRUE)
I tried group_by with transmute() and sum() as follows:
df$heartattack = as.numeric(as.character(df$heartattack))
df$spending = as.numeric(as.character(df$spending))
library(dplyr)
df = df %>% group_by(ID) %>% transmute(ever_heartattack = sum(heartattack, na.rm = T), all_spending = sum(spending, na.rm=T))
But this removes all the other columns! Also it turns NA values into zeros...for example I still want "NA" to be the value for patient ID#4, I don't want to change the data to say they never had a heart attack!
> print(dfa) #This doesn't at all match df2 :(
ID ever_heartattack all_spending
1 1 0 140
2 2 1 181
3 2 1 181
4 2 1 181
5 3 1 170
6 4 0 204
Could you do this?
aggregate(
spending ~ ID + Age + Gender,
data = transform(df, spending = as.numeric(as.character(spending))),
FUN = sum)
# ID Age Gender spending
#1 1 24 f 140
#2 3 85 f 170
#3 2 24 m 181
#4 4 45 m 204
Some comments:
The thing is that when aggregating you don't give clear rules how to deal with data in additional columns that differ (like heartattack in this case). For example, for ID = 2 why do you retain heartattack = 1 instead of heartattack = na or heartattack = 0?
Your "na"s are in fact not real NAs. That leads to spending being a factor column instead of a numeric column vector.
To exactly reproduce your expected output one can do
df %>%
mutate(
heartattack = as.numeric(as.character(heartattack)),
spending = as.numeric(as.character(spending))) %>%
group_by(ID, Age, Gender) %>%
summarise(
heartattack = ifelse(
any(heartattack %in% c(0, 1)),
max(heartattack, na.rm = T),
NA),
spending = sum(spending, na.rm = T))
## A tibble: 4 x 5
## Groups: ID, Age [?]
# ID Age Gender heartattack spending
# <int> <int> <fct> <dbl> <dbl>
#1 1 24 f 0 140
#2 2 24 m 1 181
#3 3 85 f 1 170
#4 4 45 m NA 204
This feels a bit "hacky" on account of the rules not being clear which heartattack value to keep. In this case we
keep the maximum value of heartattack if heartattack contains either 0 or 1.
return NA if heartattack does not contain 0 or 1.

How can I create an incremental ID column based on whenever one of two variables are encountered?

My data came to me like this (but with 4000+ records). The following is data for 4 patients. Every time you see surgery OR age reappear, it is referring to a new patient.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
So to say again, every time surgery or age appear (surgery isn't always there, but age is), those records and the ones after pertain to the same patient until you see surgery or age appear again.
Thus I somehow need to add an ID column with this data:
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,4)
testdat$ID = ID
I know how to transpose and melt and all that to put the data into regular format, but how can I create that ID column?
Advice on relevant tags to use is helpful!
Assuming that surgery and age will be the first two pieces of information for each patient and that each patient will have a information that is not age or surgery afterward, this is a solution.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
# Use a tibble and get rid of factors.
dfTest = as_tibble(testdat) %>%
mutate_all(as.character)
# A little dplyr magic to see find if the start of a new patient, then give them an id.
dfTest = dfTest %>%
mutate(couldBeStart = if_else(col1 == "surgery" | col1 == "age", T, F)) %>%
mutate(isStart = couldBeStart & !lag(couldBeStart, default = FALSE)) %>%
mutate(patientID = cumsum(isStart)) %>%
select(-couldBeStart, -isStart)
# # A tibble: 17 x 3
# col1 col2 patientID
# <chr> <chr> <int>
# 1 surgery yes 1
# 2 age 54 1
# 3 weight 153 1
# 4 albumin normal 1
# 5 abiotics 2 1
# 6 surgery no 2
# 7 age 65 2
# 8 weight 134 2
# 9 BAPPS yes 2
# 10 abiotics 1 2
# 11 surgery yes 3
# 12 age 61 3
# 13 weight 210 3
# 14 age 46 4
# 15 weight 178 4
# 16 BAPPS no 4
# 17 albumin low 4
# Get the data to a wide workable format.
dfTest %>% spread(col1, col2)
# # A tibble: 4 x 7
# patientID abiotics age albumin BAPPS surgery weight
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2 54 normal NA yes 153
# 2 2 1 65 NA yes no 134
# 3 3 NA 61 NA NA yes 210
# 4 4 NA 46 low no NA 178
Using dplyr:
library(dplyr)
testdat = testdat %>%
mutate(patient_counter = cumsum(col1 == 'surgery' | (col1 == 'age' & lag(col1 != 'surgery'))))
This works by checking whether the col1 value is either 'surgery' or 'age', provided 'age' is not preceded by 'surgery'. It then uses cumsum() to get the cumulative sum of the resulting logical vector.
You can try the following
keywords <- c('surgery', 'age')
lgl <- testdat$col1 %in% keywords
testdat$ID <- cumsum(c(0, diff(lgl)) == 1) + 1
col1 col2 ID
1 surgery yes 1
2 age 54 1
3 weight 153 1
4 albumin normal 1
5 abiotics 2 1
6 surgery no 2
7 age 65 2
8 weight 134 2
9 BAPPS yes 2
10 abiotics 1 2
11 surgery yes 3
12 age 61 3
13 weight 210 3
14 age 46 4
15 weight 178 4
16 BAPPS no 4
17 albumin low 4

Create missing observations in panel data

I am working on panel data with a unique case identifier and a column for the time points of the observations (long format). There are both time-constant variables and time-varying observations:
id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2
For my model I now need data with complete records per id for each time point. In other words, if an observation is missing I still need to put in a row with id, time, time-constant variables, and NA for the observed variables (as would be the line (102, 2, "female", NA) in the above example). So my question is:
How can I find out if a row with unique combination of id and time already exists in my dataset?
If not, how can I add this row, carry over time-constant variables and fill the observations with NA?
Would be great if someone could shed some light on this.
Thanks a lot in advance!
EDIT
Thank you everyone for your replies. Here is what I finally did, which is mix of several suggested approaches. The thing is that I have several time-varying variables (obs1-obsn) per row and I did not get dcast to accomodate for that - value.name does not take more than argument.
# create all possible permutations of id and year
iddat = expand.grid(id = unique(dataset$id), time = (c(1996,1999,2002,2005,2008,2011)))
iddat <- iddat[order(iddat$id, iddat$time), ]
# add permutations to existing data, combinations so far missing are NA
dataset_new <- merge(dataset, iddat, all.x=TRUE, all.y=TRUE, by=c("id", "time"))
# drop time-constant variables from data
dataset_new[c("tc1", "tc2", "tc3")] <- list(NULL)
# merge back time-constant variables from original data
temp <- dataset[c("tc1", "tc2", "tc3")]
dataset_new <- merge(dataset_new, temp, by=c("id"))
# sort
dataset_new <- dataset_new[order(dataset_new$id, dataset_new$time), ]
dataset_new <- unique(dataset_new) # some rows are duplicates after last merge, no idea why
rm(temp)
rm(iddat)
All the best and thanks again, Matt
You could create an empty dataset and then merge in the records in which you have matches.
# Create dataset. For you actual data ,you would replace c(1:3) with
# c(1:max(yourdata$id)) and adjust the number of time periods to match your data.
id <- rep(c(1:3), each = 3)
time <- rep(c(1:3), 3)
df <- data.frame(id,time)
test <- df[c(1,3,5,7,9),]
test$tc1 <- c("male", "male", "female", "male", "male")
test$obs1 <-c(4,5,3,6,2)
merge(df, test, by.x = c("id","time"), by.y = c("id","time"), all.x = TRUE)
The result:
id time tc1 obs1
1 1 1 male 4
2 1 2 <NA> NA
3 1 3 male 5
4 2 1 <NA> NA
5 2 2 female 3
6 2 3 <NA> NA
7 3 1 male 6
8 3 2 <NA> NA
9 3 3 male 2
There are probably more elegant ways, but here's one option. I'm assuming that you need all combinations of id and time but not tc1 (i.e. tc1 is tied to id).
# your data
df <- read.table(text = " id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2", header = TRUE)
First cast your data to wide format to introduce NAs, then convert back to long.
library('reshape2')
df_wide <- dcast(
df,
id + tc1 ~ time,
value.var = "obs1",
fill = NA
)
df_long <- melt(
df_wide,
id.vars = c("id","tc1"),
variable.name = "time",
value.name = "obs1"
)
# sort by id and then time
df_long[order(df_long$id, df_long$time), ]
id tc1 time obs1
1 101 male 1 4
4 101 male 2 5
7 101 male 3 3
2 102 female 1 6
5 102 female 2 NA
8 102 female 3 2
3 103 male 1 2
6 103 male 2 NA
9 103 male 3 NA

Resources