I have a simple data structure with id and time-series indicator (prd). I would like to create a dummy variable for followup visits "fup", which is equal to 0 if a patient has no more visits and 1 if a patient has more visits in the future.
How would I go about doing this?
id<- c(1, 1, 1, 2, 3, 3)
prd <- c(1, 2, 3, 1, 1, 2)
df <- data.frame(id=id, prd=prd)
Desired output:
id prd fup
1 1 1 1
2 1 2 1
3 1 3 0
4 2 1 0
5 3 1 1
6 3 2 0
We can check if the current row is the last row in each group. In base R,
df$fup <- with(df, ave(prd, id, FUN = function(x) seq_along(x) != length(x)))
df
# id prd fup
#1 1 1 1
#2 1 2 1
#3 1 3 0
#4 2 1 0
#5 3 1 1
#6 3 2 0
Similarly in dplyr,
library(dplyr)
df %>% group_by(id) %>% mutate(fup = +(row_number() != n()))
and data.table
library(data.table)
setDT(df)[, fup := +(seq_along(prd) != .N), by = id]
Related
I have the following dataframe:
df <-read.table(header=TRUE, text="id code
1 A
1 B
1 C
2 A
2 A
2 A
3 A
3 B
3 A")
Per id, I would love to find those individuals that have at least 2 conditions, namely:
conditionA = "A"
conditionB = "B"
conditionC = "C"
and create a new colum with "index", 1 if there are two or more conditions met and 0 otherwise:
df_output <-read.table(header=TRUE, text="id code index
1 A 1
1 B 1
1 C 1
2 A 0
2 A 0
2 A 0
3 A 1
3 B 1
3 A 1")
So far I have tried the following:
df_output = df %>%
group_by(id) %>%
mutate(index = ifelse(grepl(conditionA|conditionB|conditionC, code), 1, 0))
and as you can see I am struggling to get the threshold count into the code.
You can create a vector of conditions, and then use %in% and sum to count the number of occurrences in each group. Use + (or ifelse) to convert logical into 1 and 0:
conditions = c("A", "B", "C")
df %>%
group_by(id) %>%
mutate(index = +(sum(unique(code) %in% conditions) >= 2))
id code index
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 0
5 2 A 0
6 2 A 0
7 3 A 1
8 3 B 1
9 3 A 1
You could use n_distinct(), which is a faster and more concise equivalent of length(unique(x)).
df %>%
group_by(id) %>%
mutate(index = +(n_distinct(code) >= 2)) %>%
ungroup()
# # A tibble: 9 × 3
# id code index
# <int> <chr> <int>
# 1 1 A 1
# 2 1 B 1
# 3 1 C 1
# 4 2 A 0
# 5 2 A 0
# 6 2 A 0
# 7 3 A 1
# 8 3 B 1
# 9 3 A 1
You can check conditions using intersect() function and check whether resulting list is of minimal (eg- 2) length.
conditions = c('A', 'B', 'C')
df_output2 =
df %>%
group_by(id) %>%
mutate(index = as.integer(length(intersect(code, conditions)) >= 2))
I have a quite big dataframe and I'm trying to add a new variable which is the sum of the three previous rows on a running basis, also it should be grouped by ID. The first three rows per ID should be 0. Here's what it should look like.
ID Var1 VarNew
1 2 0
1 2 0
1 3 0
1 0 7
1 4 5
1 1 7
Here's an example dataframe
ID <- c(1, 1, 1, 1, 1, 1)
Var1 <- c(2, 2, 3, 0, 4, 1)
df <- data.frame(ID, Var1)
You can use any of the package that has rolling calculation function with a window size of 3 and lag the result. For example with zoo::rollsumr.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(VarNew = lag(zoo::rollsumr(Var1, 3, fill = 0), default = 0)) %>%
ungroup
# ID Var1 VarNew
# <dbl> <dbl> <dbl>
#1 1 2 0
#2 1 2 0
33 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
You can use filter in ave.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) c(0, 0, 0,
filter(head(df$Var1, -1), c(1,1,1), side=1)[-1:-2]))
df
# ID Var1 VarNew
#1 1 2 0
#2 1 2 0
#3 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
or using cumsum in combination with head and tail.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) {y <- cumsum(x)
c(0, 0, 0, tail(y, -3) - head(y, -3))})
Library runner also helps
library(runner)
df %>% mutate(var_new = sum_run(Var1, k =3, na_pad = T, lag = 1))
ID Var1 var_new
1 1 2 NA
2 1 2 NA
3 1 3 NA
4 1 0 7
5 1 4 5
6 1 1 7
NAs can be mutated to 0 if desired so, easily.
I have the following data frame ordered by name and time.
set.seed(100)
df <- data.frame('name' = c(rep('x', 6), rep('y', 4)),
'time' = c(rep(1, 2), rep(2, 3), 3, 1, 2, 3, 4),
'score' = c(0, sample(1:10, 3), 0, sample(1:10, 2), 0, sample(1:10, 2))
)
> df
name time score
1 x 1 0
2 x 1 4
3 x 2 3
4 x 2 5
5 x 2 0
6 x 3 1
7 y 1 5
8 y 2 0
9 y 3 5
10 y 4 8
In df$score there are zeros followed by an unknown number of actual values, i.e. df[1:4,], and sometimes there are overlapping df$name between two df$score == 0, i.e. df[6:7,].
I want to change df$time where df$score != 0. Specifically, I want to assign the time value of the closest upper row with df$score == 0 if df$name is matching.
The following code gives the good output but my data have millions of rows so this solution is very inefficient.
score_0 <- append(which(df$score == 0), dim(df)[1] + 1)
for(i in 1:(length(score_0) - 1)) {
df$time[score_0[i]:(score_0[i + 1] - 1)] <-
ifelse(df$name[score_0[i]:(score_0[i + 1] - 1)] == df$name[score_0[i]],
df$time[score_0[i]],
df$time[score_0[i]:(score_0[i + 1] - 1)])
}
> df
name time score
1 x 1 0
2 x 1 4
3 x 1 3
4 x 1 5
5 x 2 0
6 x 2 1
7 y 1 5
8 y 2 0
9 y 2 5
10 y 2 8
Where score_0 gives the index where df$score == 0. We see that df$time[2:4] are now all equal to 1, that in df$time[6:7] only the first one changed because the second have df$name == 'y' and the closest upper row with df$score == 0 has df$name == 'x'. The last two rows also have changed correctly.
You can do it like this:
library(dplyr)
df %>% group_by(name) %>% mutate(ID=cumsum(score==0)) %>%
group_by(name,ID) %>% mutate(time = head(time,1)) %>%
ungroup() %>% select(name,time,score) %>% as.data.frame()
# name time score
# 1 x 1 0
# 2 x 1 8
# 3 x 1 10
# 4 x 1 6
# 5 x 2 0
# 6 x 2 5
# 7 y 1 4
# 8 y 2 0
# 9 y 2 5
# 10 y 2 9
Solution using dplyr and data.table:
library(data.table)
library(dplyr)
df %>%
mutate(
chck = score == 0,
chck_rl = ifelse(score == 0, lead(rleid(chck)), rleid(chck))) %>%
group_by(name, chck_rl) %>% mutate(time = first(time)) %>%
ungroup() %>%
select(-chck_rl, -chck)
Output:
# A tibble: 10 x 3
name time score
<chr> <dbl> <int>
1 x 1 0
2 x 1 2
3 x 1 9
4 x 1 7
5 x 2 0
6 x 2 1
7 y 1 8
8 y 2 0
9 y 2 2
10 y 2 3
Solution only using data.table:
library(data.table)
setDT(df)[, chck_rl := ifelse(score == 0, shift(rleid(score == 0), type = "lead"),
rleid(score == 0))][, time := first(time), by = .(name, chck_rl)][, chck_rl := NULL]
Output:
name time score
1: x 1 0
2: x 1 2
3: x 1 9
4: x 1 7
5: x 2 0
6: x 2 1
7: y 1 8
8: y 2 0
9: y 2 2
10: y 2 3
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 4 years ago.
I'm using R and I'm really at a loss right now. I have data like this:
df <- data.frame(
group = c(2, 2, 2, 1, 1, 0, 0, 1, 1, 0, 1, 0),
grade = c(2, 4, 3, 1, 3, 2, 5, 1, 1, 2, 3, 1)
)
I want to have it like this:
group0 group1 group2
1 1 3 0
2 2 0 1
3 0 2 1
4 0 0 1
5 1 0 0
6 0 0 0
I've been trying for hours using subset, tapply, table, for loops and what not but I can't seem to figure it out. I'd be really happy if someone could help me, I can't help but think I'm missing something really easy and obvious.
How can I produce my target output?
/ Solved, see below. Thanks for finding a fitting title btw, you guys are the best!
You can do something like this with dplyr and tidyr:
df %>%
count(group, grade) %>%
mutate(group = paste0('group', group)) %>%
spread(group, n, fill = 0)
# A tibble: 5 x 4
grade group0 group1 group2
* <int> <dbl> <dbl> <dbl>
1 1 1 3 0
2 2 2 0 1
3 3 0 2 1
4 4 0 0 1
5 5 1 0 0
If you don't want the additional 'grade' column, you can do:
df %>%
count(group, grade) %>%
mutate(group = paste0('group', group)) %>%
spread(group, n, fill = 0) %>%
select(-grade)
group0 group1 group2
* <dbl> <dbl> <dbl>
1 1 3 0
2 2 0 1
3 0 2 1
4 0 0 1
5 1 0 0
Alternatively, consider a base R approach using: by for grouping, aggregate for counts, setNames for group## column names, and Reduce for chain merge of dataframes:
# DATAFRAME LIST BY EACH GROUP
grp_list <- by(df, df$group, function(d) setNames(aggregate(.~grade, d, FUN=length),
c("grade", paste0("group",max(d$group)))))
# CHAIN MERGE (OUTER JOIN)
final_df <- Reduce(function(x,y) merge(x,y, by="grade", all=TRUE), grp_list)
# FILL NA WITH ZEROS
final_df[is.na(final_df)] <- 0
final_df
# grade group0 group1 group2
# 1 1 1 3 0
# 2 2 2 0 1
# 3 3 0 2 1
# 4 4 0 0 1
# 5 5 1 0 0
And to remove grade, use transform after chain merge or directly on final_df:
final_df <- transform(Reduce(function(x,y) merge(x,y, by="grade", all=TRUE), grp_list),
grade = NULL)
final_df <- transform(final_df, grade = NULL)
Let's say I have the following simple data.frame:
ID value
1 1 3
2 2 4
3 1 5
4 3 3
My desired output is below, where we add a value to cumsum or we update it according to the latest value of an already used ID.
ID value cumsum
1 1 3 3
2 2 4 7
3 1 5 9
4 3 3 12
In row 3, the new value forms an updated cumsum (7-3+5=9). Row 4 adds a new value to cumsum because the ID was not used before (4+5+3=12).
This produces the desired outcome for your example:
df<-read.table(header=T, text="ID value
1 1 3
2 2 4
3 1 5
4 3 3")
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value = value-lag(value, def = 0L)) %>%
ungroup %>% mutate(cumsum = cumsum(value))
# # A tibble: 4 x 3
# ID value cumsum
# <int> <int> <int>
# 1 1 3 3
# 2 2 4 7
# 3 1 2 9
# 4 3 3 12
I used data.table for cumsum. Calculating the cumulative mean is a bit more tricky because the number of oberservations is not adjusted by just using cummean.
library(data.table)
dt = data.table(id = c(1, 2, 1, 3), value = c(3, 4, 5, 3))
dt[, tmp := value-shift(value, n = 1L, type = "lag", fill = 0), by = c("id")]
#CUMSUM
dt[, cumsum := cumsum(tmp)]
#CUMMEAN WITH UPDATED N
dt[value != tmp, skip := 1:.N]
dt[, skip := na.locf(skip, na.rm = FALSE)]
dt[is.na(skip), skip := 0]
dt[, cummean := cumsum(tmp)/(seq_along(tmp)-skip)]
Output is:
id value tmp cumsum skip cummean
1: 1 3 3 3 0 3.0
2: 2 4 4 7 0 3.5
3: 1 5 2 9 1 4.5
4: 3 3 3 12 1 4.0
Edit: Changed lag function to data.table's shift function.