I have a quite big dataframe and I'm trying to add a new variable which is the sum of the three previous rows on a running basis, also it should be grouped by ID. The first three rows per ID should be 0. Here's what it should look like.
ID Var1 VarNew
1 2 0
1 2 0
1 3 0
1 0 7
1 4 5
1 1 7
Here's an example dataframe
ID <- c(1, 1, 1, 1, 1, 1)
Var1 <- c(2, 2, 3, 0, 4, 1)
df <- data.frame(ID, Var1)
You can use any of the package that has rolling calculation function with a window size of 3 and lag the result. For example with zoo::rollsumr.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(VarNew = lag(zoo::rollsumr(Var1, 3, fill = 0), default = 0)) %>%
ungroup
# ID Var1 VarNew
# <dbl> <dbl> <dbl>
#1 1 2 0
#2 1 2 0
33 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
You can use filter in ave.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) c(0, 0, 0,
filter(head(df$Var1, -1), c(1,1,1), side=1)[-1:-2]))
df
# ID Var1 VarNew
#1 1 2 0
#2 1 2 0
#3 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
or using cumsum in combination with head and tail.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) {y <- cumsum(x)
c(0, 0, 0, tail(y, -3) - head(y, -3))})
Library runner also helps
library(runner)
df %>% mutate(var_new = sum_run(Var1, k =3, na_pad = T, lag = 1))
ID Var1 var_new
1 1 2 NA
2 1 2 NA
3 1 3 NA
4 1 0 7
5 1 4 5
6 1 1 7
NAs can be mutated to 0 if desired so, easily.
Related
This question already has answers here:
Convert column with pipe delimited data into dummy variables [duplicate]
(4 answers)
Closed 2 years ago.
I have some data similar to that below.
df <- data.frame(id = 1:5, tags = c("A,B,AB,C", "C", "AB,E", NA, "B,C"))
df
# id tags
# 1 1 A,B,AB,C
# 2 2 C
# 3 3 AB,E
# 4 4 <NA>
# 5 5 B,C
I'd like to create a new dummy variable column for each tag in the "tags" column, resulting in a dataframe like the following:
correct_df <- data.frame(id = 1:5,
tags = c("A,B,AB,C", "C", "AB,E", NA, "B,C"),
A = c(1, 0, 0, 0, 0),
B = c(1, 0, 0, 0, 1),
C = c(1, 1, 0, 0, 1),
E = c(0, 0, 1, 0, 0),
AB = c(1, 0, 1, 0, 0)
)
correct_df
# id tags A B C E AB
# 1 1 A,B,AB,C 1 1 1 0 1
# 2 2 C 0 0 1 0 0
# 3 3 AB,E 0 0 0 1 1
# 4 4 <NA> 0 0 0 0 0
# 5 5 B,C 0 1 1 0 0
One of the challenges is ensuring that the "A" column has 1 only for the "A" tag, so that it doesn't has 1 for the "AB" tag, for example. The following won't work for this reason, since "A" gets 1 for the "AB" tag:
df <- df %>%
mutate(A = ifelse(grepl("A", tags, fixed = T), 1, 0))
df
# id tags A
# 1 1 A,B,AB,C 1
# 2 2 C 0
# 3 3 AB,E 1 < Incorrect
# 4 4 <NA> 0
# 5 5 B,C 0
Another challenge is doing this programmatically. I can probably deal with a solution that manually creates a column for each tag, but a solution that doesn't assume which tag columns need to be created beforehand is best, since there can potentially be many different tags. Is there some relatively simple solution that I'm overlooking?
Does this work:
> library(tidyr)
> library(dplyr)
> df %>% separate_rows(tags) %>% mutate(A = case_when(tags == 'A' ~ 1, TRUE ~ 0),
+ B = case_when(tags == 'B' ~ 1, TRUE ~ 0),
+ C = case_when(tags == 'C' ~ 1, TRUE ~ 0),
+ E = case_when(tags == 'E' ~ 1, TRUE ~ 0),
+ AB = case_when(tags == 'AB' ~ 1, TRUE ~ 0)) %>%
+ group_by(id) %>% mutate(tags = toString(tags)) %>% group_by(id, tags) %>% summarise(across(A:AB, sum))
`summarise()` regrouping output by 'id' (override with `.groups` argument)
# A tibble: 5 x 7
# Groups: id [5]
id tags A B C E AB
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A, B, AB, C 1 1 1 0 1
2 2 C 0 0 1 0 0
3 3 AB, E 0 0 0 1 1
4 4 NA 0 0 0 0 0
5 5 B, C 0 1 1 0 0
>
Here's a solution:
library(dplyr)
library(stringr)
library(magrittr)
library(tidyr)
#Data
df <- data.frame(id = 1:5, tags = c("A,B,AB,C", "C", "AB,E", NA, "B,C"))
#Separate into rows
df %<>% mutate(t2 = tags) %>% separate_rows(t2, sep = ",")
#Create a presence/absence column
df %<>% mutate(pa = 1)
#Pivot wider and use the presence/absence
#column as entries; fill with 0 if absent
df %<>% pivot_wider(names_from = t2, values_from = pa, values_fill = 0)
df
# # A tibble: 5 x 8
# id tags A B AB C E `NA`
# <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 A,B,AB,C 1 1 1 1 0 0
# 2 2 C 0 0 0 1 0 0
# 3 3 AB,E 0 0 1 0 1 0
# 4 4 NA 0 0 0 0 0 1
# 5 5 B,C 0 1 0 1 0 0
Edit: updated the code to enable it to retain the tags column. Sorry.
I have data in the form of a count table of successes and trials, but for modeling I need these data in a disaggregated trial-level table.
How do I get from this:
dplyr::tibble(
user_id = c(1,2),
success = c(3,4),
trials = c(9, 10)
)
To this:
dplyr::tibble(
user_id = c(rep(1, 9), rep(2, 10)),
success = c(rep(1, 3),rep(0, 6), rep(1, 4), rep(0, 6))
)
We can uncount based on the 'trials', then grouped by 'user_id', change the 'success' to binary by creating a logical condition with row_number
library(dplyr)
library(tidyr)
df1 %>%
uncount(trials) %>%
group_by(user_id) %>%
mutate(success = +(row_number() <= first(success))) %>%
ungroup
# A tibble: 19 x 2
# user_id success
# <dbl> <int>
# 1 1 1
# 2 1 1
# 3 1 1
# 4 1 0
# 5 1 0
# 6 1 0
# 7 1 0
# 8 1 0
# 9 1 0
#10 2 1
#11 2 1
#12 2 1
#13 2 1
#14 2 0
#15 2 0
#16 2 0
#17 2 0
#18 2 0
#19 2 0
Or with base R using Map and stack
stack(setNames(Map(function(x, y) rep(1:0, c(x, y)),
df1$success, df1$trials - df1$success), df1$user_id))[2:1]
I have a longhitudinal dataframe with a lot of missing values that looks like this.
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
date = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
cond = c(0,0,0,1,0,0,0,0,1,0,0,0,0,0,0)
var = c(1, NA , 2, 0,NA, NA, 3, NA,0, NA, 2, NA, 1,NA,NA)
df = data.frame(ID, date, cond,var)
I would like to carry forward the last observation based on two conditions:
1) when cond=0 it should carry on the observation the higher value of the variable of interest.
2) when cond=1 it should carry forward the lower value of the variable of interest.
Does anyone have an idea on how I could do this in an elegant way?
The final dataset should look like this
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
date = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
cond = c(0,0,0,1,0,0,0,0,1,0,0,0,0,0,0)
var = c(1, 1 , 2, 0, 0, NA, 3, 3, 0, 0,2,2,2,2,2)
final = data.frame(ID, date, cond,var)
So far I was able to carry forward the last observation, but I was unable to impose the conditions
library(zoo)
df <- df %>%
group_by(ID) %>%
mutate(var =
na.locf(var, na.rm = F))
any suggestion is welcomed
This is the use of accumulate2 ie
df%>%
group_by(ID)%>%
mutate(d = unlist(accumulate2(var,cond[-1],function(z,x,y) if(y) min(z,x,na.rm=TRUE) else max(z,x,na.rm=TRUE))))
# A tibble: 15 x 5
# Groups: ID [3]
ID date cond var d
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 1 1
2 1 2 0 NA 1
3 1 3 0 2 2
4 1 4 1 0 0
5 1 5 0 NA 0
6 2 1 0 NA NA
7 2 2 0 3 3
8 2 3 0 NA 3
9 2 4 1 0 0
10 2 5 0 NA 0
11 3 1 0 2 2
12 3 2 0 NA 2
13 3 3 0 1 2
14 3 4 0 NA 2
15 3 5 0 NA 2
I think, if I understand what you are after is this?
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
date = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
cond = c(0,0,0,1,0,0,0,0,1,0,0,0,0,0,0)
var = c(1, NA , 2, 0,NA, NA, 3, NA,0, NA, 2, NA, 1,NA,NA)
df = data.frame(ID, date, cond,var)
Using case_when you can do some conditional checks. I'm unsure if you mean to return the minimum for all of the "ID" field, but this will look at the condition and then lag or lead to find a non missing value
library(dplyr)
df %>%
mutate(var_imput = case_when(
cond == 0 & is.na(var)~lag(x = var, n = 1, default = NA),
cond == 1 & is.na(var)~lead(x = var, n = 1, default = NA),
TRUE~var
))
Which yields:
ID date cond var var_imput
1 1 1 0 1 1
2 1 2 0 NA 1
3 1 3 0 2 2
4 1 4 1 0 0
5 1 5 0 NA 0
6 2 1 0 NA NA
7 2 2 0 3 3
8 2 3 0 NA 3
9 2 4 1 0 0
10 2 5 0 NA 0
11 3 1 0 2 2
12 3 2 0 NA 2
13 3 3 0 1 1
14 3 4 0 NA 1
15 3 5 0 NA NA
If you want to group by ID then you could generate an impute table by ID, then join it with the original table like this:
# enerate input table
input_table <- df %>%
group_by(ID) %>%
summarise(min = min(var, na.rm = T),
max = max(var, na.rm = T)) %>%
gather(cond, value, -ID) %>%
mutate(cond = ifelse(cond == "min", 0, 1))
# Join and impute missing
df %>%
left_join(input_table,by = c("ID", "cond")) %>%
mutate(var_imput = ifelse(is.na(var), value, var))
I have a simple data structure with id and time-series indicator (prd). I would like to create a dummy variable for followup visits "fup", which is equal to 0 if a patient has no more visits and 1 if a patient has more visits in the future.
How would I go about doing this?
id<- c(1, 1, 1, 2, 3, 3)
prd <- c(1, 2, 3, 1, 1, 2)
df <- data.frame(id=id, prd=prd)
Desired output:
id prd fup
1 1 1 1
2 1 2 1
3 1 3 0
4 2 1 0
5 3 1 1
6 3 2 0
We can check if the current row is the last row in each group. In base R,
df$fup <- with(df, ave(prd, id, FUN = function(x) seq_along(x) != length(x)))
df
# id prd fup
#1 1 1 1
#2 1 2 1
#3 1 3 0
#4 2 1 0
#5 3 1 1
#6 3 2 0
Similarly in dplyr,
library(dplyr)
df %>% group_by(id) %>% mutate(fup = +(row_number() != n()))
and data.table
library(data.table)
setDT(df)[, fup := +(seq_along(prd) != .N), by = id]
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 4 years ago.
I'm using R and I'm really at a loss right now. I have data like this:
df <- data.frame(
group = c(2, 2, 2, 1, 1, 0, 0, 1, 1, 0, 1, 0),
grade = c(2, 4, 3, 1, 3, 2, 5, 1, 1, 2, 3, 1)
)
I want to have it like this:
group0 group1 group2
1 1 3 0
2 2 0 1
3 0 2 1
4 0 0 1
5 1 0 0
6 0 0 0
I've been trying for hours using subset, tapply, table, for loops and what not but I can't seem to figure it out. I'd be really happy if someone could help me, I can't help but think I'm missing something really easy and obvious.
How can I produce my target output?
/ Solved, see below. Thanks for finding a fitting title btw, you guys are the best!
You can do something like this with dplyr and tidyr:
df %>%
count(group, grade) %>%
mutate(group = paste0('group', group)) %>%
spread(group, n, fill = 0)
# A tibble: 5 x 4
grade group0 group1 group2
* <int> <dbl> <dbl> <dbl>
1 1 1 3 0
2 2 2 0 1
3 3 0 2 1
4 4 0 0 1
5 5 1 0 0
If you don't want the additional 'grade' column, you can do:
df %>%
count(group, grade) %>%
mutate(group = paste0('group', group)) %>%
spread(group, n, fill = 0) %>%
select(-grade)
group0 group1 group2
* <dbl> <dbl> <dbl>
1 1 3 0
2 2 0 1
3 0 2 1
4 0 0 1
5 1 0 0
Alternatively, consider a base R approach using: by for grouping, aggregate for counts, setNames for group## column names, and Reduce for chain merge of dataframes:
# DATAFRAME LIST BY EACH GROUP
grp_list <- by(df, df$group, function(d) setNames(aggregate(.~grade, d, FUN=length),
c("grade", paste0("group",max(d$group)))))
# CHAIN MERGE (OUTER JOIN)
final_df <- Reduce(function(x,y) merge(x,y, by="grade", all=TRUE), grp_list)
# FILL NA WITH ZEROS
final_df[is.na(final_df)] <- 0
final_df
# grade group0 group1 group2
# 1 1 1 3 0
# 2 2 2 0 1
# 3 3 0 2 1
# 4 4 0 0 1
# 5 5 1 0 0
And to remove grade, use transform after chain merge or directly on final_df:
final_df <- transform(Reduce(function(x,y) merge(x,y, by="grade", all=TRUE), grp_list),
grade = NULL)
final_df <- transform(final_df, grade = NULL)