R: Tidy Aggregation of Sequence Data and Visualization of Stepfunctions

R: Tidy Aggregation of Sequence Data and Visualization of Stepfunctions - r

I have some patient data, where the individual patients change treatment groups over time. My goal is to visualize the sequence of group changes and aggregate this data into a "sequence profile" for each treatment group.
For each treatment group I would like to show, when it generally occurs
in the treatment cycle (say rather in the beginning or in the end). To account for the differing sequence length, I would like to standardize these profiles betweenn 0 (very beginning) and 1 (end).
I would like to find an efficient data preparation and visualization.
Mininmal Example
Structure of Data
library(dplyr)
library(purrr)
library(ggplot2)
# minimal data
cj_df_raw <- tibble::tribble(
~id, ~group,
1, "A",
1, "B",
2, "A",
2, "B",
2, "A"
)
# compute "intervals" for each person [start, end]
cj_df_raw %>%
group_by(id) %>%
mutate(pos = row_number(),
len = length(id),
start = (pos - 1) / len,
end = pos / len) %>%
filter(group == "A")
#> # A tibble: 3 x 6
#> # Groups: id [2]
#> id group pos len start end
#> <dbl> <chr> <int> <int> <dbl> <dbl>
#> 1 1 A 1 2 0 0.5
#> 2 2 A 1 3 0 0.333
#> 3 2 A 3 3 0.667 1
(So Id 1 was in group A in the first 50% of their sequence, and Id 2 was in Group A in the first 33% and the last 33% of their sequence. This means, that 2 Ids where between 0-33% of the sequence, 1 between 33-50%, 0 between 50-66% and 1 above 66%.)
This is the result I would like to achieve and I miss a chance to transform my data effectively.
Desired outcome
profile_treatmen_a <- tibble::tribble(
~x, ~y,
0, 0L,
0.33, 2L,
0.5, 1L,
0.66, 0L,
1, 1L,
1, 0L
)
profile_treatmen_a %>%
ggplot(aes(x, y)) +
geom_step(direction = "vh") +
expand_limits(x = c(0, 1), y = 0)
(Ideally the area under the curve would be shaded)
Ideal solution: via ggridges
The goal of the visualization would be to compare the "sequence-profile" of many treatment-groups at the same time. If I could prepare the data accordingly, I would like to use the ggridges-package for a striking visual comparison the treatment groups.
library(ggridges)
data.frame(group = rep(letters[1:2], each=20),
mean = rep(2, each=20)) %>%
mutate(count = runif(nrow(.))) %>%
ggplot(aes(x=count, y=group, fill=group)) +
geom_ridgeline(stat="binline", binwidth=0.5, scale=0.9)

You could build helper intervals and then just plot a histogram. Since each patient is either in Group A or B both groups sum up to 100%. With these helper intervals you could also easily switch to other geoms.
library(tidyverse, warn.conflicts = FALSE)
library(ggplot2)
# create sample data
set.seed(42)
id <- 1:10 %>% map(~ rep(x = .x, times = runif(n = 1, min = 1, max = 6))) %>%
unlist()
group <- sample(x = c("A", "B"), size = length(id), replace = TRUE) %>%
as_factor()
df <- tibble(id, group)
glimpse(df)
#> Observations: 37
#> Variables: 2
#> $ id <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5,...
#> $ group <fct> A, B, B, A, A, B, B, A, A, B, B, A, B, B, A, B, A, B, A,...
# tidy data
df <- df %>%
group_by(id) %>%
mutate(from = (row_number() - 1) / n(),
to = row_number() / n()) %>%
ungroup() %>%
rowwise() %>%
mutate(list = seq(from + 1/60, to, 1/60) %>% list()) %>%
unnest()
# plot
df %>%
ggplot(aes(x = list, fill = group)) +
geom_histogram(binwidth = 1/60) +
ggthemes::theme_hc()
Created on 2018-09-16 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0).

My attempt at an answer.. although it is probably not the nicest/fastest/most efficient way, I think it might help you in your efforts.
library(data.table)
# compute "intervals" for each person [start, end]
df <- cj_df_raw %>%
group_by(id) %>%
mutate(pos = row_number(),
len = length(id),
from = (pos - 1) / len,
to = pos / len,
value = 1)
dt <- as.data.table(df)
setkey(dt, from, to)
#create intervals
dt.interval <- data.table(from = seq( from = 0, by = 0.01, length.out = 100),
to = seq( from = 0.01, by = 0.01, length.out = 100))
#perform overlap join on intervals
dt2 <- foverlaps( dt.interval, dt, type = "within", nomatch = NA)[, sum(value), by = c("i.from", "group")]
#some melting ans casting to fill in '0' on empty intervals
dt3 <- melt( dcast(dt2, ... ~ group, fill = 0), id.vars = 1 )
#plot
ggplot( dt3 ) +
geom_step( aes( x = i.from, y = value, color = variable ) ) +
facet_grid( .~variable )

Related

identify which group contains sequence of non-zero values

I am trying to identify which groups in a column contain a specific sequence length of non-zero numbers. In the basic example below, where the goal is the find the groups with the a sequence length of 5, only group b would be the correct.
set.seed(123)
df <- data.frame(
id = seq(1:40),
grp = sort(rep(letters[1:4], 10)),
x = c(
c(0, sample(1:10, 3), rep(0, 6)),
c(0, 0, sample(1:10, 5), rep(0, 3)),
c(rep(0, 6), sample(1:10, 4)),
c(0, 0, sample(1:10, 3), 0, sample(1:10, 2), 0, 0))
)
One limited solution is using cumsum below, to find count the non-zero values but does not work when there are breaks in the sequence, such as the specific length being 5 and group d being incorrectly included.
library(dplyr)
df %>%
group_by(grp) %>%
mutate(cc = cumsum(x != 0)) %>% filter(cc == 5) %>% distinct(grp)
Desired output for the example of a sequence length of 5, would identify only group b, not d.

You may use rle to find a consecutive non-zero numbers for each group.
library(dplyr)
find_groups <- function(x, n) {
tmp <- rle(x != 0)
any(tmp$lengths[tmp$values] >= n)
}
#apply the function for each group
df %>%
group_by(grp) %>%
dplyr::filter(find_groups(x, 5)) %>%
ungroup %>%
distinct(grp)
# grp
# <chr>
#1 b

in data.table:
library(data.table)
setDT(df)[,.N==5,.(grp,rleid(!x))][(V1), .(grp)]
grp
1: b

Split the sequence breaks into different groups by adding cumsum(x == 0) to the group_by statement. Then filter for the groups that contain 5 non-zero rows.
library(dplyr)
df %>%
group_by(grp, cumsum(x == 0)) %>%
filter(sum(x != 0) == 5) %>%
ungroup() %>%
distinct(grp)
#> # A tibble: 1 × 1
#> grp
#> <chr>
#> 1 b

Calculate mean and standard deviation for subgroups

I want to calculate the mean and standard deviation for subgroups every column in my dataset.
The membership of the subgroups is based on the values in the column of interest and these subgroups are specific to each column of interest.
# Example data
set.seed(1)
library(data.table)
df <- data.frame(baseline = runif(100), `Week0_12` = runif(100), `Week12_24` = runif(100))
So for column Baseline, a row may be assigned to another subgroup than for column Week0_12.
I can of course create these 'subgroup columns' manually for each column and then calculate the statistics for each column by column subgroup:
df$baseline_subgroup <- ifelse(df$baseline < 0.2, "subgroup_1", "subgroup_2")
df <- as.data.table(df)
df[, .(mean = mean(baseline), sd = sd(baseline)), by = baseline_subgroup]
Giving this output:
baseline_subgroup mean sd
1: subgroup_2 0.58059314 0.22670071
2: subgroup_1 0.09793105 0.05317809
Doing this for every column separately is too much repetition, especially given that I have many columns my actual data.
df$Week0_12_subgroup <- ifelse(df$Week0-12 < 0.2, "subgroup_1", "subgroup_2")
df[, .(mean = mean(Week0_12), sd = sd(Week0_12 )), by = Week0_12_subgroup ]
df$Week12_24_subgroup <- ifelse(df$Week0-12 < 0.2, "subgroup_1", "subgroup_2")
df[, .(mean = mean(Week12_24), sd = sd(Week12_24)), by = Week12_24_subgroup ]
What is a more elegant approach to do this?

Here's a tidyverse method that gives an easy-to-read and easy-to-plot output:
library(tidyverse)
set.seed(1)
df <- data.frame(baseline = runif(100),
`Week0_12` = runif(100),
`Week12_24` = runif(100))
df2 <- df %>%
summarize(across(everything(), list(mean_subgroup1 = ~mean(.x[.x < 0.2]),
sd_subgroup1 = ~sd(.x[.x < 0.2]),
mean_subgroup2 = ~mean(.x[.x > 0.2]),
sd_subgroup2 = ~sd(.x[.x > 0.2])))) %>%
pivot_longer(everything(), names_pattern = '^(.*)_(.*)_(.*$)',
names_to = c('time', 'measure', 'subgroup')) %>%
pivot_wider(names_from = measure, values_from = value)
df2
#> # A tibble: 6 x 4
#> time subgroup mean sd
#> <chr> <chr> <dbl> <dbl>
#> 1 baseline subgroup1 0.0979 0.0532
#> 2 baseline subgroup2 0.581 0.227
#> 3 Week0_12 subgroup1 0.117 0.0558
#> 4 Week0_12 subgroup2 0.594 0.225
#> 5 Week12_24 subgroup1 0.121 0.0472
#> 6 Week12_24 subgroup2 0.545 0.239
ggplot(df2, aes(time, mean, group = subgroup)) +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd, color = subgroup),
width = 0.1) +
geom_point() +
theme_minimal(base_size = 16)
Created on 2022-07-14 by the reprex package (v2.0.1)

You could use apply to apply a subgroup function across each column
i. e.
# list to house dfs of summary statistics
summaries <- list()
subgroup <- function(x){
# x is the column that we are interested in
df$current_subgroup<- ifelse(x < 0.2, "subgroup_1", "subgroup_2")
library(data.table)
df <- as.data.table(df)
summaries.append(df[, .(mean = mean(baseline), sd = sd(baseline)), by = baseline_subgroup])
}
# MARGIN = 2 applies across columns
apply(df, 2, subgroup)

You can create a custom function and apply it using .SD, i.e.
library(data.table)
f1 <- function(x){
i_mean <- mean(x);
i_sd <- sd(x);
list(Avg = i_mean, standard_dev = i_sd)
}
setDT(df)[, unlist(lapply(.SD, f1), recursive = FALSE), by = baseline_subgroup][]
baseline_subgroup baseline.Avg baseline.standard_dev Week0.12.Avg Week0.12.standard_dev Week12.24.Avg Week12.24.standard_dev
1: subgroup_2 0.5950020 0.22556590 0.5332555 0.2651810 0.5467046 0.2912027
2: subgroup_1 0.1006693 0.04957005 0.5947161 0.2645519 0.5137543 0.3213723

Summarizing with data.table R - multiple mathematical operations and conditions

I want to summarize a table creating new columns using different mathematical operations and conditions.
I am using data.table because I am used to this package but I accept recommendations on different ones if any (maybe dplyr?).
this is an example of data frame:
id <- c(rep("A", 6), rep("B", 6), rep("C",6))
lat <- c(rep(45, 6), rep(50, 6), rep(-30,6))
lon <- c(rep(0, 6), rep(180, 6), rep(270,6))
hight <- c(rep(seq(0,100, 20),3))
var1 <- rnorm(18, 50, 50)
df <- data.frame(id, lat, lon, hight, var1)
setDT(df)
beside the typical mathematical operations, such as mean, sd, and median, I would like to create a new column showing the value of var1 at a specific condition, such as hight == 0, 100, etc..
df.new <- df[, .(
"var1_avg" = mean(var1, na.rm = T),
"var1_sd" = sd(var1, na.rm = T),
"var1_median" = median(var1, na.rm = T),
"var1_min" = min(var1),
#here I have the problems:
"var1_0" =df[which(hight == 0),
"var1"],
"var1_100" =df[which(hight == 100),
"var1"]
), by = c("lat", "lon")]
I understand the concept of the error:
Error in `[.data.table`(df, , .(var1_avg = mean(var1, na.rm = T), var1_sd = sd(var1, :
All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.
But I do not find an efficient solution to get my df.new

Here is a data.table version that seems more efficient than the proposed tidyverse approach:
library(data.table)
set.seed(123)
id <- c(rep("A", 6), rep("B", 6), rep("C",6))
lat <- c(rep(45, 6), rep(50, 6), rep(-30,6))
lon <- c(rep(0, 6), rep(180, 6), rep(270,6))
hight <- c(rep(seq(0,100, 20),3))
var1 <- rnorm(18, 50, 50)
df <- data.table(id, lat, lon, hight, var1, key=c("lat", "lon"))
df[, .(
"var1_avg" = mean(var1, na.rm = T),
"var1_sd" = sd(var1, na.rm = T),
"var1_median" = median(var1, na.rm = T),
"var1_min" = min(var1),
"var1_0"= var1[hight==0],
"var1_100"= var1[hight==100]
), by = c("lat", "lon")]
#> lat lon var1_avg var1_sd var1_median var1_min var1_0 var1_100
#> 1: -30 270 52.28133 62.36118 62.78635 -48.33086 70.03857 -48.33086
#> 2: 45 0 72.35764 47.75012 54.99490 21.97622 21.97622 135.75325
#> 3: 50 180 47.06030 45.22337 47.85380 -13.25306 73.04581 67.99069
Created on 2022-04-04 by the reprex package (v2.0.1)

This will calculate the summary statistics e.g. mean or sd for every point (lat, lon) regardless of hight:
library(tidyverse)
id <- c(rep("A", 6), rep("B", 6), rep("C", 6))
lat <- c(rep(45, 6), rep(50, 6), rep(-30, 6))
lon <- c(rep(0, 6), rep(180, 6), rep(270, 6))
hight <- c(rep(seq(0, 100, 20), 3))
var1 <- rnorm(18, 50, 50)
df <- data.frame(id, lat, lon, hight, var1)
df %>%
group_by(lat, lon) %>%
summarise(
var1_avg = mean(var1, na.rm = TRUE),
var1_sd = sd(var1, na.rm = TRUE),
var1_median = median(var1, na.rm = TRUE)
) %>%
left_join(
df %>% filter(hight == 100) %>% transmute(lat, lon, var1_100 = var1)
) %>%
left_join(
df %>% filter(hight == 0) %>% transmute(lat, lon, var1_0 = var1)
)
#> `summarise()` has grouped output by 'lat'. You can override using the `.groups`
#> argument.
#> Joining, by = c("lat", "lon")
#> Joining, by = c("lat", "lon")
#> # A tibble: 3 × 7
#> # Groups: lat [3]
#> lat lon var1_avg var1_sd var1_median var1_100 var1_0
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -30 270 90.6 67.0 81.6 181. 5.51
#> 2 45 0 43.3 40.5 49.6 36.6 -30.1
#> 3 50 180 34.9 47.0 25.3 24.6 0.705
Created on 2022-04-04 by the reprex package (v2.0.0)

How can I pull frequency data from multiple columns to make a bar chart?

Essentially, I am trying to take make a bar chart using dplyr where there are several columns, say A, B, and C
Each column has a value classifying it, 0 or 1, if the row corresponds to that type of value
I am trying to make a bar chart using ggplot that shows the number of rows that contain a true value in each column. Any advice, at least on the syntax I'd follow?
Example:
A 1 1 1 0 0 0
B 0 0 0 1 0 0
C 0 0 0 0 1 1
I want to show the frequency of each, but as if those three were columns
Edit: I should note that I am trying to pull these from a larger data set, e.x. A, B, C, D, E, F, G, H.... but I only want A, B, and C

Try this
library(dplyr)
library(ggplot2)
library(tibble)
df <- as.data.frame(
rbind(
A = c(1, 1, 1, 0, 0, 0),
B = c(0, 0, 0, 1, 0, 0),
C = c(0, 0, 0, 0, 1, 1),
D = c(0, 0, 0, 0, 0, 0),
E = c(0, 0, 0, 0, 0, 0)
))
df %>%
# NOTE: name of id variable should not start with "v" or "V"
# Otherwise the select will not work.
rownames_to_column(var = "type") %>%
mutate(count = rowSums(select(., starts_with("V")), na.rm = TRUE)) %>%
select(type, count) %>%
filter(type %in% c("A", "B", "C")) %>%
ggplot(aes(type, count, fill = type)) +
geom_col() +
guides(fill = FALSE)
Created on 2020-03-15 by the reprex package (v0.3.0)
Update
First of all, both the solution by #Chris and by #Jonathan are much cleaner and clearer than my approach and both are more efficinet. In terms of efficiency the base R solution by #Chris is however by far the most efficient (not only in terms of programmers efficiency (;). Results show that the base R solution gives a speedup compared to the tidyverse solutions by factor ~10. Whether this is crucial depends on the size of the dataset or ...
Here are the results:
I simply put the different solutions in functions (I only did some renaming) and did a microbenchmark. I also added a fourth function which adpats the code by #Chris to allow for flexible names.
library(dplyr)
library(tidyr)
library(ggplot2)
library(tibble)
# example data
df <- as.data.frame(
rbind(
A = c(1, 1, 1, 0, 0, 0),
B = c(0, 0, 0, 1, 0, 0),
C = c(0, 0, 0, 0, 1, 1),
D = c(0, 0, 0, 0, 0, 0),
E = c(0, 0, 0, 0, 0, 0)
))
# Tidyverse 1 using select & rowSums
sum_rows1 <- function(df) {
df %>%
# NOTE: name of id variable should not start with "v" or "V"
# Otherwise the select will not work.
rownames_to_column(var = "type") %>%
filter(type %in% c("A", "B", "C")) %>%
mutate(count = rowSums(select(., starts_with("V")), na.rm = TRUE)) %>%
select(type, count)
}
# Tidyverse 2 using pivot_longer
sum_rows2 <- function(df) {
df %>%
#Transpose the data
t() %>%
#Convert it as data.frame
as.data.frame() %>%
#Get data from wide to long format
pivot_longer(cols = everything(),
names_to = "type",
values_to = "value") %>%
#Filter to stay only with letters A, B, C
filter(type %in% c("A","B","C")) %>%
#group by var (i.e., letters)
group_by(type) %>%
#Get the sum of values per letter
summarize(count = sum(value))
}
# base R 1 with fixed names
sum_rows3 <- function(df) {
sum1 <- apply(t(df)[,1:3], 2, sum)
data.frame(type = LETTERS[1:3], count = sum1)
}
# base R 2 with flexible names
sum_rows4 <- function(df, cols) {
sum1 <- apply(t(df)[, cols], 2, sum)
data.frame(type = names(sum1), count = sum1)
}
(df1 <- sum_rows1(df))
#> type count
#> 1 A 3
#> 2 B 1
#> 3 C 2
(df2 <- sum_rows2(df))
#> # A tibble: 3 x 2
#> type count
#> <chr> <dbl>
#> 1 A 3
#> 2 B 1
#> 3 C 2
(df3 <- sum_rows3(df))
#> type count
#> A A 3
#> B B 1
#> C C 2
(df4 <- sum_rows4(df, c("A","B","C")))
#> type count
#> A A 3
#> B B 1
#> C C 2
# Benchmark the solutions
microbenchmark::microbenchmark(sum_rows1(df), sum_rows2(df), sum_rows3(df), sum_rows4(df, c("A","B","C")))
#> Unit: microseconds
#> expr min lq mean median uq
#> sum_rows1(df) 4239.5 4619.60 6079.313 6072.20 6771.15
#> sum_rows2(df) 3658.1 4085.55 5309.038 5225.95 5939.90
#> sum_rows3(df) 301.6 383.15 540.001 437.55 539.10
#> sum_rows4(df, c("A", "B", "C")) 302.6 387.05 533.977 469.05 546.40
#> max neval
#> 11238.7 100
#> 13808.2 100
#> 5018.6 100
#> 4106.9 100
Created on 2020-03-16 by the reprex package (v0.3.0)

Here is another solution using tidyverse that uses two great functions (pivot_longer and summarize) to organize the data and build the desired plot.
library(tidyverse)
df %>%
#Transpose the data
t() %>%
#Convert it as data.frame
as.data.frame() %>%
#Get data from wide to long format
pivot_longer(cols = everything(),
names_to = "var",
values_to = "value") %>%
#Filter to stay only with letters A, B, C
filter(var %in% c("A","B","C")) %>%
#group by var (i.e., letters)
group_by(var) %>%
#Get the sum of values per letter
summarize(sum = sum(value)) %>%
#ggplot with geom_col (i.e., columns plot)
ggplot(aes(x = var,
y = sum,
fill = var)) +
geom_col()

A simple base R solution is this, using #stefan's data:
First, calculate the sums for each row in df by transposing it (flipping rows into columns and vice versa) using tas well as apply, 2 for the rows in df that have become columns in t(df), and sum for sums:
sum1 <- apply(t(df)[,1:3], 2, sum)
Then create a dataframe with the relevant sequence of upper-case letters as the first variable and sum1 as the second variable:
sum2 <- data.frame(types = LETTERS[1:3], sum1)
And finally plot your barplot using sum2as input data:
ggplot(sum2, aes(types, sum1, fill = types)) +
geom_col(fill = c("#009E00", "#F0E300", "#0066B2"))

gather 3 different detections of three different variables

I have a dataframe of 96074 obs. of 31 variables.
the first two variables are id and the date, then I have 9 columns with measurement (three different KPIs with three different time properties), then various technical and geographical variables.
df <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d_1day_old = rnorm(9, 2, 1),
sum_i_1day_old = rnorm(9, 2, 1),
per_i_d_1day_old = rnorm(9, 0, 1),
sum_d_5days_old = rnorm(9, 0, 1),
sum_i_5days_old = rnorm(9, 0, 1),
per_i_d_5days_old = rnorm(9, 0, 1),
sum_d_15days_old = rnorm(9, 0, 1),
sum_i_15days_old = rnorm(9, 0, 1),
per_i_d_15days_old = rnorm(9, 0, 1)
)
I want to transform from wide to long, in order to do graphs with ggplot using facets for example.
If I had a df with just one variable with its three-time scans I would have no problem in using gather:
plotdf <- df %>%
gather(sum_d, value,
c(sum_d_1day_old, sum_d_5days_old, sum_d_15days_old),
factor_key = TRUE)
But having three different variables trips me up.
I would like to have this output:
plotdf <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d = rep(c("sum_d_1day_old", "sum_d_5days_old", "sum_d_15days_old"), 3),
values_sum_d = rnorm(9, 2, 1),
sum_i = rep(c("sum_i_1day_old", "sum_i_5days_old", "sum_i_15days_old"), 3),
values_sum_i = rnorm(9, 2, 1),
per_i_d = rep(c("per_i_d_1day_old", "per_i_d_5days_old", "per_i_d_15days_old"), 3),
values_per_i_d = rnorm(9, 2, 1)
)
with id, sum_d, sum_i and per_i_d of class factor time of class Date and the values of class numeric (I have to add that I don't have negative measures in these variables).
what I've tried to do:
plotdf <- gather(df, key, value, sum_d_1day_old:per_i_d_15days_old, factor_key = TRUE)
gathering all of the variables in a single column
plotdf$KPI <- paste(sapply(strsplit(as.character(plotdf$key), "_"), "[[", 1),
sapply(strsplit(as.character(plotdf$key), "_"), "[[", 2), sep = "_")
creating a new column with the name of the KPI, without the time specification
plotdf %>% unite(value2, key, value) %>%
#creating a new variable with the full name of the KPI attaching the value at the end
mutate(i = row_number()) %>% spread(KPI, value2) %>% select(-i)
#spreading
But spread creates rows with NAs.
To replace then at first I used
group_by(id, date) %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "down") %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "up") %>%
But the problem is that there are already some measurements with NAs in the original df in the variable per_i_d (44 in total), so I lose that information.
I thought that I could replace the NAs in the original df with a dummy value and then replace the NAs back, but then I thought that there could be a more efficient solution for all of my problem.
After I replaced the NAs, my idea was to use slice(1) to select only the first row of each couple id/date, then do some manipulation with separate/unite to have the output I desired.
I actually did that, but then I remembered I had those aforementioned NAs in the original df.

df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+')) %>%
select(-key) %>%
spread(type,value)
gives
id time age per_i sum_d sum_i
1 1 2009-01-01 15days_old 0.8132301 0.8888928 0.077532040
2 1 2009-01-01 1day_old -2.0993199 2.8817133 3.047894196
3 1 2009-01-01 5days_old -0.4626151 -1.0002926 0.327102000
4 1 2009-01-02 15days_old 0.4089618 -1.6868523 0.866412133
5 1 2009-01-02 1day_old 0.8181313 3.7118065 3.701018419
...
EDIT:
adding non-value columns to the dataframe:
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+'),
info = paste(age,type,sep = "_")) %>%
select(-key) %>%
gather(key,value,-id,-time,-age,-type) %>%
unite(dummy,type,key) %>%
spread(dummy,value)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Tidy Aggregation of Sequence Data and Visualization of Stepfunctions - r

Related

identify which group contains sequence of non-zero values

Calculate mean and standard deviation for subgroups

Summarizing with data.table R - multiple mathematical operations and conditions

How can I pull frequency data from multiple columns to make a bar chart?

gather 3 different detections of three different variables

Categories

Resources