gather 3 different detections of three different variables - r

I have a dataframe of 96074 obs. of 31 variables.
the first two variables are id and the date, then I have 9 columns with measurement (three different KPIs with three different time properties), then various technical and geographical variables.
df <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d_1day_old = rnorm(9, 2, 1),
sum_i_1day_old = rnorm(9, 2, 1),
per_i_d_1day_old = rnorm(9, 0, 1),
sum_d_5days_old = rnorm(9, 0, 1),
sum_i_5days_old = rnorm(9, 0, 1),
per_i_d_5days_old = rnorm(9, 0, 1),
sum_d_15days_old = rnorm(9, 0, 1),
sum_i_15days_old = rnorm(9, 0, 1),
per_i_d_15days_old = rnorm(9, 0, 1)
)
I want to transform from wide to long, in order to do graphs with ggplot using facets for example.
If I had a df with just one variable with its three-time scans I would have no problem in using gather:
plotdf <- df %>%
gather(sum_d, value,
c(sum_d_1day_old, sum_d_5days_old, sum_d_15days_old),
factor_key = TRUE)
But having three different variables trips me up.
I would like to have this output:
plotdf <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d = rep(c("sum_d_1day_old", "sum_d_5days_old", "sum_d_15days_old"), 3),
values_sum_d = rnorm(9, 2, 1),
sum_i = rep(c("sum_i_1day_old", "sum_i_5days_old", "sum_i_15days_old"), 3),
values_sum_i = rnorm(9, 2, 1),
per_i_d = rep(c("per_i_d_1day_old", "per_i_d_5days_old", "per_i_d_15days_old"), 3),
values_per_i_d = rnorm(9, 2, 1)
)
with id, sum_d, sum_i and per_i_d of class factor time of class Date and the values of class numeric (I have to add that I don't have negative measures in these variables).
what I've tried to do:
plotdf <- gather(df, key, value, sum_d_1day_old:per_i_d_15days_old, factor_key = TRUE)
gathering all of the variables in a single column
plotdf$KPI <- paste(sapply(strsplit(as.character(plotdf$key), "_"), "[[", 1),
sapply(strsplit(as.character(plotdf$key), "_"), "[[", 2), sep = "_")
creating a new column with the name of the KPI, without the time specification
plotdf %>% unite(value2, key, value) %>%
#creating a new variable with the full name of the KPI attaching the value at the end
mutate(i = row_number()) %>% spread(KPI, value2) %>% select(-i)
#spreading
But spread creates rows with NAs.
To replace then at first I used
group_by(id, date) %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "down") %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "up") %>%
But the problem is that there are already some measurements with NAs in the original df in the variable per_i_d (44 in total), so I lose that information.
I thought that I could replace the NAs in the original df with a dummy value and then replace the NAs back, but then I thought that there could be a more efficient solution for all of my problem.
After I replaced the NAs, my idea was to use slice(1) to select only the first row of each couple id/date, then do some manipulation with separate/unite to have the output I desired.
I actually did that, but then I remembered I had those aforementioned NAs in the original df.

df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+')) %>%
select(-key) %>%
spread(type,value)
gives
id time age per_i sum_d sum_i
1 1 2009-01-01 15days_old 0.8132301 0.8888928 0.077532040
2 1 2009-01-01 1day_old -2.0993199 2.8817133 3.047894196
3 1 2009-01-01 5days_old -0.4626151 -1.0002926 0.327102000
4 1 2009-01-02 15days_old 0.4089618 -1.6868523 0.866412133
5 1 2009-01-02 1day_old 0.8181313 3.7118065 3.701018419
...
EDIT:
adding non-value columns to the dataframe:
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+'),
info = paste(age,type,sep = "_")) %>%
select(-key) %>%
gather(key,value,-id,-time,-age,-type) %>%
unite(dummy,type,key) %>%
spread(dummy,value)

Related

How to sum every numeric column that start with the same name except the 2 last characters, in R?

I have a dataframe which contains >100 columns, some are numeric, some not.
All variables ending with "_f" or "_m" are numeric variables and I would like to sum all the pairs that start with the same pattern but end with "_f" or "_m".
Here is an example of variable names in my dataframe:
xxxxxxxxxxxxx_age1_f
xxxxxxxxxxxxx_age1_m
xxxxxxxxxxxxx_age2_f
xxxxxxxxxxxxx_age2_m
xxxxxxxxxxxxx_age3_f
xxxxxxxxxxxxx_age3_m
yyyyyyyyyy_age1_f
yyyyyyyyyy_age1_m
yyyyyyyyyy_age2_f
yyyyyyyyyy_age2_m
yyyyyyyyyy_age3_f
yyyyyyyyyy_age3_m
yyyyyyyyyy_age4_f
yyyyyyyyyy_age4_m
yyyyyyyyyy_age5_f
yyyyyyyyyy_age5_m
zzzzzzzzzzzzzzzzzzzz_age1_f
zzzzzzzzzzzzzzzzzzzz_age1_m
zzzzzzzzzzzzzzzzzzzz_age2_f
zzzzzzzzzzzzzzzzzzzz_age2_m
zzzzzzzzzzzzzzzzzzzz_age3_f
zzzzzzzzzzzzzzzzzzzz_age3_m
text_var_11
text_var_222
text_var_33333
(I'm abstracting the names here with x, y z to make my question clearer, they are not really named like that)
My first solution would be to sum each pair using dplyr::mutate() like this:
mutate( ... ) %>%
mutate( yyyyyyyyyy_age2 = yyyyyyyyyy_age2_f + yyyyyyyyyy_age2_m) %>%
mutate( yyyyyyyyyy_age3 = yyyyyyyyyy_age3_f + yyyyyyyyyy_age3_m) %>%
mutate( ... ) %>%
This will work, but there must be more intelligent way to do this without repeating this for all variable pairs.
After looking for a solution, the closest I found was this
Sum all columns whose names start with a pattern, by group
However the proposed solution doesn't work in my case for 2 reasons:
the substr() is not applicable to my problem since the lengths of the variable names change
this method assumes I only have variables I want to sum, while in my case I have many other variables that don't end with "_f" or "_m" and don't need to be summed (and cannot be summed since some are text)
I suppose the solution could be modified to apply to my case but I'm not sure how.
EDIT: here is sample data created with dput
structure(list(Groups = c("xx", "xx", "xx"), xxxxx_age0_f = c(8,
0, 7), xxxxx_age0_m = c(5, 0, 0), xxxxx_age1_f = c(1,
0, 0), xxxxx_age1_m = c(3, 2, 0), xxxxx_age2_f = c(0,
0, 2), xxxxx_age2_m = c(0, 1, 0), zzzz_age0_f = c(4,
2, NA), zzzz_age0_m = c(3, 6, NA), zzzz_age1_f = c(0,
0, NA), zzzz_age1_m = c(2, 0, NA), zzzz_age2_f = c(4,
1, NA), zzzz_age2_m = c(3, 1, NA)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
ps: this simplified example has only one categorical variable, while I have hundreds.
Updated, with OP's actual example data:
inner_join(
dat,
dat %>%
select(Groups, ends_with(c("_f", "_m"))) %>%
pivot_longer(cols=!Groups) %>%
mutate(name = gsub("_.$",replacement = "",name)) %>%
group_by(Groups, name) %>%
summarize(value=sum(value,na.rm=T)) %>%
pivot_wider(id_cols ="Groups", "name"),
by="Groups"
)
Previous example, prior to OP's actual example.
If your table looks like this:
dat
# A tibble: 2 x 9
zzzzzzzzzzzzzzzzzzzz_age1_f zzzzzzzzzzzzzzzzzzzz_age1_m zzzzzzzzzzzzzzzz~ zzzzzzzzzzzzzzz~ zzzzzzzzzzzzzzz~ zzzzzzzzzzzzzzz~ text_var_11 text_var_222 text_var_33333
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 -0.709 1.26 1.03 1.36 -0.140 -0.595 f o x
2 -0.202 0.164 -1.28 -1.48 -0.380 0.874 a p m
Then, you can do this:
dat %>%
select(ends_with(c("_f", "_m"))) %>%
pivot_longer(cols=everything()) %>%
mutate(name = gsub("_.$",replacement = "",name)) %>%
group_by(name) %>%
summarize(value=sum(value,na.rm=T))
To get this:
name value
<chr> <dbl>
1 zzzzzzzzzzzzzzzzzzzz_age1 0.510
2 zzzzzzzzzzzzzzzzzzzz_age2 -0.371
3 zzzzzzzzzzzzzzzzzzzz_age3 -0.240
Now, I don't know what your desired output structure is, but your mutate attempt above suggests you want to column bind these new aggregate columns. This could be done easily by taking that interim result above, and wrapping it in bind_cols, like this
bind_cols(
dat,
dat %>%
select(ends_with(c("_f", "_m"))) %>%
pivot_longer(cols=everything()) %>%
mutate(name = gsub("_.$",replacement = "",name)) %>%
group_by(name) %>%
summarize(value=sum(value,na.rm=T)) %>%
pivot_wider(id_cols ="name")
)
This does not solve all of your problem but maybe you could try something along the lines of
df %>%
rowwise() # will apply your functions to rows instead of columns
for i in c("age1", "age2", "age3"){
df %>%
z <- mutate({{i}} = sum(c_across(contains({{i}})))) %>%
print(z)
}
df %>% ungroup()

Changing the factor level based on the value in a grouped variable using the dplyr and forcats packages

I am trying to change the levels of a factor based on some values coming from another variable. I will show it on an example. I have such a table:
library(tidyverse)
set.seed(1)
df = tibble(
group = factor(rep(c("a", "b", "c", "d"), each = 5)),
x = c(rnorm(5, 0, 1), rnorm(5, 0, 2), rnorm(5, 0, 1.5), rnorm(5, 0, 3))
)
I would like to change the level of the group factor in decreasing value of the standard deviation of the variable x.
I managed to get it like this:
lev = df %>% group_by(group) %>%
summarise(sd = sd(x)) %>%
arrange(desc(sd))
df = df %>% mutate(group = fct_relevel(group, as.character(lev$group)))
However, I don't like this solution because it requires creating an auxiliary lev table, which I would like to avoid. Does anyone know how to achieve this effect in a more simple and transparent way typical for dplyr semantics.
What you are looking for is forcats::fct_reorder():
df = df %>% mutate(group = fct_reorder(group, x, sd, .desc = TRUE))
df %>% group_by(group) %>% summarise(sd=sd(x))

Sum only the numeric columns of specific rows in two data frames to add noise

I'm trying to implement a oversampling function in R, meaning to replicate samples (the rows in my data frame) and then add some noise only to the newly created samples. The problem why this answers do not work
(Add two data frames together based on matching column names, How to merge and sum two data frames)
is that I also have factor columns, which obviously can't be "noised". My idea would be to remove the non-numerical columns, add the noise and then add them back, but I don't know how. Any other idea to achieve this goal is welcome, too.
# generate the initial dataframe
library(tidyverse)
train_set <- data.frame(
Numeric1 = runif(20, 0 , 1),
Numeric2 = runif(20, 0 , 1),
Numeric3 = runif(20, 0 , 1),
Numeric4 = runif(20, 0 , 1),
Numeric5 = runif(20, 0 , 1),
Numeric6 = runif(20, 0 , 1),
Numeric7 = runif(20, 0 , 1),
Numeric8 = runif(20, 0 , 1),
Numeric9 = runif(20, 0 , 1),
Numeric10 = runif(20, 0 , 1),
Factor_column = rep("Factor1", 20)
)
row.names(train_set) <- c(paste("Sample", c(1:nrow(train_set)), sep = "_"))
# replicate each row twice
train_oversampled <- train_set[rep(seq_len(nrow(train_set)), each = 3), ]
# identify the added rows
Rownames <- row.names(subset(train_oversampled, grepl("\\.", row.names(train_oversampled))))
nRows <- nrow(subset(train_oversampled, grepl("\\.", row.names(train_oversampled))))
# which columns are numeric?
Headers <- train_oversampled %>% select_if(is.numeric) %>% names(.)
nHeaders <- train_oversampled %>% select_if(is.numeric) %>% ncol(.)
# create a new dataframe with the values to add and the dimensions of the selection above
noise <- data.frame(
replicate(nHeaders, rnorm(nRows, mean = 0, sd = 0.0005))
)
row.names(noise) <- Rownames
names(noise) <- Headers
# add noise to the oversampled data frame
# does not work due to factor column
bind_rows(
train_oversampled %>% add_rownames(),
noise %>% add_rownames()
) %>%
group_by(rowname) %>%
summarise_all(sum, na.rm = T)
Any ideas on how to add the values from noise to the corresponding rows and columns in train_oversampled?
I think this gives you what you're after. I filter out the numeric columns when I'm binding rows, do the group by and summarise, and then bind the factor columns back in.
bind_rows(
train_oversampled %>% select_if(is.numeric) %>% add_rownames(),
noise %>% add_rownames()
) %>%
group_by(rowname) %>%
summarise_all(sum, na.rm = T) %>%
bind_cols(train_oversampled %>% select_if(is.factor))

Combine rows with duplicate identifiers while adding additional columns

Here's a simple example of what I'm looking for:
Before:
data.frame(
Name = c("pusheen", "pusheen", "puppy"),
Species = c("feline", "feline", "doggie"),
Activity = c("snacking", "napping", "playing"),
Start = c(1, 2, 3),
End = c(11, 12, 13)
)
After:
data.frame(
Name = c("pusheen", "puppy"),
Species = c("feline", "doggie"),
Activity1 = c("snacking", "playing"),
Start1 = c(1, 3),
End1 = c(11, 13),
Activity2 = c("napping", NA),
Start2 = c(2, NA),
End2 = c(12, NA)
)
How do I do this in R or Excel? Thanks!
This can be done using pivot_wider from the tidyr package.
library(tidyr)
library(dplyr)
library(magrittr)
df <- df %>%
group_by(Name) %>%
mutate(num = row_number()) %>% # Create a counter by group
ungroup() %>%
pivot_wider(
id_cols = c("Name", "Species"),
names_from = num,
values_from = c("Activity", "Start", "End"),
names_sep = "")
If you want the result ordered as in your sample output, we can add an additional select statement. I used str_sub from the stringr package to pull out the last character from each column name, and then sorted the names from there. This method of ordering columns should generalise to any number of activities.
library(stringr)
df %>%
select(Name, Species, names(df)[order(str_sub(names(df), -1))])

R: Tidy Aggregation of Sequence Data and Visualization of Stepfunctions

I have some patient data, where the individual patients change treatment groups over time. My goal is to visualize the sequence of group changes and aggregate this data into a "sequence profile" for each treatment group.
For each treatment group I would like to show, when it generally occurs
in the treatment cycle (say rather in the beginning or in the end). To account for the differing sequence length, I would like to standardize these profiles betweenn 0 (very beginning) and 1 (end).
I would like to find an efficient data preparation and visualization.
Mininmal Example
Structure of Data
library(dplyr)
library(purrr)
library(ggplot2)
# minimal data
cj_df_raw <- tibble::tribble(
~id, ~group,
1, "A",
1, "B",
2, "A",
2, "B",
2, "A"
)
# compute "intervals" for each person [start, end]
cj_df_raw %>%
group_by(id) %>%
mutate(pos = row_number(),
len = length(id),
start = (pos - 1) / len,
end = pos / len) %>%
filter(group == "A")
#> # A tibble: 3 x 6
#> # Groups: id [2]
#> id group pos len start end
#> <dbl> <chr> <int> <int> <dbl> <dbl>
#> 1 1 A 1 2 0 0.5
#> 2 2 A 1 3 0 0.333
#> 3 2 A 3 3 0.667 1
(So Id 1 was in group A in the first 50% of their sequence, and Id 2 was in Group A in the first 33% and the last 33% of their sequence. This means, that 2 Ids where between 0-33% of the sequence, 1 between 33-50%, 0 between 50-66% and 1 above 66%.)
This is the result I would like to achieve and I miss a chance to transform my data effectively.
Desired outcome
profile_treatmen_a <- tibble::tribble(
~x, ~y,
0, 0L,
0.33, 2L,
0.5, 1L,
0.66, 0L,
1, 1L,
1, 0L
)
profile_treatmen_a %>%
ggplot(aes(x, y)) +
geom_step(direction = "vh") +
expand_limits(x = c(0, 1), y = 0)
(Ideally the area under the curve would be shaded)
Ideal solution: via ggridges
The goal of the visualization would be to compare the "sequence-profile" of many treatment-groups at the same time. If I could prepare the data accordingly, I would like to use the ggridges-package for a striking visual comparison the treatment groups.
library(ggridges)
data.frame(group = rep(letters[1:2], each=20),
mean = rep(2, each=20)) %>%
mutate(count = runif(nrow(.))) %>%
ggplot(aes(x=count, y=group, fill=group)) +
geom_ridgeline(stat="binline", binwidth=0.5, scale=0.9)
You could build helper intervals and then just plot a histogram. Since each patient is either in Group A or B both groups sum up to 100%. With these helper intervals you could also easily switch to other geoms.
library(tidyverse, warn.conflicts = FALSE)
library(ggplot2)
# create sample data
set.seed(42)
id <- 1:10 %>% map(~ rep(x = .x, times = runif(n = 1, min = 1, max = 6))) %>%
unlist()
group <- sample(x = c("A", "B"), size = length(id), replace = TRUE) %>%
as_factor()
df <- tibble(id, group)
glimpse(df)
#> Observations: 37
#> Variables: 2
#> $ id <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5,...
#> $ group <fct> A, B, B, A, A, B, B, A, A, B, B, A, B, B, A, B, A, B, A,...
# tidy data
df <- df %>%
group_by(id) %>%
mutate(from = (row_number() - 1) / n(),
to = row_number() / n()) %>%
ungroup() %>%
rowwise() %>%
mutate(list = seq(from + 1/60, to, 1/60) %>% list()) %>%
unnest()
# plot
df %>%
ggplot(aes(x = list, fill = group)) +
geom_histogram(binwidth = 1/60) +
ggthemes::theme_hc()
Created on 2018-09-16 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0).
My attempt at an answer.. although it is probably not the nicest/fastest/most efficient way, I think it might help you in your efforts.
library(data.table)
# compute "intervals" for each person [start, end]
df <- cj_df_raw %>%
group_by(id) %>%
mutate(pos = row_number(),
len = length(id),
from = (pos - 1) / len,
to = pos / len,
value = 1)
dt <- as.data.table(df)
setkey(dt, from, to)
#create intervals
dt.interval <- data.table(from = seq( from = 0, by = 0.01, length.out = 100),
to = seq( from = 0.01, by = 0.01, length.out = 100))
#perform overlap join on intervals
dt2 <- foverlaps( dt.interval, dt, type = "within", nomatch = NA)[, sum(value), by = c("i.from", "group")]
#some melting ans casting to fill in '0' on empty intervals
dt3 <- melt( dcast(dt2, ... ~ group, fill = 0), id.vars = 1 )
#plot
ggplot( dt3 ) +
geom_step( aes( x = i.from, y = value, color = variable ) ) +
facet_grid( .~variable )

Resources