Restructure data - get mean based on multiple grouping columns

Restructure data - get mean based on multiple grouping columns - r

library(tidyverse)
set.seed(10)
dat <- data.frame(age = sample(14:79, size = 15, replace = TRUE),
sex = sample(c("m", "f"), size = 15, replace = TRUE),
region = sample(c("A", "B", "C"), size = 15, replace = TRUE),
var1 = runif(15, min = 0, max=100)) %>%
mutate(agegrp = cut(age, breaks = c(-Inf, 20, 50, 70, Inf), labels = c("<= 20", "21-50", "51-70", ">70")))
The table looks like this:
age sex region var1 agegrp
1 79 m A 64.669975 >70
2 78 m C 92.789062 >70
3 23 m A 69.626845 21-50
4 25 m C 5.074013 21-50
5 60 f C 10.340510 51-70
6 36 f B 90.294240 21-50
7 23 m A 12.769088 21-50
8 27 f A 43.892321 21-50
9 35 f B 99.793467 21-50
10 40 f C 94.284903 21-50
11 25 m A 98.829001 21-50
12 55 m A 98.007185 51-70
13 43 f A 37.491168 21-50
14 68 m A 90.051414 51-70
15 76 f B 13.567239 >70
Unfortunately, our customer needs the data to be in a pretty weird format like this:
split value var1_mean
1 agegrp 21-50 61.3
2 agegrp 51-70 66.1
3 agegrp >70 57.0
4 sex m 55.7
5 sex f 66.5
6 region A 64.4
7 region B 67.9
8 region C 50.6
I can easily do this using the following code, but it is very unelegant:
age <- dat %>% group_by(agegrp) %>%
summarise(var1_mean = mean(var1)) %>%
mutate(value = agegrp,
split = "agegrp") %>%
select(split, value, var1_mean)
sex <- dat %>% group_by(sex) %>%
summarise(var1_mean = mean(var1)) %>%
mutate(value = "sex",
split = "sex") %>%
select(split, value, var1_mean)
region <- dat %>% group_by(region) %>%
summarise(var1_mean = mean(var1)) %>%
mutate(value = "region",
split = "region") %>%
select(split, value, var1_mean)
rbind(age, sex, region)
Is there a way to make this easier without "stacking" several tables manually (maybe using dplyr)?

You can pivot to long and summarise or, alternatively, iterate over the vars of interest:
library(dplyr)
library(purrr)
library(tidyr)
dat %>%
pivot_longer(-c(var1, age), names_to = "split") %>%
group_by(split, value) %>%
summarise(var1_mean = mean(var1))
Or:
map_df(set_names(c("agegrp", "sex", "region")), ~ dat %>%
group_by(across(.x)) %>%
summarise(var1_mean = mean(var1)) %>%
rename(value = .x), .id = "split")
# A tibble: 8 × 3
split value var1_mean
<chr> <chr> <dbl>
1 agegrp <= 20 54.6
2 agegrp 21-50 44.7
3 agegrp 51-70 46.4
4 sex f 37.4
5 sex m 55.5
6 region A 67.3
7 region B 47.9
8 region C 26.1

Loop through columns and aggregate, then rowbind the results:
do.call(rbind,
lapply(c("agegrp", "sex", "region"), function(i){
cbind(split = i,
setNames(aggregate(as.formula(paste("age ~", i)), mean, data = dat),
c("value", "var1_mean")))
}))
# split value var1_mean
# 1 agegrp <= 20 20.00000
# 2 agegrp 21-50 33.30000
# 3 agegrp 51-70 60.50000
# 4 sex f 39.62500
# 5 sex m 39.71429
# 6 region A 35.75000
# 7 region B 39.00000
# 8 region C 43.60000
Note: output is different, probably because the seeded data and shown data do not match.

Related

Select a group of n elements by position of the max in a data frame in r by group

I have a dataframe with group, value and columns based in rollaply mean of the last n values, just like that:
library(dplyr); library(zoo)
df = data.frame( group = c(rep(1,5), rep(2,5)),
value = c(23,14,53,12,56,32,65,76,36,74)) %>%
group_by(group) %>%
mutate(
roll1 = rollapplyr(value, 1, mean, fill = NA, na.rm = T, partial = F),
roll2 = rollapplyr(value, 2, mean, fill = NA, na.rm = T, partial = F),
roll3 = rollapplyr(value, 3, mean, fill = NA, na.rm = T, partial = F)
)
df
group value roll1 roll2 roll3
1 1 23 23 NA NA
2 1 14 14 18.5 NA
3 1 53 53 33.5 30
4 1 12 12 32.5 26.3
5 1 56 56 34 40.3
6 2 32 32 NA NA
7 2 65 65 48.5 NA
8 2 76 76 70.5 57.7
9 2 36 36 56 59
10 2 74 74 55 62
The 'rolln' column represents the average of the last n values.
Then I would like to summarize in a new dataframe which group of values provided the highest average. Remembering that the roll3 column, for example, has a set of 3 values.
I tried to use which.max function, but without success. The position of NA's in the final data.frame isn't important
Thanks in advance

I'd love to see a more concise solution, but this seems to work:
library(tidyverse)
df %>%
pivot_longer(starts_with("roll"), values_to = "avg") %>%
filter(!is.na(avg)) %>%
group_by(group, name) %>%
filter(slider::slide_dbl(avg, max, .after = 2) == max(avg)) %>% # EDIT #2
#filter(avg == max(avg) |
# lead(avg, default = 0) == max(avg) |
# lead(avg, 2, default = 0) == max(avg)) %>%
mutate(items = n() + 1 - parse_number(name)) %>% # EDIT
slice(items:n()) %>%
mutate(row = row_number()) %>%
select(-avg, -items) %>%
pivot_wider(names_from = name, values_from = value)
Result
group row roll1 roll2 roll3
<dbl> <int> <dbl> <dbl> <dbl>
1 1 1 56 12 53
2 1 2 NA 56 12
3 1 3 NA NA 56
4 2 1 76 65 76
5 2 2 NA 76 36
6 2 3 NA NA 74

How to transform two-timepoint wide data to long format by including all time points between

How to transform wide data to long format by including all time points by a step length of 1?
Illustration of what I need
Simulated data
library(tidyverse)
df = tibble(
id = c("a", "b", "c", "d", "e", "f"),
time1 = c(0,1,2,3,4,5),
time2 = c(3,2,6,7,5,9))
My own solution that gave an error
df %>%
mutate(
timepoint = str_c(seq(time1, time2, 1), ",", collapse ="")) %>%
separate_rows(timepoint, sep = ",")

You have to add rowwise:
df %>%
rowwise() %>%
mutate(timepoint = str_c(seq(time1, time2), collapse = ",")) %>%
separate_rows(timepoint, sep = ",", convert = TRUE)
Other solutions, in base R:
timepoint <- with(df, mapply(`:`, time1, time2))
data.frame(id = rep(df$id, lengths(timepoint)),
timepoint = unlist(timepoint))
In tidyverse, with map2 + seq (also works with :):
library(tidyverse)
df %>%
transmute(id, timepoint = map2(time1, time2, seq)) %>%
unnest(timepoint)
output
id timepoint
1 a 0
2 a 1
3 a 2
4 a 3
5 b 1
6 b 2
7 c 2
8 c 3
9 c 4
10 c 5
11 c 6
12 d 3
13 d 4
14 d 5
15 d 6
16 d 7
17 e 4
18 e 5
19 f 5
20 f 6
21 f 7
22 f 8
23 f 9

R rowwise replace the first instance of the minimum

How can I do the following:
replace all values < 6 with NA,
if there is only one NA in the row, replace the first instance of the minimum value with -99?
Some data that includes an ID variable and a total column:
library(tidyverse)
df <- data.frame(id = c(1,2,3,4,5), a = c(10,12,4,17,3), b = c(9,12,3,20,6), c = c(2,2,10,10,10), d = c(12,16,12,10,12))
df$total <- apply(df[,c(2:5)], 1, sum)
Giving
id a b c d total
1 10 9 2 12 33
2 12 12 2 16 42
3 4 3 10 12 29
4 17 20 10 10 57
5 3 6 10 12 31
My desired output is
id a b c d total
1 10 -99 NA 12 33
2 -99 12 NA 16 42
3 NA NA 10 12 29
4 17 20 10 10 57
5 NA -99 10 12 31
My attempt
df_mod <- df %>%
# Make <6 NA
mutate(
across(
.cols = 'a':'total',
~case_when(
.x < 6 ~ as.numeric(NA),
TRUE ~ .x
)
)
) %>%
# Add a count of NAs
rowwise() %>%
mutate(Count_NA = sum(is.na(cur_data()))) %>%
ungroup()
# Transpose and get row minimum
df_mod2 <- t(df_mod[,-c(1,ncol(df_mod))]) %>%
apply(., 2, function(a){
min <- min(a, na.rm = TRUE)
}
) %>%
cbind(df_mod, .) %>%
rename(., min = .) %>%
tibble(.)
# If count_NA = 1 replace the first instance of min
df_mod2 %>%
rowwise() %>%
mutate(
across(
.cols = 'a':'total',
~case_when(
Count_NA == 1 & .x == min ~ replace(.x, first(match(min, .x)), -99),
TRUE ~ .x)
)
) %>%
select(-'Count_NA', -'min')
Which gives the following
id a b c d total
1 10 -99 NA 12 33
2 -99 -99 NA 16 42
3 NA NA 10 12 29
4 17 20 10 10 57
5 NA -99 10 12 31
Thanks

If you're willing to pivot rather than work rowwise, then this solution will work.
library(dplyr)
df %>%
pivot_longer(names_to = 'col',
values_to = 'val',
-c(id, total)) %>%
group_by(id) %>%
mutate(val2 = rank(val, ties.method = 'first'),
val = ifelse(val < 6, NA , val),
val = ifelse(sum(is.na(val)) == 1 & val2 == 2, -99, val)) %>%
select(-val2) %>%
pivot_wider(names_from = col,
values_from = val) %>%
relocate(total, .after = "d")
Here's the result:
# A tibble: 5 × 6
# Groups: id [5]
id a b c d total
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 -99 NA 12 33
2 2 -99 12 NA 16 42
3 3 NA NA 10 12 29
4 4 17 20 10 10 57
5 5 NA -99 10 12 31

It is not clear what you mean by '2nd' minimum value because you replace minimum value. You can use data.table:
library(data.table)
setDT(df)[
,
(cols) := transpose(
lapply(
transpose(lapply(.SD, function(x) fifelse(x < 6, NA_real_, x))),
function(x) if(sum(is.na(x)) == 1) replace(x, which.min(x), -99) else x
)
),
.SDcols = setdiff(names(df), "id")
]

percentage change between consective columns

Looking for clever ways to automatically create percentage change columns between consecutive variables with the same prefix. I would also like to create similar variables for every 2nd variable. We can assume these variables represent years and we would like percent change between each year and every second year.
library(tidyverse)
df <- data.frame(xx = c(1, 2, 3),
a_12 = c(10, 20, 20),
a_13 = c(30, 40, 10),
a_14 = c(23, 34, 56),
a_15 = c(25, 34, 56),
a_16 = c(23, 34, 56))
df
# xx a_12 a_13 a_14 a_15 a_16
# 1 1 10 30 23 23 23
# 2 2 20 40 34 34 34
# 3 3 20 10 56 56 56
What I want is to automate the following:
df %>%
mutate(a_diff_12_13 = ((a_13 - a_12)/a_12)*100,
a_diff_13_14 = ((a_14 - a_13)/a_13)*100,
a_diff_14_15 = ((a_15 - a_14)/a_14)*100,
#ALSO EVERY SECOND YEAR
a_diff_12_14 = ((a_14 - a_12)/a_12)*100,
a_diff_14_16 = ((a_16 - a_14)/a_14)*100)
# xx a_12 a_13 a_14 a_15 a_16 a_diff_12_13 a_diff_13_14 a_diff_14_15 a_diff_12_14 a_diff_14_16
# 1 1 10 30 23 25 23 200 -23.3 8.7 130 0
# 2 2 20 40 34 34 34 100 -15.0 0.0 70 0
# 3 3 20 10 56 56 56 -50 460.0 0.0 180 0
Thanks

try it this way
df %>%
pivot_longer(-xx) %>%
group_by(xx) %>%
mutate(Diff = (value / lag(value) - 1) * 100) %>%
pivot_wider(id_cols = xx, names_from = name, values_from = Diff, names_prefix = "diff_") %>%
left_join(df) %>%
select(xx, starts_with("a"), everything())
Joining, by = "xx"
# A tibble: 3 x 11
# Groups: xx [3]
xx a_12 a_13 a_14 a_15 a_16 diff_a_12 diff_a_13 diff_a_14 diff_a_15 diff_a_16
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 30 23 25 23 NA 200 -23.3 8.70 -8.00
2 2 20 40 34 34 34 NA 100 -15. 0 0
3 3 20 10 56 56 56 NA -50 460. 0 0
or you can do it
n_lag <- seq(1,2)
df_long <- df %>%
pivot_longer(-xx) %>%
group_by(xx)
tmp <- map_dfc(
n_lag,
~ transmute(df_long, !!paste0("diff_", .x) := (value / lag(value, .x) - 1) * 100)) %>%
select(starts_with("diff_"))
bind_cols(df_long, tmp) %>%
pivot_wider(
id_cols = xx,
names_from = name,
values_from = starts_with("diff_")) %>%
left_join(df) %>%
select(xx, starts_with("a"), everything())

Concatenate pairs of columns for entire dataframe

I am working with genetic data and I need to concatenate pairs of columns. The data I have has the major and minor alleles in separate columns (e.g., allele1a, allele1b, allele2a, allele2b, etc. etc.). I need a way to pairs of columns for the entire data frame. I included a sample below, but my data has 1.7 million pairs (so I have 3.4 million columns right now), so it will not work if I need to name each column. I will change the column names later. Any guidance is appreciated if there is a way to do this in R. I have tried to create a sequence and paste them, something like:
df <- data.frame(id = seq(1,20),
var1 = rep("A", 20),
var2 = c(rep("T", 10), rep("A", 10)),
var3 = rep("C", 20),
var4 = c(rep("C", 10), rep("G", 10)),
var5 = rep("A", 20),
var6 = c(rep("A", 10), rep("G", 10)),
stringsAsFactors = FALSE)
i <- seq.int(1, length(ped), by = 2L)
df <- paste0(df[i], df[i+1])
but that did not work. I want it to go from:
id var1 var2 var3 var4 var5 var6
1 1 A T C C A A
2 2 A T C C A A
3 3 A T C C A A
4 4 A T C C A A
5 5 A T C C A A
6 6 A T C C A A
7 7 A T C C A A
8 8 A T C C A A
9 9 A T C C A A
10 10 A T C C A A
11 11 A A C G A G
12 12 A A C G A G
13 13 A A C G A G
14 14 A A C G A G
15 15 A A C G A G
16 16 A A C G A G
17 17 A A C G A G
18 18 A A C G A G
19 19 A A C G A G
20 20 A A C G A G
to:
id var1 var2 var3
1 1 AT CC AA
2 2 AT CC AA
3 3 AT CC AA
4 4 AT CC AA
5 5 AT CC AA
6 6 AT CC AA
7 7 AT CC AA
8 8 AT CC AA
9 9 AT CC AA
10 10 AT CC AA
11 11 AA CG AG
12 12 AA CG AG
13 13 AA CG AG
14 14 AA CG AG
15 15 AA CG AG
16 16 AA CG AG
17 17 AA CG AG
18 18 AA CG AG
19 19 AA CG AG
20 20 AA CG AG
edit:
Thank you!!! I was able to adapt two of the answers for my data and #akrun's ran a little faster. I created a subset of my data with 100 rows and 100,000 columns and the results are below:
microbenchmark(
+ {
+ new <- ped %>%
+ gather(key = V, value = value, -id) %>%
+ mutate(V = str_extract(V, "\\d+") %>% as.numeric()) %>%
+ group_by(id) %>%
+ mutate(pair = ceiling(V / 2)) %>%
+ group_by(id, pair) %>%
+ summarise(combined = paste(value, collapse = "")) %>%
+ mutate(V_combo = paste0("V", pair)) %>%
+ select(-pair) %>%
+ spread(key = V_combo, value = combined) %>%
+ select(id, paste0("V", seq(1, ncol(.)-1, 1)))
+ },
+ {
+ out <- ped[1]
+ new_cols <- paste0("V", seq(1, (ncol(ped)-1)/2))
+
+ out[new_cols] <- lapply(seq(2, ncol(ped)-1, 2),
+ function(i) do.call(paste0, ped[i:(i+1)]))
+ },
+ times = 1
+ )
Unit: seconds
expr min lq mean median uq max neval
camille 250.30901 250.30901 250.30901 250.30901 250.30901 250.30901 1
akrun 23.52434 23.52434 23.52434 23.52434 23.52434 23.52434 1
>
> new <- data.frame(new, stringsAsFactors = FALSE)
> identical(new, out)
[1] TRUE

We can create a loop to subset the columns along with the adjacent column, paste it together withdo.call` and assign it as new columns to the new dataset
out <- df[1]
out[paste0("var", 1:3)] <- lapply(seq(2, ncol(df), 2),
function(i) do.call(paste0, df[i:(i+1)]))

Here's a tidyverse way designed to scale fairly well. Instead of hard-coding that you want to pair columns 1 & 2, 3 & 4, and 5 & 6, I'm reshaping to long data to get a variable number, grouping those into pairs by dividing the variable number by 2, collapsing the letters in each pair, and reshaping back to wide. This way, you can do the same procedure on any even number of columns.
library(tidyverse)
...
Filtering for ID 1 to show a glimpse of this:
df %>%
gather(key = var, value = value, -id) %>%
mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
group_by(id) %>%
mutate(pair = ceiling(var / 2)) %>%
filter(id == 1)
#> # A tibble: 6 x 4
#> # Groups: id [1]
#> id var value pair
#> <int> <dbl> <chr> <dbl>
#> 1 1 1 A 1
#> 2 1 2 T 1
#> 3 1 3 C 2
#> 4 1 4 C 2
#> 5 1 5 A 3
#> 6 1 6 A 3
Then collapsing strings as a summarizing value for each combination of ID and pair:
df %>%
gather(key = var, value = value, -id) %>%
mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
group_by(id) %>%
mutate(pair = ceiling(var / 2)) %>%
group_by(id, pair) %>%
summarise(combined = paste(value, collapse = ""))
#> # A tibble: 60 x 3
#> # Groups: id [?]
#> id pair combined
#> <int> <dbl> <chr>
#> 1 1 1 AT
#> 2 1 2 CC
#> 3 1 3 AA
#> 4 2 1 AT
#> 5 2 2 CC
#> 6 2 3 AA
#> 7 3 1 AT
#> 8 3 2 CC
#> 9 3 3 AA
#> 10 4 1 AT
#> # ... with 50 more rows
And using spread to get back into a wide format.
df %>%
gather(key = var, value = value, -id) %>%
mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
group_by(id) %>%
mutate(pair = ceiling(var / 2)) %>%
group_by(id, pair) %>%
summarise(combined = paste(value, collapse = "")) %>%
mutate(var_combo = paste0("var", pair)) %>%
select(-pair) %>%
spread(key = var_combo, value = combined) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: id [6]
#> id var1 var2 var3
#> <int> <chr> <chr> <chr>
#> 1 1 AT CC AA
#> 2 2 AT CC AA
#> 3 3 AT CC AA
#> 4 4 AT CC AA
#> 5 5 AT CC AA
#> 6 6 AT CC AA
Created on 2018-11-07 by the reprex package (v0.2.1)

Using tidyverse, you can compose the modifying expressions ahead of time, then pass them all to transmute in bulk. This solution uses column names and is therefore robust to the column ordering: if you shuffle your allele columns, this should still give you the same answer.
library( tidyverse )
# Create expressions of the form allele1 = str_c(allele1a, allele1b)
v <- str_c("allele",1:3) %>% set_names %>%
map( ~glue::glue("str_c({.}a, {.}b)") ) %>% map( rlang::parse_expr )
df %>% transmute( id = id, !!!v )
# # A tibble: 20 x 4
# id allele1 allele2 allele3
# <int> <chr> <chr> <chr>
# 1 1 AT CC AA
# 2 2 AT CC AA
# 3 3 AT CC AA
# 4 4 AT CC AA
# ...
I modified your data to closer match your description:
df <- data_frame(id = seq(1,20),
allele1a = rep("A", 20),
allele1b = c(rep("T", 10), rep("A", 10)),
allele2a = rep("C", 20),
allele2b = c(rep("C", 10), rep("G", 10)),
allele3a = rep("A", 20),
allele3b = c(rep("A", 10), rep("G", 10)))

using base r you could do:
a <- seq(2,ncol(df),2)
b <- paste0(unlist(df[a]),unlist(df[a+1]))
d <- data.frame(matrix(b,nrow(df)))
result <- cbind(df[1],d)
This can also be written in a one line:
(dat = data.frame(matrix(paste0(unlist(df[a<-seq(2,ncol(df),2)]),unlist(df[a+1])),nrow(df))))
X1 X2 X3
1 AT CC AA
2 AT CC AA
3 AT CC AA
4 AT CC AA
5 AT CC AA
6 AT CC AA
7 AT CC AA
8 AT CC AA
9 AT CC AA
10 AT CC AA
11 AA CG AG
12 AA CG AG
13 AA CG AG
14 AA CG AG
15 AA CG AG
16 AA CG AG
17 AA CG AG
18 AA CG AG
19 AA CG AG
20 AA CG AG
Then cbind it with the id column:
cbind(df[1],dat)

df <- data.frame(id = seq(1,20),
var1 = rep("A", 20),
var2 = c(rep("T", 10), rep("A", 10)),
var3 = rep("C", 20),
var4 = c(rep("C", 10), rep("G", 10)),
var5 = rep("A", 20),
var6 = c(rep("A", 10), rep("G", 10)),
stringsAsFactors = FALSE)
df2 <- data.frame(id = df[,1], var1 = paste(df[,2], df[,3], sep = ""),
var2 = paste(df[,4], df[,5], sep = ""),
var3 = paste(df[,6], df[,7], sep = ""))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Restructure data - get mean based on multiple grouping columns - r

Related

Select a group of n elements by position of the max in a data frame in r by group

How to transform two-timepoint wide data to long format by including all time points between

R rowwise replace the first instance of the minimum

percentage change between consective columns

Concatenate pairs of columns for entire dataframe

Categories

Resources