I have a dataframe that looks like this:
x <- data.frame(cell = c(0,0,1,1,2,2), group = c('wt', 'mut', 'wt', 'mut', 'wt', 'mut'), val1 = c(4,5,7,1,7,8), val2 = c(1,4,2,3,5,6))
x
cell group val1 val2
1 0 wt 4 1
2 0 mut 5 4
3 1 wt 7 2
4 1 mut 1 3
5 2 wt 7 5
6 2 mut 8 6
For each cell, I would like to calculate the log ratio of the values between the wt and mut groups. For example, the final dataframe would look like this:
cell val1_ratio val2_ratio
1 0 -0.220 -1.380
2 1 1.950 -0.405
3 2 -0.134 -0.182
We may do a group by division
library(dplyr)
x %>%
group_by(cell) %>%
summarise(across(starts_with('val'),
~ log(.x[1]/.x[2]), .names = '{.col}_ratio'))
-output
# A tibble: 3 × 3
cell val1_ratio val2_ratio
<dbl> <dbl> <dbl>
1 0 -0.223 -1.39
2 1 1.95 -0.405
3 2 -0.134 -0.182
data.table
df <-
data.frame(
cell = c(0, 0, 1, 1, 2, 2),
group = c('wt', 'mut', 'wt', 'mut', 'wt', 'mut'),
val1 = c(4, 5, 7, 1, 7, 8),
val2 = c(1, 4, 2, 3, 5, 6)
)
library(data.table)
setDT(df)[, log(.SD[1] / .SD[2]), by = cell, .SDcols = c("val1", "val2")]
#> cell val1 val2
#> 1: 0 -0.2231436 -1.3862944
#> 2: 1 1.9459101 -0.4054651
#> 3: 2 -0.1335314 -0.1823216
Created on 2021-10-29 by the reprex package (v2.0.1)
Related
I have a data with several line ids per time and with -infinite values, and I would like to use the R packages dplyr and tidyverse to calculate the average number of -infinite per ID per time.
This is my data:
dt <- data.frame(id = rep(1:3, each = 4),
time = rep(1:4, time=3),
x = c(1, 2, 1, -Inf, 2, -Inf,1, 1, 5, 1, 2, 1),
y = c(2, -Inf, -Inf, -Inf, -Inf, 5, -Inf, 2, 1, 2, 2, 2))
In the real data I have more than 100 columns but to simplify I put only x and y.
The expected result:
id time n
2 1 2 0.5
3 1 3 0.5
4 1 4 1.0
5 2 1 0.5
6 2 2 0.5
7 2 3 0.5
The idea is to use some specific columns to generate a vector according to a specific calculation function.
After searching I found the rowwise() function, but it did not help, Here is my attempt:
dt %>%
group_by(id,time) %>%
summarise(n = across(x:y, ~mean(is.infinite(x) & x < 0, na.rm=TRUE)))
dt %>%
group_by(id,time) %>%
rowwise() %>%
summarise(n = across(everything(), ~mean(is.infinite(x) & x < 0, na.rm=TRUE)))
dt %>%
rowwise() %>%
summarise(n = across(everything(), ~mean(is.infinite(x) & x < 0, na.rm=TRUE)))
# same results:
`summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
# A tibble: 12 x 3
# Groups: id [3]
id time n$x $y
<int> <int> <dbl> <dbl>
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 1 1
5 2 1 0 0
6 2 2 1 1
7 2 3 0 0
8 2 4 0 0
9 3 1 0 0
10 3 2 0 0
11 3 3 0 0
12 3 4 0 0
Could you help me to generate this vector n?
I think I understand better what you're aiming to do here. across isn't needed (as it's more for modifying columns in place). Either rowwise or group_by would work:
library(dplyr)
dt <- data.frame(id = rep(1:3, each = 4),
time = rep(1:4, times = 3),
x = c(1, 2, 1, -Inf, 2, -Inf,1, 1, 5, 1, 2, 1),
y = c(2, -Inf, -Inf, -Inf, -Inf, 5, -Inf, 2, 1, 2, 2, 2))
dt %>%
group_by(id, time) %>%
summarise(n = mean(c(is.infinite(x), is.infinite(y)))) %>%
filter(n != 0)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 × 3
#> # Groups: id [2]
#> id time n
#> <int> <int> <dbl>
#> 1 1 2 0.5
#> 2 1 3 0.5
#> 3 1 4 1
#> 4 2 1 0.5
#> 5 2 2 0.5
#> 6 2 3 0.5
Here's a possible way of doing the calculation across any number of columns after grouping (by making a quick function to check the negative and the infinite value):
library(dplyr)
dt <- data.frame(id = rep(1:3, each = 4),
time = rep(1:4, times = 3),
x = c(1, 2, 1, -Inf, 2, -Inf,1, 1, 5, 1, 2, 1),
y = c(2, -Inf, -Inf, -Inf, -Inf, 5, -Inf, 2, 1, 2, 2, 2),
z = sample(c(1, 2, -Inf), 12, replace = TRUE))
is_minus_inf <- function(x) is.infinite(x) & x < 0
dt %>%
group_by(id, time) %>%
mutate(n = mean(is_minus_inf(c_across(everything()))))
#> # A tibble: 12 × 6
#> # Groups: id, time [12]
#> id time x y z n
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 2 2 0
#> 2 1 2 2 -Inf -Inf 0.667
#> 3 1 3 1 -Inf 2 0.333
#> 4 1 4 -Inf -Inf 1 0.667
#> 5 2 1 2 -Inf 1 0.333
#> 6 2 2 -Inf 5 2 0.333
#> 7 2 3 1 -Inf -Inf 0.667
#> 8 2 4 1 2 2 0
#> 9 3 1 5 1 1 0
#> 10 3 2 1 2 1 0
#> 11 3 3 2 2 2 0
#> 12 3 4 1 2 -Inf 0.333
(Or even simpler, use mutate(n = mean(c_across(everything()) == -Inf, na.rm = TRUE)) and no new checking function is needed)
How about this solution? It looks like giving the desired output and is scalable.
First I "melt" the columns x and y and then just summarise over id and time:
dt %>%
reshape2::melt(id = c("id", "time")) %>%
group_by(id, time) %>%
summarise(count_neg_inf = mean(value == -Inf, na.rm = TRUE))
regards,
Samuel
I am trying to created a weighted average for each week, across multiple columns. My data looks like this:
week <- c(1,1,1,2,2,3)
col_a <- c(1,2,2,4,2,7)
col_b <- c(4,2,3,1,2,5)
col_c <- c(4,2,3,2,2,4)
dfreprex <- data.frame(week,col_a,col_b,col_c)
week col_a col_b col_c
1 1 1 4 4
2 1 2 2 2
3 1 2 3 3
4 2 4 1 2
5 2 2 2 2
6 3 7 5 4
weightsreprex <- data.frame(county = c("col_a", "col_b", "col_c")
, weights = c(.3721, .3794, .2485))
How do I weight each column and then get the mean? Is there a simpler way than just multiplying each column by its weight in a new column (col_a_weighted) and then taking the rowmean of the weighted columns only?
Tried weighted.means, rowmeans, group_by and summarise
We may use * for matrix multiplication:
dfreprex$wtmean <- as.matrix(dfreprex[,-1]) %*% as.matrix(weightsreprex[, 2])
dfreprex
week col_a col_b col_c wtmean
1 1 1 4 4 2.8837
2 1 2 2 2 2.0000
3 1 2 3 3 2.6279
4 2 4 1 2 2.3648
5 2 2 2 2 2.0000
6 3 7 5 4 5.4957
We might also use crossprod
crossprod(t(as.matrix(dfreprex[,-1])), as.matrix(weightsreprex[, 2]))
You can use stats::weighted.mean() here:
library(tidyverse)
dfreprex <- structure(list(week = c(1, 1, 1, 2, 2, 3), col_a = c(1, 2, 2, 4, 2, 7), col_b = c(4, 2, 3, 1, 2, 5), col_c = c(4, 2, 3, 2, 2, 4)), class = "data.frame", row.names = c(NA, -6L))
weightsreprex <- data.frame(county = c("col_a", "col_b", "col_c"), weights = c(.3721, .3794, .2485))
dfreprex %>%
rowwise() %>%
mutate(wt_avg = weighted.mean(c(col_a, col_b, col_c), weightsreprex$weights))
#> # A tibble: 6 × 5
#> # Rowwise:
#> week col_a col_b col_c wt_avg
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 4 4 2.88
#> 2 1 2 2 2 2
#> 3 1 2 3 3 2.63
#> 4 2 4 1 2 2.36
#> 5 2 2 2 2 2
#> 6 3 7 5 4 5.50
Created on 2022-11-09 with reprex v2.0.2
I have a dateframe like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
# Limits for desired cumulative sum (CumSum)
maxCumSum <- 8
minCumSum <- 0
What I would like to calculate is a cumulative sum of value by group (grp) within the values of maxCumSum and minCumSum. The respective table dt2 should look something like this:
grp t value CumSum
a 1 -1 0
a 2 5 5
a 3 9 8
a 4 -15 0
a 5 6 6
b 1 5 5
b 2 1 6
b 3 7 8
b 4 -11 0
b 5 9 8
Think of CumSum as a water storage with has a certain maximum capacity and the level of which cannot sink below zero.
The normal cumsum does obviously not do the trick since there are no limitations to maximum or minimum. Has anyone a suggestion how to achieve this? In the real dataframe there are of course more than 2 groups and far more than 5 times.
Many thanks!
What you can do is create a function which calculate the cumsum until it reach the max value and start again at the min value like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
library(dplyr)
maxCumSum <- 8
minCumSum <- 0
f <- function(x, y) max(min(x + y, maxCumSum), minCumSum)
df %>%
group_by(grp) %>%
mutate(CumSum = Reduce(f, value, 0, accumulate = TRUE)[-1])
#> # A tibble: 10 × 4
#> # Groups: grp [2]
#> grp t value CumSum
#> <chr> <int> <dbl> <dbl>
#> 1 a 1 -1 0
#> 2 a 2 5 5
#> 3 a 3 9 8
#> 4 a 4 -15 0
#> 5 a 5 6 6
#> 6 b 1 5 5
#> 7 b 2 1 6
#> 8 b 3 7 8
#> 9 b 4 -11 0
#> 10 b 5 9 8
Created on 2022-07-04 by the reprex package (v2.0.1)
This is a follow-up question of this How to add a row to a dataframe modifying only some columns.
After solving this question I wanted to apply the solution provided by stefan to a larger dataframe with group_by:
My dataframe:
df <- structure(list(test_id = c(1, 1, 1, 1, 1, 1, 1, 1), test_nr = c(1,
1, 1, 1, 2, 2, 2, 2), region = c("A", "B", "C", "D", "A", "B",
"C", "D"), test_value = c(3, 1, 1, 2, 4, 2, 4, 1)), class = "data.frame", row.names = c(NA,
-8L))
test_id test_nr region test_value
1 1 1 A 3
2 1 1 B 1
3 1 1 C 1
4 1 1 D 2
5 1 2 A 4
6 1 2 B 2
7 1 2 C 4
8 1 2 D 1
I now want to add a new row to each group with this code, which gives an error:
df %>%
group_by(test_nr) %>%
add_row(test_id = .$test_id[1], test_nr = .$test_nr[1], region = "mean", test_value = mean(.$test_value))
Error: Can't add rows to grouped data frames.
Run `rlang::last_error()` to see where the error occurred.
My expected output would be:
test_id test_nr region test_value
1 1 1 A 3.00
2 1 1 B 1.00
3 1 1 C 1.00
4 1 1 D 2.00
5 1 1 MEAN 1.75
6 1 2 A 4.00
7 1 2 B 2.00
8 1 2 C 4.00
9 1 2 D 1.00
10 1 2 MEAN 2.75
I have tried so far:
library(tidyverse)
df %>%
group_by(test_nr) %>%
group_split() %>%
map_dfr(~ .x %>%
add_row(!!! map(.[4], mean)))
test_id test_nr region test_value
<dbl> <dbl> <chr> <dbl>
1 1 1 A 3
2 1 1 B 1
3 1 1 C 1
4 1 1 D 2
5 NA NA NA 1.75
6 1 2 A 4
7 1 2 B 2
8 1 2 C 4
9 1 2 D 1
10 NA NA NA 2.75
How could I modify column 1 to 3 to place my values there?
I actually recently made a little helper function for exactly this. The idea
is to use group_modify() to take the group data, and
bind_rows() the summary statistics calculated with summarise().
This is what it looks like in code:
add_summary_rows <- function(.data, ...) {
group_modify(.data, function(x, y) bind_rows(x, summarise(x, ...)))
}
And here’s how that would work with your data:
library(dplyr, warn.conflicts = FALSE)
df <- data.frame(
test_id = c(1, 1, 1, 1, 1, 1, 1, 1),
test_nr = c(1, 1, 1, 1, 2, 2, 2, 2),
region = c("A", "B", "C", "D", "A", "B", "C", "D"),
test_value = c(3, 1, 1, 2, 4, 2, 4, 1)
)
df %>%
group_by(test_id, test_nr) %>%
add_summary_rows(
region = "MEAN",
test_value = mean(test_value)
)
#> # A tibble: 10 x 4
#> # Groups: test_id, test_nr [2]
#> test_id test_nr region test_value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 1 A 3
#> 2 1 1 B 1
#> 3 1 1 C 1
#> 4 1 1 D 2
#> 5 1 1 MEAN 1.75
#> 6 1 2 A 4
#> 7 1 2 B 2
#> 8 1 2 C 4
#> 9 1 2 D 1
#> 10 1 2 MEAN 2.75
You can combine your two approaches:
df %>%
split(~test_nr) %>%
map_dfr(~ .x %>%
add_row(test_id = .$test_id[1],
test_nr = .$test_nr[1],
region = "mean",
test_value = mean(.$test_value)))
You could achieve your target with this Base R one-liner:
merge( df, aggregate( df, by = list( df$test_nr ), FUN = mean ), all = TRUE )[ , 1:4 ]
aggregate provides you with the lines you need, and merge inserts them into the right places of your dataframe. You don't need the last column of the combined dataframe, so use only the first four columns. The code produces some warnings for the region column which can be disregarded. In the region column, the function (MEAN) is not displayed.
Making it a little more generic:
f <- "mean"
df1 <- merge( df, aggregate( df, by = list( df$test_id, df$test_nr ),
FUN = f ), all = TRUE )[ , 1:4 ]
df1$region[ is.na( df1$region ) ] <- toupper( f )
Here, you aggregate also by test_id, you can change the function you are using in one place, and you have it printed in the region column:
> df1
test_id test_nr region test_value
1 1 1 A 3.00
2 1 1 B 1.00
3 1 1 C 1.00
4 1 1 D 2.00
5 1 1 MEAN 1.75
6 1 2 A 4.00
7 1 2 B 2.00
8 1 2 C 4.00
9 1 2 D 1.00
10 1 2 MEAN 2.75
I have a dataframe somewhat similar to the one below (df). I need to add a new column indicating the ratio of the largest value for each row (= largest value in row divided by sum of all values in the row). The output should look similar to df1.
df <- data.frame('x' = c(1, 4, 1, 4, 1), 'y' = c(4, 6, 5, 2, 3), 'z' = c(5, 3, 2, 3, 2))
df1 <- data.frame('x' = c(1, 4, 1, 4, 1), 'y' = c(4, 6, 5, 2, 3), 'z' = c(5, 3, 2, 3, 2), 'ratio' = c(0.5, 0.462, 0.625, 0.444, 0.5)
Thank you!
Here is a solution using dplyr:
df %>%
rowwise() %>%
mutate(max_value = max(x,y,z),
sum_values = sum(x,y,z),
ratio = max_value / sum_values) #%>%
#select(-max_value, -sum_values) #uncomment this line if you want to df1 as in your question
# A tibble: 5 x 6
x y z max_value sum_values ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 5 5 10 0.5
2 4 6 3 6 13 0.462
3 1 5 2 5 8 0.625
4 4 2 3 4 9 0.444
5 1 3 2 3 6 0.5
library(tidyverse)
df %>%
rowwise() %>%
mutate(MAX = max(x,y,z, na.rm = TRUE ),
SUM = sum(x,y,z, na.rm = TRUE),
ratio = MAX / SUM)
# A tibble: 5 x 6
x y z MAX SUM ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 5 5 10 0.5
2 4 6 3 6 13 0.462
3 1 5 2 5 8 0.625
4 4 2 3 4 9 0.444
5 1 3 2 3 6 0.5
Another option with rowSums and pmax
library(dplyr)
library(purrr)
df %>%
mutate(ratio = reduce(., pmax)/rowSums(.))
# x y z ratio
#1 1 4 5 0.5000000
#2 4 6 3 0.4615385
#3 1 5 2 0.6250000
#4 4 2 3 0.4444444
#5 1 3 2 0.5000000
Or in base R
df$ratio <- do.call(pmax, df)/rowSums(df)
Additional solution
df$ratio <- apply(df, 1, function(x) max(x, na.rm = T) / sum(x, na.rm = T))