Calculate proportions within groups in a data frame in R - r

I have the following data frame in R. For this experiment I was testing the survival of cells at several times with 2 treatments, and 2 replicates for each treatment. I want to calculate the percentage of cells alive at each time for each treatment/replicate.
For example, for Treat 1 Rep 1 it would be 500/500, 470/500, 100/500, 20/500, for Treat 2 Rep 1 it would be 430/430, 420/430, 300/430, 100/430
Thanks!
x <- data.frame("treatment"= c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
"rep"=c(1,1,1,1,2,2,2,2,1,1,1,1,2,2,2,2),
"Time" = c(0, 30, 60, 180, 0, 30, 60, 180, 0, 30, 60, 180,0, 30, 60, 180 ),
"cells_alive" = c(500, 470, 100, 20, 476, 310, 99, 2, 430, 420, 300, 100, 489, 451, 289, 4))

We can group by 'treatment', 'rep', calculate the 'prop'ortion by dividing the 'cells_alive' with the value of 'cells_alive' that correspond to 'Time' as 0
library(dplyr)
x1 <- x %>%
group_by(treatment, rep) %>%
mutate(prop = cells_alive/cells_alive[Time == 0])
-output
x1
# A tibble: 16 x 5
# Groups: treatment, rep [4]
# treatment rep Time cells_alive prop
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 0 500 1
# 2 1 1 30 470 0.94
# 3 1 1 60 100 0.2
# 4 1 1 180 20 0.04
# 5 1 2 0 476 1
# 6 1 2 30 310 0.651
# 7 1 2 60 99 0.208
# 8 1 2 180 2 0.00420
# 9 2 1 0 430 1
#10 2 1 30 420 0.977
#11 2 1 60 300 0.698
#12 2 1 180 100 0.233
#13 2 2 0 489 1
#14 2 2 30 451 0.922
#15 2 2 60 289 0.591
#16 2 2 180 4 0.00818
Or with match
x %>%
group_by(treatment, rep) %>%
mutate(prop = cells_alive/cells_alive[match(0, Time)])
if the 'Time' is already ordered
x %>%
group_by(treatment, rep) %>%
mutate(prop = cells_alive/first(cells_alive))

Related

create a new variable based on other factors using R

So I have this dataframe and I aim to add a new variable based on others:
Qi
Age
c_gen
1
56
13
2
43
15
5
31
6
3
67
8
I want to create a variable called c_sep that if:
Qi==1 or Qi==2 c_sep takes a random number between (c_gen + 6) and Age;
Qi==3 or Qi==4 c_sep takes a random number between (Age-15) and Age;
And 0 otherwise,
so my data would look something like this:
Qi
Age
c_gen
c_sep
1
56
13
24
2
43
15
13
5
31
6
0
3
67
8
40
Any ideas please
In base R, you can do something along the lines of:
dat <- read.table(text = "Qi Age c_gen
1 56 13
2 43 15
5 31 6
3 67 8", header = T)
set.seed(100)
dat$c_sep <- 0
dat$c_sep[dat$Qi %in% c(1,2)] <- apply(dat[dat$Qi %in% c(1,2),], 1, \(row) sample(
(row["c_gen"]+6):row["Age"], 1
)
)
dat$c_sep[dat$Qi %in% c(3,4)] <- apply(dat[dat$Qi %in% c(3,4),], 1, \(row) sample(
(row["Age"]-15):row["Age"], 1
)
)
dat
# Qi Age c_gen c_sep
# 1 1 56 13 28
# 2 2 43 15 43
# 3 5 31 6 0
# 4 3 67 8 57
If you are doing it more than twice you might want to put this in a function - depending on your requirements.
Try this
df$c_sep <- ifelse(df$Qi == 1 | df$Qi == 2 ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$c_gen[x] + 6, df$Age[x]) ,1)) ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$Age[x] - 15, df$Age[x]) ,1)) , 0))
output
Qi Age c_gen c_sep
1 1 56 13 41
2 2 43 15 42
3 5 31 6 0
4 3 67 8 58
A tidyverse option:
library(tidyverse)
df <- tribble(
~Qi, ~Age, ~c_gen,
1, 56, 13,
2, 43, 15,
5, 31, 6,
3, 67, 8
)
df |>
rowwise() |>
mutate(c_sep = case_when(
Qi <= 2 ~ sample(seq(c_gen + 6, Age, 1), 1),
between(Qi, 3, 4) ~ sample(seq(Age - 15, Age, 1), 1),
TRUE ~ 0
)) |>
ungroup()
#> # A tibble: 4 × 4
#> Qi Age c_gen c_sep
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 56 13 39
#> 2 2 43 15 41
#> 3 5 31 6 0
#> 4 3 67 8 54
Created on 2022-06-29 by the reprex package (v2.0.1)

Sum Blocks of Positive Values in R

I have a large data set, 150k rows, ~11 MB in size. Each row contains an hourly measure of profit, which can be positive, negative, or zero. I am trying to calculate a new variable equal to the profit of each positive "block." Hopefully this is self-explanatory in the data set below.
"Profit" is the input variable. I can get the next two columns but can't solve for "profit_block". Any help would be much appreciated!
dat <- data.frame(profit = c(20, 10, 5, 10, -20, -100, -40, 500, 27, -20),
indic_pos = c( 1, 1, 1, 1, 0, 0, 0, 1, 1, 0),
cum_profit = c(20, 30, 35, 45, 0, 0, 0, 500, 527, 0),
profit_block = c(45, 45, 45, 45, 0, 0, 0, 527, 527, 0))
profit indic_pos cum_profit profit_block
1 20 1 20 45
2 10 1 30 45
3 5 1 35 45
4 10 1 45 45
5 -20 0 0 0
6 -100 0 0 0
7 -40 0 0 0
8 500 1 500 527
9 27 1 527 527
10 -20 0 0 0
I've found the following post below very helpful, but I can't quite conform it to my need here. Thanks again.
Related URL: Assigning a value to each range of consecutive numbers with same sign in R
We can use rleid to create a group based on the sign of the column i.e. same adjacent sign elements will be a single group and then get the max of the 'cum_profit'
library(dplyr)
dat %>%
group_by(grp = rleid(sign(profit))) %>%
mutate(profit_block2 = max(cum_profit)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 10 x 5
# profit indic_pos cum_profit profit_block profit_block2
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 20 1 20 45 45
# 2 10 1 30 45 45
# 3 5 1 35 45 45
# 4 10 1 45 45 45
# 5 -20 0 0 0 0
# 6 -100 0 0 0 0
# 7 -40 0 0 0 0
# 8 500 1 500 527 527
# 9 27 1 527 527 527
#10 -20 0 0 0 0

Copy value from row, based on match of another row

Having a rough time approaching this problem with a large dataset. Essentially there are multiple rows for the same item. However, only one of the items contains the required value. I need to copy that value to all matching items.
Eg. below, I need item 100 to have a cost of 1203 for every row.
df = data.frame("item" = c(100, 100, 100, 105, 105, 102, 102, 102),
"cost" = c(1203, 0, 0, 66, 0, 1200, 0, 0))
> df
item cost
1 100 1203
2 100 0
3 100 0
4 105 66
5 105 0
6 102 1200
7 102 0
8 102 0
Like so:
df_wanted = data.frame("item" = c(100, 100, 100, 105, 105, 102, 102, 102),
"cost" = c(1203, 1203, 1203, 66, 66, 1200, 1200, 1200))
> df_wanted
item cost
1 100 1203
2 100 1203
3 100 1203
4 105 66
5 105 66
6 102 1200
7 102 1200
8 102 1200
Below is my attempt at I think an inefficient method:
for (row in 1:length(df$cost)){
if (df$cost[row] == 0){
df$cost[row] = df$cost[row-1]
}
}
here is one option. After grouping by 'item', subset the 'cost' where the 'cost' is not 0 and select the first element
library(dplyr)
df %>%
group_by(item) %>%
mutate(cost = first(cost[cost!=0))
# A tibble: 8 x 2
# Groups: item [3]
# item cost
# <dbl> <dbl>
#1 100 1203
#2 100 1203
#3 100 1203
#4 105 66
#5 105 66
#6 102 1200
#7 102 1200
#8 102 1200
Looks like you want to group by item and then replace 0 in cost with the last non-zero value. In each group, cummax(which(cost != 0)) will give the index of the last non-zero value.
library(dplyr)
df %>%
group_by(item) %>%
mutate(cost = cost[cummax(which(cost != 0))]) %>%
ungroup()
## A tibble: 8 x 2
# item cost
# <dbl> <dbl>
#1 100 1203
#2 100 1203
#3 100 1203
#4 105 66
#5 105 66
#6 102 1200
#7 102 1200
#8 102 1200
Base R equivalent is
transform(df, cost = ave(cost, item, FUN = function(x) x[cummax(which(x != 0))]))
What I ended up going with after revisiting this problem as a left_join(). Which makes more sense to me intuitively though it may not be the best solution.
The original DF below.
df = tibble("item" = as.factor(c(100, 100, 100, 105, 105, 102, 102, 102)),
"cost" = c(1203, 0, 0, 66, 0, 0, 1200, 0))
> df
# A tibble: 8 x 2
item cost
<fct> <dbl>
1 100 1203
2 100 0
3 100 0
4 105 66
5 105 0
6 102 0
7 102 1200
8 102 0
Create an 'index' of item-value pairs
df_index <- df %>%
group_by(item) %>%
arrange(-cost) %>%
slice(1)
> df_index
# A tibble: 3 x 2
# Groups: item [3]
item cost
<fct> <dbl>
1 100 1203
2 102 1200
3 105 66
Finally, join the dataframes by item to fill in the empty row values.
df_joined <- df %>%
left_join(df_index, by="item")
> df_joined
# A tibble: 8 x 3
item cost.x cost.y
<fct> <dbl> <dbl>
1 100 1203 1203
2 100 0 1203
3 100 0 1203
4 105 66 66
5 105 0 66
6 102 0 1200
7 102 1200 1200
8 102 0 1200

In R, how can I look up a value in one column based on an interval in adjoining columns that includes values I want to match?

In R, I have a reference table (dataframe) with three columns. Below is an example:
reftable <- data.frame(
X_lower = c(0, 101, 181, 231, 280, 300, 340, 390, 500),
X_upper = c(100, 180, 230, 279, 299, 339, 389, 499, 600),
Percentile = c(2, 3, 4, 6, 8, 11, 15, 20, 25))
# X_lower X_upper Percentile
# 0 100 2
# 101 180 3
# 181 230 4
# etc.
I have a separate dataframe, scores, with specific values for X, and I want to use the reference table to look up the percentile rank associated with each value.
scores <- data.frame(
X = c(58, 127, 175, 245, 300, 90, 405, 284, 330),
PercRank = NA))
# X PercRank
# 58 ?
# 127 ?
# 175 ?
# 245 ?
# etc.
I've tried using match or findInterval but can't find a solution. I've searched through existing questions. If this has been asked before, I'm must not be hitting on the right search terms.
You can try:
scores$PercRank=sapply(scores$X,function(x){
i = which(reftable$X_upper>x)[1]
reftable$Percentile[i]
})
> scores
X PercRank
1 58 2
2 127 3
3 175 3
4 245 6
5 300 11
6 90 2
7 405 20
8 284 8
9 330 11
Because reftable is ordered, you only need to check the first upper value bigger than your X.
1) sqldf An SQL left join can be used:
library(sqldf)
scores$PercRank <- NULL
sqldf("select s.X, r.Percentile as PercRank
from scores as s
left join reftable as r on s.X between r.X_lower and r.X_upper")
giving:
X PercRank
1 58 2
2 127 3
3 175 3
4 245 6
5 300 11
6 90 2
7 405 20
8 284 8
9 330 11
2) findInterval A base alternative is findInterval:
transform(scores, percRank = with(reftable, Percentile[ findInterval(X, X_lower) ]))
1) An option would be foverlaps from data.table
library(data.table)
scores$PercRank <- foverlaps(scores1, reftable)[order(rn)]$Percentile
scores$rn <- NULL
scores
# X PercRank
#1 58 2
#2 127 3
#3 175 3
#4 245 6
#5 300 11
#6 90 2
#7 405 20
#8 284 8
#9 330 11
2) Or use a non-equi join
setDT(scores)[reftable, PercRank := Percentile, on = .(X >= X_lower, X <= X_upper)]
scores
# X PercRank
#1: 58 2
#2: 127 3
#3: 175 3
#4: 245 6
#5: 300 11
#6: 90 2
#7: 405 20
#8: 284 8
#9: 330 11
3) Or with fuzzyjoin
library(fuzzyjoin)
library(dplyr)
fuzzy_left_join(scores, reftable, by = c("X" = "X_lower", "X" = "X_upper"),
match_fun = list(`>=`, `<=`)) %>%
select(X, Percentile)
# X Percentile
#1 58 2
#2 127 3
#3 175 3
#4 245 6
#5 300 11
#6 90 2
#7 405 20
#8 284 8
#9 330 11
data
scores <- data.frame(
X = c(58, 127, 175, 245, 300, 90, 405, 284, 330))
scores$rn <- seq_len(nrow(scores))
scores1 <- data.table(X_lower = scores$X, X_upper = scores$X, rn = scores$rn)
setkeyv(scores1, c("X_lower", "X_upper"))
setkeyv(reftable, c("X_lower", "X_upper"))

Conditional value filtering

Sample data
library(dplyr)
df <- data.frame(
ID = c(1,1,1,1,2,2,2,3,3,3),
day = c(3,8,14,29,4,6,8,1,4,9),
value = c(75, 101, 115, 120, 110, 106, 122, 100, 128, 140))
The idea behind the question:
Select the smallest day for each ID subject and multiply the value feature by 1.3 (ID 1 - day 3 - value 75, ID 2 - day 4 - value 110, ID 3 - day 1 - value 100).
Then compare that newly created value with other values that have the same ID, but have different day number.
For example:
The smallest day number for ID 1 is 3. Then multiply the value of that row by 1.3 (75 * 1.3 = 97.5). Compare the newly created value (97.5) with the values ((101, 115, 120)) that have the same ID of 1. Then anwser TRUE or FALSE if the new value is greater than values ((101, 115, 120)).
Repeat that as well for ID 2 and 3.
library(dplyr)
df <- data.frame(
ID = c(1,1,1,1,2,2,2,3,3,3),
day = c(3,8,14,29,4,6,8,1,4,9),
value = c(75, 101, 115, 120, 110, 106, 122, 100, 128, 140))
df %>%
group_by(ID) %>%
mutate(v = value[day == min(day)] * 1.3,
flag = value > v) %>%
ungroup()
# # A tibble: 10 x 5
# ID day value v flag
# <dbl> <dbl> <dbl> <dbl> <lgl>
# 1 1 3 75 97.5 FALSE
# 2 1 8 101 97.5 TRUE
# 3 1 14 115 97.5 TRUE
# 4 1 29 120 97.5 TRUE
# 5 2 4 110 143 FALSE
# 6 2 6 106 143 FALSE
# 7 2 8 122 143 FALSE
# 8 3 1 100 130 FALSE
# 9 3 4 128 130 FALSE
#10 3 9 140 130 TRUE
If you want to flag IDs with at least one TRUE flag you can create flag2 like this:
df %>%
group_by(ID) %>%
mutate(v = value[day == min(day)] * 1.3,
flag = value > v,
flag2 = max(flag)) %>%
ungroup()
# # A tibble: 10 x 6
# ID day value v flag flag2
# <dbl> <dbl> <dbl> <dbl> <lgl> <int>
# 1 1 3 75 97.5 FALSE 1
# 2 1 8 101 97.5 TRUE 1
# 3 1 14 115 97.5 TRUE 1
# 4 1 29 120 97.5 TRUE 1
# 5 2 4 110 143 FALSE 0
# 6 2 6 106 143 FALSE 0
# 7 2 8 122 143 FALSE 0
# 8 3 1 100 130 FALSE 1
# 9 3 4 128 130 FALSE 1
#10 3 9 140 130 TRUE 1
Or extract the IDs as a vector:
df %>%
group_by(ID) %>%
mutate(v = value[day == min(day)] * 1.3,
flag = value > v) %>%
ungroup() -> df2
df2 %>%
filter(flag == TRUE) %>%
distinct(ID) %>%
pull(ID)
#[1] 1 3

Resources