I have a data set that looks like this:
Sample data can be get from:
ID <-c("1", "1", "1","1","2", "2")
Test_date <-c(-15, -8,7, 12,-3,2)
Test_Result<-c(100, 98, 78,99, 65,89)
Sample.data <- data.frame(ID, Test_date, Test_Result)
I need to use the biggest negative test_date's test_result as baseline. The progress is calculated using test_result divided by the baseline Test_Result. what should I do?
The final result should be something looks like this:
Many Thanks.
try it this way
library(tidyverse)
df %>%
group_by(ID) %>%
filter(Test_date > 0 | Test_date == max(Test_date[Test_date < 0])) %>%
mutate(progress = ifelse(Test_date > 0,
Test_Result / Test_Result[which.min(Test_date)],
NA_real_)) %>%
right_join(df) %>%
arrange(ID, Test_date) %>%
ungroup(ID)
Joining, by = c("ID", "Test_date", "Test_Result")
# A tibble: 6 x 4
ID Test_date Test_Result progress
<chr> <dbl> <dbl> <dbl>
1 1 -15 100 NA
2 1 -8 98 NA
3 1 7 78 0.796
4 1 12 99 1.01
5 2 -3 65 NA
6 2 2 89 1.37
In my view, this kind of operation "by group" operation is most easily accomplished with dplyr or data.table packages:
ID <-c("1", "1", "1","2", "2")
Test_date <-c(-15, -8,7, -3,2)
Test_Result<-c(100, 98, 78,65,89)
Sample.data <- data.frame(ID, Test_date, Test_Result)
big_neg <- function(x) which(x == max(x[x < 0]))
library(dplyr)
Sample.data %>%
group_by(ID) %>%
mutate(Progress = Test_Result / Test_Result[big_neg(Test_date)])
#> # A tibble: 5 x 4
#> # Groups: ID [2]
#> ID Test_date Test_Result Progress
#> <chr> <dbl> <dbl> <dbl>
#> 1 1 -15 100 1.02
#> 2 1 -8 98 1
#> 3 1 7 78 0.796
#> 4 2 -3 65 1
#> 5 2 2 89 1.37
library(data.table)
dat <- data.table(Sample.data)
dat[, Progress := Test_Result / Test_Result[big_neg(Test_date)], by=ID][]
#> ID Test_date Test_Result Progress
#> 1: 1 -15 100 1.0204082
#> 2: 1 -8 98 1.0000000
#> 3: 1 7 78 0.7959184
#> 4: 2 -3 65 1.0000000
#> 5: 2 2 89 1.3692308
Related
let's say I have a dataframe like this:
df <- tibble(ID = c(1, 1, 1, 1, 1), v1 = c(3, 5, 1, 0, 1), v2 = c(10, 6, 1, 20, 23), Time = c(as.POSIXct("1900-01-01 10:00:00"), as.POSIXct("1900-01-01 11:00:00"), as.POSIXct("1900-01-01 13:00:00"), as.POSIXct("1900-01-01 16:00:00"), as.POSIXct("1900-01-01 20:00:00"))) %>% group_by(ID)
# A tibble: 5 x 4
# Groups: ID [1]
ID v1 v2 Time
<dbl> <dbl> <dbl> <dttm>
1 1 3 10 1900-01-01 10:00:00
2 1 5 6 1900-01-01 11:00:00
3 1 1 1 1900-01-01 13:00:00
4 1 0 20 1900-01-01 16:00:00
5 1 1 23 1900-01-01 20:00:00
In words, this is a simple timeseries of a specific ID with two values v1 and v2 per time.
As quite common in machine learning, I want to aggregate the last n timesteps into one feature vector. For all previous timesteps there should be a time reference in hours when this data point occured. For the first row, where no previous timestep is available, the data should be filled with zeros.
Let's make an example. In this case n=2, that is I want to aggregate the current time step (t2) and the prevopus (t1) together:
# A tibble: 5 x 6
ID v1_t1 v2_t1 time_t1 v1_t2 v2_t2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 NA 3 10
2 1 3 10 1 5 6
3 1 5 6 2 1 1
4 1 1 1 3 0 20
5 1 0 20 4 1 23
I want to keep that as generic as possible, so that n can change and the number of data columns. Any idea how to do this?
Thanks :)
Using dplyr::lag and dplyr::across you could do:
library(dplyr, warn=FALSE)
library(lubridate, warn=FALSE)
df %>%
group_by(ID) %>%
mutate(time_t1 = lubridate::hour(Time) - lag(lubridate::hour(Time))) %>%
mutate(across(c(v1, v2), .fns = list(t2 = ~.x, t1 = ~lag(.x, default = 0)))) %>%
select(-v1, -v2, -Time)
#> # A tibble: 5 × 6
#> # Groups: ID [1]
#> ID time_t1 v1_t2 v1_t1 v2_t2 v2_t1
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA 3 0 10 0
#> 2 1 1 5 3 6 10
#> 3 1 2 1 5 1 6
#> 4 1 3 0 1 20 1
#> 5 1 4 1 0 23 20
UPDATE Here is a more generic approach which makes use of some function factories to create list of functions which could then be passed to the .fns argument of across. Haven't tested for the more general case but should work for any n or number of lags to include and also for any number of data columns.
library(dplyr, warn=FALSE)
library(lubridate, warn=FALSE)
fun_factory1 <- function(n) {
function(x) {
lubridate::hour(x) - lag(lubridate::hour(x), n = n)
}
}
fun_factory2 <- function(n) {
function(x) {
lag(x, n = n, default = 0)
}
}
n <- 2
fns1 <- lapply(seq(n - 1), fun_factory1)
names(fns1) <- paste0("t", seq(n - 1))
fns2 <- lapply(seq(n) - 1, fun_factory2)
names(fns2) <- paste0("t", seq(n))
df %>%
group_by(ID) %>%
mutate(across(Time, .fns = fns1)) %>%
mutate(across(c(v1, v2), .fns = fns2)) %>%
select(-v1, -v2, -Time)
#> # A tibble: 5 × 6
#> # Groups: ID [1]
#> ID Time_t1 v1_t1 v1_t2 v2_t1 v2_t2
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA 3 0 10 0
#> 2 1 1 5 3 6 10
#> 3 1 2 1 5 1 6
#> 4 1 3 0 1 20 1
#> 5 1 4 1 0 23 20
The sample data frame:
no <- rep(1:5, each=2)
type <- rep(LETTERS[1:2], times=5)
set.seed(4)
value <- round(runif(10, 10, 30))
df <- data.frame(no, type, value)
df
no type value
1 1 A 22
2 1 B 10
3 2 A 16
4 2 B 16
5 3 A 26
6 3 B 15
7 4 A 24
8 4 B 28
9 5 A 29
10 5 B 11
Now what I want is to calculate the % value of each type of type (A or B) and create separate columns. Desired output is something like this:
no pct_A pct_B total_value
1 1 68.75000 31.25000 32
2 2 50.00000 50.00000 32
3 3 63.41463 36.58537 41
4 4 46.15385 53.84615 52
5 5 72.50000 27.50000 40
What I have tried so far (This gives the right output but the process seems very sub-optimal):
df %>%
group_by(no) %>%
mutate(total_value= sum(value))-> df
df %>%
mutate(pct_A=ifelse(type=='A', (value/total_value) *100, 0),
pct_B=ifelse(type=='B', (value/total_value) *100, 0)) %>%
group_by(no) %>%
summarise(pct_A=sum(pct_A),
pct_B=sum(pct_B)) %>%
ungroup() %>%
merge(df) %>%
distinct(no, .keep_all = T) %>%
select(-type, -value)
Is there any better way to do that? Especially using dplyr?
I looked for other answers too, but no help. This one came closer:
R Create new column of values based on the factor levels of another column
You could do it in base using aggregate.
do.call(data.frame, aggregate(value ~ no, df, \(x) c(proportions(x), sum(x)))) |>
setNames(c('no', 'pct_A', 'pct_B', 'total_value'))
# no pct_A pct_B total_value
# 1 1 0.6875000 0.3125000 32
# 2 2 0.5000000 0.5000000 32
# 3 3 0.6341463 0.3658537 41
# 4 4 0.4615385 0.5384615 52
# 5 5 0.7250000 0.2750000 40
For each no we can calculate sum and ratio then get the data in wide format.
library(dplyr)
library(tidyr)
df %>%
group_by(no) %>%
mutate(total_value = sum(value),
value = prop.table(value) * 100) %>%
ungroup %>%
pivot_wider(names_from = type, values_from = value, names_prefix = 'pct_')
# no total_value pct_A pct_B
# <int> <dbl> <dbl> <dbl>
#1 1 32 68.8 31.2
#2 2 32 50 50
#3 3 41 63.4 36.6
#4 4 52 46.2 53.8
#5 5 40 72.5 27.5
Here are two more ways to do this.
We could use purrr::map_dfc. However, setting up the correct column names is kind of cumbersome:
library(dplyr)
library(purrr)
df %>%
group_by(no) %>%
summarise(total_value = sum(value),
map_dfc(unique(type) %>% set_names(., paste0("pct_",.)),
~ sum((type == .x) * value) / total_value * 100)
)
#> # A tibble: 5 x 4
#> no total_value pct_A pct_B
#> <int> <dbl> <dbl> <dbl>
#> 1 1 32 68.8 31.2
#> 2 2 32 50 50
#> 3 3 41 63.4 36.6
#> 4 4 52 46.2 53.8
#> 5 5 40 72.5 27.5
Alternatively we can use dplyover::over (disclaimer: I'm the maintainer) which allows us to create names on the fly in a across-like way:
library(dplyover) # https://github.com/TimTeaFan/dplyover
df %>%
group_by(no) %>%
summarise(total_value = sum(value),
over(dist_values(type), # alternatively `unique(type)`
~ sum((type == .x) * value) / total_value * 100,
.names = "pct_{x}")
)
#> # A tibble: 5 x 4
#> no total_value pct_A pct_B
#> <int> <dbl> <dbl> <dbl>
#> 1 1 32 68.8 31.2
#> 2 2 32 50 50
#> 3 3 41 63.4 36.6
#> 4 4 52 46.2 53.8
#> 5 5 40 72.5 27.5
Created on 2021-09-17 by the reprex package (v2.0.1)
Performance-wise both approaches should be faster compared to data-rectangling approaches such as pivot_wider (but I haven't tested this specific scenario).
I have one data frame and one look up table. What I want is to compare df_dat$value with df_lookup$threshold.
If the value falls into threshold range, then create a new column transfer in df_dat so that its values are
linearly interpolated from the transfer column in df_lookup
library(dplyr)
df_lookup <- tribble(
~threshold, ~transfer,
0, 0,
100, 15,
200, 35
)
df_lookup
#> # A tibble: 3 x 2
#> threshold transfer
#> <dbl> <dbl>
#> 1 0 0
#> 2 100 15
#> 3 200 35
df_dat <- tribble(
~date, ~value,
"2009-01-01", 0,
"2009-01-02", 30,
"2009-01-06", 105,
"2009-01-09", 150
)
df_dat
#> # A tibble: 4 x 2
#> date value
#> <chr> <dbl>
#> 1 2009-01-01 0
#> 2 2009-01-02 30
#> 3 2009-01-06 105
#> 4 2009-01-09 150
I can manually do it like this but wondering if there is an automatic way based on the values from the df_lookup table? Thank you.
df_dat %>%
mutate(transfer = case_when(value > 0 & value < 100 ~ 0 + (value - 0)*(15 - 0)/(100 - 0),
value >= 100 & value < 200 ~ 15 + (value - 100)*(35 - 15)/(200 - 100),
TRUE ~ 0)
)
#> # A tibble: 4 x 3
#> date value transfer
#> <chr> <dbl> <dbl>
#> 1 2009-01-01 0 0
#> 2 2009-01-02 30 4.5
#> 3 2009-01-06 105 16
#> 4 2009-01-09 150 25
You can use approx
df_dat %>% mutate(transfer = with(df_lookup, approx(threshold, transfer, value))$y)
## A tibble: 4 x 3
# date value transfer
# <chr> <dbl> <dbl>
#1 2009-01-01 0 0
#2 2009-01-02 30 4.5
#3 2009-01-06 105 16
#4 2009-01-09 150 25
Another option using roll:
df_lookup[, m := (transfer - shift(transfer, -1L)) / (threshold - shift(threshold, -1L))]
df_dat[, tx :=
df_lookup[df_dat, on=c("threshold"="value"), roll=Inf,
x.m * (i.value - x.threshold) + x.transfer]
]
data:
library(data.table)
df_lookup <- fread("threshold, transfer
0, 0
100, 15
200, 35")
df_dat <- fread('date, value
"2009-01-01", 0
"2009-01-02", 30
"2009-01-06", 105
"2009-01-09", 150')
I'm trying to assess which unit in a pair is the "winner". group_by() %>% mutate() is close to the right thing, but it's not quite there. in particular
dat %>% group_by(pair) %>% mutate(winner = ifelse(score[1] > score[2], c(1, 0), c(0, 1))) doesn't work.
The below does, but is clunky with an intermediate summary data frame. Can we improve this?
library(tidyverse)
set.seed(343)
# units within pairs get scores
dat <-
data_frame(pair = rep(1:3, each = 2),
unit = rep(1:2, 3),
score = rnorm(6))
# figure out who won in each pair
summary_df <-
dat %>%
group_by(pair) %>%
summarize(winner = which.max(score))
# merge back and determine whether each unit won
dat <-
left_join(dat, summary_df, "pair") %>%
mutate(won = as.numeric(winner == unit))
dat
#> # A tibble: 6 x 5
#> pair unit score winner won
#> <int> <int> <dbl> <int> <dbl>
#> 1 1 1 -1.40 2 0
#> 2 1 2 0.523 2 1
#> 3 2 1 0.142 1 1
#> 4 2 2 -0.847 1 0
#> 5 3 1 -0.412 1 1
#> 6 3 2 -1.47 1 0
Created on 2018-09-26 by the reprex
package (v0.2.0).
maybe related to Weird group_by + mutate + which.max behavior
You could do:
dat %>%
group_by(pair) %>%
mutate(won = score == max(score),
winner = unit[won == TRUE]) %>%
# A tibble: 6 x 5
# Groups: pair [3]
pair unit score won winner
<int> <int> <dbl> <lgl> <int>
1 1 1 -1.40 FALSE 2
2 1 2 0.523 TRUE 2
3 2 1 0.142 TRUE 1
4 2 2 -0.847 FALSE 1
5 3 1 -0.412 TRUE 1
6 3 2 -1.47 FALSE 1
Using rank:
dat %>% group_by(pair) %>% mutate(won = rank(score) - 1)
More for fun (and slightly faster), using the outcome of the comparison (score[1] > score[2]) to index a vector with 'won alternatives' :
dat %>% group_by(pair) %>%
mutate(won = c(0, 1, 0)[1:2 + (score[1] > score[2])])
I would like to replace NAs in numeric columns using some variation of mutate_if and replace_na if possible, but can't figure out the syntax.
df <-tibble(
first = c("a", NA, "b"),
second = c(NA, 2, NA),
third = c(10, NA, NA)
)
#> # A tibble: 3 x 3
#> first second third
#> <chr> <dbl> <dbl>
#> 1 a NA 10.0
#> 2 <NA> 2.00 NA
#> 3 b NA NA
Final result should be:
#> # A tibble: 3 x 3
#> first second third
#> <chr> <dbl> <dbl>
#> 1 a 0 10.0
#> 2 <NA> 2.00 0
#> 3 b 0 0
My attempts look like:
df %>% mutate_if(is.numeric , replace_na(., 0) )
#>Error: is_list(replace) is not TRUE
df %>% mutate_if(is.numeric , replace_na, replace = 0)
# A tibble: 3 x 3
# first second third
# <chr> <dbl> <dbl>
#1 a 0 10.0
#2 NA 2.00 0
#3 b 0 0
The in another answer mentioned solution based on mutate_if is based on a suspended function in dplyr. The suggested alternative is to use the across() function. Here a solution using that one:
df %>%
mutate(
across(where(is.numeric), ~replace_na(.x, 0))
)
# A tibble: 3 × 3
first second third
<chr> <dbl> <dbl>
1 a 0 10
2 NA 2 0
3 b 0 0