dplyr: divide all values in group by group's first value - r

My df looks something like this:
ID Obs Value
1 1 26
1 2 13
1 3 52
2 1 1,5
2 2 30
Using dplyr, I to add the additional column Col, which is the result of a division of all values in the column value by the group's first value in that column.
ID Obs Value Col
1 1 26 1
1 2 13 0,5
1 3 52 2
2 1 1,5 1
2 2 30 20
How do I do that?

After grouping by 'ID', use mutate to create a new column by dividing the 'Value' by the first of 'Value'
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Col = Value/first(Value))
If the first 'Value' is 0 and we don't want to use it, then subset the 'Value' with a logical expression and then take the first of that
df1 %>%
group_by(ID) %>%
mutate(Col = Value/first(Value[Value != 0]))
Or in base R
df1$Col <- with(df1, Value/ave(Value, ID, FUN = head, 1))
NOTE: The comma in 'Value' suggests it is a character column. In that case, it should be first changed to decimal (.) if that is the case, convert to nunmeric and then do the division. It can be done while reading the data

Or, without creating an additional column:
library(tidyverse)
df = data.frame(ID=c(1,1,1,2,2), Obs=c(1,2,3,1,2), Value=c(26, 13, 52, 1.5, 30))
df %>%
group_by(ID) %>%
mutate_at('Value', ~./first(.))
#> # A tibble: 5 x 3
#> # Groups: ID [2]
#> ID Obs Value
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
#> 2 1 2 0.5
#> 3 1 3 2
#> 4 2 1 1
#> 5 2 2 20
### OR ###
df %>%
group_by(ID) %>%
mutate_at('Value', function(x) x/first(x))
#> # A tibble: 5 x 3
#> # Groups: ID [2]
#> ID Obs Value
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
#> 2 1 2 0.5
#> 3 1 3 2
#> 4 2 1 1
#> 5 2 2 20
Created on 2020-01-04 by the reprex package (v0.3.0)

Related

Filtering every positive value for every negative in R

I have a dataset with financial data. Sometimes, a product gets refunded, resulting in a negative count of the product (so the money gets returned). I want to conditionally filter these rows out of the dataset.
Example:
library(tidyverse)
set.seed(1)
df <- tibble(
count = sample(c(-1,1),80,replace = TRUE,prob=c(.2,.8)),
id = rep(1:4,20)
)
df %>%
group_by(id) %>%
summarize(total = sum(count))
# A tibble: 4 x 2
id total
<int> <dbl>
1 1 10
2 2 14
3 3 16
4 4 10
id = 1 has 15 positive counts and 5 negatives. (15 - 5= 10). I want to keep 10 values in df with id = 1 with the positive values.
id = 2 has 17 positive counts and 3 negatives. (17- 3 = 14). I want to keep 14 values in df with id = 2 with the positive values.
In the end, this condition should be True nrow(df) == sum(df$count)
Unfortunately, a filtering join such as anti_join() will remove all the rows. For some reason I cannot think of another option to filter the tibble.
Thanks for helping me!
You can "uncount" using the total column to get the number of repeats of each row.
df %>%
group_by(id) %>%
summarize(total = sum(count)) %>%
uncount(total) %>%
mutate(count = 1)
#> # A tibble: 50 x 2
#> id count
#> <int> <dbl>
#> 1 1 1
#> 2 1 1
#> 3 1 1
#> 4 1 1
#> 5 1 1
#> 6 1 1
#> 7 1 1
#> 8 1 1
#> 9 1 1
#> 10 1 1
#> # ... with 40 more rows
Created on 2022-10-21 with reprex v2.0.2

Subsetting first Observation per id and date in r

I want to subset the first date per observation per id. For example, just get the rows for the first date in which observations A and B appeared. If we have the following dataset:
df =
id date Observation
1 3 A
1 2 B
1 8 B
2 5 B
2 3 A
2 9 A
the outcome should look like this:
df =
id date Observation
1 3 A
1 2 B
2 5 B
2 3 A
thanks
If you don't mind the order being different, it can be accomplished using dplyr by grouping then slicing:
library(tidyverse)
df <- read_table("id date Observation
1 3 A
1 2 B
1 8 B
2 5 B
2 3 A
2 9 A")
df %>%
group_by(id, Observation) %>%
slice(1)
#> # A tibble: 4 x 3
#> # Groups: id, Observation [4]
#> id date Observation
#> <dbl> <dbl> <chr>
#> 1 1 3 A
#> 2 1 2 B
#> 3 2 3 A
#> 4 2 5 B
Created on 2021-04-12 by the reprex package (v1.0.0)
library(dplyr)
df %>%
group_by(id, Observation) %>%
slice(1) %>%
ungroup()
# OR
df %>%
group_by(id, Observation) %>%
filter(row_number() == 1) %>%
ungroup()

how to subtract two columbs using index in tidyverse

i have a dataframe
df <- tibble(row1= c(1,2,3,4,5),
row2=c(2,3,4,5,6))
how do i subtract the two columbs using index (not rownames)? I would like this to work
df %>% mutate(diff= select(1)-select(2))
But the universe is not on my side....
The select needs a data parameter as the Usage is
select(.data, ...)
Also, as select returns a data.frame/tibble as output, we can get the vector with [[
library(dplyr)
df %>%
mutate(diff = select(., 1)[[1]] - select(., 2)[[1]])
-output
# A tibble: 5 x 3
# row1 row2 diff
# <dbl> <dbl> <dbl>
#1 1 2 -1
#2 2 3 -1
#3 3 4 -1
#4 4 5 -1
#5 5 6 -1
or instead use pull to return the vector
df %>%
mutate(diff = pull(., 1) - pull(., 2))
What about using select like below?
> df %>% mutate(diff = do.call(`-`,select(.,1:2)))
# A tibble: 5 x 3
row1 row2 diff
<dbl> <dbl> <dbl>
1 1 2 -1
2 2 3 -1
3 3 4 -1
4 4 5 -1
5 5 6 -1

filter all rows smaller than x with all following values also smaller than x

I am looking for a concise way to filter a data.frame for all rows smaller than a value x with all following values also smaller than x. I found a way but it is somehwat verbose. I tried to do it with dplyr::cumall and cumany, but was not able to figure it out.
Here is a small reprex including my actual approach. Ideally I would only have one filter line or mutate + filter, but with the current approach it takes two rounds of mutate/filter.
library(dplyr)
# Original data
tbl <- tibble(value = c(100,100,100,10,10,5,10,10,5,5,5,1,1,1,1))
# desired output:
# keep only rows, where value is smaller than 5 and ...
# no value after that is larger than 5
tbl %>%
mutate(id = row_number()) %>%
filter(value <= 5) %>%
mutate(id2 = lead(id, default = max(id) + 1) - id) %>%
filter(id2 == 1)
#> # A tibble: 7 x 3
#> value id id2
#> <dbl> <int> <dbl>
#> 1 5 9 1
#> 2 5 10 1
#> 3 5 11 1
#> 4 1 12 1
#> 5 1 13 1
#> 6 1 14 1
#> 7 1 15 1
Created on 2020-04-20 by the reprex package (v0.3.0)
You could combine cummin with a reversed reverse cummax:
tbl %>% filter(rev(cummax(rev(value))) <= 5 & cummin(value) <= 5)
# A tibble: 7 x 1
value
<dbl>
1 5
2 5
3 5
4 1
5 1
6 1
7 1
A base R option is to use subset + rle
tblout <- subset(tbl,
with(rle(value<=5 & c(0,diff(value))<=0),
rep(lengths>1 & values,lengths)))
such that
> tblout
# A tibble: 7 x 1
value
<dbl>
1 5
2 5
3 5
4 1
5 1
6 1
7 1

Is it possible to dynamically mutate columns (values in a column have other column names)

So I have one column of a dataframe which contains a value, which is equal to a different column name. For each row, I want to change the value of the column that is named.
df <- tibble(.rows = 6) %>% mutate(current_stage = c("Stage-1", "Stage-1", "Stage-2", "Stage-3", "Stage-4", "Stage-4"), `Stage-1` = c(1,1,1,2,4,5), `Stage-2` = c(40,50,20,10,15,10), `Stage-3` = c(1,2,3,4,5,6), `Stage-4` = c(NA, 1, NA, 2, NA, 3))
A tibble: 6 x 5
current_stage `Stage-1` `Stage-2` `Stage-3` `Stage-4`
<chr> <dbl> <dbl> <dbl> <dbl>
Stage-1 1 40 1 NA
Stage-1 1 50 2 1
Stage-2 1 20 3 NA
Stage-3 2 10 4 2
Stage-4 4 15 5 NA
Stage-4 5 10 6 3
So in the first row, I would want to edit the value in the Stage-1 column because the current_stage column has Stage-1. I've tried using !!rlang::sym:
df %>% mutate(!!rlang::sym(current_stage) := 15)
but I get the error: Error in is_symbol(x) : object 'current_stage' not found.
Is this even possible to do? Or should I just bite the bullet and write a different function?
Within the tidyverse, I think using a long format with gather is the easiest way as suggested by Jack Brookes:
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(stage, value, -current_stage, -rowid) %>%
mutate(value = if_else(stage == current_stage, 15, value)) %>%
spread(stage, value)
#> # A tibble: 6 x 6
#> rowid current_stage `Stage-1` `Stage-2` `Stage-3` `Stage-4`
#> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Stage-1 15 40 1 NA
#> 2 2 Stage-1 15 50 2 1
#> 3 3 Stage-2 1 15 3 NA
#> 4 4 Stage-3 2 10 15 2
#> 5 5 Stage-4 4 15 5 15
#> 6 6 Stage-4 5 10 6 15
Created on 2019-05-20 by the reprex package (v0.2.1)

Resources