counter for unique values found along the vector - r

first of all let me say that I have searched a lot for this basic question, but none of the answers found seems to do the job. If this specific question has already an answer, please excuse me.
I want to count the occurrence of behaviours in my data.
mydata <- data.frame(BH=c(
"sniff","explore","walking","explore","walking","trotting","sniff","explore","trotting","trotting","walking","walking","walking","watch","walking","trotting","watch","walking","walking","walking"))
and the output has to be like this
myoutput <- data.frame(
BH=c(
"sniff","explore","walking","explore","walking","trotting","sniff","explore","trotting","trotting","walking","walking","walking","watch","walking","trotting","watch","walking","walking","walking"),
mycount=c(
1,2,3,3,3,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5))
I have experimented using ave and n_distinct from dplyr package, but I only get the count of a given behaviour, not the cumulative count.
Any help or hint on how to solve this problem would be appreciate.
Stef

This is easy with a group-by operation and cumsum. I like using package data.table.
library(data.table)
setDT(mydata)
mydata[, mycount := c(1, rep(0, .N - 1)), by = BH] #first occurences
mydata[, mycount := cumsum(mycount)]
all.equal(setDF(mydata), myoutput)
#[1] TRUE

Here is a solution with tidyverse - not as concise as Roland`s solution, but it works.
library(tidyverse)
x <- mydata |>
mutate(rn = row_number())
x |>
group_by(BH) |>
mutate(id = cur_group_id()) |>
ungroup() |>
pivot_wider(names_from = BH,
values_from = id,
values_fill = 0) |>
mutate(across(
sniff:watch, ~ cumsum(.x) > 0, .names = "{.col}_temp"),
mycount = rowSums(across(ends_with('_temp')))
) |>
dplyr::select(c(rn:watch, mycount)) |>
right_join(x, by = 'rn') |>
pivot_longer(-c(rn, mycount, BH)) |>
filter(value !=0) |>
dplyr::select(BH, mycount)
#> # A tibble: 20 × 2
#> BH mycount
#> <chr> <dbl>
#> 1 sniff 1
#> 2 explore 2
#> 3 walking 3
#> 4 explore 3
#> 5 walking 3
#> 6 trotting 4
#> 7 sniff 4
#> 8 explore 4
#> 9 trotting 4
#> 10 trotting 4
#> 11 walking 4
#> 12 walking 4
#> 13 walking 4
#> 14 watch 5
#> 15 walking 5
#> 16 trotting 5
#> 17 watch 5
#> 18 walking 5
#> 19 walking 5
#> 20 walking 5

Related

Difference by subgroup using R

I have the following dataset:
I want to calculate the difference between values according to the subgroups. Nevertheless, subgroup 1 must come first. Thus 10-0=10; 0-20=-20; 30-31=-1. I want to perform it using R.
I know that it would be something like this, but I do not know how to put the sub_group into the code:
library(tidyverse)
df %>%
group_by(group) %>%
summarise(difference= diff(value))
Edited answer after OP's comment:
The OP clarified that the data are not sorted by sub_group within every group. Therefore, I added the arrange after group_by. The OP further clarified that the value of sub_group == 1 always should be the first term of the difference.
Below I demonstrate how to achieve this in an example with 3 sub_groups within every group. The code rests on the assumption that the lowest value of sub_group == 1. I drop each group's first sub_group after the difference.
library(tidyverse)
df <- tibble(group = rep(LETTERS[1:3], each = 3),
sub_group = rep(1:3, 3),
value = c(10,0,5,0,20,15,30,31,10))
df
#> # A tibble: 9 × 3
#> group sub_group value
#> <chr> <int> <dbl>
#> 1 A 1 10
#> 2 A 2 0
#> 3 A 3 5
#> 4 B 1 0
#> 5 B 2 20
#> 6 B 3 15
#> 7 C 1 30
#> 8 C 2 31
#> 9 C 3 10
df |>
group_by(group) |>
arrange(group, sub_group) |>
mutate(value = first(value) - value) |>
slice(2:n())
#> # A tibble: 6 × 3
#> # Groups: group [3]
#> group sub_group value
#> <chr> <int> <dbl>
#> 1 A 2 10
#> 2 A 3 5
#> 3 B 2 -20
#> 4 B 3 -15
#> 5 C 2 -1
#> 6 C 3 20
Created on 2022-10-18 with reprex v2.0.2
P.S. (from the original answer)
In the example data, you show the wrong difference for group C. It should read -1. I am convinced that most people here would appreciate if you could post your example data using code or at least as text which can be copied instead of a picture.

Continuing a sequence into NAs using dplyr

I am trying to figure out a dplyr specific way of continuing a sequence of numbers when there are NAs in that column.
For example I have this dataframe:
library(tibble)
dat <- tribble(
~x, ~group,
1, "A",
2, "A",
NA_real_, "A",
NA_real_, "A",
1, "B",
NA_real_, "B",
3, "B"
)
dat
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 NA A
#> 4 NA A
#> 5 1 B
#> 6 NA B
#> 7 3 B
I would like this one:
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 3 A
#> 4 4 A
#> 5 1 B
#> 6 2 B
#> 7 3 B
When I try this I get a warning which makes me think I am probably approaching this incorrectly:
library(dplyr)
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(n))
#> Warning in seq_len(n): first element used of 'length.out' argument
#> Warning in seq_len(n): first element used of 'length.out' argument
#> # A tibble: 7 × 4
#> # Groups: group [2]
#> x group n new_seq
#> <dbl> <chr> <int> <int>
#> 1 1 A 4 1
#> 2 2 A 4 2
#> 3 NA A 4 3
#> 4 NA A 4 4
#> 5 1 B 3 1
#> 6 NA B 3 2
#> 7 3 B 3 3
It's easier if you do it in one go. Your approach is not 'wrong', it is just that seq_len needs one integer, and you are giving a vector (n), so seq_len corrects it by using the first value.
dat %>%
group_by(group) %>%
mutate(x = seq_len(n()))
Note that row_number might be even easier here:
dat %>%
group_by(group) %>%
mutate(x = row_number())
We could use rowid directly if the intention is to create a sequence and group size is just intermediate column
library(data.table)
library(dplyr)
dat %>%
mutate(new_seq = rowid(group))
The issue with using a column after it is created is that it is no longer a single row as showed in #Maëls post. If we need to do that, use first as seq_len is not vectorized and here it is not needed as well
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(first(n)))
A base R option using ave (work in a similar way as group_by in dplyr)
> transform(dat, x = ave(x, group, FUN = seq_along))
x group
1 1 A
2 2 A
3 3 A
4 4 A
5 1 B
6 2 B
7 3 B

Unexpected dplyr::bind_rows() behavior

Short Version:
I'm encountering an error with dplyr::bind_rows() which I don't understand. I want to split my data based on some condition (e.g. a == 1), operate on one part (e.g. b = b * 10), and bind it back to the other part using dplyr::bind_rows() in a single pipe chain. It works fine if I provide the first input to the two parts explictly, but if instead I pipe them in with . it complains about the data type of agrument 2.
Here's a MRE of the issue:
library(tidyverse)
# sim data
d <- tibble(a = 1:4, b = 1:4)
# works when 'd' is supplied directly to bind_rows()
bind_rows(d %>% filter(a == 1),
d %>% filter(!a == 1) %>% mutate(b = b * 10))
#> # A tibble: 4 x 2
#> a b
#> <int> <dbl>
#> 1 1 1
#> 2 2 20
#> 3 3 30
#> 4 4 40
# fails when 'd' is piped in to bind_rows()
d %>%
bind_rows(. %>% filter(a == 1),
. %>% filter(!a == 1) %>% mutate(b = b * 10))
#> Error: Argument 2 must be a data frame or a named atomic vector.
Long Version:
If I capture what the bind_rows() call is getting as input as a list() instead, I can see that two unexpected (to me) things are happening.
Instead of evaluating the pipe chains I provided it seems to just capure them as a functional sequence.
I can see that the input (.) is invisibly being provided in addition to the two explict arguments, so I get 3 items instead of 2 in the list.
# capture intermediate values for diagnostics
d %>%
list(. %>% filter(a == 1),
. %>% filter(!a == 1) %>% mutate(b = b * 10))
#> [[1]]
#> # A tibble: 4 x 2
#> a b
#> <int> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#>
#> [[2]]
#> Functional sequence with the following components:
#>
#> 1. filter(., a == 1)
#>
#> Use 'functions' to extract the individual functions.
#>
#> [[3]]
#> Functional sequence with the following components:
#>
#> 1. filter(., !a == 1)
#> 2. mutate(., b = b * 10)
#>
#> Use 'functions' to extract the individual functions.
This leads me to the following inelegant solution where I solve the first problem by piping to the inner function which seems to force evaluation correctly (for reasons I don't understand) and then solve the second problem by subsetting the list prior to performing the bind_rows() operation.
# hack solution to force eval and clean duplicated input
d %>%
list(filter(., a == 1),
filter(., !a == 1) %>% mutate(b = b * 10)) %>%
.[-1] %>%
bind_rows()
#> # A tibble: 4 x 2
#> a b
#> <int> <dbl>
#> 1 1 1
#> 2 2 20
#> 3 3 30
#> 4 4 40
Created on 2022-01-24 by the reprex package (v2.0.1)
It seems like it might be related to this issue, but I can't quite see how. It would be great to understand why this is happening and find a way code this without the need to assign intermediate variables or do this weird hack to subset the intermediate list.
EDIT:
Knowing this was related to curly braces ({}) enabled me to find a few more helpful links:
1, 2, 3
If we want to use ., then block it with scope operator ({})
library(dplyr)
d %>%
{
bind_rows({.} %>% filter(a == 1),
{.} %>% filter(!a == 1) %>% mutate(b = b * 10))
}
-output
# A tibble: 4 × 2
a b
<int> <dbl>
1 1 1
2 2 20
3 3 30
4 4 40

Mutate All columns in a list of tibbles

Lets suppose I have the following list of tibbles:
a_list_of_tibbles <- list(
a = tibble(a = rnorm(10)),
b = tibble(a = runif(10)),
c = tibble(a = letters[1:10])
)
Now I want to map them all into a single dataframe/tibble, which is not possible due to the differing column types.
How would I go about this?
I have tried this, but I want to get rid of the for loop
for(i in 1:length(a_list_of_tibbles)){
a_list_of_tibbles[[i]] <- a_list_of_tibbles[[i]] %>% mutate_all(as.character)
}
Then I run:
map_dfr(.x = a_list_of_tibbles, .f = as_tibble)
We could do the computation within the map - use across instead of the suffix _all (which is getting deprecated) to loop over the columns of the dataset
library(dplyr)
library(purrr)
map_dfr(a_list_of_tibbles,
~.x %>%
mutate(across(everything(), as.character) %>%
as_tibble))
-output
# A tibble: 30 × 1
a
<chr>
1 0.735200825884485
2 1.4741501589461
3 1.39870958697574
4 -0.36046362308853
5 -0.893860999301402
6 -0.565468636033674
7 -0.075270267983768
8 2.33534260196058
9 0.69667906338348
10 1.54213170143702
# … with 20 more rows
Another alternative is to use:
library(tidyverse)
map_depth(a_list_of_tibbles, 2, as.character) %>%
bind_rows()
#> # A tibble: 30 × 1
#> a
#> <chr>
#> 1 0.0894618169853206
#> 2 -1.50144637645091
#> 3 1.44795821718513
#> 4 0.0795342912030257
#> 5 -0.837985570593029
#> 6 -0.050845557103668
#> 7 0.031194556366589
#> 8 0.0989551909839589
#> 9 1.87007290229274
#> 10 0.67816212007413
#> # … with 20 more rows
Created on 2021-12-20 by the reprex package (v2.0.1)

assigning id values from values, not names, with purrr::map_dfr

I think this question is related to Using map_dfr and .id for list names and list of list names but not identical ...
I often use map_dfr for a case where I want to use the value of each argument, not its name, as the .id variable. Here's a silly example: I am computing the mean of mtcars$mpg raised to the second, fourth, and sixth power:
library(tidyverse)
list(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
## name x
## <chr> <dbl>
## 1 1 439.
## 2 2 262350.
## 3 3 198039783.
I would like the name variable to be 2, 4, 6 instead of 1, 2, 3. I can hack this by including setNames(.data) in the pipeline:
list(2,4,6) %>%
setNames(.data) %>%
map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
but I wonder if there is a more idiomatic approach I'm missing?
As for the suggestion of using something like ~ tible(name=., ...): nice, but slightly less convenient for the case where the mapping function already returns a tibble, because we have to add an otherwise unnecessary tibble() call:
list(2, 4, 6) %>%
map_dfr(~ tibble(name=.,
broom::tidy(lm(mpg~cyl, data=mtcars, offset=rep(., nrow(mtcars))))))
OK, I think I found this shortly before posting (so I'll answer). This answer points out that tibble::lst() is a self-naming list function, so as long as we use tibble::lst(2,4,6) instead of list(2,4,6), it Just Works, e.g.
lst(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
This can work too:
library(tidyverse)
##ben Bolker answer.
lst(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="power")
#> # A tibble: 3 x 2
#> power x
#> <chr> <dbl>
#> 1 2 439.
#> 2 4 262350.
#> 3 6 198039783.
list(2, 4, 6) %>% map_df(~ tibble(power = as.character(.x) , x = mean(mtcars$mpg^.)))
#> # A tibble: 3 x 2
#> power x
#> <chr> <dbl>
#> 1 2 439.
#> 2 4 262350.
#> 3 6 198039783.
#another option
seq(2, 6, 2) %>% map2_df(rerun(length(.), mtcars$mpg), ~ c(x = as.character(.x), mean = round(mean(.y^.x), 0)))
#> # A tibble: 3 x 2
#> x mean
#> <chr> <chr>
#> 1 2 439
#> 2 4 262350
#> 3 6 198039783
Created on 2021-06-06 by the reprex package (v2.0.0)
This is also possible, however it would not have been my first choice and only a map would suffice:
library(purrr)
list(2, 4, 6) %>%
pmap_dfr(~ tibble(power = c(...), x = map_dbl(c(...), ~ mean(mtcars$mpg ^ .x))))
# A tibble: 3 x 2
power x
<dbl> <dbl>
1 2 439.
2 4 262350.
3 6 198039783.

Resources