Continuing a sequence into NAs using dplyr - r

I am trying to figure out a dplyr specific way of continuing a sequence of numbers when there are NAs in that column.
For example I have this dataframe:
library(tibble)
dat <- tribble(
~x, ~group,
1, "A",
2, "A",
NA_real_, "A",
NA_real_, "A",
1, "B",
NA_real_, "B",
3, "B"
)
dat
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 NA A
#> 4 NA A
#> 5 1 B
#> 6 NA B
#> 7 3 B
I would like this one:
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 3 A
#> 4 4 A
#> 5 1 B
#> 6 2 B
#> 7 3 B
When I try this I get a warning which makes me think I am probably approaching this incorrectly:
library(dplyr)
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(n))
#> Warning in seq_len(n): first element used of 'length.out' argument
#> Warning in seq_len(n): first element used of 'length.out' argument
#> # A tibble: 7 × 4
#> # Groups: group [2]
#> x group n new_seq
#> <dbl> <chr> <int> <int>
#> 1 1 A 4 1
#> 2 2 A 4 2
#> 3 NA A 4 3
#> 4 NA A 4 4
#> 5 1 B 3 1
#> 6 NA B 3 2
#> 7 3 B 3 3

It's easier if you do it in one go. Your approach is not 'wrong', it is just that seq_len needs one integer, and you are giving a vector (n), so seq_len corrects it by using the first value.
dat %>%
group_by(group) %>%
mutate(x = seq_len(n()))
Note that row_number might be even easier here:
dat %>%
group_by(group) %>%
mutate(x = row_number())

We could use rowid directly if the intention is to create a sequence and group size is just intermediate column
library(data.table)
library(dplyr)
dat %>%
mutate(new_seq = rowid(group))
The issue with using a column after it is created is that it is no longer a single row as showed in #Maëls post. If we need to do that, use first as seq_len is not vectorized and here it is not needed as well
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(first(n)))

A base R option using ave (work in a similar way as group_by in dplyr)
> transform(dat, x = ave(x, group, FUN = seq_along))
x group
1 1 A
2 2 A
3 3 A
4 4 A
5 1 B
6 2 B
7 3 B

Related

Difference by subgroup using R

I have the following dataset:
I want to calculate the difference between values according to the subgroups. Nevertheless, subgroup 1 must come first. Thus 10-0=10; 0-20=-20; 30-31=-1. I want to perform it using R.
I know that it would be something like this, but I do not know how to put the sub_group into the code:
library(tidyverse)
df %>%
group_by(group) %>%
summarise(difference= diff(value))
Edited answer after OP's comment:
The OP clarified that the data are not sorted by sub_group within every group. Therefore, I added the arrange after group_by. The OP further clarified that the value of sub_group == 1 always should be the first term of the difference.
Below I demonstrate how to achieve this in an example with 3 sub_groups within every group. The code rests on the assumption that the lowest value of sub_group == 1. I drop each group's first sub_group after the difference.
library(tidyverse)
df <- tibble(group = rep(LETTERS[1:3], each = 3),
sub_group = rep(1:3, 3),
value = c(10,0,5,0,20,15,30,31,10))
df
#> # A tibble: 9 × 3
#> group sub_group value
#> <chr> <int> <dbl>
#> 1 A 1 10
#> 2 A 2 0
#> 3 A 3 5
#> 4 B 1 0
#> 5 B 2 20
#> 6 B 3 15
#> 7 C 1 30
#> 8 C 2 31
#> 9 C 3 10
df |>
group_by(group) |>
arrange(group, sub_group) |>
mutate(value = first(value) - value) |>
slice(2:n())
#> # A tibble: 6 × 3
#> # Groups: group [3]
#> group sub_group value
#> <chr> <int> <dbl>
#> 1 A 2 10
#> 2 A 3 5
#> 3 B 2 -20
#> 4 B 3 -15
#> 5 C 2 -1
#> 6 C 3 20
Created on 2022-10-18 with reprex v2.0.2
P.S. (from the original answer)
In the example data, you show the wrong difference for group C. It should read -1. I am convinced that most people here would appreciate if you could post your example data using code or at least as text which can be copied instead of a picture.

How to add a row to each group and assign values

I have this tibble:
library(tibble)
library(dplyr)
df <- tibble(id = c("one", "two", "three"),
A = c(1,2,3),
B = c(4,5,6))
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 two 2 5
3 three 3 6
I want to add a row to each group AND assign values to the new column BUT with a function (here the new row in each group should get A=4 B = the first group value of column B USING first(B)-> desired output:
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 6
5 two 2 5
6 two 4 5
I have tried so far:
If I add a row in a ungrouped tibble with add_row -> this works perfect!
df %>%
add_row(A=4, B=4)
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 two 2 5
3 three 3 6
4 NA 4 4
If I try to use add_row in a grouped tibble -> this works not:
df %>%
group_by(id) %>%
add_row(A=4, B=4)
Error: Can't add rows to grouped data frames.
Run `rlang::last_error()` to see where the error occurred.
According to this post Add row in each group using dplyr and add_row() we could use group_modify -> this works great:
df %>%
group_by(id) %>%
group_modify(~ add_row(A=4, B=4, .x))
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 4
5 two 2 5
6 two 4 4
I want to assign to column B the first value of column B (or it can be any function min(B), max(B) etccc.) -> this does not work:
df %>%
group_by(id) %>%
group_modify(~ add_row(A=4, B=first(B), .x))
Error in h(simpleError(msg, call)) :
Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl für Funktion 'first': object 'B' not found
library(tidyverse)
df <- tibble(id = c("one", "two", "three"),
A = c(1,2,3),
B = c(4,5,6))
df %>%
group_by(id) %>%
summarise(add_row(cur_data(), A = 4, B = first(cur_data()$B)))
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups: id [3]
#> id A B
#> <chr> <dbl> <dbl>
#> 1 one 1 4
#> 2 one 4 4
#> 3 three 3 6
#> 4 three 4 6
#> 5 two 2 5
#> 6 two 4 5
Or
df %>%
group_by(id) %>%
group_split() %>%
map_dfr(~ add_row(.,id = first(.$id), A = 4, B = first(.$B)))
#> # A tibble: 6 × 3
#> id A B
#> <chr> <dbl> <dbl>
#> 1 one 1 4
#> 2 one 4 4
#> 3 three 3 6
#> 4 three 4 6
#> 5 two 2 5
#> 6 two 4 5
Created on 2022-01-02 by the reprex package (v2.0.1)
Maybe this is an option
library(dplyr)
df %>%
group_by(id) %>%
summarise( A=c(A,4), B=c(B,first(B)) ) %>%
ungroup
`summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
# A tibble: 6 x 3
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 6
5 two 2 5
6 two 4 5
According to the documentation of the function group_modify, if you use a formula, you must use ". or .x to refer to the subset of rows of .tbl for the given group;" that's why you used .x inside the add_row function. To be entirely consistent, you have to do it also within the first function.
df %>%
group_by(id) %>%
group_modify(~ add_row(A=4, B=first(.x$B), .x))
# A tibble: 6 x 3
# Groups: id [3]
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 6
5 two 2 5
6 two 4 5
Using first(.$B) or first(df$B) will provide the same results.
A possible solution:
library(tidyverse)
df <- tibble(id = c("one", "two", "three"),
A = c(1,2,3),
B = c(4,5,6))
df %>%
group_by(id) %>%
slice(rep(1,2)) %>% mutate(A = if_else(row_number() > 1, first(df$B), A)) %>%
ungroup
#> # A tibble: 6 × 3
#> id A B
#> <chr> <dbl> <dbl>
#> 1 one 1 4
#> 2 one 4 4
#> 3 three 3 6
#> 4 three 4 6
#> 5 two 2 5
#> 6 two 4 5

How to Create Iterative Forumla to calculate Z Score in R?

I have a number of large data frames that have the following basic format, where the final two rows are a mean (d) and standard deviation (e) - although these are calculated elsewhere.
a b c
a 4 3 4
b 3 2 6
c 2 1 8
d 3 2 6
e 1 1 2
I would like to create an iterative function that converts each raw data point into a z-score via the mean and sd value in d and e per column. The formula I would like to apply is ((x-mean)/SD).
The result would be the following:
a b c
a 1 1 1
b 0 0 0
c -1 -1 -1
I don't mind if this is added to the end, created as a new dataframe or the data is converted.
Thanks!
Here is one approach, note that I do not use the mean/sd provided in the data but re-calculate it on the fly.
Also note that usually the data should be in a tidy data representation, which in your case would mean that a, b, c would be in columns and then mean/sd would be either calculated on the fly or be in a separate column (note that this would reshaping the data, not shown here).
# your input data
raw_data <- data.frame(
a = c(4, 3, 2, 3, 1),
b = c(3, 2, 1, 2, 1),
c = c(4, 6, 8, 6, 2),
row.names = c("a", "b", "c", "d", "e")
)
raw_data
#> a b c
#> a 4 3 4
#> b 3 2 6
#> c 2 1 8
#> d 3 2 6
#> e 1 1 2
# remove the mean/sd values
data <- raw_data[!rownames(raw_data) %in% c("d", "e"), ]
data
#> a b c
#> a 4 3 4
#> b 3 2 6
#> c 2 1 8
# quick way to recalculate the values
means <- apply(data, 2, mean)
means
#> a b c
#> 3 2 6
sds <- apply(data, 2, sd)
sds
#> a b c
#> 1 1 2
z_scores <- apply(data, 2, function(x) (x - mean(x)) / sd(x))
z_scores
#> a b c
#> a 1 1 -1
#> b 0 0 0
#> c -1 -1 1
Created on 2021-01-07 by the reprex package (v0.3.0)
Edit / Full Code
The following code is a bit longer but most of it is spent on getting the data into the right (long/tidy) format.
If you have any questions, feel free to use the comments.
Note that the tidyverse is really helpful, but might need some time to get used to. The code used here is mostly dplyr (included in the tidyverse).
If you understand the functions: %>% (pipe), group_by(), mutate(), summarise(), and pivot_longer/wider() you got everything.
library(tidyverse)
# use your original dataset again
raw_data <- data.frame(
a = c(4, 3, 2, 3, 1),
b = c(3, 2, 1, 2, 1),
c = c(4, 6, 8, 6, 2),
row.names = c("a", "b", "c", "d", "e")
)
### 1) Turn the data into a nicer format
# match-table how to rename the variables
var_match <- c(d = "mean", e = "sd")
# convert the raw data into a nicer format, first we do some minor changes
# (variable names, etc)
data_mixed <- raw_data %>%
# have the rownames as explicit variable
rownames_to_column("metric") %>%
# nicer printing etc
as_tibble() %>%
# replace variable names with mean/sd
mutate(metric = ifelse(metric %in% c("d", "e"),
var_match[metric], metric))
data_mixed
#> # A tibble: 5 x 4
#> metric a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 4 3 4
#> 2 b 3 2 6
#> 3 c 2 1 8
#> 4 mean 3 2 6
#> 5 sd 1 1 2
# separate the dataset into two:
# data holds the values
# data_vars holds the metrics mean and sd
data <- data_mixed %>% filter(!metric %in% var_match) %>% select(-metric)
data_vars <- data_mixed %>% filter(metric %in% var_match)
data
#> # A tibble: 3 x 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 4 3 4
#> 2 3 2 6
#> 3 2 1 8
data_vars
#> # A tibble: 2 x 4
#> metric a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 mean 3 2 6
#> 2 sd 1 1 2
# turn the value dataset into its longer form, makes it easier to work with it later
data_long <- data %>%
pivot_longer(everything(), names_to = "var", values_to = "val")
data_long
#> # A tibble: 9 x 2
#> var val
#> <chr> <dbl>
#> 1 a 4
#> 2 b 3
#> 3 c 4
#> 4 a 3
#> 5 b 2
#> 6 c 6
#> 7 a 2
#> 8 b 1
#> 9 c 8
# turn the metric dataset into another long form, allowing easy combination in the next step
data_vars2 <- data_vars %>%
pivot_longer(-metric, names_to = "var", values_to = "val") %>%
pivot_wider(var, names_from = metric, values_from = val)
data_vars2
#> # A tibble: 3 x 3
#> var mean sd
#> <chr> <dbl> <dbl>
#> 1 a 3 1
#> 2 b 2 1
#> 3 c 6 2
# combine the datasets
data_all <- left_join(data_long, data_vars2, by = "var")
data_all
#> # A tibble: 9 x 4
#> var val mean sd
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 4 3 1
#> 2 b 3 2 1
#> 3 c 4 6 2
#> 4 a 3 3 1
#> 5 b 2 2 1
#> 6 c 6 6 2
#> 7 a 2 3 1
#> 8 b 1 2 1
#> 9 c 8 6 2
## 2) calculate the z-score
# now comes the actual number crunchin!
# per variable var (a, b, c) compute the variable val_z as the z-score
data_res <- data_all %>%
group_by(var) %>%
mutate(val_z = (val - mean) / sd)
data_res
#> # A tibble: 9 x 5
#> # Groups: var [3]
#> var val mean sd val_z
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 4 3 1 1
#> 2 b 3 2 1 1
#> 3 c 4 6 2 -1
#> 4 a 3 3 1 0
#> 5 b 2 2 1 0
#> 6 c 6 6 2 0
#> 7 a 2 3 1 -1
#> 8 b 1 2 1 -1
#> 9 c 8 6 2 1
## 3) make the results more readable
# lastly pivot the results to its original form
data_res_wide <- data_res %>%
select(var, val_z) %>%
group_by(var) %>%
mutate(id = 1:n()) %>% # needed for easier identification of values
pivot_wider(id, names_from = var, values_from = val_z)
data_res_wide
#> # A tibble: 3 x 4
#> id a b c
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 1 -1
#> 2 2 0 0 0
#> 3 3 -1 -1 1
Created on 2021-01-07 by the reprex package (v0.3.0)

select only rows with duplicate id and specific value from another column in R

I have the following data with ID and value:
id <- c("1103-5","1103-5","1104-2","1104-2","1104-4","1104-4","1106-2","1106-2","1106-3","1106-3","2294-1","2294-1","2294-2","2294-2","2294-2","2294-3","2294-3","2294-3","2294-4","2294-4","2294-5","2294-5","2294-5","2300-1","2300-1","2300-2","2300-2","2300-4","2300-4","2321-1","2321-1","2321-2","2321-2","2321-3","2321-3","2321-4","2321-4","2347-1","2347-1","2347-2","2347-2")
value <- c(6,3,6,3,6,3,6,3,6,3,3,6,9,3,6,9,3,6,3,6,9,3,6,9,6,9,6,9,6,9,3,9,3,9,3,9,3,9,6,9,6)
If you notice, there are multiple values for the same id. What I'd like to do is get the value that are only 3 and 6 only if the IDs are the same. for eg. ID "1103-5" has both 3 and 6, so it should be in the list, but not "2347-2"
I'm using R
One method I tried is the following, but it gives me everything with value 3 and 6.
d <- data.frame(id, value)
group36 <- d[d$value == 3 | d$value == 6,]
and
d %>% group_by(id) %>% filter(3 == value | 6 == value)
The output should be like this:
id value
1103-5 6
1103-5 3
1104-2 6
1104-2 3
1104-4 6
1104-4 3
1106-2 6
1106-2 3
1106-3 6
1106-3 3
2294-1 3
2294-1 6
2294-2 3
2294-2 6
2294-3 3
2294-3 6
2294-4 3
2294-4 6
2294-5 3
2294-5 6
d<-group_by(d,id)
filter(d,any(value==3),any(value==6))
This gives you all the IDs where there is both a value of 3 (somewhere) AND a value of 6 (somewhere). Mind you, your data contains some IDs with THREE values. In these cases, if both 3 and 6 are present, it will be included in the result.
If you want to exclude those lines that remain which done equal 3 or 6, add this:
filter(d,value==3 | value==6)
If you want to exclude IDs that also have 3 and 6 as values but also have OTHER values, use this:
filter(d,any(value==3),any(value==6),value==3 | value==6)
Not sure if this is what you want. We can filter rows that equal to either 3 or 6 then convert from long to wide format and keep only columns which have both 3 and 6 values. After that, convert back to long format.
library(dplyr)
library(tidyr)
id <- c("1103-5","1103-5","1104-2","1104-2","1104-4","1104-4","1106-2","1106-2",
"1106-3","1106-3","2294-1","2294-1","2294-2","2294-2","2294-2",
"2294-3","2294-3","2294-3","2294-4","2294-4","2294-5","2294-5","2294-5",
"2300-1","2300-1","2300-2","2300-2","2300-4","2300-4","2321-1","2321-1",
"2321-2","2321-2","2321-3","2321-3","2321-4","2321-4","2347-1","2347-1","2347-2","2347-2")
value <- c(6,3,6,3,6,3,6,3,6,3,3,6,9,3,6,9,3,6,3,6,9,3,6,9,6,9,6,9,6,9,3,9,3,9,3,9,3,9,6,9,6)
d <- data.frame(id, value)
d %>%
group_by(id) %>%
filter(value %in% c(3, 6)) %>%
mutate(rows = 1:n()) %>%
spread(key = id, value) %>%
select_if(~ all(!is.na(.)))
#> # A tibble: 2 x 11
#> rows `1103-5` `1104-2` `1104-4` `1106-2` `1106-3` `2294-1` `2294-2`
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 6 6 6 6 3 3
#> 2 2 3 3 3 3 3 6 6
#> # ... with 3 more variables: `2294-3` <dbl>, `2294-4` <dbl>,
#> # `2294-5` <dbl>
d %>%
group_by(id) %>%
filter(value %in% c(3, 6)) %>%
mutate(rows = 1:n()) %>%
spread(key = id, value) %>%
select_if(~ all(!is.na(.))) %>%
select(-rows) %>%
gather(id, value)
#> # A tibble: 20 x 2
#> id value
#> <chr> <dbl>
#> 1 1103-5 6
#> 2 1103-5 3
#> 3 1104-2 6
#> 4 1104-2 3
#> 5 1104-4 6
#> 6 1104-4 3
#> 7 1106-2 6
#> 8 1106-2 3
#> 9 1106-3 6
#> 10 1106-3 3
#> 11 2294-1 3
#> 12 2294-1 6
#> 13 2294-2 3
#> 14 2294-2 6
#> 15 2294-3 3
#> 16 2294-3 6
#> 17 2294-4 3
#> 18 2294-4 6
#> 19 2294-5 3
#> 20 2294-5 6
Created on 2018-07-01 by the reprex package (v0.2.0.9000).

tidyr - unique way to get combinations (using tidyverse only)

I wanted to get all unique pairwise combinations of a unique string column of a dataframe using the tidyverse (ideally).
Here is a dummy example:
library(tidyverse)
a <- letters[1:3] %>%
tibble::as_tibble()
a
#> # A tibble: 3 x 1
#> value
#> <chr>
#> 1 a
#> 2 b
#> 3 c
tidyr::crossing(a, a) %>%
magrittr::set_colnames(c("words1", "words2"))
#> # A tibble: 9 x 2
#> words1 words2
#> <chr> <chr>
#> 1 a a
#> 2 a b
#> 3 a c
#> 4 b a
#> 5 b b
#> 6 b c
#> 7 c a
#> 8 c b
#> 9 c c
Is there a way to remove 'duplicate' combinations here. That is have the output be the following in this example:
# A tibble: 9 x 2
#> words1 words2
#> <chr> <chr>
#> 1 a b
#> 2 a c
#> 3 b c
I was hoping there would be a nice purrr::map or filter approach to pipe into to complete the above.
EDIT: There are similar questions to this one e.g. here, marked by #Sotos. Here I am specifically looking for tidyverse (purrr, dplyr) ways to complete the pipeline I have setup. The other answers use various other packages that I do not want to include as dependencies.
wish there was a better way, but I usually use this...
library(tidyverse)
df <- tibble(value = letters[1:3])
df %>%
expand(value, value1 = value) %>%
filter(value < value1)
# # A tibble: 3 x 2
# value value1
# <chr> <chr>
# 1 a b
# 2 a c
# 3 b c
Something like this?
tidyr::crossing(a, a) %>%
magrittr::set_colnames(c("words1", "words2")) %>%
rowwise() %>%
mutate(words1 = sort(c(words1, words2))[1], # sort order of words for each row
words2 = sort(c(words1, words2))[2]) %>%
filter(words1 != words2) %>% # remove word combinations with itself
unique() # remove duplicates
# A tibble: 3 x 2
words1 words2
<chr> <chr>
1 a b
2 a c
3 b c

Resources