R - Mutate column based on another column - r

Using R:
For the dataframe:
A<-c(3,3,3,3,1,1,2,2,2,2,2)
df<-data.frame(A)
How do you add a column such that the output is the same as:
A<-c(3,3,3,3,1,1,2,2,2,2,2)
df<-data.frame(A)
B<-c(1,1,1,0,1,0,1,1,0,0,0)
mutate(df,B)
In other words, is there a formula for column 'B' - such that it looks at column 'A'....and lists '1', 3 times the puts a '0' .....etc etc.
So - the desired output (given column 'A') is:
Thankyou.

Here I assign a new group each time A changes, then within each group put a 1 in B in the first #A rows.
(If the values of A are distinct for each group, you could replace the first two lines with group_by(A), but unclear if that's a fair assumption.)
library(dplyr)
df %>%
mutate(group = cumsum(A != lag(A, default = 0))) %>%
group_by(group) %>%
mutate(B = 1 * (row_number() <= A)) %>%
ungroup()
result
# A tibble: 11 x 3
A group B
<dbl> <int> <dbl>
1 3 1 1
2 3 1 1
3 3 1 1
4 3 1 0
5 1 2 1
6 1 2 0
7 2 3 1
8 2 3 1
9 2 3 0
10 2 3 0
11 2 3 0

After grouping by 'A', use rep with 1, 0 on the value of 'A' and the difference of number of rows with group value
library(dplyr)
library(data.table)
df %>%
group_by(A, grp = rleid(A)) %>%
mutate(B = rep(c(1, 0), c(first(A), n() - first(A)))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 11 x 2
# A B
# <dbl> <dbl>
# 1 3 1
# 2 3 1
# 3 3 1
# 4 3 0
# 5 1 1
# 6 1 0
# 7 2 1
# 8 2 1
# 9 2 0
#10 2 0
#11 2 0
Or using rle from base R
with(rle(df$A), rep(rep(c(1, 0), length(values)), c(values, lengths-values)))
#[1] 1 1 1 0 1 1 0 1 0 0 0

Related

How to count groupings of elements in base R or dplyr using multiple conditions?

I am trying to count the number of elements by groupings, subject to the condition that each grouping code ("Group") is > 0. Suppose we start with the below output DF generated via the code immediately beneath:
Element Group reSeq
<chr> <dbl> <int>
1 R 0 1
2 R 0 1
3 X 0 1
4 X 1 2
5 X 1 2
6 X 0 1
7 X 0 1
8 X 0 1
9 B 0 1
10 R 0 1
11 R 2 2
12 R 2 2
13 X 3 3
14 X 3 3
15 X 3 3
library(dplyr)
myDF <- data.frame(
Element = c("R","R","X","X","X","X","X","X","B","R","R","R","X","X","X"),
Group = c(0,0,0,1,1,0,0,0,0,0,2,2,3,3,3)
)
myDF %>% group_by(Element) %>% mutate(reSeq = match(Group, unique(Group)))
Instead, I would like the reSeq column to calculate and output as shown below with explanations to the right:
Element Group reSeq reSeq explanation
<chr> <dbl> <int>
1 R 0 1 1st instance of R (ungrouped)(Group = 0 means not grouped)
2 R 0 2 2nd instance of R (ungrouped)(Group = 0 means not grouped)
3 X 0 1 1st instance of X (ungrouped)(Group = 0 means not grouped)
4 X 1 2 2nd instance of X (grouped by Group = 1)
5 X 1 2 2nd instance of X (grouped by Group = 1)
6 X 0 3 3rd instance of X (ungrouped)
7 X 0 4 4th instance of X (ungrouped)
8 X 0 5 5th instance of X (ungrouped)
9 B 0 1 1st instance of B (ungrouped)
10 R 0 3 3rd instance of R (ungrouped)
11 R 2 4 4th instance of R (grouped by Group = 2)
12 R 2 4 4th instance of R (grouped by Group = 2)
13 X 3 6 6th instance of X (grouped by Group = 3)
14 X 3 6 6th instance of X (grouped by Group = 3)
15 X 3 6 6th instance of X (grouped by Group = 3)
Any recommendations for doing this? If possible, starting with the dplyr code I use above because I am fairly familiar with it.
If we use rowid from data.table, can skip a couple of steps
library(dplyr)
library(data.table)
library(tidyr)
myDF %>%
mutate(reSeq = rowid(Element) * NA^!(Group == 0 |!duplicated(Group))) %>%
group_by(Element) %>%
fill(reSeq) %>%
mutate(reSeq = match(reSeq, unique(reSeq))) %>%
ungroup
-output
# A tibble: 15 × 3
Element Group reSeq
<chr> <dbl> <int>
1 R 0 1
2 R 0 2
3 X 0 1
4 X 1 2
5 X 1 2
6 X 0 3
7 X 0 4
8 X 0 5
9 B 0 1
10 R 0 3
11 R 2 4
12 R 2 4
13 X 3 6
14 X 3 6
15 X 3 6
Below is what I managed to cobble together. Maybe there's a cleaner solution? Here's the code:
library(dplyr)
library(tidyr)
myDF %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup()%>%
mutate(reSeq = ifelse(Group == 0 | Group != lag(Group), eleCnt,0)) %>%
mutate(reSeq = na_if(reSeq, 0)) %>%
group_by(Element) %>%
fill(reSeq) %>%
mutate(reSeq = match(reSeq, unique(reSeq))) %>%
ungroup
And here's the output:
# A tibble: 15 x 4
Element Group eleCnt reSeq
<chr> <dbl> <int> <int>
1 R 0 1 1
2 R 0 2 2
3 X 0 1 1
4 X 1 2 2
5 X 1 3 2
6 X 0 4 3
7 X 0 5 4
8 X 0 6 5
9 B 0 1 1
10 R 0 3 3
11 R 2 4 4
12 R 2 5 4
13 X 3 7 6
14 X 3 8 6
15 X 3 9 6

How to count data frame elements grouped by multiple conditions in dplyr?

I am trying to use dplyr to count elements grouped by multiple conditions (columns) in a data frame. In the below example (dataframe output is at the top (except that I manually inserted the 2 right-most columns to explain what I am trying to do), and R code is underneath), I am trying to count the joint groupings of the Element and Group columns. My multiple condition grouping attempt is eleGrpCnt. Any recommendations for the correct way to do this in dplyr? I thought that group_by a combined (Element, Group) would work.
desired
Element Group origOrder eleCnt eleGrpCnt eleGrpCnt explanation
<chr> <dbl> <int> <int> <int> <comment> <comment>
1 B 0 1 1 1 1 1st grouping of B where Group = 0
2 R 0 2 1 1 1 1st grouping of R where Group = 0
3 R 1 3 2 1 2 2nd grouping of R where Group = 1
4 R 1 4 3 2 2 2nd grouping of R where Group = 1
5 B 0 5 2 2 1 1st grouping of B where Group = 0
6 X 2 6 1 1 1 1st grouping of X where Group = 2
7 X 2 7 2 2 1 1st grouping of X where Group = 2
8 X 0 8 3 1 2 2nd grouping of X where Group = 0
9 X 0 9 4 2 2 2nd grouping of X where Group = 0
10 X -1 10 5 1 3 3rd grouping of X where Group = -1
library(dplyr)
myData6 <-
data.frame(
Element = c("B","R","R","R","B","X","X","X","X","X"),
Group = c(0,0,1,1,0,2,2,0,0,-1)
)
myData6 %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup() %>%
group_by(Element, Group) %>%
mutate(eleGrpCnt = row_number())%>%
ungroup()
If you group by element then the numbers you are looking for are simply the matches of Group against the unique values of Group:
library(dplyr)
myData6 %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup() %>%
group_by(Element) %>%
mutate(eleGrpCnt = match(Group, unique(Group)))
#> # A tibble: 10 x 5
#> # Groups: Element [3]
#> Element Group origOrder eleCnt eleGrpCnt
#> <chr> <dbl> <int> <int> <dbl>
#> 1 B 0 1 1 1
#> 2 R 0 2 1 1
#> 3 R 1 3 2 2
#> 4 R 1 4 3 2
#> 5 B 0 5 2 1
#> 6 X 2 6 1 1
#> 7 X 2 7 2 1
#> 8 X 0 8 3 2
#> 9 X 0 9 4 2
#> 10 X -1 10 5 3
Created on 2022-09-11 with reprex v2.0.2
Here's one approach; I'm sorting by Group value but if you want to change the order to match original appearance order we could add a step.
myData6 %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup() %>%
arrange(Element, Group) %>%
group_by(Element) %>%
mutate(eleGrpCnt = cumsum(Group != lag(Group, default = -999))) %>%
ungroup() %>%
arrange(origOrder)
# A tibble: 10 × 5
Element Group origOrder eleCnt eleGrpCnt
<chr> <dbl> <int> <int> <int>
1 B 0 1 1 1
2 R 0 2 1 1
3 R 1 3 2 2
4 R 1 4 3 2
5 B 0 5 2 1
6 X 2 6 1 3
7 X 2 7 2 3
8 X 0 8 3 2
9 X 0 9 4 2
10 X -1 10 5 1

How to count the cumulative number of subgroupings using dplyr?

I'm trying to run the number of cumulative subgroupings using dplyr, as illustrated and explanation in the image below. I am trying to solve for Flag2 in the image. Any recommendations for how to do this?
Beneath the image I also have the reproducible code that runs all columns up through Flag1 which works fine.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Element = c("A","B","B","B","B","B","A","C","C","C","C","C"),
Group = c(0,0,1,1,2,2,0,3,3,0,0,0)
)
excelCopy <- myData %>%
group_by(Element) %>%
mutate(Element_Count = row_number()) %>%
mutate(Flag1 = case_when(Group > 0 ~ match(Group, unique(Group)),TRUE ~ Element_Count)) %>%
ungroup()
print.data.frame(excelCopy)
Using row_number and setting 0 values to NA
library(dplyr)
excelCopy |>
group_by(Element, Group) |>
mutate(Flag2 = ifelse(Group == 0, NA, row_number()))
Element Group Element_Count Flag1 Flag2
<chr> <dbl> <int> <int> <int>
1 A 0 1 1 NA
2 B 0 1 1 NA
3 B 1 2 2 1
4 B 1 3 2 2
5 B 2 4 3 1
6 B 2 5 3 2
7 A 0 2 2 NA
8 C 3 1 1 1
9 C 3 2 1 2
10 C 0 3 3 NA
11 C 0 4 4 NA
12 C 0 5 5 NA

R dplyr filter direct and indirect sequences of string

Assume I have the table below. And I want to find all id where A is directly or indirectly followed by B.
A->B is the direct sequence for id 1, while A->B is the indirect sequence for id 2, id 3, and id 4.
have <- tibble(id=c(rep(1,2),rep(2,4),rep(3,3),rep(4,4))
,sequence=c('A','B','A','D','C','B','D','A','C','B','D','A','B'))
have
# A tibble: 13 × 2
id sequence
<dbl> <chr>
1 1 A
2 1 B
3 2 A
4 2 D
5 2 C
6 2 B
7 3 D
8 3 A
9 3 C
10 4 B
11 4 D
12 4 A
13 4 B
For A->B direct sequences, I do the following. But I don't think the same logic applies to indirect sequences unless I use regex expressions on the concatenated.
have %>% group_by(id) %>%
dplyr::mutate(process_seq = paste(lag(sequence), '->', sequence)) %>%
dplyr::filter(process_seq == 'A -> B')
want
# A tibble: 13 × 2
id sequence-type
<dbl> <chr>
1 1 direct
2 2 indirect
3 3 indirect
4 4 indirect
ABs = have %>%
group_by(id) %>%
mutate(rownum = row_number(),
letternum = match(sequence, LETTERS[1:26])) %>%
filter(sequence == "A" | sequence == "B") %>%
mutate(dif_row = rownum - lag(rownum),
dif_let = letternum - lag(letternum)) %>%
filter(!is.na(dif_row)) %>%
mutate(has_direct_link = max(dif_row==1),
has_indirect_link = max(dif_row >1 & dif_let == 1),
has_reverse_link = max(dif_row >1 & dif_let < 0)) %>%
select(id, starts_with("has_")) %>%
distinct()
res = have %>% left_join(ABs) %>%
mutate(has_no_link = as.integer(is.na(has_direct_link))) %>%
mutate_if(is.numeric,coalesce,0)
> res
# A tibble: 13 x 6
id sequence has_direct_link has_indirect_link has_reverse_link has_no_link
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 A 1 0 0 0
2 1 B 1 0 0 0
3 2 A 0 1 0 0
4 2 D 0 1 0 0
5 2 C 0 1 0 0
6 2 B 0 1 0 0
7 3 D 0 0 0 1
8 3 A 0 0 0 1
9 3 C 0 0 0 1
10 4 B 1 0 1 0
11 4 D 1 0 1 0
12 4 A 1 0 1 0
13 4 B 1 0 1 0
#deschen's answer above is elegant, but somewhat incomplete: I don't think the "definition" of an indirect link is correct (every direct link is also an indirect link). But my answer could probably be improved by deschen.
Here‘s one approach:
library(tidyverse)
have %>%
group_by(id) %>%
mutate(direct = if_else(sequence == 'A' & lead(sequence) == 'B', 1, 0)) %>%
mutate(indirect = any(sequence == 'A') & any(sequence == 'B')) %>%
filter(any(direct == 1) | indirect == TRUE) %>%
ungroup()
which gives:
# A tibble: 10 x 4
id sequence direct indirect
<dbl> <chr> <dbl> <lgl>
1 1 A 1 TRUE
2 1 B 0 TRUE
3 2 A 0 TRUE
4 2 D 0 TRUE
5 2 C 0 TRUE
6 2 B 0 TRUE
7 4 B 0 TRUE
8 4 D 0 TRUE
9 4 A 1 TRUE
10 4 B 0 TRUE
You can of course deselect the created direct /indirect columns.

dplyr: streamline creating matching and absolute difference variables

I have a dataset of friendships and characteristics of each individual, I'm trying to create variables that are if they match on the binary measures and what their absolute difference is for the continuous measures.
I can do this easily, but I was wondering if there is a different way to do it that is more streamlined than my method given that I have ~60 variables to do this with.
Sample Data:
dat <- read.table(text = "id.x id.y male.x smoke.x drink.x everfight.x grades.x male.y smoke.y drink.y everfight.y grades.y
1 6 0 2 4 1 3 0 2 1 0 2
2 7 0 2 4 0 5 0 2 3 1 4
3 8 1 4 4 1 2 0 4 2 1 1
4 9 0 2 3 1 2 0 3 2 0 1
5 10 1 2 4 0 4 1 4 1 0 4", header = TRUE)
Here's what I've done:
dat <- dat %>%
mutate(sex_match = case_when(male.x == male.y ~ 1,
TRUE ~ 0),
fight_match = case_when(everfight.x == everfight.y ~ 1,
TRUE ~ 0),
smoke_diff = abs(smoke.x - smoke.y),
drink_diff = abs(drink.x - drink.y),
grades_diff = abs(grades.x - grades.y))
This gives me exactly what I want:
id.x id.y male.x smoke.x drink.x everfight.x grades.x male.y smoke.y drink.y everfight.y grades.y sex_match fight_match smoke_diff drink_diff grades_diff
1 6 0 2 4 1 3 0 2 1 0 2 1 0 0 3 1
2 7 0 2 4 0 5 0 2 3 1 4 1 0 0 1 1
3 8 1 4 4 1 2 0 4 2 1 1 0 1 0 2 1
4 9 0 2 3 1 2 0 3 2 0 1 1 0 1 1 1
5 10 1 2 4 0 4 1 4 1 0 4 1 1 2 3 0
However, I'm wondering if there's a way to do this with a loop or apply that identifies corresponding vairables and create the matching and absolute difference new variables in the sample output above.
UPDATE
Ended up using most of what Jon answered with and one part of akrun, here's what worked best for me:
non_binary <- dat %>% select(., contains(".x")) %>%
select(., -id.x) %>%
select_if(~!all(. %in% 0:1)) %>%
rename_with(~str_remove(., '.x')) %>%
names()
dat %>%
pivot_longer(-c(id.x:id.y),
names_to = c("var", ".value"),
names_pattern = "(.+).(.+)") %>%
mutate(match = if_else(var %in% non_binary, abs(x - y), 1L * (x == y))) %>%
mutate(col_name = paste(var, ifelse(var %in% non_binary, "diff", "match"), sep = "_")) %>%
select(-c(var:y)) %>%
pivot_wider(names_from = col_name, values_from = match)
Thanks to both of you!
Here's a tidyr/dplyr approach. First I reshape to a long format with a row for each id/variable combination, with columns for each version. Then I can compare those for every pair at once, and reshape wide.
library(dplyr); library(tidyr)
non_binary <- c("smoke", "drink", "grades")
dat %>%
pivot_longer(-c(id.x:id.y),
names_to = c("var", ".value"),
names_pattern = "(.+).(.+)") %>%
mutate(match = if_else(var %in% non_binary, abs(x - y), 1L * (x == y))) %>%
mutate(col_name = paste(var, ifelse(var %in% non_binary, "diff", "match"), sep = "_")) %>%
select(-c(var:y)) %>%
pivot_wider(names_from = col_name, values_from = match)
Result, which could be appended to original data:
# A tibble: 5 x 7
id.x id.y male_match smoke_diff drink_diff everfight_match grades_diff
<int> <int> <int> <int> <int> <int> <int>
1 1 6 1 0 3 0 1
2 2 7 1 0 1 0 1
3 3 8 0 0 2 1 1
4 4 9 1 1 1 0 1
5 5 10 1 2 3 1 0
We can use tidyverse with across which can do this with dplyr/stringr packages alone i.e. loop across the .x columns of 'male', 'everfight', then get the value of the corresponding .y columns to create the binary column, similarly do this on the other columns, and get the absolute difference. In the .names, replace the column name by making use of str_replace
library(dplyr)
library(stringr)
dat %>%
mutate(across(c(male.x, everfight.x ),
~ +(. == get(str_replace(cur_column(), 'x$', 'y'))),
.names = "{str_replace(.col, '.x', '_match')}"),
across(c(smoke.x, drink.x, grades.x),
~
abs(. - get(str_replace(cur_column(), 'x$', 'y'))),
.names = "{str_replace(.col, '.x', '_diff')}"))
-output
id.x id.y male.x smoke.x drink.x everfight.x grades.x male.y smoke.y drink.y everfight.y grades.y male_match everfight_match smoke_diff drink_diff grades_diff
1 1 6 0 2 4 1 3 0 2 1 0 2 1 0 0 3 1
2 2 7 0 2 4 0 5 0 2 3 1 4 1 0 0 1 1
3 3 8 1 4 4 1 2 0 4 2 1 1 0 1 0 2 1
4 4 9 0 2 3 1 2 0 3 2 0 1 1 0 1 1 1
5 5 10 1 2 4 0 4 1 4 1 0 4 1 1 2 3 0
Or may do this in a single across as well
dat %>%
mutate(across(ends_with('.x'), ~ {
other <- get(str_replace(cur_column(), 'x$', 'y'))
if(all(. %in% c(0, 1)) ) +(. == other) else abs(. - other)
}, .names = "{str_replace(.col, '.x', '_diff')}"))

Resources