working with integer ranges in dplyr

working with integer ranges in dplyr - r

I have a tibble that encodes when each of 300 counties experienced a (potentially) recurrent event. The "shape of the data" is:
county event_start event_end
A 3 6
A 12 20
A 71 80
B 1 3
B 19 30
...
Some helpful characteristics here:
There is no missing data.
No county has two events that overlap (event_start_2 is always greater than event_end_1 for two events)
Within county, the events are sorted.
I want to reshape the data to be more like this:
county day event
A 1 no
A 2 no
A 3 yes
A 4 yes
A 5 yes
A 6 yes
A 7 no
...
I can imagine how to do this with a bunch of for loops and such. But is there a dplyrish way to do it?

One option would be to get the sequence between corresponding elements of 'event_start', 'event_end' with map, unnest the list output to expand the data, use complete to fill up the 'day' and replace the NA elements with 'no' for the 'event' column
library(tidyverse)
df1 %>%
transmute(county, day = map2(event_start, event_end, seq), event = 'yes') %>%
unnest %>%
group_by(county) %>%
complete(day = seq_len(max(day))) %>%
mutate(event = replace(event, is.na(event), 'no'))
# A tibble: 110 x 3
# Groups: county [2]
# county day event
# <chr> <int> <chr>
# 1 A 1 no
# 2 A 2 no
# 3 A 3 yes
# 4 A 4 yes
# 5 A 5 yes
# 6 A 6 yes
# 7 A 7 no
# 8 A 8 no
# 9 A 9 no
#10 A 10 no
# ... with 100 more rows
data
df1 <- structure(list(county = c("A", "A", "A", "B", "B"), event_start = c(3L,
12L, 71L, 1L, 19L), event_end = c(6L, 20L, 80L, 3L, 30L)), .Names = c("county",
"event_start", "event_end"), class = "data.frame", row.names = c(NA,
-5L))

Related

Rearrangement columns of a table in R

I have the following table that I want to modify
Debt2017 Debt2018 Debt2019 Cash2017 Cash2018 Cash2019 Year Other
2 4 3 5 6 7 2018 x
3 8 9 7 9 9 2017 y
So that the result is the following
Debt Cash FLAG After Other
2 5 0 x
3 7 1 x
8 9 1 y
9 9 1 y|
Basically, I want to change the data so that I have the different years in different rows, eliminating the values for the year indicated in the column "Year" and adding a FLAG that tells me whether the data indicated in the row is from a previous (0) or following (1) year (with respect to the year indicated in the column "Year").
Furthermore, I also want to keep the column "Other".
Does anybody know how to do it in R?

library(dplyr)
library(tidyr)
df %>%
pivot_longer(Debt2017:Cash2019,
names_to = c(".value", "Year2"),
names_pattern = "(\\D+)(\\d+)") %>%
filter(Year != Year2) %>%
mutate(flag = +(Year2 > Year))
# # A tibble: 4 × 6
# Year Other Year2 Debt Cash flag
# <int> <chr> <chr> <int> <int> <int>
# 1 2018 x 2017 2 5 0
# 2 2018 x 2019 3 7 1
# 3 2017 y 2018 8 9 1
# 4 2017 y 2019 9 9 1
Data
df <- structure(list(Debt2017 = 2:3, Debt2018 = c(4L, 8L), Debt2019 = c(3L, 9L),
Cash2017 = c(5L, 7L), Cash2018 = c(6L, 9L), Cash2019 = c(7L, 9L),
Year = 2018:2017, Other = c("x", "y")), class = "data.frame", row.names = c(NA, -2L))

Create combinations by group and sum

I have data of names within an ID number along with a number of associated values. It looks something like this:
structure(list(id = c("a", "a", "b", "b"), name = c("bob", "jane",
"mark", "brittney"), number = c(1L, 2L, 1L, 2L), value = c(1L,
2L, 1L, 2L)), class = "data.frame", row.names = c(NA, -4L))
# id name number value
# 1 a bob 1 1
# 2 a jane 2 2
# 3 b mark 1 1
# 4 b brittney 2 2
I would like to create all the combinations of name, regardless of how many there are, and paste them together separated with commas, and sum their number and value within each id. The desired output from the example above is then:
structure(list(id = c("a", "a", "a", "b", "b", "b"), name = c("bob",
"jane", "bob, jane", "mark", "brittney", "mark, brittney"), number = c(1L,
2L, 3L, 1L, 2L, 3L), value = c(1L, 2L, 3L, 1L, 2L, 3L)), class = "data.frame", row.names = c(NA, -6L))
# id name number value
# 1 a bob 1 1
# 2 a jane 2 2
# 3 a bob, jane 3 3
# 4 b mark 1 1
# 5 b brittney 2 2
# 6 b mark, brittney 3 3
Thanks all!

You could use group_modify() + add_row():
library(dplyr)
df %>%
group_by(id) %>%
group_modify( ~ .x %>%
summarise(name = toString(name), across(c(number, value), sum)) %>%
add_row(.x, .)
) %>%
ungroup()
# # A tibble: 6 × 4
# id name number value
# <chr> <chr> <int> <int>
# 1 a bob 1 1
# 2 a jane 2 2
# 3 a bob, jane 3 3
# 4 b mark 1 1
# 5 b brittney 2 2
# 6 b mark, brittney 3 3

You can create pairwise indices using combn() and expand the data frame with these using slice(). Then just group by these row pairs and summarise. I'm assuming you want pairwise combinations but this can be adapted for larger sets if needed. Some code to handle groups < 2 is included but can be removed if these don't exist in your data.
library(dplyr)
library(purrr)
df1 %>%
group_by(id) %>%
slice(c(combn(seq(n()), min(n(), 2)))) %>%
mutate(id2 = (row_number()-1) %/% 2) %>%
group_by(id, id2) %>%
summarise(name = toString(name),
across(where(is.numeric), sum), .groups = "drop") %>%
select(-id2) %>%
bind_rows(df1 %>%
group_by(id) %>%
filter(n() > 1), .) %>%
arrange(id) %>%
ungroup()
# A tibble: 6 × 4
id name number value
<chr> <chr> <int> <int>
1 a bob 1 1
2 a jane 2 2
3 a bob, jane 3 3
4 b mark 1 1
5 b brittney 2 2
6 b mark, brittney 3 3
Edit:
To adapt for all possible combinations you can iterate over the values up to the max group size. Using edited data which has a couple of rows added to the first group:
map_df(seq(max(table(df2$id))), ~
df2 %>%
group_by(id) %>%
slice(c(combn(seq(n()), .x * (.x <= n())))) %>%
mutate(id2 = (row_number() - 1) %/% .x) %>%
group_by(id, id2) %>%
summarise(name = toString(name),
across(where(is.numeric), sum), .groups = "drop")
) %>%
select(-id2) %>%
arrange(id)
# A tibble: 18 × 4
id name number value
<chr> <chr> <int> <int>
1 a bob 1 1
2 a jane 2 2
3 a sophie 1 1
4 a jeremy 2 2
5 a bob, jane 3 3
6 a bob, sophie 2 2
7 a bob, jeremy 3 3
8 a jane, sophie 3 3
9 a jane, jeremy 4 4
10 a sophie, jeremy 3 3
11 a bob, jane, sophie 4 4
12 a bob, jane, jeremy 5 5
13 a bob, sophie, jeremy 4 4
14 a jane, sophie, jeremy 5 5
15 a bob, jane, sophie, jeremy 6 6
16 b mark 3 5
17 b brittney 4 6
18 b mark, brittney 7 11
Data for df2:
df2 <- structure(list(id = c("a", "a", "a", "a", "b", "b"), name = c("bob",
"jane", "sophie", "jeremy", "mark", "brittney"), number = c(1L,
2L, 1L, 2L, 3L, 4L), value = c(1L, 2L, 1L, 2L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-6L))

A data.table option
setDT(df)[
,
lapply(
.SD,
function(x) {
unlist(
lapply(
seq_along(x),
combn,
x = x,
function(v) {
ifelse(all(is.character(v)), toString, sum)(v)
}
)
)
}
),
id
]
gives
id name number value
1: a bob 1 1
2: a jane 2 2
3: a bob, jane 3 3
4: b mark 1 1
5: b brittney 2 2
6: b mark, brittney 3 3

Using tidyverse to clean up rank-choice survey

I have survey data in R that looks like this, where I've presented people with two groups of actions - High and Low - and asked them to rank each action. Each group contains unique actions, marked by the letter (6 actions in total).
id A_High B_High C_High D_Low E_Low F_Low
001 5 2 1 6 4 3
002 6 4 3 5 2 1
003 3 1 6 2 4 5
004 6 5 2 1 3 4
I need a new df that looks like the one below, where each High action is assigned a new numeric rank (between 0 and 3) corresponding to the number of Low action items that were ranked below that High action.
For example, a person with id 001 ranked A_High at number 5, B_High at 2, and C_High at 1. A_High's new rank would be 1 (since only 1 Low action, D_Low is ranked below A_High), B_High's new rank would be 3 (since all 3 Low actions were ranked below B_High), and C_High's new rank would be 3 (since all 3 Low actions were ranked below C_High).
id A_High_rank B_High_rank C_High_rank
001 1 3 3
002 0 1 1
003 2 3 0
004 0 0 2
I have a sense that this can be done with if/else statements but suspect that there should be a far more efficient way of achieving this with tidyverse. In the real dataset, I have 1000+ rows and 12 actions (6 High and 6 Low). I would appreciate any help on this.
Thanks!
Data:
"id A_High B_High C_High D_Low E_Low F_Low
001 5 2 1 6 4 3
002 6 4 3 5 2 1
003 3 1 6 2 4 5
004 6 5 2 1 3 4"

A base R option would be to loop over the 'High' columns, get the rowSums of the logical matrix created by checking if it less than the 'Low' column, and rename those output by appending _rank as suffix
out <- cbind(df1[1], sapply(df1[2:4],
function(x) rowSums(x < df1[endsWith(names(df1), 'Low')])))
names(out)[-1] <- paste0(names(out)[-1], "_rank")
-output
out
# id A_High_rank B_High_rank C_High_rank
#1 1 1 3 3
#2 2 0 1 1
#3 3 2 3 0
#4 4 0 0 2
Or using dplyr
library(dplyr)
df1 %>%
transmute(id, across(ends_with('High'),
~ rowSums(. < select(df1, ends_with('Low'))), .names = '{.col}_rank'))
# id A_High_rank B_High_rank C_High_rank
#1 1 1 3 3
#2 2 0 1 1
#3 3 2 3 0
#4 4 0 0 2
data
df1 <- structure(list(id = 1:4, A_High = c(5L, 6L, 3L, 6L), B_High = c(2L,
4L, 1L, 5L), C_High = c(1L, 3L, 6L, 2L), D_Low = c(6L, 5L, 2L,
1L), E_Low = c(4L, 2L, 4L, 3L), F_Low = c(3L, 1L, 5L, 4L)),
class = "data.frame", row.names = c(NA,
-4L))

After much suffering, this is the tidyverse solution I came up with. This was fun!
library(tidyverse)
data %>%
pivot_longer(cols = ends_with("_High"), names_to = "High Variables", values_to = "High") %>%
pivot_longer(cols = ends_with("_Low"), names_to = "Low Variables", values_to = "Low") %>%
filter(High-Low < 0) %>%
group_by(`High Variables`, `id`) %>%
summarise(Count = n()) %>%
pivot_wider(names_from = `High Variables`, values_from = Count) %>%
arrange(id)
Translation:
The first two line create two pairs of columns and leave id untouched. Each pair has two columns, one with the original column names, and the other with the values. Each pait of columns represents either High or Low.
Then, I filtered all the rows, keeping only those where Low was greater than High. Then I counted how many where left for each id and reversed back the format.
Now I just have to figure out how to turn those NAs into 0s.
Here's the output:
> data %>%
+ pivot_longer(cols = ends_with("_High"), names_to = "High Variables", values_to = "High") %>%
+ pivot_longer(cols = ends_with("_Low"), names_to = "Low Variables", values_to = "Low") %>%
+ filter(High < Low) %>%
+ group_by(`High Variables`, `id`) %>%
+ summarise(Count = n()) %>%
+ pivot_wider(names_from = `High Variables`, values_from = Count) %>%
+ arrange(id)
`summarise()` regrouping output by 'High Variables' (override with `.groups` argument)
# A tibble: 4 x 4
id A_High B_High C_High
<int> <int> <int> <int>
1 1 1 3 3
2 2 NA 1 1
3 3 2 3 NA
4 4 NA NA 2

Aggregate Two Variables by One Date

I have the following dataset I am working with:
day descent_cd
<int> <chr>
1 26 B
2 19 W
3 19 B
4 16 B
5 1 W
6 2 W
7 2 B
8 2 B
9 3 W
10 3 W
# … with 1,283 more rows
In short: the "day" variable is the day of the month. "Descent_cd" is race (black or white).
I am trying to organize it so that I get a column for "B" and a column for "W" both sorted by total arrest made that day ... meaning: counting all the "B"s for day "1" and the same for "W" and then so on and so forth through the rest of the month.
I ultimately want to do this as a geom_ridge graph.

Is this what you are looking for?
library(tidyverse)
#sample data
df <- tibble::tribble(
~day, ~descent_cd,
26L, "B",
19L, "W",
19L, "B",
16L, "B",
1L, "W",
2L, "W",
2L, "B",
2L, "B",
3L, "W",
3L, "W"
)
df %>%
group_by(day, descent_cd) %>%
summarise(total_arrest = n()) %>% #calculate number of arrests per day per descent_cd
pivot_wider(names_from = descent_cd, values_from = total_arrest) %>% #create columns W and B
mutate(W = if_else(is.na(W),as.integer(0),W), #replace NAs with 0 (meaning 0 arrests that day)
B = if_else(is.na(B),as.integer(0),B)) %>%
arrange(desc(wt = W+B)) #arrange df in descending order of total arrests per day
# A tibble: 6 x 3
# Groups: day [6]
day W B
<int> <int> <int>
1 2 1 2
2 3 2 0
3 19 1 1
4 1 1 0
5 16 0 1
6 26 0 1

How to deduplicate based upon an interval between dates in same column

I have a table that looks something like this:
ID Date Type
1 2019/03/12 A
1 2019/03/12 A
2 2019/01/07 A
2 2019/04/20 B
3 2019/02/09 C
4 2019/01/19 A
4 2019/01/23 A
I want to deduplicate this table by ID, but only if the span between the dates listed is greater than 7 days. If it is less than 7 days, then I want to keep the earliest date.
Want:
ID Date Type
1 2019/03/12 A
2 2019/01/07 A
2 2019/04/20 B
3 2019/02/09 C
4 2019/01/19 A
I'm just struggling with where to start conceptually.

An option would be to convert the 'Date' to Date class (ymd from lubridate is used here), then grouped by 'ID', filter the difference of 'Date' that is greater than or equal to 7
library(dplyr)
library(lubridate)
df1 %>%
mutate(Date = ymd(Date)) %>%
group_by(ID) %>%
filter(c(TRUE, diff(Date) >= 7))
# A tibble: 5 x 3
# Groups: ID [4]
# ID Date Type
# <int> <date> <chr>
#1 1 2019-03-12 A
#2 2 2019-01-07 A
#3 2 2019-04-20 B
#4 3 2019-02-09 C
#5 4 2019-01-19 A
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 4L, 4L), Date = c("2019/03/12",
"2019/03/12", "2019/01/07", "2019/04/20", "2019/02/09", "2019/01/19",
"2019/01/23"), Type = c("A", "A", "A", "B", "C", "A", "A")),
class = "data.frame", row.names = c(NA,
-7L))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

working with integer ranges in dplyr - r

Related

Rearrangement columns of a table in R

Create combinations by group and sum

Using tidyverse to clean up rank-choice survey

Aggregate Two Variables by One Date

How to deduplicate based upon an interval between dates in same column

Categories

Resources