R slide window through tibble - r

I got a simple question that I cannot figure out solutions.
Also, I didn't find an answer that I understand.
Imagine I got this data frame
(ts <- tibble(
+ a = LETTERS[1:10],
+ b = c(rep(1, 5), rep(2,5))
+ ))
# A tibble: 10 x 2
a b
<chr> <dbl>
1 A 1
2 B 1
3 C 1
4 D 1
5 E 1
6 F 2
7 G 2
8 H 2
9 I 2
10 J 2
What I want is simple. I want to build a df with the column b indexing a sliding window which sizes n f the column a.
The output can be something like this:
# A tibble: 8 x 2
b a
<dbl> <chr>
1 1 A B
2 1 B C
3 1 C D
4 1 D E
5 2 F G
6 2 G H
7 2 H I
8 2 I J
I don't care if the column a contains an array (nest values).
I just need a new data frame based on the sliding window.
Since this operation will run in a relational database I'd like a function compatible with DBI-PostgresSQL.
Any help is appreciated.
Thanks in advance

We can group by 'b', create the new column based on the lead of 'a', remove the NA rows with na.omit
library(dplyr)
ts %>%
group_by(b) %>%
mutate(a2 = lead(a)) %>%
ungroup %>%
na.omit %>%
select(b, everything())
# A tibble: 8 x 3
# b a a2
# <dbl> <chr> <chr>
#1 1 A B
#2 1 B C
#3 1 C D
#4 1 D E
#5 2 F G
#6 2 G H
#7 2 H I
#8 2 I J
If lead doesn't works, then just remove the first element, append NA at the end in the mutate step
ts %>%
group_by(b) %>%
mutate(a2 = c(a[-1], NA)) %>%
ungroup %>%
na.omit %>%
select(b, everything())

Related

Extract rows where value appears in any of multiple columns

Let' say I have two data.frames
name_df = read.table(text = "player_name
a
b
c
d
e
f
g", header = T)
game_df = read.table(text = "game_id winner_name loser_name
1 a b
2 b a
3 a c
4 a d
5 b c
6 c d
7 d e
8 e f
9 f a
10 g f
11 g a
12 f e
13 a d", header = T)
name_df contains a unique list of all the winner_name or loser_name values in game_df. I want to create a new data.frame that has, for each person in the name_df a row if a given name (e.g. a) appears in either the winner_name or loser_name column
So I essentially want to merge game_df with name_df, but the key column (name) can appear in either winner_name or loser_name.
So, for just a and b the final output would look something like:
final_df = read.table(text = "player_name game_id winner_name loser_name
a 1 a b
a 2 b a
a 3 a c
a 4 a d
a 9 f a
a 11 g a
a 13 a d
b 1 a b
b 2 b a
b 5 b c", header = T)
We can loop over the elements in 'name_df' for 'player_name', filter the rows from 'game_df' for either the 'winner_name' or 'loser_name'
library(dplyr)
library(purrr)
map_dfr(setNames(name_df$player_name, name_df$player_name),
~ game_df %>%
filter(winner_name %in% .x|loser_name %in% .x), .id = 'player_name')
Or if there are many columns, use if_any
map_dfr(setNames(name_df$player_name, name_df$player_name),
~ {
nm1 <- .x
game_df %>%
filter(if_any(c(winner_name, loser_name), ~ . %in% nm1))
}, .id = 'player_name')
Dedicated to our teacher and mentor dear #akrun
I think we can also make use of the add_row() function you first taught me the other day. Unbelievable!!!
library(dplyr)
library(purrr)
library(tibble)
game_df %>%
rowwise() %>%
mutate(player_name = winner_name) %>%
group_split(game_id) %>%
map_dfr(~ add_row(.x, game_id = .x$game_id, winner_name = .x$winner_name,
loser_name = .x$loser_name, player_name = .x$loser_name)) %>%
arrange(player_name) %>%
relocate(player_name)
# A tibble: 26 x 4
player_name game_id winner_name loser_name
<chr> <int> <chr> <chr>
1 a 1 a b
2 a 2 b a
3 a 3 a c
4 a 4 a d
5 a 9 f a
6 a 11 g a
7 a 13 a d
8 b 1 a b
9 b 2 b a
10 b 5 b c
# ... with 16 more rows
This can be directly expressed in SQL:
library(sqldf)
sqldf("select *
from name_df
left join game_df on winner_name = player_name or loser_name = player_name")
Without using purrr. I think this is appropriate use case of tidyr::unite with argument remove = F where we can first unite the winners' and losers' names and then use tidyr::separate_rows to split new column into rows.
library(tidyr)
library(dplyr)
game_df %>% unite(Player_name, winner_name, loser_name, remove = F, sep = ', ') %>%
separate_rows(Player_name) %>%
relocate(Player_name) %>%
arrange(Player_name)
# A tibble: 26 x 4
Player_name game_id winner_name loser_name
<chr> <int> <chr> <chr>
1 a 1 a b
2 a 2 b a
3 a 3 a c
4 a 4 a d
5 a 9 f a
6 a 11 g a
7 a 13 a d
8 b 1 a b
9 b 2 b a
10 b 5 b c
# ... with 16 more rows
A Base R approach :
result <- do.call(rbind, lapply(name_df$player_name, function(x)
cbind(plaername = x,
subset(game_df, winner_name == x | loser_name == x))))
rownames(result) <- NULL
result
# playername game_id winner_name loser_name
#1 a 1 a b
#2 a 2 b a
#3 a 3 a c
#4 a 4 a d
#5 a 9 f a
#6 a 11 g a
#7 a 13 a d
#8 b 1 a b
#...
#...

Mutate new column with unique values for each list

I have a list here, and I wish to mutate a new column with unique values for each list relative to the mutation. For example, I want to mutate a column named ID as n >= 1.
Naturally, on a dataframe I would do this:
dat %>% mutate(id = row_number())
For a list, I would do this:
dat%>% map(~ mutate(., ID = row_number()))
And I would get an output likeso:
dat <- list(data.frame(x=c("a", "b" ,"c", "d", "e" ,"f" ,"g") ), data.frame(y=c("p", "lk", "n", "m", "g", "f", "t")))
[[1]]
x id
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
[[2]]
y id
1 p 1
2 lk 2
3 n 3
4 m 4
5 g 5
6 f 6
7 t 7
Though, how would I mutate a new column ID such that the row number continues from the first list.
Expected output:
[[1]]
x id
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
[[2]]
y id
1 p 8
2 lk 9
3 n 10
4 m 11
5 g 12
6 f 13
7 t 14
An option is to bind them into a single dataset, create the 'id' with row_number(), split by 'grp', loop over the list and remove any columns that have all NA values
library(dplyr)
library(purrr)
dat %>%
bind_rows(.id = 'grp') %>%
mutate(id = row_number()) %>%
group_split(grp) %>%
map(~ .x %>%
select(where(~ any(!is.na(.))), -grp))
-output
#[[1]]
# A tibble: 7 x 2
# x id
# <chr> <int>
#1 a 1
#2 b 2
#3 c 3
#4 d 4
#5 e 5
#6 f 6
#7 g 7
#[[2]]
# A tibble: 7 x 2
# y id
# <chr> <int>
#1 p 8
#2 lk 9
#3 n 10
#4 m 11
#5 g 12
#6 f 13
#7 t 14
Or an easier approach is to unlist (assuming single column), get the sequence, add a new column with map2
map2(dat, relist(seq_along(unlist(dat)), skeleton = dat),
~ .x %>% mutate(id = .y))
Or using a for loop
dat[[1]]$id <- seq_len(nrow(dat[[1]]))
for(i in seq_along(dat)[-1]) dat[[i]]$id <-
seq(tail(dat[[i-1]]$id, 1) + 1, length.out = nrow(dat[[i]]), by = 1)

dplyr mutate: create column using first occurrence of another column

I was wondering if there's a more elegant way of taking a dataframe, grouping by x to see how many x's occur in the dataset, then mutating to find the first occurrence of every x (y)
test <- data.frame(x = c("a", "b", "c", "d",
"c", "b", "e", "f", "g"),
y = c(1,1,1,1,2,2,2,2,2))
x y
1 a 1
2 b 1
3 c 1
4 d 1
5 c 2
6 b 2
7 e 2
8 f 2
9 g 2
Current Output
output <- test %>%
group_by(x) %>%
summarise(count = n())
x count
<fct> <int>
1 a 1
2 b 2
3 c 2
4 d 1
5 e 1
6 f 1
7 g 1
Desired Output
x count first_seen
<fct> <int> <dbl>
1 a 1 1
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 2
6 f 1 2
7 g 1 2
I can filter the test dataframe for the first occurrences then use a left_join but was hoping there's a more elegant solution using mutate?
# filter for first occurrences of y
right <- test %>%
group_by(x) %>%
filter(y == min(y)) %>%
slice(1) %>%
ungroup()
# bind to the output dataframe
left_join(output, right, by = "x")
We can use first after grouping by 'x' to create a new column, use that also in group_by and get the count with n()
library(dplyr)
test %>%
group_by(x) %>%
group_by(first_seen = first(y), add = TRUE) %>%
summarise(count = n())
# A tibble: 7 x 3
# Groups: x [7]
# x first_seen count
# <fct> <dbl> <int>
#1 a 1 1
#2 b 1 2
#3 c 1 2
#4 d 1 1
#5 e 2 1
#6 f 2 1
#7 g 2 1
I have a question. Why not keep it simple? for example
test %>%
group_by(x) %>%
summarise(
count = n(),
first_seen = first(y)
)
#> # A tibble: 7 x 3
#> x count first_seen
#> <chr> <int> <dbl>
#> 1 a 1 1
#> 2 b 2 1
#> 3 c 2 1
#> 4 d 1 1
#> 5 e 1 2
#> 6 f 1 2
#> 7 g 1 2

Drop list columns from dataframe using dplyr and select_if

Is it possible to drop all list columns from a dataframe using dpyr select similar to dropping a single column?
df <- tibble(
a = LETTERS[1:5],
b = 1:5,
c = list('bob', 'cratchit', 'rules!','and', 'tiny tim too"')
)
df %>%
select_if(-is.list)
Error in -is.list : invalid argument to unary operator
This seems to be a doable work around, but was wanting to know if it can be done with select_if.
df %>%
select(-which(map(df,class) == 'list'))
Use Negate
df %>%
select_if(Negate(is.list))
# A tibble: 5 x 2
a b
<chr> <int>
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
There is also purrr::negate that would give the same result.
We can use Filter from base R
Filter(Negate(is.list), df)
# A tibble: 5 x 2
# a b
# <chr> <int>
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5

Finding Maximumth value of currently mutating variable in dplyr

While trying to work out this question Identify duplicates of one value with different values in another column; I felt that the solution was closer but I couldn't because the dplyr mutate function refers to the pre-mutated state's max when I use max(ID) in the below code and not post-mutated value (like recursively).
The objective is to assign a new unique ID value for the rows where the current Address has mismatch with the previous Address of the same ID value.
The code I tried:
df <- read.table(text = 'ID Address
1 X
1 X
1 Y
2 Z
2 Z
3 A
3 B
4 C
4 D
4 E
5 F
5 F
5 F
', header= T, stringsAsFactors = F)
df %>% group_by(ID) %>% mutate(flag = ifelse(lag(Address)==Address,F,T)) %>%
mutate(flag = ifelse(is.na(flag),F,flag)) %>% ungroup() %>%
mutate(newID = ifelse(flag | is.na(flag), max(ID)+1,ID))%>%
select(ID = newID,Address)
Received Output:
# A tibble: 13 x 2
ID Address
<dbl> <chr>
1 1 X
2 1 X
3 6 Y
4 2 Z
5 2 Z
6 3 A
7 6 B
8 4 C
9 6 D
10 6 E
11 5 F
12 5 F
13 5 F
Expected Output:
ID Address
1 X
1 X
6 Y
2 Z
2 Z
3 A
7 B
4 C
8 D
9 E
5 F
5 F
5 F
Any help would be appreciated!
Edit:
Ideal code: Where I should've been able to use newID which is the current mutating variable to use.
> df %>% group_by(ID) %>% mutate(flag = ifelse(lag(Address)==Address,F,T)) %>%
+ mutate(flag = ifelse(is.na(flag),F,flag)) %>% ungroup() %>%
+ mutate(newID = ifelse(flag | is.na(flag), max(newID)+1,ID))%>%
+ select(ID = newID,Address)
One problem is the max(ID) + 1 which will give the constant value and the second problem is the ifelse itself which requires equal length vector for 'yes' and 'no'. In the below solution, we replace the max(ID) + 1 with max(ID) + seq_len(sum(flag)) and instead of ifelse used replace
df %>%
group_by(ID) %>%
mutate(flag = lag(Address, default = Address[1])!= Address) %>%
ungroup() %>%
mutate(newID = replace(ID, flag, max(ID) + seq_len(sum(flag))))%>%
select(ID = newID,Address)
# A tibble: 13 x 2
# ID Address
# <dbl> <chr>
# 1 1 X
# 2 1 X
# 3 6 Y
# 4 2 Z
# 5 2 Z
# 6 3 A
# 7 7 B
# 8 4 C
# 9 8 D
#10 9 E
#11 5 F
#12 5 F
#13 5 F
In addition, the two ifelse statements to create the 'flag' can be replaced by a single statement

Resources