Is there a way to create a key without using rowwise()?
Any pointer is much appreciated.
df <- tibble(grp1=rev(LETTERS[1:5]),grp2=letters[11:15],grp3=LETTERS[1:5],
value=rnorm(5,10,10))
df %>% rowwise %>% mutate(key=paste(sort(c(grp1, grp2)), collapse="")) %>% ungroup()
grp1 grp2 grp3 value key
<chr> <chr> <chr> <chr> <chr>
1 E k A -3.73984194875213 AE
2 D l B 3.25846392371014 BD
3 C m C 3.62405652088127 CC
4 B n D 6.41520621902784 BD
5 A o E 20.1892413026407 AE
Update: the tibble contains multiple character vectors, but the key should be generated from column grp1 and grp3.
using purrr::pmap_chr :
library(tidyverse)
df %>% mutate(key=pmap_chr(.[c("grp1","grp3")],~paste(sort(c(...)), collapse="")))
# # A tibble: 5 x 5
# grp1 grp2 grp3 value key
# <chr> <chr> <chr> <chr> <chr>
# 1 E k A 22.0150932758833 AE
# 2 D l B 2.24725610156698 BD
# 3 C m C -6.2414882455089 CC
# 4 B n D 22.5699168856552 BD
# 5 A o E -6.21443670571301 AE
In base R you could do:
transform(df, key=mapply(function(...) paste(sort(c(...)), collapse=""), grp1, grp3)
Here is a vectorized option using pmin/pmap. Take the min/max for each row of columns 'grp1', 'grp3' with pmin/pmax and concatenate together (str_c)
library(dplyr)
library(stringr)
df %>%
mutate(key = str_c(pmin(grp1, grp3), pmax(grp1, grp3)))
# A tibble: 5 x 5
# grp1 grp2 grp3 value key
# <chr> <chr> <chr> <dbl> <chr>
#1 E k A 24.7 AE
#2 D l B 5.66 BD
#3 C m C 16.3 CC
#4 B n D 5.88 BD
#5 A o E -9.22 AE
data
df <- tibble(grp1=rev(LETTERS[1:5]),grp2=letters[11:15],grp3=LETTERS[1:5],
value=rnorm(5,10,10))
NOTE: cbind converts to matrix and matrix can hold only a single class. By converting to tibble with as_tibble doesn't change the class automatically. Instead, use tibble/data.frame directly instead of cbind route
Another way is to use mutate, without rowwise, but with a vectorised version of your function, like this:
library(dplyr)
# create a function and vectorise it
f = function(x, y) paste(sort(c(x, y)), collapse="")
f = Vectorize(f)
# use the function
df %>% mutate(key = f(grp1, grp3))
# # A tibble: 5 x 5
# grp1 grp2 grp3 value key
# <chr> <chr> <chr> <chr> <chr>
# 1 E k A -4.41213449814982 AE
# 2 D l B 10.4314736952111 BD
# 3 C m C 5.69345098226371 CC
# 4 B n D 4.39266020802413 BD
# 5 A o E 22.0623810028979 AE
Related
Let' say I have two data.frames
name_df = read.table(text = "player_name
a
b
c
d
e
f
g", header = T)
game_df = read.table(text = "game_id winner_name loser_name
1 a b
2 b a
3 a c
4 a d
5 b c
6 c d
7 d e
8 e f
9 f a
10 g f
11 g a
12 f e
13 a d", header = T)
name_df contains a unique list of all the winner_name or loser_name values in game_df. I want to create a new data.frame that has, for each person in the name_df a row if a given name (e.g. a) appears in either the winner_name or loser_name column
So I essentially want to merge game_df with name_df, but the key column (name) can appear in either winner_name or loser_name.
So, for just a and b the final output would look something like:
final_df = read.table(text = "player_name game_id winner_name loser_name
a 1 a b
a 2 b a
a 3 a c
a 4 a d
a 9 f a
a 11 g a
a 13 a d
b 1 a b
b 2 b a
b 5 b c", header = T)
We can loop over the elements in 'name_df' for 'player_name', filter the rows from 'game_df' for either the 'winner_name' or 'loser_name'
library(dplyr)
library(purrr)
map_dfr(setNames(name_df$player_name, name_df$player_name),
~ game_df %>%
filter(winner_name %in% .x|loser_name %in% .x), .id = 'player_name')
Or if there are many columns, use if_any
map_dfr(setNames(name_df$player_name, name_df$player_name),
~ {
nm1 <- .x
game_df %>%
filter(if_any(c(winner_name, loser_name), ~ . %in% nm1))
}, .id = 'player_name')
Dedicated to our teacher and mentor dear #akrun
I think we can also make use of the add_row() function you first taught me the other day. Unbelievable!!!
library(dplyr)
library(purrr)
library(tibble)
game_df %>%
rowwise() %>%
mutate(player_name = winner_name) %>%
group_split(game_id) %>%
map_dfr(~ add_row(.x, game_id = .x$game_id, winner_name = .x$winner_name,
loser_name = .x$loser_name, player_name = .x$loser_name)) %>%
arrange(player_name) %>%
relocate(player_name)
# A tibble: 26 x 4
player_name game_id winner_name loser_name
<chr> <int> <chr> <chr>
1 a 1 a b
2 a 2 b a
3 a 3 a c
4 a 4 a d
5 a 9 f a
6 a 11 g a
7 a 13 a d
8 b 1 a b
9 b 2 b a
10 b 5 b c
# ... with 16 more rows
This can be directly expressed in SQL:
library(sqldf)
sqldf("select *
from name_df
left join game_df on winner_name = player_name or loser_name = player_name")
Without using purrr. I think this is appropriate use case of tidyr::unite with argument remove = F where we can first unite the winners' and losers' names and then use tidyr::separate_rows to split new column into rows.
library(tidyr)
library(dplyr)
game_df %>% unite(Player_name, winner_name, loser_name, remove = F, sep = ', ') %>%
separate_rows(Player_name) %>%
relocate(Player_name) %>%
arrange(Player_name)
# A tibble: 26 x 4
Player_name game_id winner_name loser_name
<chr> <int> <chr> <chr>
1 a 1 a b
2 a 2 b a
3 a 3 a c
4 a 4 a d
5 a 9 f a
6 a 11 g a
7 a 13 a d
8 b 1 a b
9 b 2 b a
10 b 5 b c
# ... with 16 more rows
A Base R approach :
result <- do.call(rbind, lapply(name_df$player_name, function(x)
cbind(plaername = x,
subset(game_df, winner_name == x | loser_name == x))))
rownames(result) <- NULL
result
# playername game_id winner_name loser_name
#1 a 1 a b
#2 a 2 b a
#3 a 3 a c
#4 a 4 a d
#5 a 9 f a
#6 a 11 g a
#7 a 13 a d
#8 b 1 a b
#...
#...
I got a simple question that I cannot figure out solutions.
Also, I didn't find an answer that I understand.
Imagine I got this data frame
(ts <- tibble(
+ a = LETTERS[1:10],
+ b = c(rep(1, 5), rep(2,5))
+ ))
# A tibble: 10 x 2
a b
<chr> <dbl>
1 A 1
2 B 1
3 C 1
4 D 1
5 E 1
6 F 2
7 G 2
8 H 2
9 I 2
10 J 2
What I want is simple. I want to build a df with the column b indexing a sliding window which sizes n f the column a.
The output can be something like this:
# A tibble: 8 x 2
b a
<dbl> <chr>
1 1 A B
2 1 B C
3 1 C D
4 1 D E
5 2 F G
6 2 G H
7 2 H I
8 2 I J
I don't care if the column a contains an array (nest values).
I just need a new data frame based on the sliding window.
Since this operation will run in a relational database I'd like a function compatible with DBI-PostgresSQL.
Any help is appreciated.
Thanks in advance
We can group by 'b', create the new column based on the lead of 'a', remove the NA rows with na.omit
library(dplyr)
ts %>%
group_by(b) %>%
mutate(a2 = lead(a)) %>%
ungroup %>%
na.omit %>%
select(b, everything())
# A tibble: 8 x 3
# b a a2
# <dbl> <chr> <chr>
#1 1 A B
#2 1 B C
#3 1 C D
#4 1 D E
#5 2 F G
#6 2 G H
#7 2 H I
#8 2 I J
If lead doesn't works, then just remove the first element, append NA at the end in the mutate step
ts %>%
group_by(b) %>%
mutate(a2 = c(a[-1], NA)) %>%
ungroup %>%
na.omit %>%
select(b, everything())
I am trying to find the best way to iterate through each column of a data frame, group by that column, and produce a summary.
Here is my attempt:
library(tidyverse)
data = data.frame(
a = sample(LETTERS[1:3], 100, replace=TRUE),
b = sample(LETTERS[1:8], 100, replace=TRUE),
c = sample(LETTERS[3:15], 100, replace=TRUE),
d = sample(LETTERS[16:26], 100, replace=TRUE),
value = rnorm(100)
)
myfunction <- function(x) {
groupVars <- select_if(x, is.factor) %>% colnames()
results <- list()
for(i in 1:length(groupVars)) {
results[[i]] <- x %>%
group_by_at(.vars = vars(groupVars[i])) %>%
summarise(
n = n()
)
}
return(results)
}
test <- myfunction(data)
The function returns:
[[1]]
# A tibble: 3 x 2
a n
<fct> <int>
1 A 37
2 B 34
3 C 29
...
...
...
My question is, is this the best way to do this? Is there a way to avoid using a for loop? Can I use purrr and map somehow to do this?
Thank you
An option is to use map
library(tidyverse)
map(data[1:4], ~data.frame(x = {{.x}}) %>% count(x))
#$a
## A tibble: 3 x 2
# x n
# <fct> <int>
#1 A 39
#2 B 32
#3 C 29
#
#$b
## A tibble: 8 x 2
# x n
# <fct> <int>
#1 A 14
#2 B 11
#3 C 16
#4 D 10
#5 E 12
#6 F 10
#7 G 13
#8 H 14
#...
The output is a list. Note that I have ignored the last column of data, as it doesn't seem to be relevant here.
If you want columns in the list data.frames to be named according to the columns from your original data, we can use imap
imap(data[1:4], ~tibble(!!.y := {{.x}}) %>% count(!!sym(.y)))
#$a
## A tibble: 3 x 2
# a n
# <fct> <int>
#1 A 23
#2 B 35
#3 C 42
#
#$b
## A tibble: 8 x 2
# b n
# <fct> <int>
#1 A 15
#2 B 10
#3 C 13
#4 D 5
#5 E 19
#6 F 9
#7 G 13
#8 H 16
#...
Or making use of tibble::enframe (thanks #camille)
imap(data[1:4], ~enframe(.x, value = .y) %>% count(!!sym(.y)))
You could reshape the data and group by both the column and the letter. This gives you one dataframe instead of a list of them, but you could get the list if you really want it with split.
set.seed(123)
library(tidyverse)
data = data.frame(
a = sample(LETTERS[1:3], 100, replace=TRUE),
b = sample(LETTERS[1:8], 100, replace=TRUE),
c = sample(LETTERS[3:15], 100, replace=TRUE),
d = sample(LETTERS[16:26], 100, replace=TRUE),
value = rnorm(100)
)
data %>%
pivot_longer(cols = -value, names_to = "column", values_to = "letter") %>%
group_by(column, letter) %>%
summarise(n = n())
#> # A tibble: 35 x 3
#> # Groups: column [4]
#> column letter n
#> <chr> <fct> <int>
#> 1 a A 33
#> 2 a B 32
#> 3 a C 35
#> 4 b A 8
#> 5 b B 11
#> 6 b C 12
#> 7 b D 14
#> 8 b E 8
#> 9 b F 17
#> 10 b G 16
#> # … with 25 more rows
Created on 2019-10-30 by the reprex package (v0.3.0)
You can simply call:
apply(data, 2,table)
You can drop the last list element if you want.
I want to take multiple lagged values of multiple columns in R.
How do I use mutate_at to get the same results as below? Lets say the real example has 30 columns, so it doesn't make sense to write out the lag formula 30x for each time period.
df <- data_frame(time_col = 1:26, col_1 = letters, col_2 = rev(letters))
df %>% mutate(col_1_lag_1 = lag(col_1, n = 1, by = time_col),
col_2_lag_1 = lag(col_2, n = 1, by = time_col),
col_1_lag_2 = lag(col_1, n = 2, by = time_col),
col_2_lag_2 = lag(col_2, n = 2, by = time_col))
I think it should be something like this, but I don't know how to specify both sets of parameters:
df <- data_frame(time_col = 1:26, col_1 = letters, col_2 = rev(letters))
df %>% mutate_at(vars(col_1, col_2), funs(lag, lag), n = 1, n = 2, by = time_col)
A solution with the help from purrr.
library(dplyr)
library(purrr)
df <- data_frame(time_col = 1:26, col_1 = letters, col_2 = rev(letters))
map_dfc(1:2, function(x){
df2 <- df %>% transmute_at(vars(starts_with("col")),
funs(lag(., n = x, by = time_col)))
return(df2)
}) %>%
bind_cols(df, .) %>%
set_names(c(names(df), paste0("col_", 1:2, "_lag_", rep(1:2, each = 2))))
# # A tibble: 26 x 7
# time_col col_1 col_2 col_1_lag_1 col_2_lag_1 col_1_lag_2 col_2_lag_2
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 a z NA NA NA NA
# 2 2 b y a z NA NA
# 3 3 c x b y a z
# 4 4 d w c x b y
# 5 5 e v d w c x
# 6 6 f u e v d w
# 7 7 g t f u e v
# 8 8 h s g t f u
# 9 9 i r h s g t
# 10 10 j q i r h s
# # ... with 16 more rows
Here is an alternative purrr solution using a nested map_dfc and quasiquotation syntax
bind_cols(
df,
map_dfc(c("col_1", "col_2"), function(i) map_dfc(c(1, 2), function(n)
df %>%
transmute(!!paste0(i, "_lag_", n, collapse = "") := lag(!!rlang::sym(i), n = n, by = time_col)))))
## A tibble: 26 x 7
# time_col col_1 col_2 col_1_lag_1 col_1_lag_2 col_2_lag_1 col_2_lag_2
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 a z NA NA NA NA
# 2 2 b y a NA z NA
# 3 3 c x b a y z
# 4 4 d w c b x y
# 5 5 e v d c w x
# 6 6 f u e d v w
# 7 7 g t f e u v
# 8 8 h s g f t u
# 9 9 i r h g s t
#10 10 j q i h r s
## ... with 16 more rows
I have data in a data frame where one column is a list. This is an example:
rand_lets <- function(){
sample(letters[1:26], runif(sample(1:10, 1), min=5, max=12))
}
example_data <- data.frame(ID = seq(1:5),
location = LETTERS[1:5],
observations = I(list(rand_lets(),
rand_lets(),
rand_lets(),
rand_lets(),
rand_lets())))
I am looking for an elegant tidyverse approach to unlist the list column so that each element in the list is separated into a new column. For example the first row would look like this:
ID location observations observations.1 observations.3 observations.3 observations.4 observations.5 observations.6 observations.7 observations.8 observations.9
1 A "y" "b" "m" "u" "x" "j" "t" "i" "v" "w"
Of course the lists entries may be different lengths so empty cells should be NA.
How could this be done?
If you want to keep your data in "long" format, you can do:
example_data %>% unnest(observations)
ID location observations
1 1 A e
2 1 A x
3 1 A w
...
44 5 E u
45 5 E o
46 5 E z
To spread the data to "wide" format, as in your example, you can do:
library(stringr)
example_data %>% unnest(observations) %>%
group_by(location) %>%
mutate(counter=paste0("Obs_", str_pad(1:n(),2,"left","0"))) %>%
spread(counter, observations)
ID location Obs_01 Obs_02 Obs_03 Obs_04 Obs_05 Obs_06 Obs_07 Obs_08 Obs_09 Obs_10 Obs_11
* <int> <fctr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 A e x w c s j k t z <NA> <NA>
2 2 B k u d h z x <NA> <NA> <NA> <NA> <NA>
3 3 C v z m o s f n c r u b
4 4 D z i m s a v n r e t x
5 5 E f b g h a d u o z <NA> <NA>