Take an example dataframe like so (the real dataframe has more columns):
df <- data.frame(A = seq(1, 3, 1),
B = seq(4, 6, 1))
I can use pivot_longer to collect my columns of interest (A and B) like so:
library(dplyr)
library(tidyr)
df <- df %>%
pivot_longer(cols = c("A", "B"), names_to = "Letter", values_to = "Number")
df
Letter Number
<chr> <dbl>
1 A 1
2 B 4
3 A 2
4 B 5
5 A 3
6 B 6
Now let's say I have another column C in my dataframe, making it no longer tidy
C <- seq(7, 12, 1)
df_2 <- data.frame(df, C)
df_2
Letter Number C
1 A 1 7
2 B 4 8
3 A 2 9
4 B 5 10
5 A 3 11
6 B 6 12
I want to use pivot_longer again to make df_2 tidy and get this output:
data.frame(Letter = c(rep("A", 3), rep("B", 3), rep("C", 3)),
Number = seq(1, 12, 1))
Letter Number
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
10 C 10
11 C 11
12 C 12
Using the same strategy creates an error though:
df_2 %>%
pivot_longer(cols = "C", names_to = "Letter", values_to = "Number")
Error: Failed to create output due to bad names.
* Choose another strategy with `names_repair`
Setting names_repair to minimal runs but doesn't produce the output I want.
Follow it like this
library(tidyverse)
df <- data.frame(A = seq(1, 3, 1),
B = seq(4, 6, 1))
df <- df %>%
pivot_longer(cols = c("A", "B"), names_to = "Letter", values_to = "Number")
C <- seq(7, 12, 1)
df_2 <- data.frame(C)
df_2 <- df_2 %>% pivot_longer(cols = C, names_to = "Letter", values_to = "Number")
df_result <- rbind(df, df_2)
Output
> df_result
# A tibble: 12 x 2
Letter Number
<chr> <dbl>
1 A 1
2 B 4
3 A 2
4 B 5
5 A 3
6 B 6
7 C 7
8 C 8
9 C 9
10 C 10
11 C 11
12 C 12
Maybe try this if it is helpful:
library(tidyverse)
#Code
df_2 %>% pivot_longer(everything()) %>%
arrange(name) %>% group_by(name) %>%
filter(!duplicated(value))
Output:
# A tibble: 12 x 2
# Groups: name [3]
name value
<chr> <dbl>
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
10 C 10
11 C 11
12 C 12
We could do this easily with stack
library(dplyr)
stack(df_2)[2:1] %>%
distinct %>%
set_names(c("Letter", "Number"))
-output
# Letter Number
#1 A 1
#2 A 2
#3 A 3
#4 B 4
#5 B 5
#6 B 6
#7 C 7
#8 C 8
#9 C 9
#10 C 10
#11 C 11
#12 C 12
Or an option with unnest/enframe
library(tidyr)
library(tibble)
unclass(df_2) %>%
enframe(name = "Letter", value = "Number") %>%
unnest(c(Number)) %>%
distinct
Or using melt
library(reshape2)
melt(df_2) %>%
distinct()
Or in a single line in base R
unique(stack(df_2)[2:1])
Related
I have a list here, and I wish to mutate a new column with unique values for each list relative to the mutation. For example, I want to mutate a column named ID as n >= 1.
Naturally, on a dataframe I would do this:
dat %>% mutate(id = row_number())
For a list, I would do this:
dat%>% map(~ mutate(., ID = row_number()))
And I would get an output likeso:
dat <- list(data.frame(x=c("a", "b" ,"c", "d", "e" ,"f" ,"g") ), data.frame(y=c("p", "lk", "n", "m", "g", "f", "t")))
[[1]]
x id
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
[[2]]
y id
1 p 1
2 lk 2
3 n 3
4 m 4
5 g 5
6 f 6
7 t 7
Though, how would I mutate a new column ID such that the row number continues from the first list.
Expected output:
[[1]]
x id
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
[[2]]
y id
1 p 8
2 lk 9
3 n 10
4 m 11
5 g 12
6 f 13
7 t 14
An option is to bind them into a single dataset, create the 'id' with row_number(), split by 'grp', loop over the list and remove any columns that have all NA values
library(dplyr)
library(purrr)
dat %>%
bind_rows(.id = 'grp') %>%
mutate(id = row_number()) %>%
group_split(grp) %>%
map(~ .x %>%
select(where(~ any(!is.na(.))), -grp))
-output
#[[1]]
# A tibble: 7 x 2
# x id
# <chr> <int>
#1 a 1
#2 b 2
#3 c 3
#4 d 4
#5 e 5
#6 f 6
#7 g 7
#[[2]]
# A tibble: 7 x 2
# y id
# <chr> <int>
#1 p 8
#2 lk 9
#3 n 10
#4 m 11
#5 g 12
#6 f 13
#7 t 14
Or an easier approach is to unlist (assuming single column), get the sequence, add a new column with map2
map2(dat, relist(seq_along(unlist(dat)), skeleton = dat),
~ .x %>% mutate(id = .y))
Or using a for loop
dat[[1]]$id <- seq_len(nrow(dat[[1]]))
for(i in seq_along(dat)[-1]) dat[[i]]$id <-
seq(tail(dat[[i-1]]$id, 1) + 1, length.out = nrow(dat[[i]]), by = 1)
Lets say I have the following data frame:
tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9))
# A tibble: 2 x 3
user first last
<chr> <dbl> <dbl>
1 A 1 6
2 B 4 9
And want to create a tibble that looks like:
bind_rows(tibble(user = 'A', weeks = 1:6),
tibble(user = 'B', weeks = 4:9))
# A tibble: 12 x 2
user weeks
<chr> <int>
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 4
8 B 5
9 B 6
10 B 7
11 B 8
12 B 9
How could I go about doing this? I have tried:
tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9)) %>%
group_by(user) %>%
mutate(weeks = first:last)
I wonder if I should try a combination of complete map or nest?
One option is unnest after creating a sequence
library(dplyr)
library(purrr)
df1 %>%
transmute(user, weeks = map2(first, last, `:`)) %>%
unnest(weeks)
# A tibble: 12 x 2
# user weeks
# <chr> <int>
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 A 5
# 6 A 6
# 7 B 4
# 8 B 5
# 9 B 6
#10 B 7
#11 B 8
#12 B 9
Or another option is rowwise
df1 %>%
rowwise %>%
transmute(user, weeks = list(first:last)) %>%
unnest(weeks)
Or without any packages
stack(setNames(Map(`:`, df1$first, df1$last), df1$user))
Or otherwise written as
stack(setNames(do.call(Map, c(f = `:`, df1[-1])), df1$user))
data
df1 <- tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9))
One option involving dplyr and tidyr could be:
df %>%
uncount(last - first + 1) %>%
group_by(user) %>%
transmute(weeks = first + 1:n() - 1)
user weeks
<chr> <dbl>
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 4
8 B 5
9 B 6
10 B 7
11 B 8
12 B 9
I have data:
rowID incidentID participant.type
1 1 A
2 1 B
3 2 A
4 3 A
5 3 B
6 3 C
7 4 B
8 4 C
And I would like to end up with:
rowID incident participant.type participant.type.1 participant.type.2
1 1 A B
2 2 A
3 3 A B C
4 4 B C
I tried the spread but can't achieve one line per incident; I don't think I have a way of creating a key-value pair so I wonder if there is some other method for doing this.
Before using spread(), you need to create a proper key argument.
df %>% select(-rowID) %>%
group_by(incidentID) %>%
mutate(id = 1:n()) %>%
spread(id, participant.type)
# incidentID `1` `2` `3`
# <int> <fct> <fct> <fct>
# 1 1 A B NA
# 2 2 A NA NA
# 3 3 A B C
# 4 4 B C NA
Since your grouping is based on the row order within the icidentID column. The following simple solution will also work.
It is just filtering the dataframe and then merging in the end.
It is probably not the best solution in terms of effective use of computing power, but it is easy to understand.
library(tidyverse)
df <-
tribble(
~rowID, ~incidentID, ~participant.type,
1, 1, "A",
2, 1, "B",
3, 2, "A",
4, 3, "A",
5, 3, "B",
6, 3, "C",
7, 4, "B",
8, 4, "C")
df_1 <- df %>%
select(-rowID) %>%
group_by(incidentID) %>%
filter(row_number()==1)
df_2 <- df %>%
select(-rowID) %>%
group_by(incidentID) %>%
filter(row_number()==2) %>%
rename(participant.type.1 = participant.type)
df_3 <- df %>%
select(-rowID) %>%
group_by(incidentID) %>%
filter(row_number()==3) %>%
rename(participant.type.2 = participant.type)
full_join(df_1, full_join(df_2, df_3))
Result:
Joining, by = "incidentID"
Joining, by = "incidentID"
# A tibble: 4 x 4
# Groups: incidentID [?]
incidentID participant.type participant.type.1 participant.type.2
<dbl> <chr> <chr> <chr>
1 1 A B NA
2 2 A NA NA
3 3 A B C
4 4 B C NA
Here's my solution:
df %>%
select(-rowID) %>%
group_by(incidentID) %>%
nest() %>%
mutate(data = map_chr(data, ~str_c(.x$participant.type, collapse = '_'))) %>%
separate(data, paste0('participant.type.', 0:2)) %>%
mutate_at(2:4, ~replace_na(.x, ''))
We can use reshape2::dcast for this
reshape2::dcast(df, insidentID ~ participant.type)
# insidentID A B C
# 1 1 <NA> B <NA>
# 2 8 <NA> B <NA>
# 3 12 <NA> <NA> C
# 4 16 A <NA> <NA>
# 5 24 <NA> B <NA>
# 6 27 <NA> B C
# 7 29 <NA> <NA> C
with the data
set.seed(123)
df <- data.frame(insidentID = sample(0:30, 8L, replace = TRUE),
participant.type = sample(LETTERS[1:3], 8L, replace = TRUE),
stringsAsFactors = FALSE)
df
# insidentID participant.type
# 1 8 B
# 2 24 B
# 3 12 C
# 4 27 B
# 5 29 C
# 6 1 B
# 7 16 A
# 8 27 C
The 'related question' link provided by #markus shows a variety of other solutions, including what appears to be the most concise in a tidyverse format:
df %>%
group_by(incidentID) %>%
mutate(rn = paste0("newcolumn",row_number())) %>%
spread(rn, participant.type)
gives:
incidentID newcolumn1 newcolumn2 newcolumn3
<int> <fct> <fct> <fct>
1 1 A B NA
2 2 A NA NA
3 3 A B C
4 4 B C NA
A
with df like below
df <- data.frame(
name = rep(c("A", "B", "C"),2),
type = c("10", "10", "10","20", "20", "20"),
val = c(1,2,3,4,5,6)
)
> df
name type val
1 A 10 1
2 B 10 2
3 C 10 3
4 A 20 4
5 B 20 5
6 C 20 6
>
the expected output is
I need to add val of all records with name C to val of records with name A for the corresponding type with a new name AC. Need an output keeping name C and without it.
output1
name type val
1 A 10 1
2 B 10 2
3 C 10 3
4 AC 10 4
5 A 20 4
6 B 20 5
7 C 20 6
8 AC 20 10
output2
name type val
1 AC 10 4
2 B 10 2
4 AC 20 10
5 B 20 5
>
prefer dplyr based solution
Here is one way,
library(dplyr)
df %>%
mutate(new = as.integer(name %in% c('A', 'C'))) %>%
group_by(type, new) %>%
summarise(name = paste0(name, collapse = ''), val = sum(val)) %>%
ungroup() %>%
select(-new)
# A tibble: 4 × 3
# type name val
# <fctr> <chr> <dbl>
#1 10 B 2
#2 10 AC 4
#3 20 B 5
#4 20 AC 10
To get the other output then,
df %>%
mutate(new = as.integer(name %in% c('A', 'C'))) %>%
group_by(type, new) %>%
summarise(name = paste0(name, collapse = ''), val = sum(val)) %>%
ungroup() %>%
select(-new) %>%
filter(nchar(name) > 1) %>%
bind_rows( df) %>%
arrange(val)
# A tibble: 8 × 3
# type name val
# <fctr> <chr> <dbl>
#1 10 A 1
#2 10 B 2
#3 10 C 3
#4 10 AC 4
#5 20 A 4
#6 20 B 5
#7 20 C 6
#8 20 AC 10
Here is another (requires tidyr as well as dplyr)
df1 <- df %>% group_by(type) %>%
summarise(AC=sum(val[name %in% c("A","C")]),B=val[name=="B"]) %>%
gather(key=name,value=val,-type) %>%
arrange(type)
Here is one option using data.table
library(data.table)
rbindlist(list(df, setDT(df)[, .(name = "AC", val = sum(val[as.character(name) %chin%
c("A", "C")])) , .(type)][, names(df), with = FALSE]))[order(type, name)]
# name type val
#1: A 10 1
#2: B 10 2
#3: C 10 3
#4: AC 10 4
#5: A 20 4
#6: B 20 5
#7: C 20 6
#8: AC 20 10
Or with dplyr
library(dplyr)
df %>%
filter(name %in% c("A", "C")) %>%
group_by(type) %>%
summarise(name = 'AC', val = sum(val)) %>%
full_join(df, ., on = 'type') %>%
arrange(type, val)
# name type val
#1 A 10 1
#2 B 10 2
#3 C 10 3
#4 AC 10 4
#5 A 20 4
#6 B 20 5
#7 C 20 6
#8 AC 20 10
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I have a trouble with repeating rows of my real data using dplyr. There is already another post in here repeat-rows-of-a-data-frame but no solution for dplyr.
Here I just wonder how could be the solution for dplyr
but failed with error:
Error: wrong result size (16), expected 4 or 1
library(dplyr)
df <- data.frame(column = letters[1:4])
df_rep <- df%>%
mutate(column=rep(column,each=4))
Expected output
>df_rep
column
#a
#a
#a
#a
#b
#b
#b
#b
#*
#*
#*
Using the uncount function will solve this problem as well. The column count indicates how often a row should be repeated.
library(tidyverse)
df <- tibble(letters = letters[1:4])
df
# A tibble: 4 x 1
letters
<chr>
1 a
2 b
3 c
4 d
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
uncount(count)
# A tibble: 11 x 1
letters
<chr>
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 d
11 d
I was looking for a similar (but slightly different) solution. Posting here in case it's useful to anyone else.
In my case, I needed a more general solution that allows each letter to be repeated an arbitrary number of times. Here's what I came up with:
library(tidyverse)
df <- data.frame(letters = letters[1:4])
df
> df
letters
1 a
2 b
3 c
4 d
Let's say I want 2 A's, 3 B's, 2 C's and 4 D's:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count))
# A tibble: 11 x 2
# Groups: letters [4]
letters count
<fctr> <int>
1 a 1
2 a 2
3 b 1
4 b 2
5 b 3
6 c 1
7 c 2
8 d 1
9 d 2
10 d 3
11 d 4
If you don't want to keep the count column:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count)) %>%
select(letters)
# A tibble: 11 x 1
# Groups: letters [4]
letters
<fctr>
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 d
11 d
If you want the count to reflect the number of times each letter is repeated:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count)) %>%
mutate(count = max(count))
# A tibble: 11 x 2
# Groups: letters [4]
letters count
<fctr> <dbl>
1 a 2
2 a 2
3 b 3
4 b 3
5 b 3
6 c 2
7 c 2
8 d 4
9 d 4
10 d 4
11 d 4
This is rife with peril if the data.frame has other columns (there, I said it!), but the do block will allow you to generate a derived data.frame within a dplyr pipe (though, ceci n'est pas un pipe):
library(dplyr)
df <- data.frame(column = letters[1:4], stringsAsFactors = FALSE)
df %>%
do( data.frame(column = rep(.$column, each = 4), stringsAsFactors = FALSE) )
# column
# 1 a
# 2 a
# 3 a
# 4 a
# 5 b
# 6 b
# 7 b
# 8 b
# 9 c
# 10 c
# 11 c
# 12 c
# 13 d
# 14 d
# 15 d
# 16 d
As #Frank suggested, a much better alternative could be
df %>% slice(rep(1:n(), each=4))
I did a quick benchmark to show that uncount() is a lot faster than expand()
# for the pipe
library(magrittr)
# create some test data
df_test <-
tibble::tibble(
letter = letters,
row_count = sample(1:10, size = 26, replace = TRUE)
)
# benchmark
bench <- microbenchmark::microbenchmark(
expand = df_test %>%
dplyr::group_by(letter) %>%
tidyr::expand(row_count = seq(1:row_count)),
uncount = df_test %>%
tidyr::uncount(row_count)
)
# plot the benchmark
ggplot2::autoplot(bench)