Lengthen data frame with duplicate names - r

I have a data.frame that contains duplicate column names that I want to lengthen. I don't want to fix the names because they correspond to values in my future column. I am trying to use pivot_longer but it throws an error.
Error: Can't transform a data frame with duplicate names.
I looked at the documentation for the function and used the "names_repair" argument to get around the issue but it didn't help.
I also found this issue on tidyvere's github but I'm not sure what's going on in there.
Here's my code:
library(dplyr)
library(tidyr)
df %>%
mutate_all(as.character) %>%
pivot_longer(-a, names_to = "Names", values_to = "Values", names_repair = "minimal")
Is there a way to do this?
Desired output:
a Names Values
<chr> <chr> <chr>
1 1 b 4
2 1 c a
3 1 c d
4 2 b 5
5 2 c b
6 2 c e
7 3 b 6
8 3 c c
9 3 c f
Sample data:
df <- setNames(data.frame(c(1,2,3),
c(4,5,6),
c("a","b","c"),
c("d","e","f"),
stringsAsFactors = F),
c("a","b","c","c"))

The problem is not pivot_wider, it can be used on data.frames containing columns with the same name - mutate can't. So we need to transform the columns to character columns either by (i) using base R or (ii) if you want to stay in the larger tidyverse purrr::modify_at (after all a data.frame is always a list). After that its just a regular call to pivot_wider.
df <- setNames(data.frame(c(1,2,3),
c(4,5,6),
c("a","b","c"),
c("d","e","f"),
stringsAsFactors = F),
c("a","b","c","c"))
library(dplyr)
library(tidyr)
# Alternatively use base R to transform cols to character
# df[,c("a", "b")] <- lapply(df[,c("a", "b")], as.character)
df %>%
purrr::modify_at(c("a","b"), as.character) %>%
pivot_longer(-a,
names_to = "Names",
values_to = "Values")
#> # A tibble: 9 x 3
#> a Names Values
#> <chr> <chr> <chr>
#> 1 1 b 4
#> 2 1 c a
#> 3 1 c d
#> 4 2 b 5
#> 5 2 c b
#> 6 2 c e
#> 7 3 b 6
#> 8 3 c c
#> 9 3 c f
Created on 2021-02-23 by the reprex package (v0.3.0)

Related

tidyr "pivot_longer": repeat column gives object not found error

I'm starting with a data frame with 5 columns: one treatment column, T_type, and four outcome variable columns, A, B, C and D. I'm trying to stack the outcome variables so I end up with one column for values, another with the names of the four outcome variables and then a column with the treatment names repeated down along the stacked columns. It's what's shown in the R help page for pivot_longer in the relig_income example and pretty much what Jason was trying to do here: dplyr `pivot_longer()` object not found but it's right there?
I get the same sort of error Jason was getting with pivot_longer and have no idea why. Here's what's happening.
dd <- as.data.frame(matrix(rpois(32, 4), nrow = 8))
names(dd) <- LETTERS[1:4]
dd <- data.frame(dd, T_type = rep(c("M", "P"), each = 4))
dd
A B C D T_type
1 3 5 5 4 M
2 7 5 2 2 M
3 2 3 3 10 M
4 3 3 2 3 M
5 8 3 4 3 P
6 4 4 5 1 P
7 6 4 2 6 P
8 9 4 3 6 P
So now I try pivot_longer.
dd %>% pivot_longer(-T_type, cols = A:D, names_to = "response", values_to = "y_obs")
Error in build_longer_spec(data, !!cols, names_to = names_to, values_to = values_to, :
object 'T_type' not found
Re-arranging the columns in dd so T_type is before columns A to D doesn't help.
I'd be grateful if someone could tell me what's going on here and how I can get pivot_longer to do the job.
You need to eliminate T_type from pivot_longer because the first argument of this function is the dataset (which can be omitted in you are in a %>% pipeline)
dd %>% pivot_longer(cols = A:D, names_to = "response", values_to = "y_obs")
Output
# A tibble: 32 x 3
# T_type response y_obs
# <chr> <chr> <int>
# 1 M A 7
# 2 M B 4
# 3 M C 4
# 4 M D 3
# 5 M A 8
# 6 M B 3
# 7 M C 5
# 8 M D 3
# 9 M A 4
# 10 M B 6
# ... with 22 more rows
Try this :
dd %>%
gather("response", "y_obs", -T_type)
Or :
dd %>% pivot_longer(names_to = "response", values_to = "y_obs", -T_type)
Or :
dd %>% pivot_longer(names_to = "response", values_to = "y_obs", A:D)
Youy specify the range of cols : A to D, so you will not find T_type

How to lag a specific column of a data frame in R

Input
(Say d is the data frame below.)
a b c
1 5 7
2 6 8
3 7 9
I want to shift the contents of column b one position down and put an arbitrary number in the first position in b. How do I do this? I would appreciate any help in this regard. Thank you.
I tried c(6,tail(d["b"],-1)) but it does not produce (6,5,6).
Output
a b c
1 6 7
2 5 8
3 6 9
Use head instead
df$b <- c(6, head(df$b, -1))
# a b c
#1 1 6 7
#2 2 5 8
#3 3 6 9
You could also use lag in dplyr
library(dplyr)
df %>% mutate(b = lag(b, default = 6))
Or shift in data.table
library(data.table)
setDT(df)[, b:= shift(b, fill = 6)]
A dplyr solution uses lag with an explicit default argument, if you prefer:
library(dplyr)
d <- tibble(a = 1:3, b = 5:7, c = 7:9)
d %>% mutate(b = lag(b, default = 6))
#> # A tibble: 3 x 3
#> a b c
#> <int> <dbl> <int>
#> 1 1 6 7
#> 2 2 5 8
#> 3 3 6 9
Created on 2019-12-05 by the reprex package (v0.3.0)
Here is a solution similar to the head approach by #Ronak Shah
df <- within(df,b <- c(runif(1),b[-1]))
where a uniformly random variable is added to the first place of b column:
> df
a b c
1 1 0.6644704 7
2 2 6.0000000 8
3 3 7.0000000 9
Best solution below will help in any lag or lead position
d <- data.frame(a=c(1,2,3),b=c(5,6,7),c=c(7,8,9))
d1 <- d %>% arrange(b) %>% group_by(b) %>%
mutate(b1= dplyr::lag(b, n = 1, default = NA))

How to Pass column name in group by from a variable

Want to extract max values of a column of each group of data frame.
I have column name in a variable which i want to pass in group by condition but it is failing.
I have below data frame:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')
Column values in Variables below:
columnselected <- c("Value")
groupbycol <- c("Gene")
My Code is :
df %>% group_by(groupbycol) %>% top_n(1, columnselected)
This code is giving error.
Gene Value
A 12
B 6
C 1
D 4
You need to convert column names to symbol using sym and then evaluate them using !!
library(dplyr)
df %>% group_by(!!sym(groupbycol)) %>% top_n(1, !!sym(columnselected))
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
We can use group_by_at and without using an additional package
library(dplyr)
df %>%
group_by_at(groupbycol) %>%
top_n(1, !! as.name(columnselected))
# A tibble: 4 x 2
# Groups: Gene [4]
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
NOTE: There would be many dupes for this post :=)

Drop list columns from dataframe using dplyr and select_if

Is it possible to drop all list columns from a dataframe using dpyr select similar to dropping a single column?
df <- tibble(
a = LETTERS[1:5],
b = 1:5,
c = list('bob', 'cratchit', 'rules!','and', 'tiny tim too"')
)
df %>%
select_if(-is.list)
Error in -is.list : invalid argument to unary operator
This seems to be a doable work around, but was wanting to know if it can be done with select_if.
df %>%
select(-which(map(df,class) == 'list'))
Use Negate
df %>%
select_if(Negate(is.list))
# A tibble: 5 x 2
a b
<chr> <int>
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
There is also purrr::negate that would give the same result.
We can use Filter from base R
Filter(Negate(is.list), df)
# A tibble: 5 x 2
# a b
# <chr> <int>
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5

Repeating rows of data.frame in dplyr [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I have a trouble with repeating rows of my real data using dplyr. There is already another post in here repeat-rows-of-a-data-frame but no solution for dplyr.
Here I just wonder how could be the solution for dplyr
but failed with error:
Error: wrong result size (16), expected 4 or 1
library(dplyr)
df <- data.frame(column = letters[1:4])
df_rep <- df%>%
mutate(column=rep(column,each=4))
Expected output
>df_rep
column
#a
#a
#a
#a
#b
#b
#b
#b
#*
#*
#*
Using the uncount function will solve this problem as well. The column count indicates how often a row should be repeated.
library(tidyverse)
df <- tibble(letters = letters[1:4])
df
# A tibble: 4 x 1
letters
<chr>
1 a
2 b
3 c
4 d
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
uncount(count)
# A tibble: 11 x 1
letters
<chr>
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 d
11 d
I was looking for a similar (but slightly different) solution. Posting here in case it's useful to anyone else.
In my case, I needed a more general solution that allows each letter to be repeated an arbitrary number of times. Here's what I came up with:
library(tidyverse)
df <- data.frame(letters = letters[1:4])
df
> df
letters
1 a
2 b
3 c
4 d
Let's say I want 2 A's, 3 B's, 2 C's and 4 D's:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count))
# A tibble: 11 x 2
# Groups: letters [4]
letters count
<fctr> <int>
1 a 1
2 a 2
3 b 1
4 b 2
5 b 3
6 c 1
7 c 2
8 d 1
9 d 2
10 d 3
11 d 4
If you don't want to keep the count column:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count)) %>%
select(letters)
# A tibble: 11 x 1
# Groups: letters [4]
letters
<fctr>
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 d
11 d
If you want the count to reflect the number of times each letter is repeated:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count)) %>%
mutate(count = max(count))
# A tibble: 11 x 2
# Groups: letters [4]
letters count
<fctr> <dbl>
1 a 2
2 a 2
3 b 3
4 b 3
5 b 3
6 c 2
7 c 2
8 d 4
9 d 4
10 d 4
11 d 4
This is rife with peril if the data.frame has other columns (there, I said it!), but the do block will allow you to generate a derived data.frame within a dplyr pipe (though, ceci n'est pas un pipe):
library(dplyr)
df <- data.frame(column = letters[1:4], stringsAsFactors = FALSE)
df %>%
do( data.frame(column = rep(.$column, each = 4), stringsAsFactors = FALSE) )
# column
# 1 a
# 2 a
# 3 a
# 4 a
# 5 b
# 6 b
# 7 b
# 8 b
# 9 c
# 10 c
# 11 c
# 12 c
# 13 d
# 14 d
# 15 d
# 16 d
As #Frank suggested, a much better alternative could be
df %>% slice(rep(1:n(), each=4))
I did a quick benchmark to show that uncount() is a lot faster than expand()
# for the pipe
library(magrittr)
# create some test data
df_test <-
tibble::tibble(
letter = letters,
row_count = sample(1:10, size = 26, replace = TRUE)
)
# benchmark
bench <- microbenchmark::microbenchmark(
expand = df_test %>%
dplyr::group_by(letter) %>%
tidyr::expand(row_count = seq(1:row_count)),
uncount = df_test %>%
tidyr::uncount(row_count)
)
# plot the benchmark
ggplot2::autoplot(bench)

Resources