How to Pass column name in group by from a variable - r

Want to extract max values of a column of each group of data frame.
I have column name in a variable which i want to pass in group by condition but it is failing.
I have below data frame:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')
Column values in Variables below:
columnselected <- c("Value")
groupbycol <- c("Gene")
My Code is :
df %>% group_by(groupbycol) %>% top_n(1, columnselected)
This code is giving error.
Gene Value
A 12
B 6
C 1
D 4

You need to convert column names to symbol using sym and then evaluate them using !!
library(dplyr)
df %>% group_by(!!sym(groupbycol)) %>% top_n(1, !!sym(columnselected))
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4

We can use group_by_at and without using an additional package
library(dplyr)
df %>%
group_by_at(groupbycol) %>%
top_n(1, !! as.name(columnselected))
# A tibble: 4 x 2
# Groups: Gene [4]
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
NOTE: There would be many dupes for this post :=)

Related

Lengthen data frame with duplicate names

I have a data.frame that contains duplicate column names that I want to lengthen. I don't want to fix the names because they correspond to values in my future column. I am trying to use pivot_longer but it throws an error.
Error: Can't transform a data frame with duplicate names.
I looked at the documentation for the function and used the "names_repair" argument to get around the issue but it didn't help.
I also found this issue on tidyvere's github but I'm not sure what's going on in there.
Here's my code:
library(dplyr)
library(tidyr)
df %>%
mutate_all(as.character) %>%
pivot_longer(-a, names_to = "Names", values_to = "Values", names_repair = "minimal")
Is there a way to do this?
Desired output:
a Names Values
<chr> <chr> <chr>
1 1 b 4
2 1 c a
3 1 c d
4 2 b 5
5 2 c b
6 2 c e
7 3 b 6
8 3 c c
9 3 c f
Sample data:
df <- setNames(data.frame(c(1,2,3),
c(4,5,6),
c("a","b","c"),
c("d","e","f"),
stringsAsFactors = F),
c("a","b","c","c"))
The problem is not pivot_wider, it can be used on data.frames containing columns with the same name - mutate can't. So we need to transform the columns to character columns either by (i) using base R or (ii) if you want to stay in the larger tidyverse purrr::modify_at (after all a data.frame is always a list). After that its just a regular call to pivot_wider.
df <- setNames(data.frame(c(1,2,3),
c(4,5,6),
c("a","b","c"),
c("d","e","f"),
stringsAsFactors = F),
c("a","b","c","c"))
library(dplyr)
library(tidyr)
# Alternatively use base R to transform cols to character
# df[,c("a", "b")] <- lapply(df[,c("a", "b")], as.character)
df %>%
purrr::modify_at(c("a","b"), as.character) %>%
pivot_longer(-a,
names_to = "Names",
values_to = "Values")
#> # A tibble: 9 x 3
#> a Names Values
#> <chr> <chr> <chr>
#> 1 1 b 4
#> 2 1 c a
#> 3 1 c d
#> 4 2 b 5
#> 5 2 c b
#> 6 2 c e
#> 7 3 b 6
#> 8 3 c c
#> 9 3 c f
Created on 2021-02-23 by the reprex package (v0.3.0)

How to summarise across different types of variables with dplyr::c_across()

I have data with different types of variables. Some are character, some factors, and some numeric, like below:
df <- data.frame(a = c("tt", "ss", "ss", NA), b=c(2,3,NA,1), c=c(1,2,NA, NA), d=c("tt", "ss", "ss", NA))
I'm trying to count the number of missing values per observation using c_across in dplyr
However, c_across doesn't seem to be able to combine different type of values, as the error message below suggests
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across())))
Error: Problem with summarise() input NAs.
x Can't combine a <factor> and b .
ℹ Input NAs is sum(is.na(c_across())).
ℹ The error occurred in row 1.
Indeed, if I include only numeric variables, it works.
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across(b:c))))
Same thing if I include only character variables
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across(c(a,d)))))
I could solve the issue without using c_across like below, but I have lots of variables, so it's not very practical.
df %>%
rowwise() %>%
summarise(NAs = is.na(a)+is.na(b)+is.na(c)+is.na(d))
I could use the traditional apply approach, like below, but I'd like to solve this using dplyr.
apply(df, 1, function(x)sum(is.na(x)))
Any suggestions as to how to compute the number of missing values, row-wise, efficiently, and using dplyr?
I would suggest this approach. The issue is because of two things. First, different type of variables in your dataframe an second that you need a key variable for the rowwise style task. So, in next code we first transform variables into a similar type, then we create an id based on the number of row. With this we use that element as input for rowwise() and then we can use c_across() function. Here the code (I have used you df data):
library(tidyverse)
#Code
df %>%
mutate_at(vars(everything()),funs(as.character(.))) %>%
mutate(id=1:n()) %>%
rowwise(id) %>%
mutate(NAs = sum(is.na(c_across(a:d))))
Output:
# A tibble: 4 x 6
# Rowwise: id
a b c d id NAs
<chr> <chr> <chr> <chr> <int> <int>
1 tt 2 1 tt 1 0
2 ss 3 2 ss 2 0
3 ss NA NA ss 3 2
4 NA 1 NA NA 4 3
And we can avoid the mutate_at() function using the new across() with mutate() to homologate the variables:
#Code 2
df %>%
mutate(across(a:d,~as.character(.))) %>%
mutate(id=1:n()) %>%
rowwise(id) %>%
mutate(NAs = sum(is.na(c_across(a:d))))
Output:
# A tibble: 4 x 6
# Rowwise: id
a b c d id NAs
<chr> <chr> <chr> <chr> <int> <int>
1 tt 2 1 tt 1 0
2 ss 3 2 ss 2 0
3 ss NA NA ss 3 2
4 NA 1 NA NA 4 3
A much faster option is not to use rowwise or c_across, but with rowSums
library(dplyr)
df %>%
mutate(NAs = rowSums(is.na(.)))
# a b c d NAs
#1 tt 2 1 tt 0
#2 ss 3 2 ss 0
#3 ss NA NA ss 2
#4 <NA> 1 NA <NA> 3
If we want to select certain columns i.e. numeric
df %>%
mutate(NAs = rowSums(is.na(select(., where(is.numeric)))))
# a b c d NAs
#1 tt 2 1 tt 0
#2 ss 3 2 ss 0
#3 ss NA NA ss 2
#4 <NA> 1 NA <NA> 1

Drop list columns from dataframe using dplyr and select_if

Is it possible to drop all list columns from a dataframe using dpyr select similar to dropping a single column?
df <- tibble(
a = LETTERS[1:5],
b = 1:5,
c = list('bob', 'cratchit', 'rules!','and', 'tiny tim too"')
)
df %>%
select_if(-is.list)
Error in -is.list : invalid argument to unary operator
This seems to be a doable work around, but was wanting to know if it can be done with select_if.
df %>%
select(-which(map(df,class) == 'list'))
Use Negate
df %>%
select_if(Negate(is.list))
# A tibble: 5 x 2
a b
<chr> <int>
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
There is also purrr::negate that would give the same result.
We can use Filter from base R
Filter(Negate(is.list), df)
# A tibble: 5 x 2
# a b
# <chr> <int>
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5

Implementation of LIFO in converting comma separated row values into separate row values

I have a data frame where individual row values are comma separated, I wanted to separate the comma values into individual row values. So using this SO post is was able to achieve in converting comma separated string values in individual rows, but, It replaces first value if the string as the first values if the row and I wanted to have the inverse of this, i.e First string value is the last row value.
# create data
library(tidyverse)
d <- data_frame(
col1 = c("1,2,3")
)
Dataframe
# # A tibble: 3 x 2
# col1
# <chr>
# 1 1,2,3
# tidy data
separate_rows(d, col1, convert = TRUE)
Current Output
# # A tibble: 6 x 2
# col1
# <int>
# 1
# 2
# 3
Desired Output
# tidy data
separate_rows(d, col1, convert = TRUE)
# # A tibble: 6 x 2
# col1
# <int>
# 3
# 2
# 1
Split the column on commas, reverse the vector, construct a data frame.
Sample data:
> d = data.frame(col1=c("23,34,99","9,3,2"),stringsAsFactors=FALSE)
> d
col1
1 23,34,99
2 9,3,2
Do:
> data.frame(col1=do.call(c,lapply(strsplit(d$col1,","),rev)))
col1
1 99
2 34
3 23
4 2
5 3
6 9
We can invert the dataframe and select the indices in reverse order using slice
library(tidyverse)
separate_rows(d, col1, convert = TRUE) %>%
slice(n():1)
# col1
# <int>
#1 3
#2 2
#3 1
For multiple rows, taking #Spacedman's example
d = data.frame(col1=c("23,34,99","9,3,2"),stringsAsFactors=FALSE)
The above solution would give
separate_rows(d, col1, convert = TRUE) %>%
slice(n():1)
# A tibble: 6 x 1
col1
<int>
#1 2
#2 3
#3 9
#4 99
#5 34
#6 23
However, in case if OP needs to reverse the string for each row seperately we can create a group column with row_number and then reverse the string for each row separately as suggested by #Sotos
d %>%
mutate(group = row_number()) %>%
separate_rows(col1, convert = TRUE) %>%
group_by(group) %>%
slice(n():1) %>%
ungroup() %>%
select(-group)
# A tibble: 6 x 1
# col1
# <int>
#1 99
#2 34
#3 23
#4 2
#5 3
#6 9
You can use stri_reverse from stringi package and reverse the string prior to separating, i.e.
library(tidyverse)
d %>%
mutate(col1 = stringi::stri_reverse(col1)) %>%
separate_rows(col1)
which gives
A tibble: 3 x 1
col1
<chr>
1 3
2 2
3 1

R: unique observations conditionally on other variable - from rows into additional columns

I am new to R. I struggle to find a suitable solution for the following problem:
My dataframe looks approximately like this:
ID Att
1 a
1 b
1 c
2 d
3 e
3 f
4 g
I would like to convert it into a new df of the following form:
ID Att_1 Att_2 ... Att_n
1 a b c
2 d N/A N/A
3 e f N/A
4 g N/A N/A
Where the number of columns is dependent on max counts of unique 'Att' in 'ID' (here three). The generation of the number of columns in the new dataframe (i.e. 'n') should be automated and dependent on the count of :
max_ID_count <- table(df$ID)
n <- max(max_ID_count)
Thanks a lot!
We can create a sequence column and then spread
library(tidyverse)
df1 %>%
group_by(ID) %>%
mutate(rn = paste0("Att_", row_number())) %>%
spread(rn, Att)
# A tibble: 4 x 4
# Groups: ID [4]
# ID Att_1 Att_2 Att_3
# <int> <chr> <chr> <chr>
#1 1 a b c
#2 2 d <NA> <NA>
#3 3 e f <NA>
#4 4 g <NA> <NA>
Or with dcast from data.table
library(data.table)
dcast(setDT(df1), ID ~ paste0("Att_", rowid(ID)), value.var = "Att")

Resources