Identify subsets containing only repeats of an expression - r

I have a dataset like so:
df<-data.frame(x=c("A","A","A","A", "B","B","B","B","B",
"C","C","C","C","C","D","D","D","D","D"),
y= as.factor(c(rep("Eoissp2",4),rep("Eoissp1",5),"Eoissp1","Eoisp4","Automerissp1","Automerissp2","Acharias",rep("Eoissp2",3),rep("Eoissp1",2))))
I want to identify, for each subset of x, the corresponding levels in y that are entirely duplicates containing the expression Eois. Therefore, A , B, and D will be returned in a vector because every level of A , B, and D contains the expression Eois , while level C consists of various unique levels (e.g. Eois, Automeris and Acharias). For this example the output would be:
output<- c("A", "B", "D")

Using new df:
> df %>% filter(str_detect(y,"Eois")) %>% group_by(x) %>% distinct(y) %>%
count() %>% filter(n==1) %>% select(x)
# A tibble: 2 x 1
# Groups: x [2]
x
<fct>
1 A
2 B
(Answer below uses the original df posted by the question author.)
Using the pipe function in magrittr & functions from dplyr:
> df %>% group_by(x) %>% distinct(y)
# A tibble: 7 x 2
# Groups: x [3]
x y
<fct> <fct>
1 A plant1a
2 B plant1b
3 C plant1a
4 C plant2a
5 C plant3a
6 C plant4a
7 C plant5a
Then you can roll up the results like this:
> results <- df %>% group_by(x) %>% distinct(y) %>%
count() %>% filter(n==1) %>% select(x)
> results
# A tibble: 2 x 1
# Groups: x [2]
x
<fct>
1 A
2 B
If you know your original data frame is always going to come with the x's in order, you can drop the group_by part.

A dplyr based solution could be as:
library(dplyr)
df %>% group_by(x) %>%
filter(grepl("Eoiss", y)) %>%
mutate(y = sub("\\d+", "", y)) %>%
filter(n() >1 & length(unique(y)) == 1) %>%
select(x) %>% unique(.)
# A tibble: 3 x 1
# Groups: x [3]
# x
# <fctr>
#1 A
#2 B
#3 D
Data
df<-data.frame(x=c("A","A","A","A", "B","B","B","B","B",
"C","C","C","C","C","D","D","D","D","D"),
y= as.factor(c(rep("Eoissp2",4),
rep("Eoissp1",5),"Eoissp1","Eoisp4","Automerissp1","Automerissp2",
"Acharias",rep("Eoissp2",3),rep("Eoissp1",2))))

Related

How to 'summarize' variable which mixed by 'numeric' and 'character'

here is data.frame data as below , how to transfer it to wished_data Thanks!
library(tidyverse)
data <- data.frame(category=c('a','b','a','b','a'),
values=c(1,'A','2','4','B'))
#below code can't work
data %>% group_by(category ) %>%
summarize(sum=if_else(is.numeric(values)>0,sum(is.numeric(values)),paste0(values)))
#below is the wished result
wished_data <- data.frame(category=c('a','a','b','b'),
values=c('3','B','A','4'))
Mixing numeric and character variables in a column is not tidy. Consider giving each type their own column, for example:
data %>%
mutate(letters = str_extract(values, "[A-Z]"),
numbers = as.numeric(str_extract(values, "\\d"))) %>%
group_by(category) %>%
summarise(values = sum(numbers, na.rm = T),
letters = na.omit(letters))
category values letters
<chr> <dbl> <chr>
1 a 3 B
2 b 4 A
In R string math does not make sense, "1+1" is not "2", and is.numeric("1") gives FALSE. A workaround is converting to list object, or to give each their own columns.
I'd create a separate column to group numeric values in a category separately from characters.
data %>%
mutate(num_check = grepl("[0-9]", values)) %>%
group_by(category, num_check) %>%
summarize(sum = ifelse(
unique(num_check),
as.character(sum(as.numeric(values))),
unique(values)
), .groups = "drop")
#> # A tibble: 4 × 3
#> category num_check sum
#> <chr> <lgl> <chr>
#> 1 a FALSE B
#> 2 a TRUE 3
#> 3 b FALSE A
#> 4 b TRUE 4
Here is a bit of a messy answer,
library(dplyr)
bind_rows(data %>%
filter(is.na(as.numeric(values))),
data %>%
mutate(values = as.numeric(values)) %>%
group_by(category) %>%
summarise(values = as.character(sum(values, na.rm = TRUE)))) %>%
arrange(category)
category values
#1 a B
#2 a 3
#3 b A
#4 b 4

dplyr concatenate column by variable value

I can concatenate one column of data.frame, following the code as below if the column name is available.
However, How about the "column" name saved in the variable?
Further question is, how can I specify the columns by the value of a variable? (!!sym() ?)
Here are test code:
> library(dplyr)
> packageVersion("dplyr")
[1] ‘1.0.7’
> df <- data.frame(x = 1:3, y = c("A", "B", "A"))
> df %>%
group_by(y) %>%
summarise(z = paste(x, collapse = ","))
# A tibble: 2 x 2
y z
<chr> <chr>
1 A 1,3
2 B 2
I have a variable a, with the value x, How can I do above summarize?
> a <- "x"
> df %>%
group_by(y) %>%
summarise(z = paste(a, collapse = ","))
# A tibble: 2 x 2
y z
<chr> <chr>
1 A x
2 B x
Solution-1: use !!sym()
> a <- "x"
> df %>%
group_by(y) %>%
summarise(z = paste(!!sym(a), collapse = ","))
# A tibble: 2 x 2
y z
<chr> <chr>
1 A 1,3
2 B 2
Solution-2: Assign the column to new variable
> df %>%
group_by(y) %>%
rename(new_col = a) %>%
summarise(z = paste(new_col, collapse = ","))
# A tibble: 2 x 2
y z
<chr> <chr>
1 A 1,3
2 B 2
Are there any other ways to do the job?
similar questions could be found: https://stackoverflow.com/a/15935166/2530783 ,https://stackoverflow.com/a/50537209/2530783,
Here are some other options -
Use .data -
library(dplyr)
a <- "x"
df %>% group_by(y) %>% summarise(z = toString(.data[[a]]))
# y z
# <chr> <chr>
#1 A 1, 3
#2 B 2
get
df %>% group_by(y) %>% summarise(z = toString(get(a)))
as.name
df %>% group_by(y) %>% summarise(z = toString(!!as.name(a)))
paste(..., collapse = ',') is equivalent to toString.

R tibble: Group by column A, keep only distinct values in column B and C and sum values in column C

I want to group by column A and then sum values in column C for distinct values in columns B and C. Is it possible to do it inside summarise clause?
I know that's possible with distinct() function before aggregation. What about something like that:
Data:
df <- tibble(A = c(1,1,1,2,2), B = c('a','b','b','a','a'), C=c(5,10,10,15,15))
My try that doesn't work:
df %>%
group_by(A) %>%
summarise(sumC=sum(distinct(B,C) %>% select(C)))
Desired ouput:
A sumC
1 15
2 15
You could use duplicated
df %>%
group_by(A) %>%
summarise(sumC = sum(C[!duplicated(B)]))
## A tibble: 2 x 2
# A sumC
# <dbl> <dbl>
#1 1 15
#2 2 15
Or with distinct
df %>%
group_by(A) %>%
distinct(B, C) %>%
summarise(sumC = sum(C))
## A tibble: 2 x 2
# A sumC
# <dbl> <dbl>
#1 1 15
#2 2 15
A different possibility could be:
df %>%
group_by(A, B, C) %>%
slice(1) %>%
group_by(A) %>%
summarise(sumC = sum(C))
A sumC
<dbl> <dbl>
1 1 15
2 2 15
Or a twist on #Maurits Evers answer:
df %>%
distinct(A, B, C) %>%
group_by(A) %>%
summarise(sumC = sum(C))

tidyverse - prefered way to turn a named vector into a data.frame/tibble

Using the tidyverse a lot i often face the challenge of turning named vectors into a data.frame/tibble with the columns being the names of the vector.
What is the prefered/tidyversey way of doing this?
EDIT: This is related to: this and this github-issue
So i want:
require(tidyverse)
vec <- c("a" = 1, "b" = 2)
to become this:
# A tibble: 1 × 2
a b
<dbl> <dbl>
1 1 2
I can do this via e.g.:
vec %>% enframe %>% spread(name, value)
vec %>% t %>% as_tibble
Usecase example:
require(tidyverse)
require(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
Which gives
# A tibble: 2 × 3
a b c
<chr> <chr> <chr>
1 1 2 <NA>
2 1 <NA> 3
This is now directly supported using bind_rows (introduced in dplyr 0.7.0):
library(tidyverse))
vec <- c("a" = 1, "b" = 2)
bind_rows(vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
This quote from https://cran.r-project.org/web/packages/dplyr/news.html explains the change:
bind_rows() and bind_cols() now accept vectors. They are treated as rows by the former and columns by the latter. Rows require inner names like c(col1 = 1, col2 = 2), while columns require outer names: col1 = c(1, 2). Lists are still treated as data frames but can be spliced explicitly with !!!, e.g. bind_rows(!!! x) (#1676).
With this change, it means that the following line in the use case example:
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
can be rewritten as
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(bind_rows)
which is also equivalent to
txt %>% map(read_xml) %>% map(xml_attrs) %>% { bind_rows(!!! .) }
The equivalence of the different approaches is demonstrated in the following example:
library(tidyverse)
library(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
temp <- txt %>% map(read_xml) %>% map(xml_attrs)
# x, y, and z are identical
x <- temp %>% map_df(~t(.) %>% as_tibble)
y <- temp %>% map_df(bind_rows)
z <- bind_rows(!!! temp)
identical(x, y)
#> [1] TRUE
identical(y, z)
#> [1] TRUE
z
#> # A tibble: 2 x 3
#> a b c
#> <chr> <chr> <chr>
#> 1 1 2 <NA>
#> 2 1 <NA> 3
The idiomatic way would be to splice the vector with !!! within a tibble() call so the named vector elements become column definitions :
library(tibble)
vec <- c("a" = 1, "b" = 2)
tibble(!!!vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
Created on 2019-09-14 by the reprex package (v0.3.0)
This works for me: c("a" = 1, "b" = 2) %>% t() %>% tbl_df()
Interestingly you can use the as_tibble() method for lists to do this in one call. Note that this isn't best practice since this isn't an exported method.
tibble:::as_tibble.list(vec)
as_tibble(as.list(c(a=1, b=2)))

dplyr: Difference between unique and distinct

Seems the number of resulting rows is different when using distinct vs unique. The data set I am working with is huge. Hope the code is OK to understand.
dt2a <- select(dt, mutation.genome.position,
mutation.cds, primary.site, sample.name, mutation.id) %>%
group_by(mutation.genome.position, mutation.cds, primary.site) %>%
mutate(occ = nrow(.)) %>%
select(-sample.name) %>% distinct()
dim(dt2a)
[1] 2316382 5
## Using unique instead
dt2b <- select(dt, mutation.genome.position, mutation.cds,
primary.site, sample.name, mutation.id) %>%
group_by(mutation.genome.position, mutation.cds, primary.site) %>%
mutate(occ = nrow(.)) %>%
select(-sample.name) %>% unique()
dim(dt2b)
[1] 2837982 5
This is the file I am working with:
sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/CosmicMutantExport.tsv.gz
dt = fread(fl)
This appears to be a result of the group_by Consider this case
dt<-data.frame(g=rep(c("a","b"), each=3),
v=c(2,2,5,2,7,7))
dt %>% group_by(g) %>% unique()
# Source: local data frame [4 x 2]
# Groups: g
#
# g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7
dt %>% group_by(g) %>% distinct()
# Source: local data frame [2 x 2]
# Groups: g
#
# g v
# 1 a 2
# 2 b 2
dt %>% group_by(g) %>% distinct(v)
# Source: local data frame [4 x 2]
# Groups: g
#
# g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7
When you use distinct() without indicating which variables to make distinct, it appears to use the grouping variable.

Resources