How to calculate overlap between different categories in R - r

I have read around the forum but I have not found my desired answer.
I have the following dataset:
Dataset
The important columns are TGEClass and peptide:
I would like to calculate the overlap between the different TGEclasses
I used calculate.overlap(TGE) from VennDiagram but that does not give me the desired result;
The R code with a dummy dataset:
# A simple single-set diagram
C1 <- as.data.frame(letters[1:10])
C2 <- as.data.frame(letters[1:10])
data =cbind(C1,C2)
overlap <- calculate.overlap(data)
overlap = as.data.frame(overlap)
The R result:
The result:
a1 a2 a3
1 a a a
2 b b b
3 c c c
4 d d d
5 e e e
6 f f f
The desired result will look like this:
TGEClass
Desired Result
10 genes are expressed in both TGE classes
50 genes in only alternative
60 genes in only short
It is basically a ven diagram but in a table format.
Please note that each gene have a different number of TGE class categories.
I am very new to R so any help will be greatly appreciated.
Thanks very much,
Ishack

The output of VennDiagram::calculate.overlap() is not very convenient for later use (here using as.data.frame you just got lucky as both vectors are of same size).
You can actually use tidyverse to compute it yourself, and return the summary:
library(tidyverse)
list(
"Cardiome" = letters[1:10],
"SuperSet" = letters[8:24]
) %>%
map2_dfr(., names(.), ~tibble::enframe(.x) %>% mutate(group=.y)) %>%
add_count(value) %>%
group_by(value) %>%
summarise(group2 = ifelse(n()==2, "both", group)) %>%
count(group2)
#> # A tibble: 3 x 2
#> group2 n
#> <chr> <int>
#> 1 both 3
#> 2 Cardiome 7
#> 3 SuperSet 14
If you want to stick with the output of VennDiagram::calculate.overlap(), you can use something like:
library(tidyverse)
overlap <- VennDiagram::calculate.overlap(
x = list(
"Cardiome" = letters[1:10],
"SuperSet" = letters[8:24]
)
);
map2_dfr(overlap, names(overlap), ~tibble::enframe(.x) %>% mutate(group=.y)) %>%
spread(group, group) %>%
mutate(a1_only = !is.na(a1) & is.na(a2),
a2_only = !is.na(a2) & is.na(a1),
both = !is.na(a2) & !is.na(a1)) %>%
summarise_at(c("a1_only", "a2_only", "both"), sum) %>%
gather(group, number, everything())
#> # A tibble: 3 x 2
#> group number
#> <chr> <int>
#> 1 a1_only 10
#> 2 a2_only 17
#> 3 both 0

Related

How to sum values from one column based on specific conditions from other column in R?

I have a dataset that looks something like this:
df <- data.frame(plot = c("A","A","A","A","A","B","B","B","B","B","C","C","C","C","C"),
species = c("Fagus","Fagus","Quercus","Picea", "Abies","Fagus","Fagus","Quercus","Picea", "Abies","Fagus","Fagus","Quercus","Picea", "Abies"),
value = sample(100, size = 15, replace = TRUE))
head(df)
plot species value
1 A Fagus 53
2 A Fagus 48
3 A Quercus 5
4 A Picea 25
5 A Abies 12
6 B Fagus 12
Now, I want to create a new data frame containing per plot values for share.conifers and share.broadleaves by basically summing the values with conditions applied for species. I thought about using case_when but I am not sure how to write the syntax:
df1 <- df %>% share.broadleaves = case_when(plot = plot & species = "Fagus" or species = "Quercus" ~ FUN="sum")
df1 <- df %>% share.conifers = case_when(plot = plot & species = "Abies" or species = "Picea" ~ FUN="sum")
I know this is not right, but I would like something like this.
Using dplyr/tidyr:
First construct the group, do the calculation and then spread into columns.
library(dplyr)
library(tidyr)
df |>
mutate(type = case_when(species %in% c("Fagus", "Quercus") ~ "broadleaves",
species %in% c("Abies", "Picea") ~ "conifers")) |>
group_by(plot, type) |>
summarise(share = sum(value)) |>
ungroup() |>
pivot_wider(values_from = "share", names_from = "type", names_prefix = "share.")
Output:
# A tibble: 3 × 3
plot share.broadleaves share.conifers
<chr> <int> <int>
1 A 159 77
2 B 53 42
3 C 204 63
I am not sure if you want to sum or get the share, but the code could easily be adapted to whatever goal you have.
One way could just be summarizing by plot and species:
library(dplyr)
df |>
group_by(plot, species) |>
summarize(share = sum(value))
If you really want to get the share of a specific species per plot you could also do:
df |>
group_by(plot) |>
summarize(share_certain_species = sum(value[species %in% c("Fagus", "Quercus")]) / sum(value))
which gives:
# A tibble: 3 × 2
plot share_certain_species
<chr> <dbl>
1 A 0.546
2 B 0.583
3 C 0.480

Calculate the frequency of species occurrence across sites

I need to calculate a relative frequency of the species occurence across sites. Lets say, if the species a was found in 5 out of the 8 sampling sites, its relative frequency is 62.5 %. I wonder how to do it in R, ideally using dplyr?
Dummy example:
d <-data.frame(site = c(1,1,2,2,3,3,4,4),
species = c('a','b', 'a','b', 'a','d', 'a', 'e'))
I know that I can calculate the sum of unique sites by counting distinct ones:
d %>%
group_by(site) %>%
summarize(n_sites = n_distinct(site))
I can get the frequency of the individual species occurences using this:
d %>%
count(species)
But how can I get that the frequency of occurence of each species?
Desired output:
species freq
a 100 # species a is present in each plot
b 50 # b occurs in half of plots
d 25 # d&e occur only in 1 out of 4 plots
e 25
We can use
library(dplyr)
d |> group_by(species) |> mutate(n = n_distinct(site)) |>
summarise(freq = n()) |> ungroup() |>
mutate(freq = freq/n_distinct(species)*100)
Output
A tibble: 4 × 2
species freq
<chr> <dbl>
1 a 100
2 b 50
3 d 25
4 e 25
I would break this into two steps, as follows:
d%>%
group_by(species)%>%
# Step 1; count sites by specices
summarise(sites_by_species=n_distinct(site))%>%
# Step 2; divide by total number of sites
mutate(frequency=100*sites_by_species/n_distinct(d$site))
Output of which is:
# A tibble: 4 × 3
species sites_by_species frequency
<chr> <int> <dbl>
1 a 4 100
2 b 2 50
3 d 1 25
4 e 1 25
Since we already group_by species, I guess we cannot use n_distinct() to find out the distinct sites, therefore I used length(unique(d$site)).
library(dplyr)
d %>% group_by(species) %>% summarize(freq = n()*100/length(unique(d$site)))
Or more lengthy (trying to stay in dplyr as much as possible):
d %>%
mutate(sites_n = n_distinct(site)) %>%
group_by(species) %>%
summarize(freq = n()*100/max(sites_n))
Output
# A tibble: 4 × 2
species freq
<chr> <dbl>
1 a 100
2 b 50
3 d 25
4 e 25
d %>%
count(species) %>%
mutate(freq=n/n_distinct(d$site)*100) %>%
select(-n)
species freq
1 a 100
2 b 50
3 d 25
4 e 25
I needed to use d$site since site is no longer available trough pipes after the use of count.

Row mean of two matching columns with same name but differ by: '_1' and '_2'

Lets say I have the dataframe:
z = data.frame(col_1 = c(1,2,3,4), col_2 = c(3,4,5,6))
col_1 col_2
1 1 3
2 2 4
3 3 5
4 4 6
I want to take columns with the same name that only differ by the number e.g. '_1' and '_2' and take the pairwise mean. In reality I have a big dataframe with many pairs and they are not in a nice order, therefore looking for a clever solution that can be applied to this.
So the output should look like this:
col
1 2
2 3
3 4
4 5
With the column name given as the same as the column pair but with the additional label removed.
Any help would be great thanks.
Here is a base R option using list2DF + split.default + rowMeans
list2DF(lapply(split.default(z,gsub("_\\d+","",names(z))),rowMeans))
which gives
col
1 2
2 3
3 4
4 5
Try this tidyverse approach. By using separate() you can extract the name and then with reshaping you can reach the desired output. Here the code:
library(dplyr)
library(tidyr)
#Data
z = data.frame(col_1 = c(1,2,3,4), col_2 = c(3,4,5,6))
#Code
z1 <- z %>% mutate(id=1:n()) %>%
pivot_longer(-id) %>%
separate(name,c('var1','var2'),sep='_') %>%
group_by(id,var1) %>% summarise(Mean=mean(value)) %>%
pivot_wider(names_from = var1,values_from=Mean) %>% ungroup() %>% select(-id)
Output:
# A tibble: 4 x 1
col
<dbl>
1 2
2 3
3 4
4 5
Here is a purrr oriented solution:
library(purrr)
library(stringr)
split.default(z, str_remove(names(z), "[:digit:]+$")) %>% map_dfc(rowMeans)
#> # A tibble: 4 x 1
#> col_
#> <dbl>
#> 1 2
#> 2 3
#> 3 4
#> 4 5
It works even if z is:
z <- data.frame(col_1 = c(1,2,3,4),
col_2 = c(3,4,5,6),
anothercol_1 = c(1,2,3,4),
anothercol_2 = c(3,4,5,6))

Check if values of one dataframe exist in another dataframe in exact order

I have 1 dataframe of data and multiple "reference" dataframes. I'm trying to automate checking if values of the dataframe match the values of the reference dataframes. Importantly, the values must also be in the same order as the values in the reference dataframes. These columns are of the columns of importance, but my real dataset contains many more columns.
Below is a toy dataset.
Dataframe
group type value
1 A Teddy
1 A William
1 A Lars
2 B Dolores
2 B Elsie
2 C Maeve
2 C Charlotte
2 C Bernard
Reference_A
type value
A Teddy
A William
A Lars
Reference_B
type value
B Elsie
B Dolores
Reference_C
type value
C Maeve
C Hale
C Bernard
For example, in the toy dataset, group1 would score 1.0 (100% correct) because all its values in A match the values and order of values of An in reference_A. However, group2 would score 0.0 because the values in B are out of order compared to reference_B and 0.66 because 2/3 values in C match the values and order of values in reference_C.
Desired output
group type score
1 A 1.0
2 B 0.0
2 C 0.66
This was helpful, but does not take order into account:
Check whether values in one data frame column exist in a second data frame
Update: Thank you to everyone that has provided solutions! These solutions are great for the toy dataset, but have not yet been adaptable to datasets with more columns. Again, like I wrote in my post, the columns that I've listed above are of importance — I'd prefer to not drop the unneeded columns if necessary.
We may also do this with mget to return a list of data.frames, bind them together, and do a group by mean of logical vector
library(dplyr)
mget(ls(pattern = '^Reference_[A-Z]$')) %>%
bind_rows() %>%
bind_cols(df1) %>%
group_by(group, type = type...1) %>%
summarise(score = mean(value...2 == value...5))
# Groups: group [2]
# group type score
# <int> <chr> <dbl>
#1 1 A 1
#2 2 B 0
#3 2 C 0.667
This is another tidyverse solution. Here, I am adding a counter (i.e. rowname) to both reference and data. Then I join them together on type and rowname. At the end, I summarize them on type to get the desired output.
library(dplyr)
library(purrr)
library(tibble)
list(`Reference A`, `Reference B`, `Reference C`) %>%
map(., rownames_to_column) %>%
bind_rows %>%
left_join({Dataframe %>%
group_split(type) %>%
map(., rownames_to_column) %>%
bind_rows},
. , by=c("type", "rowname")) %>%
group_by(type) %>%
dplyr::summarise(group = head(group,1),
score = sum(value.x == value.y)/n())
#> # A tibble: 3 x 3
#> type group score
#> <chr> <int> <dbl>
#> 1 A 1 1
#> 2 B 2 0
#> 3 C 2 0.667
Here's a "tidy" method:
library(dplyr)
# library(purrr) # map2_dbl
Reference <- bind_rows(Reference_A, Reference_B, Reference_C) %>%
nest_by(type, .key = "ref") %>%
ungroup()
Reference
# # A tibble: 3 x 2
# type ref
# <chr> <list<tbl_df[,1]>>
# 1 A [3 x 1]
# 2 B [2 x 1]
# 3 C [3 x 1]
Dataframe %>%
nest_by(group, type, .key = "data") %>%
left_join(Reference, by = "type") %>%
mutate(
score = purrr::map2_dbl(data, ref, ~ {
if (length(.x) == 0 || length(.y) == 0) return(numeric(0))
if (length(.x) != length(.y)) return(0)
sum((is.na(.x) & is.na(.y)) | .x == .y) / length(.x)
})
) %>%
select(-data, -ref) %>%
ungroup()
# # A tibble: 3 x 3
# group type score
# <int> <chr> <dbl>
# 1 1 A 1
# 2 2 B 0
# 3 2 C 0.667

Create a list of all values of a variable grouped by another variable in R

I have a data frame that contains two variables, like this:
df <- data.frame(group=c(1,1,1,2,2,3,3,4),
type=c("a","b","a", "b", "c", "c","b","a"))
> df
group type
1 1 a
2 1 b
3 1 a
4 2 b
5 2 c
6 3 c
7 3 b
8 4 a
I want to produce a table showing for each group the combination of types it has in the data frame as one variable e.g.
group alltypes
1 1 a, b
2 2 b, c
3 3 b, c
4 4 a
The output would always list the types in the same order (e.g. groups 2 and 3 get the same result) and there would be no repetition (e.g. group 1 is not "a, b, a").
I tried doing this using dplyr and summarize, but I can't work out how to get it to meet these two conditions - the code I tried was:
> df %>%
+ group_by(group) %>%
+ summarise(
+ alltypes = paste(type, collapse=", ")
+ )
# A tibble: 4 × 2
group alltypes
<dbl> <chr>
1 1 a, b, a
2 2 b, c
3 3 c, b
4 4 a
I also tried turning type into a set of individual counts, but not sure if that's actually useful:
> df %>%
+ group_by(group, type) %>%
+ tally %>%
+ spread(type, n, fill=0)
Source: local data frame [4 x 4]
Groups: group [4]
group a b c
* <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0
2 2 0 1 1
3 3 0 1 1
4 4 1 0 0
Any suggestions would be greatly appreciated.
I think you were very close. You could call the sort and unique functions to make sure your result adheres to your conditions as follows:
df %>% group_by(group) %>%
summarize(type = paste(sort(unique(type)),collapse=", "))
returns:
# A tibble: 4 x 2
group type
<int> <chr>
1 1 a, b
2 2 b, c
3 3 b, c
4 4 a
To expand on Florian's answer this could be extended to generating an ordered list based on values in your data set. An example could be determining the order of dates:
library(lubridate)
library(tidyverse)
# Generate random dates
set.seed(123)
Date = ymd("2018-01-01") + sort(sample(1:200, 10))
A = ymd("2018-01-01") + sort(sample(1:200, 10))
B = ymd("2018-01-01") + sort(sample(1:200, 10))
C = ymd("2018-01-01") + sort(sample(1:200, 10))
# Combine to data set
data = bind_cols(as.data.frame(Date), as.data.frame(A), as.data.frame(B), as.data.frame(C))
# Get order of dates for each row
data %>%
mutate(D = Date) %>%
gather(key = Var, value = D, -Date) %>%
arrange(Date, D) %>%
group_by(Date) %>%
summarize(Ord = paste(Var, collapse=">"))
Somewhat tangential to the original question but hopefully helpful to someone.

Resources