Calculate the frequency of species occurrence across sites - r

I need to calculate a relative frequency of the species occurence across sites. Lets say, if the species a was found in 5 out of the 8 sampling sites, its relative frequency is 62.5 %. I wonder how to do it in R, ideally using dplyr?
Dummy example:
d <-data.frame(site = c(1,1,2,2,3,3,4,4),
species = c('a','b', 'a','b', 'a','d', 'a', 'e'))
I know that I can calculate the sum of unique sites by counting distinct ones:
d %>%
group_by(site) %>%
summarize(n_sites = n_distinct(site))
I can get the frequency of the individual species occurences using this:
d %>%
count(species)
But how can I get that the frequency of occurence of each species?
Desired output:
species freq
a 100 # species a is present in each plot
b 50 # b occurs in half of plots
d 25 # d&e occur only in 1 out of 4 plots
e 25

We can use
library(dplyr)
d |> group_by(species) |> mutate(n = n_distinct(site)) |>
summarise(freq = n()) |> ungroup() |>
mutate(freq = freq/n_distinct(species)*100)
Output
A tibble: 4 × 2
species freq
<chr> <dbl>
1 a 100
2 b 50
3 d 25
4 e 25

I would break this into two steps, as follows:
d%>%
group_by(species)%>%
# Step 1; count sites by specices
summarise(sites_by_species=n_distinct(site))%>%
# Step 2; divide by total number of sites
mutate(frequency=100*sites_by_species/n_distinct(d$site))
Output of which is:
# A tibble: 4 × 3
species sites_by_species frequency
<chr> <int> <dbl>
1 a 4 100
2 b 2 50
3 d 1 25
4 e 1 25

Since we already group_by species, I guess we cannot use n_distinct() to find out the distinct sites, therefore I used length(unique(d$site)).
library(dplyr)
d %>% group_by(species) %>% summarize(freq = n()*100/length(unique(d$site)))
Or more lengthy (trying to stay in dplyr as much as possible):
d %>%
mutate(sites_n = n_distinct(site)) %>%
group_by(species) %>%
summarize(freq = n()*100/max(sites_n))
Output
# A tibble: 4 × 2
species freq
<chr> <dbl>
1 a 100
2 b 50
3 d 25
4 e 25

d %>%
count(species) %>%
mutate(freq=n/n_distinct(d$site)*100) %>%
select(-n)
species freq
1 a 100
2 b 50
3 d 25
4 e 25
I needed to use d$site since site is no longer available trough pipes after the use of count.

Related

R tables with proportion

ID <- c(1,2,3,4,5,6,7,8)
Hospital <- c("A","A","A","A","B","B","B","B")
risk <- c("Low","Low","High","High","Low","Low","High","High")
retest <- c(1,0,1,1,1,1,0,1)
df <- data.frame(ID, Hospital, risk, retest)
# freq. table
df %>% group_by(risk, Hospital) %>%
summarise(n=n())%>%
spread(Hospital,n)
# A tibble: 2 × 3
# Groups: risk [2]
risk A B
<chr> <int> <int>
1 High 2 2
2 Low 2 2
#freq. table of retest by risk and Hospital
df %>%
group_by(risk, Hospital) %>%
#summarise(n=n()) %>%
summarise(retestsum = sum(retest))%>%
spread(Hospital, retestsum)
# A tibble: 2 × 3
# Groups: risk [2]
risk A B
<chr> <dbl> <dbl>
1 High 2 1
2 Low 1 2
I want to get the proportions of retest by Hospital and by risk categories.
For example, Hospital A, low risk , retested 1 person / 2 person = 50.
Need to create A% B% columns to get the final result of the table below.
Please help me get the prop. columns and also (n=x) part in the final table.
Just divide the second table's numeric values by those of the first. Fortunately elementwise division does not destroy the structure if the two tibbles have the same dimensions:
d2 <- df1 %>% group_by(risk, Hospital) %>%
summarise(n=n())%>%
spread(Hospital,n)
`summarise()` has grouped output by 'risk'. You can override using the `.groups` argument.
d3 <- df1 %>%
group_by(risk, Hospital) %>%
#summarise(n=n()) %>%
summarise(retestsum = sum(retest))%>%
spread(Hospital, retestsum)
You can deliver a proportion or a percentage
# proportion
> d3[-1]/d2[-1]
A B
1 1.0 0.5
2 0.5 1.0
#percentage
> 100*d3[-1]/d2[-1]
A B
1 100 50
2 50 100
``

Cumulative sum of unique values based on multiple criteria

I've got a df with multiple columns containing information of species sightings over the years in different sites, therefore each year might show multiple records. I would like to filter my df and calculate some operations based on certain columns, but I'd like to keep all columns for further analyses. I had some previous code using summarise but as I would like to keep all columns I was trying to avoid using it.
Let's say the columns I'm interested to work with at the moment are as follows:
df <- data.frame("Country" = LETTERS[1:5], "Site"=LETTERS[6:10], "species"=1:5, "Year"=1981:2010)
I would like to calculate:
1- The cumulative sum of the records in which a species has been documented within each site creating a new column "Spsum".
2- The number of different years that each species has been seen on a particular site, this could be done as cumulative sum as well, on a new column "nYear".
For example, if species 1 has been recorded 5 times in 1981, and 2 times in 1982 in Site G, Spsum would show 7 (cumulative sum of records) whereas nYear would show 2 as it was spotted over two different years. So far I've got this, but nYear is displaying 0s as a result.
Df1 <- df %>%
filter(Year>1980)%>%
group_by(Country, Site, Species, Year) %>%
mutate(nYear = n_distinct(Year[Species %in% Site]))%>%
ungroup()
Thanks!
This cound help, without the need for a join.
df %>% arrange(Country, Site, species, Year) %>%
filter(Year>1980) %>%
group_by(Site, species) %>%
mutate(nYear = length(unique(Year))) %>%
mutate(spsum = rowid(species))
# A tibble: 30 x 6
# Groups: Site, species [5]
Country Site species Year nYear spsum
<chr> <chr> <int> <int> <int> <int>
1 A F 1 1981 6 1
2 A F 1 1986 6 2
3 A F 1 1991 6 3
4 A F 1 1996 6 4
5 A F 1 2001 6 5
6 A F 1 2006 6 6
7 B G 2 1982 6 1
8 B G 2 1987 6 2
9 B G 2 1992 6 3
10 B G 2 1997 6 4
# ... with 20 more rows
If the table contains multiple records per Country+Site+species+Year combination, I would first aggregate those and then calculate the cumulative counts from that. The counts can then be joined back to the original table.
Something along these lines:
cumulative_counts <- df %>%
count(Country, Site, species, Year) %>%
group_by(Country, Site, species) %>%
arrange(Year) %>%
mutate(Spsum = cumsum(n), nYear = row_number())
df %>%
left_join(cumulative_counts)

Check if values of one dataframe exist in another dataframe in exact order

I have 1 dataframe of data and multiple "reference" dataframes. I'm trying to automate checking if values of the dataframe match the values of the reference dataframes. Importantly, the values must also be in the same order as the values in the reference dataframes. These columns are of the columns of importance, but my real dataset contains many more columns.
Below is a toy dataset.
Dataframe
group type value
1 A Teddy
1 A William
1 A Lars
2 B Dolores
2 B Elsie
2 C Maeve
2 C Charlotte
2 C Bernard
Reference_A
type value
A Teddy
A William
A Lars
Reference_B
type value
B Elsie
B Dolores
Reference_C
type value
C Maeve
C Hale
C Bernard
For example, in the toy dataset, group1 would score 1.0 (100% correct) because all its values in A match the values and order of values of An in reference_A. However, group2 would score 0.0 because the values in B are out of order compared to reference_B and 0.66 because 2/3 values in C match the values and order of values in reference_C.
Desired output
group type score
1 A 1.0
2 B 0.0
2 C 0.66
This was helpful, but does not take order into account:
Check whether values in one data frame column exist in a second data frame
Update: Thank you to everyone that has provided solutions! These solutions are great for the toy dataset, but have not yet been adaptable to datasets with more columns. Again, like I wrote in my post, the columns that I've listed above are of importance — I'd prefer to not drop the unneeded columns if necessary.
We may also do this with mget to return a list of data.frames, bind them together, and do a group by mean of logical vector
library(dplyr)
mget(ls(pattern = '^Reference_[A-Z]$')) %>%
bind_rows() %>%
bind_cols(df1) %>%
group_by(group, type = type...1) %>%
summarise(score = mean(value...2 == value...5))
# Groups: group [2]
# group type score
# <int> <chr> <dbl>
#1 1 A 1
#2 2 B 0
#3 2 C 0.667
This is another tidyverse solution. Here, I am adding a counter (i.e. rowname) to both reference and data. Then I join them together on type and rowname. At the end, I summarize them on type to get the desired output.
library(dplyr)
library(purrr)
library(tibble)
list(`Reference A`, `Reference B`, `Reference C`) %>%
map(., rownames_to_column) %>%
bind_rows %>%
left_join({Dataframe %>%
group_split(type) %>%
map(., rownames_to_column) %>%
bind_rows},
. , by=c("type", "rowname")) %>%
group_by(type) %>%
dplyr::summarise(group = head(group,1),
score = sum(value.x == value.y)/n())
#> # A tibble: 3 x 3
#> type group score
#> <chr> <int> <dbl>
#> 1 A 1 1
#> 2 B 2 0
#> 3 C 2 0.667
Here's a "tidy" method:
library(dplyr)
# library(purrr) # map2_dbl
Reference <- bind_rows(Reference_A, Reference_B, Reference_C) %>%
nest_by(type, .key = "ref") %>%
ungroup()
Reference
# # A tibble: 3 x 2
# type ref
# <chr> <list<tbl_df[,1]>>
# 1 A [3 x 1]
# 2 B [2 x 1]
# 3 C [3 x 1]
Dataframe %>%
nest_by(group, type, .key = "data") %>%
left_join(Reference, by = "type") %>%
mutate(
score = purrr::map2_dbl(data, ref, ~ {
if (length(.x) == 0 || length(.y) == 0) return(numeric(0))
if (length(.x) != length(.y)) return(0)
sum((is.na(.x) & is.na(.y)) | .x == .y) / length(.x)
})
) %>%
select(-data, -ref) %>%
ungroup()
# # A tibble: 3 x 3
# group type score
# <int> <chr> <dbl>
# 1 1 A 1
# 2 2 B 0
# 3 2 C 0.667

How to calculate overlap between different categories in R

I have read around the forum but I have not found my desired answer.
I have the following dataset:
Dataset
The important columns are TGEClass and peptide:
I would like to calculate the overlap between the different TGEclasses
I used calculate.overlap(TGE) from VennDiagram but that does not give me the desired result;
The R code with a dummy dataset:
# A simple single-set diagram
C1 <- as.data.frame(letters[1:10])
C2 <- as.data.frame(letters[1:10])
data =cbind(C1,C2)
overlap <- calculate.overlap(data)
overlap = as.data.frame(overlap)
The R result:
The result:
a1 a2 a3
1 a a a
2 b b b
3 c c c
4 d d d
5 e e e
6 f f f
The desired result will look like this:
TGEClass
Desired Result
10 genes are expressed in both TGE classes
50 genes in only alternative
60 genes in only short
It is basically a ven diagram but in a table format.
Please note that each gene have a different number of TGE class categories.
I am very new to R so any help will be greatly appreciated.
Thanks very much,
Ishack
The output of VennDiagram::calculate.overlap() is not very convenient for later use (here using as.data.frame you just got lucky as both vectors are of same size).
You can actually use tidyverse to compute it yourself, and return the summary:
library(tidyverse)
list(
"Cardiome" = letters[1:10],
"SuperSet" = letters[8:24]
) %>%
map2_dfr(., names(.), ~tibble::enframe(.x) %>% mutate(group=.y)) %>%
add_count(value) %>%
group_by(value) %>%
summarise(group2 = ifelse(n()==2, "both", group)) %>%
count(group2)
#> # A tibble: 3 x 2
#> group2 n
#> <chr> <int>
#> 1 both 3
#> 2 Cardiome 7
#> 3 SuperSet 14
If you want to stick with the output of VennDiagram::calculate.overlap(), you can use something like:
library(tidyverse)
overlap <- VennDiagram::calculate.overlap(
x = list(
"Cardiome" = letters[1:10],
"SuperSet" = letters[8:24]
)
);
map2_dfr(overlap, names(overlap), ~tibble::enframe(.x) %>% mutate(group=.y)) %>%
spread(group, group) %>%
mutate(a1_only = !is.na(a1) & is.na(a2),
a2_only = !is.na(a2) & is.na(a1),
both = !is.na(a2) & !is.na(a1)) %>%
summarise_at(c("a1_only", "a2_only", "both"), sum) %>%
gather(group, number, everything())
#> # A tibble: 3 x 2
#> group number
#> <chr> <int>
#> 1 a1_only 10
#> 2 a2_only 17
#> 3 both 0

How to use mutate iteratively over multiple rows in r

I am trying to calculate the percent difference in ht between all possible pairs of data, per group of individuals, as well as the time difference between the ht measures. This is my data:
hc1<- data.frame(id= c(1,1,1,2,2,2,3,3),
testoccasion= c(1,2,3,1,2,3,1,2),
ht= c(0.2,0.1,0.8,0.9,1.0,0.5,0.4,0.8),
time= c(5,4,8,5,6,5,2,1))
This is my code.
library(dplyr)
a<-hc1 %>%
group_by(id) %>%
arrange(id,testoccasion) %>%
mutate(fd = (ht-lag(ht))/lag(ht)*100) %>%
mutate(t = time-lag(time))
b<-hc1 %>%
group_by(id) %>%
arrange(id,testoccasion) %>%
mutate(fd = (ht-lag(ht,2))/lag(ht,2)*100) %>%
mutate(t = time-lag(time,2))
c<-hc1 %>%
group_by(id) %>%
arrange(id,testoccasion) %>%
mutate(fd = (ht-lag(ht,3))/lag(ht,3)*100) %>%
mutate(t = time-lag(time,3))
diff<-rbind(a,b,c)
diff<-na.omit(diff)
I am curious how I can make this code shorter. I want to be able to find the difference across all possible pairs of ht, for all test occasions, where the number of test occasions differs between individual id's.It would be great if I didn't have to do it iteratively like this, because it's a huge dataset I have. Thanks!
We can use map to loop the n used in lag
library(tidyverse)
map_df(1:3, ~
hc1 %>%
group_by(id) %>%
arrange(id, testoccasion) %>%
mutate(fd = (ht -lag(ht, .x))/lag(ht, .x) * 100,
t = time -lag(time, .x))) %>%
na.omit
# A tibble: 7 x 6
# Groups: id [3]
# id testoccasion ht time fd t
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 2 0.1 4 -50 -1
#2 1 3 0.8 8 700 4
#3 2 2 1 6 11.1 1
#4 2 3 0.5 5 -50 -1
#5 3 2 0.8 1 100 -1
#6 1 3 0.8 8 300. 3
#7 2 3 0.5 5 -44.4 0

Resources