How to indicate a higher hierarchy level using dplyr? - r

I have a dataframe (df) in which each row represents the start (Start) and the end (End) of a specific habitat (Habitat) within a transect (Transect) and site (Site) in meters. It is important to note that the length of the transects varies within and among sites. As an example:
df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df
Site Transect Habitat Start End
1 A 1 X 0.0 2.8 # Habitat `X` is between the meters 0 and 2.8
2 A 1 Y 2.8 3.4 # Habitat `Y` is between the meters 2.8 and 3.4
3 A 1 X 3.4 5.0 # Habitat `X` is between the meters 3.4 and 5.0
4 A 1 Z 5.0 10.0 # Habitat `Z` is between the meters 5 and 10.0
5 A 2 Z 0.0 1.5
6 A 2 Y 1.5 5.0
7 A 2 X 5.0 8.0
8 A 2 Z 8.0 12.0
9 A 2 X 12.0 15.0
10 B 1 X 0.0 2.0
11 B 1 Z 2.0 5.0
12 B 1 X 5.0 7.5
13 B 1 Y 7.5 20.0
14 B 2 Z 0.0 4.0
15 B 2 X 4.0 8.0
16 B 2 Y 8.0 12.0
17 B 2 Z 12.0 15.0
In this example, for instance, habitat X is twice in the transect 1 in site A. Also, we can observe that the total length of transects 1 and 2 in site A are 10 and 15 m, respectively. In site B, the total length of the transects 1 and 2 are 20 and 15 meters, respectively.
What I want is to calculate per Site and Transect the percentage that each Habitat represents with respect to all the habitats presented in terms of meters. For example, in transect 1 and site A habitat X represents 4.4 meters of a total length of 10 meters for transect 1. In site A and transect 2, habitat X has 6 meters from a total length of 15 meters for transect B.
To this aim, the first thing I do is to calculate the length (Length) in meters of each habitat record (=row)
df$Length <- df$End - df$Start
Then, what I want is to calculate by site and transect the percentage that the meters of an habitat represents with respect the rest of habitats and the total length of the transect. I tried this:
df2 <- as.data.frame(df %>% group_by(Site, Transect, Habitat) %>% summarise(Porcentage = (sum(Length)/max(End))*100))
I want to change max(End) to another expression that represents the total length OF THE TRANSECT. Right now max(End) represents the last meter (End) in which a specific habitat was present. How can I include in the code above "maximum value of End" but within of a specific Site and Transect, but not for a specific Habitat.
How can I do it? My desired output would be this:
Site Transect Habitat Percentage
1 A 1 X 44.0
2 A 1 Y 6.0
3 A 1 Z 50.0
4 A 2 X 40.0
5 A 2 Y 23.3
6 A 2 Z 36.7
7 B 1 X 22.5
8 B 1 Y 62.5
9 B 1 Z 15.0
10 B 2 X 26.7
11 B 2 Y 26.7
12 B 2 Z 46.7
Does anyone know how to do it?
Thanks in advance!

With dplyr, when you have different levels of hierarchy that need managing, you may need multiple group_by() statements. In the code below, I use group_by(Site, Transect, Habitat) to calculate the total length of each habitat in the Site and Transect and then group_by(Site, Transect) to calculate the percentage.
library(dplyr)
df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df %>%
mutate(length = End-Start) %>%
group_by(Site, Transect, Habitat) %>%
summarise(tot_length = sum(length)) %>%
group_by(Site, Transect) %>%
mutate(percentage = 100*tot_length/sum(tot_length))
#> `summarise()` has grouped output by 'Site', 'Transect'. You can override using
#> the `.groups` argument.
#> # A tibble: 12 × 5
#> # Groups: Site, Transect [4]
#> Site Transect Habitat tot_length percentage
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 A 1 X 4.4 44
#> 2 A 1 Y 0.6 6
#> 3 A 1 Z 5 50
#> 4 A 2 X 6 40
#> 5 A 2 Y 3.5 23.3
#> 6 A 2 Z 5.5 36.7
#> 7 B 1 X 4.5 22.5
#> 8 B 1 Y 12.5 62.5
#> 9 B 1 Z 3 15
#> 10 B 2 X 4 26.7
#> 11 B 2 Y 4 26.7
#> 12 B 2 Z 7 46.7
Created on 2023-02-16 by the reprex package (v2.0.1)
In your code from above, when you are calculating the percentage, your data are still grouped by Habitat, so the percentage you are calculating is within the Habitat rather than across habitats within Site and Transect pairs.

Related

How to include factor levels when they are missing in some replicas in R?

I have a dataframe df in which I have data about percentages (df$Percentage) of different habitats(df$Habitat) for specific tracks (df$Replica) in different sites (df$Site). As an example:
df <- data.frame(Site = c("A","A","A","A","A","B","B","B","B","B","B"),
Replica = c(1,1,1,2,2,1,1,1,2,2,2),
Habitat = c("X","Y","Z","X","Y","X","Y","M","X","M","Z"),
Porcentage = c(46,38,16,40,60,20,60,20,35,55,10))
df
Site Replica Habitat Porcentage
1 A 1 X 46
2 A 1 Y 38
3 A 1 Z 16
4 A 2 X 40
5 A 2 Y 60
6 B 1 X 20
7 B 1 Y 60
8 B 1 M 20
9 B 2 X 35
10 B 2 M 55
11 B 2 Z 10
Here, for example, the percentage of habitat X at site A is 46, while there is no habitat M (which is present in site B). The sum of all values for a specific Replica within a site is 100. For example, in site A, for Replica == 2, habitat X and Y sum 100, which means there is no other habitat present in this replica/track.
I want to calculate both the mean percentage (Mean_Percentage) of each habitat per Site and the standard error (SE) of the mean. The mean is calculated using Replica, since for each Site I have repeated measures (= Replica).
df %>% group_by(Site,Habitat) %>% summarise(Mean_Porcentage = mean(Porcentage),SE = sd(Porcentage)/sqrt(length(Porcentage)))
# A tibble: 7 × 4
# Groups: Site [2]
Site Habitat Mean_Porcentage SE
<chr> <chr> <dbl> <dbl>
1 A X 43 3
2 A Y 49 11
3 A Z 16 NA
4 B M 37.5 17.5
5 B X 27.5 7.5
6 B Y 60 NA
7 B Z 10 NA
The problem is that Mean_Porcentage and thus, SE are not calculated properly. For example, habitat Z is present in Replica == 1 (=16%) but not in Replica == 2 (=0%) in site A. Thus, the mean percentage (Mean_Porcentage) of Z in site A should be 8 (=[16%+0%]/2) and SE should be 8. Also, habitat M is not present in site A but it is in site B, so I want habitat M to appear for site A with a Mean_Percentage of 0.
My desired result should be a dataframe as this:
df2
Site Habitat Mean_Porcentage SE
1 A M 0.0 0
2 A X 43.0 3
3 A Y 49.0 11
4 A Z 8.0 8
5 B M 37.5 17.5
6 B X 27.5 7.5
7 B Y 30.0 30
8 B Z 5.0 5
Does anyone know how to do it?
Thanks in advance!
You can try xtabs and proportions.
proportions(xtabs(Porcentage ~ Site + Habitat, df), 1) * 100
# Habitat
#Site M X Y Z
# A 0.0 43.0 49.0 8.0
# B 37.5 27.5 30.0 5.0
as.data.frame(proportions(xtabs(Porcentage ~ Site + Habitat, df), 1) * 100)
# Site Habitat Freq
#1 A M 0.0
#2 B M 37.5
#3 A X 43.0
#4 B X 27.5
#5 A Y 49.0
#6 B Y 30.0
#7 A Z 8.0
#8 B Z 5.0
Adding the se as described in the question.
x <- as.data.frame(proportions(xtabs(Porcentage ~ Site + Habitat, df), 1) * 100)
x <- merge(x, aggregate(cbind(SE = Porcentage) ~ Site + Habitat, df, \(x) sd(x)/sqrt(length(x))), all.x=TRUE)
i <- is.na(x$SE)
x$SE[i] <- x$Freq[i]
x
# Site Habitat Freq SE
#1 A M 0.0 0.0
#2 A X 43.0 3.0
#3 A Y 49.0 11.0
#4 A Z 8.0 8.0
#5 B M 37.5 17.5
#6 B X 27.5 7.5
#7 B Y 30.0 30.0
#8 B Z 5.0 5.0

Select varying number of top_n for different groups using dplyr

I have the following dataframe. I want to prefer dplyr to solve this problem.
For each zone I want at minimum two values. Value > 4.0 is preferred.
Therefore, for zone 10 all values (being > 4.0) are kept. For zone 20, top two values are picked. Similarly for zone 30.
zone <- c(rep(10,4), rep(20, 4), rep(30, 4))
set.seed(1)
value <- c(4.5,4.3,4.6, 5,5, rep(3,7)) + round(rnorm(12, sd = 0.1),1)
df <- data.frame(zone, value)
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 2.9
7 20 3.0
8 20 3.1
9 30 3.1
10 30 3.0
11 30 3.2
12 30 3.0
The desired output is as follows
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 3.1
7 30 3.1
8 30 3.2
I thought of using top_n but it picks the same number for each zone.
You could dynamically calculate n in top_n
library(dplyr)
df %>% group_by(zone) %>% top_n(max(sum(value > 4), 2), value)
# zone value
# <dbl> <dbl>
#1 10 4.4
#2 10 4.3
#3 10 4.5
#4 10 5.2
#5 20 5
#6 20 3.1
#7 30 3.1
#8 30 3.2
can do so
library(tidyverse)
df %>%
group_by(zone) %>%
filter(row_number(-value) <=2 | head(value > 4))

Summarizing columns using a vector with dplyr

I want to calculate the mean of certain columns (names stored in a vector), while grouping against a column. Here is a reproducible example:
Cities <- c("London","New_York")
df <- data.frame(Grade = c(rep("Bad",2),rep("Average",4),rep("Good",4)),
London = seq(1,10,1),
New_York = seq(11,20,1),
Shanghai = seq(21,30,1))
> df
Grade London New_York Shanghai
1 Bad 1 11 21
2 Bad 2 12 22
3 Average 3 13 23
4 Average 4 14 24
5 Average 5 15 25
6 Average 6 16 26
7 Good 7 17 27
8 Good 8 18 28
9 Good 9 19 29
10 Good 10 20 30
The output I want:
> df %>% group_by(Grade) %>% summarise(London = mean(London), New_York = mean(New_York))
# A tibble: 3 x 3
Grade London New_York
<fct> <dbl> <dbl>
1 Average 4.5 14.5
2 Bad 1.5 11.5
3 Good 8.5 18.5
I would like to select the elements within vector cities (without calling their names) inside summarise, all while retaining their original name within the vector
You can do:
df %>%
group_by(Grade) %>%
summarise_at(vars(one_of(Cities)), mean)
Grade London New_York
<fct> <dbl> <dbl>
1 Average 4.5 14.5
2 Bad 1.5 11.5
3 Good 8.5 18.5
From documentation:
one_of(): Matches variable names in a character vector.
vars can take a vector of column names as such. select-helpers(matches, starts_with, ends_with are used when we have some kind of pattern to match). Now, with the current implementation vars is more generalized, it can select columns, deselect (with -)
library(dplyr)
df %>%
group_by(Grade) %>%
summarise_at(vars(Cities), mean)
# A tibble: 3 x 3
# Grade London New_York
# <fct> <dbl> <dbl>
#1 Average 4.5 14.5
#2 Bad 1.5 11.5
#3 Good 8.5 18.5

Fastest way to loop through columns and calculate IQR by group, then calculate proportion IQR for each group based on reference group?

I have a large dataset (about 12,000 columns) that looks like this
> df
ID Group val1 val2 val3
1 01 a 3 3 3
2 02 a 4 4 4
3 03 b 6 6 7
4 04 c 10 10 19
5 05 b 2 2 2
6 06 b 4 4 4
7 07 c 8 8 8
8 08 c 12 12 12
loop through each column and get an IQR for each Group.
Then calculate for each column per group a deltaIQR...
For example
delta IQR of B = ( IQR of group B - IQR of Group A) / IQR of Group A
delta IQR of C = (IQR of group C - IQR of Group A) / IQR of Group A
What is the most efficient way to do this?
I attempted a dplyr summarise by Group solution but the df is too big. And also I need to calculate quantiles first, etc. So it gets more unwieldy...
Using the dplyr solution before brings in some errors
df %>%
group_by(Group) %>%
summarise_at(vars(matches('val')), IQR) %>%
rename_at(-1, ~ paste0(., "_IQR")) %>%
mutate_at(vars(matches('val')), list(delta= ~ (. - .[1])/.[1]))
In my actual dataset
> temp
v6599_IQR v6599_IQR_delta v1554_IQR v1554_IQR_delta
1 0.00191803 0.000000e+00 0.001794153 0.000000e+00
2 0.62698976 3.258926e+02 1.722508234 9.590677e+02
3 0.00191803 7.235440e-15 0.001794153 4.641005e-14
4 0.00191803 -3.617720e-14 2.155928869 1.200642e+03
Now there seems to be an error, because when I calculate the deltaIQR for 3 and 4... the calculation is off, for the first column, delta IQR for rows 3 and 4 should be 0.
Update:
To calculate deltaIQR I am using dplyr.
library(dplyr)
df %>%
group_by(Group) %>%
summarise_at(vars(matches('val')), IQR) %>%
rename_at(-1, ~ paste0(., "_IQR")) %>%
mutate_at(vars(matches('val')), list(delta= ~ (. - .[1])/.[1]))
#> # A tibble: 3 x 7
#> Group val1_IQR val2_IQR val3_IQR val1_IQR_delta val2_IQR_delta val3_deltaIQR
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 a 0.5 0.5 0.5 0 0 0
#> 2 b 2 2 2.5 3 3 4
#> 3 c 2 2 5.5 3 3 10
Thanks to akrun for his comment on dplyr solution
Looping through columns to calculate IQR can be done in base:
sapply(df[,3:5], function(x) tapply(x, df$Group, IQR))
#> val1 val2 val3
#> a 0.5 0.5 0.5
#> b 2.0 2.0 2.5
#> c 2.0 2.0 5.5
Data:
df <- read.table(text="ID Group val1 val2 val3
01 a 3 3 3
02 a 4 4 4
03 b 6 6 7
04 c 10 10 19
05 b 2 2 2
06 b 4 4 4
07 c 8 8 8
08 c 12 12 12", header=T)

Mean of variable of alters in R

I have a network data set of a school, where I have the level of depression for each respondent. The data looks like this:
id depression friendid_1 friendid_2 friendid_2 friendid_3
1 1.0 7 3 6 5
2 0.6 6 4 NA NA
3 0.0 1 4 5 7
4 1.8 9 3 8 2
I want to add a variable to the data that is the mean depression of the respondent's network (so averaging the depression level of all the alters who also exist in this data as respondents).
Any help would be great!
With these type of "connected" problems I like to use the igraph package to treat the data like a graph/network. So with your sample data
dd<-read.table(text="id depression friendid_1 friendid_2 friendid_3 friendid_4
1 1.0 7 3 6 5
2 0.6 6 4 NA NA
3 0.0 1 4 5 7
4 1.8 9 3 8 2", header=TRUE)
We can create a graph of your data with
library(igraph)
library(dplyr) #for select
library(tidyr) #for gathter
gg <- dd %>% select(-depression) %>%
gather(friend, friend_id, -id) %>%
select(-friend) %>%
na.omit() %>%
graph_from_data_frame(, directed=FALSE) %>% #this assumes friendships are mutual
simplify()
gg <- set_vertex_attr(gg, "depression", V(gg)[dd$id], dd$depression)
plot(gg)
Then you can loop over all the vertices and calculate the mean depression of the neighbors
sapply(V(gg), function(v) {
mean(neighbors(gg, v)$depression, na.rm=TRUE)
})
# 1 2 3 4 7 6 9 5 8
# 0.0 1.8 1.4 0.3 0.5 0.8 1.8 0.5 1.8

Resources