Create a temporary group in dplyr group_by - r

I would like to group all members of the same genera together for some summary statistics, but would like to maintain their full names in the original dataframe. I know that I could change their names or create a new column in the original dataframe but I am lookng for a more elegant solution. I would like to implement this in R and the dplyr package.
Example data here https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3Aeb176981-1909-4d6d-ac07-3406e4efc43f
I would like to group all clams of the genus Macoma as one group, "Macoma sp." but ideally creating this grouping within the following, perhapse before the group_by(site_code, species_scientific)
summary <- data %>%
group_by(site_code, species_scientific) %>%
summarize(mean_size = mean(width_mm))
Note that there are multiple Macoma xxx species and multiple other species that I want to group as is.

We may replace the species_scientific by replaceing the elements that have the substring 'Macoma' (str_detect) with 'Macoma', use that as grouping column and get the mean
library(dplyr)
library(stringr)
data %>%
mutate(species_scientific = replace(species_scientific,
str_detect(species_scientific, "Macoma"), "Macoma")) %>%
group_by(site_code, species_scientific) %>%
summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 97 × 3
site_code species_scientific mean_size
<chr> <chr> <dbl>
1 H_01_a Clinocardium nuttallii 33.9
2 H_01_a Macoma 41.0
3 H_01_a Protothaca staminea 37.3
4 H_01_a Saxidomus gigantea 56.0
5 H_01_a Tresus nuttallii 100.
6 H_02_a Clinocardium nuttallii 35.1
7 H_02_a Macoma 41.3
8 H_02_a Protothaca staminea 38.0
9 H_02_a Saxidomus gigantea 54.7
10 H_02_a Tresus nuttallii 50.5
# … with 87 more rows
If the intention is to keep only the first word in 'species_scientific'
data %>%
group_by(genus = str_remove(species_scientific, "\\s+.*"), site_code) %>%
summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 97 × 3
genus site_code mean_size
<chr> <chr> <dbl>
1 Clinocardium H_01_a 33.9
2 Clinocardium H_02_a 35.1
3 Clinocardium H_03_a 37.5
4 Clinocardium H_04_a 48.2
5 Clinocardium H_05_a 37.6
6 Clinocardium H_06_a 38.7
7 Clinocardium H_07_a 40.2
8 Clinocardium L_01_a 44.4
9 Clinocardium L_02_a 54.8
10 Clinocardium L_03_a 61.1
# … with 87 more rows

Related

How to extract timestamp difference of the first 2 equal values from 2 data frame

I am working with time series, i have 2 different time series that have 2 columns and different row number.
df_1=read.table("data_1")
df_2=read.table("data_2")
I would like to compare the values of df_1$V2 (second column) with the values in df_2$V2, if they are equal calculate the time difference between them (df_2$V1[i]-df_2$V1[j])
here is my code:
vect=c()
horo=c()
j=1
for (i in 2: nrow(df_1)){
for(j in 1:nrow(df_2)) {
if(df_1$V2[i]==df_2$V2[j]){
calc=abs(df_2$V1[j] - df_1$V1[i])
vect=append(vect, calc)
}
}
}
The problem is:
it could exist many element in df_2$V2[j] that are equal to df_1$V2[i]
and i only want the first value.
as i know that in my data if (for example) df_1$V2[1]= df_2$V2[8] so for the next iteration no need to compare the df_1$V1[2] with the first 8 values of df_2$V2 and i can start comparing from df_2$V2[9]
it take too much time... because of the for loop, so is there another way to do it?
Thank you for your help!
data example:
df_1=
15.942627 2633
15.942630 2664
15.942831 2699
15.943421 3068
15.943422 4256
15.943423 5444
15.943425 6632
15.943426 7820
15.945489 9008
15.945490 10196
15.945995 11384
15.960359 12572
15.960360 13760
15.960413 14948
15.960414 16136
15.961537 17202
15.962138 18390
15.962139 18624
16.042805 18659
16.043349 18851
....
df_2=
15.942244 2376
15.942332 2376
15.942332 2376
15.959306 2633
15.960350 2633
15.961223 3068
15.967225 6632
15.978364 10196
15.982280 12572
15.994296 16136
15.994379 18624
16.042336 18624
16.060262 18659
16.065397 21250
16.069239 24814
16.073407 28378
16.077236 31942
You've mentioned that your for-loop is slow; it's generally advisable to avoid writing your own for-loops in R, and letting built-in vectorisation handle things efficiently.
Here's a non-for-loop-dependent solution using the popular dplyr package from the tidyverse.
Read in data
First, let's read in your data for the sake of reproducibility. Note that I've added names to your data, because unnamed data is confusing and hard to work with.
require(vroom) # useful package for flexible data reading
df_1 <- vroom(
"timestamp value
15.942627 2633
15.942630 2664
15.942831 2699
15.943421 3068
15.943422 4256
15.943423 5444
15.943425 6632
15.943426 7820
15.945489 9008
15.945490 10196
15.945995 11384
15.960359 12572
15.960360 13760
15.960413 14948
15.960414 16136
15.961537 17202
15.962138 18390
15.962139 18624
16.042805 18659
16.043349 18851")
#> Rows: 20 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: " "
#> dbl (2): timestamp, value
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_2 <- vroom(
"timestamp value
15.942244 2376
15.942332 2376
15.942332 2376
15.959306 2633
15.960350 2633
15.961223 3068
15.967225 6632
15.978364 10196
15.982280 12572
15.994296 16136
15.994379 18624
16.042336 18624
16.060262 18659
16.065397 21250
16.069239 24814
16.073407 28378
16.077236 31942")
#> Rows: 17 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: " "
#> dbl (2): timestamp, value
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Comparing time differences for matching values
Let's go through the solution step-by-step:
Add id for each row of df_1
We'll need this later to remove unwanted values.
require(dplyr)
#> Loading required package: dplyr
df_1 <- mutate(df_1, id = paste0("id_", row_number()) |>
## For the sake of neatness, we'll make id_ an ordered factor
## that's ordered by it's current arrangement
ordered() |>
fct_inorder())
df_1 <- relocate(df_1, id)
head(df_1)
#> # A tibble: 6 × 3
#> id timestamp value
#> <chr> <dbl> <dbl>
#> 1 id_1 15.9 2633
#> 2 id_2 15.9 2664
#> 3 id_3 15.9 2699
#> 4 id_4 15.9 3068
#> 5 id_5 15.9 4256
#> 6 id_6 15.9 5444
Join rows from df_2 on matching values
joined <- left_join(df_1, df_2, by = "value", suffix = c(".1", ".2"))
head(joined)
#> # A tibble: 6 × 4
#> id timestamp.1 value timestamp.2
#> <chr> <dbl> <dbl> <dbl>
#> 1 id_1 15.9 2633 16.0
#> 2 id_1 15.9 2633 16.0
#> 3 id_2 15.9 2664 NA
#> 4 id_3 15.9 2699 NA
#> 5 id_4 15.9 3068 16.0
#> 6 id_5 15.9 4256 NA
Get the first returned value for each value in df_1
We can do this by grouping by our id column, then just getting the first() row from each group.
joined <- group_by(joined, id) # group by row identifiers
summary <- summarise(joined, across(everything(), first))
head(summary)
#> # A tibble: 6 × 4
#> id timestamp.1 value timestamp.2
#> <ord> <dbl> <dbl> <dbl>
#> 1 id_1 15.9 2633 16.0
#> 2 id_2 15.9 2664 NA
#> 3 id_3 15.9 2699 NA
#> 4 id_4 15.9 3068 16.0
#> 5 id_5 15.9 4256 NA
#> 6 id_6 15.9 5444 NA
Get time difference
A simple case of using mutate() to subtract timestamp.1 from timestamp.2:
times <- mutate(summary, time_diff = timestamp.2 - timestamp.1) |>
relocate(value, .after = id) # this is just for presentation
## You may want to remove rows with no time diff?
filter(times, !is.na(time_diff))
#> # A tibble: 8 × 5
#> id value timestamp.1 timestamp.2 time_diff
#> <ord> <dbl> <dbl> <dbl> <dbl>
#> 1 id_1 2633 15.9 16.0 0.0167
#> 2 id_4 3068 15.9 16.0 0.0178
#> 3 id_7 6632 15.9 16.0 0.0238
#> 4 id_10 10196 15.9 16.0 0.0329
#> 5 id_12 12572 16.0 16.0 0.0219
#> 6 id_15 16136 16.0 16.0 0.0339
#> 7 id_18 18624 16.0 16.0 0.0322
#> 8 id_19 18659 16.0 16.1 0.0175
Created on 2022-10-25 with reprex v2.0.2

Is there any function that give the changes between columns?

I have a df that looks like this.
head(dfhigh)
rownames 2015Y 2016Y 2017Y 2018Y 2019Y 2020Y 2021Y
1 Australia 29583.7403 48397.383 45220.323 68461.941 39218.044 20140.351 29773.188
2 Austria* 1294.5092 -8400.973 14926.164 5511.625 2912.795 -14962.963 5855.014
3 Belgium* -24013.3111 68177.596 -3057.153 27119.084 -9208.553 13881.481 22955.298
4 Canada 43852.7732 36061.859 22764.156 37653.521 50141.784 23174.006 59693.992
5 Chile* 20507.8407 12249.294 6128.716 7735.778 12499.238 8385.907 15251.538
6 Czech Republic 465.2137 9814.496 9517.948 11010.423 10108.914 9410.576 5805.084
I want to calculate the changes between years, so instead of the values, the table has the percentage of change (obviously deleting 2015Y).
Try this using (current - previous)/ previous *100
lst <- list()
nm <- names(dfhigh)[-1]
for(i in 1:(length(nm) - 1)){
lst[[i]] <- (dfhigh[[nm[i+1]]] - dfhigh[[nm[i]]]) / dfhigh[[nm[i]]] * 100
}
ans <- do.call(cbind , lst)
colnames(ans) <- paste("ch_of" , nm[-1])
ans
you can change the formula to calculate percentage as you want
You could also use a tidyverse solution.
library(tidyverse)
df %>%
pivot_longer(!rownames) %>%
group_by(rownames) %>%
mutate(value = 100*value/lag(value)-100) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value)
# # A tibble: 6 × 8
# rownames `2015Y` `2016Y` `2017Y` `2018Y` `2019Y` `2020Y` `2021Y`
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Australia NA 63.6 -6.56 51.4 -42.7 -48.6 47.8
# 2 Austria* NA -749. -278. -63.1 -47.2 -614. -139.
# 3 Belgium* NA -384. -104. -987. -134. -251. 65.4
# 4 Canada NA -17.8 -36.9 65.4 33.2 -53.8 158.
# 5 Chile* NA -40.3 -50.0 26.2 61.6 -32.9 81.9
# 6 CzechRepublic NA 2010. -3.02 15.7 -8.19 -6.91 -38.3

Calculating mean age by group in R

I have the following data: https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv
I'm trying to determine how to calculate the mean age of members of Congress by year (termstart) for each party (Republican and Democrat).
I was hoping for some help on how to go about doing this. I am a beginner in R and I'm just playing around with the data.
Thanks!
Try this approach. Make a filter for the required parties and then summarise. After that you can reshape to wide in order to have both parties for each individual date. Here the code using tidyverse functions:
library(dplyr)
library(tidyr)
#Data
df <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv',stringsAsFactors = F)
#Code
newdf <- df %>% filter(party %in% c('R','D')) %>%
group_by(termstart,party) %>% summarise(MeanAge=mean(age,na.rm=T)) %>%
pivot_wider(names_from = party,values_from=MeanAge)
Output:
# A tibble: 34 x 3
# Groups: termstart [34]
termstart D R
<chr> <dbl> <dbl>
1 1947-01-03 52.0 53.0
2 1949-01-03 51.4 54.6
3 1951-01-03 52.3 54.3
4 1953-01-03 52.3 54.1
5 1955-01-05 52.3 54.7
6 1957-01-03 53.2 55.4
7 1959-01-07 52.4 54.7
8 1961-01-03 53.4 53.9
9 1963-01-09 53.3 52.6
10 1965-01-04 52.3 52.2
# ... with 24 more rows

Named vector "by" arguments for `dplyr::_join` functions [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I am writing a function to dplyr::_join two dataframes by different columns, with the column name of the first dataframe dynamically specified as a function argument. I believe I need to use rlang quasiquotation/metaprogramming but haven't been able to get a working solution. I appreciate any suggestions!
library(dplyr)
library(rlang)
library(palmerpenguins)
# Create a smaller dataset
penguins <-
penguins %>%
group_by(species) %>%
slice_head(n = 4) %>%
ungroup()
# Create a colors dataset
penguin_colors <-
tibble(
type = c("Adelie", "Chinstrap", "Gentoo"),
color = c("orange", "purple", "green")
)
# Without function --------------------------------------------------------
# Join works with character vectors
left_join(
penguins, penguin_colors, by = c("species" = "type")
)
#> # A tibble: 12 x 9
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <chr> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 39.1 18.7 181 3750
#> 2 Adelie Torge… 39.5 17.4 186 3800
#> 3 Adelie Torge… 40.3 18 195 3250
#> 4 Adelie Torge… NA NA NA NA
#> 5 Chinst… Dream 46.5 17.9 192 3500
#> 6 Chinst… Dream 50 19.5 196 3900
#> 7 Chinst… Dream 51.3 19.2 193 3650
#> 8 Chinst… Dream 45.4 18.7 188 3525
#> 9 Gentoo Biscoe 46.1 13.2 211 4500
#> 10 Gentoo Biscoe 50 16.3 230 5700
#> 11 Gentoo Biscoe 48.7 14.1 210 4450
#> 12 Gentoo Biscoe 50 15.2 218 5700
#> # … with 3 more variables: sex <fct>, year <int>, color <chr>
# Join works with data-variable and character vector
left_join(
penguins, penguin_colors, by = c(species = "type")
)
#> # A tibble: 12 x 9
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <chr> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 39.1 18.7 181 3750
#> 2 Adelie Torge… 39.5 17.4 186 3800
#> 3 Adelie Torge… 40.3 18 195 3250
#> 4 Adelie Torge… NA NA NA NA
#> 5 Chinst… Dream 46.5 17.9 192 3500
#> 6 Chinst… Dream 50 19.5 196 3900
#> 7 Chinst… Dream 51.3 19.2 193 3650
#> 8 Chinst… Dream 45.4 18.7 188 3525
#> 9 Gentoo Biscoe 46.1 13.2 211 4500
#> 10 Gentoo Biscoe 50 16.3 230 5700
#> 11 Gentoo Biscoe 48.7 14.1 210 4450
#> 12 Gentoo Biscoe 50 15.2 218 5700
#> # … with 3 more variables: sex <fct>, year <int>, color <chr>
# Join does NOT work with character vector and data-variable
left_join(
penguins, penguin_colors, by = c(species = type)
)
#> Error in standardise_join_by(by, x_names = x_names, y_names = y_names): object 'type' not found
# With function -----------------------------------------------------------
# Version 1: Without tunneling
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = c(var = "type")
)
}
add_colors(penguins, species)
#> Error: Join columns must be present in data.
#> x Problem with `var`.
add_colors(penguins, "species")
#> Error: Join columns must be present in data.
#> x Problem with `var`.
# Version 2: With tunneling
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = c("{{var}}" = "type")
)
}
add_colors(penguins, species)
#> Error: Join columns must be present in data.
#> x Problem with `{{var}}`.
add_colors(penguins, "species")
#> Error: Join columns must be present in data.
#> x Problem with `{{var}}`.
# Version 2: With tunneling and glue syntax
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = c("{{var}}" := "type")
)
}
add_colors(penguins, species)
#> Error: `:=` can only be used within a quasiquoted argument
add_colors(penguins, "species")
#> Error: `:=` can only be used within a quasiquoted argument
Created on 2020-10-05 by the reprex package (v0.3.0)
Here are related resources I consulted:
using `rlang` quasiquotation with `dplyr::_join` functions
https://dplyr.tidyverse.org/reference/join.html
https://speakerdeck.com/lionelhenry/interactivity-and-programming-in-the-tidyverse
https://dplyr.tidyverse.org/articles/programming.html
Thank you for your advice.
library(dplyr)
left_join(
penguins, penguin_colors, by = c(species = "type")
)
The reason why above works is because in by we are creating a named vector like this :
c(species = "type")
#species
# "type"
You can also do that via setNames :
setNames('type', 'species')
but notice that passing species without quotes fail.
setNames('type', species)
Error in setNames("type", species) : object 'species' not found
So create a named vector with setNames and pass character value in the function.
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = setNames('type', var)
)
}
add_colors(penguins, 'species')
To add to Ronak's solution you can also achieve this without quotes using ensym
Example:
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = set_names("type", nm = ensym(var))
)
}

Subsetting data set to only retain the mean

Please see attached image of dataset.
What are the different ways to only retain a single value for each 'Month'? I've got a bunch of data points and would only need to retain, say, the mean value.
Many thanks
A different way of using the aggregate() function.
> aggregate(Temp ~ Month, data=airquality, FUN = mean)
Month Temp
1 5 65.54839
2 6 79.10000
3 7 83.90323
4 8 83.96774
5 9 76.90000
library(tidyverse)
library(lubridate)
#example data from airquality:
aq<-as_data_frame(airquality)
aq$mydate<-lubridate::ymd(paste0(2018, "-", aq$Month, "-", aq$Day))
> aq
# A tibble: 153 x 7
Ozone Solar.R Wind Temp Month Day mydate
<int> <int> <dbl> <int> <int> <int> <date>
1 41 190 7.40 67 5 1 2018-05-01
2 36 118 8.00 72 5 2 2018-05-02
3 12 149 12.6 74 5 3 2018-05-03
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE))
Summarize can return multiple summary functions:
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE),
"Num" = n(),
"SD" = sd(Temp, na.rm=TRUE))
# A tibble: 5 x 4
Month Mean_Temp Num SD
<dbl> <dbl> <int> <dbl>
1 5.00 65.5 31 6.85
2 6.00 79.1 30 6.60
3 7.00 83.9 31 4.32
4 8.00 84.0 31 6.59
5 9.00 76.9 30 8.36
Lubridate Cheatsheet
A data.table answer:
# load libraries
library(data.table)
library(lubridate)
setDT(dt)
dt[, .(meanValue = mean(value, na.rm =TRUE)), by = .(monthDate = floor_date(dates, "month"))]
Where dt has at least columns value and dates.
We can group by the index of dataset, use that in aggregate (from base R) to get the mean
aggregate(dat, index(dat), FUN = mean)
NB: Here, we assumed that the dataset is xts or zoo format. If the dataset have a month column, then use
aggregate(dat, list(dat$Month), FUN = mean)

Resources