Conditional statements of rows and columns - r

I am trying a rather complex conditional statement in R that I need help with. I have a massive dataset (~1,450,000) different values.
I am trying to shave down the dataset with R by asking it a conditional statement: if multiple rows have the same values in column "date" AND if multiples rows have the same values in column "PageName", THEN give averages in columns "sst", "lat", and "long".
The code I have Frankensteined together so far is:
Combines_averages <- if(Combine_12$date == Combine_12$PageName{aggregate(Combine_12$sst, Combine_12$`location-lat`, Combine_12$`location-long`)}
Data:
Example_Data
# A tibble: 3 x 7
animal_id `location-lat` `location-long` date sst month PageName
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 Alpha-626 30.5 -79.5 3/14/2020 22.2 3 ABCD
2 Bravo-522 30.5 -79.5 3/14/2020 22.6 3 ABCD
3 Charlie-389 30.5 -79.5 3/13/2020 22.4 3 BCAD
4 Delta-720 30.5 -79.5 3/16/2020 22.8 3 CADB
5 Echo-550 30.5 -79.5 3/14/2020 22.2 3 ABCD

Related

How can I calculate mean values for each day of an year from a time series data set in R?

I have a data set containing climatic data taken hourly from 01-01-2007 to 31-12-2021.
I want to calculate the mean value for a given variable (e.g. temperature) for each day of the year (1:365).
My dataset look something like this:
dia prec_h tc_h um_h v_d vm_h
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2007-01-01 0.2 22.9 89 42 3
2 2007-01-01 0.4 22.8 93 47 1.9
3 2007-01-01 0 22.7 94 37 1.3
4 2007-01-01 0 22.6 94 38 1.6
5 2007-01-01 0 22.7 95 46 2.3
[...]
131496 2021-12-31 0.0 24.7 87 47 2.6
( "[...]" stands for sequence of data from 2007 - 2014).
I first calculated daily mean temperature for each of my entry dates as follows:
md$dia<-as.Date(md$dia,format = "%d/%m/%Y")
m_tc<-aggregate(tc_h ~ dia, md, mean)
This returned me a data frame with mean temperature values for each analyzed year.
Now, I want to calculate the mean temperature for each day of the year from this data frame, i.e: mean temperature for January 1st up to December 31st.
Thus, I need to end up with a data frame with 365 rows, but I don't know how to do such calculation. Can anyone help me out?
Also, there is a complication: I have 4 leap years in my data frame. Any recommendations on how to deal with them?
Thankfully
First simulate a data set with the relevant columns and number of rows, then aggregate by day giving m_tc.
As for the question, create an auxiliary variable mdia by formating the dates column as month-day only. Compute the means grouping by mdia. The result is a data.frame with 366 rows and 2 columns as expected.
set.seed(2022)
# number of rows in the question
n <- 131496L
dia <- seq(as.Date("2007-01-01"), as.Date("2021-12-31"), by = "1 day")
md <- data.frame(
dia = sort(sample(dia, n, TRUE)),
tc_h = round(runif(n, 0, 40), 1)
)
m_tc <- aggregate(tc_h ~ dia, md, mean)
mdia <- format(m_tc$dia, "%m-%d")
final <- aggregate(tc_h ~ mdia, m_tc, mean)
str(final)
#> 'data.frame': 366 obs. of 2 variables:
#> $ mdia: chr "01-01" "01-02" "01-03" "01-04" ...
#> $ tc_h: num 20.2 20.4 20.2 19.6 20.7 ...
head(final, n = 10L)
#> mdia tc_h
#> 1 01-01 20.20741
#> 2 01-02 20.44143
#> 3 01-03 20.20979
#> 4 01-04 19.63611
#> 5 01-05 20.69064
#> 6 01-06 18.89658
#> 7 01-07 20.15992
#> 8 01-08 19.53639
#> 9 01-09 19.52999
#> 10 01-10 19.71914
Created on 2022-10-18 with reprex v2.0.2
You can pass your data to the function using the pipe (%>%) from R package (magrittr) and calculate the mean values by calling R package (dplyr):
library(dplyr); library(magrittr)
tcmean<-md %>% group_by(dia) %>% summarise(m_tc=mean(tc_h))

Create a temporary group in dplyr group_by

I would like to group all members of the same genera together for some summary statistics, but would like to maintain their full names in the original dataframe. I know that I could change their names or create a new column in the original dataframe but I am lookng for a more elegant solution. I would like to implement this in R and the dplyr package.
Example data here https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3Aeb176981-1909-4d6d-ac07-3406e4efc43f
I would like to group all clams of the genus Macoma as one group, "Macoma sp." but ideally creating this grouping within the following, perhapse before the group_by(site_code, species_scientific)
summary <- data %>%
group_by(site_code, species_scientific) %>%
summarize(mean_size = mean(width_mm))
Note that there are multiple Macoma xxx species and multiple other species that I want to group as is.
We may replace the species_scientific by replaceing the elements that have the substring 'Macoma' (str_detect) with 'Macoma', use that as grouping column and get the mean
library(dplyr)
library(stringr)
data %>%
mutate(species_scientific = replace(species_scientific,
str_detect(species_scientific, "Macoma"), "Macoma")) %>%
group_by(site_code, species_scientific) %>%
summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 97 × 3
site_code species_scientific mean_size
<chr> <chr> <dbl>
1 H_01_a Clinocardium nuttallii 33.9
2 H_01_a Macoma 41.0
3 H_01_a Protothaca staminea 37.3
4 H_01_a Saxidomus gigantea 56.0
5 H_01_a Tresus nuttallii 100.
6 H_02_a Clinocardium nuttallii 35.1
7 H_02_a Macoma 41.3
8 H_02_a Protothaca staminea 38.0
9 H_02_a Saxidomus gigantea 54.7
10 H_02_a Tresus nuttallii 50.5
# … with 87 more rows
If the intention is to keep only the first word in 'species_scientific'
data %>%
group_by(genus = str_remove(species_scientific, "\\s+.*"), site_code) %>%
summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 97 × 3
genus site_code mean_size
<chr> <chr> <dbl>
1 Clinocardium H_01_a 33.9
2 Clinocardium H_02_a 35.1
3 Clinocardium H_03_a 37.5
4 Clinocardium H_04_a 48.2
5 Clinocardium H_05_a 37.6
6 Clinocardium H_06_a 38.7
7 Clinocardium H_07_a 40.2
8 Clinocardium L_01_a 44.4
9 Clinocardium L_02_a 54.8
10 Clinocardium L_03_a 61.1
# … with 87 more rows

Extract value from previous row based on a condition

I have a dataset that looks as follows:
data <- tribble(
~Date, ~Ticker, ~Close, ~Open,
"1989-09-11","COND",77.3292,77.3292,
"1989-09-12","COND",77.4435,77.4435,
"1989-09-13","COND",76.3118,76.3118,
"1989-09-14","COND",75.5309,75.6344,
"1989-09-15","COND",75.6598,75.4675)
# A tibble: 5 x 4
Date Ticker Close Open
<chr> <chr> <dbl> <dbl>
1 1989-09-11 COND 77.3 77.3
2 1989-09-12 COND 77.4 77.4
3 1989-09-13 COND 76.3 76.3
4 1989-09-14 COND 75.5 75.6
5 1989-09-15 COND 75.7 75.5
The issue with it is that until a certain date, the closing price is identical with the opening price. What I'm trying to do is writing a function that checks if the opening and closing price are the same, and if that's the case, it replaces the opening price with the closing price from the previous row. If applied to the above data, it would transform the data as follows:
# A tibble: 5 x 4
Date Ticker Close Open
<chr> <chr> <dbl> <dbl>
1 1989-09-11 COND 77.3 NA
2 1989-09-12 COND 77.4 77.3
3 1989-09-13 COND 76.3 77.4
4 1989-09-14 COND 75.5 75.6
5 1989-09-15 COND 75.7 75.5
I tried to do it with an if statement, but I'm running into problems as soon as I try to get the value from the previous row in the "Close" column to the current "Open" value.
In dplyr, it's a simple mutate with lag.
library(dplyr)
data %>%
mutate(Open = if_else(Open == Close, lag(Close), Open))
## A tibble: 5 x 4
# Date Ticker Close Open
# <chr> <chr> <dbl> <dbl>
#1 1989-09-11 COND 77.3 NA
#2 1989-09-12 COND 77.4 77.3
#3 1989-09-13 COND 76.3 77.4
#4 1989-09-14 COND 75.5 75.6
#5 1989-09-15 COND 75.7 75.5

A quick way to rename multiple columns with unique names using dplyr

I am beginner R user, currently learning the tidyverse way. I imported a dataset which is a time series of monthly indexed consumer prices over a period of four years. The imported headings on the monthly CPI columns displayed in R as five digit numbers (as characters). Here is a short mockup recreation of what it looks like...
df <- tibble(`Product` = c("Eggs", "Chicken"),
`44213` = c(35.77, 36.77),
`44244` = c(39.19, 39.80),
`44272` = c(40.12, 43.42),
`44303` = c(41.09, 41.33)
)
# A tibble: 2 x 5
# Product `44213` `44244` `44272` `44303`
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3
I want to change the column headings (44213 etc) to dates that make more sense to me (still as characters). I understand, using dplyr, to do it the following way:
df <- df %>% rename("Jan17" = `44213`, "Feb17" = `44244`,
"Mar17" = `44272`, "Apr17" = `44303`)
# A tibble: 2 x 5
# Product Jan17 Feb17 Mar17 Apr17
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3
The problem is that my actual dataset contains 48 such columns (months) to rename and so it is a lot of work to type out. I looked at other replace and set_names functions but these seem to add in the repeated changes to the column names, don't provide new unique names like I am looking for?
(I realise dates as columns is not good practice and would need to shift these to rows before proceeding with any analysis... or maybe this must be a prior step to renaming?)
Trust I expressed my question sufficiently. Would love to learn a quicker solution using dplyr or be directed to where one can be found. Thank you for your time.
We can use !!! with rename by passing a named vector
library(dplyr)
library(stringr)
df1 <- df %>%
rename(!!! setNames(names(df)[-1], str_c(month.abb[1:4], 17)))
-output
df1
# A tibble: 2 x 5
# Product Jan17 Feb17 Mar17 Apr17
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3
Or use rename_with
df %>%
rename_with(~str_c(month.abb[1:4], 17), -1)
If the column names should be converted to Date formatted
nm1 <- format(as.Date(as.numeric(names(df)[-1]), origin = '1896-01-01'), '%b%y')
df %>%
rename_with(~ nm1, -1)
# A tibble: 2 x 5
# Product Jan17 Feb17 Mar17 Apr17
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3
using some random names, but sequentially
names(df)[2:ncol(df)] <- paste0('col_', 1:(ncol(df)-1), sep = '')
## A tibble: 2 x 5
# Product col_1 col_2 col_3 col_4
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3

Aggregate with multiple duplicates and calculate their mean

Assume we have a DF with duplicates in their respected UserID's but with different namings, which of course can be duplicates as well.
DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))
The aim is to aggregate and calculate the mean and standard deviation of the UserID's and their names respectively. A desired output example:
UserID Name Class Scoring_mean Scoring_std
101 Ed Junior 12.5 3
101 Hank Junior 24.67 11.62
102 Sandy High 24.75 6.29
102 Jessica High 24.25 1.5
Hence my question:
What are the options to aggregate the Names based on the UserID, without the loss of information (Hank being coerced into Ed etc. as with summarise() or mutate() )
In my way of thinking, R has to check which Name corresponds to the UserID, and if a match; aggregate and calculate mean & standard deviation, but I'm not able to get this working in R with dplyr.
At the same time I couldn't find any other post that is somewhat related to this question, as in:
How to calculate the mean of specific rows in R?
Subtract pairs of columns based on matching column
Calculating mean when 2 conditions need met in R
average between duplicated rows in R
Here's a tidyverse option that uses some reshaping to create one column of scores and then some grouping in order to get the summary stats:
DF <- data.frame(
ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26),
Other_Scores=c(15,9,34,23,43,23,34,23,23)
)
library(tidyverse)
DF %>%
gather(score_type, score, Scoring, Other_Scores) %>% # reshape score columns
group_by(ID, Name, Class) %>% # group by combinations
summarise(scoring_mean = mean(score), # get summary stats
scoring_sd = sd(score)) %>%
ungroup() # forget the grouping
# # A tibble: 4 x 5
# ID Name Class scoring_mean scoring_sd
# <dbl> <fct> <fct> <dbl> <dbl>
# 1 101. Ed Junior 12.5 3.00
# 2 101. Hank Junior 24.7 11.6
# 3 102. Jessica Mid 24.2 1.50
# 4 102. Sandy High 24.8 6.29
What about computing your summary stats then joining the results to your initial dataframe. Like so:
DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))
DF2 <- DF %>% group_by(Name) %>%
summarise(scoring_mean=mean(Scoring), scoring_sd = sd(Scoring)) %>%
left_join(DF[,c(1,2,3)], by="Name")
Giving:
# A tibble: 9 x 5
Name scoring_mean scoring_sd ID Class
<fct> <dbl> <dbl> <dbl> <fct>
1 Ed 13.0 2.83 101. Junior
2 Ed 13.0 2.83 101. Junior
3 Hank 16.0 3.46 101. Junior
4 Hank 16.0 3.46 101. Junior
5 Hank 16.0 3.46 101. Junior
6 Jessica 25.5 0.707 102. Mid
7 Jessica 25.5 0.707 102. Mid
8 Sandy 21.0 1.41 102. High
9 Sandy 21.0 1.41 102. High

Resources