How to combine winter months of two consecutive years - r

I have a count data of several species spanning over several years. I want to look at the abundance dynamics for each species over winter season only for each year. The problem is winter season span over two years, November, December and January of next year. Now, I want to combine the abundance of each species of winter months spanning over two consecutive years and do some analysis. For example, I want to subset Nov-Dec of 2005 and Jan of 2006 in first round and do some analysis with this then in second round want to subset Nov-Dec of 2006 and Jan of 2007 and then repeat the same analysis and so on.... How can I do it in R?
Here is an example of the data
date species year month day abundance temp
9/3/2005 A 2005 9 3 3 19
9/15/2005 B 2005 9 15 30 16
10/4/2005 A 2005 10 4 24 12
11/6/2005 A 2005 11 6 32 14
12/8/2005 A 2005 12 8 15 13
1/3/2005 A 2006 1 3 64 19
1/4/2006 B 2006 1 4 2 13
2/10/2006 A 2006 2 10 56 12
2/8/2006 A 2006 1 3 34 19
3/9/2006 A 2006 1 3 64 19

I convert your date column to a date class (possibly with lubridate) and remove the year month day columns as they are redundant.
Then make a new column with the seasonal year (defined as the year, unless the month is Jan, then it is the previous year). A further column is made with case_when that defines the row's season.
library(dplyr)
library(lubridate)
# converts to date format
df$date <- mdy(df$date)
# add in columns
df <- mutate(df,
season_year = ifelse(month(date) == 1, year(date) - 1, year(date)),
season = case_when(
month(date) %in% c(2, 3, 4) ~ "Spring",
month(date) %in% c(5, 6, 7) ~ "Summer",
month(date) %in% c(8, 9, 10) ~ "Autumn",
month(date) %in% c(11, 12, 1) ~ "Winter",
T ~ NA_character_
))
# date species abundance temp season_year season
# 1 2005-09-03 A 3 19 2005 Autumn
# 2 2005-09-15 B 30 16 2005 Autumn
# 3 2005-10-04 A 24 12 2005 Autumn
# 4 2005-11-06 A 32 14 2005 Winter
# 5 2005-12-08 A 15 13 2005 Winter
# 6 2005-01-03 A 64 19 2004 Winter
# 7 2006-01-04 B 2 13 2005 Winter
# 8 2006-02-10 A 56 12 2006 Spring
# 9 2006-02-08 A 34 19 2006 Spring
# 10 2006-03-09 A 64 19 2006 Spring
Then you can group_by() and/or filter() your data for further analysis:
df %>%
group_by(season_year) %>%
filter(season == "Winter") %>%
summarise(count = sum(abundance))
# # A tibble: 2 x 2
# season_year count
# <dbl> <int>
# 1 2004 64
# 2 2005 49

data.table solution:
first create a lookup-table with from-to dates and the season-year, then perform an overlap-join using foverlaps
library( data.table )
sample data
dt <- fread("date species year month day abundance temp
9/3/2005 A 2005 9 3 3 19
9/15/2005 B 2005 9 15 30 16
10/4/2005 A 2005 10 4 24 12
11/6/2005 A 2005 11 6 32 14
12/8/2005 A 2005 12 8 15 13
1/3/2005 A 2006 1 3 64 19
1/4/2006 B 2006 1 4 2 13
2/10/2006 A 2006 2 10 56 12
2/8/2006 A 2006 1 3 34 19
3/9/2006 A 2006 1 3 64 19", header = TRUE)
create a lookup-table
In here, you define the names, start and end of the seasons. Adjust to your own needs. Since you want to analyse the seasons individually, I advise to keep unique season-names (here: based on start-year of the season).
dt.season <- data.table( from = seq( as.Date("1999-02-01"), length.out = 100, by = "3 month"),
to = seq( as.Date("1999-05-01"), length.out = 100, by = "3 month") - 1 )
dt.season[, season := paste0( c( "spring", "summer", "autumn", "winter" ), "-", year( from ) )]
setkey( dt.season, from, to )
head(dt.season,6)
# from to season
# 1: 1999-02-01 1999-04-30 spring-1999
# 2: 1999-05-01 1999-07-31 summer-1999
# 3: 1999-08-01 1999-10-31 autumn-1999
# 4: 1999-11-01 2000-01-31 winter-1999
# 5: 2000-02-01 2000-04-30 spring-2000
# 6: 2000-05-01 2000-07-31 summer-2000
and perform join
#set dt$date as dates
dt[, date := as.Date(date, format = "%m/%d/%Y")]
#create dummy variables to join on
dt[, `:=`( from = date, to = date)]
#create an overlap join, and clean the dummies used for the join
foverlaps( dt, dt.season)[, `:=`(from = NULL, to = NULL, i.from = NULL, i.to = NULL)][]
# season date species year month day abundance temp
# 1: autumn-2005 2005-09-03 A 2005 9 3 3 19
# 2: autumn-2005 2005-09-15 B 2005 9 15 30 16
# 3: autumn-2005 2005-10-04 A 2005 10 4 24 12
# 4: winter-2005 2005-11-06 A 2005 11 6 32 14
# 5: winter-2005 2005-12-08 A 2005 12 8 15 13
# 6: winter-2004 2005-01-03 A 2006 1 3 64 19
# 7: winter-2005 2006-01-04 B 2006 1 4 2 13
# 8: spring-2006 2006-02-10 A 2006 2 10 56 12
# 9: spring-2006 2006-02-08 A 2006 1 3 34 19
# 10: spring-2006 2006-03-09 A 2006 1 3 64 19
You can now easily group/sum/analyse by season

I'd think the easiest way would be to consider that 2006 winter consists of Nov, Dec 2006 and Jan 2007, you could add a column winterid <- ifelse(data$month %in% c(11,12), data$year, ifelse(data$month == 1, data$year-1, "notwinter")).
You are now able to subset on the successive winter seasons. Adapt according to your notation.

Related

Sum up with the next line into a new colum

I'm having some trouble on figuring out how to create a new column with the sum of 2 subsequent cells.
I have :
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
Now, I want a new column where the first line is the sum of 1+2, the second line is the sum of 1+2+3 , the third line is the sum 1+2+3+4 and so on.
As 1, 2, 3, 4... are hipoteticall values, I need to measure the absolute growth from a decade to another in order to create later on a new variable to measure the percentage change from a decade to another.
library(tibble)
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
library(slider)
library(dplyr, warn.conflicts = F)
df1 %>%
mutate(xx = slide_sum(Values, after = 1, before = Inf))
#> # A tibble: 9 x 3
#> Years Values xx
#> <dbl> <dbl> <dbl>
#> 1 1990 1 3
#> 2 2000 2 6
#> 3 2010 3 10
#> 4 2020 4 15
#> 5 2030 5 21
#> 6 2050 6 28
#> 7 2060 7 36
#> 8 2070 8 45
#> 9 2080 9 45
Created on 2021-08-12 by the reprex package (v2.0.0)
Assuming the last row is to be repeated. Otherwise the fill part can be skipped.
library(dplyr)
library(tidyr)
df1 %>%
mutate(x = lead(cumsum(Values))) %>%
fill(x)
# Years Values x
# <dbl> <dbl> <dbl>
# 1 1990 1 3
# 2 2000 2 6
# 3 2010 3 10
# 4 2020 4 15
# 5 2030 5 21
# 6 2050 6 28
# 7 2060 7 36
# 8 2070 8 45
# 9 2080 9 45
Using base R
v1 <- cumsum(df1$Values)[-1]
df1$new <- c(v1, v1[length(v1)])
You want the cumsum() function. Here are two ways to do it.
### Base R
df1$cumsum <- cumsum(df1$Values)
### Using dplyr
library(dplyr)
df1 <- df1 %>%
mutate(cumsum = cumsum(Values))
Here is the output in either case.
df1
# A tibble: 9 x 3
Years Values cumsum
<dbl> <dbl> <dbl>
1 1990 1 1
2 2000 2 3
3 2010 3 6
4 2020 4 10
5 2030 5 15
6 2050 6 21
7 2060 7 28
8 2070 8 36
9 2080 9 45
A data.table option
> setDT(df)[, newCol := shift(cumsum(Values), -1, fill = sum(Values))][]
Years Values newCol
1: 1990 1 3
2: 2000 2 6
3: 2010 3 10
4: 2020 4 15
5: 2030 5 21
6: 2050 6 28
7: 2060 7 36
8: 2070 8 45
9: 2080 9 45
or a base R option following a similar idea
transform(
df,
newCol = c(cumsum(Values)[-1],sum(Values))
)

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA

How to remove subjects with missing yearly observations in R?

num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432
I have the data which is represented by various subjects in 5 years. I need to remove all the subjects, which are missing any of years from 2011 to 2015. How can I accomplish it, so in given data only subject A is left?
Using data.table:
A data.table solution might look something like this:
library(data.table)
dt <- as.data.table(df)
dt[, keep := identical(unique(year), 2011:2015), by = Name ][keep == T, ][,keep := NULL]
# num Name year age X
#1: 1 A 2011 68 116292
#2: 1 A 2012 69 46132
#3: 1 A 2013 70 7042
#4: 1 A 2014 71 -100425
#5: 1 A 2015 72 6493
This is more strict in that it requires that the unique years be exactly equal to 2011:2015. If there is a 2016, for example that person would be excluded.
A less restrictive solution would be to check that 2011:2015 is in your unique years. This should work:
dt[, keep := all(2011:2015 %in% unique(year)), by = Name ][keep == T, ][,keep := NULL]
Thus, if for example, A had a 2016 year and a 2010 year it would still keep all of A. But if anyone is missing a year in 2011:2015 this would exclude them.
Using base R & aggregate:
Same option, but using aggregate from base R:
agg <- aggregate(df$year, by = list(df$Name), FUN = function(x) all(2011:2015 %in% unique(x)))
df[df$Name %in% agg[agg$x == T, 1] ,]
Here is a slightly more straightforward tidyverse solution.
First, expand the dataframe to include all combinations of Name + year:
df %>% complete(Name, year)
# A tibble: 20 x 5
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
6 B 2011 2 20 -8484
7 B 2012 NA NA NA
8 B 2013 NA NA NA
9 B 2014 NA NA NA
10 B 2015 NA NA NA
...
Then extend the pipe to group by "Name", and filter to keep only those with 0 NA values:
df %>% complete(Name, year) %>%
group_by(Name) %>%
filter(sum(is.na(age)) == 0)
# A tibble: 5 x 5
# Groups: Name [1]
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
Just check which names have the right number of entries.
## Reproduce your data
df = read.table(text=" num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432",
header=TRUE)
Tab = table(df$Name)
Keepers = names(Tab)[which(Tab == 5)]
df[df$Name %in% Keepers,]
num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
Here is a somewhat different approach using tidyverse packages:
library(tidyverse)
df <- read.table(text = " num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432")
df2 <- spread(data = df, key = Name, value = year)
x <- colSums(df2[, 4:7], na.rm = TRUE) > 10000
df3 <- select(df2, num, age, X, c(4:7)[x])
df4 <- na.omit(df3)
All steps can of course be constructed as one single pipe with the %>% operator.

Replace duplicate values using multiple conditions in r

I am new to R and I have the following data (an example) as a csv file, and I want to replace any duplicate values if they occurred on the consecutive days during similar year and month by zero or a letter. I only need to keep one average.
Year Month Day Average
2013 8 28 2.3
2013 8 29 2.3
2013 8 30 1.7
2013 8 31 1.7
2014 8 7 3
2014 8 6 3
2014 8 8 3
2014 8 9 3
2014 9 11 5.8
2014 9 12 5.8
2014 9 13 5.8
The result I expect is something like this
Year Month Day Average
2013 8 28 2.3
2013 8 29 0
2013 8 30 1.7
2013 8 31 0
2014 8 7 3
2014 8 6 0
2014 8 8 0
2014 8 9 0
2014 9 11 5.8
2014 9 12 0
2014 9 13 0
Also I would like to be able delete the rows that have the duplicate values that were replaced like this:
Year Month Day Average
2013 8 28 2.3
2013 8 30 1.7
2014 8 7 3
2014 9 11 5.8
I have to have two files one with the duplicate values replaced by Zero or a letter and another one has only the averages without the duplicate values.
Thank you in advance!!
Using dplyr for the data.frame manipulation, lubridate for date
manipulation and diff to find consecutive repeated values.
Note that I've also sorted the dates to keep the earliest one which makes it not exactly match with the example solution.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
df1 <- read.table(
text = "
Year Month Day Average
2013 8 28 2.3
2013 8 29 2.3
2013 8 30 1.7
2013 8 31 1.7
2014 8 7 3
2014 8 6 3
2014 8 8 3
2014 8 9 3
2014 9 11 5.8
2014 9 12 5.8
2014 9 13 5.8",
header = T)
df2 <- read.table(
text = "
Year Month Day Average
2013 8 28 2.3
2013 8 29 0
2013 8 30 1.7
2013 8 31 0
2014 8 7 3
2014 8 6 0
2014 8 8 0
2014 8 9 0
2014 9 11 5.8
2014 9 12 0
2014 9 13 0",
header = T)
df3 <- read.table(
text = "
Year Month Day Average
2013 8 28 2.3
2013 8 30 1.7
2014 8 7 3
2014 9 11 5.8",
header = T)
df2 <- df1 %>%
mutate(date = ymd(paste(Year, Month, Day, sep = "-"))) %>%
arrange(date) %>%
mutate(is_consecutive_average = c(FALSE, diff(Average) == 0)) %>%
mutate(is_consecutive_day = c(FALSE, diff(date) == 1)) %>%
mutate(Average = Average * !(is_consecutive_average & is_consecutive_day)) %>%
select(-is_consecutive_average, -is_consecutive_day, -date)
df2
## Year Month Day Average
## 1 2013 8 28 2.3
## 2 2013 8 29 0.0
## 3 2013 8 30 1.7
## 4 2013 8 31 0.0
## 5 2014 8 6 3.0
## 6 2014 8 7 0.0
## 7 2014 8 8 0.0
## 8 2014 8 9 0.0
## 9 2014 9 11 5.8
## 10 2014 9 12 0.0
## 11 2014 9 13 0.0
df3 <- df2 %>%
filter(Average != 0)
df3
## Year Month Day Average
## 1 2013 8 28 2.3
## 2 2013 8 30 1.7
## 3 2014 8 6 3.0
## 4 2014 9 11 5.8
Here's a data.table solution:
Read in the data
data <- readr::read_csv(
text,
col_names = TRUE,
trim_ws = TRUE
)
library( data.table )
setDT( data )
Convert the date values to a nicer format, and sort
data[ , date := as.Date( paste0( Year, "-", Month, "-", Day ) ) ]
setorder( data, date )
Create new columns for previous date and average values
data[ , prev.date := shift( date, 1L, type = "lag" ) ]
data[ , prev.average := shift( Average, 1L, type = "lag" ) ]
Mark the points where a new "group" should be created, based on your criteria. Also mark the very first record as the start of a new group, since we can assume that it is.
data[ , group := 0L
][ as.integer( date - prev.date ) > 1L |
Average != prev.average, group := 1L
][ 1L, group := 1L ]
Get your first desired output by replacing particular values with zeros
data[ group != 1L, Average := 0 ]
first.output <- data[ , .( date, Average ) ]
head( first.output, 3 )
date Average
1: 2013-08-28 2.3
2: 2013-08-29 0.0
3: 2013-08-30 1.7
Now mark the groups as unique numbers
data[ , group := cumsum( group ) ]
And get your second output by aggregating to maximum "Average" value (which will be the only one not equal to zero), and the minimum "date" value (the first in that group):
second.output <- data[ , .( date = min( date ),
Average = max( Average ) ),
by = group ][ , .( date, Average ) ]
head( second.output, 3 )
date Average
1: 2013-08-28 2.3
2: 2013-08-30 1.7
3: 2014-08-06 3.0
NOTE: you could likely get second.output by simply removing rows with a zero "Average" value from the first.output, but it would remove any groups where the "Average" really is zero, so I think this method is safer.

Resources