Replace duplicate values using multiple conditions in r - r

I am new to R and I have the following data (an example) as a csv file, and I want to replace any duplicate values if they occurred on the consecutive days during similar year and month by zero or a letter. I only need to keep one average.
Year Month Day Average
2013 8 28 2.3
2013 8 29 2.3
2013 8 30 1.7
2013 8 31 1.7
2014 8 7 3
2014 8 6 3
2014 8 8 3
2014 8 9 3
2014 9 11 5.8
2014 9 12 5.8
2014 9 13 5.8
The result I expect is something like this
Year Month Day Average
2013 8 28 2.3
2013 8 29 0
2013 8 30 1.7
2013 8 31 0
2014 8 7 3
2014 8 6 0
2014 8 8 0
2014 8 9 0
2014 9 11 5.8
2014 9 12 0
2014 9 13 0
Also I would like to be able delete the rows that have the duplicate values that were replaced like this:
Year Month Day Average
2013 8 28 2.3
2013 8 30 1.7
2014 8 7 3
2014 9 11 5.8
I have to have two files one with the duplicate values replaced by Zero or a letter and another one has only the averages without the duplicate values.
Thank you in advance!!

Using dplyr for the data.frame manipulation, lubridate for date
manipulation and diff to find consecutive repeated values.
Note that I've also sorted the dates to keep the earliest one which makes it not exactly match with the example solution.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
df1 <- read.table(
text = "
Year Month Day Average
2013 8 28 2.3
2013 8 29 2.3
2013 8 30 1.7
2013 8 31 1.7
2014 8 7 3
2014 8 6 3
2014 8 8 3
2014 8 9 3
2014 9 11 5.8
2014 9 12 5.8
2014 9 13 5.8",
header = T)
df2 <- read.table(
text = "
Year Month Day Average
2013 8 28 2.3
2013 8 29 0
2013 8 30 1.7
2013 8 31 0
2014 8 7 3
2014 8 6 0
2014 8 8 0
2014 8 9 0
2014 9 11 5.8
2014 9 12 0
2014 9 13 0",
header = T)
df3 <- read.table(
text = "
Year Month Day Average
2013 8 28 2.3
2013 8 30 1.7
2014 8 7 3
2014 9 11 5.8",
header = T)
df2 <- df1 %>%
mutate(date = ymd(paste(Year, Month, Day, sep = "-"))) %>%
arrange(date) %>%
mutate(is_consecutive_average = c(FALSE, diff(Average) == 0)) %>%
mutate(is_consecutive_day = c(FALSE, diff(date) == 1)) %>%
mutate(Average = Average * !(is_consecutive_average & is_consecutive_day)) %>%
select(-is_consecutive_average, -is_consecutive_day, -date)
df2
## Year Month Day Average
## 1 2013 8 28 2.3
## 2 2013 8 29 0.0
## 3 2013 8 30 1.7
## 4 2013 8 31 0.0
## 5 2014 8 6 3.0
## 6 2014 8 7 0.0
## 7 2014 8 8 0.0
## 8 2014 8 9 0.0
## 9 2014 9 11 5.8
## 10 2014 9 12 0.0
## 11 2014 9 13 0.0
df3 <- df2 %>%
filter(Average != 0)
df3
## Year Month Day Average
## 1 2013 8 28 2.3
## 2 2013 8 30 1.7
## 3 2014 8 6 3.0
## 4 2014 9 11 5.8

Here's a data.table solution:
Read in the data
data <- readr::read_csv(
text,
col_names = TRUE,
trim_ws = TRUE
)
library( data.table )
setDT( data )
Convert the date values to a nicer format, and sort
data[ , date := as.Date( paste0( Year, "-", Month, "-", Day ) ) ]
setorder( data, date )
Create new columns for previous date and average values
data[ , prev.date := shift( date, 1L, type = "lag" ) ]
data[ , prev.average := shift( Average, 1L, type = "lag" ) ]
Mark the points where a new "group" should be created, based on your criteria. Also mark the very first record as the start of a new group, since we can assume that it is.
data[ , group := 0L
][ as.integer( date - prev.date ) > 1L |
Average != prev.average, group := 1L
][ 1L, group := 1L ]
Get your first desired output by replacing particular values with zeros
data[ group != 1L, Average := 0 ]
first.output <- data[ , .( date, Average ) ]
head( first.output, 3 )
date Average
1: 2013-08-28 2.3
2: 2013-08-29 0.0
3: 2013-08-30 1.7
Now mark the groups as unique numbers
data[ , group := cumsum( group ) ]
And get your second output by aggregating to maximum "Average" value (which will be the only one not equal to zero), and the minimum "date" value (the first in that group):
second.output <- data[ , .( date = min( date ),
Average = max( Average ) ),
by = group ][ , .( date, Average ) ]
head( second.output, 3 )
date Average
1: 2013-08-28 2.3
2: 2013-08-30 1.7
3: 2014-08-06 3.0
NOTE: you could likely get second.output by simply removing rows with a zero "Average" value from the first.output, but it would remove any groups where the "Average" really is zero, so I think this method is safer.

Related

Sum up with the next line into a new colum

I'm having some trouble on figuring out how to create a new column with the sum of 2 subsequent cells.
I have :
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
Now, I want a new column where the first line is the sum of 1+2, the second line is the sum of 1+2+3 , the third line is the sum 1+2+3+4 and so on.
As 1, 2, 3, 4... are hipoteticall values, I need to measure the absolute growth from a decade to another in order to create later on a new variable to measure the percentage change from a decade to another.
library(tibble)
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
library(slider)
library(dplyr, warn.conflicts = F)
df1 %>%
mutate(xx = slide_sum(Values, after = 1, before = Inf))
#> # A tibble: 9 x 3
#> Years Values xx
#> <dbl> <dbl> <dbl>
#> 1 1990 1 3
#> 2 2000 2 6
#> 3 2010 3 10
#> 4 2020 4 15
#> 5 2030 5 21
#> 6 2050 6 28
#> 7 2060 7 36
#> 8 2070 8 45
#> 9 2080 9 45
Created on 2021-08-12 by the reprex package (v2.0.0)
Assuming the last row is to be repeated. Otherwise the fill part can be skipped.
library(dplyr)
library(tidyr)
df1 %>%
mutate(x = lead(cumsum(Values))) %>%
fill(x)
# Years Values x
# <dbl> <dbl> <dbl>
# 1 1990 1 3
# 2 2000 2 6
# 3 2010 3 10
# 4 2020 4 15
# 5 2030 5 21
# 6 2050 6 28
# 7 2060 7 36
# 8 2070 8 45
# 9 2080 9 45
Using base R
v1 <- cumsum(df1$Values)[-1]
df1$new <- c(v1, v1[length(v1)])
You want the cumsum() function. Here are two ways to do it.
### Base R
df1$cumsum <- cumsum(df1$Values)
### Using dplyr
library(dplyr)
df1 <- df1 %>%
mutate(cumsum = cumsum(Values))
Here is the output in either case.
df1
# A tibble: 9 x 3
Years Values cumsum
<dbl> <dbl> <dbl>
1 1990 1 1
2 2000 2 3
3 2010 3 6
4 2020 4 10
5 2030 5 15
6 2050 6 21
7 2060 7 28
8 2070 8 36
9 2080 9 45
A data.table option
> setDT(df)[, newCol := shift(cumsum(Values), -1, fill = sum(Values))][]
Years Values newCol
1: 1990 1 3
2: 2000 2 6
3: 2010 3 10
4: 2020 4 15
5: 2030 5 21
6: 2050 6 28
7: 2060 7 36
8: 2070 8 45
9: 2080 9 45
or a base R option following a similar idea
transform(
df,
newCol = c(cumsum(Values)[-1],sum(Values))
)

Averaging a monthly time series with incomplete observations

I have the following dataset:
id observation_date Observation_value
1 2015-02-23 5
1 2015-02-24 6
1 2015-03-01 24
1 2015-07-16 2
1 2015-09-28 9
1 2015-12-05 12
I would like to create monthly averages of observation_value. In those cases that there are no values for a certain month, I would like to fill in the data with the average between the months where I have data.
Using the data in the Note at the end -- we have added a second id -- convert to zoo using column 1 to split by and column 2 as the index with yearmon class. Also in the same statement aggregate using mean over year/month giving the zoo object z. Then convert to ts which will fill in the missing months with NA and then convert back to zoo and use na.approx to fill in the NAs (or use na.spline or na.locf depending on what you want). fortify.zoo(zz) and fortify.zoo(zz, melt = TRUE) can be used to convert zoo objects to data frames.
library(zoo)
z <- read.zoo(dat, FUN = as.yearmon, index = 2, split = 1, aggregate = mean)
zz <- na.approx(as.zoo(as.ts(z)))
giving
> zz
1 2
Feb 2015 5.5 5.5
Mar 2015 24.0 24.0
Apr 2015 18.5 18.5
May 2015 13.0 13.0
Jun 2015 7.5 7.5
Jul 2015 2.0 2.0
Aug 2015 5.5 5.5
Sep 2015 9.0 9.0
Oct 2015 10.0 10.0
Nov 2015 11.0 11.0
Dec 2015 12.0 12.0
Note
Lines <- "id observation_date Observation_value
1 2015-02-23 5
1 2015-02-24 6
1 2015-03-01 24
1 2015-07-16 2
1 2015-09-28 9
1 2015-12-05 12
2 2015-02-23 5
2 2015-02-24 6
2 2015-03-01 24
2 2015-07-16 2
2 2015-09-28 9
2 2015-12-05 12"
dat <- read.table(text = Lines, header = TRUE)

How to combine winter months of two consecutive years

I have a count data of several species spanning over several years. I want to look at the abundance dynamics for each species over winter season only for each year. The problem is winter season span over two years, November, December and January of next year. Now, I want to combine the abundance of each species of winter months spanning over two consecutive years and do some analysis. For example, I want to subset Nov-Dec of 2005 and Jan of 2006 in first round and do some analysis with this then in second round want to subset Nov-Dec of 2006 and Jan of 2007 and then repeat the same analysis and so on.... How can I do it in R?
Here is an example of the data
date species year month day abundance temp
9/3/2005 A 2005 9 3 3 19
9/15/2005 B 2005 9 15 30 16
10/4/2005 A 2005 10 4 24 12
11/6/2005 A 2005 11 6 32 14
12/8/2005 A 2005 12 8 15 13
1/3/2005 A 2006 1 3 64 19
1/4/2006 B 2006 1 4 2 13
2/10/2006 A 2006 2 10 56 12
2/8/2006 A 2006 1 3 34 19
3/9/2006 A 2006 1 3 64 19
I convert your date column to a date class (possibly with lubridate) and remove the year month day columns as they are redundant.
Then make a new column with the seasonal year (defined as the year, unless the month is Jan, then it is the previous year). A further column is made with case_when that defines the row's season.
library(dplyr)
library(lubridate)
# converts to date format
df$date <- mdy(df$date)
# add in columns
df <- mutate(df,
season_year = ifelse(month(date) == 1, year(date) - 1, year(date)),
season = case_when(
month(date) %in% c(2, 3, 4) ~ "Spring",
month(date) %in% c(5, 6, 7) ~ "Summer",
month(date) %in% c(8, 9, 10) ~ "Autumn",
month(date) %in% c(11, 12, 1) ~ "Winter",
T ~ NA_character_
))
# date species abundance temp season_year season
# 1 2005-09-03 A 3 19 2005 Autumn
# 2 2005-09-15 B 30 16 2005 Autumn
# 3 2005-10-04 A 24 12 2005 Autumn
# 4 2005-11-06 A 32 14 2005 Winter
# 5 2005-12-08 A 15 13 2005 Winter
# 6 2005-01-03 A 64 19 2004 Winter
# 7 2006-01-04 B 2 13 2005 Winter
# 8 2006-02-10 A 56 12 2006 Spring
# 9 2006-02-08 A 34 19 2006 Spring
# 10 2006-03-09 A 64 19 2006 Spring
Then you can group_by() and/or filter() your data for further analysis:
df %>%
group_by(season_year) %>%
filter(season == "Winter") %>%
summarise(count = sum(abundance))
# # A tibble: 2 x 2
# season_year count
# <dbl> <int>
# 1 2004 64
# 2 2005 49
data.table solution:
first create a lookup-table with from-to dates and the season-year, then perform an overlap-join using foverlaps
library( data.table )
sample data
dt <- fread("date species year month day abundance temp
9/3/2005 A 2005 9 3 3 19
9/15/2005 B 2005 9 15 30 16
10/4/2005 A 2005 10 4 24 12
11/6/2005 A 2005 11 6 32 14
12/8/2005 A 2005 12 8 15 13
1/3/2005 A 2006 1 3 64 19
1/4/2006 B 2006 1 4 2 13
2/10/2006 A 2006 2 10 56 12
2/8/2006 A 2006 1 3 34 19
3/9/2006 A 2006 1 3 64 19", header = TRUE)
create a lookup-table
In here, you define the names, start and end of the seasons. Adjust to your own needs. Since you want to analyse the seasons individually, I advise to keep unique season-names (here: based on start-year of the season).
dt.season <- data.table( from = seq( as.Date("1999-02-01"), length.out = 100, by = "3 month"),
to = seq( as.Date("1999-05-01"), length.out = 100, by = "3 month") - 1 )
dt.season[, season := paste0( c( "spring", "summer", "autumn", "winter" ), "-", year( from ) )]
setkey( dt.season, from, to )
head(dt.season,6)
# from to season
# 1: 1999-02-01 1999-04-30 spring-1999
# 2: 1999-05-01 1999-07-31 summer-1999
# 3: 1999-08-01 1999-10-31 autumn-1999
# 4: 1999-11-01 2000-01-31 winter-1999
# 5: 2000-02-01 2000-04-30 spring-2000
# 6: 2000-05-01 2000-07-31 summer-2000
and perform join
#set dt$date as dates
dt[, date := as.Date(date, format = "%m/%d/%Y")]
#create dummy variables to join on
dt[, `:=`( from = date, to = date)]
#create an overlap join, and clean the dummies used for the join
foverlaps( dt, dt.season)[, `:=`(from = NULL, to = NULL, i.from = NULL, i.to = NULL)][]
# season date species year month day abundance temp
# 1: autumn-2005 2005-09-03 A 2005 9 3 3 19
# 2: autumn-2005 2005-09-15 B 2005 9 15 30 16
# 3: autumn-2005 2005-10-04 A 2005 10 4 24 12
# 4: winter-2005 2005-11-06 A 2005 11 6 32 14
# 5: winter-2005 2005-12-08 A 2005 12 8 15 13
# 6: winter-2004 2005-01-03 A 2006 1 3 64 19
# 7: winter-2005 2006-01-04 B 2006 1 4 2 13
# 8: spring-2006 2006-02-10 A 2006 2 10 56 12
# 9: spring-2006 2006-02-08 A 2006 1 3 34 19
# 10: spring-2006 2006-03-09 A 2006 1 3 64 19
You can now easily group/sum/analyse by season
I'd think the easiest way would be to consider that 2006 winter consists of Nov, Dec 2006 and Jan 2007, you could add a column winterid <- ifelse(data$month %in% c(11,12), data$year, ifelse(data$month == 1, data$year-1, "notwinter")).
You are now able to subset on the successive winter seasons. Adapt according to your notation.

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA

Taking Average and Median by Month and then Ordering by Date and Factor in R

Lets suppose I have the following data:
set.seed(123)
Dates <- c("2013-10-07","2013-10-14","2013-11-21","2013-11-28" , "2013-12-04" , "2013-12-11","2013-01-18","2013-01-18")
Dates.New <- c(Dates,Dates)
Values <- sample(seq(1:10),16,replace = TRUE)
Factor <- c(rep("Group 1",8),rep("Group 2",8))
df <- data.frame(Dates.New,Values,Factor)
df[sample(1:nrow(df)),]
This returns
Dates.New Values Factor
4 2013-11-28 9 Group 1
1 2013-10-07 3 Group 1
5 2013-12-04 10 Group 1
13 2013-12-04 7 Group 2
11 2013-11-21 10 Group 2
8 2013-01-18 9 Group 1
7 2013-01-18 6 Group 1
9 2013-10-07 6 Group 2
6 2013-12-11 1 Group 1
14 2013-12-11 6 Group 2
16 2013-01-18 9 Group 2
3 2013-11-21 5 Group 1
2 2013-10-14 8 Group 1
15 2013-01-18 2 Group 2
12 2013-11-28 5 Group 2
10 2013-10-14 5 Group 2
What I am trying to do here is find the monthly average and median for both of my factors then order each group by month in a new data frame. So the new data frame would have a median and average for months 10,11,12,1 for Group 1 bundled together and the next 4 rows would have the median and average for months 10,11,12,1 for Group 2bundled together as well. I am open to packages. Thanks!
Here is a data.table solution. The question seems to be looking for both mean and median. See if this suits your need.
library(zoo); library(data.table)
setDT(df)[, list(Mean = mean(Values),
Median = median(Values)),
by = list(Factor, as.yearmon(Dates.New))][order(Factor, as.yearmon)]
# Factor as.yearmon Mean Median
# 1: Group 1 Jan 2013 7.5 7.5
# 2: Group 1 Oct 2013 5.5 5.5
# 3: Group 1 Nov 2013 7.0 7.0
# 4: Group 1 Dec 2013 5.5 5.5
# 5: Group 2 Jan 2013 5.5 5.5
# 6: Group 2 Oct 2013 5.5 5.5
# 7: Group 2 Nov 2013 7.5 7.5
# 8: Group 2 Dec 2013 6.5 6.5
Like this?
df$Dates.New <- as.Date(df$Dates.New)
library(zoo) # for as.yearmon(...)
result <- aggregate(Values~as.yearmon(Dates.New)+Factor,df,mean)
names(result)[1] <- "Year.Mon"
result
# Year.Mon Factor Values
# 1 Jan 2013 Group 1 7.5
# 2 Oct 2013 Group 1 5.5
# 3 Nov 2013 Group 1 7.0
# 4 Dec 2013 Group 1 5.5
# 5 Jan 2013 Group 2 5.5
# 6 Oct 2013 Group 2 5.5
# 7 Nov 2013 Group 2 7.5
# 8 Dec 2013 Group 2 6.5

Resources