How to conditionally update a table in R? - r

My table looks like this:
# Year Month WaterYear
# 1993 3
# 2000 4
# 2013 10
# 2015 6
# 2000 7
# 2008 12
# 2008 9
# 2012 10
# 2000 11
# 2000 12
I am trying to update this table by computing WaterYear equals Year+1 where months range between October and December.
I am working on R and hoping to find the easiest way to make it work.

Simple ifelse function will do the trick.
From your data.
# Create data
Year <- c(1993, 2000, 2013, 2015, 2000, 2008, 2008, 2012, 2000, 2000)
Month <- c(3, 4, 10, 6, 7, 12, 9 ,10, 11, 12)
WaterYear <- rep("",length(Year))
dat <- data.frame(Year, Month, WaterYear)
# If month is greater or equal to 10 change it to Year +1,
# otherwise keep it as it is
dat$WaterYear <- ifelse(dat$Month >=10, Year+1, WaterYear)
Results in
Year Month WaterYear
1993 3
2000 4
2013 10 2014
2015 6
2000 7
2008 12 2009
2008 9
2012 10 2013
2000 11 2001

We can also do
i1 <- dat$Month >=10
dat$WaterYear[i1] <- dat$Year[i1] + 1
dat
# Year Month WaterYear
#1 1993 3
#2 2000 4
#3 2013 10 2014
#4 2015 6
#5 2000 7
#6 2008 12 2009
#7 2008 9
#8 2012 10 2013
#9 2000 11 2001
#10 2000 12 2001
Or using data.table, convert the 'data.frame' to 'data.table' (setDT(dat)), specify the logical condition in 'i' (Month >= 10), and assign (:=) the 'Year' + 1 to 'WaterYear'
library(data.table)
setDT(dat)[Month >=10, WaterYear := as.character(Year + 1)]

Related

Remove duplicate year rows by groups [duplicate]

This question already has answers here:
get rows of unique values by group
(4 answers)
Closed 1 year ago.
I have a data.table of the following form:-
data <- data.table(group = rep(1:3, each = 4),
year = c(2011:2014, rep(2011:2012, each = 2),
2012, 2012, 2013, 2014), value = 1:12)
This is only an abstract of my data.
So group 2 has 2 values for 2011 and 2012. And group 3 has 2 values for the year 2012. I want to just keep the first row for all the duplicated years.
So, in effect, my data.table will become the following:-
data <- data.table(group = c(rep(1, 4), rep(2, 2), rep(3, 3)),
year = c(2011:2014, 2011, 2012, 2012, 2013, 2014),
value = c(1:5, 7, 9, 11, 12))
How can I achieve this? Thanks in advance.
Try this data.table option with duplicated
> data[!duplicated(cbind(group, year))]
group year value
1: 1 2011 1
2: 1 2012 2
3: 1 2013 3
4: 1 2014 4
5: 2 2011 5
6: 2 2012 7
7: 3 2012 9
8: 3 2013 11
9: 3 2014 12
For data.tables you can pass by argument to unique -
library(data.table)
unique(data, by = c('group', 'year'))
# group year value
#1: 1 2011 1
#2: 1 2012 2
#3: 1 2013 3
#4: 1 2014 4
#5: 2 2011 5
#6: 2 2012 7
#7: 3 2012 9
#8: 3 2013 11
#9: 3 2014 12
Using base R
subset(data, !duplicated(cbind(group, year)))
One solution would be to use distinct from dplyr like so:
library(dplyr)
data %>%
distinct(group, year, .keep_all = TRUE)
Output:
group year value
1: 1 2011 1
2: 1 2012 2
3: 1 2013 3
4: 1 2014 4
5: 2 2011 5
6: 2 2012 7
7: 3 2012 9
8: 3 2013 11
9: 3 2014 12
This should do the trick:
library(tidyverse)
data %>%
group_by(group, year) %>%
filter(!duplicated(group, year))

Create incremental column year based on id and year column in R

I have the below dataframe and i want to create the 'create_col' using some kind of seq() function i guess using the 'year' column as the start of the sequence. How I could do that?
id <- c(1,1,2,3,3,3,4)
year <- c(2013, 2013, 2015,2017,2017,2017,2011)
create_col <- c(2013,2014,2015,2017,2018,2019,2011)
Ideal result:
id year create_col
1 1 2013 2013
2 1 2013 2014
3 2 2015 2015
4 3 2017 2017
5 3 2017 2018
6 3 2017 2019
7 4 2011 2011
You can add row_number() to minimum year in each id :
library(dplyr)
df %>%
group_by(id) %>%
mutate(create_col = min(year) + row_number() - 1)
# id year create_col
# <dbl> <dbl> <dbl>
#1 1 2013 2013
#2 1 2013 2014
#3 2 2015 2015
#4 3 2017 2017
#5 3 2017 2018
#6 3 2017 2019
#7 4 2011 2011
data
df <- data.frame(id, year)

Create groups based on time period

How can I create a new grouping variable for my data based on 5-year steps?
So from this:
group <- c(rep("A", 7), rep("B", 10))
year <- c(2008:2014, 2005:2014)
dat <- data.frame(group, year)
group year
1 A 2008
2 A 2009
3 A 2010
4 A 2011
5 A 2012
6 A 2013
7 A 2014
8 B 2005
9 B 2006
10 B 2007
11 B 2008
12 B 2009
13 B 2010
14 B 2011
15 B 2012
16 B 2013
17 B 2014
To this:
> dat
group year period
1 A 2008 2005_2009
2 A 2009 2005_2009
3 A 2010 2010_2014
4 A 2011 2010_2014
5 A 2012 2010_2014
6 A 2013 2010_2014
7 A 2014 2010_2014
8 B 2005 2005_2009
9 B 2006 2005_2009
10 B 2007 2005_2009
11 B 2008 2005_2009
12 B 2009 2005_2009
13 B 2010 2010_2014
14 B 2011 2010_2014
15 B 2012 2010_2014
16 B 2013 2010_2014
17 B 2014 2010_2014
I guess I could use cut(dat$year, breaks = ??) but I don't know how to set the breaks.
Here is one way of doing it:
dat$period <- paste(min <- floor(dat$year/5)*5, min+4,sep = "_")
I guess the trick here is to get the biggest whole number smaller than your year with the floor(year/x)*x function.
Here is a version that should work generally:
x <- 5
yearstart <- 2000
dat$period <- paste(min <- floor((dat$year-yearstart)/x)*x+yearstart,
min+x-1,sep = "_")
You can use yearstart to ensure e.g. year 2000 is the first in a group for when x is not a multiple of it.
cut should do the job if you create actual Date objects from your 'year' column.
## convert 'year' column to dates
yrs <- paste0(dat$year, "-01-01")
yrs <- as.Date(yrs)
## create cuts of 5 years and add them to data.frame
dat$period <- cut(yrs, "5 years")
## create desired factor levels
library(lubridate)
lvl <- as.Date(levels(dat$period))
lvl <- paste(year(lvl), year(lvl) + 4, sep = "_")
levels(dat$period) <- lvl
head(dat)
group year period
1 A 2008 2005_2009
2 A 2009 2005_2009
3 A 2010 2010_2014
4 A 2011 2010_2014
5 A 2012 2010_2014
6 A 2013 2010_2014

How do I format row.names of an R table?

Consider this x set of dates:
set.seed(1234)
x <- sample(1980:2010, 100, replace = T)
x <- strptime(x, '%Y')
x <- strftime(x, '%Y')
The following is a distribution of the years of those dates:
> table(x)
x
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1994
4 4 3 3 6 4 3 4 5 12 1 1 1 2
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
9 4 2 1 4 4 2 1 4 1 4 3 4 3
2010
1
Now say I want to group them by decade. For this, I use the cut function:
> table(cut(x, seq(1980, 2010, 10)))
Error in cut.default(x, seq(1980, 2010, 10)) : 'x' must be numeric
Ok, so let's force x to numeric:
> table(cut(as.numeric(x), seq(1980, 2010, 10)))
(1.98e+03,1.99e+03] (1.99e+03,2e+03] (2e+03,2.01e+03]
45 28 23
Now, as you can see, the row.names of that table are in scientific format. How do I force them to not be in scientific notation? I've tried wrapping that whole command above inside format, formatC and prettyNum, but all those do is format the frequencies.
Thanks joran for pointing the path to the answer. I'll elaborate it here for the record:
Changing cut's dig.lab parameter from the default 3 to 4 solved this particular mockup as well as my real problem:
> table(cut(as.numeric(x), seq(1980, 2010, 10), dig.lab = 4))
(1980,1990] (1990,2000] (2000,2010]
45 28 23
By the way, in order for 1980 to be counted one should include the include.lowest argument:
> table(cut(as.numeric(x), seq(1980, 2010, 10), dig.lab = 4, include.lowest = T))
[1980,1990] (1990,2000] (2000,2010]
49 28 23
Now it sums to 100! :)
This doesn't exactly answer the question you asked, but shows you a possible alternative: use the fact that there is a cut.Date method:
set.seed(1234)
x <- sample(1980:2010, 100, replace = T)
x <- strptime(x, '%Y')
out <- table(cut(x, "10 years"))
out
#
# 1980-01-01 1990-01-01 2000-01-01 2010-01-01
# 48 25 26 1
Here, we also get what I would consider the "correct" values for each bin.
As a crude justification of my statement about "correct" values, consider the values we get when we manually calculate based on table:
y <- strftime(x, '%Y')
Tab <- table(y)
Tab
# y
# 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1994 1995 1996
# 4 4 3 3 6 4 3 4 5 12 1 1 1 2 9 4
# 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2010
# 2 1 4 4 2 1 4 1 4 3 4 3 1
sum(Tab[grepl("198", names(Tab))])
# [1] 48
sum(Tab[grepl("199", names(Tab))])
# [1] 25
sum(Tab[grepl("200", names(Tab))])
# [1] 26
sum(Tab[grepl("201", names(Tab))])
# [1] 1

Change year variable in dataframe depending on season

I have a dataframe that looks like following:
set.seed(50)
df <- data.frame(Month=c(sort(sample(1:12, 10)),
sort(sample(1:12, 10)),
sort(sample(1:12, 10))),
Year=c(rep(2007, 10),
rep(2010, 10),
rep(2011, 10)))
Head of df:
Month Year
1 1 2007
2 3 2007
3 4 2007
4 5 2007
5 6 2007
6 7 2007
I need to recode the year variable depending on season, for example if month is January and year is 2013, then year should be be recoded to 2012/2013. For January-June, year should be recoded to 2012/2013, and for July-December year should be recoded to 2013/2014.
df should therefore be recoded as below. Note that some months are missing and some years are missing:
set.seed(50)
df <- data.frame(Month=c(sort(sample(1:12, 10)),
sort(sample(1:12, 10)),
sort(sample(1:12, 10))),
Year=c(rep(2007, 10),
rep(2010, 10),
rep(2011, 10)),
Year.Seasonal=c(rep('2006/2007', 5),
rep('2007/2008', 5),
rep('2009/2010', 6),
rep('2010/2011', 9),
rep('2011/2012', 5)))
Head of recoded df:
Month Year Year.Seasonal
1 1 2007 2006/2007
2 3 2007 2006/2007
3 4 2007 2006/2007
4 5 2007 2006/2007
5 6 2007 2006/2007
6 7 2007 2007/2008
What is the best way of doing this?
df <- within(df,season <- paste(Year - (Month <= 6),
Year + (Month > 6),sep="/"))
head(df)
Month Year season
1 1 2007 2006/2007
2 3 2007 2006/2007
3 4 2007 2006/2007
4 5 2007 2006/2007
5 6 2007 2006/2007
6 7 2007 2007/2008
Here is solution using ifelse() - If Month is less than 7 then that will be previous season, if not then next season. Function paste() will put together years.
df$Year.Seasonal<-ifelse(df$Month<7,
paste(df$Year-1,df$Year,sep="/"),paste(df$Year,df$Year+1,sep="/"))
> head(df)
Month Year Year.Seasonal
1 1 2007 2006/2007
2 3 2007 2006/2007
3 4 2007 2006/2007
4 5 2007 2006/2007
5 6 2007 2006/2007
6 7 2007 2007/2008

Resources