Create groups based on time period - r

How can I create a new grouping variable for my data based on 5-year steps?
So from this:
group <- c(rep("A", 7), rep("B", 10))
year <- c(2008:2014, 2005:2014)
dat <- data.frame(group, year)
group year
1 A 2008
2 A 2009
3 A 2010
4 A 2011
5 A 2012
6 A 2013
7 A 2014
8 B 2005
9 B 2006
10 B 2007
11 B 2008
12 B 2009
13 B 2010
14 B 2011
15 B 2012
16 B 2013
17 B 2014
To this:
> dat
group year period
1 A 2008 2005_2009
2 A 2009 2005_2009
3 A 2010 2010_2014
4 A 2011 2010_2014
5 A 2012 2010_2014
6 A 2013 2010_2014
7 A 2014 2010_2014
8 B 2005 2005_2009
9 B 2006 2005_2009
10 B 2007 2005_2009
11 B 2008 2005_2009
12 B 2009 2005_2009
13 B 2010 2010_2014
14 B 2011 2010_2014
15 B 2012 2010_2014
16 B 2013 2010_2014
17 B 2014 2010_2014
I guess I could use cut(dat$year, breaks = ??) but I don't know how to set the breaks.

Here is one way of doing it:
dat$period <- paste(min <- floor(dat$year/5)*5, min+4,sep = "_")
I guess the trick here is to get the biggest whole number smaller than your year with the floor(year/x)*x function.
Here is a version that should work generally:
x <- 5
yearstart <- 2000
dat$period <- paste(min <- floor((dat$year-yearstart)/x)*x+yearstart,
min+x-1,sep = "_")
You can use yearstart to ensure e.g. year 2000 is the first in a group for when x is not a multiple of it.

cut should do the job if you create actual Date objects from your 'year' column.
## convert 'year' column to dates
yrs <- paste0(dat$year, "-01-01")
yrs <- as.Date(yrs)
## create cuts of 5 years and add them to data.frame
dat$period <- cut(yrs, "5 years")
## create desired factor levels
library(lubridate)
lvl <- as.Date(levels(dat$period))
lvl <- paste(year(lvl), year(lvl) + 4, sep = "_")
levels(dat$period) <- lvl
head(dat)
group year period
1 A 2008 2005_2009
2 A 2009 2005_2009
3 A 2010 2010_2014
4 A 2011 2010_2014
5 A 2012 2010_2014
6 A 2013 2010_2014

Related

Extract all possible combinations of rows with unique values in a variable

I am trying to perform a meta-analysis on a dataset in which multiple authors have multiple studies which might cause bias. Therefore, I want to extract all the possible combinations of rows, in which any Author appears once.
Sample data:
sample <- data.frame(Author = c('a','a','b','b','c'),
Year = c('2020','2016', '2020','2010','2005'),
Value = c(3,1,2,4,5),
UniqueName = c('a 2020', 'a 2016', 'b 2020', 'b 2010', 'c 2005'))
Sample:
Author Year Value UniqueName
1 a 2020 3 a 2020
2 a 2016 1 a 2016
3 b 2020 2 b 2020
4 b 2010 4 b 2010
5 c 2005 5 c 2005
And would like to extract all possible combinations of rows (in this case, 4 possibilities) where each Author appears once.
> output1
Author Year Value UniqueName
1 a 2020 3 a 2020
2 b 2020 2 b 2020
3 c 2005 5 c 2005
> output2
Author Year Value UniqueName
1 a 2016 1 a 2016
2 b 2020 2 b 2020
3 c 2005 5 c 2005
> output3
Author Year Value UniqueName
1 a 2016 1 a 2016
2 b 2010 4 b 2010
3 c 2005 5 c 2005
> output4
Author Year Value UniqueName
1 a 2020 3 a 2020
2 b 2010 4 b 2010
3 c 2005 5 c 2005
At the end, I will perform the analyses on these 4 different extracted dataframes, but I don't know how to get them in a less manual way.
Maybe a less hacky way exists, but I seem to have a working solution.
My idea was to split your dataframe on authors and brute force the combinations of unique rows with expand.grid. Then with lapply creating a list of data.frames with the indexes of rows.
Here is the code:
splitsample <- split(sample, sample$Author)
outputs_rows <- expand.grid(lapply(splitsample, \(x) seq_len(nrow(x))))
names_authors <- colnames(outputs_rows)
outputs <- lapply(seq_len(nrow(outputs_rows)),
function(row) {
df <- data.frame()
for (aut in names_authors) {
df <- rbind(df, splitsample[[aut]][outputs_rows[row, aut], ])
}
return(df)
})
outputs
And the result looks like this:
> outputs
[[1]]
Author Year Value UniqueName
1 a 2020 3 a 2020
3 b 2020 2 b 2020
5 c 2005 5 c 2005
[[2]]
Author Year Value UniqueName
2 a 2016 1 a 2016
3 b 2020 2 b 2020
5 c 2005 5 c 2005
[[3]]
Author Year Value UniqueName
1 a 2020 3 a 2020
4 b 2010 4 b 2010
5 c 2005 5 c 2005
[[4]]
Author Year Value UniqueName
2 a 2016 1 a 2016
4 b 2010 4 b 2010
5 c 2005 5 c 2005
I hope this helped you.

subtract specific row und rename it

it is possible to subtract certain rows and rename them?
year <- c(2005,2005,2005,2006,2006,2006,2007,2007,2007)
category <- c("a","b","c","a","b","c", "a", "b", "c")
value <- c(2,2,10,3,3,12,4,4,16)
df <- data.frame(year, category,value, stringsAsFactors = FALSE)
And this is how the result should look:
year
category
value
2005
a
2
2005
b
2
2005
c
4
2006
a
3
2006
b
3
2006
c
12
2007
a
4
2007
b
4
2007
c
16
2005
c-b
2
2006
c-b
9
2007
c-b
12
You can use group_modify:
library(tidyverse)
df %>%
group_by(year) %>%
group_modify(~ add_row(.x, category = "c-b", value = .x$value[.x$category == "c"] - .x$value[.x$category == "b"]))
# A tibble: 12 x 3
# Groups: year [3]
year category value
<dbl> <chr> <dbl>
1 2005 a 2
2 2005 b 2
3 2005 c 10
4 2005 c-b 8
5 2006 a 3
6 2006 b 3
7 2006 c 12
8 2006 c-b 9
9 2007 a 4
10 2007 b 4
11 2007 c 16
12 2007 c-b 12
See substract() function.
Example:
substracted_df<-substr(df,df$category=="c")
If you want to know which rows are you dealing with, use which()
rows<-which(df$category=="c")
substracted_df<-df[rows, ]
You can rename each desired row as
row.names(substracted_df)<-c("Your desired row names")

replace NA with previous 2 years values

i have 2 df's ,in df1 we have NA values which needs to be replaced with mean of previous 2 years Average_f1
eg. in df1 - for row 5 year is 2015 and bin - 5 and we need to replace previous 2 years mean for same bin from df2 (2013&2014) and for row-7 we have only 1 year value
df1 df2
year p1 bin year bin_p1 Average_f1
2013 20 1 2013 5 29.5
2013 24 1 2014 5 16.5
2014 10 2 2015 NA 30
2014 11 2 2016 7 12
2015 NA 5
2016 10 3
2017 NA 7
output
df1
year p1 bin
2013 20 1
2013 24 1
2014 10 2
2014 11 2
2015 **23** 5
2016 10 3
2017 **12** 7
Thanks in advance

How to conditionally update a table in R?

My table looks like this:
# Year Month WaterYear
# 1993 3
# 2000 4
# 2013 10
# 2015 6
# 2000 7
# 2008 12
# 2008 9
# 2012 10
# 2000 11
# 2000 12
I am trying to update this table by computing WaterYear equals Year+1 where months range between October and December.
I am working on R and hoping to find the easiest way to make it work.
Simple ifelse function will do the trick.
From your data.
# Create data
Year <- c(1993, 2000, 2013, 2015, 2000, 2008, 2008, 2012, 2000, 2000)
Month <- c(3, 4, 10, 6, 7, 12, 9 ,10, 11, 12)
WaterYear <- rep("",length(Year))
dat <- data.frame(Year, Month, WaterYear)
# If month is greater or equal to 10 change it to Year +1,
# otherwise keep it as it is
dat$WaterYear <- ifelse(dat$Month >=10, Year+1, WaterYear)
Results in
Year Month WaterYear
1993 3
2000 4
2013 10 2014
2015 6
2000 7
2008 12 2009
2008 9
2012 10 2013
2000 11 2001
We can also do
i1 <- dat$Month >=10
dat$WaterYear[i1] <- dat$Year[i1] + 1
dat
# Year Month WaterYear
#1 1993 3
#2 2000 4
#3 2013 10 2014
#4 2015 6
#5 2000 7
#6 2008 12 2009
#7 2008 9
#8 2012 10 2013
#9 2000 11 2001
#10 2000 12 2001
Or using data.table, convert the 'data.frame' to 'data.table' (setDT(dat)), specify the logical condition in 'i' (Month >= 10), and assign (:=) the 'Year' + 1 to 'WaterYear'
library(data.table)
setDT(dat)[Month >=10, WaterYear := as.character(Year + 1)]

Merge 2 resulting vectors into 1 data frame using R

I have a df like this
Month <- c('JAN','JAN','JAN','JAN','FEB','FEB','MAR','APR','MAY','MAY')
Category <- c('A','A','B','C','A','E','B','D','E','F')
Year <- c(2014,2015,2015,2015,2014,2013,2015,2014,2015,2013)
Number_Combinations <- c(3,2,3,4,1,3,6,5,1,1)
df <- data.frame(Month ,Category,Year,Number_Combinations)
df
Month Category Year Number_Combinations
1 JAN A 2014 3
2 JAN A 2015 2
3 JAN B 2015 3
4 JAN C 2015 4
5 FEB A 2014 1
6 FEB E 2013 3
7 MAR B 2015 6
8 APR D 2014 5
9 MAY E 2015 1
10 MAY F 2013 1
I have another df that I got from the above dataframe with a condition
df1 <- subset(df,Number_Combinations > 2)
df1
Month Category Year Number_Combinations
1 JAN A 2014 3
3 JAN B 2015 3
4 JAN C 2015 4
6 FEB E 2013 3
7 MAR B 2015 6
8 APR D 2014 5
Now I want to create a table reporting the month, the total number of rows for the month in df and the total number of for the month in df1
Desired Output would be
Month Number_Month_df Number_Month_df1
1 JAN 4 3
2 FEB 2 1
3 MAR 1 1
4 APR 1 1
5 MAY 2 0
While I used table(df) and table(df1) and tried merging but not getting the desired result. Could someone please help me in getting the above dataframe?
We get the table of the 'Month' column from both 'df' and 'df1', convert to 'data.frame' (as.data.frame), merge by the 'Var1', and change the column names accordingly.
res <- merge(as.data.frame(table(df$Month)),
as.data.frame(table(df1$Month)), by='Var1')
colnames(res) <- c('Month', 'Number_Month_df', 'Number_Month_df1')
res <- data.frame(Number_Month_df=sort(table(df$Month),T),
Number_Month_df1=sort(table(df1$Month),T))
res$Month <- rownames(res)

Resources