How to modify a column based on a condition in a time series? - r

I have a data on animal territories by month (1 = January etc.) for multiple individuals:
year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 5 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2
I want to add a column that has a 1 if two consecutive months exceed some value e.g. 10. One wrinkle is that my data can run over one year for a single id.
year month terr_size id new_col
2018 1 20 1 1
2018 2 30 1 1
2019 1 5 1 0
2019 2 10 1 0
2018 3 20 2 0
2018 5 25 2 1
2018 6 20 2 1
2018 7 20 2 1
2019 1 10 2 0
2019 2 5 2 0
2019 3 20 2 1
2019 4 30 2 1

This can be expressed compactly using a single left join in a single SQL statement.
Using the input shown in the Note at the end, perform a left self join using the indicated on condition and set new_col to 1 if for any original row both it and any matched rows have terr_size greater than or equal to 10. If there is no matched row then use coalesce to set new_col to 0.
library(sqldf)
sqldf("
select a.*,
coalesce(max(a.terr_size >= 10 and b.terr_size >= 10), 0)
new_col
from DF a
left join DF b on
a.id = b.id and
(12 * b.year + b.month = 12 * a.year + a.month + 1 or
12 * b.year + b.month = 12 * a.year + a.month - 1)
group by a.rowid")
giving:
year month terr_size id new_col
1 2018 1 20 1 1
2 2018 2 30 1 1
3 2019 1 5 1 0
4 2019 2 10 1 0
5 2018 3 20 2 0
6 2018 5 25 2 1
7 2018 6 20 2 1
8 2018 7 20 2 1
9 2019 1 10 2 0
10 2019 2 5 2 0
11 2019 3 20 2 1
12 2019 4 30 2 1
Note
The input and output shown in the question are not consistent so to be clear we assumed this:
Lines <- "year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 5 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2 "
DF <- read.table(text = Lines, header = TRUE)

Your data:
df <- read.table(text = "year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 2 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2 ", header = TRUE)
The idea is to create a date variable first.
Then you create two copies of your data by changing the dates one month ahead and one month back.
R is efficient memory-wise for this kind of operation, so you won't have a problem.
You will just take the space for one additional column. It doesn't actually replicate the whole dataframe.
Then you can join the new columns to the original dataframe.
You then apply the condition you needed.
I created a magic_number variable for that.
At the end, I selected only the original columns plus the one you needed.
library(dplyr)
library(lubridate)
# the threshold number
magic_number <- 10
# creare date variable
df <- df %>% mutate(date = make_date(year, month))
# [p]revious month
dfp <- df %>% transmute(id, date = date - months(1), terr_size_p = terr_size)
# [n]ext month
dfn <- df %>% transmute(id, date = date + months(1), terr_size_n = terr_size)
# join by id and date
df <- df %>%
left_join(dfp, by = c("id", "date")) %>%
left_join(dfn, by = c("id", "date"))
# for new_col to be 1, terr_size must be over the threshold, so must be at least one between previous and next month
df <- df %>%
mutate(new_col = as.numeric(terr_size > magic_number &
any(terr_size_p > magic_number, terr_size_n > magic_number)))
# remove variables if there is no more use for them
df <- df %>% select(-terr_size_p, -terr_size_n, -date)
df
Result:
year month terr_size id new_col
1 2018 1 20 1 1
2 2018 2 30 1 1
3 2019 1 5 1 0
4 2019 2 10 1 0
5 2018 3 20 2 1
6 2018 2 25 2 1
7 2018 6 20 2 1
8 2018 7 20 2 1
9 2019 1 10 2 0
10 2019 2 5 2 0
11 2019 3 20 2 1
12 2019 4 30 2 1
(The result is not exactly the same because your initial data and expected results do not correspond at row 5)
This solution handles the december-january issue we talked about in the comments.

I'm not exactly sure what is the rule because your output isn't following the rule you talk about (eg: line1/5 doesn't have another month for comparison yet you put an 1, line 6 is separated by 2 months, you put a 1 in the line 11 whereas line12 was <10).
I assumed the most complicated scenario, so you can remove the extra conditions you don't need:
You put an 1 if the territory size remained >10 for two consecutive months including this one (or the first recorded month if it's >10) for each individual.
df <- read.table(text = "year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 5 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2", header = TRUE)
Using dplyr and lag:
library(dplyr)
df %>% arrange(id, year,month) %>%
dplyr::mutate(newcol=case_when(is.na(lag(month))==TRUE & terr_size>10~1,
lag(id)!=id & terr_size>10~1,
id==lag(id) & year-lag(year)==0 & month-lag(month)==1 & terr_size>10 & lag(terr_size)>10~1,
id==lag(id) & year-lag(year)==1 & lag(month)-month==11 & terr_size>10 & lag(terr_size)>10~1,
TRUE~0))
output:
year month terr_size id newcol
1 2018 1 20 1 1
2 2018 2 30 1 1
3 2019 1 5 1 0
4 2019 2 10 1 0
5 2018 3 20 2 1
6 2018 5 25 2 0
7 2018 6 20 2 1
8 2018 7 20 2 1
9 2019 1 10 2 0
10 2019 2 5 2 0
11 2019 3 20 2 0
12 2019 4 30 2 1

Related

Repeating annual values multiple times to form a monthly dataframe

I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.
Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))

Count the occurences of accidents until the next accidents

I have the following data frame and I would like to create the "OUTPUT_COLUMN".
Explanation of columns:
ID is the identification number of the policy
ID_REG_YEAR is the identification number per Registration Year
CALENDAR_YEAR is the year that the policy have exposure
NUMBER_OF_RENEWALS is the count of numbers that the policy has renewed
ACCIDENT is accident occurred
KEY TO THE DATASET: ID_REG_YEAR and CALENDAR_YEAR
Basically, if column NUMBER_OF_RENEWALS = 0 then OUTPUT_COLUMN = 100. Any rows that an accident did not occurred before should contain 100 (e.g rows 13,16,17). If an Accident occured I would like to count the number of renewals until the next accident.
ID ID_REG_YEAR CALENDAR_YEAR NUMBER_OF_RENEWALS ACCIDENT OUTPUT_COLUMN
1 A A_2015 2015 0 YES 100
2 A A_2015 2016 0 YES 100
3 A A_2016 2016 1 YES 0
4 A A_2016 2017 1 YES 0
5 A A_2017 2017 2 NO 1
6 A A_2017 2018 2 NO 1
7 A A_2018 2018 3 NO 2
8 A A_2018 2019 3 NO 2
9 A A_2019 2019 4 YES 0
10 A A_2019 2020 4 YES 0
11 B B_2015 2015 0 NO 100
12 B B_2015 2016 0 NO 100
13 B B_2016 2016 1 NO 100
14 C C_2013 2013 0 NO 100
15 C C_2013 2014 0 NO 100
16 C C_2014 2014 1 NO 100
17 C C_2014 2015 1 NO 100
18 C C_2015 2015 2 YES 0
19 C C_2015 2016 2 YES 0
20 C C_2016 2016 3 NO 1
21 C C_2016 2017 3 NO 1
22 C C_2017 2017 4 NO 2
23 C C_2017 2018 4 NO 2
24 C C_2018 2018 5 YES 0
25 C C_2018 2019 5 YES 0
26 C C_2019 2019 6 NO 1
27 C C_2019 2020 6 NO 1
28 C C_2020 2020 7 NO 2
Here is a dplyr solution. First, obtain a separate column for the registration year, which will be used to calculate renewals since prior accident (assumes this is years since last accident). Then, create a column to contain the year of the last accident after grouping by ID. Using fill this value will be propagated. The final outcome column will be set as either 100 (if no prior accident, or NUMBER_OF_RENEWALS is zero) vs. the registration year - last accident year.
library(dplyr)
df %>%
separate(ID_REG_YEAR, into = c("ID_REG", "REG_YEAR"), convert = T) %>%
group_by(ID) %>%
mutate(LAST_ACCIDENT = ifelse(ACCIDENT == "YES", REG_YEAR, NA_integer_)) %>%
fill(LAST_ACCIDENT, .direction = "down") %>%
mutate(OUTPUT_COLUMN_2 = ifelse(
is.na(LAST_ACCIDENT) | NUMBER_OF_RENEWALS == 0, 100, REG_YEAR - LAST_ACCIDENT
))
Output
ID ID_REG REG_YEAR CALENDAR_YEAR NUMBER_OF_RENEWALS ACCIDENT OUTPUT_COLUMN LAST_ACCIDENT OUTPUT_COLUMN_2
<chr> <chr> <int> <int> <int> <chr> <int> <int> <dbl>
1 A A 2015 2015 0 YES 100 2015 100
2 A A 2015 2016 0 YES 100 2015 100
3 A A 2016 2016 1 YES 0 2016 0
4 A A 2016 2017 1 YES 0 2016 0
5 A A 2017 2017 2 NO 1 2016 1
6 A A 2017 2018 2 NO 1 2016 1
7 A A 2018 2018 3 NO 2 2016 2
8 A A 2018 2019 3 NO 2 2016 2
9 A A 2019 2019 4 YES 0 2019 0
10 A A 2019 2020 4 YES 0 2019 0
# … with 18 more rows
Note: If you want to use your policy number (NUMBER_OF_RENEWALS) and not go by the year, you can do something similar. Instead of adding a column with the last accident year, you can include the last accident policy. Then, your output column could reflect the policy number instead of year (to consider the possibility that one or more years could be skipped).
df %>%
separate(ID_REG_YEAR, into = c("ID_REG", "REG_YEAR"), convert = T) %>%
group_by(ID) %>%
mutate(LAST_ACCIDENT_POLICY = ifelse(ACCIDENT == "YES", NUMBER_OF_RENEWALS, NA_integer_)) %>%
fill(LAST_ACCIDENT_POLICY, .direction = "down") %>%
mutate(OUTPUT_COLUMN_2 = ifelse(
is.na(LAST_ACCIDENT_POLICY) | NUMBER_OF_RENEWALS == 0, 100, NUMBER_OF_RENEWALS - LAST_ACCIDENT_POLICY
))

Create a new column with max values using the identifier column within a pipeline

I am trying to clean up some old code and convert over to "tidy". I am trying to create a new column of data within a pipeline that is the maximum age of individual fish. Let's represent the columns of interest as:
fish_1 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3))
# which looks like this:
fish_1
year fishid agei
1 2012 a 1
2 2012 a 2
3 2015 b 1
4 2015 b 2
5 2015 b 3
6 2013 c 1
7 2013 c 2
8 2013 c 3
9 2013 c 4
10 2012 d 1
11 2012 d 2
12 2015 e 1
13 2015 e 2
14 2015 e 3
What I'm trying to do is create a new column agec that is the maximum age for each individual fish repeated however many number of times is required to fill the rows for each fish.
The desired output would be:
fish_2 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3),
agec = c(2,2,3,3,3,4,4,4,4,2,2,3,3,3))
# Which looks like:
fish_2
year fishid agei agec
1 2012 a 1 2
2 2012 a 2 2
3 2015 b 1 3
4 2015 b 2 3
5 2015 b 3 3
6 2013 c 1 4
7 2013 c 2 4
8 2013 c 3 4
9 2013 c 4 4
10 2012 d 1 2
11 2012 d 2 2
12 2015 e 1 3
13 2015 e 2 3
14 2015 e 3 3
The way I had done this in the past was to use a plyr::ddply() call to create a new dataframe and then merge with fish like this:
caps = plyr::ddply(fish_1, c('fishid'), plyr::summarize, agec=max(agei))
fish = merge(fish_1, caps, by='fishid')
fish
fishid year agei agec
1 a 2012 1 2
2 a 2012 2 2
3 b 2015 1 3
4 b 2015 2 3
5 b 2015 3 3
6 c 2013 1 4
7 c 2013 2 4
8 c 2013 3 4
9 c 2013 4 4
10 d 2012 1 2
11 d 2012 2 2
12 e 2015 1 3
13 e 2015 2 3
14 e 2015 3 3
I'm hoping someone can help me achieve this data structure concisely within a pipeline. All of the similar questions I have found have been very verbose and not specific to this issue. I am new to using tidyverse but I'm having trouble getting the group_by() function (to replace the ddply() call) within a pipe, and I'm hoping there is a simpler way.
UPDATE
For those interested it appears both answers below are correct. The reason that I struggled was because I was already completing other data manipulations within my pipeline and I tried to complete the formation of the agec column within a previous call to dplyr::mutate(). You can refer to my comment on #Thomas answer to see the error in my ways. Hope this helps.
Try dplyr instead of plyr
library(dplyr)
fish_1 %>%
group_by(fishid) %>%
mutate(agec = max(agei))
You can use group_by from dplyr to group your fish IDs and then simply call mutate (dplyr as well) with max:
fish_1 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3))
fish_1 %>%
group_by(fishid) %>%
mutate(agec = max(agei))
# A tibble: 14 x 4
# Groups: fishid [5]
year fishid agei agec
<dbl> <chr> <dbl> <dbl>
1 2012 a 1 2
2 2012 a 2 2
3 2015 b 1 3
4 2015 b 2 3
5 2015 b 3 3
6 2013 c 1 4
7 2013 c 2 4
8 2013 c 3 4
9 2013 c 4 4
10 2012 d 1 2
11 2012 d 2 2
12 2015 e 1 3
13 2015 e 2 3
14 2015 e 3 3
An option with data.table
library(data.table)
setDT(fish_1)[, agec := max(agei, na.rm = TRUE), fishid]

Counting the distinct values for each day and group and inserting the value in an array in R

I want to transform the data below to give me an association array with the count of each unique id in each group for each day. So, for example, from the data below
Year Month Day Group ID
2014 04 26 1 A
2014 04 26 1 B
2014 04 26 2 B
2014 04 26 2 C
2014 05 12 1 B
2014 05 12 2 E
2014 05 12 2 F
2014 05 12 2 G
2014 05 12 3 G
2014 05 12 3 F
2015 05 19 1 F
2015 05 19 1 D
2015 05 19 2 E
2015 05 19 2 G
2015 05 19 2 D
2015 05 19 3 A
2015 05 19 3 E
2015 05 19 3 B
I want to make an array that gives:
[1] (04/26/2014)
Grp 1 2 3
1 0 1 0
2 1 0 0
3 0 0 0
[2] (05/12/2014)
Grp 1 2 3
1 0 0 1
2 0 0 2
3 1 2 0
[3] (05/19/2015)
Grp 1 2 3
1 0 1 0
2 1 0 1
3 0 1 0
The 'Grp' is just to indicate the group number. I know how to count the distinct values within the table, overall, but I’m trying to use for loops to also insert the appropriate unique value for each day for e.g., inserting the unique number of IDs that are present in both group 1 and 2 in 04/26/2014 and inserting that number in the group 1 and group 2 association matrix for that day. Any help would be appreciated.
I don't quite understand how you get the second one, but you can try this
dd <- read.table(header = TRUE, text = "Year Month Day Group ID
2014 04 26 1 A
2014 04 26 1 B
2014 04 26 2 B
2014 04 26 2 C
2014 05 12 1 B
2014 05 12 2 E
2014 05 12 2 F
2014 05 12 2 G
2014 05 12 3 G
2014 05 12 3 F
2015 05 19 1 F
2015 05 19 1 D
2015 05 19 2 E
2015 05 19 2 G
2015 05 19 2 D
2015 05 19 3 A
2015 05 19 3 E
2015 05 19 3 B")
dd <- within(dd, {
date <- as.Date(apply(dd[, 1:3], 1, paste0, collapse = '-'))
Group <- factor(Group)
Year <- Month <- Day <- NULL
})
Eg, for the first one
sp <- split(dd, dd$date)[[1]]
tbl <- table(sp$ID, sp$Group)
`diag<-`(crossprod(tbl), 0)
# 1 2 3
# 1 0 1 0
# 2 1 0 0
# 3 0 0 0
And do them all at once
lapply(split(dd, dd$date), function(x) {
cp <- crossprod(table(x$ID, x$Group))
diag(cp) <- 0
cp
})
# $`2014-04-26`
#
# 1 2 3
# 1 0 1 0
# 2 1 0 0
# 3 0 0 0
#
# $`2014-05-12`
#
# 1 2 3
# 1 0 0 0
# 2 0 0 2
# 3 0 2 0
#
# $`2015-05-19`
#
# 1 2 3
# 1 0 1 0
# 2 1 0 1
# 3 0 1 0
A possible solution with dplyr and tidyr will be as follows:
library(dplyr)
library(tidyr)
df$date <- as.Date(paste(df$Year, df$Month, df$Day, sep = '-'))
df %>%
expand(date, Group) %>%
left_join(., df) %>%
group_by(date, Group) %>%
summarise(nID = n_distinct(ID)) %>%
split(., .$date)
Resulting output:
$`2014-04-26`
Source: local data frame [3 x 3]
Groups: date [1]
date Group nID
(date) (int) (int)
1 2014-04-26 1 2
2 2014-04-26 2 2
3 2014-04-26 3 1
$`2014-05-12`
Source: local data frame [3 x 3]
Groups: date [1]
date Group nID
(date) (int) (int)
1 2014-05-12 1 1
2 2014-05-12 2 3
3 2014-05-12 3 2
$`2015-05-19`
Source: local data frame [3 x 3]
Groups: date [1]
date Group nID
(date) (int) (int)
1 2015-05-19 1 2
2 2015-05-19 2 3
3 2015-05-19 3 3

Create a new variable to epidemiological week

I have a data frame with a column week and another year (87 weeks). I need to create a new column (weekseq) with a number that identify the week sequentially from first to last. I dont know how to do. Someone can help me?
Example:
id week month year yearweek weekseq
1 1 1 2014 2014/1
1 1 1 2013 2013/1
1 2 1 2014 2014/2
1 2 1 2013 2013/2
1 3 1 2014 2014/3
1 3 1 2013 2013/3
1 4 1 2014 2014/4
1 4 1 2013 2013/4
1 5 1 2014 2014/5
1 5 1 2013 2013/5
1 6 2 2014 2014/6
1 6 2 2013 2013/6
1 7 2 2014 2014/7
1 7 2 2013 2013/7
1 8 2 2014 2014/8
1 8 2 2013 2013/8
1 9 2 2014 2014/9
1 9 2 2013 2013/9
1 10 3 2014 2014/10
1 10 3 2013 2013/10
1 11 3 2014 2014/11
1 11 3 2013 2013/11
1 12 3 2014 2014/12
1 12 3 2013 2013/12
This solution requires the 'dplyr' and 'plyr' packages:
# Coerce into tbd_df
datatbl <- tbl_df(data)
# Arrange, giving more weight to year than week
datatbl <- arrange(datatbl, year, month, week)
# Create a new column that numbers the arranged rows sequentially
seqtbl <- ddply(datatbl, .(id), transform, sequence=seq_along(id))

Resources