add column to df according to rule - r

My dataset is:
unit date total
1 2019-04-02 7
1 2020-01-01 5
2 2019-12-01 10
2 2020-01-03 2
3 2019-09-01 3
3 2020-03-03 3
I would like to add the 'category' column if any value in 'total' is higher or equal to 10 for each 'unit':
unit date total category
1 2019-04-02 7 low
1 2020-01-01 5 low
2 2019-12-01 10 high
2 2020-01-03 2 high
3 2019-09-01 3 low
3 2020-03-03 3 low
I have tried many things such as:
df$category <- "low"
for (i in df$unit){
if (rowSums(df$total >= 10) > 0){
df$category <- "high"
}
}
but none worked. Can you please advise?

Try working around the max values in each group and then assign the category. Here the code:
library(dplyr)
#Code
dfnew <- df %>% group_by(unit) %>% mutate(category=ifelse(max(total,na.rm=T)>=10,'High','Low'))
Output:
# A tibble: 6 x 4
# Groups: unit [3]
unit date total category
<int> <chr> <int> <chr>
1 1 2019-04-02 7 Low
2 1 2020-01-01 5 Low
3 2 2019-12-01 10 High
4 2 2020-01-03 2 High
5 3 2019-09-01 3 Low
6 3 2020-03-03 3 Low
Some data used:
#Data
df <- structure(list(unit = c(1L, 1L, 2L, 2L, 3L, 3L), date = c("2019-04-02",
"2020-01-01", "2019-12-01", "2020-01-03", "2019-09-01", "2020-03-03"
), total = c(7L, 5L, 10L, 2L, 3L, 3L)), class = "data.frame", row.names = c(NA,
-6L))

Does this work:
> library(dplyr)
> df %>% group_by(unit) %>% mutate(category = case_when(max(total) >= 10 ~ 'high', TRUE ~ 'low'))
# A tibble: 6 x 4
# Groups: unit [3]
unit date total category
<dbl> <dttm> <dbl> <chr>
1 1 2019-04-02 00:00:00.000 7 low
2 1 2020-01-01 00:00:00.000 5 low
3 2 2019-12-01 00:00:00.000 10 high
4 2 2020-01-03 00:00:00.000 2 high
5 3 2019-09-01 00:00:00.000 3 low
6 3 2020-03-03 00:00:00.000 3 low
>

One base R option using ave, e.g.,
transform(
df,
category = c("Low","High")[ave(total>=10,unit,FUN = any)+1]
)
which gives
unit date total category
1 1 2019-04-02 7 Low
2 1 2020-01-01 5 Low
3 2 2019-12-01 10 High
4 2 2020-01-03 2 High
5 3 2019-09-01 3 Low
6 3 2020-03-03 3 Low
Data
> dput(df)
structure(list(unit = c(1L, 1L, 2L, 2L, 3L, 3L), date = c("2019-04-02",
"2020-01-01", "2019-12-01", "2020-01-03", "2019-09-01", "2020-03-03"
), total = c(7L, 5L, 10L, 2L, 3L, 3L)), class = "data.frame", row.names = c(NA,
-6L))

For each unit you can check if any value is greater than equal to 10 and assign category value accordingly.
library(dplyr)
df %>%
group_by(unit) %>%
mutate(category = if(any(total >= 10)) 'high' else 'low')
# unit date total category
# <int> <chr> <int> <chr>
#1 1 2019-04-02 7 low
#2 1 2020-01-01 5 low
#3 2 2019-12-01 10 high
#4 2 2020-01-03 2 high
#5 3 2019-09-01 3 low
#6 3 2020-03-03 3 low
The same logic can be implemented in base R
df$category <- with(df, ave(total, unit, FUN = function(x)
if(any(x >= 10)) 'high' else 'low'))
and data.table :
library(data.table)
setDT(df)[, category := if(any(total >= 10)) 'high' else 'low', unit]

Related

Identify unique values within a multivariable subset

I have data that look like these:
Subject Site Date
1 2 '2020-01-01'
1 2 '2020-01-01'
1 2 '2020-01-02'
2 1 '2020-01-02'
2 1 '2020-01-03'
2 1 '2020-01-03'
And I'd like to create an order variable for unique dates by Subject and Site. i.e.
Want
1
1
2
1
2
2
I define a little wrapper:
rle <- function(x) cumsum(!duplicated(x))
and I notice inconsistent behavior when I supply:
have1 <- unlist(tapply(val$Date, val[, c( 'Site', 'Subject')], rle))
versus
have2 <- unlist(tapply(val$Date, val[, c('Subject', 'Site')], rle))
> have1
[1] 1 1 2 1 2 2
> have2
[1] 1 2 2 1 1 2
Is there any way to ensure that the natural ordering of the dataset is followed regardless of the specific columns supplied to the INDEX argument?
library(dplyr)
val %>%
group_by(Subject, Site) %>%
mutate(Want = match(Date, unique(Date))) %>%
ungroup
-output
# A tibble: 6 × 4
Subject Site Date Want
<int> <int> <chr> <int>
1 1 2 2020-01-01 1
2 1 2 2020-01-01 1
3 1 2 2020-01-02 2
4 2 1 2020-01-02 1
5 2 1 2020-01-03 2
6 2 1 2020-01-03 2
val$Want <- with(val, ave(as.integer(as.Date(Date)), Subject, Site,
FUN = \(x) match(x, unique(x))))
val$Want
[1] 1 1 2 1 2 2
data
val <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L), Site = c(2L,
2L, 2L, 1L, 1L, 1L), Date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03")),
class = "data.frame", row.names = c(NA,
-6L))

Remove rows based on date and time where no update actually occurs but keep the first instance

I've come up against a wall in trying to resolve this and hope somebody can help. I'm trying to implement a way to filter this dataset which reflects bike station occupancy data that is time stamped.
ID Time Bike.Availability
1 2 01/04/2020 04:31:16 11
2 2 01/04/2020 04:40:07 11
3 2 01/04/2020 04:50:15 10
4 2 01/04/2020 04:57:10 10
5 2 01/04/2020 05:07:19 9
6 2 01/04/2020 05:19:38 10
7 2 01/04/2020 05:29:47 10
8 2 01/04/2020 06:43:54 11
I want to remove the rows where there is no change in Bike.Availability and only keep the first instance.
I would like the resulting dataset to look as follows:
ID Time Bike.Availability
1 2 01/04/2020 04:31:16 11
2 2 01/04/2020 04:50:15 10
3 2 01/04/2020 05:07:19 9
4 2 01/04/2020 05:19:38 10
5 2 01/04/2020 06:43:54 11
I've converted the timestamp:
bike_data$Time <- as.POSIXct(bike_data$Time,format="%Y-%m-%d %H:%M:%S")
And I've tried different variations of:
library(dplyr)
bike_data %>%
group_by(Time) %>%
arrange(Bike.Availability) %>%
top_n(1)
Any help or feedback would be greatly appreciated.
We group by the 'ID' and run-length-id of 'Bike.Availability' i.e. it creates a grouping index based on the similarity of adjacent elements of 'Bike.Availability', then slice the first row with slice_head specifying n = 1
library(dplyr)
library(data.table)
bike_data %>%
group_by(ID, grp = rleid(Bike.Availability)) %>%
slice_head(n = 1) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 x 3
# ID Time Bike.Availability
# <int> <chr> <int>
#1 2 01/04/2020 04:31:16 11
#2 2 01/04/2020 04:50:15 10
#3 2 01/04/2020 05:07:19 9
#4 2 01/04/2020 05:19:38 10
#5 2 01/04/2020 06:43:54 11
Grouping by 'Time' column would create groups with single observation per group (based on the values showed in 'Time'), thererefore top_n(1) returns the original dataset instead of subsetting
data
bike_data <- structure(list(ID = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
Time = c("01/04/2020 04:31:16",
"01/04/2020 04:40:07", "01/04/2020 04:50:15", "01/04/2020 04:57:10",
"01/04/2020 05:07:19", "01/04/2020 05:19:38", "01/04/2020 05:29:47",
"01/04/2020 06:43:54"), Bike.Availability = c(11L, 11L, 10L,
10L, 9L, 10L, 10L, 11L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
A dplyr solution alone. Checking if row above and below are same ifelse. Then NA to 0 and then filter.
library(dplyr)
bike_data %>%
mutate(same = ifelse(Bike.Availability == lag(Bike.Availability), 1, 0)) %>%
mutate(same = ifelse(is.na(same), 0, same)) %>%
filter(same=="NA" | same==0) %>%
select(-same)
Output:
ID Time Bike.Availability
1 2 01/04/2020 04:31:16 11
3 2 01/04/2020 04:50:15 10
5 2 01/04/2020 05:07:19 9
6 2 01/04/2020 05:19:38 10
8 2 01/04/2020 06:43:54 11

Trying to find occurrences of ID that meets sequential conditions in R

I'm trying to return a logical vector based on whether a person meets one set of conditions and ALSO meets another set of conditions later on. I'm using a data frame that looks like so:
Person.Id Year Term
250 1 3
250 1 1
250 2 3
300 1 3
511 2 1
300 1 5
700 2 3
What I want to return is a logical vector that indicates true/false if person ID 250 has year 1 and term 3, AND later has year 2 term 3. So a person that only has year 1 term 3 or year 1 term 5 will return false. Solutions in dplyr preferred! I feel like this is simple and I'm just missing something. I initially tried this code but all it returned was a blank df:
df2 <- df1 %>%
group_by(Person.Id) %>%
filter((year==1 & term==3) & (year==2 & term==3))
Are you looking for something like this ?
require(dplyr)
df %>%
group_by(Person.Id) %>%
mutate(count=sum((year==1 & term==3) | (year==2 & term==3))) %>%
mutate(count2=if_else(count==2,T,F))
# A tibble: 7 x 5
# Groups: Person.Id [4]
Person.Id year term count count2
<int> <int> <int> <int> <lgl>
1 250 1 3 2 TRUE
2 250 1 1 2 TRUE
3 250 2 3 2 TRUE
4 300 1 3 1 FALSE
5 511 2 1 0 FALSE
6 300 1 5 1 FALSE
7 700 2 3 1 FALSE
Maybe this can help:
#Data
Data <- structure(list(Person.Id = c(250L, 250L, 250L, 300L, 511L, 300L,
700L), Year = c(1L, 1L, 2L, 1L, 2L, 1L, 2L), Term = c(3L, 1L,
3L, 3L, 1L, 5L, 3L)), row.names = c(NA, -7L), class = "data.frame")
#Flags
cond1 <- Data$Year==1 & Data$Term==3
cond2 <- Data$Year==2 & Data$Term==3
#Replace
Data$Flag1 <- 0
Data$Flag1[cond1]<-1
Data$Flag2 <- 0
Data$Flag2[cond2]<-1
#Filter
Data %>% group_by(Person.Id) %>% filter(Flag1==1 | Flag2==1)
# A tibble: 4 x 5
# Groups: Person.Id [3]
Person.Id Year Term Flag1 Flag2
<int> <int> <int> <dbl> <dbl>
1 250 1 3 1 0
2 250 2 3 0 1
3 300 1 3 1 0
4 700 2 3 0 1

How do I create an index variable for unique values of X within a group Y?

I have the following table:
id_question id_event num_events
2015012713 49508 1
2015012711 49708 1
2015011523 41808 3
2015011523 44008 3
2015011523 44108 3
2015011522 41508 3
2015011522 43608 3
2015011522 43708 3
2015011521 39708 1
2015011519 44208 1
The third column gives the count of events by question. I want to create a variable that would index the events by question only where there are multiple events per question. It would look something like that:
id_question id_event num_events index_event
2015012713 49508 1
2015012711 49708 1
2015011523 41808 3 1
2015011523 44008 3 2
2015011523 44108 3 3
2015011522 41508 3 1
2015011522 43608 3 2
2015011522 43708 3 3
2015011521 39708 1
2015011519 44208 1
How can I do that?
We can use tidyverse to create an 'index_event' after grouping by 'id_question'. If the number of rows are greater than 1 (n() >1), then get the sequence of rows (row_number()) and the default option in case_when is NA
library(dplyr)
df1 %>%
group_by(id_question) %>%
mutate(index_event = case_when(n() >1 ~ row_number()))
# A tibble: 10 x 4
# Groups: id_question [6]
# id_question id_event num_events index_event
# <int> <int> <int> <int>
# 1 2015012713 49508 1 NA
# 2 2015012711 49708 1 NA
# 3 2015011523 41808 3 1
# 4 2015011523 44008 3 2
# 5 2015011523 44108 3 3
# 6 2015011522 41508 3 1
# 7 2015011522 43608 3 2
# 8 2015011522 43708 3 3
# 9 2015011521 39708 1 NA
#10 2015011519 44208 1 NA
Or with data.table, we use rowid on 'id_question' and change the elements that are 1 in 'num_events' to NA with NA^ (making use of NA^0, NA^1)
library(data.table)
setDT(df1)[, index_event := rowid(id_question) * NA^(num_events == 1)]
Or using base R, another option with the sequence of frequency from 'id_question' and change elements to NA as in the previous case
df1$index_event <- with(df1, sequence(table(id_question)) * NA^(num_events == 1))
df1$index_event
#[1] NA NA 1 2 3 1 2 3 NA NA
data
df1 <- structure(list(id_question = c(2015012713L, 2015012711L, 2015011523L,
2015011523L, 2015011523L, 2015011522L, 2015011522L, 2015011522L,
2015011521L, 2015011519L), id_event = c(49508L, 49708L, 41808L,
44008L, 44108L, 41508L, 43608L, 43708L, 39708L, 44208L), num_events = c(1L,
1L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
If num_events = 1 you can return NA or create a row-index for each id_question.
This can be done in base R :
df$index_event <- with(df, ave(num_events == 1, id_question,
FUN = function(x) replace(seq_along(x), x, NA)))
df
# id_question id_event num_events index_event
#1 2015012713 49508 1 NA
#2 2015012711 49708 1 NA
#3 2015011523 41808 3 1
#4 2015011523 44008 3 2
#5 2015011523 44108 3 3
#6 2015011522 41508 3 1
#7 2015011522 43608 3 2
#8 2015011522 43708 3 3
#9 2015011521 39708 1 NA
#10 2015011519 44208 1 NA
dplyr :
library(dplyr)
df %>%
group_by(id_question) %>%
mutate(index_event = if_else(num_events == 1, NA_integer_, row_number()))
Or data.table :
library(data.table)
setDT(df)
df[,index_event := ifelse(num_events == 1, NA_integer_, seq_len(.N)), id_question]
data
df <- structure(list(id_question = c(2015012713L, 2015012711L, 2015011523L,
2015011523L, 2015011523L, 2015011522L, 2015011522L, 2015011522L,
2015011521L, 2015011519L), id_event = c(49508L, 49708L, 41808L,
44008L, 44108L, 41508L, 43608L, 43708L, 39708L, 44208L), num_events = c(1L,
1L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L)),class = "data.frame",row.names = c(NA, -10L))

Grouping dates by periods and id using R

I have a list of events by ID and would like to group them in two week periods. The two weeks should start whenever the first event occurs for each ID. The grouped event data should look something like the following,
ID Date Group
<dbl> <date> <dbl>
1 2018-01-01 1
1 2018-01-02 1
1 2018-01-02 1
1 2018-02-01 2
1 2018-03-01 3
2 2018-01-01 4
2 2018-04-01 5
dat = structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), Date = structure(c(17532,
17533, 17533, 17563, 17591, 17532, 17622), class = "Date"), Group = c(1L,
1L, 1L, 2L, 3L, 4L, 5L)), .Names = c("ID", "Date", "Group"), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
I was originally thinking of lagging by ID and filtering for events that happen within a two week period, but there may be many events that correspond to a single two week period.
You can use cut and seq to round to the nearest two week cutoff, then group_indices to make an increasing index:
dat %>%
group_by(ID) %>%
mutate(g = cut(Date, seq(first(Date), max(Date) + 14, by="2 weeks")) %>% as.character) %>%
ungroup %>%
mutate(g = group_indices(., ID, g))
# A tibble: 7 x 4
ID Date Group g
<int> <date> <int> <int>
1 1 2018-01-01 1 1
2 1 2018-01-02 1 1
3 1 2018-01-02 1 1
4 1 2018-02-01 2 2
5 1 2018-03-01 3 3
6 2 2018-01-01 4 4
7 2 2018-04-01 5 5
Get the difference of adjacent 'Date's with difftime specifying the unit as "week", check if the difference is greater than 2, and get the cumulative sum
dat %>%
mutate(GroupNew = cumsum(abs(difftime(Date, lag(Date,
default = first(Date)), unit = "week")) > 2) + 1)
# A tibble: 7 x 4
# ID Date Group GroupNew
# <int> <date> <int> <dbl>
#1 1 2018-01-01 1 1
#2 1 2018-01-02 1 1
#3 1 2018-01-02 1 1
#4 1 2018-02-01 2 2
#5 1 2018-03-01 3 3
#6 2 2018-01-01 4 4
#7 2 2018-04-01 5 5

Resources