I have a data frame as follows :
id <- c(1, 2, 3, 4, 5)
week1 <- c(234,567456, 134123, 13412421, 2345245)
week2 <- c(4234,5123456, 454123, 12342421, 8394545)
week3 <- c(1234, 234124, 12348, 9348522, 134534)
data <- data.frame(id, week1, week2, week3)
I would like to find the percent change between week1 and week2, and then week2 and week3 etc (my dataframe is much larger with about 27 columns).
I tried:
data$change1 <- (data$week2-data$week1)*100/data$week1
However this would be extensive with a larger dataset.
Try the following:
library(tidyverse)
df <- gather(df, key='week', value='value', -id)
df$week <- as.integer(as.character((gsub('week', '', df$week))))
df %>% group_by(id) %>% arrange(week) %>% mutate(perc_change = (value-lag(value,1))/lag(value,1)*100)
# A tibble: 15 x 4
# Groups: id [5]
id week value perc_change
<dbl> <int> <dbl> <dbl>
1 1 1 234 NA
2 2 1 567456 NA
3 3 1 134123 NA
4 4 1 13412421 NA
5 5 1 2345245 NA
6 1 2 4234 1709.
7 2 2 5123456 803.
8 3 2 454123 239.
9 4 2 12342421 -7.98
10 5 2 8394545 258.
11 1 3 1234 -70.9
12 2 3 234124 -95.4
13 3 3 12348 -97.3
14 4 3 9348522 -24.3
15 5 3 134534 -98.4
This works reasonably well, but assumes that there is an observation every week, or else your percent change will be based on the last available week (so, if week 3 is missing, the value for week 4 will be a week on week change with week 2 as basis).
(Edit: replaced substr with gsub)
Sense checking:
For row 6, you see id 1. This is week 2 with a value of 4234. In week 1, id 1 had a value of 234. The difference is
(4234-234)/234
[1] 17.09402
So, that is aligned.
If I had:
person_ID visit date
1 2/25/2001
1 2/27/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
and I wanted another column to indicate the earliest recurring observation within 90 days, grouped by patient ID, with the desired output:
person_ID visit date date
1 2/25/2001 2/27/2001
1 2/27/2001 4/2/2001
1 4/2/2001 NA
2 3/18/2004 NA
3 9/22/2004 10/27/2004
3 10/27/2004 NA
3 5/15/2008 NA
Thank you!
We convert the 'visit_date' to Date class, grouped by 'person_ID', create a binary column that returns 1 if the difference between the current and next visit_date is less than 90 or else 0, using this column, get the correponding next visit_date' where the value is 1
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(i1 = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0),
date = case_when(as.logical(i1)~ lead(visit_date)), i1 = NULL ) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date date
# <int> <date> <date>
#1 1 2001-02-25 2001-02-27
#2 1 2001-02-27 2001-04-02
#3 1 2001-04-02 NA
#4 2 2004-03-18 NA
#5 3 2004-09-22 2004-10-27
#6 3 2004-10-27 NA
#7 3 2008-05-15 NA
I am trying to use dplyr::lag to determine the number of days that have passed for each event since the initial event but I am getting unexpected behavior.
Example, very simple data:
df <- data.frame(id = c("1", "1", "1", "1", "2", "2"),
date= c("4/1/2020", "4/2/2020", "4/3/2020", "4/4/2020", "4/17/2020", "4/18/2020"))
df$date <- as.Date(df$date, format = "%m/%d/%Y")
id date
1 1 4/1/2020
2 1 4/2/2020
3 1 4/3/2020
4 1 4/4/2020
5 2 4/17/2020
6 2 4/18/2020
What I was hoping to do was create a new column days_since_first_event that calculated the number of days between the initial event by id and each subsequent date with this expected output
df <- df %>%
group_by(id) %>%
mutate(days_since_first_event = as.numeric(date - lag(date)))
id date days_since_first_event
1 1 4/1/2020 0
2 1 4/2/2020 1
3 1 4/3/2020 2
4 1 4/4/2020 3
5 2 4/17/2020 0
6 2 4/18/2020 1
But instead I get this output
# A tibble: 6 x 3
# Groups: id [2]
id date days_since_first_event
<chr> <date> <dbl>
1 1 2020-04-01 NA
2 1 2020-04-02 1
3 1 2020-04-03 1
4 1 2020-04-04 1
5 2 2020-04-17 NA
6 2 2020-04-18 1
Any suggestions on what I'm doing wrong?
The first n values of lag() get a default value, because you don't have 'older' data. The default value is NA. Hence the NA in your results.
Furthermore, using lag will only yield the difference between consecutive events.
I have a dataframe that contains id(contains duplicate),date(contains duplicate),value. the values are recorded for different consecutive days. now what i want is to group the dataframe with id and date(as n consecutive days) and find mean of values. and return NA if the last group does not contain n days.
id date value
1 2016-10-5 2
1 2016-10-6 3
1 2016-10-7 1
1 2016-10-8 2
1 2016-10-9 5
2 2013-10-6 2
. . .
. . .
. . .
20 2012-2-6 10
desired output with n-consecutive days as 3
id date value group_n_consecutive_days mean_n_consecutive_days
1 2016-10-5 2 1 2
1 2016-10-6 3 1 2
1 2016-10-7 1 1 2
1 2016-10-8 2 2 NA
1 2016-10-9 5 2 NA
2 2013-10-6 2 1 4
.
.
.
.
20 2012-2-6 10 6 25
The data in the question is sorted and consecutive within id so we assume that that is the case. Also when the question refers to duplicate dates we assume that that means that different id values can have the same date but within id the dates are unique and consecutive. Now, using the data shown reproducibly in Note 2 at the end group by id and compute the group numbers using gl. Then grouping by id and group_no take the mean of each group of 3 or NA for smaller groups.
library(dplyr)
DF %>%
group_by(id) %>%
mutate(group_no = c(gl(n(), 3, n()))) %>%
group_by(group_no, add = TRUE) %>%
mutate(mean = if (n() == 3) mean(value) else NA) %>%
ungroup
giving:
# A tibble: 6 x 5
id date value group_no mean
<int> <date> <int> <int> <dbl>
1 1 2016-10-05 2 1 2
2 1 2016-10-06 3 1 2
3 1 2016-10-07 1 1 2
4 1 2016-10-08 2 2 NA
5 1 2016-10-09 5 2 NA
6 2 2013-10-06 2 1 NA
Note 1
An alternative to gl(...) could be cumsum(rep(1:3, length = n()) == 1) and an alternative to if (n() = 3) mean(value) else NA could be mean(head(c(value, NA, NA), 3)) .
Note 2
The input data in reproducible form was assumed to be:
Lines <- "id date value
1 2016-10-5 2
1 2016-10-6 3
1 2016-10-7 1
1 2016-10-8 2
1 2016-10-9 5
2 2013-10-6 2"
DF <- read.table(text = Lines, header = TRUE)
DF$date <- as.Date(DF$date)
Sample data:
df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
new.employee=c(4,6,2,6,23,2,5,34))
id year month new.employee
1 A 2014 1 4
2 A 2014 2 6
3 A 2015 1 2
4 A 2015 2 6
5 B 2014 1 23
6 B 2014 2 2
7 B 2015 1 5
8 B 2015 2 34
Desired outcome:
desired_df <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
new.employee=c(4,6,2,6,23,2,5,34),
new.employee.rank=c(1,1,2,2,2,2,1,1))
id year month new.employee new.employee.rank
1 A 2014 1 4 1
2 A 2014 2 6 1
3 A 2015 1 2 2
4 A 2015 2 6 2
5 B 2014 1 23 2
6 B 2014 2 2 2
7 B 2015 1 5 1
8 B 2015 2 34 1
The ranking rule is: I choose month 2 in each year to rank number of new employees between A and B. Then I need to give those ranks to month 1. i.e., month 1 of each year rankings must be equal to month 2 ranking in the same year.
I tried these code to get rankings for each month and each year,
library(data.table)
df1 <- data.table(df1)
df1[,rank:=rank(new.employee), by=c("year","month")]
If (anyone can roll the rank value within a column to replace rank of month 1 by rank of month 2 ), it might be a solution.
You've tried a data.table solution, so here's how would I do this using data.table
library(data.table) # V1.9.6+
temp <- setDT(df1)[month == 2L, .(id, frank(-new.employee)), by = year]
df1[temp, new.employee.rank := i.V2, on = c("year", "id")]
df1
# id year month new.employee new.employee.rank
# 1: A 2014 1 4 1
# 2: A 2014 2 6 1
# 3: A 2015 1 2 2
# 4: A 2015 2 6 2
# 5: B 2014 1 23 2
# 6: B 2014 2 2 2
# 7: B 2015 1 5 1
# 8: B 2015 2 34 1
It appears somewhat similar to the above dplyr solution. Which is basically ranks the ids per year and joins them back to the original data set. I'm using data.table V1.9.6+ here.
Here's a dplyr-based solution. The idea is to reduce the data to the parts you want to compare, make the comparison, then join the results back into the original data set, expanding it to fill all of the relevant slots. Note the edits to your code for creating the sample data.
df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=rep(c(2014,2014,2015,2015), 2),
month=rep(c(1,2), 4),
new.employee=c(4,6,2,6,23,2,5,34))
library(dplyr)
df1 %>%
# Reduce the data to the slices (months) you want to compare
filter(month==2) %>%
# Group the data by year, so the comparisons are within and not across years
group_by(year) %>%
# Create a variable that indicates the rankings within years in descending order
mutate(rank = rank(-new.employee)) %>%
# To prepare for merging, reduce the new data to just that ranking var plus id and year
select(id, year, rank) %>%
# Use left_join to merge the new data (.) with the original df, expanding the
# new data to fill all rows with id-year matches
left_join(df1, .) %>%
# Order the data by id, year, and month to make it easier to review
arrange(id, year, month)
Output:
Joining by: c("id", "year")
id year month new.employee rank
1 A 2014 1 4 1
2 A 2014 2 6 1
3 A 2015 1 2 2
4 A 2015 2 6 2
5 B 2014 1 23 2
6 B 2014 2 2 2
7 B 2015 1 5 1
8 B 2015 2 34 1