I have a data frame as follows :
id <- c(1, 2, 3, 4, 5)
week1 <- c(234,567456, 134123, 13412421, 2345245)
week2 <- c(4234,5123456, 454123, 12342421, 8394545)
week3 <- c(1234, 234124, 12348, 9348522, 134534)
data <- data.frame(id, week1, week2, week3)
I would like to find the percent change between week1 and week2, and then week2 and week3 etc (my dataframe is much larger with about 27 columns).
I tried:
data$change1 <- (data$week2-data$week1)*100/data$week1
However this would be extensive with a larger dataset.
Try the following:
library(tidyverse)
df <- gather(df, key='week', value='value', -id)
df$week <- as.integer(as.character((gsub('week', '', df$week))))
df %>% group_by(id) %>% arrange(week) %>% mutate(perc_change = (value-lag(value,1))/lag(value,1)*100)
# A tibble: 15 x 4
# Groups: id [5]
id week value perc_change
<dbl> <int> <dbl> <dbl>
1 1 1 234 NA
2 2 1 567456 NA
3 3 1 134123 NA
4 4 1 13412421 NA
5 5 1 2345245 NA
6 1 2 4234 1709.
7 2 2 5123456 803.
8 3 2 454123 239.
9 4 2 12342421 -7.98
10 5 2 8394545 258.
11 1 3 1234 -70.9
12 2 3 234124 -95.4
13 3 3 12348 -97.3
14 4 3 9348522 -24.3
15 5 3 134534 -98.4
This works reasonably well, but assumes that there is an observation every week, or else your percent change will be based on the last available week (so, if week 3 is missing, the value for week 4 will be a week on week change with week 2 as basis).
(Edit: replaced substr with gsub)
Sense checking:
For row 6, you see id 1. This is week 2 with a value of 4234. In week 1, id 1 had a value of 234. The difference is
(4234-234)/234
[1] 17.09402
So, that is aligned.
Related
If I had:
person_ID visit date
1 2/25/2001
1 2/27/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
and I wanted another column to indicate the earliest recurring observation within 90 days, grouped by patient ID, with the desired output:
person_ID visit date date
1 2/25/2001 2/27/2001
1 2/27/2001 4/2/2001
1 4/2/2001 NA
2 3/18/2004 NA
3 9/22/2004 10/27/2004
3 10/27/2004 NA
3 5/15/2008 NA
Thank you!
We convert the 'visit_date' to Date class, grouped by 'person_ID', create a binary column that returns 1 if the difference between the current and next visit_date is less than 90 or else 0, using this column, get the correponding next visit_date' where the value is 1
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(i1 = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0),
date = case_when(as.logical(i1)~ lead(visit_date)), i1 = NULL ) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date date
# <int> <date> <date>
#1 1 2001-02-25 2001-02-27
#2 1 2001-02-27 2001-04-02
#3 1 2001-04-02 NA
#4 2 2004-03-18 NA
#5 3 2004-09-22 2004-10-27
#6 3 2004-10-27 NA
#7 3 2008-05-15 NA
I have an incomplete time series dataframe and I need to insert rows of NAs for missing time stamps. There should always be 6 time stamps per day, which is indicated by the variable "Signal" (1-6) in the dataframe. I am trying to merge the incomplete dataframe A with a vector Bcontaining all Signals. Simplified example data below:
B <- rep(1:6,2)
A <- data.frame(Signal = c(1,2,3,5,1,2,4,5,6), var1 = c(1,1,1,1,1,1,1,1,1))
Expected <- data.frame(Signal = c(1,2,3,NA, 5, NA, 1,2,NA,4,5,6), var1 = c(1,1,1,NA,1,NA,1,1,NA,1,1,1)
Note that Brepresents a dataframe with multiple variables and the NAs in Expected are rows of NAs in the dataframe. Also the actual dataframe has more observations (84 in total).
Would be awesome if you guys could help me out!
If you already know there are 6 timestamps in a day you can do this without B. We can create groups for each day and use complete to add the missing observations with NA.
library(dplyr)
library(tidyr)
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
ungroup() %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 NA
# 5 5 1
# 6 6 NA
# 7 1 1
# 8 2 1
# 9 3 NA
#10 4 1
#11 5 1
#12 6 1
If in the output you need Signal as NA for missing combination you can use
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
mutate(Signal = replace(Signal, is.na(var1), NA)) %>%
ungroup %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 NA NA
# 5 5 1
# 6 NA NA
# 7 1 1
# 8 2 1
# 9 NA NA
#10 4 1
#11 5 1
#12 6 1
I have a tibble containing time series of various blood parameters like CRP over the course of several days. The tibble is tidy, with each time series in one column, as well as a column for the day of measurement. The tibble contains another column with a day of infection. I want to replace each blood parameter with NA if the Day variable is greater-equal than the InfectionDay. Since I have a lot of variables, I'd like to have a function which accepts the column name dynamically and creates a new column name by appending "_censored" to the old one. I've tried the following:
censor.infection <- function(df, colname){
newcolname <- paste0(colname, "_censored")
return(df %>% mutate(!!newcolname := ifelse( Day < InfectionDay, !!colname, NA)))
}
data = tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1))
data = censor.infection(data, "CRP")
Running this, I expected
# A tibble: 5 x 4
Day InfectionDay CRP CRP_censored
<int> <dbl> <dbl> <chr>
1 1 3 3 3
2 2 3 2 2
3 3 3 5 NA
4 4 3 4 NA
5 5 3 1 NA
but I get
# A tibble: 5 x 4
Day InfectionDay CRP CRP_censored
<int> <dbl> <dbl> <chr>
1 1 3 3 CRP
2 2 3 2 CRP
3 3 3 5 NA
4 4 3 4 NA
5 5 3 1 NA
You can add sym() to the column name in mutate to convert to symbol before evaluating
censor.infection <- function(df, colname){
newcolname <- paste0(colname, "_censored")
return(df %>% mutate(!!newcolname := ifelse( Day < InfectionDay, !! sym(colname), NA)))
}
data = tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1))
data = censor.infection(data, "CRP")
We can select columns on which we want to apply the function (cols) and use mutate_at which will also automatically rename the columns. Added an extra column in the data to show renaming.
library(dplyr)
cols <- c("CRP", "CRP1")
data %>%
mutate_at(cols, list(censored = ~replace(., Day >= InfectionDay, NA)))
# A tibble: 5 x 6
# Day InfectionDay CRP CRP1 CRP_censored CRP1_censored
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 3 3 3 3 3
#2 2 3 2 2 2 2
#3 3 3 5 5 NA NA
#4 4 3 4 4 NA NA
#5 5 3 1 1 NA NA
data
data <- tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1), CRP1 = c(3,2,5,4,1))
I have a grouped df and I would like to add additional rows to the top of the groups that match with a variable (item_code) from the df.
The additional rows do not have an id column. The additional rows should not be duplicated within the groups of df.
Example data:
df <- as.tibble(data.frame(id=rep(1:3,each=2),
item_code=c("A","A","B","B","B","Z"),
score=rep(1,6)))
additional_rows <- as.tibble(data.frame(item_code=c("A","Z"),
score=c(6,6)))
What I tried
I found this post and tried to apply it:
Add row in each group using dplyr and add_row()
df %>% group_by(id) %>% do(add_row(additional_rows %>%
filter(item_code %in% .$item_code)))
What I get:
# A tibble: 9 x 3
# Groups: id [3]
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 Z 6
3 1 NA NA
4 2 A 6
5 2 Z 6
6 2 NA NA
7 3 A 6
8 3 Z 6
9 3 NA NA
What I am looking for:
# A tibble: 6 x 3
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 A 1
3 1 A 1
4 2 B 1
5 2 B 1
6 3 B 1
7 3 Z 6
8 3 Z 1
This should do the trick:
library(plyr)
df %>%
join(subset(df, item_code %in% additional_rows$item_code, select = c(id, item_code)) %>%
join(additional_rows) %>%
subset(!duplicated(.)), type = "full") %>%
arrange(id, item_code, -score)
Not sure if its the best way, but it works
Edit: to get the score in the same order added the other arrange terms
Edit 2: alright, there should now be no duplicated rows added from the additional rows as per your comment
I've tried creating a variable that represents the lagged version of another variable relative to the whole change of the variable within the group.
Let's use this example dataframe:
game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3), score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
game_data
player level score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I've tried the following, but while lagging the variable works, I am not able to create a new variable that shows the lag of the variable relative to the whole change for the player:
result <-
+ game_data %>%
+ group_by(player) %>%
+ mutate(
+ lag_score = score - dplyr::lag(score, n=1, default = NA),
+ lag_score_relative = lag_score/sum(lag_score))
result
# A tibble: 9 x 5
# Groups: player [3]
player level score lag_score lag_score_relative
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 NA NA
2 1 2 150 150 NA
3 1 3 170 20 NA
4 2 1 80 NA NA
5 2 2 100 20 NA
6 2 3 110 10 NA
7 3 1 75 NA NA
8 3 2 100 25 NA
9 3 3 0 -100 NA
For example, for player 1 it should be in
Level 1: NA/170 = NA
Level 2: 150/170
Level 3: 20/170
Thanks in advance, I hope anyone can help.
If you sum the lagged scores you include an NA. The sum then returns NA. You divide by NA which in the end returns NA for every value. To avoid this just set the na.rm argument to TRUE in your call of sum and NAs do not get included in the sum:
game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3),
score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
game_data %>%
group_by(player) %>%
mutate(
lag_score = score - dplyr::lag(score, n=1, default = NA),
lag_score_relative = lag_score/sum(lag_score, na.rm = TRUE))