Avoiding the use of for loop for cumsum - r

First generating some sample data:
doy <- rep(1:365,times=2)
year <- rep(2000:2001,each=365)
set.seed(1)
value <-runif(min=0,max=10,365*2)
doy.range <- c(40,50,60,80)
thres <- 200
df <- data.frame(cbind(doy,year,value))
What I want to do is the following:
For the df$year == 2000, starting from doy.range == 40, start adding the
df$value and calculate the df$doy when the cumualtive sum of df$value is >= thres
Here's my long for loop to achieve this:
# create a matrix to store results
mat <- matrix(, nrow = length(doy.range)*length(unique(year)),ncol=3)
mat[,1] <- rep(unique(year),each=4)
mat[,2] <- rep(doy.range,times=2)
for(i in unique(df$year)){
dat <- df[df$year== i,]
for(j in doy.range){
dat1 <- dat[dat$doy >= j,]
dat1$cum.sum <-cumsum(dat1$value)
day.thres <- dat1[dat1$cum.sum >= thres,"doy"][1] # gives me the doy of the year where cumsum of df$value becomes >= thres
mat[mat[,2] == j & mat[,1] == i,3] <- day.thres
}
}
This loop gives me the in the third column of my matrix, the doy when cumsum$value exceeded thres
However, I really want to avoid the loops. Is there any way I can do it using less code?

If I understand correctly you can use dplyr. Assume a threshold of 200:
library(dplyr)
df %>% group_by(year) %>%
filter(doy >= 40) %>%
mutate(CumSum = cumsum(value)) %>%
filter(CumSum >= 200) %>%
top_n(n = -1, wt = CumSum)
which yields
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 78 2000 3.899895 201.4864
2 75 2001 9.205178 204.3171
The verbs used are self-explanatory I guess. If not, let me know.
For different doy create a function and use lapply:
f <- function(doy.range) {
df %>% group_by(year) %>%
filter(doy >= doy.range) %>%
mutate(CumSum = cumsum(value)) %>%
filter(CumSum >= 200) %>%
top_n(n = -1, wt = CumSum)
}
lapply(doy.range, f)
[[1]]
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 78 2000 3.899895 201.4864
2 75 2001 9.205178 204.3171
[[2]]
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 89 2000 2.454885 200.2998
2 91 2001 6.578281 200.6544
[[3]]
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 98 2000 4.100841 200.5048
2 102 2001 7.158333 200.3770
[[4]]
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 120 2000 6.401010 204.9951
2 120 2001 5.884192 200.8252

The idea is to create a function that based on a given (starting) doy and threshold gets you the relevant info. Then apply this function to different combinations of starting doys and thresholds and get a dataset back with all relevant info:
# create example data
doy <- rep(1:365,times=2)
year <- rep(2000:2001,each=365)
set.seed(1)
value <-runif(min=0,max=10,365*2)
df <- data.frame(doy,year,value)
library(dplyr)
library(purrr)
# function (inputs: dr for doy range and t for threshold)
f = function(dr, t) {
df %>%
filter(doy >= dr) %>% # keep rows with values aboven a given doy
group_by(year) %>% # for each year
mutate(CumSumValue = cumsum(value)) %>% # get the cumulative sum of value
filter(CumSumValue >= t) %>% # keep rows equal or above a given threshold
slice(1) %>% # keep the first row
ungroup() %>% # forget the grouping
select(-value) %>% # remove unnecessary variable
mutate(doy_input=dr, thres_input=t) %>% # add the input info as columns
select(doy_input, thres_input, year, doy, CumSumValue) # re arrange columns
}
# input doy and threshold
doy.range <- c(40,50,60,80)
thres <- 200
# map those vectors to the function
map2_df(doy.range, thres, f)
# # A tibble: 8 x 5
# doy_input thres_input year doy CumSumValue
# <dbl> <dbl> <int> <int> <dbl>
# 1 40 200 2000 78 201.4864
# 2 40 200 2001 75 204.3171
# 3 50 200 2000 89 200.2998
# 4 50 200 2001 91 200.6544
# 5 60 200 2000 98 200.5048
# 6 60 200 2001 102 200.3770
# 7 80 200 2000 120 204.9951
# 8 80 200 2001 120 200.8252

Related

Iterating over listed data frames within a piped purrr anonymous function call

Using purrr::map and the magrittr pipe, I am trying generate a new column with values equal to a substring of the existing column.
I can illustrate what I'm trying to do with the following toy dataset:
library(tidyverse)
library(purrr)
test <- list(tibble(geoid_1970 = c(123, 456),
name_1970 = c("here", "there"),
pop_1970 = c(1, 2)),
tibble(geoid_1980 = c(234, 567),
name_1980 = c("here", "there"),
pop_1970 = c(3, 4))
)
Within each listed data frame, I want a column equal to the relevant year. Without iterating, the code I have is:
data <- map(test, ~ .x %>% mutate(year = as.integer(str_sub(names(test[[1]][1]), -4))))
Of course, this returns a year of 1970 in both listed data frames, which I don't want. (I want 1970 in the first and 1980 in the second.)
In addition, it's not piped, and my attempt to pipe it throws an error:
data <- test %>% map(~ .x %>% mutate(year = as.integer(str_sub(names(.x[[1]][1]), -4))))
# > Error: Problem with `mutate()` input `year`.
# > x Input `year` can't be recycled to size 2.
# > ℹ Input `year` is `as.integer(str_sub(names(.x[[1]][1]), -4))`.
# > ℹ Input `year` must be size 2 or 1, not 0.
How can I iterate over each listed data frame using the pipe?
Try:
test %>% map(~.x %>% mutate(year = as.integer(str_sub(names(.x[1]), -4))))
[[1]]
# A tibble: 2 x 4
geoid_1970 name_1970 pop_1970 year
<dbl> <chr> <dbl> <int>
1 123 here 1 1970
2 456 there 2 1970
[[2]]
# A tibble: 2 x 4
geoid_1980 name_1980 pop_1970 year
<dbl> <chr> <dbl> <int>
1 234 here 3 1980
2 567 there 4 1980
We can get the 'year' with parse_number
library(dplyr)
library(purrr)
map(test, ~ .x %>%
mutate(year = readr::parse_number(names(.)[1])))
-output
#[[1]]
# A tibble: 2 x 4
# geoid_1970 name_1970 pop_1970 year
# <dbl> <chr> <dbl> <dbl>
#1 123 here 1 1970
#2 456 there 2 1970
#[[2]]
# A tibble: 2 x 4
# geoid_1980 name_1980 pop_1970 year
# <dbl> <chr> <dbl> <dbl>
#1 234 here 3 1980
#2 567 there 4 1980

How to find the patient with at least two HbA1c values more than 6.5 at least two years apart using R

Now I have a dataset contains Patient ID, the year of measurement of HbA1c, and the value of HbA1c.
I want to find the PATIENT_ID who had at least two measurements of HbA1c >= 6.5 apart at least two years. I would like a dataset with PATIENT_ID and the first time of HbA1c >=6.5 and the value of HbA1c.
I want to find the PATIENT_ID and the date and the value of Hba1c, the row at which the last time HbA1c >=6.5 for a patient (mean after which all HbA1c would be >=6.5 for a patient).
Thank you very much.
df <- data.frame(PATIENT_ID=c(1,1,1,2,2,2,3,3),
TEST_DATE=c(2001,2001,2006,2002,2004,2006,2001,2006),
HbA1c=c(6.4,6.5,6.3,6.7,6.1,6.9,6.7,7.0))
The expected output:
(1)
df <- data.frame(PATIENT_ID=c(2,3),
TEST_DATE=c(2002,2001),
HbA1c=c(6.7,6.7))
(2)
df <- data.frame(PATIENT_ID=c(2,3),
TEST_DATE=c(2006,2001),
HbA1c=c(6.9,6.7))
library(dplyr)
df %>%
arrange(TEST_DATE) %>%
filter(HbA1c >= 6.5) %>%
group_by(PATIENT_ID) %>%
filter(diff(range(TEST_DATE)) >= 2) %>%
slice(1) %>%
ungroup()
# # A tibble: 2 x 3
# PATIENT_ID TEST_DATE HbA1c
# <dbl> <dbl> <dbl>
# 1 2 2002 6.7
# 2 3 2001 6.7
df %>%
arrange(TEST_DATE) %>%
group_by(PATIENT_ID) %>%
filter(rev(cumall(rev(HbA1c >= 6.5)))) %>%
slice(1) %>%
ungroup()
# # A tibble: 2 x 4
# # Groups: PATIENT_ID [2]
# PATIENT_ID TEST_DATE HbA1c a
# <dbl> <dbl> <dbl> <lgl>
# 1 2 2006 6.9 TRUE
# 2 3 2001 6.7 TRUE
Using dplyr :
library(dplyr)
Answer 1 :
df %>%
group_by(PATIENT_ID) %>%
slice(if (sum(HbA1c > 6.5) >= 2) which.max(HbA1c > 6.5) else 0)
# PATIENT_ID TEST_DATE HbA1c
# <dbl> <dbl> <dbl>
#1 2 2002 6.7
#2 3 2001 6.7
Answer 2 :
df %>%
group_by(PATIENT_ID) %>%
mutate(temp = cumsum(HbA1c < 6.5)) %>%
slice(if(last(HbA1c) >= 6.5) which.max(temp == max(temp) & HbA1c >= 6.5) else 0) %>%
select(-temp)
# PATIENT_ID TEST_DATE HbA1c
# <dbl> <dbl> <dbl>
#1 2 2006 6.9
#2 3 2001 6.7

How to group by a fixed number of rows in dplyr? [duplicate]

This question already has answers here:
Calculate the mean of every 13 rows in data frame
(4 answers)
Closed 1 year ago.
I have a data frame:
set.seed(123)
x <- sample(10)
y <- x^2
my.df <- data.frame(x, y)
The result is this:
> my.df
x y
1 3 9
2 8 64
3 4 16
4 7 49
5 6 36
6 1 1
7 10 100
8 9 81
9 2 4
10 5 25
What I want is to group the rows by every n rows to compute the mean, sum, or whatever on the 5 selected rows. Something like this for n=5:
my.df %>% group_by(5) %>% summarise(sum = sum(y), mean = mean(y))
The expected output would be something like:
# A tibble: 1 x 2
sum mean
<dbl> <dbl>
1 174 34.8
2 211 42.2
Of course, the number of rows in the data frame could be 15, 20, 100, whatever. I still want to group the data every n rows.
How can I do this?
We can use rep or gl to create the grouping variable
library(dplyr)
my.df %>%
group_by(grp = as.integer(gl(n(), 5, n()))) %>%
#or with rep
# group_by(grp = rep(row_number(), length.out = n(), each = 5))
summarise(sum = sum(y), mean = mean(y))
# A tibble: 2 x 3
# grp sum mean
# <int> <dbl> <dbl>
#1 1 174 34.8
#2 2 211 42.2
Another option could be:
my.df %>%
group_by(x = ceiling(row_number()/5)) %>%
summarise_all(list(sum = sum, mean = mean))
x sum mean
<dbl> <dbl> <dbl>
1 1 174 34.8
2 2 211 42.2

dplyr: passing a grouped tibble to a custom function

(The following scenario simplifies my actual situation)
My data comes from villages, and I would like to summarize an outcome variable by a village variable.
> data
village A Z Y
<chr> <int> <int> <dbl>
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700
For example, I would like to calculate the mean of Y only using Z==z by villages. In this case, I want to have (500 + 400)/2 = 450 for village "a" and 700 for village "b".
Please note that the actual situation is more complicated and I cannot directly use this answer, but the point is I need to pass a grouped tibble and a global variable (z) to my function.
z <- 1 # z takes 0 or 1
data %>%
group_by(village) %>% # grouping by village
summarize(Y_village = Y_hat_village(., z)) # pass a part of tibble and a global variable
Y_hat_village <- function(data_village, z){
# This function takes a part of tibble (`data_village`) and a variable `z`
# Calculate the mean for a specific z in a village
data_z <- data_village %>% filter(Z==get("z"))
return(mean(data_z$Y))
}
However, I found . passes entire tibble and the code above returns the same values for all groups.
There are a couple things you can simplify. One is in your function: since you're passing in a value z to the function, you don't need to use get("z"). You have a z in the global environment that you pass in; or, more safely, assign your z value to a variable with some other name so you don't run into scoping issues, and pass that in to the function. In this case, I'm calling it z_val.
library(tidyverse)
z_val <- 1
Y_hat_village2 <- function(data, z) {
data_z <- data %>% filter(Z == z)
return(mean(data_z$Y))
}
You can make the function call on each group using do, which will get you a list-column, and then unnesting that column. Again note that I'm passing in the variable z_val to the argument z.
df %>%
group_by(village) %>%
do(y_hat = Y_hat_village2(., z = z_val)) %>%
unnest()
#> # A tibble: 2 x 2
#> village y_hat
#> <chr> <dbl>
#> 1 a 450
#> 2 b 700
However, do is being deprecated in favor of purrr::map, which I am still having trouble getting the hang of. In this case, you can group and nest, which gives a column of data frames called data, then map over that column and again supply z = z_val. When you unnest the y_hat column, you still have the original data as a nested column, since you wanted access to the rest of the columns still.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = z_val))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 450
#> 2 b <tibble [2 × 3]> 700
Just to check that everything works okay, I also passed in z = 0 to check for 1. scoping issues, and 2. that other values of z work.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = 0))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 800
#> 2 b <tibble [2 × 3]> 300
As an extension/modification to #patL's answer, you can also wrap the tidyverse solution within purrr:map to return a list of two tibbles, one for each z value:
z <- c(0, 1);
map(z, ~df %>% filter(Z == .x) %>% group_by(village) %>% summarise(Y.mean = mean(Y)))
#[[1]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 800.
#2 b 300.
#
#[[2]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 450.
#2 b 700.
Sample data
df <- read.table(text =
" village A Z Y
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700 ", header = T)
You can use dplyr to accomplish it:
library(dplyr)
df %>%
group_by(village) %>%
filter(Z == 1) %>%
summarise(Y_village = mean(Y))
## A tibble: 2 x 2
# village Y_village
# <chr> <dbl>
#1 a 450
#2 b 700
To get all columns:
df %>%
group_by(village) %>%
filter(Z == 1) %>%
mutate(Y_village = mean(Y)) %>%
distinct(village, A, Z, Y_village)
## A tibble: 2 x 4
## Groups: village [2]
# village A Z Y_village
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 450
#2 b 1 1 700
data
df <- data_frame(village = c("a", "a", "a", "b", "b"),
A = rep(1, 5),
Z = c(1, 1, 0, 0, 1),
Y = c(500, 400, 800, 30, 700))

Error while using the forcats relevel function

I have a dataframe with X, Y coordinate values and corresponding ID values in Val.
df1 <- data.frame(X=rnorm(1000,0,1), Y=rnorm(1000,0,1),
ID=paste(rep("ID", 1000), 1:1000, sep="_"),
Type=rep("ID",1000),
Val=c(rep(c('Type1','Type2'),300),
rep(c('Type3','Type4'),200)))
Adding the missing IDs for the existing X,Y values in df1.
dat2 <- data.frame(Type=rep('D',8),
Val=paste(rep("D", 8),
sample(1:2,8,replace=T), sep="_"))
dat2 <- cbind(df[sample(1:1000,80),1:3],dat2)
df1 <- rbind(df1, dat2)
Looking at the frequency of ID values.
df1 %>% count(Val)
# # A tibble: 6 x 2
# Val n
# <fctr> <int>
# 1 Type1 300
# 2 Type2 300
# 3 Type3 200
# 4 Type4 200
# 5 D_1 60
# 6 D_2 20
I am interested in only two IDs for further analysis and the rest can be grouped into a random value. With the help of fct_other function, I have recoded them into Other and the frequency looks as expected.
df1 %>% mutate(Val=fct_other(Val,keep=c('D_1','D_2'))) %>% count(Val)
# # A tibble: 3 x 2
# Val n
# <fctr> <int>
# 1 D_1 60
# 2 D_2 20
# 3 Other 1000
As the fct_other function puts "Other" values as the last factor value and I want it at first, I used the other function fct_relevel available in the same package.
df1 %>% mutate(Val=fct_other(Val,keep=c('Type5','Type6'))) %>%
mutate(Val=fct_relevel(Val,'Other'))%>%
count(Val)
# # A tibble: 1 x 2
# Val n
# <fctr> <int>
# 1 Other 1080
But it is giving unexpected results. Any idea on what might have gone wrong?
Update:
The error was trying to keep unavailable values.
df1 %>% mutate(Val=fct_other(Val,keep=c('D_1','D_2'))) %>%
mutate(Val=fct_relevel(Val,'Other'))%>% count(Val)
# # A tibble: 3 x 2
# Val n
# <fctr> <int>
# 1 Other 1000
# 2 D_1 30
# 3 D_2 50
When I tried to retain the unique values, the selected ones are missing:
df1 %>% mutate(Val=fct_other(Val,keep=c('D_1','D_2'))) %>%
mutate(Val=fct_relevel(Val,'Other'))%>%
arrange(Val) %>% filter(!duplicated(.[,c("X","Y")])) %>% count(Val)
# # A tibble: 1 x 2
# Val n
# <fctr> <int>
# 1 Other 1000
Relevelling after the extraction of unique values does the job:
df1 %>% mutate(Val=fct_other(Val,keep=c('D_1','D_2'))) %>%
arrange(Val) %>% filter(!duplicated(.[,c("X","Y")])) %>%
mutate(Val=fct_relevel(Val,'Other')) %>%
arrange(Val) %>% count(Val)
# # A tibble: 3 x 2
# Val n
# <fctr> <int>
# 1 Other 920
# 2 D_1 30
# 3 D_2 50
Is this the efficient way of doing it?

Resources