R - Sum total time of multiple overlapping and/or discontinuous periods - r

For example, these are the days certain type of roles are present in an office
type
day_in
day_out
A
1
10
A
5
15
A
31
35
B
5
15
C
10
20
C
45
55
D
41
50
I want the number of days the office is occupied. There is a continuous office presence from days 1 to 20, 31 to 35, and 41 to 45, so the answer I want is 40 days.
I have a solution based on pivoting the data and setting flags on day when the state switches between occupied and unoccupied , using a for loop to cycle through each row. But I came to this solution reluctantly after failing to work out a vectorized approach.
Is there a vectorized way to do the operation from my for loop? Or any ideas for different algorithms would also be welcome.
My solution with example data is below:
library(dplyr)
library(tidyr)
df_raw <- read.table(
header = TRUE,
text = "
type day_in day_out
A 1 10
A 5 15
A 31 35
B 5 15
C 10 20
C 45 55
D 41 50
"
)
# occupancy from day 1 to 20, 31 to 35 & 41 to 55 = 40 days
# Unoccupied for 15 days
df <- df_raw %>%
tidyr::pivot_longer(cols = c(day_in, day_out), names_to = "in_out", values_to = "day") %>%
arrange(day)
# Create these columns to prevent warning "Unknown or uninitialised column" later
df$current_types <- NA
df$flag <- NA
# Loop to create flags on day when occupancy switches from occupied to unoccupied or vice-versa
for (rown in 1:nrow(df)) {
df$current_types[rown] <- if (rown == 1) {
df$type[rown]
} else {
if (df$in_out[rown] == "day_in") {
paste(df$current_types[rown - 1], df$type[rown], collapse = " ")
} else {
trimws(gsub(paste0("\\s?", df$type[rown], "\\s?"), " ", df$current_types[rown - 1]))
}
}
# if there are no current type then unoccupied. It may or may not be occupied again afterwards.
df$flag[rown] <- if (rown == 1 | (df$in_out[rown] == "day_out" & nchar(df$current_types[rown]) == 0)) {
1
} else {
if (df$in_out[rown] == "day_in" & nchar(df$current_types[rown - 1]) == 0) 1 else 0
}
}
# Then filter the flags, "pivot" to get each occupancy start and end in one row and sum the total days occupied
df %>%
filter(flag == 1) %>%
mutate(
start = if_else(in_out == "day_out" & lag(in_out) == "day_in", dplyr::lag(day), NULL),
stop = if_else(in_out == "day_out", day, NULL)
) %>%
filter(in_out == "day_out") %>%
summarise(days_occupied = sum(stop - start + 1))

You can generate day sequences for each role and count the number of unique days:
length(unique(unlist(apply(df_raw[, c('day_in', 'day_out')],
1,
function(x) seq(x[1], x[2])))))
Or using pipes:
df_raw[, c('day_in', 'day_out')] %>%
apply(1, function(x) seq(x[1], x[2])) %>%
unlist %>%
unique %>%
length

Another simple solution would be to create a vector with the size of your timespan and flag all occupied days and count them afterwards.
df <- data.frame(
type = c("A","A","A","B","C","C","D"),
day_in = c(1,5,31,5,10,45,41),
day_out = c(10,15,35,15,20,55,50))
occupation <- rep(0, max(df$day_out))
for(i in 1:nrow(df)){
occupation[df[i,'day_in']:df[i,'day_out']] <- 1
}
# 40
sum(occupation)

Related

How to do rowsums on a select set of columns containing a string and a number in R?

I have a list of column names that look like this...
colnames(dat)
1 subject
2 e.type
3 group
4 boxnum
5 edate
6 file.name
7 fr
8 active
9 inactive
10 reward
11 latency.to.first.active
12 latency.to.first.inactive
13 act0.600
14 act600.1200
15 act1200.1800
16 act1800.2400
17 act2400.3000
18 act3000.3600
19 inact0.600
20 inact600.1200
21 inact1200.1800
22 inact1800.2400
23 inact2400.3000
24 inact3000.3600
25 rew0.600
26 rew600.1200
27 rew1200.1800
28 rew1800.2400
29 rew2400.3000
30 rew3000.3600
I want to get the row sum for the columns that list act#, inact#, and reward#
This works...
for (row in 1:nrow(dat)) {
dat[row, "active"] = rowSums(dat[row,c(13:18)])
dat[row, "inactive"] = rowSums(dat[row,c(19:24)])
dat[row, "reward"] = rowSums(dat[row,c(25:30)])
}
But I don't want to hard coded it since the number of columns for the 3 sections may change. How can I do this without hard coding the column indexes?
Also, for example, I tried searching for the "act" named columns but it was also including the "active" column.
sub_dat <- dat[, 13:30]
result <- sapply(split.default(sub_dat, substr(names(sub_dat), 1, 3)), rowSums)
dat[, c('active', 'inactive', 'reward')] <- result
Easy-cheesy with witch select & matches from the tidyverse.
library(tidyverse)
data %>%
mutate(
sum_act = rowSums(select(., matches("act[0-9]"))),
sum_inact = rowSums(select(., matches("inact[0-9]"))),
sum_rew = rowSums(select(., matches("rew[0-9]")))
)
I made an example how it could be done:
t <- data.frame(c(1,2,3),c("a","b","c"))
colnames(t) <- c("num","char")
#with function append() you make a list of rows that fulfill your logical argument
whichRows <- append(which(t$char == "a"),which(t$char == "b"))
sum(t$num[whichRows])
or if I misunderstood you and you want to sum for every column separately then:
sum(t$num[which(t$char == "a")])
sum(t$num[which(t$char == "b")])

How to apply function with multiple conditions on multiple columns to get new conditional columns in R

Hello all a R noob here,
I hope you guys can help me with the following.
I need to transform multiple columns in my dataset to new columns based on the values in the original columns multiple times. This means that for the first transformation I use column 1, 2, 3 and if certain conditions are met the output results a new column with a 1 or a 0, for the second transformation I use columns 4, 5, 6 and the output should be a 1 or a 0 also. I have to do this 18 times. I already wrote a function which succesfully does the transformation if I impute the variables manually, but I would like to apply this function to all the desired columns at once. My desired output would be 18 new columns with 0's and 1's. Finally I will make a last column which will display a 1 if any of the 18 columns is a 1 and a 0 otherwise.
df <- data.frame(admiss1 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss2 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss3 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
visit1 = sample(seq(as.Date('1995/01/01'), as.Date('1996/01/01'), by="day"), 12),
visit2 = sample(seq(as.Date('1997/01/01'), as.Date('1998/01/01'), by="day"), 12),
reason1 = sample(3,12, replace = T),
reason2 = sample(3,12, replace = T),
reason3 = sample(3,12, replace = T))
df$discharge1 <- df$admiss1 + 10
df$discharge2 <- df$admiss2 + 10
df$discharge3 <- df$admiss3 + 10
#every discharge date is 10 days after the admission date for the sake of this example
#now I have the following dataframe
#for the sake of it I included only 3 dates and reasons(instead of 18)
admiss1 admiss2 admiss3 visit1 visit2 reason1 reason2 reason3 discharge1 discharge2 discharge3
1 1990-03-12 1992-04-04 1998-07-31 1995-01-24 1997-10-07 2 1 3 1990-03-22 1992-04-14 1998-08-10
2 1999-05-18 1990-11-25 1995-10-04 1995-03-06 1997-03-13 1 2 1 1999-05-28 1990-12-05 1995-10-14
3 1993-07-16 1998-06-10 1991-07-05 1995-11-06 1997-11-15 1 1 2 1993-07-26 1998-06-20 1991-07-15
4 1991-07-05 1992-06-17 1995-10-12 1995-05-14 1997-05-02 2 1 3 1991-07-15 1992-06-27 1995-10-22
5 1995-08-16 1999-03-08 1992-04-03 1995-02-20 1997-01-03 1 3 3 1995-08-26 1999-03-18 1992-04-13
6 1999-10-07 1991-12-26 1995-05-05 1995-10-24 1997-10-15 3 1 1 1999-10-17 1992-01-05 1995-05-15
7 1998-03-18 1992-04-18 1993-12-31 1995-11-14 1997-06-14 3 2 2 1998-03-28 1992-04-28 1994-01-10
8 1992-08-04 1991-09-16 1992-04-23 1995-05-29 1997-10-11 1 2 3 1992-08-14 1991-09-26 1992-05-03
9 1997-02-20 1990-02-12 1998-03-08 1995-10-09 1997-12-29 1 1 3 1997-03-02 1990-02-22 1998-03-18
10 1992-09-16 1997-06-16 1997-07-18 1995-12-11 1997-01-12 1 2 2 1992-09-26 1997-06-26 1997-07-28
11 1991-01-25 1998-04-07 1999-07-02 1995-12-27 1997-05-28 3 2 1 1991-02-04 1998-04-17 1999-07-12
12 1996-02-25 1993-03-30 1997-06-25 1995-09-07 1997-10-18 1 3 2 1996-03-06 1993-04-09 1997-07-05
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(dis))] <= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, 0)
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(admis))] <= df[eval(substitute(vis2))] & df[eval(substitute(dis))] >= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, xnew)
return(xnew)
}
I wrote this function to generate a 1 if the conditions are true and a 0 if the conditions are false.
-Condition 1: admission date and discharge date are between visit 1 and visit 2 + admission reason is 2.
-Condition 2: admission date is after visit 1 but before visit 2 and the discharge date is after visit 2 with also admission reason 2.
It should return 1 if these conditions are true and 0 if these conditions are false. Eventually, I will end up with 18 new variables with 1's or 0's and will combine them to make one variable with Admission between visit 1 and visit 2 (with reason 2).
If I manually impute the variable names it will work, but I cant make it work for all the variables at once. I tried to make a string vector with all the admiss dates, discharge dates and reasons and tried to transform them with mapply, but this does not work.
admiss <- paste0(rep("admiss", 3), 1:3)
discharge <- paste0(rep("discharge", 3), 1:3)
reason <- paste0(rep("reason", 3), 1:3)
visit1 <- rep("visit1",3)
visit2 <- rep("visit2",3)
mapply(admissdate, admis = admiss, dis = discharge, rsn = reason, vis1 = visit1, vis2 = visit2)
I have also considered lapply but here you have to define an X = ..., which I think I cannot use because I have multiple column that I want to impute, please correct me if I am wrong!
Also I considered using a for loop, but I don't know how to use that with multiple conditions.
Any help would be greatly appreciated!
You can change the function to accept values instead of column names.
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- as.integer(admis >= vis1 & dis <= vis2 & rsn == 2)
xnew <- ifelse(admis >= vis1 & admis <= vis2 & dis >= vis2 & rsn == 2, 1, xnew)
return(xnew)
}
Now create new columns -
admiss <- paste0("admiss", 1:3)
discharge <- paste0("discharge", 1:3)
reason <- paste0("reason", 1:3)
new_col <- paste0('newcol', 1:3)
df[new_col] <- Map(function(x, y, z) admissdate(x, y, z, df$visit1, df$visit2),
df[admiss],df[discharge],df[reason])
#Additional column will be 1 if any of the value in the new column is 1.
df$result <- as.integer(rowSums(df[new_col]) > 0)
df

How can I create new column in data frame by aggregating rows?

I have a large (~200k rows) dataframe that is structured like this:
df <-
data.frame(c(1,1,1,1,1), c('blue','blue','blue','blue','blue'), c('m','m','m','m','m'), c(2016,2016,2016,2016,2016),c(3,4,5,6,7), c(10,20,30,40,50))
colnames(df) <- c('id', 'color', 'size', 'year', 'week','revenue')
Let's say it is currently week 7, and I want to compare the trailing 4 week average of revenue to the current week's revenue. What I would like to do is create a new column for that average when all of the identifiers match.
df_new <-
data.frame(1, 'blue', 'm', 2016,7,50, 25 )
colnames(df_new) <- c('id', 'color', 'size', 'year', 'week','revenue', 't4ave')
How can I accomplish this efficiently? Thank you for the help
good question. for loops are pretty inefficient, but since you do have to check the conditions of prior entries, this is the only solution I can think of (mind you, I'm also an intermediate at R):
for (i in 1:nrow(df))
{
# condition for all entries to match up
if ((i > 5) && (df$id[i] == df$id[i-1] == df$id[i-2] == df$id[i-3] == df$id[i-4])
&& (df$color[i] == df$color[i-1] == df$color[i-2] == df$color[i-3] == df$color[i-4])
&& (df$size[i] == df$size[i-1] == df$size[i-2] == df$size[i-3] == df$size[i-4])
&& (df$year[i] == df$year[i-1] == df$year[i-2] == df$year[i-3] == df$year[i-4])
&& (df$week[i] == df$week[i-1] == df$week[i-2] == df$week[i-3] == df$week[i-4]))
# avg of last 4 entries' revenues
avg <- mean(df$revenue[i-1] + df$revenue[i-2] + df$revenue[i-3] + df$revenue[i-4])
# create new variable of difference between this entry and last 4's
df$diff <- df$revenue[i] - avg
}
This code will probably take forever, but it should work. If this is a one time thing for when the code needs to run, then it should be okay. Otherwise, hopefully others will be able to advise.
A solution using dplyr and zoo. The idea is to group the variable that are the same, such as id, color, size, and year. Aftet that, use rollmean to calculate the rolling mean of revenue. Use na.pad = TRUE and align = "right" to make sure the calculation covers the recent weeks. Finally, use lag to "shift" the calculation results to fit your needs.
library(dplyr)
library(zoo)
df2 <- df %>%
group_by(id, color, size, year) %>%
mutate(t4ave = rollmean(revenue, 4, na.pad = TRUE, align = "right")) %>%
mutate(t4ave = lag(t4ave))
df2
# A tibble: 5 x 7
# Groups: id, color, size, year [1]
id color size year week revenue t4ave
<dbl> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
1 1 blue m 2016 3 10 NA
2 1 blue m 2016 4 20 NA
3 1 blue m 2016 5 30 NA
4 1 blue m 2016 6 40 NA
5 1 blue m 2016 7 50 25

How to change a proportion of values in a data.frame

I'm trying to find out how to change the values in a data.frame based on specific proportions. This is an example of the values in the data.frame where class values (with different counts) are grouped by the field "id":
> head(pts)
id class
1 245 10
2 522 10
3 522 10
4 522 10
And this is an example of the proportions:
id class perc%
245 10 100
522 10 50
522 20 50
My objective is to be able to select the values for each "id" and change them according to the "perc%" field, e.g. if I have 100 values for id=522 then change 50 values to class=10 and then 50 values to class=20 (perc%=50).
I've tried subsetting the data.frame or making conditional selections but can't find a way to basically join the "perc%" with the counts of values per "id".
Thanks in advance.
I think I see better what you are trying to do now, this code may not be very elegant, but should get the job done. The "percentages" dataframe represents the second table you described, note that I am renaming "perc%" to be "perc"
colnames(df.percentages) <- c("id","class","perc")
p.check <- df.percentages %>% group_by(id) %>% summarize(sum(perc))
colnames(p.check) <- c("id","sumperc")
not100 <- which(p.check$sumperc != 100)
if(length(not100) != 0)
{
print(paste("ID",p.check$id[not100],"does not add up to 100%"))
}
rm(p.check)
ids <- unique(df.percentages$id)
for(i in 1:length(ids))
{
print("")
print(paste("Processing ID:",ids[i]))
classes.to.reassign <- pts %>% filter(id == ids[i])
if(nrow(classes.to.reassign) == 0)
{
print(paste("Could not find ID",ids[i],"pts dataframe!"))
next
}
class.rows <- df.percentages %>%
filter(id == ids[i]) %>%
mutate(rows = as.integer(round(nrow(classes.to.reassign) * (perc / 100))))
if(nrow(classes.to.reassign) < nrow(class.rows))
{
print(paste("Cannot split", nrow(classes.to.reassign), "classes into", nrow(class.rows), "segments for ID:", ids[i]))
next
}
if(sum(class.rows$rows) > nrow(classes.to.reassign))
{class.rows$rows[nrow(class.rows)] <- class.rows$rows[nrow(class.rows)] + (nrow(classes.to.reassign) - sum(class.rows$rows))}
else if(sum(class.rows$rows) < nrow(classes.to.reassign))
{class.rows$rows[nrow(class.rows)] <- class.rows$rows[nrow(class.rows)] + (sum(class.rows$rows) - nrow(classes.to.reassign))}
class.rows <- class.rows %>%
mutate(cumrows = cumsum(as.integer(rows)))
print(paste("Total rows for ID",ids[i],"=",nrow(classes.to.reassign)))
cur.row <- 1
for(c in 1:nrow(class.rows))
{
last.row <- class.rows$cumrow[c]
print(paste("Assigning class",class.rows$class[c],"to rows",cur.row,"-",last.row))
classes.to.reassign$class[cur.row:last.row] <- class.rows$class[c]
cur.row <- last.row + 1
}
if(i == 1)
{pts.new <- classes.to.reassign}
else
{pts.new <- rbind(pts.new, classes.to.reassign)}
rm(classes.to.reassign, class.rows)
}
pts <- pts.new
View(pts)

dplyr idiom for summarize() a filtered-group-by, and also replace any NAs due to missing rows

I am computing a dplyr::summarize across a dataframe of sales data.
I do a group-by (S,D,Y), then within each group, compute medians and means for weeks 5..43, then merge those back into the parent df. Variable X is sales. X is never NA (i.e. there are no explicit NAs anywhere in df), but if there is no data (as in, no sales) for that S,D,Y and set of weeks, there will simply be no row with those values in df (take it that means zero sales for that particular set of parameters). In other words, impute X=0 in any structurally missing rows (but I hope I don't need to melt/cast the original df, to avoid bloat. Similar to cast(fill....,add.missing=T) or caret::preProcess()).
Two questions about my code idiom:
Is it better to use summarize than dplyr::filter, because filter physically drops rows so I have to assign the results to df.tmp then left-join it back to the original df (as below)? Also, big subsetting expressions repeated on every single line of summarize computations make the code harder to read.
Should I worry (or not) about caching the rows or logical indices of the subsetting operation, in the general case where I might be computing say n=20 new summary variables?
Not all combinations of S,D,Y-groups and filter (for those weeks) have rows, so how to get the summarize to replace NA on any missing rows? Currently I do as below.
Sorry both the code and dataset are proprietary, but here's the code idiom, and below is code you should run first to generate sample-data:
# Compute median, mean of X across wks 5..43, for that set of S,D,Y-values
# Issue a) filter() or repeatedly use subset() within each calculation?
df.tmp <- df %.% group_by(S,D,Y) %.% filter(Week>=5 & Week<=43) %.%
summarize(ysd_med543_X = median(X),
ysd_mean543_X = mean(X)
) %.% ungroup()
# Issue b) how to replace NAs in groups where the group_by-and-filter gave empty output?
# can you merge this code with the summarize above?
df <- left_join(df, df.tmp, copy=F)
newcols <- match(c('ysd_mean543_X','ysd_med543_X'), names(df))
df[!complete.cases(df[,newcols]), newcols] <- c(0.0,0.0)
and run this first to generate sample-data:
set.seed(1234)
rep_vector <- function(vv, n) {
unlist(as.vector(lapply(vv, function(...) {rep(...,n)} )))
}
n=7
m=3
df = data.frame(S = rep_vector(10:12, n), D = 20:26,
Y = rep_vector(2005:2007, n),
Week = round(52*runif(m*n)),
X = 4e4*runif(m*n) + 1e4 )
# Now drop some rows, to model structurally missing rows
I <- sort(sample(1:nrow(df),0.6*nrow(df)))
df = df[I,]
require(dplyr)
I don't think this has anything to do with the feature you've linked under comments (because IIUC that feature has to do with unused factor levels). Once you filter your data, IMO summarise should not (or rather can't?) be including them in the results (with the exception of factors). You should clarify this with the developers on their project page.
I'm by no means a dplyr expert, but I think, firstly, it'd be better to filter first followed by group_by + summarise. Else, you'll be filtering for each group, which is unnecessary. That is:
df.tmp <- df %.% filter(Week>=5 & Week<=43) %.% group_by(S,D,Y) %.% ...
This is just so that you're aware of it for any future cases.
IMO, it's better to use mutate here instead of summarise, as it'll remove the need for left_join, IIUC. That is:
df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
md_X = median(X[Week >=5 & Week <= 43]),
mn_X = mean(X[Week >=5 & Week <= 43]))
Here, still we've the issue of replacing the NA/NaN. There's no easy/direct way to sub-assign here. So, you'll have to use ifelse, once again IIUC. But that'd be a little nicer if mutate supports expressions.
What I've in mind is something like:
df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
{ tmp = Week >= 5 & Week <= 43;
md_X = ifelse(length(tmp), median(X[tmp]), 0),
md_Y = ifelse(length(tmp), mean(X[tmp]), 0)
})
So, we'll have to workaround in this manner probably:
df.tmp = df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43)
df.tmp %.% mutate(md_X = ifelse(tmp[1L], median(X), 0),
mn_X = ifelse(tmp[1L], mean(X), 0))
Or to put things together:
df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43,
md_X = ifelse(tmp[1L], median(X), 0),
mn_X = ifelse(tmp[1L], median(X), 0))
# S D Y Week X tmp md_X mn_X
# 1 10 20 2005 6 22107.73 TRUE 22107.73 22107.73
# 2 10 23 2005 32 18751.98 TRUE 18751.98 18751.98
# 3 10 25 2005 33 31027.90 TRUE 31027.90 31027.90
# 4 10 26 2005 0 46586.33 FALSE 0.00 0.00
# 5 11 20 2006 12 43253.80 TRUE 43253.80 43253.80
# 6 11 22 2006 27 28243.66 TRUE 28243.66 28243.66
# 7 11 23 2006 36 20607.47 TRUE 20607.47 20607.47
# 8 11 24 2006 28 22186.89 TRUE 22186.89 22186.89
# 9 11 25 2006 15 30292.27 TRUE 30292.27 30292.27
# 10 12 20 2007 15 40386.83 TRUE 40386.83 40386.83
# 11 12 21 2007 44 18049.92 FALSE 0.00 0.00
# 12 12 26 2007 16 35856.24 TRUE 35856.24 35856.24
which doesn't require df.tmp.
HTH

Resources