I have a dataframe that has a timestamp and a numeric variable, the data is recorded once every hour. Ultimately, I'd life to know the mean number of hours that the variable is at or below a certain value. For example, what is the average number of hours that var1 is at or below 4? There are missing timestamps in the dataframe, so if the time is not consecutive the sum needs to restart.
In the example data frame the columns HoursBelow5 and RunningGroup were generated 'by hand', if I could create these columns programmatically, I could filter to remove the RunningGroups that were associate with var1 values greater than 4 and then use dplyr::slice to get the maximum HoursBelow5 per group. I could then find the mean of these values.
So, in this approach I would need to create the restarting cumulative sum HoursBelow5, which restarts when the condition var1<5 is not met, or when the timestamp is not consecutive hours. I could then use ifelse statements to create the RunningGroup variable. Is this possible? I may be lacking the jargon to find the procedure. Cumsum and lag seemed promising, but I have yet to construct a procedure that does the above.
Or, there may be a smarter way to do this using the timestamp.
edit: result incorporating code from answer below
df1 <- df %>%
group_by(group = data.table::rleid(var1 > 4),
group1 = cumsum(ts - lag(ts, default = first(ts)) > 3600)) %>%
mutate(temp = row_number() * (var1 <= 4)) %>%
ungroup() %>%
filter(var1 <= 4) %>%
select(ts, var1, temp)
df2 <- df1 %>% mutate(temp2 = ifelse(temp==1, 1, 0),
newgroup = cumsum(temp2))
df3 <- df2 %>% group_by(newgroup) %>% slice(which.max(temp))
mean(df3$temp)
# example dataframe with desired output columns to then get actual output
df <- structure(list(ts = structure(c(-2208967200, -2208963600, -2208960000,
-2208956400, -2208952800, -2208949200, -2208945600, -2208942000,
-2208938400, -2208934800, -2208931200, -2208927600, -2208924000,
-2208913200, -2208909600, -2208906000, -2208902400, -2208898800,
-2208895200, -2208891600, -2208888000, -2208884400, -2208880800,
-2208877200, -2208852000, -2208848400, -2208844800, -2208841200,
-2208837600, -2208834000, -2208830400, -2208826800, -2208823200,
-2208819600, -2208816000, -2208812400, -2208808800, -2208805200,
-2208801600), class = c("POSIXct", "POSIXt"), tzone = ""), var1 = c(1L,
3L, 4L, 5L, 4L, 3L, 5L, 6L, 7L, 8L, 3L, 2L, 2L, 2L, 3L, 3L, 2L,
2L, 1L, 1L, 1L, 1L, 4L, 4L, 3L, 9L, 3L, 3L, 3L, 2L, 2L, 3L, 4L,
5L, 3L, 2L, 1L, 2L, 3L), HoursBelow5 = c(1L, 2L, 3L, 0L, 1L,
2L, 0L, 0L, 0L, 0L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 1L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 0L, 1L, 2L,
3L, 4L, 5L), RunningGroup = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L,
6L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
10L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 14L, 14L, 14L,
14L, 14L), NotContinuous = c("", "", "", "", "", "", "", "",
"", "", "", "", "", "NC", "", "", "", "", "", "", "", "", "",
"", "NC", "", "", "", "", "", "", "", "", "", "", "", "", "",
"")), row.names = c(NA, -39L), class = "data.frame")
One way could using dplyr and data.table::rleid could be
library(dplyr)
df %>%
group_by(group = data.table::rleid(var1 > 4),
group1 = cumsum(ts - lag(ts, default = first(ts)) > 3600)) %>%
mutate(temp = row_number() * (var1 <= 4)) %>%
ungroup() %>%
select(ts, var1, HoursBelow5, temp)
# ts var1 HoursBelow5 temp
# <dttm> <int> <int> <int>
# 1 1900-01-01 12:46:46 1 1 1
# 2 1900-01-01 13:46:46 3 2 2
# 3 1900-01-01 14:46:46 4 3 3
# 4 1900-01-01 15:46:46 5 0 0
# 5 1900-01-01 16:46:46 4 1 1
# 6 1900-01-01 17:46:46 3 2 2
# 7 1900-01-01 18:46:46 5 0 0
# 8 1900-01-01 19:46:46 6 0 0
# 9 1900-01-01 20:46:46 7 0 0
#10 1900-01-01 21:46:46 8 0 0
# … with 29 more rows
temp column is the one which was generated programmatically and HoursBelow5 is kept as it is for comparison purposes. If you also need RunningGroup you could use group and group1 together.
Related
This question already has answers here:
Create group number for contiguous runs of equal values
(4 answers)
Closed 4 days ago.
I received some data from a colleague who is working with animal observations recorded in several transects. However my colleague used the same three ID codes for identifying each transect: 1, 7, 13 and 19. I would like to replace the repeated IDs with unique IDs. This image shows what I want to do:
Here's the corresponding code:
example_data<-structure(list(ID_Transect = c(1L, 1L, 1L, 1L, 1L, 1L, 7L, 7L,
7L, 7L, 7L, 7L, 13L, 13L, 13L, 13L, 13L, 13L, 19L, 19L, 19L,
19L, 19L, 19L, 1L, 1L, 1L, 1L, 1L, 1L, 7L, 7L, 7L, 7L, 7L, 7L),
transect_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L,
5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L)), class = "data.frame", row.names = c(NA,
-36L))
We can also do
library(data.table)
setDT(example_data)[, transect_id := rleid(ID_Transect)]
You can use data.table rleid -
example_data$transect_id <- data.table::rleid(example_data$ID_Transect)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6
In base R we can use rle -
with(rle(example_data$ID_Transect), rep(seq_along(values), lengths))
Or diff + cumsum -
cumsum(c(TRUE, diff(example_data$ID_Transect) != 0))
I am trying to copy over a solution from excel solver into R but not sure where to start.
The problem: Choose 5 options for each hour (5 rows) that maximize the sum of "Score" without picking the same group 2 times across multiple hours.
In other words: Maximize score, with criteria:
1. rows within same group only gets picked a maximum of 2 times.
2. rows within same hour get picked a maximum of 5 times.
I think it would be easier for me to explain this by showing you guys the results in excel:
Data:
group,hour,Score a,1,1000 a,2,1231 b,1,12312 b,2,6438 c,1,3033 c,2,6535 d,1,4283 d,2,4957 e,1,9507 e,2,5115 f,1,1914 f,2,9278 g,1,5362 g,2,8408 h,1,4640 h,2,4296 j,1,8115 j,2,1143 aa,1,3242 aa,2,3695 bb,1,3908 bb,2,2540 cc,1,6438 cc,2,2170 dd,1,6497 dd,2,3327 ee,1,5067 ee,2,6614 ff,1,5140 ff,2,9858 gg,1,8061 gg,2,2316 hh,1,7848 hh,2,3525 jj,1,8259 jj,2,9014 a,3,31100 b,3,111100 c,3,87200 d,3,60700 e,3,50600 f,3,74300 g,3,97400 h,3,28900 j,3,25900 aa,3,55600 bb,3,38200 cc,3,58500 dd,3,51300 ee,3,84000 ff,3,83700 gg,3,74200 hh,3,19700 jj,3,62800
Data in dput format.
df1 <-
structure(list(group = structure(c(1L, 1L, 3L, 3L,
5L, 5L, 7L, 7L, 9L, 9L, 11L, 11L, 13L, 13L, 15L,
15L, 17L, 17L, 2L, 2L, 4L, 4L, 6L, 6L, 8L, 8L,
10L, 10L, 12L, 12L, 14L, 14L, 16L, 16L, 18L,
18L, 1L, 3L, 5L, 7L, 9L, 11L, 13L, 15L, 17L,
2L, 4L, 6L, 8L, 10L, 12L, 14L, 16L, 18L),
.Label = c("a", "aa", "b", "bb", "c", "cc",
"d", "dd", "e", "ee", "f", "ff", "g", "gg",
"h", "hh", "j", "jj"), class = "factor"),
hour = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), Score = c(1000L,
1231L, 12312L, 6438L, 3033L, 6535L, 4283L, 4957L,
9507L, 5115L, 1914L, 9278L, 5362L, 8408L, 4640L,
4296L, 8115L, 1143L, 3242L, 3695L, 3908L, 2540L,
6438L, 2170L, 6497L, 3327L, 5067L, 6614L, 5140L,
9858L, 8061L, 2316L, 7848L, 3525L, 8259L, 9014L,
31100L, 111100L, 87200L, 60700L, 50600L, 74300L,
97400L, 28900L, 25900L, 55600L, 38200L, 58500L,
51300L, 84000L, 83700L, 74200L, 19700L, 62800L)),
class = "data.frame", row.names = c(NA, -54L))
Using lpSolve package for this optimization problem with binary variables and linear constraints,
library(lpSolve)
library(data.table) #for pivoting data and shifting coef of constraints
d <- dcast(df1, group ~ hour, value.var="Score")
nr <- nrow(d)
nc <- ncol(d) - 1L
m1 <- matrix(c(1,1,1,rep(0, nr*nc-3L)), ncol=nc, byrow=TRUE)
max2constr <- do.call(rbind, shift(m1, 0L:(nr-1), fill=0))
m2 <- matrix(c(rep(1, nr), rep(0, (nc-1)*nr)), ncol=nc)
choose5constr <- do.call(rbind, shift(m2, seq(0, by=nr, length.out=nc), fill=0))
ans <- lp("max",
unlist(d[, 2:4]),
rbind(max2constr, choose5constr),
c(rep("<=", nrow(max2constr)), rep("=", nrow(choose5constr))),
c(rep(2, nrow(max2constr)), rep(5, nrow(choose5constr))),
all.bin=TRUE)
ans$objval
soln <- matrix(ans$solution, nrow=nr, dimnames=list(d$group, names(d)[-1L]))
Objective value = 552826
soln output:
1 2 3
a 0 0 0
aa 0 0 0
b 1 0 1
bb 0 0 0
c 0 0 1
cc 0 0 0
d 0 0 0
dd 0 0 0
e 1 0 0
ee 0 1 1
f 0 1 0
ff 0 1 1
g 0 1 1
gg 1 0 0
h 0 0 0
hh 0 0 0
j 1 0 0
jj 1 1 0
I am working on a data frame with three columns labelled as id, time1 and time2. A sample is:
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L),
time1 = c(12L, 5L, 3L, 5L, 6L, 30L, 3L, 30L, 7L, 2L, 17L, 5L, 8L, 3L, 22L, 5L, 15L, 4L, 7L, 23L),
time2=c(23L,23L,23L,23L,23L,22L,22L,22L,22L,22L,25L,25L,25L,25L,25L,24L,24L,24L,24L,24L)
),
.Names = c("id", "time1","time2"),
class = "data.frame",
row.names = c(NA,-20L)
)
I am using R and I am trying to subset this data and replace column time2 with a new column based on the following criteria:
Sum the values of time1 for each id until it is greater than or equal to the corresponding value of time2 for that id.
Replace the cells in time1 where the summations terminate with the corresponding time2 value for each id.
Column time2 is to be replaced with a new column labelled as status which consists of 0's and 1's. That is, status takes on 1 for the non-replaced values of time1 and 0 for all the replaced values of time1.
In summary, I am expecting to see something like this:
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 4L, 4L, 4L),
time1 = c(12L, 5L, 3L, 23, 22L, 17L, 5L, 25L, 5L, 15L, 24L),
status=c(1L,1L,1L,0L,0L,1L,1L,0L,1L,1L,0L)
),
.Names = c("id", "time1","status"),
class = "data.frame",
row.names = c(NA,-11L)
)
I greatly appreciate any help on this.
We can do the following:
library(tidyverse);
df %>%
group_by(id) %>%
mutate(
status = as.numeric(cumsum(time1) < time2),
time1 = ifelse(status == 1, time1, time2)) %>%
group_by(id, status) %>%
mutate(n = 1:n()) %>%
ungroup() %>%
filter(status == 1 | (status == 0 & n == 1)) %>%
select(-n, -time2)
## A tibble: 11 x 3
# id time1 status
# <int> <int> <dbl>
# 1 1 12 1.
# 2 1 5 1.
# 3 1 3 1.
# 4 1 23 0.
# 5 2 22 0.
# 6 3 17 1.
# 7 3 5 1.
# 8 3 25 0.
# 9 4 5 1.
#10 4 15 1.
#11 4 24 0.
Explanation: We group rows by id, then calculate the cumulative sum of time1 entries, and flag those rows where cumsum(time1) < time2 with 1, else with 0; we replace time1 entries with time2 entries if status == 1. Lastly we need to remove excess status = 0 rows; to do so, we regroup by id and status, number rows consecutively, and keep only one row for status = 0 per id.
This is a question about array and data frame manipulation and calculation, in the
context of models for log odds in contingency tables. The closest question I've found to this is How can i calculate odds ratio in many table, but mine is more general.
I have a data frame representing a 3-way frequency table, of size 5 (litter) x 2 (treatment) x 3 (deaths).
"Freq" is the frequency in each cell, and deaths is the response variable.
Mice <-
structure(list(litter = c(7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L,
11L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 7L, 7L, 8L,
8L, 9L, 9L, 10L, 10L, 11L, 11L), treatment = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), deaths = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("0", "1",
"2+"), class = "factor"), Freq = c(58L, 75L, 49L, 58L, 33L, 45L,
15L, 39L, 4L, 5L, 11L, 19L, 14L, 17L, 18L, 22L, 13L, 22L, 12L,
15L, 5L, 7L, 10L, 8L, 15L, 10L, 15L, 18L, 17L, 8L)), .Names = c("litter",
"treatment", "deaths", "Freq"), row.names = c(NA, 30L), class = "data.frame")
From this, I want to calculate the log odds for adjacent categories of the last variable (deaths)
and have this value in a data frame with factors litter (5), treatment (2), and contrast (2), as detailed below.
The data can be seen in xtabs() form:
mice.tab <- xtabs(Freq ~ litter + treatment + deaths, data=Mice)
ftable(mice.tab)
deaths 0 1 2+
litter treatment
7 A 58 11 5
B 75 19 7
8 A 49 14 10
B 58 17 8
9 A 33 18 15
B 45 22 10
10 A 15 13 15
B 39 22 18
11 A 4 12 17
B 5 15 8
>
From this, I want to calculate the (adjacent) log odds of 0 vs. 1 and 1 vs.2+ deaths, which is easy in
array format,
odds1 <- log(mice.tab[,,1]/mice.tab[,,2]) # contrast 0:1
odds2 <- log(mice.tab[,,2]/mice.tab[,,3]) # contrast 1:2+
odds1
treatment
litter A B
7 1.6625477 1.3730491
8 1.2527630 1.2272297
9 0.6061358 0.7156200
10 0.1431008 0.5725192
11 -1.0986123 -1.0986123
>
But, for analysis, I want to have these in a data frame, with factors litter, treatment and contrast
and a column, 'logodds' containing the entries in the odds1 and odds2 tables, suitably strung out.
More generally, for an I x J x K table, where the last factor is the response, my desired result
is a data frame of IJ(K-1) rows, with adjacent log odds in a 'logodds' column, and ideally, I'd like
to have a general function to do this.
Note that if T is the 10 x 3 matrix of frequencies shown by ftable(), the calculation is essentially
log(T) %*% matrix(c(1, -1, 0,
0, 1, -1))
followed by reshaping and labeling.
Can anyone help with this?
This is a re-post of "R: For() loop checking if date is between two dates in separate object", that has been changed to incorporate a mock/test minimal after the suggestions of Henrik and Metrics. Thanks to them.
I have two large datasets, both contain columns of date/time fields. My first dataset has a single date, the second has two dates. In short I am trying to find all dates from the first data set that are between the other two dates of the second and then find an average value. In order to provide clarity, I have created a mock minimal data set using values rather than dates.
The head() of my first mock data set is below – as well as the dput() output. The data is specific to an individual noted by the IndID column.
IndID MockDate RandNumber
1 1 5 1.862084
2 1 3 1.103154
3 1 5 1.373760
4 1 1 1.497397
5 1 1 1.319488
6 1 3 2.120354
actData <- structure(list(IndID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L), MockDate = c(5L, 3L, 5L, 1L, 1L, 3L, 4L,
2L, 2L, 5L, 2L, 1L, 5L, 3L, 5L, 3L, 5L, 3L, 5L, 1L, 5L, 3L, 5L,
5L, 2L, 3L, 1L, 4L, 3L, 3L), RandNumber = c(1.862083679, 1.103154127,
1.37376001, 1.497397482, 1.319487885, 2.120353884, 1.895660195,
1.150411874, 2.61036961, 1.99354158, 1.547706758, 1.941501873,
1.739226419, 2.455590044, 2.907382515, 2.110502618, 2.076187012,
2.507527308, 2.167657681, 1.662405916, 2.428807116, 2.04699653,
1.937335768, 1.456518889, 1.948952907, 2.104325112, 2.311519732,
2.092650229, 2.109051215, 2.089144475)), .Names = c("IndID",
"MockDate", "RandNumber"), class = "data.frame", row.names = c(NA,
-30L))
The head() of my 2nd mock data set is below – as well as the dput() output.
IndID StartTime EndTime
1 1 4 5
2 1 7 11
3 1 6 9
4 1 7 9
5 1 6 10
6 1 2 12
clstrData <- structure(list(IndID.1 = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), StartTime = c(4L, 7L,
6L, 7L, 6L, 2L, 6L, 4L, 3L, 5L, 2L, 5L, 7L, 3L, 4L, 3L, 2L, 5L,
5L), EndTime = c(5L, 11L, 9L, 9L, 10L, 12L, 8L, 13L, 5L, 13L,
9L, 9L, 17L, 6L, 8L, 6L, 9L, 15L, 7L)), .Names = c("IndID",
"StartTime", "EndTime"), row.names = c(NA, 19L), class = "data.frame")
The second dataset has two number fields representing a start and end time. As above, these data are also specific to an individual noted by the IndD column.
I need to average the ‘RandNumber’ from dataset one for all the instances when ‘MockDate’ is between ‘StartTime’ and ‘EndTime’ of the second dataset for each unique IndID. Thus, ‘RandNumber’ values should only be averaged if 1) they are within the ‘StartTime’ and ‘EndTime’ and 2) the IndID for both rows are the same.
I started by creating a function to ID if MockDate is between StartTime and EndTime
is.between <- function(x, a, b) {
x > a & x < b
}
Testing that function works for a single value
is.between(actData[1,3], clstrData[,2], clstrData[,3])
But cannot figure out how to loop this for all rows, and then find the mean. My for() loop beginnings are below.
YesNo <- list()
for (i in 1:nrow(actData)) {
YesNo[[i]] <- is.between(actData[1,3], clstrData[,2], clstrData[,3])
}
YesNo[[3]]
This for() gives the same result for all row…
Hope to create...
clstrData$NEWcolum <- mean RandNum for each row.
Thanks, and as always any suggestions are greatly appreciated!
Assuming your machine can handle the data size, you can:
merge the two data frames on the ID, then
group accordingly (ie, by IndID, Start & End dates)
compute mean for those rows where mock date falls between the end dates
Here is some code using data.table
library(data.table)
DT.clstr <- data.table(clstrData, key="IndID")
DT.act <- data.table(actData, key="IndID")
# Adjust to `<=` if needed
ComputedDT <-
merge(DT.clstr, DT.act, allow.cartesian=TRUE)[
MockDate > StartTime & MockDate < EndTime
, list(Mean=mean(RandNumber))
, by=list(IndID, StartTime, EndTime)
]
Results
ComputedDT
IndID StartTime EndTime Mean
1: 1 2 12 1.671002
2: 2 4 13 2.176799
3: 2 2 9 2.244702
4: 3 3 6 1.978828
5: 3 4 8 1.940887
6: 3 2 9 2.033104
Thanks to Ricardo Saporta for earlier thoughts.
However, constructing a long conditional in my for() loop was the best option for me - although not as fast as data.table().
Using the data above, the code below is what I ended up constructing.
clstrData$meanAct = rep(NA, nrow(clstrData))
for (i in 1:nrow(clstrData)){
clstrData$meanAct[i] = mean(actData$RandNumber[actData$IndID==clstrData$IndID[i]
&is.between(actData$RandNumber, clstrData$StartTime[i], clstrData$EndTime[i])])
}
head(clstrData)
tail(clstrData)
Were there is no corresponding value between the Start and End times, NAN's are produced.