I've been trying to combine the two For Loops into a single loop.
Loop 1:
Unique.Order.Comment <- unique(df2$Rebuilt.Order.Comment)
length(Unique.Order.Comment)
#loop for the calculations
for (i in 1:length(Unique.Order.Comment)) {
#a <- i-11
#c[i] <- print(sum(n.Cases.per.month$nCases[a:i]))
a <- subset.data.frame(Rebuilt.Data, Rebuilt.Order.Comment == Unique.Order.Comment[i])
assign(Unique.Order.Comment[i],a)
}
Loop 2:
#loop for the calculations
c <- rep(0, nrow(BR))
for (ii in 1:nrow(BR)) {
if (ii < 12){
print(0)
}else {
a <- ii-11
c[ii] <- print(sum(BR$Number.Cases.Authorised[a:ii]))
}
}
c <- data.frame(c)
c <- c %>%
rename(
n.Seen.Cum = c
)
#View(c)
BR <- cbind(BR,c)
The BR need to be Unique.Order.Comment[i] in Loop 2.
What I believe/hope it would look like should be the below.
But I get the error message Error in rep(0, nrow(Unique.Order.Comment[i])) : invalid 'times' argument
(What I think it should look like)
Unique.Order.Comment <- unique(df2$Rebuilt.Order.Comment)
length(Unique.Order.Comment)
#loop for the calculations
for (i in 1:length(Unique.Order.Comment)) {
#a <- i-11
#c[i] <- print(sum(n.Cases.per.month$nCases[a:i]))
a <- subset.data.frame(Rebuilt.Data, Rebuilt.Order.Comment == Unique.Order.Comment[i])
assign(Unique.Order.Comment[i],a)
#loop for the calculations
c <- rep(0, nrow(Unique.Order.Comment[i]))
for (ii in 1:nrow(Unique.Order.Comment[i])) {
if (ii < 12){
print(0)
}else {
a <- ii-11
c[ii] <- print(sum(Unique.Order.Comment[i]$Number.Cases.Authorised[a:ii]))
}
}
c <- data.frame(c)
c <- c %>%
rename(
n.Seen.Cum = c
)
#View(c)
Unique.Order.Comment[i] <- cbind(Unique.Order.Comment[i],c)
}
Edit example data:
dput(Unique.Order.Comment)
c("CN", "DM", "DR", "FF", "PG", "HN", "SK", "GI", "GYN", "BR",
"UR", "LYMPH", "HPB", "BST", "ENDOC", "PAEDGI", "CT", "PERI",
"NEURO", "MOHS", "ICC", "RE", "PAED", "MN", "EMR", "PR", "LBX",
"HAEM", "CTT", "UGI", "NEUR", "URGI", "GYNAE")
dput(head(Rebuilt.Data))
structure(list(Rebuilt.Order.Comment = c("BR", "BR", "BR", "BR",
"BR", "BR"), Period.Received = c("2019-01", "2019-02", "2019-03",
"2019-04", "2019-05", "2019-06"), Number.Cases.Received = c(838L,
730L, 778L, 832L, 574L, 626L), Number.Cases.Authorised = c(680L,
587L, 896L, 715L, 761L, 554L), Number.Cases.Authorised.Less7Days = c(550L,
343L, 520L, 389L, 393L, 374L), Number.Cases.Authorised.Less10.Days = c(628L,
475L, 723L, 595L, 555L, 474L), Percentage.Authorsied.Less7Days = c(0.808823529411765,
0.584327086882453, 0.580357142857143, 0.544055944055944, 0.516425755584757,
0.675090252707581), Percentage.Authorsied.Less10Days = c(0.923529411764706,
0.809199318568995, 0.806919642857143, 0.832167832167832, 0.729303547963206,
0.855595667870036), Avg.TaT.for.Authorised.Cases = structure(c(5.26470588235294,
8.74616695059625, 8.34709821428571, 8.09370629370629, 12.826544021025,
6.22021660649819), class = "difftime", units = "days"), MDM.Received = c(2L,
13L, 2L, NA, NA, 5L), MDM.Received.Avg.TAT = structure(c(5, 29.2307692307692,
0.5, NA, NA, 5.4), class = "difftime", units = "days"), So.Received = c(NA,
1L, NA, 1L, NA, 2L), So.Received.Avg.TAT = structure(c(NA, 14,
NA, 9, NA, 54), class = "difftime", units = "days")), row.names = c(NA,
6L), class = "data.frame")
if I place print(Unique.Order.Comment[i]) before the second seperate loop I get:
"CN"
In theory the first loop subsets data based upon a unique list of Order.Comment (which it can do).
Then it does a cumlative sum and this gets cbind onto the subsetted data.
First, it is easier to help if you provide a small example along with the your expected output. You can share your original data removing the columns which are not necessary to the question or create a fake dataset which is similar to your original data.
Second, I think you are overcomplicating this. It is never a good idea to create multiple datasets in your global environment. They are very difficult to manage and unnecessary pollute the global environment. You can use lists instead.
In this case I don't think we need to split the datasets in different lists as we have different packages that can perform rolling calculations. For example, below I have used zoo package which has rollsumr function.
library(dplyr)
library(zoo)
df <- df %>%
group_by(Rebuilt.Order.Comment) %>%
mutate(n.Seen.Cum = rollsumr(Number.Cases.Authorised, 12, fill = 0)) %>%
ungroup
df
# Rebuilt.Order.Comment Period.Received Number.Cases.Authorised n.Seen.Cum
# <chr> <chr> <int> <int>
# 1 BR 2019-01 680 0
# 2 BR 2019-02 587 0
# 3 BR 2019-03 896 0
# 4 BR 2019-04 715 0
# 5 BR 2019-05 761 0
# 6 BR 2019-06 554 0
# 7 BR 2019-07 843 0
# 8 BR 2019-08 815 0
# 9 BR 2019-09 704 0
#10 BR 2019-10 939 0
#11 BR 2019-11 834 0
#12 BR 2019-12 880 9208
#13 BR 2020-01 801 9329
#14 BR 2020-02 610 9352
#15 BR 2020-03 853 9309
I think I see what you are aiming for, but I may have missed something. Let me know, and I can edit.
From what I can tell, you only need one loop, and instead of assign()ing a bunch of dataframes, you can iteratively build a summary table.
edit
The other answer here is quite elegant! I'm updating my answer based on your new comments just for fun. Not sure why we have different n.Seen.Cum values...
df2 <- structure(list(
Rebuilt.Order.Comment = c("BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR" ),
Period.Received = c("2019-01", "2019-02", "2019-03", "2019-04", "2019-05", "2019-06", "2019-07", "2019-08", "2019-09", "2019-10", "2019-11", "2019-12", "2020-01", "2020-02", "2020-03"),
Number.Cases.Authorised = c(680L, 587L, 896L, 715L, 761L, 554L, 843L, 815L, 704L, 939L, 834L, 880L, 801L, 610L, 853L),
n.Seen.Cum = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9208, 9329, 9352, 9309)),
row.names = c(NA, 15L), class = "data.frame")
# This will hold results
output <- list()
# Loop over this vector
Unique.Order.Comment <- unique(df2$Rebuilt.Order.Comment)
for(comment in Unique.Order.Comment){
# Temporary dataframe that is subset of 'df2'
temp <- df2[df2$Rebuilt.Order.Comment == comment,]
# We can do arithmetic with dates that have days
temp$Period.Received2 <- as.Date(paste(temp$Period.Received, "-01", sep=""))
# Calculate cumsum after 333 days have passed
temp$n.Seen.cum2 <- ifelse(
test = temp$Period.Received2 - min(temp$Period.Received2) > 333,
yes = cumsum(temp$Number.Cases.Authorised),
no = NA)
# better
output[[comment]] <- temp
# quick and dirty
# assign(x = comment, value = temp)
}
output[[1]]
#> Rebuilt.Order.Comment Period.Received Number.Cases.Authorised n.Seen.Cum
#> 1 BR 2019-01 680 0
#> 2 BR 2019-02 587 0
#> 3 BR 2019-03 896 0
#> 4 BR 2019-04 715 0
#> 5 BR 2019-05 761 0
#> 6 BR 2019-06 554 0
#> 7 BR 2019-07 843 0
#> 8 BR 2019-08 815 0
#> 9 BR 2019-09 704 0
#> 10 BR 2019-10 939 0
#> 11 BR 2019-11 834 0
#> 12 BR 2019-12 880 9208
#> 13 BR 2020-01 801 9329
#> 14 BR 2020-02 610 9352
#> 15 BR 2020-03 853 9309
#> Period.Received2 n.Seen.cum2
#> 1 2019-01-01 NA
#> 2 2019-02-01 NA
#> 3 2019-03-01 NA
#> 4 2019-04-01 NA
#> 5 2019-05-01 NA
#> 6 2019-06-01 NA
#> 7 2019-07-01 NA
#> 8 2019-08-01 NA
#> 9 2019-09-01 NA
#> 10 2019-10-01 NA
#> 11 2019-11-01 NA
#> 12 2019-12-01 9208
#> 13 2020-01-01 10009
#> 14 2020-02-01 10619
#> 15 2020-03-01 11472
If you have multiple years and want the cumulative sum to reset, update the test parameter in ifelse() to include some max number of days.
Related
Anonymised example subset of a much larger dataset (now edited to show an option with multiple competing types):
structure(list(`Sample File` = c("A", "A", "A", "A", "A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C"),
Marker = c("X", "X", "X", "X", "Y", "Y", "Y", "Y", "Y", "Z",
"Z", "Z", "Z", "Z", "q", "q", "q", "q"), Allele = c(19, 20,
22, 23, 18, 18.2, 19, 19.2, 20, 12, 13, 14, 15, 16, 10, 10.2,
11, 12), Size = c(249.15, 253.13, 260.64, 264.68, 366, 367.81,
369.97, 372.02, 373.95, 91.65, 95.86, 100, 104.24, 108.38,
177.51, 179.4, 181.42, 185.49), Height = c(173L, 1976L, 145L,
1078L, 137L, 62L, 1381L, 45L, 1005L, 38L, 482L, 5766L, 4893L,
19L, 287L, 36L, 5001L, 50L), Type = c("minusone", "allele",
"minusone", "allele", "ambiguous", "minushalf", "allele",
"minushalf", "allele", "minustwo", "ambiguous", "allele",
"allele", "plusone", "minusone", "minushalf", "allele", "plusone"
), LUS = c(11.75, 11.286, 13.375, 13.5, 18, 9, 19, 10, 20,
12, 11, 14, 15, 16, 9.5, NA, 11, 11.5)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -18L), groups = structure(list(
`Sample File` = c("A", "A", "B", "C"), Marker = c("X", "Y",
"Z", "q"), .rows = structure(list(1:4, 5:9, 10:14, 15:18), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), .drop = TRUE))
I want to look up values based on the classification $Type.
"minustwo" means I want to look up the "Allele", "Height" and "LUS"
values for the row with "Allele" equal to the current row plus two,
with the same Sample File and Marker.
"minusone" means the same but for "Allele" equal to the current row plus one.
"minushalf" means the same but for "Allele" equal to the current row plus 0.2 but the dot values here are 25% each, so 12.1, 12.3, 12.3, 13, 13.1 etc - I have a helper function plusTwoBP() for this.
"plusone" means the same for "Allele" equal to the current row -1
"allele" or "ambiguous" don't need to do anything.
Ideal output:
# A tibble: 18 × 10
# Rowwise: Sample File, Marker
`Sample File` Marker Allele Size Height Type LUS ParentHeight ParentAllele ParentLUS
<chr> <chr> <dbl> <dbl> <int> <chr> <dbl> <int> <dbl> <dbl>
1 A X 19 249. 173 minusone 11.8 1976 20 11.3
2 A X 20 253. 1976 allele 11.3 NA NA NA
3 A X 22 261. 145 minusone 13.4 1078 23 13.5
4 A X 23 265. 1078 allele 13.5 NA NA NA
5 A Y 18 366 137 ambiguous 18 NA NA NA
6 A Y 18.2 368. 62 minushalf 9 1381 19 19
7 A Y 19 370. 1381 allele 19 NA NA NA
8 A Y 19.2 372. 45 minushalf 10 1005 20 20
9 A Y 20 374. 1005 allele 20 NA NA NA
10 B Z 12 91.6 38 minustwo 12 5766 14 14
11 B Z 13 95.9 482 ambiguous 11 NA NA NA
12 B Z 14 100 5766 allele 14 NA NA NA
13 B Z 15 104. 4893 allele 15 NA NA NA
14 B Z 16 108. 19 plusone 16 4893 15 15
15 C q 10 178. 287 minusone 9.5 5001 11 11
16 C q 10.2 179. 36 minushalf NA 5001 11 11
17 C q 11 181. 5001 allele 11 NA NA NA
18 C q 12 185. 50 plusone 11.5 5001 11 11
I have a rather belaboured way of doing it:
# eg for minustwo
sampleData %>%
filter(Type == "minustwo") %>%
rowwise() %>%
mutate(ParentHeight = sampleData$Height[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)],
ParentAllele = sampleData$Allele[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)],
ParentLUS = sampleData$LUS[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)]) %>%
right_join(sampleData)
I then have to redo that for each of my Types
My real dataset is thousands of rows so this ends up being a little slow but manageable, but more to the point I want to learn a better way to do it, in particular the sampleData$'Sample File' == 'Sample File' & sampleData$Marker == Marker seems like it should be doable with grouping so I must be missing a trick there.
I have tried using group_map() but I've clearly not understood it correctly:
sampleData$ParentHeight <- sampleData %>%
group_by(`Sample File`, `Marker`) %>%
group_map(.f = \(.x, .y) {
pmap_dbl(.l = .x, .f = \(Allele, Height, Type, ...){
if(Type == "allele" | Type == "ambiguous") { return(0)
} else if (Type == "plusone") {
return(.x$Height[.x$Allele == round(Allele - 1, 1)])
} else if (Type == "minushalf") {
return(.x$Height[.x$Allele == round(plustwoBP(Allele), 1)])
} else if (Type == "minusone") {
return(.x$Height[.x$Allele == round(Allele + 1, 1)])
} else if (Type == "minustwo") {
return(.x$Height[.x$Allele == round(Allele + 2, 1)])
} else { stop("unexpected peak type") }
})}) %>% unlist()
Initially seems to work, but on investigation it's not respecting both layers of grouping, so brings matches from the wrong Marker. Additionally, here I'm assigning the output to a new column in the data frame, but if I try to instead wrap a mutate() around this so that I can create all three new columns in one go then the group_map() no longer works at all.
I also considered using complete() to hugely extend the data frame will all possible values of Allele (including x.0, x.1, x.2, x.3 variants) then use lag() to select the corresponding rows, then drop the spare rows. This seems like it'd make the data frame enormous in the interim.
To summarise
This works, but it feels ugly and like I'm missing a more elegant and obvious solution. How would you approach this?
You can create two versions of Allele: one identical to the original Allele, and one that is equal to an adjustment based on minusone, minustwo, etc
Then do a self left join, based on that adjusted version of Allele (and Sample File and Marker)
sampleData = sampleData %>% group_by(`Sample File`,Marker) %>% mutate(id = Allele) %>% ungroup()
left_join(
sampleData %>%
mutate(id = case_when(
Type=="minusone"~id+1,
Type=="minustwo"~id+2,
Type=="plusone"~id-1,
Type=="minushalf"~ceiling(id))),
sampleData %>% select(-c(Size,Type)),
by=c("Sample File", "Marker", "id"),
suffix = c("", ".parent")
) %>% select(-id)
Output:
# A tibble: 14 × 10
`Sample File` Marker Allele Size Height Type LUS Allele.parent Height.parent LUS.parent
<chr> <chr> <dbl> <dbl> <int> <chr> <dbl> <dbl> <int> <dbl>
1 A X 19 249. 173 minusone 11.8 20 1976 11.3
2 A X 20 253. 1976 allele 11.3 NA NA NA
3 A X 22 261. 145 minusone 13.4 23 1078 13.5
4 A X 23 265. 1078 allele 13.5 NA NA NA
5 A Y 18 366 137 ambiguous 18 NA NA NA
6 A Y 18.2 368. 62 minushalf 9 19 1381 19
7 A Y 19 370. 1381 allele 19 NA NA NA
8 A Y 19.2 372. 45 minushalf 10 20 1005 20
9 A Y 20 374. 1005 allele 20 NA NA NA
10 B Z 12 91.6 38 minustwo 12 14 5766 14
11 B Z 13 95.9 482 ambiguous 11 NA NA NA
12 B Z 14 100 5766 allele 14 NA NA NA
13 B Z 15 104. 4893 allele 15 NA NA NA
14 B Z 16 108. 19 plusone 16 15 4893 15
15 C q 10 178. 287 minusone 9.5 11 5001 11
16 C q 10.2 179. 36 minushalf NA 11 5001 11
17 C q 11 181. 5001 allele 11 NA NA NA
18 C q 12 185. 50 plusone 11.5 11 5001 11
I am not 100% sure how to formulate my question because I don't know the formal names are for what it is that I am trying to do with my dataset. Based on previous questions, there appears to be some way to address what I am trying to, but I am unable at making the logical jump from their problem to my own.
I have attached a sample of my data here.
The first thing I did with my data was add a column indicating which species (sps) are predators (coded as 1) and which species are prey (coded as 0).
#specify which are predators and prey
d1 = d1 %>%
group_by(sps) %>% #grouped by species
mutate(pp=ifelse(sps %in% c("MUXX", "MUVI","MEME"), 1,0)) #mutate to specify predators as 1 and prey as 0
My data is structured as such:
head(d1) #visualize the first few lines of the data
# A tibble: 6 x 8
# Groups: sps [4]
ID date km culv.id type sps time pp
<int> <fctr> <dbl> <fctr> <fctr> <fctr> <fctr> <dbl>
1 2012-06-19 80 A DCC MICRO 2:19 0
2 2012-06-21 80 A DCC MUXX 23:23 1
3 2012-07-15 80 A DCC MAMO 11:38 0
4 2012-07-20 80 A DCC MICRO 22:19 0
5 2012-07-29 80 A DCC MICRO 23:03 0
6 2012-08-07 80 A DCC PRLO 2:04 0
Here is also the output for dput(head(d1)):
structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20", "2012-07-29", "2012-08-07" ), km = c(80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A", "A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX", "MAMO", "MICRO", "MICRO", "PRLO" ), time = c("2:19", "23:23", "11:38", "22:19", "23:03", "2:04" ), pp = c(0, 1, 0, 0, 0, 0)), .Names = c("ID", "date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA, 6L ), class = "data.frame")
I also converted the time and date using the following code:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE) #converting the date/time into a new format
The (most) relevant columns are date, time, and pp (where 1 = predator species and 0 = prey species).
I am now trying to figure out how to extract the following information (average +/- std):
average time between prey-prey observations
average time between prey-predator observations
average time between predator-predator observations
average time between predator-prey observations
To put one of these examples (#2) into words:
What is the average time between when a prey species (pp = 0) is first seen followed by a predator species (pp = 1)?
I am trying to figure out how to do this for my dataset overall first. I think that once I figure out how to do that, it should be fairly straightforward to restrict the data.
Here is a data.table (and lubridate) version that might address your problem:
Using a selection of your posted data (posted at bottom), with a slight modification to your datetime creation so that the format works with data.table:
d1$datetime <- as.POSIXct(strptime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M'))
Convert to a data table:
d1 <- as.data.table(d1)
Calculate time differences for equal pp values for animals by specialization (prey or predator), less than (pred to prey), or greater than (prey to pred).
d1$class.class <- d1[d1, difftime(x.datetime, i.datetime, units = "days"),
on = .(datetime > datetime, pp == pp), mult = "first"]
d1$prey.pred <-d1[d1, x.datetime - i.datetime,
on = .(datetime > datetime, pp > pp ), mult = "first"]
d1$pred.prey <- d1[d1, x.datetime - i.datetime,
on = .(datetime > datetime, pp < pp), mult = "first"]
Gives you a column for each:
> head(d1[, 7:ncol(d1)])
time pp datetime class.class prey.pred pred.prey
1: 2:19 0 2012-06-19 02:19:00 26.388194 days 2.877778 days NA days
2: 23:23 1 2012-06-21 23:23:00 74.177083 days NA days 23.51042 days
3: 11:38 0 2012-07-15 11:38:00 5.445139 days 50.666667 days NA days
4: 22:19 0 2012-07-20 22:19:00 9.030556 days 45.221528 days NA days
5: 23:03 0 2012-07-29 23:03:00 8.125694 days 36.190972 days NA days
6: 2:04 0 2012-08-07 02:04:00 1.911111 days 28.065278 days NA days
And you can get summary statistics as you like:
d1[by = sps,, .(mean.same.class = mean(class.class, na.rm = TRUE),
sd.same.class = sd(class.class, na.rm = TRUE),
mean.prey.pred = mean(prey.pred, na.rm = TRUE),
sd.prey.pred = sd(prey.pred, na.rm = TRUE),
mean.pred.prey = mean(pred.prey, na.rm = TRUE),
sd.pred.prey = sd(pred.prey, na.rm = TRUE))]
sps mean.same.class sd.same.class mean.prey.pred sd.prey.pred mean.pred.prey sd.pred.prey
1: MICRO 7.886237 days 8.0547631 18.80733 days 15.504646 NaN days NA
2: MUXX 42.073611 days 45.4011658 NaN days NA 13.01366 days 9.315697
3: MAMO 5.445139 days NA 50.66667 days NA NaN days NA
4: PRLO 2.475694 days 0.7984414 26.62708 days 2.033914 NaN days NA
5: LEAM 2.897222 days NA 10.11597 days NA NaN days NA
Libraries: data.table, lubridate
Data:
> dput(d1)
structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15200L, 15201L, 15199L, 15177L, 15178L, 15204L, 15205L
), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20",
"2012-07-29", "2012-08-07", "2012-08-08", "2012-08-09", "2012-08-13",
"2012-08-13", "2012-08-25", "2012-08-27", "2012-09-04", "2012-09-09",
"2012-09-11", "2012-09-14", "2012-09-23", "2012-09-26", "2012-09-27"
), km = c(80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L,
80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC",
"DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC", "DCC",
"DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX",
"MAMO", "MICRO", "MICRO", "PRLO", "MICRO", "PRLO", "MICRO", "MICRO",
"LEAM", "MICRO", "MUXX", "MICRO", "MICRO", "MUXX", "MICRO", "MICRO",
"MICRO"), time = c("2:19", "23:23", "11:38", "22:19", "23:03",
"2:04", "23:56", "23:06", "0:04", "0:46", "0:51", "22:23", "3:38",
"21:08", "0:40", "2:55", "22:09", "20:46", "3:20"), pp = c(0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0)), class = "data.frame", .Names = c("ID",
"date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA,
-19L))
Edit:
I'm not really sure about mixing tidyverse and data.table ideologies, but you could potentially do what you described in comments using do. For example, make a modified version of df:
d1 <- as.data.table(d1)
d1$datetime <- as.POSIXct(strptime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M'))
d1Mod <- d1
d1Mod$km[10:nrow(d1Mod)] <- 90
Then, define the data.table bit as a function:
foo <- function(df_) {
df_$class.class <- df_[df_, difftime(x.datetime, i.datetime, units = "days"),
on = .(datetime > datetime, pp == pp), mult = "first"]
df_$prey.pred <-df_[df_, x.datetime - i.datetime,
on = .(datetime > datetime, pp > pp ), mult = "first"]
df_$pred.prey <- df_[df_, x.datetime - i.datetime,
on = .(datetime > datetime, pp < pp), mult = "first"]
return(df_)
}
Running d1 %>% group_by(km) %>% do(foo(as.data.table(.))) gets you the same output as in the original answer above (since all km values are 80). If you run it on the modified d1Mod you get an output that looks like it has been grouped by km:
> d1Mod %>%
+ group_by(km) %>%
+ do(foo(as.data.table(.)))
# A tibble: 19 x 12
# Groups: km [2]
ID date km culv.id type sps time pp datetime class.class prey.pred pred.prey
<int> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dttm> <time> <time> <time>
1 1 2012-06-19 80 A DCC MICRO 2:19 0 2012-06-19 02:19:00 26.3881944 days 2.877778 days NA days
2 2 2012-06-21 80 A DCC MUXX 23:23 1 2012-06-21 23:23:00 NA days NA days 23.510417 days
3 3 2012-07-15 80 A DCC MAMO 11:38 0 2012-07-15 11:38:00 5.4451389 days NA days NA days
4 4 2012-07-20 80 A DCC MICRO 22:19 0 2012-07-20 22:19:00 9.0305556 days NA days NA days
5 5 2012-07-29 80 A DCC MICRO 23:03 0 2012-07-29 23:03:00 8.1256944 days NA days NA days
6 8 2012-08-07 80 A DCC PRLO 2:04 0 2012-08-07 02:04:00 1.9111111 days NA days NA days
7 9 2012-08-08 80 A DCC MICRO 23:56 0 2012-08-08 23:56:00 0.9652778 days NA days NA days
8 10 2012-08-09 80 A DCC PRLO 23:06 0 2012-08-09 23:06:00 3.0402778 days NA days NA days
9 11 2012-08-13 80 A DCC MICRO 0:04 0 2012-08-13 00:04:00 NA days NA days NA days
10 12 2012-08-13 90 A DCC MICRO 0:46 0 2012-08-13 00:46:00 12.0034722 days 22.119444 days NA days
11 13 2012-08-25 90 A DCC LEAM 0:51 0 2012-08-25 00:51:00 2.8972222 days 10.115972 days NA days
12 14 2012-08-27 90 A DCC MICRO 22:23 0 2012-08-27 22:23:00 12.9479167 days 7.218750 days NA days
13 15200 2012-09-04 90 A DCC MUXX 3:38 1 2012-09-04 03:38:00 9.9701389 days NA days 5.729167 days
14 15201 2012-09-09 90 A DCC MICRO 21:08 0 2012-09-09 21:08:00 1.1472222 days 4.240972 days NA days
15 15199 2012-09-11 90 A DCC MICRO 0:40 0 2012-09-11 00:40:00 12.8951389 days 3.093750 days NA days
16 15177 2012-09-14 90 A DCC MUXX 2:55 1 2012-09-14 02:55:00 NA days NA days 9.801389 days
17 15178 2012-09-23 90 A DCC MICRO 22:09 0 2012-09-23 22:09:00 2.9423611 days NA days NA days
18 15204 2012-09-26 90 A DCC MICRO 20:46 0 2012-09-26 20:46:00 0.2736111 days NA days NA days
19 15205 2012-09-27 90 A DCC MICRO 3:20 0 2012-09-27 03:20:00 NA days NA days NA days
However, you'll have to do some checking to make sure that the calculations are actually doing what you need- I don't have example output or actual km/year info to truth these results against (read: I don't know what I'm looking at!).
Note also that I think arrange is irrelevant for the operations here, considering that the datetime gets sorted in the function.
I'll use the piece on the comments as an example:
d1 = structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 8L), date = c("2012-06-19", "2012-06-21", "2012-07-15", "2012-07-20", "2012-07-29", "2012-08-07" ), km = c(80L, 80L, 80L, 80L, 80L, 80L), culv.id = c("A", "A", "A", "A", "A", "A"), type = c("DCC", "DCC", "DCC", "DCC", "DCC", "DCC"), sps = c("MICRO", "MUXX", "MAMO", "MICRO", "MICRO", "PRLO" ), time = c("2:19", "23:23", "11:38", "22:19", "23:03", "2:04" ), pp = c(0, 1, 0, 0, 0, 0)), .Names = c("ID", "date", "km", "culv.id", "type", "sps", "time", "pp"), row.names = c(NA, 6L ), class = "data.frame")
We add the datetime column just as you specified:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE)
First, add a column indicating which sequence of happened prey/predator and the time between observations (we remove the first row because there is no information about the previous observation). Note that, the timedif is a numerical value indicating the number of days.
d1 = d1 %>% mutate(prev = lag(pp))
d1 = d1 %>% mutate(timedif = as.numeric(as.POSIXct(datetime) - lag(as.POSIXct(datetime))))
d1 = d1[2:nrow(d1),] %>% mutate(seque = as.factor(paste0(pp,prev)))
At this point, your table looks like
> d1
ID date km culv.id type sps time pp datetime prev timedif seque
1 2 2012-06-21 80 A DCC MUXX 23:23 1 2012-06-21 23:23 0 2.877778 10
2 3 2012-07-15 80 A DCC MAMO 11:38 0 2012-07-15 11:38 1 23.510417 01
3 4 2012-07-20 80 A DCC MICRO 22:19 0 2012-07-20 22:19 0 5.445139 00
4 5 2012-07-29 80 A DCC MICRO 23:03 0 2012-07-29 23:03 0 9.030556 00
5 8 2012-08-07 80 A DCC PRLO 2:04 0 2012-08-07 02:04 0 8.125694 00
After that, just take the wanted statistics for each group by using
avg = d1 %>% group_by(seque) %>% summarise(mean(timedif))
sdevs = d1 %>% group_by(seque) %>% summarise(sd(timedif))
We obtain
>avg
# A tibble: 3 x 2
seque `mean(timedif)`
<fctr> <dbl>
1 00 7.533796
2 01 23.510417
3 10 2.877778
> sdevs
# A tibble: 3 x 2
seque `sd(timedif)`
<fctr> <dbl>
1 00 1.864554
2 01 NA
3 10 NA
Note that the standard deviation is not computed because we only have one observation in the sample dataset for these categories.
I have a data frame that looks as the following:
system Id initial final
665 9 16001 6070 6071
683 10 16001 6100 6101
696 11 16001 6101 6113
712 10 16971 6150 6151
715 11 16971 6151 6163
4966 7 4118 10238 10242
5031 9 4118 10260 10278
5088 10 4118 10279 10304
5115 11 4118 10305 10317
structure(list(system = c(9L, 10L, 11L, 10L, 11L, 7L, 9L, 10L,
11L), Id = c(16001L, 16001L, 16001L, 16971L, 16971L, 4118L, 4118L,
4118L, 4118L), initial = c(6070, 6100, 6101, 6150, 6151, 10238,
10260, 10279, 10305), final = c(6071, 6101, 6113, 6151, 6163,
10242, 10278, 10304, 10317)), .Names = c("system", "Id", "initial",
"final"), row.names = c(665L, 683L, 696L, 712L, 715L, 4966L,
5031L, 5088L, 5115L), class = "data.frame")
I would like to get a new data frame with the next structure
Id system length initial final
1 16001 9,10,11 3 6070 6113
2 16971 10,11 2 6150 6163
3 4118 7 1 10238 10242
4 4118 9,10,11 3 10260 10317
structure(list(Id = c(16001L, 16971L, 4118L, 4118L), system = structure(c(3L,
1L, 2L, 3L), .Label = c("10,11", "7", "9,10,11"), class = "factor"),
length = c(3L, 2L, 1L, 3L), initial = c(6070L, 6150L, 10238L,
10260L), final = c(6113, 6163, 10242, 10317)), .Names = c("Id",
"system", "length", "initial", "final"), class = "data.frame", row.names = c(NA,
-4L))
The grouping is by Id and the difference (between rows) in "system" field equal to one. Also I would like to get the different "system" and how many of that involved in grouping. Finally a column with the first "initial" and the last "final" involved also.
It is possible to do that in r?
Thanks.
You could use data.table. Convert "data.frame" to "data.table" (setDT), create a grouping variable "indx" by taking the difference of adjacent elements of "system" (diff(system)), cumsum the logical vector, use "Id" and "indx" as grouping variable to get the statistics.
library(data.table)
setDT(df)[,list(system=toString(system), length=.N, initial=initial[1L],
final=final[.N]), by=list(Id,indx=cumsum(c(TRUE, diff(system)!=1)))][,
indx:=NULL][]
# Id system length initial final
#1: 16001 9, 10, 11 3 6070 6113
#2: 16971 10, 11 2 6150 6163
#3: 4118 7 1 10238 10242
#4: 4118 9, 10, 11 3 10260 10317
Or based on #jazzurro's comment about using first/last functions from dplyr,
library(dplyr)
df %>%
group_by(indx=cumsum(c(TRUE, diff(system)!=1)), Id) %>%
summarise(system=toString(system), length=n(),
initial=first(initial), final=last(final))
A solution without data.table, but plyr:
library(plyr)
func = function(subdf)
{
bool = c(diff(subdf$system),1)==1
ldply(split(subdf, bool), function(u){
data.frame(system = paste(u$system, collapse=','),
Id = unique(u$Id),
length = nrow(u),
initial= head(u,1)$initial,
final = tail(u,1)$final)
})
}
ldply(split(df, df$Id), func)
# .id system length Id initial final
#1 FALSE 7 1 4118 10238 10242
#2 TRUE 9,10,11 3 4118 10260 10317
#3 TRUE 9,10,11 3 16001 6070 6113
#4 TRUE 10,11 2 16971 6150 6163
I have a list of items with 2 dates (start date and end date) and duration in days (end date - start date). I want to cut them into bins to show the number of "active items" in each bin, i.e. if start date <= bin date and end date > bin date, the item should be counted in the bin.
Item StartDate EndDate Duration
Machine1 2005/01/21 2011/03/29 2258
Machine2 2004/05/12 2012/05/08 2918
Machine3 2004/10/15 2005/09/10 330
Machine4 2004/08/30 2011/08/02 2528
Machine5 2005/06/06 2010/12/03 2006
Machine6 2004/05/11 2007/03/17 1040
Machine7 2005/08/09 2011/05/30 2120
Machine8 2005/01/06 2012/06/07 2709
Machine9 2005/06/13 2008/08/28 1172
Machine10 2005/06/28 2010/04/08 1745
Machine11 2004/11/09 2007/05/14 916
Machine12 2005/05/26 2012/09/16 2670
Machine13 2004/05/28 2009/06/09 1838
Machine14 2005/01/06 2012/05/25 2696
Machine15 2005/08/20 2012/02/11 2366
Machine16 2004/08/02 2011/10/23 2638
Machine17 2004/08/10 2009/03/15 1678
Machine18 2005/05/08 2006/04/17 344
Machine19 2005/08/26 2006/07/24 332
Machine20 2004/03/30 2006/05/07 768
Bin counts that I want to produce:
2004/01/01 0
2005/01/01 9
2006/01/01 19
2007/01/01 16
2008/01/01 14
2009/01/01 13
2010/01/01 11
2011/01/01 9
2012/01/01 5
2013/01/01 0
As you can see, the totals of the bins do not add up to the total number of items, as you would expect with a traditional histogram.
I can do this with some verbose code, but I'm sure there must be some short way, using cut or split. I'm aware that the bin labels are off by one according to my definition above, but let's ignore that for now.
A way is:
#turn dates to actual dates
DF$StartDate <- as.Date(DF$StartDate, "%Y/%m/%d")
DF$EndDate <- as.Date(DF$EndDate, "%Y/%m/%d")
binDF[,1] <- as.Date(binDF[,1], "%Y/%m/%d")
counts <- colSums(sapply(binDF[,1], function(x) {DF$StartDate <= x & DF$EndDate > x}))
#> counts
#[1] 0 9 19 16 14 13 11 9 5 0
And as a complete dataframe:
resDF <- data.frame(dates = binDF[,1], counts = counts, stringsAsFactors = F)
#> resDF
# dates counts
#1 2004-01-01 0
#2 2005-01-01 9
#3 2006-01-01 19
#4 2007-01-01 16
#5 2008-01-01 14
#6 2009-01-01 13
#7 2010-01-01 11
#8 2011-01-01 9
#9 2012-01-01 5
#10 2013-01-01 0
The dataframes DF and binDF:
DF <- structure(list(Item = c("Machine1", "Machine2", "Machine3", "Machine4",
"Machine5", "Machine6", "Machine7", "Machine8", "Machine9", "Machine10",
"Machine11", "Machine12", "Machine13", "Machine14", "Machine15",
"Machine16", "Machine17", "Machine18", "Machine19", "Machine20"
), StartDate = c("2005/01/21", "2004/05/12", "2004/10/15", "2004/08/30",
"2005/06/06", "2004/05/11", "2005/08/09", "2005/01/06", "2005/06/13",
"2005/06/28", "2004/11/09", "2005/05/26", "2004/05/28", "2005/01/06",
"2005/08/20", "2004/08/02", "2004/08/10", "2005/05/08", "2005/08/26",
"2004/03/30"), EndDate = c("2011/03/29", "2012/05/08", "2005/09/10",
"2011/08/02", "2010/12/03", "2007/03/17", "2011/05/30", "2012/06/07",
"2008/08/28", "2010/04/08", "2007/05/14", "2012/09/16", "2009/06/09",
"2012/05/25", "2012/02/11", "2011/10/23", "2009/03/15", "2006/04/17",
"2006/07/24", "2006/05/07"), Duration = c(2258L, 2918L, 330L,
2528L, 2006L, 1040L, 2120L, 2709L, 1172L, 1745L, 916L, 2670L,
1838L, 2696L, 2366L, 2638L, 1678L, 344L, 332L, 768L)), .Names = c("Item",
"StartDate", "EndDate", "Duration"), class = "data.frame", row.names = c(NA,
-20L))
binDF <- structure(list(V1 = c("2004/01/01", "2005/01/01", "2006/01/01",
"2007/01/01", "2008/01/01", "2009/01/01", "2010/01/01", "2011/01/01",
"2012/01/01", "2013/01/01"), V2 = c(0L, 9L, 19L, 16L, 14L, 13L,
11L, 9L, 5L, 0L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-10L))
I have a huge data file in long format-parts of it supplied below. Each ID can have several rows, where status is the final status. However I need to do the analysis with time varying covariates and hence need to create two new time variables and update the status variable. I´ve been struggling with this for some time now and I cannot figure out how to do this efficiently as there can be as many as four rows per ID. The time varying variable is NUM.AFTER.DIAG. If NUM.AFTER.DIAG==0 then it is easy, where time1=0 and time2=STATUSDATE. However when NUM.AFTER.DIAG==1 then I need to make a new row where time1=0, time2=DOB-DATE.DIAG and NUM.AFTER.DIAG=0 and also make sure STATUS="B". The second row would then be time1=time2 from the previous row and time2=STATUSDATE-DATE.DIAG-time1 from this row. Equally if there are more rows then the different rows needs to be subtracted from each other. Also if NUM.AFTER.DIAG==0 but there are multiple rows then all extra rows can be deleted.
Any ideas for an efficient solution to this?
I´ve looked at john Fox unfold command, but it assumes that all the intervals are in wide format to begin with.
Edit: The table as requested. As for the censor variable: "D"=event (death)
structure(list(ID = c(187L, 258L, 265L, 278L, 281L, 281L, 283L,
283L, 284L, 291L, 292L, 292L, 297L, 299L, 305L, 305L, 311L, 311L,
319L, 319L, 319L, 322L, 322L, 329L, 329L, 333L, 333L, 333L, 334L,
334L), STATUS = c("D", "B", "B", "B", "B", "B", "D", "D", "B",
"B", "B", "B", "D", "D", "D", "D", "B", "B", "B", "B", "B", "D",
"D", "B", "B", "D", "D", "D", "D", "D"), STATUSDATE = structure(c(11153,
15034, 15034, 15034, 15034, 15034, 5005, 5005, 15034, 15034,
15034, 15034, 6374, 5005, 7562, 7562, 15034, 15034, 15034, 15034,
15034, 7743, 7743, 15034, 15034, 4670, 4670, 4670, 5218, 5218
), class = "Date"), DATE.DIAG = structure(c(4578, 4609, 4578,
4487, 4670, 4670, 4517, 4517, 4640, 4213, 4397, 4397, 4397, 4487,
4213, 4213, 4731, 4731, 4701, 4701, 4701, 4397, 4397, 4578, 4578,
4275, 4275, 4275, 4456, 4456), class = "Date"), DOB = structure(c(NA,
13010, NA, NA, -1082, -626, 73, 1353, 13679, NA, 1626, 3087,
-626, -200, 2814, 3757, 1930, 3787, 6740, 13528, 14167, 5462,
6557, 7865, 9235, -901, -504, -108, -535, -78), class = "Date"),
NUM.AFTER.DIAG = c(0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 2, 3, 1, 2, 1, 2, 0, 0, 0, 0, 0)), .Names = c("ID",
"STATUS", "STATUSDATE", "DATE.DIAG", "DOB", "NUM.AFTER.DIAG"), row.names = c(NA,
30L), class = "data.frame")
EDIT: I did come up with a solution, although probably not very efficient.
u1<-ddply(p,.(ID),function(x) {
if (x$NUM.AFTER.DIAG==0){
x$time1<-0
x$time2<-x$STATUSDATE-x$DATE.DIAG
x<-x[1,]
}
else {
x<-rbind(x,x[1,])
x<-x[order(x$DOB),]
u<-max(x$NUM.AFTER.DIAG)
x$NUM.AFTER.DIAG<-0:u
x$time1[1]<-0
x$time2[1:(u)]<-x$DOB[2:(u+1)]-x$DATE.DIAG[2:(u+1)]
x$time2[u+1]<-x$STATUSDATE[u]-x$DATE.DIAG[u]
x$time1[2:(u+1)]<-x$time2[1:u]
x$STATUS[1:u]<-"B"
}
x
}
)
Ok, I've tried something, but I'm not sure I understand your transformation process entirely, so let me know if there are some mistakes. In general ddply will be slow (even when .parallel = TRUE), when there are many individuals, mainly because at the end it has to bring all the data sets of all individuals together and rbind (or rbind.fill) them, which takes forever for a multitude of data.frame objects.
So here's a suggestion, where dat.orig is your toy data set:
I would first split the task in two:
1) NUM.AFTER.DIAG == 0
2) NUM.AFTER.DIAG == 1
1) It seems that if NUM.AFTER.DIAG == 0, except of calculating time2 and extract first row if an ID occurs more than once (like ID 333), there is not much to do in part 1):
## erase multiple occurences
dat <- dat.orig[!(duplicated(dat.orig$ID) & dat.orig$NUM.AFTER.DIAG == 0), ]
dat0 <- dat[dat$NUM.AFTER.DIAG == 0, ]
dat0$time1 <- 0
dat0$time2 <- difftime(dat0$STATUSDATE, dat0$DATE.DIAG, unit = "days")
time.na <- is.na(dat0$DOB)
dat0$time1[time.na] <- dat0$time2[time.na] <- NA
> dat0
ID STATUS STATUSDATE DATE.DIAG DOB NUM.AFTER.DIAG time1 time2
1 187 D 2000-07-15 1982-07-15 <NA> 0 NA NA days
3 265 B 2011-03-01 1982-07-15 <NA> 0 NA NA days
4 278 B 2011-03-01 1982-04-15 <NA> 0 NA NA days
5 281 B 2011-03-01 1982-10-15 1967-01-15 0 0 10364 days
7 283 D 1983-09-15 1982-05-15 1970-03-15 0 0 488 days
10 291 B 2011-03-01 1981-07-15 <NA> 0 NA NA days
11 292 B 2011-03-01 1982-01-15 1974-06-15 0 0 10637 days
13 297 D 1987-06-15 1982-01-15 1968-04-15 0 0 1977 days
14 299 D 1983-09-15 1982-04-15 1969-06-15 0 0 518 days
15 305 D 1990-09-15 1981-07-15 1977-09-15 0 0 3349 days
17 311 B 2011-03-01 1982-12-15 1975-04-15 0 0 10303 days
26 333 D 1982-10-15 1981-09-15 1967-07-15 0 0 395 days
29 334 D 1984-04-15 1982-03-15 1968-07-15 0 0 762 days
2) is a little trickier, but all you actually have to do is insert one more row and calculate the time variables:
## create subset with relevant observations
dat.unfold <- dat[dat$NUM.AFTER.DIAG != 0, ]
## compute time differences
time1 <- difftime(dat.unfold$DOB, dat.unfold$DATE.DIAG, unit = "days")
time1[time1 < 0] <- 0
time2 <- difftime(dat.unfold$STATUSDATE, dat.unfold$DATE.DIAG, unit = "days")
## calculate indices for individuals
n.obs <- daply(dat.unfold, .(ID), function(z) max(z$NUM.AFTER.DIAG) + 1)
df.new <- data.frame(ID = rep(unique(dat.unfold$ID), times = n.obs))
rle.new <- rle(df.new$ID)
ind.last <- cumsum(rle.new$lengths)
ind.first <- !duplicated(df.new$ID)
ind.first.w <- which(ind.first)
ind.second <- ind.first.w + 1
ind2.to.last <- unlist(sapply(seq_along(ind.second),
function(z) ind.second[z]:ind.last[z]))
## insert time variables
df.new$time2 <- df.new$time1 <- NA
df.new$time1[ind.first] <- 0
df.new$time1[!ind.first] <- time1
df.new$time2[!ind.first] <- time2
df.new$time2[ind2.to.last - 1] <- time1
this gives me:
> df.new
ID time1 time2
1 258 0 8401
2 258 8401 10425
3 284 0 9039
4 284 9039 10394
5 319 0 2039
6 319 2039 8827
7 319 8827 9466
8 319 9466 10333
9 322 0 1065
10 322 1065 2160
11 322 2160 3346
12 329 0 3287
13 329 3287 4657
14 329 4657 10456
This should work for the STATUS variable and the other variables in similar fashion.
When both steps are working separately, you just have to do one rbind step at the end.