I have a huge data file in long format-parts of it supplied below. Each ID can have several rows, where status is the final status. However I need to do the analysis with time varying covariates and hence need to create two new time variables and update the status variable. I´ve been struggling with this for some time now and I cannot figure out how to do this efficiently as there can be as many as four rows per ID. The time varying variable is NUM.AFTER.DIAG. If NUM.AFTER.DIAG==0 then it is easy, where time1=0 and time2=STATUSDATE. However when NUM.AFTER.DIAG==1 then I need to make a new row where time1=0, time2=DOB-DATE.DIAG and NUM.AFTER.DIAG=0 and also make sure STATUS="B". The second row would then be time1=time2 from the previous row and time2=STATUSDATE-DATE.DIAG-time1 from this row. Equally if there are more rows then the different rows needs to be subtracted from each other. Also if NUM.AFTER.DIAG==0 but there are multiple rows then all extra rows can be deleted.
Any ideas for an efficient solution to this?
I´ve looked at john Fox unfold command, but it assumes that all the intervals are in wide format to begin with.
Edit: The table as requested. As for the censor variable: "D"=event (death)
structure(list(ID = c(187L, 258L, 265L, 278L, 281L, 281L, 283L,
283L, 284L, 291L, 292L, 292L, 297L, 299L, 305L, 305L, 311L, 311L,
319L, 319L, 319L, 322L, 322L, 329L, 329L, 333L, 333L, 333L, 334L,
334L), STATUS = c("D", "B", "B", "B", "B", "B", "D", "D", "B",
"B", "B", "B", "D", "D", "D", "D", "B", "B", "B", "B", "B", "D",
"D", "B", "B", "D", "D", "D", "D", "D"), STATUSDATE = structure(c(11153,
15034, 15034, 15034, 15034, 15034, 5005, 5005, 15034, 15034,
15034, 15034, 6374, 5005, 7562, 7562, 15034, 15034, 15034, 15034,
15034, 7743, 7743, 15034, 15034, 4670, 4670, 4670, 5218, 5218
), class = "Date"), DATE.DIAG = structure(c(4578, 4609, 4578,
4487, 4670, 4670, 4517, 4517, 4640, 4213, 4397, 4397, 4397, 4487,
4213, 4213, 4731, 4731, 4701, 4701, 4701, 4397, 4397, 4578, 4578,
4275, 4275, 4275, 4456, 4456), class = "Date"), DOB = structure(c(NA,
13010, NA, NA, -1082, -626, 73, 1353, 13679, NA, 1626, 3087,
-626, -200, 2814, 3757, 1930, 3787, 6740, 13528, 14167, 5462,
6557, 7865, 9235, -901, -504, -108, -535, -78), class = "Date"),
NUM.AFTER.DIAG = c(0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 2, 3, 1, 2, 1, 2, 0, 0, 0, 0, 0)), .Names = c("ID",
"STATUS", "STATUSDATE", "DATE.DIAG", "DOB", "NUM.AFTER.DIAG"), row.names = c(NA,
30L), class = "data.frame")
EDIT: I did come up with a solution, although probably not very efficient.
u1<-ddply(p,.(ID),function(x) {
if (x$NUM.AFTER.DIAG==0){
x$time1<-0
x$time2<-x$STATUSDATE-x$DATE.DIAG
x<-x[1,]
}
else {
x<-rbind(x,x[1,])
x<-x[order(x$DOB),]
u<-max(x$NUM.AFTER.DIAG)
x$NUM.AFTER.DIAG<-0:u
x$time1[1]<-0
x$time2[1:(u)]<-x$DOB[2:(u+1)]-x$DATE.DIAG[2:(u+1)]
x$time2[u+1]<-x$STATUSDATE[u]-x$DATE.DIAG[u]
x$time1[2:(u+1)]<-x$time2[1:u]
x$STATUS[1:u]<-"B"
}
x
}
)
Ok, I've tried something, but I'm not sure I understand your transformation process entirely, so let me know if there are some mistakes. In general ddply will be slow (even when .parallel = TRUE), when there are many individuals, mainly because at the end it has to bring all the data sets of all individuals together and rbind (or rbind.fill) them, which takes forever for a multitude of data.frame objects.
So here's a suggestion, where dat.orig is your toy data set:
I would first split the task in two:
1) NUM.AFTER.DIAG == 0
2) NUM.AFTER.DIAG == 1
1) It seems that if NUM.AFTER.DIAG == 0, except of calculating time2 and extract first row if an ID occurs more than once (like ID 333), there is not much to do in part 1):
## erase multiple occurences
dat <- dat.orig[!(duplicated(dat.orig$ID) & dat.orig$NUM.AFTER.DIAG == 0), ]
dat0 <- dat[dat$NUM.AFTER.DIAG == 0, ]
dat0$time1 <- 0
dat0$time2 <- difftime(dat0$STATUSDATE, dat0$DATE.DIAG, unit = "days")
time.na <- is.na(dat0$DOB)
dat0$time1[time.na] <- dat0$time2[time.na] <- NA
> dat0
ID STATUS STATUSDATE DATE.DIAG DOB NUM.AFTER.DIAG time1 time2
1 187 D 2000-07-15 1982-07-15 <NA> 0 NA NA days
3 265 B 2011-03-01 1982-07-15 <NA> 0 NA NA days
4 278 B 2011-03-01 1982-04-15 <NA> 0 NA NA days
5 281 B 2011-03-01 1982-10-15 1967-01-15 0 0 10364 days
7 283 D 1983-09-15 1982-05-15 1970-03-15 0 0 488 days
10 291 B 2011-03-01 1981-07-15 <NA> 0 NA NA days
11 292 B 2011-03-01 1982-01-15 1974-06-15 0 0 10637 days
13 297 D 1987-06-15 1982-01-15 1968-04-15 0 0 1977 days
14 299 D 1983-09-15 1982-04-15 1969-06-15 0 0 518 days
15 305 D 1990-09-15 1981-07-15 1977-09-15 0 0 3349 days
17 311 B 2011-03-01 1982-12-15 1975-04-15 0 0 10303 days
26 333 D 1982-10-15 1981-09-15 1967-07-15 0 0 395 days
29 334 D 1984-04-15 1982-03-15 1968-07-15 0 0 762 days
2) is a little trickier, but all you actually have to do is insert one more row and calculate the time variables:
## create subset with relevant observations
dat.unfold <- dat[dat$NUM.AFTER.DIAG != 0, ]
## compute time differences
time1 <- difftime(dat.unfold$DOB, dat.unfold$DATE.DIAG, unit = "days")
time1[time1 < 0] <- 0
time2 <- difftime(dat.unfold$STATUSDATE, dat.unfold$DATE.DIAG, unit = "days")
## calculate indices for individuals
n.obs <- daply(dat.unfold, .(ID), function(z) max(z$NUM.AFTER.DIAG) + 1)
df.new <- data.frame(ID = rep(unique(dat.unfold$ID), times = n.obs))
rle.new <- rle(df.new$ID)
ind.last <- cumsum(rle.new$lengths)
ind.first <- !duplicated(df.new$ID)
ind.first.w <- which(ind.first)
ind.second <- ind.first.w + 1
ind2.to.last <- unlist(sapply(seq_along(ind.second),
function(z) ind.second[z]:ind.last[z]))
## insert time variables
df.new$time2 <- df.new$time1 <- NA
df.new$time1[ind.first] <- 0
df.new$time1[!ind.first] <- time1
df.new$time2[!ind.first] <- time2
df.new$time2[ind2.to.last - 1] <- time1
this gives me:
> df.new
ID time1 time2
1 258 0 8401
2 258 8401 10425
3 284 0 9039
4 284 9039 10394
5 319 0 2039
6 319 2039 8827
7 319 8827 9466
8 319 9466 10333
9 322 0 1065
10 322 1065 2160
11 322 2160 3346
12 329 0 3287
13 329 3287 4657
14 329 4657 10456
This should work for the STATUS variable and the other variables in similar fashion.
When both steps are working separately, you just have to do one rbind step at the end.
Related
I've been trying to combine the two For Loops into a single loop.
Loop 1:
Unique.Order.Comment <- unique(df2$Rebuilt.Order.Comment)
length(Unique.Order.Comment)
#loop for the calculations
for (i in 1:length(Unique.Order.Comment)) {
#a <- i-11
#c[i] <- print(sum(n.Cases.per.month$nCases[a:i]))
a <- subset.data.frame(Rebuilt.Data, Rebuilt.Order.Comment == Unique.Order.Comment[i])
assign(Unique.Order.Comment[i],a)
}
Loop 2:
#loop for the calculations
c <- rep(0, nrow(BR))
for (ii in 1:nrow(BR)) {
if (ii < 12){
print(0)
}else {
a <- ii-11
c[ii] <- print(sum(BR$Number.Cases.Authorised[a:ii]))
}
}
c <- data.frame(c)
c <- c %>%
rename(
n.Seen.Cum = c
)
#View(c)
BR <- cbind(BR,c)
The BR need to be Unique.Order.Comment[i] in Loop 2.
What I believe/hope it would look like should be the below.
But I get the error message Error in rep(0, nrow(Unique.Order.Comment[i])) : invalid 'times' argument
(What I think it should look like)
Unique.Order.Comment <- unique(df2$Rebuilt.Order.Comment)
length(Unique.Order.Comment)
#loop for the calculations
for (i in 1:length(Unique.Order.Comment)) {
#a <- i-11
#c[i] <- print(sum(n.Cases.per.month$nCases[a:i]))
a <- subset.data.frame(Rebuilt.Data, Rebuilt.Order.Comment == Unique.Order.Comment[i])
assign(Unique.Order.Comment[i],a)
#loop for the calculations
c <- rep(0, nrow(Unique.Order.Comment[i]))
for (ii in 1:nrow(Unique.Order.Comment[i])) {
if (ii < 12){
print(0)
}else {
a <- ii-11
c[ii] <- print(sum(Unique.Order.Comment[i]$Number.Cases.Authorised[a:ii]))
}
}
c <- data.frame(c)
c <- c %>%
rename(
n.Seen.Cum = c
)
#View(c)
Unique.Order.Comment[i] <- cbind(Unique.Order.Comment[i],c)
}
Edit example data:
dput(Unique.Order.Comment)
c("CN", "DM", "DR", "FF", "PG", "HN", "SK", "GI", "GYN", "BR",
"UR", "LYMPH", "HPB", "BST", "ENDOC", "PAEDGI", "CT", "PERI",
"NEURO", "MOHS", "ICC", "RE", "PAED", "MN", "EMR", "PR", "LBX",
"HAEM", "CTT", "UGI", "NEUR", "URGI", "GYNAE")
dput(head(Rebuilt.Data))
structure(list(Rebuilt.Order.Comment = c("BR", "BR", "BR", "BR",
"BR", "BR"), Period.Received = c("2019-01", "2019-02", "2019-03",
"2019-04", "2019-05", "2019-06"), Number.Cases.Received = c(838L,
730L, 778L, 832L, 574L, 626L), Number.Cases.Authorised = c(680L,
587L, 896L, 715L, 761L, 554L), Number.Cases.Authorised.Less7Days = c(550L,
343L, 520L, 389L, 393L, 374L), Number.Cases.Authorised.Less10.Days = c(628L,
475L, 723L, 595L, 555L, 474L), Percentage.Authorsied.Less7Days = c(0.808823529411765,
0.584327086882453, 0.580357142857143, 0.544055944055944, 0.516425755584757,
0.675090252707581), Percentage.Authorsied.Less10Days = c(0.923529411764706,
0.809199318568995, 0.806919642857143, 0.832167832167832, 0.729303547963206,
0.855595667870036), Avg.TaT.for.Authorised.Cases = structure(c(5.26470588235294,
8.74616695059625, 8.34709821428571, 8.09370629370629, 12.826544021025,
6.22021660649819), class = "difftime", units = "days"), MDM.Received = c(2L,
13L, 2L, NA, NA, 5L), MDM.Received.Avg.TAT = structure(c(5, 29.2307692307692,
0.5, NA, NA, 5.4), class = "difftime", units = "days"), So.Received = c(NA,
1L, NA, 1L, NA, 2L), So.Received.Avg.TAT = structure(c(NA, 14,
NA, 9, NA, 54), class = "difftime", units = "days")), row.names = c(NA,
6L), class = "data.frame")
if I place print(Unique.Order.Comment[i]) before the second seperate loop I get:
"CN"
In theory the first loop subsets data based upon a unique list of Order.Comment (which it can do).
Then it does a cumlative sum and this gets cbind onto the subsetted data.
First, it is easier to help if you provide a small example along with the your expected output. You can share your original data removing the columns which are not necessary to the question or create a fake dataset which is similar to your original data.
Second, I think you are overcomplicating this. It is never a good idea to create multiple datasets in your global environment. They are very difficult to manage and unnecessary pollute the global environment. You can use lists instead.
In this case I don't think we need to split the datasets in different lists as we have different packages that can perform rolling calculations. For example, below I have used zoo package which has rollsumr function.
library(dplyr)
library(zoo)
df <- df %>%
group_by(Rebuilt.Order.Comment) %>%
mutate(n.Seen.Cum = rollsumr(Number.Cases.Authorised, 12, fill = 0)) %>%
ungroup
df
# Rebuilt.Order.Comment Period.Received Number.Cases.Authorised n.Seen.Cum
# <chr> <chr> <int> <int>
# 1 BR 2019-01 680 0
# 2 BR 2019-02 587 0
# 3 BR 2019-03 896 0
# 4 BR 2019-04 715 0
# 5 BR 2019-05 761 0
# 6 BR 2019-06 554 0
# 7 BR 2019-07 843 0
# 8 BR 2019-08 815 0
# 9 BR 2019-09 704 0
#10 BR 2019-10 939 0
#11 BR 2019-11 834 0
#12 BR 2019-12 880 9208
#13 BR 2020-01 801 9329
#14 BR 2020-02 610 9352
#15 BR 2020-03 853 9309
I think I see what you are aiming for, but I may have missed something. Let me know, and I can edit.
From what I can tell, you only need one loop, and instead of assign()ing a bunch of dataframes, you can iteratively build a summary table.
edit
The other answer here is quite elegant! I'm updating my answer based on your new comments just for fun. Not sure why we have different n.Seen.Cum values...
df2 <- structure(list(
Rebuilt.Order.Comment = c("BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR" ),
Period.Received = c("2019-01", "2019-02", "2019-03", "2019-04", "2019-05", "2019-06", "2019-07", "2019-08", "2019-09", "2019-10", "2019-11", "2019-12", "2020-01", "2020-02", "2020-03"),
Number.Cases.Authorised = c(680L, 587L, 896L, 715L, 761L, 554L, 843L, 815L, 704L, 939L, 834L, 880L, 801L, 610L, 853L),
n.Seen.Cum = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9208, 9329, 9352, 9309)),
row.names = c(NA, 15L), class = "data.frame")
# This will hold results
output <- list()
# Loop over this vector
Unique.Order.Comment <- unique(df2$Rebuilt.Order.Comment)
for(comment in Unique.Order.Comment){
# Temporary dataframe that is subset of 'df2'
temp <- df2[df2$Rebuilt.Order.Comment == comment,]
# We can do arithmetic with dates that have days
temp$Period.Received2 <- as.Date(paste(temp$Period.Received, "-01", sep=""))
# Calculate cumsum after 333 days have passed
temp$n.Seen.cum2 <- ifelse(
test = temp$Period.Received2 - min(temp$Period.Received2) > 333,
yes = cumsum(temp$Number.Cases.Authorised),
no = NA)
# better
output[[comment]] <- temp
# quick and dirty
# assign(x = comment, value = temp)
}
output[[1]]
#> Rebuilt.Order.Comment Period.Received Number.Cases.Authorised n.Seen.Cum
#> 1 BR 2019-01 680 0
#> 2 BR 2019-02 587 0
#> 3 BR 2019-03 896 0
#> 4 BR 2019-04 715 0
#> 5 BR 2019-05 761 0
#> 6 BR 2019-06 554 0
#> 7 BR 2019-07 843 0
#> 8 BR 2019-08 815 0
#> 9 BR 2019-09 704 0
#> 10 BR 2019-10 939 0
#> 11 BR 2019-11 834 0
#> 12 BR 2019-12 880 9208
#> 13 BR 2020-01 801 9329
#> 14 BR 2020-02 610 9352
#> 15 BR 2020-03 853 9309
#> Period.Received2 n.Seen.cum2
#> 1 2019-01-01 NA
#> 2 2019-02-01 NA
#> 3 2019-03-01 NA
#> 4 2019-04-01 NA
#> 5 2019-05-01 NA
#> 6 2019-06-01 NA
#> 7 2019-07-01 NA
#> 8 2019-08-01 NA
#> 9 2019-09-01 NA
#> 10 2019-10-01 NA
#> 11 2019-11-01 NA
#> 12 2019-12-01 9208
#> 13 2020-01-01 10009
#> 14 2020-02-01 10619
#> 15 2020-03-01 11472
If you have multiple years and want the cumulative sum to reset, update the test parameter in ifelse() to include some max number of days.
I have the following dataframe:
test_df <- structure(list(system = c("A", "B", "B", "C", "D", "B", "B",
"C", "B", "B", "A", "D", "D", "B", "E", NA, NA, "B", "A", "D"
), type = c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
2L, 1L, 1L, 2L, 2L, 1L, 1L), start_date = structure(c(16567,
16604, 16324, 16595, 16111, 17597, 16784, 16648, 16121, 16549,
16438, 16484, 15997, 16488, 16708, 16121, 16327, 16329, 17010,
16342), class = "Date"), end_date = structure(c(16995, 16984,
16661, 16909, 16414, 17843, 16990, 16853, 16323, 16751, 16622,
16665, 16154, 16624, 16839, 16251, 16456, 16456, 17134, 16458
), class = "Date"), event_duration = c(428, 380, 337, 314, 303,
246, 206, 205, 202, 202, 184, 181, 157, 136, 131, 130, 129, 127,
124, 116)), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
test_df
#> system type start_date end_date event_duration
#> 1 A 2 2015-05-12 2016-07-13 428
#> 2 B 2 2015-06-18 2016-07-02 380
#> 3 B 2 2014-09-11 2015-08-14 337
#> 4 C 2 2015-06-09 2016-04-18 314
#> 5 D 1 2014-02-10 2014-12-10 303
#> 6 B 1 2018-03-07 2018-11-08 246
#> 7 B 1 2015-12-15 2016-07-08 206
#> 8 C 2 2015-08-01 2016-02-22 205
#> 9 B 1 2014-02-20 2014-09-10 202
#> 10 B 2 2015-04-24 2015-11-12 202
#> 11 A 2 2015-01-03 2015-07-06 184
#> 12 D 2 2015-02-18 2015-08-18 181
#> 13 D 1 2013-10-19 2014-03-25 157
#> 14 B 2 2015-02-22 2015-07-08 136
#> 15 E 1 2015-09-30 2016-02-08 131
#> 16 <NA> 1 2014-02-20 2014-06-30 130
#> 17 <NA> 2 2014-09-14 2015-01-21 129
#> 18 B 2 2014-09-16 2015-01-21 127
#> 19 A 1 2016-07-28 2016-11-29 124
#> 20 D 1 2014-09-29 2015-01-23 116
For each system, I would like to plot differently colored sequences of segments, for each type of event, starting at start_date and ending at end_date. For example, for system A, I would like to plot two sequences of segments:
one, corresponding to events of type 1, containing a single segment, which starts at 2016-07-28 and ends at 2016-11-29
another, corresponding to events of type 2, containing two segments, one starting at 2015-01-03 and ending at 2015-07-06, and another starting at 2015-05-12 and ending at 2016-07-13. As you see, events in a sequence can overlap. I'm not sure how to make sure that the user can still distinguish among events: maybe one could use arrows, or vertical bars or whatever, to show the start and the end of an event.
Ideally, the plots for each system should be in different facets, because I believe that having all of them in the same plot would make a complete mess (of course, the actual data frame is much bigger than this sample data frame).
For system B, I would have 3 segments corresponding to events of type 1, 5 corresponding to events of type 2. And so on. How can I create the plot I desire? I would prefer a ggplot2 solution.
One option is to use jittering to avoid overplotting the start and end points. Whether this works well will depend on how many segments you're trying to plot.
To ensure you're able to adjust the y-direction of the lines the same amount, you can add the jitter to the df itself and use that to plot segments:
test_df$jitter <- jitter(test_df$type, amount = 0.25)
ggplot(test_df) +
geom_segment(aes(x=start_date, xend=end_date, y=jitter, yend=jitter)) +
facet_wrap(~system) +
scale_y_continuous(breaks=c(1,2), labels=c(1,2)) +
theme(panel.grid.minor.y = element_blank())
You could also use start and end indicators as you suggested, to help emphasize the ends of the segments, but this may just add more noise if the number of segments is large.
ggplot(test_df) +
geom_point(aes(x=start_date, y=jitter), size=1) +
geom_segment(aes(x=start_date, xend=end_date, y=jitter, yend=jitter),
arrow=arrow(30,unit(1.25,"mm"),"last","closed")) +
facet_wrap(~system) +
scale_y_continuous(breaks=c(1,2), labels=c(1,2)) +
theme(panel.grid.minor.y = element_blank())
Let's say we have two tables:
A table of budgets:
Item Budget
A 900
B 350
C 100
D 0
bDT = structure(list(Item = c("A", "B", "C", "D"), Budget = c(900L,
350L, 100L, 0L)), .Names = c("Item", "Budget"), row.names = c(NA,
-4L), class = "data.frame")
and a table of expected expenses by item per date.
Item Date Expense
A 2017-08-24 850
B 2017-08-18 300
B 2017-08-11 50
C 2017-08-18 50
C 2017-08-11 100
D 2017-08-01 500
expDF = structure(list(Item = c("A", "B", "B", "C", "C", "D"), Date = structure(c(17402,
17396, 17389, 17396, 17389, 17379), class = "Date"), Expense = c(850L,
300L, 50L, 50L, 100L, 500L)), .Names = c("Item", "Date", "Expense"
), row.names = c(NA, -6L), class = "data.frame")
I'm looking to summarize the amount we can spend per item per date like this:
Item Date Spend
A 8/24/2017 850
B 8/18/2017 300
B 8/11/2017 50
C 8/18/2017 50
C 8/11/2017 50
D 8/1/2017 0
This works:
library(data.table)
setDT(bDF); setDT(expDF)
expDF[bDF, on=.(Item), Spending :=
pmin(
Expense,
pmax(
0,
Budget - cumsum(shift(Expense, fill=0))
)
)
, by=.EACHI]
Item Date Expense Spending
1: A 2017-08-24 850 850
2: B 2017-08-18 300 300
3: B 2017-08-11 50 50
4: C 2017-08-18 50 50
5: C 2017-08-11 100 50
6: D 2017-08-01 500 0
How it works
cumsum(shift(Expense, fill = 0)) is prior spending**
max(0, Budget - prior spending) is remaining budget
min(Expense, remaining budget) is current spending
The data.table syntax x[i, on=, j, by=.EACHI] is a join. In this case j takes the form v := expr, which adds a new column to x. See ?data.table for details.
** Well, "prior" in ordering of the table. I'll ignore the OP's weird reversed dates.
I have data which is stored in a dataframe as
CST M QS
501 1204315 1
501 1204324 1
501 6041541 2
501 7508226 1
501 7509677 1
501 7514622 1
503 7511601 2
507 6961200 2
507 7514100 2
507 7522029 1
509 7512374 1
509 7516799 2
511 419110 0.5
511 6000832 5
511 6960800 3.33
511 7010000 2
511 7508229 2.5
511 7508307 2
511 7515126 2
Now, i would like to change this data based on CST different M has to be stored in different columns. These columns should be dynamic such that CST can have only 1 M or it can be 10 or 20 or infinite that many columns has to be generated with M1, M2, M3 and so on. QS has to be displayed with sum of the values based on sum(M)
Sample output is as below
cst M1 M2 M3 M4 M5 M6 M7 Total
501 1204315 1204324 6041541 7508226 7509677 7514622 7
503 7511601 2
507 6961200 7514100 7522029 5
509 7512374 7516799 3
511 419110 6000832 6960800 7010000 7508229 7508307 7515126 17.33
I have used transpose() where it just converts rows to columns.But this will not resolve to the expected output which i want.
I have tried using data.table function as dt[,sum(QS),by="CST"] but am not able to display M along with it in different columns.
Thanks in advance.
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'CST', we get the sequence of rows (1:.N) and sum of "QS", assign (:=) the output to create two columns ("N", "Total"). Then, we use dcast to convert from long to wide format.
library(data.table)
setDT(df1)[, c("N", "Total") := list(paste0("M", 1:.N), sum(QS)), CST]
dcast(df1, CST+Total~N, value.var='M')
# CST Total M1 M2 M3 M4 M5 M6 M7
#1: 501 7.00 1204315 1204324 6041541 7508226 7509677 7514622 NA
#2: 503 2.00 7511601 NA NA NA NA NA NA
#3: 507 5.00 6961200 7514100 7522029 NA NA NA NA
#4: 509 3.00 7512374 7516799 NA NA NA NA NA
#5: 511 17.33 419110 6000832 6960800 7010000 7508229 7508307 7515126
Or we use the same methodology with dplyr/tidyr
library(dplyr)
library(tidyr)
df1 %>%
group_by(CST) %>%
mutate(Total=sum(QS), N=row_number()) %>%
select(-QS) %>%
spread(N, M)
Update
If we need the columns in the order, we can convert the "N" to factor with levels specified
setDT(df2)[, c("N", "Total") := list(paste0("M", 1:.N), sum(QS)), CST]
df2[, N:= factor(N, levels=unique(N))]
dcast(df2, CST+Total~N, value.var="M")
data
df2 <- structure(list(CST = c(501L, 501L, 501L, 501L, 501L,
501L, 501L,
501L, 501L, 501L, 501L, 501L, 503L, 507L, 507L, 507L, 509L, 509L,
511L, 511L, 511L, 511L, 511L, 511L, 511L), M = c(1204315L, 1204324L,
6041541L, 7508226L, 7509677L, 7434399L, 7843392L, 7834393L, 8343999L,
3439242L, 3434323L, 7514622L, 7511601L, 6961200L, 7514100L, 7522029L,
7512374L, 7516799L, 419110L, 6000832L, 6960800L, 7010000L, 7508229L,
7508307L, 7515126L), QS = c(1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 1, 1, 2, 0.5, 5, 3.33, 2, 2.5, 2, 2)),
.Names = c("CST",
"M", "QS"), class = "data.frame", row.names = c(NA, -25L))
I have my data in the form of a data.table given below
structure(list(atp = c(1, 0, 1, 0, 0, 1), len = c(2, NA, 3, NA,
NA, 1), inv = c(593, 823, 668, 640, 593, 745), GU = c(36, 94,
57, 105, 48, 67), RUTL = c(100, NA, 173, NA, NA, 7)), .Names = c("atp",
"len", "inv", "GU", "RUTL"), row.names = c(NA, -6L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000320788>)
I need to form 4 new columns csi_begin,csi_end, IRQ and csi_order. the value of csi_begin and csi_end when atp=1 depends directly on inv and gu values.
But when atp is not equal to 1 csi_begin and csi_end depends on inv and gu values and IRQ value of previous row
The value of IRQ depends on csi_order of that row if atp==1 else its 0 and csi_order value depends on two rows previous csi_begin value.
I have written the condition with the help of for loop.
Below is the code given
lostsales<-function(transit)
{
if (transit$atp==1)
{
transit$csi_begin[i]<-(transit$inv)[i]
transit$csi_end[i]<-transit$csi_begin[i]-transit$GU[i]
}
else
{
transit$csi_begin[i]<-(transit$inv)[i]+transit$IRQ[i-1]
transit$csi_end[i]<-transit$csi_begin[i]-transit$GU[i]
}
if (transit$csi_begin[i-2]!= NA)
{
transit$csi_order[i]<-transit$csi_begin[i-2]
}
else
{ transit$csi_order[i]<-0}
if (transit$atp==1)
{
transit$IRQ[i]<-transit$csi_order[i]-transit$RUTL[i]
}
else
{
transit$IRQ[i]<-0
}
}
Can anyone help me how to do efficient looping with data.tables using setkeys? As my data set is very large and I cannot use for loop else the timing would be very high.
Adding the desired outcome to your example would be very helpful, as I'm having trouble following the if/then logic. But I took a stab at it anyway:
library(data.table)
# Example data:
dt <- structure(list(atp = c(1, 0, 1, 0, 0, 1), len = c(2, NA, 3, NA, NA, 1), inv = c(593, 823, 668, 640, 593, 745), GU = c(36, 94, 57, 105, 48, 67), RUTL = c(100, NA, 173, NA, NA, 7)), .Names = c("atp", "len", "inv", "GU", "RUTL"), row.names = c(NA, -6L), class = c("data.table", "data.frame"), .internal.selfref = "<pointer: 0x0000000000320788>")
# Add a row number:
dt[,rn:=.I]
# Use this function to get the value from a previous (shiftLen is negative) or future (shiftLen is positive) row:
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
# My attempt to follow the seemingly circular if/then rules:
lostsales2 <- function(transit) {
# If atp==1, set csi_begin to inv and csi_end to csi_begin - GU:
transit[atp==1, `:=`(csi_begin=inv, csi_end=inv-GU)]
# Set csi_order to the value of csi_begin from two rows prior:
transit[, csi_order:=rowShift(csi_begin,-2)]
# Set csi_order to 0 if csi_begin from two rows prior was NA
transit[is.na(csi_order), csi_order:=0]
# Initialize IRQ to 0
transit[, IRQ:=0]
# If ATP==1, set IRQ to csi_order - RUTL
transit[atp==1, IRQ:=csi_order-RUTL]
# If ATP!=1, set csi_begin to inv + IRQ value from previous row, and csi_end to csi_begin - GU
transit[atp!=1, `:=`(csi_begin=inv+rowShift(IRQ,-1), csi_end=inv+rowShift(IRQ,-1)-GU)]
return(transit)
}
lostsales2(dt)
## atp len inv GU RUTL rn csi_begin csi_end csi_order IRQ
## 1: 1 2 593 36 100 1 593 557 0 -100
## 2: 0 NA 823 94 NA 2 NA NA 0 0
## 3: 1 3 668 57 173 3 668 611 593 420
## 4: 0 NA 640 105 NA 4 640 535 0 0
## 5: 0 NA 593 48 NA 5 593 545 668 0
## 6: 1 1 745 67 7 6 745 678 640 633
Is this output close to what you were expecting?