I have a bit of a question about obtaining a previous value from one column and putting it into another column. I have a data.table object (which can easily be converted into an xts object) as follows:
The dput output is as follows:
structure(list(Time = structure(c(1122855314, 1122855315, 1122855316,
1122855317, 1122855318, 1122855319, 1122855320, 1122955811, 1122955812,
1122955813, 1122955814, 1123027212, 1123027213, 1123027214, 1123027215,
1123027216, 1123027217), class = c("POSIXct", "POSIXt"), tzone = "Australia/Melbourne"),
`Inventory_{t}` = c(0, 2, 2, 2, 5, 8, 3, 7, 6, 6, 1, 0, 1,
1, 3, 3, 3), `Inventory_{t-1}` = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `Delta Inventory_{t-1}` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Time",
"Inventory_{t}", "Inventory_{t-1}", "Delta Inventory_{t-1}"), row.names = c(NA,
-17L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00000000028b0788>)
I would like to 'fill in' the "Inventory_{t-1}" such that it takes the value which was in "Inventory_{t}" one second earlier and puts it into that cell. Similarly, for "Delta Inventory_{t-1}" I want it to be equal to Delta Inventory_{t-1} = Inventory_{t-1} - Inventory_{t-2}
I should also note that at the start of each new day, the initial values for "Inventory_{t-1}" and "Delta Inventory_{t-1}" must be 0.
With this information, I would like to get a new data.table/xts object which looks like this:
structure(list(Time = structure(c(1122855314, 1122855315, 1122855316,
1122855317, 1122855318, 1122855319, 1122855320, 1122955811, 1122955812,
1122955813, 1122955814, 1123027212, 1123027213, 1123027214, 1123027215,
1123027216, 1123027217), class = c("POSIXct", "POSIXt"), tzone = "Australia/Melbourne"),
`Inventory_{t}` = c(0, 2, 2, 2, 5, 8, 3, 7, 6, 6, 1, 0, 1,
1, 3, 3, 3), `Inventory_{t-1}` = c(0, 0, 2, 2, 2, 5, 8, 0,
7, 6, 6, 0, 0, 1, 1, 3, 3), `Delta Inventory_{t-1}` = c(0,
0, 2, 0, 0, 3, 3, 0, 7, -1, 0, 0, 0, 1, 0, 2, 0)), .Names = c("Time",
"Inventory_{t}", "Inventory_{t-1}", "Delta Inventory_{t-1}"), row.names = c(NA,
-17L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00000000028b0788>)
The things is, this issue is very straightforward for me to solve if I use loops, but since I have so much data I was hoping for a much faster way to do this, so if anyone can help me out with this I'd really appreciate it, thanks in advance.
This can be solved using the shift() function. The OP has requested to restart the calculation anew every day. This is accomplished by the by = parameter:
z[, `:=`(`Inventory_{t-1}` = shift(`Inventory_{t}`, fill = 0),
`Delta Inventory_{t-1}` = shift(`Inventory_{t}`, fill = 0) -
shift(`Inventory_{t}`, n = 2L, fill = 0)), by = .(Day = as.Date(Time))][]
Time Inventory_{t} Inventory_{t-1} Delta Inventory_{t-1}
1: 2005-08-01 10:15:14 0 0 0
2: 2005-08-01 10:15:15 2 0 0
3: 2005-08-01 10:15:16 2 2 2
4: 2005-08-01 10:15:17 2 2 0
5: 2005-08-01 10:15:18 5 2 0
6: 2005-08-01 10:15:19 8 5 3
7: 2005-08-01 10:15:20 3 8 3
8: 2005-08-02 14:10:11 7 0 0
9: 2005-08-02 14:10:12 6 7 7
10: 2005-08-02 14:10:13 6 6 -1
11: 2005-08-02 14:10:14 1 6 0
12: 2005-08-03 10:00:12 0 0 0
13: 2005-08-03 10:00:13 1 0 0
14: 2005-08-03 10:00:14 1 1 1
15: 2005-08-03 10:00:15 3 1 0
16: 2005-08-03 10:00:16 3 3 2
17: 2005-08-03 10:00:17 3 3 0
Related
I have been struggling for a while now, and I am just missing a step. I hope you can help with this final step.
Reprex
structure(list(record_id = c(110001, 110001, 110001, 110001,
110001, 110001, 110001, 110001, 110001, 110021, 110021, 110021,
110021, 110021, 110021, 110021, 110021, 110021, 110021, 110021,
110021, 110021), day_count = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1,
2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14), previous_treatment = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0
), treatment = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), interruption_streak = c(1, 2, 3, 4, 5,
6, 7, 8, 9, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), row.names = c(NA,
-22L), groups = structure(list(record_id = c(110001, 110021),
.rows = structure(list(1:9, 10:22), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Explanation
This is just an excerpt of the main dataset in which I have listed per participant per day how they were treated.
Here, you can see two study participants record_id 110001 and 110021
To count how many days their treatment was interrupted, i have created a count_streak function interruption_streak
This is a function of treatment: if treatment = 0, then start counting until treatment > 0.
Both treatment and previous_treatment can be 0 (no treatment) or 1,2,3 (treatment A,B,C)
However, as you can see in record_id 110001, you can't really call the first streak an interruption, as prior to day 1, he didn't receive any treatment at all previous_treatment = 0. Same goes for the first streak of 110021.
The second streak of 110021 is the only valid one which I would like to consider as an interruption and keep in the dataset:
at day 5, it went from previous_treatment = 1 to treatment = 0.
Question
I would like to delete all streaks which started with a previous_treatment = 0 and keep all streaks which started with a previous_treatment > 0.
Thanks a lot in advance
You were close. Does this suffice?
df %>% group_by(record_id) %>%
filter(cumsum(previous_treatment) > 0)
record_id day_count previous_treatment treatment interruption_streak
<dbl> <dbl> <dbl> <dbl> <dbl>
1 110021 5 1 0 1
2 110021 6 0 0 2
3 110021 7 0 0 3
4 110021 8 0 0 4
5 110021 9 0 0 5
6 110021 10 0 0 6
7 110021 11 0 0 7
8 110021 12 0 0 8
9 110021 13 0 0 9
10 110021 14 0 0 10
I got data like this
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), drug_1 = c(0,
0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1), drug_2 = c(0, 1, 1, 1, 1, 0,
1, 0, 0, 1, 0, 1)), class = "data.frame", row.names = c(NA, -12L
))
I would like to get the cumulative count of each column for each id and get the data like this
structure(list(id2 = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), drug_1_b = c(0,
0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 2), drug_2_b = c(0, 1, 2, 3, 4,
0, 5, 0, 0, 1, 0, 2)), class = "data.frame", row.names = c(NA,
-12L))
You can get a cumulative sum with cumsum.
To split data.frame into subsets, you can use split and then lapply cumsum over the list of the data.frames and again over the list of the columns, or you can use the ave function which does exactly that:
data = structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), drug_1 = c(0,
0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1), drug_2 = c(0, 1, 1, 1, 1, 0,
1, 0, 0, 1, 0, 1)), class = "data.frame", row.names = c(NA, -12L
))
data[-1] = ave(data[-1], data$id, FUN=cumsum)
edit:
I assumed that the cumulative sum is requested (as per instructions) and that there is a mistake in the example data. If the example data is correct, then the condition is If the count is zero, don't do cumulative sum and leave at zero or ifelse(x == 0, 0, cumsum(x)) (as per #r2evans). However, this construct doesn't work when applied for the data.frame. A more complex helper function is required:
data[-1] = ave(data[-1], data$id, FUN=function(x){
y = cumsum(x)
y[x == 0] = 0
y
})
We can now compare it with the requested (renamed) data:
result = structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), drug_1 = c(0,
0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 2), drug_2 = c(0, 1, 2, 3, 4,
0, 5, 0, 0, 1, 0, 2)), class = "data.frame", row.names = c(NA,
-12L))
identical(data, result)
Base R,
ave(df$drug_2, df$id, FUN = function(z) ifelse(z == 0, z, cumsum(z)))
# [1] 0 1 2 3 4 0 5 0 0 1 0 2
Edit Simplified the solution after reading r2evans' approach.
You could use
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with("drug"),
~ifelse(.x == 0, 0, cumsum(.x)))) %>%
ungroup()
This returns
# A tibble: 12 x 3
id drug_1 drug_2
<dbl> <dbl> <dbl>
1 1 0 0
2 1 0 1
3 1 0 2
4 1 0 3
5 1 0 4
6 1 1 0
7 1 2 5
8 2 0 0
9 2 0 0
10 2 1 1
11 2 0 0
12 2 2 2
Base R solution:
# Resolve the names of vectors we want to cumulatively sum:
# drug_vec_names => character vector
drug_vec_names <- grep( "^drug\\_", colnames(df), value = TRUE)
# Resolve the names of vectors we want to keep:
# not_drug_vec_names => character vector
not_drug_vec_names <- names(df)[!(names(df) %in% drug_vec_names)]
# Calculate the result: res => data.frame
res <- setNames(
cbind(
df[,not_drug_vec_names],
replace(
ave(
df[,drug_vec_names],
df[,not_drug_vec_names],
FUN = cumsum
),
df[,drug_vec_names] == 0,
0
)
),
c(not_drug_vec_names, drug_vec_names)
)
If you have binary values (1/0) in drug columns, you can multiply the cumulative sum with itself to get 0 for 0 values.
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with('drug'), ~cumsum(.) * .)) %>%
ungroup
# id drug_1 drug_2
# <dbl> <dbl> <dbl>
# 1 1 0 0
# 2 1 0 1
# 3 1 0 2
# 4 1 0 3
# 5 1 0 4
# 6 1 1 0
# 7 1 2 5
# 8 2 0 0
# 9 2 0 0
#10 2 1 1
#11 2 0 0
#12 2 2 2
I got two data sets of different lengths. I want to create a new column in the dataset which got more rows based on filtering a specific column from the shorter df. I am getting a waring " Longer object length is not a multiple of shorter object length". And the result is also not correct. I tried to created a smaller example datasets and tried the same code and its working with correct results. I am not sure why on my original data the results are not correct and I am getting the warning.
The example datasets are
structure(list(id = 1:10, activity = c(0, 0, 0, 0, 1, 0, 0, 1,
0, 0), code = c(2, 5, 11, 15, 3, 18, 21, 3, 27, 55)), class = "data.frame", row.names = c(NA,
-10L))
the second df
structure(list(id2 = 1:20, code2 = c(2, 5, 11, 15, 9, 18, 21,
3, 27, 55, 2, 5, 11, 15, 3, 18, 21, 3, 27, 55), d_Activity = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA,
-20L))
I tried this on both my original datasets where I get the warning and these dummy dfs where no warning and correct results.
data2 <- data2 %>%
mutate(d_Activity = ifelse(code2 %in% data1$code & activity == 1, 1,0))
Actually, you are doing it wrong way. Let me explain-
In sample data it is working because larger df have rows (20) which is multiple of rows in smaller df (10).
So in you syntax what you are doing is, to check one complete vector with another complete vector (column of another df), because R normally works in vectorised way of operations.
the correct way of matching one to many is through purrr::map where each individual value in first argument (code2 here) operates with another vector i.e. df1$code which is not in argument of map.
df1 <- structure(list(id = 1:10, activity = c(0, 0, 0, 0, 1, 0, 0, 1,
0, 0), code = c(2, 5, 11, 15, 3, 18, 21, 3, 27, 55)), class = "data.frame", row.names = c(NA,
-10L))
df2 <- structure(list(id2 = 1:20, code2 = c(2, 5, 11, 15, 9, 18, 21,
3, 27, 55, 2, 5, 11, 15, 3, 18, 21, 3, 27, 55), d_Activity = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA,
-20L))
library(tidyverse)
df2 %>%
mutate(d_Activity = map(code2, ~ +(.x %in% df1$code[df1$activity == 1])))
#> id2 code2 d_Activity
#> 1 1 2 0
#> 2 2 5 0
#> 3 3 11 0
#> 4 4 15 0
#> 5 5 9 0
#> 6 6 18 0
#> 7 7 21 0
#> 8 8 3 1
#> 9 9 27 0
#> 10 10 55 0
#> 11 11 2 0
#> 12 12 5 0
#> 13 13 11 0
#> 14 14 15 0
#> 15 15 3 1
#> 16 16 18 0
#> 17 17 21 0
#> 18 18 3 1
#> 19 19 27 0
#> 20 20 55 0
Created on 2021-06-17 by the reprex package (v2.0.0)
I would like to create a column in a data.frame, placing the first time that the year appears in each id.
That is, I have this data:
example <- structure(list(id = structure(c(1, 2, 3, 4, 5), class = "numeric"),
`2007` = c(0, 0, 0, 0, 0), `2008` = c(0, 0, 0, 0, 1), `2009` = c(1,
0, 0, 0, 0), `2010` = c(1, 0, 1, 0, 1), `2011` = c(0, 0,
0, 0, 0), `2012` = c(1, 0, 1, 1, 1), `2013` = c(1, 0, 1,
0, 1), `2014` = c(1, 1, 1, 1, 0), `2015` = c(1, 1, 0, 0,
0), `2016` = c(1, 1, 1, 0, 1)), row.names = c(NA, 5L), class = "data.frame")
And I would like to get the following:
example2 <- structure(list(id = structure(c(1, 2, 3, 4, 5), class = "numeric"),
`2007` = c(0, 0, 0, 0, 0), `2008` = c(0, 0, 0, 0, 1), `2009` = c(1,
0, 0, 0, 0), `2010` = c(1, 0, 1, 0, 1), `2011` = c(0, 0,
0, 0, 0), `2012` = c(1, 0, 1, 1, 1), `2013` = c(1, 0, 1,
0, 1), `2014` = c(1, 1, 1, 1, 0), `2015` = c(1, 1, 0, 0,
0), `2016` = c(1, 1, 1, 0, 1), situation = c(2009, 2014,
2010, 2012, 2008)), row.names = c(NA, 5L), class = "data.frame")
Is it possible to do that ? Every help is welcome. Thanks.
Try this:
#Code
example$situation <- apply(example[,-1],1,function(x) names(x)[min(which(x==1))])
Output:
example
id 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 situation
1 1 0 0 1 1 0 1 1 1 1 1 2009
2 2 0 0 0 0 0 0 0 1 1 1 2014
3 3 0 0 0 1 0 1 1 1 0 1 2010
4 4 0 0 0 0 0 1 0 1 0 0 2012
5 5 0 1 0 1 0 1 1 0 0 1 2008
Or with dplyr and tidyr reshaping and merging:
library(dplyr)
library(tidyr)
#Code
example <- example %>%
left_join(
example %>% pivot_longer(-1) %>%
group_by(id) %>%
summarise(situation=name[min(which(value==1))])
)
Same output.
I have two data frames:
> df1
2013-04-1 2013-04-2 2013-04-3 2013-04-4 2013-04-5 2013-04-6 2013-04-7 2013-04-8 2013-04-9 2013-04-10 2013-04-11
bin_1 32 489 32 32 364 19 312 0 0 0 346
bin_2 8 346 8 0 98 8 12 12 46 364 346
bin_3 9 98 346 46 9 312 6 1912 0 489 0
bin_4 4 12 9 12 0 12 0 987 9 19 12
bin_5 0 0 8 8 0 0 312 6 312 12 4
df1 contains 5 rows (bins) and 23 columns (date)
> df2
orange apple pear banana watermelon lemon
2013-04-1 1 1 1 1 0 1
2013-04-2 1 1 0 1 0 0
2013-04-3 1 1 1 1 0 1
2013-04-4 0 1 0 1 1 1
2013-04-5 1 0 0 0 1 1
df2 contains 23 rows(date) and 6 columns (types of fruits)
So now, I want to concentrate these 2 dfs into 1 big data frame that contains all the information, like:
> df3
orange apple pear banana watermelon lemon
bin_1 ? ? ? ? ? ?
bin_2 ? ? ? ? ? ?
bin_3 ? ? ? ? ? ?
bin_4 ? ? ? ? ? ?
bin_5 ? ? ? ? ? ?
But how can i concentrate the data? So for example,
on 2013-04-1,
bin_1 contains 32 fruits, bin_2 contains 8 fruits, ..., bin_5 contains 0 fruits (based on df1)
only orange, apple, pear, banana, and lemon are available (based on df2)
Q. I want my df3 to contain concentrate information, like bin_1 on average contain x amount of oranges, ...etc .How can I model this?
Code:
> dput(df1)
structure(list(`2013-04-1` = c(32, 8, 9, 4, 0), `2013-04-2` = c(489,
346, 98, 12, 0), `2013-04-3` = c(32, 8, 346, 9, 8), `2013-04-4` = c(32,
0, 46, 12, 8), `2013-04-5` = c(364, 98, 9, 0, 0), `2013-04-6` = c(19,
8, 312, 12, 0), `2013-04-7` = c(312, 12, 6, 0, 312), `2013-04-8` = c(0,
12, 1912, 987, 6), `2013-04-9` = c(0, 46, 0, 9, 312), `2013-04-10` = c(0,
364, 489, 19, 12), `2013-04-11` = c(346, 346, 0, 12, 4), `2013-04-12` = c(0,
9, 12, 46, 489), `2013-04-13` = c(32, 8, 19, 46, 0), `2013-04-14` = c(0,
987, 12, 0, 6), `2013-04-15` = c(0, 346, 4, 346, 0), `2013-04-16` = c(0,
1912, 1912, 12, 364), `2013-04-17` = c(12, 98, 32, 32, 1912),
`2013-04-18` = c(12, 12, 12, 0, 346), `2013-04-19` = c(9,
46, 98, 312, 4), `2013-04-20` = c(32, 987, 46, 9, 312), `2013-04-21` = c(4,
98, 12, 32, 12), `2013-04-22` = c(19, 0, 4, 346, 0), `2013-04-23` = c(1912,
364, 0, 0, 489)), row.names = c("bin_1", "bin_2", "bin_3",
"bin_4", "bin_5"), class = "data.frame")
> dput(df2)
structure(list(orange = c(1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0), apple = c(1, 1, 1, 1, 0, 1,
0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0), pear = c(1,
0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1,
0), banana = c(1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1,
0, 0, 1, 1, 0, 1, 0), watermelon = c(0, 0, 0, 1, 1, 0, 1, 1,
1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0), lemon = c(1, 0,
1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0
)), row.names = c("2013-04-1", "2013-04-2", "2013-04-3", "2013-04-4",
"2013-04-5", "2013-04-6", "2013-04-7", "2013-04-8", "2013-04-9",
"2013-04-10", "2013-04-11", "2013-04-12", "2013-04-13", "2013-04-14",
"2013-04-15", "2013-04-16", "2013-04-17", "2013-04-18", "2013-04-19",
"2013-04-20", "2013-04-21", "2013-04-22", "2013-04-23"), class = "data.frame")