Extract tibble_df and text message from activitylog object - r

I have the code below
library(bupar)
library(daqapo)
hospital<-hospital
hospital %>%
rename(start = start_ts,
complete = complete_ts) -> hospital
hospital %>%
convert_timestamps(c("start","complete"), format = dmy_hms) -> hospital
hospital %>%
activitylog(case_id = "patient_visit_nr",
activity_id = "activity",
resource_id = "originator",
timestamps = c("start", "complete")) -> hospital
hospital %>%
detect_time_anomalies()
which gives
*** OUTPUT ***
For 5 rows in the activity log (9.43%), an anomaly is detected.
The anomalies are spread over the activities as follows:
# A tibble: 3 × 3
activity type n
<chr> <chr> <int>
1 Registration negative duration 3
2 Clinical exam zero duration 1
3 Trage negative duration 1
Anomalies are found in the following rows:
# Log of 10 events consisting of:
3 traces
3 cases
5 instances of 3 activities
5 resources
Events occurred from 2017-11-21 11:22:16 until 2017-11-21 19:00:00
# Variables were mapped as follows:
Case identifier: patient_visit_nr
Activity identifier: activity
Resource identifier: originator
Timestamps: start, complete
# A tibble: 5 × 10
patient_visit_nr activity originator start complete triagecode specialization .order durat…¹ type
<dbl> <chr> <chr> <dttm> <dttm> <dbl> <chr> <int> <dbl> <chr>
1 518 Registration Clerk 12 2017-11-21 11:45:16 2017-11-21 11:22:16 4 PED 1 -23 nega…
2 518 Registration Clerk 6 2017-11-21 11:45:16 2017-11-21 11:22:16 4 PED 2 -23 nega…
3 518 Registration Clerk 9 2017-11-21 11:45:16 2017-11-21 11:22:16 4 PED 3 -23 nega…
4 520 Trage Nurse 17 2017-11-21 13:43:16 2017-11-21 13:39:00 5 URG 4 -4.27 nega…
5 528 Clinical exam Doctor 1 2017-11-21 19:00:00 2017-11-21 19:00:00 3 TRAU 5 0 zero…
# … with abbreviated variable name ¹​duration
from this output I would like to extract the text message
For 5 rows in the activity log (9.43%), an anomaly is detected.
The anomalies are spread over the activities as follows:
and also in another object the tibble
activity type n
<chr> <chr> <int>
1 Registration negative duration 3
2 Clinical exam zero duration 1
3 Trage negative duration 1

The string you are trying to obtain is a message, and though it is possible to capture a message, it's not that straightforward. You can easily generate it by emulating a couple of lines within the function though.
If you store the result of detect_time_anomalies:
anomalies <- hospital %>% detect_time_anomalies()
Then you can generate the message like this:
paste0("For ", nrow(anomalies), " rows in the activity log (",
round(nrow(anomalies)/nrow(hospital) * 100, 2),
"%), an anomaly is detected.")
#> [1] "For 5 rows in the activity log (9.43%), an anomaly is detected."
Similarly, you can obtain the output table like this:
anomalies %>%
group_by(activity, type) %>%
summarize(n = n()) %>%
arrange(desc(n))
#> # A tibble: 3 x 3
#> activity type n
#> <chr> <chr> <int>
#> 1 Registration negative duration 3
#> 2 Clinical exam zero duration 1
#> 3 Trage negative duration 1
Created on 2022-12-13 with reprex v2.0.2

Related

Sum unique occurrences per night and create a new data frame in R

I have studied prey deliveries in a breeding owl and want to score the number of prey items delivered during the night to the nestlings. I define night as from 21 to 5. How could I make a new data frame with number of prey each night per location ID based upon these 24/7 observation dataset? In the new data frame, I wish to have the following columns: ID (A & B), No_prey_during_night (the sum of prey items), Time (date, e.g. 4/6 to 5/6), there will be a unique row per night per ID.
https://drive.google.com/file/d/1y5VCoNWZCmYbyWCktKfMSBqjOIaLeumQ/view?usp=sharing. I have done it in Excel so far, but very time demanding. I would be happy to get help with a simple script I could use in R.
To take into account the fact that a night begins and ends on different dates, you could first assign all the morning hours to the prior day. The final label (the Time column in your question) then includes the next day. If the year of the data collection has a Feb 29, make sure the year is correct (I used 2022).
library(dplyr)
library(lubridate)
read.csv("Tot_prey_example.csv") %>%
mutate(time = make_datetime(year = 2022, month = Month, day = Day, hour = Hour),
night_time = if_else(between(Hour, 0, 5), time - days(1), time),
night_date = floor_date(night_time, unit = "day"),
night = Hour <= 5 | Hour >= 21) %>%
filter(night) %>%
group_by(ID, night_date) %>%
summarise(No_prey_during_night = sum(n), .groups = "drop") %>%
mutate(next_day = night_date + days(1),
Time = glue::glue("{day(night_date)}/{month(night_date)} to {day(next_day)}/{month(next_day)}")) %>%
select(ID, No_prey_during_night, Time)
#> # A tibble: 88 × 3
#> ID No_prey_during_night Time
#> <chr> <int> <glue>
#> 1 A 12 4/6 to 5/6
#> 2 A 22 5/6 to 6/6
#> 3 A 20 6/6 to 7/6
#> 4 A 14 7/6 to 8/6
#> 5 A 14 8/6 to 9/6
#> 6 A 27 9/6 to 10/6
#> 7 A 22 10/6 to 11/6
#> 8 A 18 11/6 to 12/6
#> 9 A 22 12/6 to 13/6
#> 10 A 25 13/6 to 14/6
#> # … with 78 more rows
Created on 2022-05-18 by the reprex package (v2.0.1)
You can do something like this:
library(dplyr)
library(lubridate)
read.csv("Tot_prey_example.csv") %>%
# create initial datetime variable, `night`
mutate(night = lubridate::make_datetime(2021, Month,Day,Hour)) %>%
# filter to nighttime hours
filter(Hour>=21 | Hour<=5) %>%
# flip datetime variable to the next day if hour is >=21
mutate(night = if_else(Hour>=21,night + 60*60*24, night)) %>%
# now group by the date part of `night`
group_by(ID,Night_No = as.Date(night)) %>%
# summarize the sum of prey
summarize(
No_prey_during_night = sum(n),
No_deliveries_during_night = sum(PreyDelivery)
) %>%
# replace the Night_No with a character variable showing both dates
mutate(Night_No = paste0(Night_No-1, "-", Night_No))
Output:
# A tibble: 88 × 4
# Groups: ID [2]
ID Night_No No_prey_during_night No_deliveries_during_night
<chr> <chr> <int> <int>
1 A 2021-06-04-2021-06-05 12 5
2 A 2021-06-05-2021-06-06 22 6
3 A 2021-06-06-2021-06-07 20 5
4 A 2021-06-07-2021-06-08 14 6
5 A 2021-06-08-2021-06-09 14 5
6 A 2021-06-09-2021-06-10 27 5
7 A 2021-06-10-2021-06-11 22 4
8 A 2021-06-11-2021-06-12 18 6
9 A 2021-06-12-2021-06-13 22 6
10 A 2021-06-13-2021-06-14 25 5
# … with 78 more rows

Comparing Dates Across Multiple Variables

I'm attempting to figure out the amount of days in between games and if that has an impact on wins/losses, this is the information I'm starting with:
schedule:
Home
Away
Home_Final
Away_Final
Date
DAL
OAK
30
35
9/1/2015
KC
PHI
21
28
9/2/2015
This is the result I'd like to get:
Home
Away
Home_Final
Away_Final
Date
Home_Rest
Away_Rest
Adv
Adv_Days
Adv_Won
DAL
OAK
30
35
9/1/2015
null
null
null
null
null
KC
PHI
21
28
9/2/2015
null
null
null
null
null
DAL
PHI
28
7
9/9/2015
8
7
1
1
1
OAK
KC
14
21
9/9/2015
8
7
1
1
0
'Home_Rest' = The home teams amount of days between their games
'Away Rest' = The away teams amount of days between their games
'Adv' = True/False that there was an advantage on one side
'Adv_Days' = The amount of advantage in days
'Adv_Won' = The side with the advantage won
Here is what I've tried, I was able to get it to count how many days were between games for one team, but when I bring all the other ones in I can't wrap my head around how to do that.
library(tidyverse)
library(lubridate)
team_post <- schedule %>% filter(home == 'PHI' | visitor == 'PHI')
day_dif = interval(lag(ymd(team_post$date)), ymd(team_post$date))
team_post <- team_post %>% mutate(days_off = time_length(day_dif, "days"))
You can extend this to all teams using a grouped mutate. See docs for group_by() here.
Something like
schedule %>%
group_by(vars_to_group_by) %>%
mutate(new_var = expr_to_calculate_new_var)
In future, it would be helpful if you included code to recreate a minimal dataset for your example.
The problem is that before you can calculate differences between dates, you must put your dataframe in a friendlier format. Because the Date applies to both teams, that is, one item applies to two columns in the dataframe, which makes it difficult to give it a uniform treatment.
We'll add an id (row number) to the schedule dataframe, as a primary key, so it becomes easy to identify the rows later on.
schedule <- tidyr::tribble(
~Home, ~Away, ~Home_Final, ~Away_Final, ~Date,
"DAL", "OAK", 30, 35, "9/1/2015",
"KC", "PHI", 21, 28, "9/2/2015",
"DAL", "PHI", 28, 7, "9/9/2015",
"OAK", "KC", 14, 21, "9/9/2015"
)
schedule <- schedule %>% mutate(id = row_number())
> schedule
# A tibble: 4 x 6
Home Away Home_Final Away_Final Date id
<chr> <chr> <dbl> <dbl> <chr> <int>
1 DAL OAK 30 35 9/1/2015 1
2 KC PHI 21 28 9/2/2015 2
3 DAL PHI 28 7 9/9/2015 3
4 OAK KC 14 21 9/9/2015 4
Now we'll place your dataframe in a more 'relational' format.
schedule_relational <-
rbind(
schedule %>%
transmute(
id,
Team = Home,
Role = "Home",
Final = Home_Final,
Date
),
schedule %>%
transmute(
id,
Team = Away,
Role = "Away",
Final = Away_Final,
Date
)
)
> schedule_relational
# A tibble: 8 x 5
id Team Role Final Date
<int> <chr> <chr> <dbl> <chr>
1 1 DAL Home 30 9/1/2015
2 2 KC Home 21 9/2/2015
3 3 DAL Home 28 9/9/2015
4 4 OAK Home 14 9/9/2015
5 1 OAK Away 35 9/1/2015
6 2 PHI Away 28 9/2/2015
7 3 PHI Away 7 9/9/2015
8 4 KC Away 21 9/9/2015
How about that!
Now it becomes easy to calculate the difference between dates of games for each team:
schedule_relational <-
schedule_relational %>%
group_by(Team) %>%
arrange(Date) %>%
mutate(Rest = mdy(Date) - mdy(lag(Date))) %>%
ungroup()
> schedule_relational
# A tibble: 8 x 6
id Team Role Final Date Rest
<int> <chr> <chr> <dbl> <chr> <drtn>
1 1 DAL Home 30 9/1/2015 NA days
2 1 OAK Away 35 9/1/2015 NA days
3 2 KC Home 21 9/2/2015 NA days
4 2 PHI Away 28 9/2/2015 NA days
5 3 DAL Home 28 9/9/2015 8 days
6 4 OAK Home 14 9/9/2015 8 days
7 3 PHI Away 7 9/9/2015 7 days
8 4 KC Away 21 9/9/2015 7 days
Observe that the appropriate function to convert dates in character format is mdy(), because your dates are in month/day/year format.
We're very close to a solution! Now all we have to do is to pivot your data back to the wider format. We'll join back the data on the home team and away team by using the id as our unique key.
result <-
schedule_relational %>%
pivot_wider(
names_from = Role,
values_from = c(Team, Final, Rest),
names_glue = "{Role}_{.value}"
)
> result
# A tibble: 4 x 8
id Date Home_Team Away_Team Home_Final Away_Final Home_Rest Away_Rest
<int> <chr> <chr> <chr> <dbl> <dbl> <drtn> <drtn>
1 1 9/1/2015 DAL OAK 30 35 NA days NA days
2 2 9/2/2015 KC PHI 21 28 NA days NA days
3 3 9/9/2015 DAL PHI 28 7 8 days 7 days
4 4 9/9/2015 OAK KC 14 21 8 days 7 days
We'll adjust column names and ordering, and make the final calculations now.
result_final <-
result %>%
transmute(
Home = Home_Team,
Away = Away_Team,
Home_Final,
Away_Final,
Date,
Home_Rest,
Away_Rest,
Adv = as.integer(Home_Rest != Away_Rest),
Adv_Days = abs(Home_Rest != Away_Rest),
Adv_Won = as.integer(Home_Rest > Away_Rest & Home_Final > Away_Final | Away_Rest > Home_Rest & Away_Final > Home_Final)
)
> result_final
# A tibble: 4 x 10
Home Away Home_Final Away_Final Date Home_Rest Away_Rest Adv Adv_Days Adv_Won
<chr> <chr> <dbl> <dbl> <chr> <drtn> <drtn> <int> <int> <int>
1 DAL OAK 30 35 9/1/2015 NA days NA days NA NA NA
2 KC PHI 21 28 9/2/2015 NA days NA days NA NA NA
3 DAL PHI 28 7 9/9/2015 8 days 7 days 1 1 1
4 OAK KC 14 21 9/9/2015 8 days 7 days 1 1 0
It would be interesting if instead of reducing Adv and Adv_Won to yes/no (discrete) values, you'd track the number of days of rest and difference in score. Therefore you could correlate the results also in terms of magnitude.
I've made the code step by step, so you can see intermediate results and understand it better. You may later coalesce some of the statements if you want.
There may be more convoluted solutions, but this is very clear to read and understand.

Extract data from table with condition

I have data. Here's an example:
A tibble: 1,296 x 4
id treatmentstart protocoltype PDL1_date
<dbl> <chr> <chr> <chr>
1 1111 05/11/2020 Chemoradiation 05/03/2020
2 22222 03/03/2021 Chemo plus PD-1 plus CTLA-4 01/03/2020
3 333333 08/04/2018 Anti-VEGF plus Chemo NA
4 444444 07/06/2019 Chemoradiation 03/08/2018
5 555555 09/12/2020 Chemo plus PDl-1 07/11/2020
6 666666 05/06/2018 PD-1 08/02/2017
7 666666 07/07/2018 Chemotherapy 08/02/2017
8 777777 07/05/2019 Chemotherapy 06/03/2020
9 999999 08/08/2018 Chemoradiation 08/05/2020
10 999999 12/07/2017 PDL-1 08/05/2020
As you can see, some of the IDs are repeated, but have different treatments (type of protocol)
I need to extract the IDs that meet the following conditions:
Test date, earlier treatment start date, and treatment type all that include "PD1" or "PDL1", if the ID has multiple treatments then I need to compare treatment dates and choose the earliest treatment date and compare with the test date, if the test earlier then it fits, if not then not.
In conclusion: only those who have a test date before a certain type of treatment ("PD1" or "PDL1") and have not received any other treatment before the test date should be selected. Here is an example of what should come up:
A tibble: 1,296 x 4
id treatmentstart protocoltype PDL1_date
<dbl> <chr> <chr> <chr>
1 22222 03/03/2021 Chemo plus PD-1 plus CTLA-4 01/03/2020
2 555555 09/12/2020 Chemo plus PDl-1 07/11/2020
6 666666 05/06/2018 PD-1 08/02/2017
So 1111,44444,77777 excluded by treatment condition(not received any PD1/PDL1), 333333 no PDL1_date, 99999 received PD-1 and it's before PDL1date, but received other treatment before PDL1date.
I have tried dplyr filter (PDL1_date<treatmentstart), but I am stuck on comparing ID with same ID.
Please help.
You have to share a reproducible data to help you, but maybe this could help you:
data %>%
filter(grepl("PD-1|PD[Ll]-1",protocoltype)) %>%
group_by(id) %>%
filter(treatmentstart == min(treatmentstart)) %>% ungroup()

R Filter by sequence of events

I have an event log data. For reproducible example, let's use the data from eventdataR
eventdataR::patients
## look at patient 1 sequence
eventdataR::patients %>% dplyr::filter(patient == '1')
# A tibble: 12 x 7
handling patient employee handling_id registration_ty~ time .order
<fct> <chr> <fct> <chr> <fct> <dttm> <int>
1 Registration 1 r1 1 start 2017-01-02 11:41:53 1
2 Triage and A~ 1 r2 501 start 2017-01-02 12:40:20 2
3 Blood test 1 r3 1001 start 2017-01-05 08:59:04 3
4 MRI SCAN 1 r4 1238 start 2017-01-05 21:37:12 4
5 Discuss Resu~ 1 r6 1735 start 2017-01-07 07:57:49 5
6 Check-out 1 r7 2230 start 2017-01-09 17:09:43 6
7 Registration 1 r1 1 complete 2017-01-02 12:40:20 7
8 Triage and A~ 1 r2 501 complete 2017-01-02 22:32:25 8
9 Blood test 1 r3 1001 complete 2017-01-05 14:34:27 9
10 MRI SCAN 1 r4 1238 complete 2017-01-06 01:54:23 10
11 Discuss Resu~ 1 r6 1735 complete 2017-01-07 10:18:08 11
12 Check-out 1 r7 2230 complete 2017-01-09 19:45:45 12
In the above example, we can see the sequence of handling for patient 1 over a period of time. We can imagine that different patients would have different sequences or went through different number of sequences.
Now let's say I'm interested in a specific sequence and want to know which patients had gone through this specific sequence. How can I filter this dataset by this specific sequence so that I can get to know who these patients are?
The filter_activity_presence from edeaR library can help me with identifying the unique sequences and its frequency
patients %>% traces
# A tibble: 7 x 3
trace absolute_frequen~ relative_frequen~
<chr> <int> <dbl>
1 Registration,Triage and Assessment,X-Ray,Discuss R~ 258 0.516
2 Registration,Triage and Assessment,Blood test,MRI ~ 234 0.468
3 Registration,Triage and Assessment,Blood test,MRI ~ 2 0.004
4 Registration,Triage and Assessment,X-Ray 2 0.004
5 Registration,Triage and Assessment 2 0.004
6 Registration,Triage and Assessment,X-Ray,Discuss R~ 1 0.002
7 Registration,Triage and Assessment,Blood test 1 0.002
Let's say I'm interested in sequence from row 5, that is patients who had exclusively this sequence Registration -> Triage -> Assessment, there are 2 patients who had this sequence.
It seems the library that doesn't provide ready made function to extract this. At least from this doc page, https://www.bupar.net/subsetting.html#trace_length, it's not available.
Basically, given an exhaustive list of sequence, return all the patients who had gone through exactly this sequence.
In fact, if I can rebuild the trace and map it back to the original dataset, that should allow for a simple dplyr::filter. But this may not be ideal as well in the case if I'm interested in open ended sequence, for example, find all patients who started with Registration -> Triage and can be followed by any sequence.
Here's my long-winded attempt
# get trace for each patient
patient_trace <- as_tibble(patients) %>% group_by(patient) %>% dplyr::filter(registration_type == 'complete') %>%
summarise(trace = paste(handling, collapse = ","), n = n())
# identify the sequence trace of interest
trace_summary <- patients %>% traces
# here we want to see patients who had the sequence from row 5
res <- patients %>%
dplyr::filter(patient %in% c(patient_trace %>% dplyr::filter(trace %in% trace_summary$trace[5]) %>% .$patient)) %>%
dplyr::filter(registration_type == 'complete') %>%
arrange(patient, time)
# A tibble: 4 x 7
handling patient employee handling_id registration_ty~ time .order
<fct> <chr> <fct> <chr> <fct> <dttm> <int>
1 Registration 499 r1 499 complete 2018-05-01 22:57:38 1
2 Triage and As~ 499 r2 999 complete 2018-05-04 23:53:27 3
3 Registration 500 r1 500 complete 2018-05-02 01:28:23 2
4 Triage and As~ 500 r2 1000 complete 2018-05-05 07:16:02 4
You can filter them with dplyr :
library(dplyr)
req_sequence <- c('Registration', 'Triage and Assessment')
eventdataR::patients %>%
group_by(patient) %>%
filter(all(handling == req_sequence)) %>%
filter(registration_type == 'complete') %>%
ungroup
# handling patient employee handling_id registration_type time .order
# <fct> <chr> <fct> <chr> <fct> <dttm> <int>
#1 Registration 499 r1 499 complete 2018-05-01 22:57:38 3220
#2 Registration 500 r1 500 complete 2018-05-02 01:28:23 3221
#3 Triage and Assessment 499 r2 999 complete 2018-05-04 23:53:27 3720
#4 Triage and Assessment 500 r2 1000 complete 2018-05-05 07:16:02 3721
For this case to be sure of the output and to avoid any recycling effect we can filter registration_type == 'complete' first and also add another check of length(req_sequence) equal to number of rows for the patient id.
eventdataR::patients %>%
filter(registration_type == 'complete') %>%
group_by(patient) %>%
filter(length(req_sequence) == n() && all(handling == req_sequence)) %>%
ungroup

Creating a new Data.Frame from variable values

I am currently working on a task that requires me to query a list of stocks from an sql db.
The problem is that it is a list where there are 1:n stocks traded per date. I want to calculate the the share of each stock int he portfolio on a given day (see example) and pass it to a new data frame. In other words date x occurs 2 times (once for stock A and once for stock B) and then pull it together that date x occurs only one time with the new values.
'data.frame': 1010 obs. of 5 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Date : Date, format: "2019-11-22" "2019-11-21" "2019-11-20" "2019-11-19" ...
$ Close: num 52 51 50.1 50.2 50.2 ...
$ Volume : num 5415 6196 3800 4784 6189 ...
$ Stock_ID : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
RawInput<-data.frame(Date=c("2017-22-11","2017-22-12","2017-22-13","2017-22-11","2017-22-12","2017-22-13","2017-22-11"), Close=c(50,55,56,10,11,12,200),Volume=c(100,110,150,60,70,80,30),Stock_ID=c(1,1,1,2,2,2,3))
RawInput$Stock_ID<-as.factor(RawInput$Stock_ID)
*cannot transfer the date to a date variable in this example
I would like to have a new dataframe that generates the Value traded per day, the weight of each stock, and the daily returns per day, while keeping the number of stocks variable.
I hope I translated the issue properly so that I can receive help.
Thank you!
I think the easiest way to do this would be to use the dplyr package. You may need to read some documentation but the mutate and group_by function may be able do what you want. This function will allow you to modify the current dataframe by either adding a new column or changing the existing data.
Lets start with a reproducible dataset
RawInput<-data.frame(Date=c("2017-22-11","2017-22-12","2017-22-13","2017-22-11","2017-22-12","2017-22-13","2017-22-11"),
Close=c(50,55,56,10,11,12,200),
Volume=c(100,110,150,60,70,80,30),
Stock_ID=c(1,1,1,2,2,2,3))
RawInput$Stock_ID<-as.factor(RawInput$Stock_ID)
library(magrittr)
library(dplyr)
dat2 <- RawInput %>%
group_by(Date, Stock_ID) %>% #this example only has one stock type but i imagine you want to group by stock
mutate(CloseMean=mean(Close),
CloseSum=sum(Close),
VolumeMean=mean(Volume),
VolumeSum=sum(Volume)) #what ever computation you need to do with
#multiple stock values for a given date goes here
dat2 %>% select(Stock_ID, Date, CloseMean, CloseSum, VolumeMean,VolumeSum) %>% distinct() #dat2 will still be the same size as dat, thus use the distinct() function to reduce it to unique values
# A tibble: 7 x 6
# Groups: Date, Stock_ID [7]
Stock_ID Date CloseMean CloseSum VolumeMean VolumeSum
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 2017-22-11 50 50 100 100
2 1 2017-22-12 55 55 110 110
3 1 2017-22-13 56 56 150 150
4 2 2017-22-11 10 10 60 60
5 2 2017-22-12 11 11 70 70
6 2 2017-22-13 12 12 80 80
7 3 2017-22-11 200 200 30 30
This data set that you provided actually only has one unique Stock_ID and Date combinations so there was nothing actually done with the data. However if you remove Stock_ID where necessary you can see how this function would work
dat2 <- RawInput %>%
group_by(Date) %>%
mutate(CloseMean=mean(Close),
CloseSum=sum(Close),
VolumeMean=mean(Volume),
VolumeSum=sum(Volume))
dat2 %>% select(Date, CloseMean, CloseSum, VolumeMean,VolumeSum) %>% distinct()
# A tibble: 3 x 5
# Groups: Date [3]
Date CloseMean CloseSum VolumeMean VolumeSum
<fct> <dbl> <dbl> <dbl> <dbl>
1 2017-22-11 86.7 260 63.3 190
2 2017-22-12 33 66 90 180
3 2017-22-13 34 68 115 230
After reading your first reply, You will have to be specific on how you are trying to calculate the weight. Also define your end result.
Im going to assume weight is just percentage by total cost. And the end result is for each date show the weight per stock. In other words a matrix of dates and stock Ids
library(tidyr)
RawInput %>%
group_by(Date) %>%
mutate(weight=Close/sum(Close)) %>%
select(Date, weight, Stock_ID) %>%
spread(key = "Stock_ID", value = "weight", fill = 0)
# A tibble: 3 x 4
# Groups: Date [3]
Date `1` `2` `3`
<fct> <dbl> <dbl> <dbl>
1 2017-22-11 0.192 0.0385 0.769
2 2017-22-12 0.833 0.167 0
3 2017-22-13 0.824 0.176 0

Resources