I have the following dataframes
structure(list(id = c(1, 2, 3, 4, 5), time = structure(c(1484092800,
1485907200, 1490227200, 1490918400, 1491955200), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
id time
<dbl> <dttm>
1 1 2017-01-11 00:00:00
2 2 2017-02-01 00:00:00
3 3 2017-03-23 00:00:00
4 4 2017-03-31 00:00:00
5 5 2017-04-12 00:00:00
structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5), time = structure(c(1466553600,
1465948800, 1453420800, 1485302400, 1433030400, 1421712000, 1453852800,
1485302400, 1485993600, 1517529600, 1400544000, 1434067200, 1466985600,
1497484800, 1390003200, 1516060800, 1464825600, 1497916800, 1527638400,
1454025600, 1390608000, 1421712000, 1466467200, 1453852800, 1485820800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), score = c(3,
2, 5, 4, 5, 24.2, 24.8, 25.4, 26, 26.6, 36.2, 36.8, 37.4, 38,
38.6, 44, 44.6, 45.2, 45.8, 46.4, 59, 59.6, 60.2, 60.8, 61.4)), row.names = c(NA,
-25L), class = c("tbl_df", "tbl", "data.frame"))
id time score
<dbl> <dttm> <dbl>
1 1 2016-06-22 00:00:00 3
2 1 2016-06-15 00:00:00 2
3 1 2016-01-22 00:00:00 5
4 1 2017-01-25 00:00:00 4
5 1 2015-05-31 00:00:00 5
6 2 2015-01-20 00:00:00 24.2
7 2 2016-01-27 00:00:00 24.8
8 2 2017-01-25 00:00:00 25.4
9 2 2017-02-02 00:00:00 26
10 2 2018-02-02 00:00:00 26.6
# … with 15 more rows
I would like to have the score of sdf where the time is closest to that of in df. But I would also have to look at the id's! I already tried this from Join two data frames in R based on closest timestamp:
d <- function(x,y) abs(x-y) # define the distance function
idx <- sapply( df$time, function(x) which.min( d(x,sdf$time) ))
cbind(df,sdf[idx,-1,drop=FALSE])
id time time score
1 1 2017-01-11 2017-01-25 4
2 2 2017-02-01 2017-02-02 26
3 3 2017-03-23 2017-02-02 26
4 4 2017-03-31 2017-02-02 26
5 5 2017-04-12 2017-06-15 38
But you don't look at the id, I tried to incorporate the id, however did not work. Any ideas? Thank you in advance :)
We can join the data frames by id and then calculate the time difference and keep the observation with the minimal time difference for each individual:
library(tidyverse)
df2 %>%
left_join(df1, by = "id") %>%
mutate(time_dif = abs(time.x - time.y)) %>%
group_by(id) %>%
filter(time_dif == min(time_dif))
# A tibble: 5 x 5
# Groups: id [5]
id time.x score time.y time_dif
<dbl> <dttm> <dbl> <dttm> <drtn>
1 1 2017-01-25 00:00:00 4 2017-01-11 00:00:00 14 days
2 2 2017-02-02 00:00:00 26 2017-02-01 00:00:00 1 days
3 3 2017-06-15 00:00:00 38 2017-03-23 00:00:00 84 days
4 4 2017-06-20 00:00:00 45.2 2017-03-31 00:00:00 81 days
5 5 2017-01-31 00:00:00 61.4 2017-04-12 00:00:00 71 days
Data
df1 <- structure(list(id = c(1, 2, 3, 4, 5), time = structure(c(1484092800,
1485907200, 1490227200, 1490918400, 1491955200), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
df2 <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5), time = structure(c(1466553600,
1465948800, 1453420800, 1485302400, 1433030400, 1421712000, 1453852800,
1485302400, 1485993600, 1517529600, 1400544000, 1434067200, 1466985600,
1497484800, 1390003200, 1516060800, 1464825600, 1497916800, 1527638400,
1454025600, 1390608000, 1421712000, 1466467200, 1453852800, 1485820800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), score = c(3,
2, 5, 4, 5, 24.2, 24.8, 25.4, 26, 26.6, 36.2, 36.8, 37.4, 38,
38.6, 44, 44.6, 45.2, 45.8, 46.4, 59, 59.6, 60.2, 60.8, 61.4)), row.names = c(NA,
-25L), class = c("tbl_df", "tbl", "data.frame"))
Related
I have a dataframe games_h. This is just a snippet of the table but it has many teams and is sorted by date, team, game number. I am trying to create a weighted rolling average grouped by the team. I would like the most recent game to be weighted more than two games ago. So the weights would be (Game_1 * 1+ Game_2 *2)/3 or weights equal to 1 with same ratio so weights = c(1-.667, .667).
dput(games_h)
structure(list(GameId = c(16, 16, 37, 37, 57, 57), GameDate = structure(c(17905,
17905, 17916, 17916, 17926, 17926), class = "Date"), NeutralSite = c(0,
0, 0, 0, 0, 0), AwayTeam = c("Virginia Cavaliers", "Virginia Cavaliers",
"Florida State Seminoles", "Florida State Seminoles", "Syracuse Orange",
"Syracuse Orange"), HomeTeam = c("Boston College Eagles", "Boston College Eagles",
"Boston College Eagles", "Boston College Eagles", "Boston College Eagles",
"Boston College Eagles"), Team = c("Virginia Cavaliers", "Boston College Eagles",
"Florida State Seminoles", "Boston College Eagles", "Syracuse Orange",
"Boston College Eagles"), Home = c(0, 1, 0, 1, 0, 1), Score = c(83,
56, 82, 87, 77, 71), AST = c(17, 6, 12, 16, 11, 13), TOV = c(10,
8, 9, 13, 11, 11), STL = c(5, 4, 4, 6, 6, 5), BLK = c(6, 0, 4,
4, 1, 0), Rebounds = c(38, 18, 36, 33, 23, 23), ORB = c(7, 4,
16, 10, 7, 6), DRB = c(31, 14, 20, 23, 16, 17), FGA = c(55, 57,
67, 55, 52, 45), FGM = c(33, 22, 28, 27, 29, 21), X3FGM = c(8,
7, 8, 13, 11, 9), X3FGA = c(19, 25, 25, 21, 26, 22), FTA = c(14,
9, 24, 28, 15, 23), FTM = c(9, 5, 18, 20, 8, 20), Fouls = c(16,
12, 25, 20, 19, 19), Game_Number = 1:6, Count = c(1, 1, 1, 1,
1, 1)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L), groups = structure(list(HomeTeam = "Boston College Eagles",
.rows = structure(list(1:6), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L), .drop = TRUE))
Here is an example output of the score column.
Here is my failed attempt. The function work correctly but I cannot apply it to the entire dataframe by group.
weighted_avg<-function(x, wt1, wt2) {
rs1 = rollsum(x,1,align = "right")
rs2 = rollsum(x,2,align = "right")
rs1=rs1[-1]
rs3 = rs2 - rs1
weighted_avg= ((rs3 * wt2)+ (rs1*wt1))/(wt1+wt2)
return(weighted_avg)
}
weighted_avg(csum$Score_Y, 2, 1)
apply(csum$Score_Y , 2, weighted_avg, wt1 = 2, wt2=1)
test<-csum %>%
group_by(Team)%>%
group_map(across(c(Score:Fouls), weighted_avg(.x$Team, 2, 1) ))
test<-csum %>%
group_by(Team)%>%
group_walk(across(c(Score:Fouls),weighted_avg(.~,2,1) ))
Here are some notes about the code:
I used slider::slide_dbl function. First we specify the vector for which we would like to compute the moving average Score.
As we need a sliding window of length 2, I used .before argument in slide_dbl to use the previous value and a current value to be used for calculating moving average.
Also I set .complete argument to TRUE to makes sure to only calculate moving average when we have a previous value. In other word we don't have any moveing average in first row.
For more info check the documentation for slider package.
library(tidyverse)
library(slider)
df %>%
group_by(HomeTeam) %>%
summarise(Example = c(NA, slide_dbl(Score, .before = 1, .complete = TRUE,
.f = ~ (.x[1] * 1 + .x[2] * 2) / 3)))
`summarise()` has grouped output by 'HomeTeam'. You can override using the `.groups` argument.
# A tibble: 7 × 2
# Groups: HomeTeam [1]
HomeTeam Example
<chr> <dbl>
1 Boston College Eagles NA
2 Boston College Eagles NA
3 Boston College Eagles 65
4 Boston College Eagles 73.3
5 Boston College Eagles 85.3
6 Boston College Eagles 80.3
7 Boston College Eagles 73
If it is going to calculate moving average for all numeric columns you could try:
df %>%
group_by(HomeTeam) %>%
summarise(across(where(is.numeric), ~ c(NA, slide_dbl(., .before = 1, .complete = TRUE,
.f = ~ (.x[1] * 1 + .x[2] * 2) / 3)))) %>%
ungroup()
`summarise()` has grouped output by 'HomeTeam'. You can override using the `.groups` argument.
# A tibble: 7 × 21
HomeTeam GameId NeutralSite Home Score AST TOV STL BLK Rebounds ORB DRB FGA FGM
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Boston C… NA NA NA NA NA NA NA NA NA NA NA NA NA
2 Boston C… NA NA NA NA NA NA NA NA NA NA NA NA NA
3 Boston C… 16 0 0.667 65 9.67 8.67 4.33 2 24.7 5 19.7 56.3 25.7
4 Boston C… 30 0 0.333 73.3 10 8.67 4 2.67 30 12 18 63.7 26
5 Boston C… 37 0 0.667 85.3 14.7 11.7 5.33 4 34 12 22 59 27.3
6 Boston C… 50.3 0 0.333 80.3 12.7 11.7 6 2 26.3 8 18.3 53 28.3
7 Boston C… 57 0 0.667 73 12.3 11 5.33 0.333 23 6.33 16.7 47.3 23.7
# … with 7 more variables: X3FGM <dbl>, X3FGA <dbl>, FTA <dbl>, FTM <dbl>, Fouls <dbl>,
# Game_Number <dbl>, Count <dbl>
I am trying to find the rows sums for each column in data frame df_count (cars, buses,trucks) between the time frames given in each row in the data frame start_end
So for example, row 1 of start_end ranges from 2021-06-12 00:15:00 to 2021-06-12 00:55:00.
I want to find the row sum of cars (for example) between these timestamps in column 1 of df_count (rows 5 to 12)
df_count <- structure(list(date = structure(c(1623456000, 1623456300, 1623456600,
1623456900, 1623457200, 1623457500, 1623457800, 1623458100, 1623458400,
1623458700, 1623459000, 1623459300, 1623459600, 1623459900, 1623460200,
1623460500, 1623460800, 1623461100, 1623461400, 1623461700, 1623462000,
1623462300, 1623462600, 1623462900, 1623463200, 1623463500, 1623463800,
1623464100, 1623464400, 1623464700), tzone = "UTC", class = c("POSIXct",
"POSIXt")), cars = c(45, 45, 45, 52, 52, 52, 46, 46, 46, 34,
34, 34, 29, 29, 29, 36, 36, 36, 17, 17, 17, 18, 18, 18, 14, 14,
14, 3, 3, 3), buses = c(4, 4, 4, 7, 7, 7, 5, 5, 5, 4, 4, 4, 5,
5, 5, 4, 4, 4, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1), trucks = c(3,
3, 3, 2, 2, 2, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2,
2, 2, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -30L), class = c("tbl_df",
"tbl", "data.frame"))
start_end <- structure(list(start_co2_plume = c("2021-06-12 00:15:00", "2021-06-12 00:55:00",
"2021-06-12 01:15:00", "2021-06-12 01:30:00", "2021-06-12 02:00:00",
"2021-06-12 02:25:00", "2021-06-12 03:00:00", "2021-06-12 03:20:00",
"2021-06-12 03:45:00", "2021-06-12 03:55:00", "2021-06-12 04:20:00",
"2021-06-12 04:35:00", "2021-06-12 04:50:00", "2021-06-12 05:10:00",
"2021-06-12 05:40:00", "2021-06-12 05:50:00", "2021-06-12 06:00:00",
"2021-06-12 06:10:00", "2021-06-12 06:25:00", "2021-06-12 06:35:00",
"2021-06-12 06:45:00", "2021-06-12 06:55:00", "2021-06-12 08:10:00",
"2021-06-12 08:30:00", "2021-06-12 08:55:00", "2021-06-12 09:45:00",
"2021-06-12 10:05:00", "2021-06-12 10:35:00", "2021-06-12 11:05:00",
"2021-06-12 11:25:00"), end_co2_plume = c("2021-06-12 00:55:00",
"2021-06-12 01:15:00", "2021-06-12 01:30:00", "2021-06-12 02:00:00",
"2021-06-12 02:25:00", "2021-06-12 03:00:00", "2021-06-12 03:20:00",
"2021-06-12 03:35:00", "2021-06-12 03:55:00", "2021-06-12 04:10:00",
"2021-06-12 04:35:00", "2021-06-12 04:50:00", "2021-06-12 05:10:00",
"2021-06-12 05:30:00", "2021-06-12 05:50:00", "2021-06-12 06:00:00",
"2021-06-12 06:10:00", "2021-06-12 06:25:00", "2021-06-12 06:35:00",
"2021-06-12 06:45:00", "2021-06-12 06:55:00", "2021-06-12 07:10:00",
"2021-06-12 08:30:00", "2021-06-12 08:55:00", "2021-06-12 09:10:00",
"2021-06-12 10:05:00", "2021-06-12 10:25:00", "2021-06-12 10:50:00",
"2021-06-12 11:25:00", "2021-06-12 11:45:00")), row.names = c(NA,
30L), class = "data.frame")
The below produces the desired output. It is necessary to assume the time zones for the dates, so I assumed they came from the same time zone.
library(purrr)
library(dplyr)
# Convert dates in start_end from character vectors to date classes
# Assumes the times are in the same time zone
start_end <- start_end %>% mutate(start_date = as.POSIXct(start_co2_plume, tz = "UTC"),
end_date = as.POSIXct(end_co2_plume, tz = "UTC"))
# For each row in start_end, subset df_count to the rows whose dates fall in
# the the interval defined by the start_date and end_date values for that row.
# For each automobile column, sum the values and add an index to tell us which
# interval it came from.
results <-
pmap(list(start_end$start_date, start_end$end_date, 1:nrow(start_end)),
function(start, end, ind) {
df_count %>%
filter((date >= start) & (date < end)) %>%
select(-date) %>%
summarise(across(everything(), sum)) %>%
mutate(interval_id = ind,
start = start,
end = end)
})
# Combine into a single data.frame
results %>% bind_rows()
#> # A tibble: 30 x 6
#> cars buses trucks interval_id start end
#> <dbl> <dbl> <dbl> <int> <dttm> <dttm>
#> 1 362 44 26 1 2021-06-12 00:15:00 2021-06-12 00:55:00
#> 2 121 19 13 2 2021-06-12 00:55:00 2021-06-12 01:15:00
#> 3 108 12 9 3 2021-06-12 01:15:00 2021-06-12 01:30:00
#> 4 105 12 15 4 2021-06-12 01:30:00 2021-06-12 02:00:00
#> 5 48 5 5 5 2021-06-12 02:00:00 2021-06-12 02:25:00
#> 6 3 1 1 6 2021-06-12 02:25:00 2021-06-12 03:00:00
#> 7 0 0 0 7 2021-06-12 03:00:00 2021-06-12 03:20:00
#> 8 0 0 0 8 2021-06-12 03:20:00 2021-06-12 03:35:00
#> 9 0 0 0 9 2021-06-12 03:45:00 2021-06-12 03:55:00
#> 10 0 0 0 10 2021-06-12 03:55:00 2021-06-12 04:10:00
#> # ... with 20 more rows
I have timeseries data from 2 sensors recoding independently. They were both started at different start times and record data at different intervals. Sensor 1 records every one second while sensor 2 records every 2 seconds. I want to combine both these datasets into a single dataframe in order to ggplot. Can someone help me out? If there are other better options than ggplot and dataframes, please let me know. Thanks for the help. I included sample data (not actual, let me know if I did not include the right sample under):
dput(reading1)
structure(list(time = structure(c(-2209030842, -2209030841, -2209030840,
-2209030839, -2209030838, -2209030837, -2209030836, -2209030835,
-2209030834, -2209030833, -2209030832, -2209030831, -2209030830,
-2209030829, -2209030828, -2209030827, -2209030826, -2209030825,
-2209030824), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`reading 1` = c(0.004, 0.003, 0.003, 0.013, 0.021, 0.008,
0.004, 0.005, 0.004, 0.007, 0.003, 0.004, 0.002, 0.003, 0.004,
0.004, 0.005, 0.001, 0.003)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -19L))
dput(reading2)
structure(list(Date = structure(c(-2209031012, -2209031009, -2209031007,
-2209031005, -2209031003, -2209030982, -2209030981, -2209030976,
-2209030974, -2209030972, -2209030970, -2209030949, -2209030882,
-2209030879, -2209030877, -2209030875, -2209030873, -2209030871,
-2209030850, -2209030849, -2209030838, -2209030816, -2209030814,
-2209030811, -2209030808, -2209030806, -2209030804, -2209030783,
-2209030782, -2209030780, -2209030778, -2209030775, -2209030773,
-2209030771, -2209030750, -2209030749, -2209030747, -2209030742,
-2209030740, -2209030738, -2209030717, -2209030705, -2209030684,
-2209030683, -2209030681, -2209030679, -2209030676, -2209030674,
-2209030672, -2209030651, -2209030650, -2209030648, -2209030646,
-2209030644, -2209030641, -2209030639), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), `reading 2` = c(8, 8, 8, 8, 8, 6,
6, 8, 8, 8, 8, 6, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6,
5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -56L))
You could combine them using dplyr like this:
combined <- bind_rows(
reading1 %>% rename(reading = `reading 1`) %>% mutate(sensor = 1),
reading2 %>% rename(reading = `reading 2`, time = Date) %>%
mutate(sensor = 2)) %>%
arrange(time)
combined
#> # A tibble: 75 x 3
#> time reading sensor
#> <dttm> <dbl> <dbl>
#> 1 1899-12-31 12:16:28 8 2
#> 2 1899-12-31 12:16:31 8 2
#> 3 1899-12-31 12:16:33 8 2
#> 4 1899-12-31 12:16:35 8 2
#> 5 1899-12-31 12:16:37 8 2
#> 6 1899-12-31 12:16:58 6 2
#> 7 1899-12-31 12:16:59 6 2
#> 8 1899-12-31 12:17:04 8 2
#> 9 1899-12-31 12:17:06 8 2
#> 10 1899-12-31 12:17:08 8 2
#> # ... with 65 more rows
Having your data in long format like this allows for easier plotting, for example:
library(ggplot2)
ggplot(combined, aes(time, reading, color = factor(sensor))) +
geom_line(size = 1) +
theme_bw(base_size = 16) +
scale_color_brewer(palette = "Set1", name = "Sensor")
Created on 2022-05-19 by the reprex package (v2.0.1)
I have two tables first table has stress score recorded at various time points and second table has date of treatment. I want to get the stress scores before and after treatment for each participant who has received the treatment. Also I want a column that gives information on when was the stress score recorded before and after treatment. I do not understand from where do I begin,and what should my code look like.
score.dt = data.table(
participant.index = c(1, 1, 1, 3, 4, 4, 13, 21, 21, 25, 37, 40, 41, 41, 41, 43, 43, 43, 44),
repeat.instance = c(2, 3, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 3, 1, 2, 3, 1),
date.recorded = c(
'2017-07-13',
'2017-06-26',
'2018-09-17',
'2016-04-14',
'2014-03-24',
'2016-05-30',
'2018-06-20',
'2014-08-03',
'2015-07-06',
'2014-12-17',
'2014-09-05',
'2013-06-10',
'2015-10-04',
'2016-11-04',
'2016-04-18',
'2014-02-13',
'2013-05-24',
'2014-09-10',
'2014-11-25'
),
subscale = c(
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress"
),
score = c(18, 10, 18, 36, 16, 30, 28, 10, 12, 40, 16, 12, 10, 14, 6, 32, 42, 26, 18)
)
date.treatment.dt = data.table (
participant.index = c(1, 4, 5, 6, 8, 10, 11, 12, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26),
date.treatment = c(
'2018 - 06 - 27',
'2001 - 07 - 16',
'2009 - 12 - 09',
'2009 - 05 - 20',
'2009 - 07 - 22',
'2008-07 - 02',
'2009 - 11 - 25',
'2009 - 09 - 16',
'1991 - 07 - 30',
'2016 - 05 - 25',
'2012 - 07 - 25',
'2007 - 03 - 19',
'2012 - 01 - 25',
'2011 - 09 - 21',
'2000 - 03 - 06',
'2001 - 09 - 25',
'1999 - 12 - 20',
'1997 -07 - 28',
'2002 - 03 - 12',
'2008 - 01 - 23'
))
Desired output columns: is something like this
score.date.dt = c("candidate.index.x", "repeat.instance", "subscale", "score", "date.treatment", "date.recorded", "score.before.treatment", "score.after.treatment", "months.before.treatment", "months.after.treatment")
Here the columns months.before.treatment indicates how many months before treatment the stress score was measured and month.after.treatment indicates how many months after treatment the stress score was measured.
In your example set, you only have four individuals with stress scores that have any rows in the treatment table (participants 1,4,21,and 25). Only one of these, participant 1, has both a pre-treatment stress measures and post-treatment stress measure...
Here is one way to produce the information you need:
inner_join(score.dt,date.treatment.dt, by="participant.index") %>%
group_by(participant.index, date.treatment) %>%
summarize(pre_treatment = min(date.recorded[date.recorded<=date.treatment]),
post_treatment = max(date.recorded[date.recorded>=date.treatment])) %>%
pivot_longer(cols = -(participant.index:date.treatment), names_to = "period", values_to = "date.recorded") %>%
left_join(score.dt, by=c("participant.index", "date.recorded" )) %>%
mutate(period=str_extract(period,".*(?=_)"),
months = abs(as.numeric(date.treatment-date.recorded))/(365.25/12)) %>%
pivot_wider(id_cols = participant.index:date.treatment, names_from = period, values_from=c(date.recorded, subscale, months,score))
Output:
participant.index date.treatment date.recorded_pre date.recorded_post subscale_pre subscale_post months_pre months_post score_pre score_post
<dbl> <date> <date> <date> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 2018-06-27 2017-06-26 2018-09-17 stress stress 12.0 2.69 10 18
2 4 2001-07-16 NA 2016-05-30 NA stress Inf 178. NA 30
3 21 2000-03-06 NA 2015-07-06 NA stress Inf 184. NA 12
4 25 2002-03-12 NA 2014-12-17 NA stress Inf 153. NA 40
Note: you will have to fix the date inputs to the two source files, like this:
# first correct, your date.treatment column, and convert to date
date.treatment.dt[, date.treatment := as.Date(str_replace_all(date.treatment," ",""), "%Y-%m-%d")]
# second, similarly fix the date column in your stress score table
score.dt[,date.recorded := as.Date(date.recorded,"%Y-%m-%d")]
It seems like there are a few parts to what you're asking. First, you need to merge the two tables together. Here I use dplyr::inner_join() which automatically detects that the candidate.index is the only column in common and merges on that while discarding records found in only one of the tables. Second, we convert to a date format for both dates to enable the calculation of elapsed months.
library(tidyverse)
library(data.table)
library(lubridate)
score.dt <- structure(list(participant.index = c(1, 1, 1, 3, 4, 4, 13, 21, 21, 25, 37, 40, 41, 41, 41, 43, 43, 43, 44), repeat.instance = c(2, 3, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 3, 1, 2, 3, 1), date.recorded = c("2017-07-13", "2017-06-26", "2018-09-17", "2016-04-14", "2014-03-24", "2016-05-30", "2018-06-20", "2014-08-03", "2015-07-06", "2014-12-17", "2014-09-05", "2013-06-10", "2015-10-04", "2016-11-04", "2016-04-18", "2014-02-13", "2013-05-24", "2014-09-10", "2014-11-25"), subscale = c("stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress"), score = c(18, 10, 18, 36, 16, 30, 28, 10, 12, 40, 16, 12, 10, 14, 6, 32, 42, 26, 18)), row.names = c(NA, -19L), class = c("data.table", "data.frame"))
date.treatment.dt <- structure(list(participant.index = c(1, 4, 5, 6, 8, 10, 11, 12, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26), date.treatment = c("2018 - 06 - 27", "2001 - 07 - 16", "2009 - 12 - 09", "2009 - 05 - 20", "2009 - 07 - 22", "2008-07 - 02", "2009 - 11 - 25", "2009 - 09 - 16", "1991 - 07 - 30", "2016 - 05 - 25", "2012 - 07 - 25", "2007 - 03 - 19", "2012 - 01 - 25", "2011 - 09 - 21", "2000 - 03 - 06", "2001 - 09 - 25", "1999 - 12 - 20", "1997 -07 - 28", "2002 - 03 - 12", "2008 - 01 - 23")), row.names = c(NA, -20L), class = c("data.table", "data.frame"))
inner_join(date.treatment.dt, score.dt) %>%
mutate(across(contains("date"), as_date)) %>%
mutate(months.after = interval(date.treatment, date.recorded) %/% months(1)) %>%
mutate(months.before = 0 - months.after)
#> Joining, by = "participant.index"
#> participant.index date.treatment repeat.instance date.recorded subscale
#> 1: 1 2018-06-27 2 2017-07-13 stress
#> 2: 1 2018-06-27 3 2017-06-26 stress
#> 3: 1 2018-06-27 6 2018-09-17 stress
#> 4: 4 2001-07-16 1 2014-03-24 stress
#> 5: 4 2001-07-16 2 2016-05-30 stress
#> 6: 21 2000-03-06 1 2014-08-03 stress
#> 7: 21 2000-03-06 2 2015-07-06 stress
#> 8: 25 2002-03-12 1 2014-12-17 stress
#> score months.after months.before
#> 1: 18 -11 11
#> 2: 10 -12 12
#> 3: 18 2 -2
#> 4: 16 152 -152
#> 5: 30 178 -178
#> 6: 10 172 -172
#> 7: 12 184 -184
#> 8: 40 153 -153
Created on 2022-04-05 by the reprex package (v2.0.1)
This question already has answers here:
subtract value from previous row by group
(3 answers)
Closed 1 year ago.
I have this data :
structure(list(new_col = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5,
6, 7, 1, 2, 3, 4, 5, 6, 7), DATE = structure(c(1608249600, 1608249600,
1608249600, 1608249600, 1608249600, 1608249600, 1608249600, 1608336000,
1608336000, 1608336000, 1608336000, 1608336000, 1608336000, 1608336000,
1608422400, 1608422400, 1608422400, 1608422400, 1608422400, 1608422400,
1608422400), tzone = "UTC", class = c("POSIXct", "POSIXt")),
HOSP_COUNT = c(582, 931, 1472, 2175, 2791, 3024, 2310, 588,
932, 1477, 2186, 2810, 3051, 2330, 590, 932, 1479, 2188,
2817, 3060, 2335)), row.names = c(NA, -21L), class = c("tbl_df",
"tbl", "data.frame"))
HOPS_COUNT is a cumulative variable, while I need daily measure. Something like this :
structure(list(X1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21), new_col = c(1, 2, 3, 4,
5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7), DATE = c("12/18/2020",
"12/18/2020", "12/18/2020", "12/18/2020", "12/18/2020", "12/18/2020",
"12/18/2020", "12/19/2020", "12/19/2020", "12/19/2020", "12/19/2020",
"12/19/2020", "12/19/2020", "12/19/2020", "12/20/2020", "12/20/2020",
"12/20/2020", "12/20/2020", "12/20/2020", "12/20/2020", "12/20/2020"
), HOSP_COUNT = c(582, 931, 1472, 2175, 2791, 3024, 2310, 6,
1, 5, 11, 19, 27, 20, 2, 1, 2, 2, 7, 9, 15)), row.names = c(NA,
-21L), class = c("tbl_df", "tbl", "data.frame"))
So I need a new column the formula is that
(HOPS_CASE of new_col(1) in the second date (12/19/2020)= (CASE_HOSP of new_col(1) in the first day 12/18/2020)-(Current CASE_HOSP of new_col(1) in the first day 12/19/2020)
This should work for you:
library(dplyr)
df %>%
group_by(new_col) %>%
mutate(new_count = HOSP_COUNT - lag(HOSP_COUNT),
new_count = ifelse(is.na(new_count), HOSP_COUNT, new_count))
Where new_count is your corrected counts:
# A tibble: 20 x 4
# Groups: new_col [7]
new_col DATE HOSP_COUNT new_count
<dbl> <dttm> <dbl> <dbl>
1 1 2020-12-18 00:00:00 582 582
2 2 2020-12-18 00:00:00 931 931
3 3 2020-12-18 00:00:00 1472 1472
4 4 2020-12-18 00:00:00 2175 2175
5 5 2020-12-18 00:00:00 2791 2791
6 6 2020-12-18 00:00:00 3024 3024
7 7 2020-12-18 00:00:00 2310 2310
8 1 2020-12-19 00:00:00 588 6
9 2 2020-12-19 00:00:00 932 1
10 3 2020-12-19 00:00:00 1477 5
11 4 2020-12-19 00:00:00 2186 11
12 5 2020-12-19 00:00:00 2810 19
13 6 2020-12-19 00:00:00 3051 27
14 7 2020-12-19 00:00:00 2330 20
15 1 2020-12-20 00:00:00 590 2
16 2 2020-12-20 00:00:00 932 0
17 3 2020-12-20 00:00:00 1479 2
18 4 2020-12-20 00:00:00 2188 2
19 5 2020-12-20 00:00:00 2817 7
20 6 2020-12-20 00:00:00 3060 9