I have a dataframe like so:
Month CumulativeSum
2019-02-01 40
2019-03-01 70
2019-04-01 80
2019-07-01 100
2019-08-01 120
Problem is that nothing happen in May and June, hence there is no data. Plotting this in barcharts results in some empty space on the x-axis.
Is there some way to "fill" the missing spot like so, using the last known value?:
Month CumulativeSum
2019-02-01 40
2019-03-01 70
2019-04-01 80
**2019-05-01 80** <--
**2019-06-01 80** <--
2019-07-01 100
2019-08-01 120
We can use complete
library(dplyr)
library(tidyr)
df1 %>%
complete(Month = seq(min(Month), max(Month), by = '1 month')) %>%
fill(CumulativeSum)
-output
# A tibble: 7 x 2
# Month CumulativeSum
# <date> <int>
#1 2019-02-01 40
#2 2019-03-01 70
#3 2019-04-01 80
#4 2019-05-01 80
#5 2019-06-01 80
#6 2019-07-01 100
#7 2019-08-01 120
data
df1 <- structure(list(Month = structure(c(17928, 17956, 17987, 18078,
18109), class = "Date"), CumulativeSum = c(40L, 70L, 80L, 100L,
120L)), row.names = c(NA, -5L), class = "data.frame")
Here is a base R option using cummax
transform(
data.frame(
Month = seq(min(df$Month), max(df$Month), by = "1 month"),
CumulativeSum = -Inf
),
CumulativeSum = cummax(replace(CumulativeSum, Month %in% df$Month, df$CumulativeSum))
)
which gives
Month CumulativeSum
1 2019-02-01 40
2 2019-03-01 70
3 2019-04-01 80
4 2019-05-01 80
5 2019-06-01 80
6 2019-07-01 100
7 2019-08-01 120
Related
I want to group animals based on consecutive months they were found within the same burrow, but also split up those groups if the months were not consecutive.
#Input Data
burrow.data <- read.csv
Animal Burrow Date
1 027 B0961 2022-03-01
2 027 B0961 2022-04-26
3 033 1920 2021-11-02
4 033 1955 2022-03-29
5 033 1955 2022-04-26
6 063 B0540 2021-04-21
7 063 B0540 2022-01-04
8 063 B0540 2022-03-01
9 101 B0021 2020-11-23
10 101 B0021 2020-12-23
11 101 B0021 2021-11-04
12 101 B0021 2022-01-06
13 101 B0021 2022-02-04
14 101 B0021 2022-03-03
#Expected Output
Animal Burrow grp Date.Start Date.End
1 033 1920 1 2021-11-02 2021-11-02
2 033 1955 1 2022-03-29 2022-04-26
3 101 B0021 1 2020-11-23 2020-12-23
4 101 B0021 2 2022-01-06 2020-03-03
5 063 B0540 1 2021-04-21 2022-03-01
6 027 B0961 1 2022-03-01 2022-04-26
I used code from another post: Group consecutive dates in R
And wrote:
burrow.input <- burrow.data[order(burrow.data$Date),]
burrow.input$grp <- ave(as.integer(burrow.input$Date), burrow.input[-4], FUN = function(z) cumsum(c(TRUE, diff(z)>1)))
burrow.input
out <- aggregate(Date ~ Animal + Burrow + grp, data = burrow.input, FUN = function(z) setNames(range(z), c("Start", "End")))
out <- do.call(data.frame,out)
out[,4:5] <- lapply(out[,4:5], as.Date, origin = "1970-01-01")
out
The code keeps grouping 101 into a single group instead of two groups broken up by a date gap (See below).
How can I fix this?
Animal Burrow grp Date.Start Date.End
1 033 1920 1 2021-11-02 2021-11-02
2 033 1955 1 2022-03-29 2022-04-26
3 101 B0021 1 2020-11-23 2022-03-03
4 063 B0540 1 2021-04-21 2022-03-01
5 027 B0961 1 2022-03-01 2022-04-26
Group the data by Animal, Burrow and a grouping variable that changes each time the date jumps by more than 1 month. Here as.yearmon converts the date to a yearmon object which internally is a year plus 0 for Jan, 1/12 for Feb, ..., 11/12 for Dec so multiply that by 12 and check whether the difference between it and the prior value is greater than 1. Take the cumulative sum of that to generate a grouping variable. Finally summarize that, sort and remove the grouping variable that was added.
library(dplyr)
library(zoo)
burrow.data %>%
group_by(Animal, Burrow,
diff = cumsum( c(1, diff(12 * as.yearmon(Date)) > 1) ) ) %>%
summarize(Date.start = first(Date), Date.end = last(Date), .groups = "drop") %>%
arrange(Burrow) %>%
select(-diff)
giving:
# A tibble: 7 × 4
Animal Burrow Date.start Date.end
<int> <chr> <chr> <chr>
1 33 1920 2021-11-02 2021-11-02
2 33 1955 2022-03-29 2022-04-26
3 101 B0021 2020-11-23 2021-11-04
4 101 B0021 2022-01-06 2022-03-03
5 63 B0540 2021-04-21 2022-01-04
6 63 B0540 2022-03-01 2022-03-01
7 27 B0961 2022-03-01 2022-04-26
Note
The input data in reproducible form is:
burrow.data <-
structure(list(Animal = c(27L, 27L, 33L, 33L, 33L, 63L, 63L,
63L, 101L, 101L, 101L, 101L, 101L, 101L), Burrow = c("B0961",
"B0961", "1920", "1955", "1955", "B0540", "B0540", "B0540", "B0021",
"B0021", "B0021", "B0021", "B0021", "B0021"), Date = c("2022-03-01",
"2022-04-26", "2021-11-02", "2022-03-29", "2022-04-26", "2021-04-21",
"2022-01-04", "2022-03-01", "2020-11-23", "2020-12-23", "2021-11-04",
"2022-01-06", "2022-02-04", "2022-03-03")), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12", "13", "14"))
I am working in R. I have a data frame that consists of Sampling Date and water temperature. I have provided a sample dataframe below:
Date Temperature
2015-06-01 11
2015-08-11 13
2016-01-12 2
2016-07-01 12
2017-01-08 4
2017-08-13 14
2018-03-04 7
2018-09-19 10
2019-8-24 8
Due to the erratic nature of sampling dates (due to samplers ability to site) I am unable to classify years normally January 1 to December 31st and instead am using the beginning of the sampling period as the start of 1 year. In this case a year would start June 1st and End may 31st, that way I can accruately compare the years to one another. Thus I want 4 years to have the following labels
Year_One = "2015-06-01" - "2016-05-31"
Year_Two = "2016-06-01" - "2017-05-31"
Year_Three = "2017-06-01" - "2018-05-31"
Year_Four = "2018-06-01" - "2019-08-24"
My goal is to create an additional column with these labels but have thus far been unable to do so.
I create two columns year1 and year2 with two different approaches. The year2 approach needs that all the periods start june 1st and end may 31st (in your code the year_four ends 2019-08-24) so it may not be exactly what you need:
library(tidyverse)
library(lubridate)
dt$Date <- as.Date(dt$Date)
dt %>%
mutate(year1= case_when(between(Date, as.Date("2015-06-01") , as.Date("2016-05-31")) ~ "Year_One",
between(Date, as.Date("2016-06-01") , as.Date("2017-05-31")) ~ "Year_Two",
between(Date, as.Date("2017-06-01") , as.Date("2018-05-31")) ~ "Year_Three",
between(Date, as.Date("2018-06-01") , as.Date("2019-08-24")) ~ "Year_Four",
TRUE ~ "0")) %>%
mutate(year2 = paste0(year(Date-months(5)),"/", year(Date-months(5))+1))
The output:
# A tibble: 9 x 4
Date Temperature year1 year2
<date> <dbl> <chr> <chr>
1 2015-06-01 11 Year_One 2015/2016
2 2015-08-11 13 Year_One 2015/2016
3 2016-01-12 2 Year_One 2015/2016
4 2016-07-01 12 Year_Two 2016/2017
5 2017-01-08 4 Year_Two 2016/2017
6 2017-08-13 14 Year_Three 2017/2018
7 2018-03-04 7 Year_Three 2017/2018
8 2018-09-19 10 Year_Four 2018/2019
9 2019-08-24 8 Year_Four 2019/2020
Using strftime to get the years, then make a factor with levels on the unique values. I'd recommend numbers instead of words, because they can be coded automatically. Else, use labels=c("one", "two", ...).
d <- within(d, {
year <- strftime(Date, "%Y")
year <- paste("Year", factor(year, labels=seq(unique(year))), sep="_")
})
# Date temperature year
# 1 2017-06-01 1 Year_1
# 2 2017-09-01 2 Year_1
# 3 2017-12-01 3 Year_1
# 4 2018-03-01 4 Year_2
# 5 2018-06-01 5 Year_2
# 6 2018-09-01 6 Year_2
# 7 2018-12-01 7 Year_2
# 8 2019-03-01 8 Year_3
# 9 2019-06-01 9 Year_3
# 10 2019-09-01 10 Year_3
# 11 2019-12-01 11 Year_3
# 12 2020-03-01 12 Year_4
# 13 2020-06-01 13 Year_4
Data:
d <- structure(list(Date = structure(c(17318, 17410, 17501, 17591,
17683, 17775, 17866, 17956, 18048, 18140, 18231, 18322, 18414
), class = "Date"), temperature = 1:13), class = "data.frame", row.names = c(NA,
-13L))
I have below-mentioned dataframe in R.
DF
ID Datetime Value
T-1 2020-01-01 15:12:14 10
T-2 2020-01-01 00:12:10 20
T-3 2020-01-01 03:11:11 25
T-4 2020-01-01 14:01:01 20
T-5 2020-01-01 18:07:11 10
T-6 2020-01-01 20:10:09 15
T-7 2020-01-01 15:45:23 15
By utilizing the above-mentioned dataframe, I want to bifurcate the count basis month and time bucket considering the Datetime.
Required Output:
Month Count Sum
Jan-20 7 115
12:00 AM to 05:00 AM 2 45
06:00 AM to 12:00 PM 0 0
12:00 PM to 03:00 PM 1 20
03:00 PM to 08:00 PM 3 35
08:00 PM to 12:00 AM 1 15
You can bin the hours of the day by using hour from the lubridate package and then cut from base R, before summarizing with dplyr.
Here, I am assuming that your Datetime column is actually in a date-time format and not just a character string or factor. If it is, ensure you have done DF$Datetime <- as.POSIXct(as.character(DF$Datetime)) first to convert it.
library(tidyverse)
DF$bins <- cut(lubridate::hour(DF$Datetime), c(-1, 5.99, 11.99, 14.99, 19.99, 24))
levels(DF$bins) <- c("00:00 to 05:59", "06:00 to 11:59", "12:00 to 14:59",
"15:00 to 19:59", "20:00 to 23:59")
newDF <- DF %>%
group_by(bins, .drop = FALSE) %>%
summarise(Count = length(Value), Total = sum(Value))
This gives the following result:
newDF
#> # A tibble: 5 x 3
#> bins Count Total
#> <fct> <int> <dbl>
#> 1 00:00 to 05:59 2 45
#> 2 06:00 to 11:59 0 0
#> 3 12:00 to 14:59 1 20
#> 4 15:00 to 19:59 3 35
#> 5 20:00 to 23:59 1 15
And if you want to add January as a first row (though I'm not sure how much sense this makes in this context) you could do:
newDF %>%
summarise(bins = "January", Count = sum(Count), Total = sum(Total)) %>% bind_rows(newDF)
#> # A tibble: 6 x 3
#> bins Count Total
#> <chr> <int> <dbl>
#> 1 January 7 115
#> 2 00:00 to 05:59 2 45
#> 3 06:00 to 11:59 0 0
#> 4 12:00 to 14:59 1 20
#> 5 15:00 to 19:59 3 35
#> 6 20:00 to 23:59 1 15
Incidentally, the reproducible version of the data I used for this was:
structure(list(ID = structure(1:7, .Label = c("T-1", "T-2", "T-3",
"T-4", "T-5", "T-6", "T-7"), class = "factor"), Datetime = structure(c(1577891534,
1577837530, 1577848271, 1577887261, 1577902031, 1577909409, 1577893523
), class = c("POSIXct", "POSIXt"), tzone = ""), Value = c(10,
20, 25, 20, 10, 15, 15)), class = "data.frame", row.names = c(NA,
-7L))
I have several days of heart rate data for every second of the day (with random missing gaps of data) like this:
structure(list(TimePoint = structure(c(1523237795, 1523237796,
1523237797, 1523237798, 1523237799, 1523237800, 1523237801, 1523237802,
1523237803, 1523237804), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
HR = c(80L, 83L, 87L, 91L, 95L, 99L, 102L, 104L, 104L, 103L
)), row.names = c(NA, 10L), class = "data.frame")
------------------------------
TimePoint HR
1 2018-04-09 01:36:35 80
2 2018-04-09 01:36:36 83
3 2018-04-09 01:36:37 87
4 2018-04-09 01:36:38 91
5 2018-04-09 01:36:39 95
6 2018-04-09 01:36:40 99
7 2018-04-09 01:36:41 102
8 2018-04-09 01:36:42 104
9 2018-04-09 01:36:43 104
10 2018-04-09 01:36:44 103
.
.
.
I would like to apply the Scale(center = T, scale = T) function to the data to normalize across participants.
However, I don't want to normalize across entire days of available data, but across every 24 hour period
So if a participant has 3 days of data, the HR will be scaled to a z-distribution 3 separate times; each for it's respective day
I am having trouble doing this successfully.
# read csv
DF = read.csv(x)
# make sure date stamp is read YYYY Month Day & convert timestamp into class POSIXct
x2 = as.POSIXct(DF[,1], format = '%d.%m.%Y %H:%M:%S', tz = "UTC") %>% data.frame()
# rename column
colnames(x2)[1] = "TimePoint"
# add the participant HR data to this dataframe
x2$HR = DF[,2]
# break time stamps into 60 minute windows
by60 = cut(x2$TimePoint, breaks = "60 min")
# get the average HR per 60 min window
DF_Sum = aggregate(HR ~ by60, FUN=mean, data=x2)
# add weekday /hours for future plot visualization
DF_Sum$WeekDay = wday(DF_Sum$by60, label = T)
DF_Sum$Hour = hour(DF_Sum$by60)
I am able to split the data by timeseries and average the HR by hour but I cannot seem to add the scale function properly.
Help appreciated.
Create time intervals of 24 hours for each patient, group_by patient and time intervals, then calculate the scaled HR for each group.
library(dplyr)
df %>%
#remove the following mutate and replace ID in group_by by the ID's column name in your data set
mutate(ID=1) %>%
group_by(ID, Int=cut(TimePoint, breaks="24 hours")) %>%
mutate(HR_sc=scale(HR, center = TRUE, scale = TRUE))
# A tibble: 10 x 5
# Groups: ID, Int [1]
TimePoint HR ID Int HR_sc
<dttm> <int> <dbl> <fct> <dbl>
1 2018-04-09 01:26:35 80 1 2018-04-09 01:00:00 -1.63
2 2018-04-09 01:28:16 83 1 2018-04-09 01:00:00 -1.30
3 2018-04-09 01:29:57 87 1 2018-04-09 01:00:00 -0.860
4 2018-04-09 01:31:38 91 1 2018-04-09 01:00:00 -0.419
5 2018-04-09 01:33:19 95 1 2018-04-09 01:00:00 0.0221
6 2018-04-09 01:33:20 99 1 2018-04-09 01:00:00 0.463
7 2018-04-09 01:35:01 102 1 2018-04-09 01:00:00 0.794
8 2018-04-09 01:36:42 104 1 2018-04-09 01:00:00 1.01
9 2018-04-09 01:38:23 104 1 2018-04-09 01:00:00 1.01
10 2018-04-09 01:39:59 103 1 2018-04-09 01:00:00 0.905
I am looking to take a dataframe which has data ordered through time and aggregate up to the hourly level, and place the data into a separate dataframe. It's best explained with an example:
tradeData dataframe:
Time Amount
2014-05-16 14:00:05 10
2014-05-16 14:00:10 20
2014-05-16 14:08:15 30
2014-05-16 14:23:09 51
2014-05-16 14:59:54 84
2014-05-16 15:09:45 94
2014-05-16 15:24:41 53
2014-05-16 16:30:51 44
The matrix above contains the data I would like to aggregate. Below is the dataframe into which I would like to insert it:
HourlyData dataframe:
Time Profit
2014-05-16 00:00:00 100
2014-05-16 01:00:00 200
2014-05-16 02:00:00 250
...
2014-05-16 14:00:00 30
2014-05-16 15:00:00 -50
2014-05-16 16:00:00 67
...
2014-05-16 23:00:00 -8
I would like to aggregate the data in the tradeData dataframe and place it in the correct place in the hourlyData dataframe as below:
New hourlyData dataframe:
Time Profit Amount
2014-05-16 00:00:00 100 0
2014-05-16 01:00:00 200 0
2014-05-16 02:00:00 250 0
...
2014-05-16 14:00:00 30 0
2014-05-16 15:00:00 -50 195 (10+20+30+51+84)
2014-05-16 16:00:00 67 147 (94+53)
2014-05-16 17:00:00 20 44
...
2014-05-16 23:00:00 -8 0
Using the solution provided by Akrun below, I was able to get a solution for most instances. However, there appears to be an issue when an event occurs within the last hour of the day, as below:
TradeData
Time Amount
2014-08-15 22:09:07 11037.778
2014-08-15 23:01:33 13374.724
2014-08-20 23:25:40 133373.000
HourlyData
Time Amount
2014-08-15 23:00:00 11037.778 (correct)
2014-08-18 00:00:00 0 (incorrect)
2014-08-21 00:00:00 133373 (correct)
The formula appears to be skip the data for the second trade in the tradeData dataframe when aggregating in the hourlyData dataframe. It appears as though this occurs for trades that occur in the last hour of a Friday,because (I imagine) data doesn't exist for a Saturday at 12am i.e. Friday 11PM + 1 hour. It works for a trade occurring in the last hour of Monday to Thursday.
Any ideas on how to adjust the algo? Please let me know if anything is unclear.
Thanks
Mike
Try
library(dplyr)
res <- left_join(df2,
df %>%
group_by(hour=as.POSIXct(cut(Time, breaks='hour'))+3600) %>%
summarise(Amount=sum(Amount)),
by=c('Time'='hour'))
res$Amount[is.na(res$Amount)] <- 0
res
# Time Profit Amount
#1 2014-05-16 00:00:00 100 0
#2 2014-05-16 01:00:00 200 0
#3 2014-05-16 02:00:00 250 0
#4 2014-05-16 14:00:00 30 0
#5 2014-05-16 15:00:00 -50 195
#6 2014-05-16 16:00:00 67 147
#7 2014-05-16 23:00:00 -8 0
Or using data.table
library(data.table)
DT <- data.table(df)
DT2 <- data.table(df2)
DT1 <- DT[,list(Amount=sum(Amount)), by=(Time=
as.POSIXct(cut(Time, breaks='hour'))+3600)]
setkey(DT1, Time)
DT1[DT2][is.na(Amount), Amount:=0][]
# Time Amount Profit
#1: 2014-05-16 00:00:00 0 100
#2: 2014-05-16 01:00:00 0 200
#3: 2014-05-16 02:00:00 0 250
#4: 2014-05-16 14:00:00 0 30
#5: 2014-05-16 15:00:00 195 -50
#6: 2014-05-16 16:00:00 147 67
#7: 2014-05-16 23:00:00 0 -8
Update
Based on the weekends info,
indx <- with(df, as.numeric(format(Time, '%H'))==23 &
as.numeric(format(Time, '%S'))>0& format(Time, '%a')=='Fri')
grp <- with(df, as.POSIXct(cut(Time, breaks='hour')))
grp[indx] <- grp[indx] +3600*49
grp[!indx] <- grp[!indx]+3600
df$Time <- grp
df %>%
group_by(Time) %>%
summarise(Amount=sum(Amount)) #in the example dataset, it is just 3 rows
# Time Amount
#1 2014-08-15 23:00:00 11037.78
#2 2014-08-18 00:00:00 13374.72
#3 2014-08-21 00:00:00 133373.00
data
df <- structure(list(Time = structure(c(1400263205, 1400263210, 1400263695,
1400264589, 1400266794, 1400267385, 1400268281, 1400272251), class = c("POSIXct",
"POSIXt"), tzone = ""), Amount = c(10L, 20L, 30L, 51L, 84L, 94L,
53L, 44L)), .Names = c("Time", "Amount"), row.names = c(NA, -8L
), class = "data.frame")
df2 <- structure(list(Time = structure(c(1400212800, 1400216400, 1400220000,
1400263200, 1400266800, 1400270400, 1400295600), class = c("POSIXct",
"POSIXt"), tzone = ""), Profit = c(100L, 200L, 250L, 30L, -50L,
67L, -8L)), .Names = c("Time", "Profit"), row.names = c(NA, -7L
), class = "data.frame")
newdata
df <- structure(list(Time = structure(c(1408158000, 1408334400, 1408593600
), tzone = "", class = c("POSIXct", "POSIXt")), Amount = c(11037.778,
13374.724, 133373)), .Names = c("Time", "Amount"), row.names = c(NA,
-3L), class = "data.frame")