I am not good at "R" and not sure how to rearrange and subset time series data. Sorry, if this question sounds stupid.
I have a time series data of sea tide with four values per day (with missing values as well). Two values for high tide and two values for low tide. The time and date are given in the same column but in different rows. Now, I want to subset the data only for daytime (from 7:00 AM to 7:00 PM) not for night. Then I want to have data arranged in three columns only i) Date, ii) Time and iii) Tide. For Tide, I only need minimum and maximum values. Here is an example of the data and the desired arrangement of data. For each date, data is arranged in three rows similar to the example.
1/1/2011 Low High Low NA
Time 2:58 AM 9:38 AM 5:19 PM NA
Tide 1.2 m 2.2 m 0.6 m NA
1/2/2011 High Low High Low
Time 2:07 AM 4:22 AM 10:19 AM 6:07 PM
Tide 1.4 m 1.3 m 2.3 m 0.4 m
Date Time Tide
1/1/2011 17:19 0.6
1/1/2011 9:38 2.2
1/2/2011 2:07 1.4
1/2/2011 18:07 0.4
The input, DF is assumed to be as in the Note below.
g, the grouping vector, has one element per row of DF and is equal to c(1, 1, 1, 2, 2, 2, ...). Alternate ways to compute g would be n <- nrow(DF); g <- gl(n, 3, n) or n <- nrow(DF); g <- rep(1:3, n, n).
We then use by to split DF into groups and apply the indicated anonymous function to each group as defined by g.
The anonymous function combines the date and the times in the current group to create the date/times dt making use of the fact that the common date is x[1,1] and the times prior to being cleaned up are in x[2,-1].
Using dt and the tides in x[2, -1] (prior to being cleaned up) it computes each of the three columns arranging them into a data frame. Then there is a commented out line which removes NA values. If you want this uncomment it. Subset the data frame obtained so far to the 7am to 7pm time period and further take the two rows consisting of the min and max tide. We sort that by time.
Finally do.call("rbind", ...) puts the groups together into one overall data frame.
No packages are used.
g <- cumsum(grepl("\\d", DF$V1))
Long <- do.call("rbind", by(DF, g, function(x) {
dt <- as.POSIXct(paste(x[1,1], as.matrix(x[2, -1])), format = "%m/%d/%Y %I:%M %p")
X <- data.frame(Date = as.Date(dt),
Time = format(dt, "%H:%M"),
Tide = as.numeric(sub("m", "", as.matrix(x[3, -1]))),
stringsAsFactors = FALSE)
# X <- na.omit(X)
X <- subset(X, Time >= "07:00" & Time <= "19:00")
X <- X[c(which.min(X$Tide), which.max(X$Tide)), ]
X[order(X$Time), ]
}))
giving the following -- note that the third row in the question's output is not between 7am and 7pm so the output here necessarily differs.
> Long
Date Time Tide
1.2 2011-01-01 09:38 2.2
1.3 2011-01-01 17:19 0.6
2.3 2011-01-02 10:19 2.3
2.4 2011-01-02 18:07 0.4
Note: The input DF is assumed to be as follows in reproducible form:
Lines <- "1/1/2011,Low,High,Low,NA
Time,2:58 AM,9:38 AM,5:19 PM,NA
Tide,1.2 m,2.2 m,0.6 m,NA
1/2/2011,High,Low,High,Low
Time,2:07 AM,4:22 AM,10:19 AM,6:07 PM
Tide,1.4 m,1.3 m,2.3 m,0.4 m"
DF <- read.table(text = Lines, sep = ",", as.is = TRUE)
If the list is not too long, this endeavour would be simpler to do in a spreadsheet simply by mapping cells and filtering. But one way to do it in R with zoo and tidyverse is the following:
Assuming that the original dataframes have their columns named as C1:C5
C1 C2 C3 C4 C5
<chr> <chr> <chr> <chr> <chr>
1 1/1/2010 Low High Low <NA>
2 Time 2:58 AM 9:38 AM 5:19 PM <NA>
3 Tide 1.2 2.2 0.6 <NA>
4 1/2/2011 High Low High Low
5 Time 2:07 AM 4:22 AM 10:19 AM 6:07 PM
6 Tide 1.4 1.3 2.3 0.4
DF <- DF %>%
mutate(Date = as.Date(gsub("Tide|Time","", C1), format = "%d/%m/%Y"))
DF <- DF %>%
mutate(Date = na.locf(DF$Date, na.rm = TRUE),
C1 = gsub("[[:digit:]]|\\/", "", C1),
Type = if_else(nchar(C1) == 0, "TideType", C1)) %>%
select(Date, Type, C2:C5) %>%
gather(oColumn, Value, -c(Date, Type)) %>%
spread(key = Type, value = Value) %>%
select(Date, Time, Tide) %>%
filter(complete.cases(.))
DF <- DF %>%
mutate(Time = ymd_hm(paste(DF$Date, DF$Time, sep = " ")),
Tide = as.numeric(Tide))
DF <- DF %>%
mutate(DayNight = (DF$Time) %within%
interval(as.POSIXlt(DF$Date) + (7*60*60), as.POSIXlt(DF$Date) + (19*60*60))) %>%
filter(DayNight == TRUE) %>%
select(-DayNight) ) %>%
group_by(Date) %>%
filter(Tide == max(Tide) | min(Tide))
DF
Source: local data frame [4 x 3]
Groups: Date [2]
Date Time Tide
<date> <dttm> <dbl>
1 2010-01-01 2010-01-01 09:38:00 2.2
2 2010-01-01 2010-01-01 17:19:00 0.6
3 2011-02-01 2011-02-01 10:19:00 2.3
4 2011-02-01 2011-02-01 18:07:00 0.4
Note that "Date" is a Date type of Object and "Time" is a Posixct type of Date-Time Object. You might want to convert "Time" into a vector of minutes.
Related
I'm trying to remove redundant data rows from a gigantic dataset. For the same individual, at the same location, on the same day, I want to keep just one detection for every 10 minutes time range. So for example if individual 1 stays around station 6 for 20 minutes, instead of 200 or so detections I just want to keep 2, one for every 10 minutes he's there. Here is an example dataset:
datetime<-c("2020-12-30 23:03:24","2020-12-30 23:04:25","2020-12-28 23:06:20", "2020-12-26 12:02:10","2020-12-26 12:07:26","2020-12-26 12:10:07", "2018-05-11 05:02:05","2018-05-11 05:03:07", "2018-05-11 05:13:25", "2018-05-11 05:14:27")
dt<-as.POSIXct(datetime, format="%Y-%m-%d %H:%M:%S")
i<-c('ind1','ind1','ind1', 'ind2', 'ind2', 'ind2', 'ind1', 'ind1', 'ind1', 'ind4')
l<-c('station1', 'station1', station1','station2','station2','station3','station1','station1','station2','station6')
stack<-data.frame(dt, i,l)
The expected resulting dataframe:
dt i l
1 2020-12-30 23:03:24 ind1 station1
2 2020-12-30 23:04:25 ind1 station1
3 2020-12-28 23:06:20 ind1 station1
4 2020-12-26 12:02:10 ind2 station2
5 2020-12-26 12:07:26 ind2 station2
6 2020-12-26 12:10:07 ind2 station3
7 2018-05-11 05:02:05 ind1 station1
8 2018-05-11 05:03:07 ind1 station1
9 2018-05-11 05:13:25 ind1 station2
10 2018-05-11 05:14:27 ind4 station6
Here is what I have tried to code so far:
#Separate date and time
stack <- tidyr::separate(stack, dt, c("date", "time"), sep = " ")
#Merge columns location (l), individual (i) and date (date)
data_set_merged <- stack%>%
unite("Merged_sample", c("i", "l", "date"), remove=FALSE)
#Order dataset chronologically
data_set_merged %>% arrange(ymd(data_set_merged$date))
#Count number of minutes between every group of detection
data_set_merged$time<-as.POSIXct(as.character(data_set_merged$time), format="%H:%M:%S")
value <-diff(data_set_merged$time)
#Add NA value at the end, since no difference between last value and nothing
Adding_NA_value <- append(value , "NA")
New_data_frame_with_column<- data_set_merged %>%
dplyr::mutate (Time_intervall_seconds = Adding_NA_value)
#Group_by sample, site, year and day and select the observations with > 10 min (= less than 600 seconds) replicates
Final_data_frame <- New_data_frame_with_column %>%
group_by(Merged_sample)%>%
filter (Time_intervall_seconds>= 600)
This code deletes more than juste the redundant data: if two rows are less than 600 seconds, it deletes both instead of just one, so I'm losing information. I also don't know if it is only calculating the time difference for each group - I don't want to delete detections between individuals, or between locations for the same individual, I really only want it for the same location + individual + date.
I tried ordering it chronologically to get a time interval that makes sense, but then the interval is not calculated by group anymore.
I'm still quite new at R and I don't know where to go from here. Any help would be sososo welcome. Thanks!
This will block the data into 10 minute intervals
library(slider)
block(stack, stack$dt, period = "minute", every = 10)
Odd as it sounds, it looks like the way to do this is block the intervals, then convert it back to a dataframe retaining the blocking in a column. Then you can simply group by (i, l, bucket) and sample one from each. you probably want to set the origin option in the block statement so you know where the cuts land, I just let them fall where they may.
f <- function(x) {t <- block(x, x$dt, period = "minute", every = 10) %>% tibble()}
fr <- f(stack) %>% mutate(bucket=row_number()) %>%
unnest()
fr %>% group_by(i, l, bucket) %>% group_modify( ~ sample_n(.x, 1))
# A tibble: 7 x 4
# Groups: i, l, bucket [7]
i l bucket dt
<chr> <chr> <int> <dttm>
1 ind1 station1 1 2018-05-11 05:03:07
2 ind1 station1 5 2020-12-28 23:06:20
3 ind1 station1 6 2020-12-30 23:04:25
4 ind1 station2 2 2018-05-11 05:13:25
5 ind2 station2 3 2020-12-26 12:07:26
6 ind2 station3 4 2020-12-26 12:10:07
7 ind4 station6 2 2018-05-11 05:14:27
I have a data.frame that doesn't account for leap year (ie all years are 365 days). I would like to repeat the last day value in February during the leap year. The DF in my code below has fake data set, I intentionally remove the leap day value in DF_NoLeapday. I would like to add a leap day value in DF_NoLeapday by repeating the value of the last day of February in a leap year (in our example it would Feb 28, 2004 value). I would rather like to have a general solution to apply this to many years data.
set.seed(55)
DF <- data.frame(date = seq(as.Date("2003-01-01"), to= as.Date("2005-12-31"), by="day"),
A = runif(1096, 0,10),
Z = runif(1096,5,15))
DF_NoLeapday <- DF[!(format(DF$date,"%m") == "02" & format(DF$date, "%d") == "29"), ,drop = FALSE]
We can use complete on the 'date' column which is already a Date class to expand the rows to fill in the missing dates
library(dplyr)
library(tidyr)
out <- DF_NoLeapday %>%
complete(date = seq(min(date), max(date), by = '1 day'))
dim(out)
#[1] 1096 3
out %>%
filter(date >= '2004-02-28', date <= '2004-03-01')
# A tibble: 3 x 3
# date A Z
# <date> <dbl> <dbl>
#1 2004-02-28 9.06 9.70
#2 2004-02-29 NA NA
#3 2004-03-01 5.30 7.35
By default, the other columns values are filled with NA, if we need to change it to a different value, it can be done within complete with fill
If we need the previous values, then use fill
out <- out %>%
fill(A, Z)
out %>%
filter(date >= '2004-02-28', date <= '2004-03-01')
# A tibble: 3 x 3
# date A Z
# <date> <dbl> <dbl>
#1 2004-02-28 9.06 9.70
#2 2004-02-29 9.06 9.70
#3 2004-03-01 5.30 7.35
I have a data set with values every minute and I want to calculate the average value for every hour. I have tried by using the group_by(), filter() and summarise() from dplyr package to reduce the data every hour. When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.
> head(DF)
datetime pw cu year m d hr min
1 2017-08-18 14:56:00 0.0630341 1.94065 2017 8 18 14 53
2 2017-08-18 14:57:00 0.0604653 1.86771 2017 8 18 14 57
3 2017-08-18 14:58:00 0.0601318 1.86596 2017 8 18 14 58
4 2017-08-18 14:59:00 0.0599276 1.83761 2017 8 18 14 59
5 2017-08-18 15:00:00 0.0598998 1.84177 2017 8 18 15 0
I had to use a for loop to reduce my table, I wrote the following to do it:
datetime <- c()
eg_bf <-c ()
for(i in 1:8760){
hour= start + 3600
DF= DF %>%
filter(datetime >= start & datetime < hour) %>%
summarise(eg= mean(pw))
datetime= append(datetime, start)
eg_bf= append(eg_bf, DF$eg)
start= hour
}
new_DF= data.frame(datetime, eg_bf)
So. I was able to get my new data set with the mean value for every hour of the year.
datetime eg_bf
1 2018-01-01 00:00:00 0.025
2 2018-01-01 01:00:00 0.003
3 2018-01-01 02:00:00 0.002
4 2018-01-01 03:00:00 0.010
5 2018-01-01 04:00:00 0.015
The problem I'm facing is that It takes a lot of time to do it. The idea is to add this calculation to a shiny UI, so every time I make a change it must make the changes faster. Any idea how to improve this calculation?
you can try this. use make_date from the lubridate package to make a new date_time column using the year , month, day and hour columns of your dataset. Then group and summarise on the new column
library(dplyr)
library(lubridate)
df %>%
mutate(date_time = make_datetime(year, m, d, hr)) %>%
group_by(date_time) %>%
summarise(eg_bf = mean(pw))
#Adam Gruer's answer provides a nice solution for the date variable that should solve your question. The calculation of the mean per hour does work with just dplyr, though:
df %>%
group_by(year, m, d, hr) %>%
summarise(test = mean(pw))
# A tibble: 2 x 5
# Groups: year, m, d [?]
year m d hr test
<int> <int> <int> <int> <dbl>
1 2017 8 18 14 0.0609
2 2017 8 18 15 0.0599
You said in your question:
When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.
What did you do differently?
Even if you've found your answer, I believe this is worth mentioning:
If you're working with a lot of data and speed is an issue, then you might want ot see if you can use data.table instead of dplyr
You can see with a simple benchmarking how much faster data.table is:
library(dplyr)
library(lubridate)
library(data.table)
library(microbenchmark)
set.seed(123)
# dummy data, one year, one entry per minute
# first as data frame
DF <- data.frame(datetime = seq(as.POSIXct("2018-01-01 00:00:00"),
as.POSIXct("2019-01-02 00:00:00"), 60),
pw = runif(527041)) %>%
mutate(year = year(datetime), m=month(datetime),
d=day(datetime), hour = hour(datetime))
# save it as a data.table
dt <- as.data.table(DF)
# transformation with dplyr
f_dplyr <- function(){
DF %>%
group_by(year, m, d, hour) %>%
summarize(eg_bf = mean(pw))
}
# transformation with data.table
f_datatable <- function() {
dt[, mean(pw), by=.(year, m, d, hour)]
}
# benchmarking
microbenchmark(f_dplyr(), f_datatable())
#
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# f_dplyr() 41.240235 44.075019 46.85497 45.64998 47.95968 76.73714 100 b
# f_datatable() 9.081295 9.712694 12.53998 10.55697 11.33933 41.85217 100 a
check out this post it tells a lot data.table vs dplyr: can one do something well the other can't or does poorly?
As I understood you have a data frame of 365 * 24 * 60 rows. The code below returns the result instantly. The outcome is mean(pw) grouped by every hour of the year.
remove(list = ls())
library(dplyr)
library(lubridate)
library(purrr)
library(tibble)
date_time <- seq.POSIXt(
as.POSIXct("2018-01-01"),
as.POSIXct("2019-01-01"),
by = "1 min"
)
n <- length(date_time)
data <- tibble(
date_time = date_time,
pw = runif(n),
cu = runif(n),
ye = year(date_time),
mo = month(date_time),
da = day(date_time),
hr = hour(date_time)
)
grouped <- data %>%
group_by(
ye, mo, da, hr
) %>%
summarise(
mean_pw = mean(pw)
)
I have a csv file that is written like this
Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50
I'd like R to produce something like this
Date Date
1/1/1980
1/2/1980
1/3/1980
1/4/1980
1/5/1980 25
1/6/1980
1/7/1980 30
Then I would like R to bring the last observation forward like this
Date Date
1/1/1980
1/2/1980
1/3/1980
1/4/1980
1/5/1980 25
1/6/1980 25
1/7/1980 30
I'd like two separate data.tables created one with just the actual data, then another with the last observation brought forward.
Thanks for all the help!
Edit: I also will need any NA's that are populated to changed to 0
You could also use tidyverse:
library(tidyverse)
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
complete(Date = seq(as.Date(format(min(Date), "%Y-%m-01")), max(Date), by = "day")) %>%
fill(Data) %>%
replace(., is.na(.), 0)
First 10 rows:
# A tibble: 104 x 2
Date Data
<date> <dbl>
1 1980-01-01 0
2 1980-01-02 0
3 1980-01-03 0
4 1980-01-04 0
5 1980-01-05 25
6 1980-01-06 25
7 1980-01-07 30
8 1980-01-08 30
9 1980-01-09 30
10 1980-01-10 30
I've used as a starting point the 1st day of the month and year of minimum date, and maximum the maximum date; this can be of course adjusted as needed.
EDIT: #Sotos has an even better suggestion for a more concise approach (by better usage of format argument):
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
complete(Date = seq(as.Date(format(min(Date), "%Y-%m-01")), max(Date), by = "day")) %>%
fill(Data)
The solution is:
create a data.frame with successive date
merge it with your original data.frame
use na.locf function from zoo to carry forward your data
Here is the code. I use lubridate to work with date.
library(lubridate)
df$Date <- mdy(df$Date)
successive <-data.frame(Date = seq( as.Date(as.yearmon(df$Date[1])), df$Date[length(df$Date)], by="days"))
successive is the vector of successive dates. Now the merging:
result <- merge(df,successive,all.y = T,on = "Date")
And the forward propagation:
library(zoo)
result$Data <- na.locf(result$Data,na.rm = F)
Date Data
1 1980-01-05 25
2 1980-01-06 25
3 1980-01-07 30
4 1980-01-08 30
5 1980-01-09 30
6 1980-01-10 30
7 1980-01-11 30
8 1980-01-12 30
9 1980-01-13 30
10 1980-01-14 30
11 1980-01-15 30
12 1980-01-16 30
13 1980-01-17 30
14 1980-01-18 30
15 1980-01-19 30
16 1980-01-20 30
17 1980-01-21 30
18 1980-01-22 30
19 1980-01-23 30
20 1980-01-24 30
21 1980-01-25 30
The data:
df <- read.table(text = "Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50", header = T)
Assuming that the result should start at the first of the month of the first date and end at the last date and that the input data frame is DF shown reproducibly in the Note at the end, convert DF to a zoo object z, create a grid of dates g merge them to give zoo objects z0 (with zero filling) and zz (with na.locf filling) and optionally convert back to data frames or else just leave it as is so you can use zoo for further processing.
library(zoo)
z <- read.zoo(DF, header = TRUE, format = "%m/%d/%Y")
g <- seq(as.Date(as.yearmon(start(z))), end(z), "day")
z0 <- merge(z, zoo(, g), fill = 0) # zero filled
zz <- na.locf0(merge(z, zoo(, g))) # na.locf filled
# optional
DF0 <- fortify.zoo(z0) # zero filled
DF2 <- fortify.zoo(zz) # na.locf filled
data.table
The question mentions data tables and if that refers to the data.table package then add:
library(data.table)
DT0 <- data.table(DF0) # zero filled
DT2 <- data.table(DF2) # na.locf filled
Variations
I wasn't clear on whether the question was asking for a zero filled answer and an na.locf filled answer or just an na.locf filled answer whose remaining NA values are 0 filled but assumed the former case. If you want to fill the NAs that are left in the na.locf filled answer then add:
zz[is.na(zz)] <- 0
If you want to end at the end of the last month rather than at the last date replace end(z) with as.Date(as.yearmon(end(z)), frac = 1) .
If you want to start at the first date rather than the first of the month of the first date replace as.Date(as.yearmon(start(z))) with start(z)
.
As an alternative to (3), to start at the first date and end at the last date we could simply convert to ts and back. Note that we need to restore Date class on the second line below since ts class cannot handle Date class directly.
z2.na <- as.zoo(as.ts(z))
time(z2.na) <- as.Date(time(z2.na))
zz20 <- replace(z2.na, is.na(z2.na), 0) # zero filled
zz2 <- na.locf0(z2.na) # na.locf filled
Note
Lines <- "
Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50"
DF <- read.table(text = Lines, header = TRUE)
I have the following two data frames:
Date <- seq(as.Date("2013/1/1"), by = "day", length.out = 46)
x <-data.frame(Date)
x$discharge <- c("1000","1100","1200","1300","1400","1200","1300","1300","1200","1100","1200","1200","1100","1400","1200","1100","1400","1000","1100","1200","1300","1400","1200","1300","1300","1200","1100","1200","1200","1100","1400","1200","1100","1400","1000","1100","1200","1300","1400","1200","1300","1300","1200","1100","1200","1200")
x$discharge <- as.numeric(x$discharge)
And
Date_from <- c("2013-01-01","2013-01-15","2013-01-21","2013-02-10")
Date_to <- c("2013-01-07","2013-01-20","2013-01-25","2013-02-15")
y <- data.frame(Date_from,Date_to)
y$concentration <- c("1.5","2.5","1.5","3.5")
y$Date_from <- as.Date(y$Date_from)
y$Date_to <- as.Date(y$Date_to)
y$concentration <- as.numeric(y$concentration)
I am trying to calculate the average discharge from the daily discharges in data frame x for each row in data frame y based on the date range Date_from to Date_to in data frame y . Notice, that there is a gap in the measurements in data frame ybetween 2013-01-08 to 2013-01-14, and 2013-01-26 to 2013-02-09. This gap is due to the fact that no measurements were taken during this time. And this gap is causing me headaches as I was using the following code to calculate the average discharge for each date range in y:
rng <- cut(x$Date, breaks=c(y$Date_from, max(y$Date_to),
include.lowest=T))
range<-cbind(x,rng)
discharge<-aggregate(cbind(mean=x$discharge)~rng, FUN=mean)
However, if you check the range in data frame rangethe range for 2013-01-01 to 2013-01-07 is extended up to 2013-01-14 but I only need it to 2013-01-07 and than with a break until the next range begins on 2013-01-15.
You can try a tidyverse.
library(tidyverse)
y %>%
split(seq_along(1:nrow(.))) %>%
map(~filter(x, between(Date, .$Date_from, .$Date_to)) %>%
summarise(Mean=mean(discharge))) %>%
bind_rows() %>%
bind_cols(y,.)
Date_from Date_to concentration Mean
1 2013-01-01 2013-01-07 1.5 1214.286
2 2013-01-15 2013-01-20 2.5 1166.667
3 2013-01-21 2013-01-25 1.5 1300.000
4 2013-02-10 2013-02-15 3.5 1216.667
Using only this code you can see the values and groups.
y %>%
split(seq_along(1:nrow(.))) %>%
map(~filter(x, between(Date, .$Date_from, .$Date_to)))
Here's a base answer:
helper <- merge(x, y)
helper <- helper[helper$Date >= helper$Date_from & helper$Date <= helper$Date_to, ]
aggregate(helper$discharge,
list(Date_from = helper$Date_from,
Date_to = helper$Date_to),
FUN = 'mean')
Date_from Date_to x
1 2013-01-01 2013-01-07 1214.286
2 2013-01-15 2013-01-20 1166.667
3 2013-01-21 2013-01-25 1300.000
4 2013-02-10 2013-02-15 1216.667