Creating a 4-hr time interval using a reference column in R - r

I want to create a 4-hrs interval using a reference column from a data frame. I have a data frame like this one:
species<-"ABC"
ind<-rep(1:4,each=24)
hour<-rep(seq(0,23,by=1),4)
depth<-runif(length(ind),1,50)
df<-data.frame(cbind(species,ind,hour,depth))
df$depth<-as.numeric(df$depth)
What I would like is to create a new column (without changing the information or dimensions of the original data frame) that could look at my hour column (reference column) and based on that value will give me a 4-hrs time interval. For example, if the value from the hour column is between 0 and 3, then the value in new column will be 0; if the value is between 4 and 7 the value in the new column will be 4, and so on... In excel I used to use the floor/ceiling functions for this, but in R they are not exactly the same. Also, if someone has an easier suggestion for this using the original date/time data that could work too. In my original script I used the function as.POSIXct to get the date/time data, and from there my hour column.
I appreciate your help,

what about taking the column of hours, converting it to integers, and using integer division to get the floor? something like this
# convert hour to integer (hour is currently a col of factors)
i <- as.numeric(levels(df$hour))[df$hour]
# make new column
df$interval <- (i %/% 4) * 4

Expanding on my comment, since I think you're ultimately looking for actual dates at some point...
Some sample hourly data:
set.seed(1)
mydata <- data.frame(species = "ABC",
ind = rep(1:4, each=24),
depth = runif(96, 1, 50),
datetime = seq(ISOdate(2000, 1, 1, 0, 0, 0),
by = "1 hour", length.out = 96))
list(head(mydata), tail(mydata))
# [[1]]
# species ind depth datetime
# 1 ABC 1 14.00992 2000-01-01 00:00:00
# 2 ABC 1 19.23407 2000-01-01 01:00:00
# 3 ABC 1 29.06981 2000-01-01 02:00:00
# 4 ABC 1 45.50218 2000-01-01 03:00:00
# 5 ABC 1 10.88241 2000-01-01 04:00:00
# 6 ABC 1 45.02109 2000-01-01 05:00:00
#
# [[2]]
# species ind depth datetime
# 91 ABC 4 12.741841 2000-01-04 18:00:00
# 92 ABC 4 3.887784 2000-01-04 19:00:00
# 93 ABC 4 32.472125 2000-01-04 20:00:00
# 94 ABC 4 43.937191 2000-01-04 21:00:00
# 95 ABC 4 39.166819 2000-01-04 22:00:00
# 96 ABC 4 40.068132 2000-01-04 23:00:00
Transforming that data using cut and format:
mydata <- within(mydata, {
hourclass <- cut(datetime, "4 hours") # Find the intervals
hourfloor <- format(as.POSIXlt(hourclass), "%H") # Display just the "hour"
})
list(head(mydata), tail(mydata))
# [[1]]
# species ind depth datetime hourclass hourfloor
# 1 ABC 1 14.00992 2000-01-01 00:00:00 2000-01-01 00:00:00 00
# 2 ABC 1 19.23407 2000-01-01 01:00:00 2000-01-01 00:00:00 00
# 3 ABC 1 29.06981 2000-01-01 02:00:00 2000-01-01 00:00:00 00
# 4 ABC 1 45.50218 2000-01-01 03:00:00 2000-01-01 00:00:00 00
# 5 ABC 1 10.88241 2000-01-01 04:00:00 2000-01-01 04:00:00 04
# 6 ABC 1 45.02109 2000-01-01 05:00:00 2000-01-01 04:00:00 04
#
# [[2]]
# species ind depth datetime hourclass hourfloor
# 91 ABC 4 12.741841 2000-01-04 18:00:00 2000-01-04 16:00:00 16
# 92 ABC 4 3.887784 2000-01-04 19:00:00 2000-01-04 16:00:00 16
# 93 ABC 4 32.472125 2000-01-04 20:00:00 2000-01-04 20:00:00 20
# 94 ABC 4 43.937191 2000-01-04 21:00:00 2000-01-04 20:00:00 20
# 95 ABC 4 39.166819 2000-01-04 22:00:00 2000-01-04 20:00:00 20
# 96 ABC 4 40.068132 2000-01-04 23:00:00 2000-01-04 20:00:00 20
Note that your new "hourclass" variable is a factor and the new "hourfloor" variable is character, but you can easily change those, even during the within stage.
str(mydata)
# 'data.frame': 96 obs. of 6 variables:
# $ species : Factor w/ 1 level "ABC": 1 1 1 1 1 1 1 1 1 1 ...
# $ ind : int 1 1 1 1 1 1 1 1 1 1 ...
# $ depth : num 14 19.2 29.1 45.5 10.9 ...
# $ datetime : POSIXct, format: "2000-01-01 00:00:00" "2000-01-01 01:00:00" ...
# $ hourclass: Factor w/ 24 levels "2000-01-01 00:00:00",..: 1 1 1 1 2 2 2 2 3 3 ...
# $ hourfloor: chr "00" "00" "00" "00" ...

tip number 1, don't use cbind to create a data.frame with differing type of columns, everything gets coerced to the same type (in this case factor)
findInterval or cut would seem appropriate here.
df <- data.frame(species,ind,hour,depth)
# copy
df2 <- df
df2$fourhour <- c(0,4,8,12,16,20)[findInterval(df$hour, c(0,4,8,12,16,20))]

Though there is probably a simpler way, here is one attempt.
Make your data.frame not using cbind first though, so hour is not a factor but numeric
df <- data.frame(species,ind,hour,depth)
Then:
df$interval <- factor(findInterval(df$hour,seq(0,23,4)),labels=seq(0,23,4))
Result:
> head(df)
species ind hour depth interval
1 ABC 1 0 23.11215 0
2 ABC 1 1 10.63896 0
3 ABC 1 2 18.67615 0
4 ABC 1 3 28.01860 0
5 ABC 1 4 38.25594 4
6 ABC 1 5 30.51363 4
You could also make the labels a bit nicer like:
cutseq <- seq(0,23,4)
df$interval <- factor(
findInterval(df$hour,cutseq),
labels=paste(cutseq,cutseq+3,sep="-")
)
Result:
> head(df)
species ind hour depth interval
1 ABC 1 0 23.11215 0-3
2 ABC 1 1 10.63896 0-3
3 ABC 1 2 18.67615 0-3
4 ABC 1 3 28.01860 0-3
5 ABC 1 4 38.25594 4-7
6 ABC 1 5 30.51363 4-7

Related

calculate number of frost change days (number of days) from the weather hourly data in r

I have to calculate the following data Number of frost change days**(NFCD)**** as weekly basis.
That means the number of days in which minimum temperature and maximum temperature cross 0°C.
Let's say I work with years 1957-1980 with hourly temp.
Example data (couple of rows look like):
Date Time (UTC) temperature
1957-07-01 00:00:00 5
1957-07-01 03:00:00 6.2
1957-07-01 05:00:00 9
1957-07-01 06:00:00 10
1957-07-01 07:00:00 10
1957-07-01 08:00:00 14
1957-07-01 09:00:00 13.2
1957-07-01 10:00:00 15
1957-07-01 11:00:00 15
1957-07-01 12:00:00 16.3
1957-07-01 13:00:00 15.8
Expected data:
year month week NFCD
1957 7 1 1
1957 7 2 5
dat <- data.frame(date=c(rep("A",5),rep("B",5)), time=rep(1:5, times=2), temp=c(1:5,-2,1:4))
dat
# date time temp
# 1 A 1 1
# 2 A 2 2
# 3 A 3 3
# 4 A 4 4
# 5 A 5 5
# 6 B 1 -2
# 7 B 2 1
# 8 B 3 2
# 9 B 4 3
# 10 B 5 4
aggregate(temp ~ date, data = dat, FUN = function(z) min(z) <= 0 && max(z) > 0)
# date temp
# 1 A FALSE
# 2 B TRUE
(then rename temp to NFCD)
Using the data from r2evans's answer you can also use tidyverse logic:
library(tidyverse)
dat %>%
group_by(date) %>%
summarize(NFCD = min(temp) < 0 & max(temp) > 0)
which gives:
# A tibble: 2 x 2
date NFCD
<chr> <lgl>
1 A FALSE
2 B TRUE

Clustering dates into buckets of 30 days

I have a dataframe with 5 columns. One of these columns contain dates which I want to cluster for further analysis.
I want to create a new column that generates a number so that:
Anything within the last 30 day daterange gets a "1"
Anything within the last 30 to 60 days daterange gets a "2", and so on...
The dates are in a format
%Y-%m-%d
What would be the best way to do this?
You could try something like:
library(tidyverse)
df <-
sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 12)
df %>%
enframe(name = NULL) %>%
mutate(
target_date = as.Date('2000/01/01'),
date_diff = as.numeric(target_date - value),
what_you_want = ceiling(date_diff / 30)
)
# A tibble: 12 x 4
value target_date date_diff what_you_want
<date> <date> <dbl> <dbl>
1 1999-08-02 2000-01-01 152 6
2 1999-04-09 2000-01-01 267 9
3 1999-03-05 2000-01-01 302 11
4 1999-07-06 2000-01-01 179 6
5 1999-06-12 2000-01-01 203 7
6 1999-11-14 2000-01-01 48 2
7 1999-04-03 2000-01-01 273 10
8 1999-07-27 2000-01-01 158 6
9 1999-08-12 2000-01-01 142 5
10 1999-03-06 2000-01-01 301 11
11 1999-12-11 2000-01-01 21 1
12 1999-06-15 2000-01-01 200 7

R Max of Same Date, Previous Date, and Previous Hour Value

A couple basic data manipulations. I searched with different wordings and couldn't find much.
I have data structured as below. In reality the hourly data is continuous, but I just included 4 lines as an example.
start <- as.POSIXlt(c('2017-1-1 1:00','2017-1-1 2:00','2017-1-2 1:00','2017-1-2 2:00'))
values <- as.numeric(c(2,5,4,3))
df <- data.frame(start,values)
df
start values
1 2017-01-01 01:00:00 2
2 2017-01-01 02:00:00 5
3 2017-01-02 01:00:00 4
4 2017-01-02 02:00:00 3
I would like to add a couple columns that:
1) Show the max of the same day.
2) Show the max of the previous day.
3) Show the value of one previous hour.
The goal is to have an output like:
MaxValueDay <- as.numeric(c(5,5,4,4))
MaxValueYesterday <- as.numeric(c(NA,NA,5,5))
PreviousHourValue <- as.numeric(c(NA,2,NA,4))
df2 <- data.frame(start,values,MaxValueDay,MaxValueYesterday,PreviousHourValue)
df2
start values MaxValueDay MaxValueYesterday PreviousHourValue
1 2017-01-01 01:00:00 2 5 NA NA
2 2017-01-01 02:00:00 5 5 NA 2
3 2017-01-02 01:00:00 4 4 5 NA
4 2017-01-02 02:00:00 3 4 5 4
Any help would be greatly appreciated. Thanks
A solution using dplyr, magrittr, and lubridate packages:
library(dplyr)
library(magrittr)
library(lubridate)
df %>%
within(MaxValueDay <- sapply(as.Date(start), function (x) max(df$values[which(x==as.Date(start))]))) %>%
within(MaxValueYesterday <- MaxValueDay[sapply(as.Date(start)-1, match, as.Date(start))]) %>%
within(PreviousHourValue <- values[sapply(start-hours(1), match, start)])
# start values MaxValueDay MaxValueYesterday PreviousHourValue
# 1 2017-01-01 01:00:00 2 5 NA NA
# 2 2017-01-01 02:00:00 5 5 NA 2
# 3 2017-01-02 01:00:00 4 4 5 NA
# 4 2017-01-02 02:00:00 3 4 5 4

Splitting time stamps into two columns

I have a column with ID and for each ID several even dates. I want to create two columns with rows for each id one column with the first date and the other with the next consecutive date. The next row for the ID should have the entry in the previous row second column and the next consecutive date for this ID. An example:
This is the data I have
id date
1 1 2015-01-01
2 1 2015-01-18
3 1 2015-08-02
4 2 2015-01-01
5 2 2015-01-13
6 3 2015-01-01
This is data I want
id date1 date2
1 1 2015-01-01 2015-01-18
2 1 2015-01-18 2015-08-02
3 1 2015-08-02 NA
4 2 2015-01-01 2015-01-13
5 2 2015-01-13 NA
6 3 2015-01-01 NA
Using dplyr:
library(dplyr)
df %>% group_by(id) %>%
mutate(date2 = lead(date))
id date date2
(int) (fctr) (fctr)
1 1 2015-01-01 2015-01-18
2 1 2015-01-18 2015-08-02
3 1 2015-08-02 NA
4 2 2015-01-01 2015-01-13
5 2 2015-01-13 NA
6 3 2015-01-01 NA
Using data.table, you can do as follow:
require(data.table)
DT[, .(date1 = date, date2 = shift(date, type = "lead")), by = id]
Or simply (also mentioned by #docendodiscimus)
DT[, date2 := shift(date, type = "lead"), by = id]
Also, if you are interested on making a recursive n columns (edited, taking advantage of #docendodiscimus comment to simplify the code)
i = 1:5
DT[, paste0("date", i+1) := shift(date, i, type = "lead"), by = id]
Base R solution using transform() and ave():
transform(df,date1=date,date2=ave(date,id,FUN=function(x) c(x[-1L],NA)),date=NULL);
## id date1 date2
## 1 1 2015-01-01 2015-01-18
## 2 1 2015-01-18 2015-08-02
## 3 1 2015-08-02 <NA>
## 4 2 2015-01-01 2015-01-13
## 5 2 2015-01-13 <NA>
## 6 3 2015-01-01 <NA>
The above line of code produces a copy of the data.frame. The return value can be assigned over the original df, assigned to a new variable, or passed as an argument/operand to a function/operator. If you want to modify it in-place, which would be a more efficient way to overwrite df, you can do this:
df$date2 <- ave(df$date,df$id,FUN=function(x) c(x[-1L],NA));
colnames(df)[colnames(df)=='date'] <- 'date1';
df;
## id date1 date2
## 1 1 2015-01-01 2015-01-18
## 2 1 2015-01-18 2015-08-02
## 3 1 2015-08-02 <NA>
## 4 2 2015-01-01 2015-01-13
## 5 2 2015-01-13 <NA>
## 6 3 2015-01-01 <NA>
df$date2 = ifelse(df$id==c(df$id[-1],-1), c(df$date[-1],NA), NA)

How to fill missing data (not NA value) with value 0?

My data are like:
Date Value
00:00 10
01:00 8
02:00 1
04:00 4
...
Some data are missing if the value=0. My question is how to fill these data back in. Like, after 02:00 17, fill in a row 03:00 0.
I have done some search, but only found solutions to replace NAs with 0. In my case, my data are not even showing in the data frame. Is there a way to check whether there's a gap between adjacent data?
Here is an approach using data.table:
library(data.table)
data = data.frame(Date=as.Date(c('2015-03-20','2015-03-24','2015-03-25','2015-03-28')),
Value=c(1,2,3,4))
# Date Value
#1 2015-03-20 1
#2 2015-03-24 2
#3 2015-03-25 3
#4 2015-03-28 4
dt = data.table(Date=seq(min(data$Date), max(data$Date), by='days'))
setkey(setDT(data), Date)[dt][!data, Value:=0][]
# Date Value
#1: 2015-03-20 1
#2: 2015-03-21 0
#3: 2015-03-22 0
#4: 2015-03-23 0
#5: 2015-03-24 2
#6: 2015-03-25 3
#7: 2015-03-26 0
#8: 2015-03-27 0
#9: 2015-03-28 4
It is basically a join on the resampling table - setkey(setDT(data), Date)[dt] - you want (you have to define it, here this is dt). Then you replace values not present in your original dataset with 0's - [!data, Value:=0]
Two simple ways I can think of in base r:
s <- format(seq(s <- as.POSIXct('2000-01-01'), s + 3.6e4, by = 'hour'), '%H:%M')
# [1] "00:00" "01:00" "02:00" "03:00" "04:00" "05:00" "06:00" "07:00" "08:00"
# [10] "09:00" "10:00"
ss <- s[c(1:3, 5)]
dd <- data.frame(hour = ss, value = c(10, 8, 1, 4), stringsAsFactors = FALSE)
# hour value
# 1 00:00 10
# 2 01:00 8
# 3 02:00 1
# 4 04:00 4
I made two sample vectors, s are the times you want to fill in, and ss are the times in your data frame that you have. presumably you already have both of these, so then you can
create a data frame with the sequence of times you want and merge the two with all = TRUE so that there are no duplicates; then replace the NA with 0
dm <- data.frame(hour = s)
out <- merge(dm, dd, all = TRUE)
# hour value
# 1 00:00 10
# 2 01:00 8
# 3 02:00 1
# 4 03:00 NA
# 5 04:00 4
# 6 05:00 NA
# 7 06:00 NA
# 8 07:00 NA
# 9 08:00 NA
# 10 09:00 NA
# 11 10:00 NA
out[is.na(out)] <- 0
# hour value
# 1 00:00 10
# 2 01:00 8
# 3 02:00 1
# 4 03:00 0
# 5 04:00 4
# 6 05:00 0
# 7 06:00 0
# 8 07:00 0
# 9 08:00 0
# 10 09:00 0
# 11 10:00 0
or you can give the exact hours you want in a vector or take the set difference between the times you want and the times you have and order the results:
## giving the times explicitly
out <- rbind(dd, data.frame(hour = sprintf('%02s:00', c(3, 5:10)), value = 0))
## or more programmatically:
out <- rbind(dd, data.frame(hour = setdiff(s, dd$hour),
value = 0))
out[order(out$hour), ]
# hour value
# 1 00:00 10
# 2 01:00 8
# 3 02:00 1
# 5 03:00 0
# 4 04:00 4
# 6 05:00 0
# 7 06:00 0
# 8 07:00 0
# 9 08:00 0
# 10 09:00 0
# 11 10:00 0

Resources