R Max of Same Date, Previous Date, and Previous Hour Value - r

A couple basic data manipulations. I searched with different wordings and couldn't find much.
I have data structured as below. In reality the hourly data is continuous, but I just included 4 lines as an example.
start <- as.POSIXlt(c('2017-1-1 1:00','2017-1-1 2:00','2017-1-2 1:00','2017-1-2 2:00'))
values <- as.numeric(c(2,5,4,3))
df <- data.frame(start,values)
df
start values
1 2017-01-01 01:00:00 2
2 2017-01-01 02:00:00 5
3 2017-01-02 01:00:00 4
4 2017-01-02 02:00:00 3
I would like to add a couple columns that:
1) Show the max of the same day.
2) Show the max of the previous day.
3) Show the value of one previous hour.
The goal is to have an output like:
MaxValueDay <- as.numeric(c(5,5,4,4))
MaxValueYesterday <- as.numeric(c(NA,NA,5,5))
PreviousHourValue <- as.numeric(c(NA,2,NA,4))
df2 <- data.frame(start,values,MaxValueDay,MaxValueYesterday,PreviousHourValue)
df2
start values MaxValueDay MaxValueYesterday PreviousHourValue
1 2017-01-01 01:00:00 2 5 NA NA
2 2017-01-01 02:00:00 5 5 NA 2
3 2017-01-02 01:00:00 4 4 5 NA
4 2017-01-02 02:00:00 3 4 5 4
Any help would be greatly appreciated. Thanks

A solution using dplyr, magrittr, and lubridate packages:
library(dplyr)
library(magrittr)
library(lubridate)
df %>%
within(MaxValueDay <- sapply(as.Date(start), function (x) max(df$values[which(x==as.Date(start))]))) %>%
within(MaxValueYesterday <- MaxValueDay[sapply(as.Date(start)-1, match, as.Date(start))]) %>%
within(PreviousHourValue <- values[sapply(start-hours(1), match, start)])
# start values MaxValueDay MaxValueYesterday PreviousHourValue
# 1 2017-01-01 01:00:00 2 5 NA NA
# 2 2017-01-01 02:00:00 5 5 NA 2
# 3 2017-01-02 01:00:00 4 4 5 NA
# 4 2017-01-02 02:00:00 3 4 5 4

Related

calculate number of frost change days (number of days) from the weather hourly data in r

I have to calculate the following data Number of frost change days**(NFCD)**** as weekly basis.
That means the number of days in which minimum temperature and maximum temperature cross 0°C.
Let's say I work with years 1957-1980 with hourly temp.
Example data (couple of rows look like):
Date Time (UTC) temperature
1957-07-01 00:00:00 5
1957-07-01 03:00:00 6.2
1957-07-01 05:00:00 9
1957-07-01 06:00:00 10
1957-07-01 07:00:00 10
1957-07-01 08:00:00 14
1957-07-01 09:00:00 13.2
1957-07-01 10:00:00 15
1957-07-01 11:00:00 15
1957-07-01 12:00:00 16.3
1957-07-01 13:00:00 15.8
Expected data:
year month week NFCD
1957 7 1 1
1957 7 2 5
dat <- data.frame(date=c(rep("A",5),rep("B",5)), time=rep(1:5, times=2), temp=c(1:5,-2,1:4))
dat
# date time temp
# 1 A 1 1
# 2 A 2 2
# 3 A 3 3
# 4 A 4 4
# 5 A 5 5
# 6 B 1 -2
# 7 B 2 1
# 8 B 3 2
# 9 B 4 3
# 10 B 5 4
aggregate(temp ~ date, data = dat, FUN = function(z) min(z) <= 0 && max(z) > 0)
# date temp
# 1 A FALSE
# 2 B TRUE
(then rename temp to NFCD)
Using the data from r2evans's answer you can also use tidyverse logic:
library(tidyverse)
dat %>%
group_by(date) %>%
summarize(NFCD = min(temp) < 0 & max(temp) > 0)
which gives:
# A tibble: 2 x 2
date NFCD
<chr> <lgl>
1 A FALSE
2 B TRUE

How to calculate number of hours from a fixed start point that varies among levels of a variable

The dataframe df1 summarizes detections of different individuals (ID) through time (Datetime). As a short example:
library(lubridate)
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Datetime= ymd_hms(c("2016-08-21 00:00:00","2016-08-24 08:00:00","2016-08-23 12:00:00","2016-08-29 03:00:00","2016-08-27 23:00:00","2016-09-02 02:00:00","2016-09-01 12:00:00","2016-09-09 04:00:00","2016-09-01 12:00:00","2016-09-10 12:00:00")))
> df1
ID Datetime
1 1 2016-08-21 00:00:00
2 2 2016-08-24 08:00:00
3 1 2016-08-23 12:00:00
4 2 2016-08-29 03:00:00
5 1 2016-08-27 23:00:00
6 2 2016-09-02 02:00:00
7 1 2016-09-01 12:00:00
8 2 2016-09-09 04:00:00
9 1 2016-09-01 12:00:00
10 2 2016-09-10 12:00:00
I want to calculate for each row, the number of hours (Hours_since_begining) since the first time that the individual was detected.
I would expect something like that (It can contain some mistakes since I did the calculations by hand):
> df1
ID Datetime Hours_since_begining
1 1 2016-08-21 00:00:00 0
2 2 2016-08-24 08:00:00 0
3 1 2016-08-23 12:00:00 60 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-23 12:00:00"
4 2 2016-08-29 03:00:00 115
5 1 2016-08-27 23:00:00 167 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-27 23:00:00"
6 2 2016-09-02 02:00:00 210
7 1 2016-09-01 12:00:00 276
8 2 2016-09-09 04:00:00 380
9 1 2016-09-01 12:00:00 276
10 2 2016-09-10 12:00:00 412
Does anyone know how to do it?
Thanks in advance!
You can do this :
library(tidyverse)
# first get min datetime by ID
min_datetime_id <- df1 %>% group_by(ID) %>% summarise(min_datetime=min(Datetime))
# join with df1 and compute time difference
df1 <- df1 %>% left_join(min_datetime_id) %>% mutate(Hours_since_beginning= as.numeric(difftime(Datetime, min_datetime,units="hours")))

Adding different value to different parts of a dataframe without loop

I want to speed up this calculation I need to do on specific part of the dataframe, this is an example data
days <- c("01.01.2018","01.01.2018","01.01.2018",
"02.01.2018","02.01.2018","02.01.2018",
"03.01.2018","03.01.2018","03.01.2018")
time <- c("00:00:00","01:00:00","02:00:00",
"00:00:00","01:00:00","02:00:00",
"00:00:00","01:00:00","02:00:00")
a <- c(1,2,3,
1,2,3,
1,2,3)
b <- c(1,2,3,
5,6,7,
10,11,12)
results <- NA
df1 <- data.frame(days,time,a,results)
df2 <- data.frame(days,time,b)
I need to add the value from df2$b at 00:00:00 of each day to the same entire day values in df1$a and store it in results.
Right now I'm doing it like this:
ndays <- unique(df1$days)
for(i in 1:length(ndays)) {
factor <- df2[(df2$days == ndays[i] & df2$time == "00:00:00"),]$b
df1[df1$days == ndays[i],]$results <- df1[df1$days == ndays[i],]$a + factor
}
The problem is, I've got huge dataframes with lot of days and cycling them one by one is slow. Is there a fastest way to do so?
edit: This is the filled results column after the cycle
df1
days time a results
1 01.01.2018 00:00:00 1 2 # results = a + df$b # 01.01.2018 00:00:00
2 01.01.2018 01:00:00 2 3 # results = a + df$b # 01.01.2018 00:00:00
3 01.01.2018 02:00:00 3 4 # results = a + df$b # 01.01.2018 00:00:00
4 02.01.2018 00:00:00 1 6 # results = a + df$b # 02.01.2018 00:00:00
5 02.01.2018 01:00:00 2 7 # results = a + df$b # 02.01.2018 00:00:00
6 02.01.2018 02:00:00 3 8 # results = a + df$b # 02.01.2018 00:00:00
7 03.01.2018 00:00:00 1 11 # results = a + df$b # 03.01.2018 00:00:00
8 03.01.2018 01:00:00 2 12 # results = a + df$b # 03.01.2018 00:00:00
9 03.01.2018 02:00:00 3 13 # results = a + df$b # 03.01.2018 00:00:00
You can do this with a merge instead of a for loop which will be much faster. In the below answer I'm also using data.table, a fast version of data.frames that are very useful when working with large tables.
# install.packages("data.table") # Uncomment if necessary
library(data.table)
df1 <- data.frame(days,time,a) # You don't need to create the result column yet
df2 <- data.frame(days,time,b)
df1 <- data.table(df1)
df2 <- data.table(df2)
# Merge the two tables on the days column
df3 <- merge(df1, df2[time=="00:00:00"], by="days")
# This is your result
answer <- df3[, .(days, time=time.x, a, results=a+b)]
Output:
> answer
days time a results
1: 01.01.2018 00:00:00 1 2
2: 01.01.2018 01:00:00 2 3
3: 01.01.2018 02:00:00 3 4
4: 02.01.2018 00:00:00 1 6
5: 02.01.2018 01:00:00 2 7
6: 02.01.2018 02:00:00 3 8
7: 03.01.2018 00:00:00 1 11
8: 03.01.2018 01:00:00 2 12
9: 03.01.2018 02:00:00 3 13
One solution using dplyr could be as below. The approach of the solution is to:
1) First filter all time other than 00:00:00 from the df2
2) Then inner_join both df1 and df2 on days. This will enable to select value of b from df2 to every matching day in merged dataframe. Finally add a and b to find result.
df1 <- data.frame(days,time,a,results, stringsAsFactors = FALSE)
df2 <- data.frame(days,time,b, stringsAsFactors = FALSE)
library(dplyr)
df2 %>%
filter(time == "00:00:00") %>%
inner_join(df1, by="days") %>%
mutate(time = time.y, results = a+b) %>%
select( days, time, a, b, results)
#Result:
days time a b results
1 01.01.2018 00:00:00 1 1 2
2 01.01.2018 01:00:00 2 1 3
3 01.01.2018 02:00:00 3 1 4
4 02.01.2018 00:00:00 1 5 6
5 02.01.2018 01:00:00 2 5 7
6 02.01.2018 02:00:00 3 5 8
7 03.01.2018 00:00:00 1 10 11
8 03.01.2018 01:00:00 2 10 12
9 03.01.2018 02:00:00 3 10 13
transform(merge(df1,aggregate(b~days,df2,function(x)x[1])),results=a+b)
days time a results b
1 01.01.2018 00:00:00 1 2 1
2 01.01.2018 01:00:00 2 3 1
3 01.01.2018 02:00:00 3 4 1
4 02.01.2018 00:00:00 1 6 5
5 02.01.2018 01:00:00 2 7 5
6 02.01.2018 02:00:00 3 8 5
7 03.01.2018 00:00:00 1 11 10
8 03.01.2018 01:00:00 2 12 10
9 03.01.2018 02:00:00 3 13 10
One thing to note. This assumes time in df2 is arranged chronologically and that the first value for any given day is of time 00:00:00.

Finding Difference Between Dates in Long format

I have a dataset in long form for start and end date. for each id you will see multiple start and end dates.
I need to find the difference between the first end date and second start date. I am not sure how to use two rows to calculate the difference. Any help is appreciated.
df=data.frame(c(1,2,2,2,3,4,4),
as.Date(c( "2010-10-01","2009-09-01","2014-01-01","2014-02-01","2009-01-01","2013-03-01","2014-03-01")),
as.Date(c("2016-04-30","2013-12-31","2014-01-31","2016-04-30","2014-02-28","2013-05-01","2014-08-31")));
names(df)=c('id','start','end')
my output would look like this:
df$diff=c(NA,1,1,NA,NA,304, NA)
Here's an attempt in base R that I think does what you want:
df$diff <- NA
split(df$diff, df$id) <- by(df, df$id, FUN=function(SD) c(SD$start[-1], NA) - SD$end)
df
# id start end diff
#1 1 2010-10-01 2016-04-30 NA
#2 2 2009-09-01 2013-12-31 1
#3 2 2014-01-01 2014-01-31 1
#4 2 2014-02-01 2016-04-30 NA
#5 3 2009-01-01 2014-02-28 NA
#6 4 2013-03-01 2013-05-01 304
#7 4 2014-03-01 2014-08-31 NA
Alternatively, in data.table it would be:
setDT(df)[, diff := shift(start,n=1,type="lead") - end, by=id]
Here's an alternative using the popular dplyr package:
library(dplyr)
df %>%
group_by(id) %>%
mutate(diff = difftime(lead(start), end, units = "days"))
# id start end diff
# (dbl) (date) (date) (dfft)
# 1 1 2010-10-01 2016-04-30 NA days
# 2 2 2009-09-01 2013-12-31 1 days
# 3 2 2014-01-01 2014-01-31 1 days
# 4 2 2014-02-01 2016-04-30 NA days
# 5 3 2009-01-01 2014-02-28 NA days
# 6 4 2013-03-01 2013-05-01 304 days
# 7 4 2014-03-01 2014-08-31 NA days
You can wrap diff in as.numeric if you want.
Again with base R, you can do the following:
df$noofdays <- as.numeric(as.difftime(df$end-df$start, units=c("days"), format="%Y-%m-%d"))

Creating a 4-hr time interval using a reference column in R

I want to create a 4-hrs interval using a reference column from a data frame. I have a data frame like this one:
species<-"ABC"
ind<-rep(1:4,each=24)
hour<-rep(seq(0,23,by=1),4)
depth<-runif(length(ind),1,50)
df<-data.frame(cbind(species,ind,hour,depth))
df$depth<-as.numeric(df$depth)
What I would like is to create a new column (without changing the information or dimensions of the original data frame) that could look at my hour column (reference column) and based on that value will give me a 4-hrs time interval. For example, if the value from the hour column is between 0 and 3, then the value in new column will be 0; if the value is between 4 and 7 the value in the new column will be 4, and so on... In excel I used to use the floor/ceiling functions for this, but in R they are not exactly the same. Also, if someone has an easier suggestion for this using the original date/time data that could work too. In my original script I used the function as.POSIXct to get the date/time data, and from there my hour column.
I appreciate your help,
what about taking the column of hours, converting it to integers, and using integer division to get the floor? something like this
# convert hour to integer (hour is currently a col of factors)
i <- as.numeric(levels(df$hour))[df$hour]
# make new column
df$interval <- (i %/% 4) * 4
Expanding on my comment, since I think you're ultimately looking for actual dates at some point...
Some sample hourly data:
set.seed(1)
mydata <- data.frame(species = "ABC",
ind = rep(1:4, each=24),
depth = runif(96, 1, 50),
datetime = seq(ISOdate(2000, 1, 1, 0, 0, 0),
by = "1 hour", length.out = 96))
list(head(mydata), tail(mydata))
# [[1]]
# species ind depth datetime
# 1 ABC 1 14.00992 2000-01-01 00:00:00
# 2 ABC 1 19.23407 2000-01-01 01:00:00
# 3 ABC 1 29.06981 2000-01-01 02:00:00
# 4 ABC 1 45.50218 2000-01-01 03:00:00
# 5 ABC 1 10.88241 2000-01-01 04:00:00
# 6 ABC 1 45.02109 2000-01-01 05:00:00
#
# [[2]]
# species ind depth datetime
# 91 ABC 4 12.741841 2000-01-04 18:00:00
# 92 ABC 4 3.887784 2000-01-04 19:00:00
# 93 ABC 4 32.472125 2000-01-04 20:00:00
# 94 ABC 4 43.937191 2000-01-04 21:00:00
# 95 ABC 4 39.166819 2000-01-04 22:00:00
# 96 ABC 4 40.068132 2000-01-04 23:00:00
Transforming that data using cut and format:
mydata <- within(mydata, {
hourclass <- cut(datetime, "4 hours") # Find the intervals
hourfloor <- format(as.POSIXlt(hourclass), "%H") # Display just the "hour"
})
list(head(mydata), tail(mydata))
# [[1]]
# species ind depth datetime hourclass hourfloor
# 1 ABC 1 14.00992 2000-01-01 00:00:00 2000-01-01 00:00:00 00
# 2 ABC 1 19.23407 2000-01-01 01:00:00 2000-01-01 00:00:00 00
# 3 ABC 1 29.06981 2000-01-01 02:00:00 2000-01-01 00:00:00 00
# 4 ABC 1 45.50218 2000-01-01 03:00:00 2000-01-01 00:00:00 00
# 5 ABC 1 10.88241 2000-01-01 04:00:00 2000-01-01 04:00:00 04
# 6 ABC 1 45.02109 2000-01-01 05:00:00 2000-01-01 04:00:00 04
#
# [[2]]
# species ind depth datetime hourclass hourfloor
# 91 ABC 4 12.741841 2000-01-04 18:00:00 2000-01-04 16:00:00 16
# 92 ABC 4 3.887784 2000-01-04 19:00:00 2000-01-04 16:00:00 16
# 93 ABC 4 32.472125 2000-01-04 20:00:00 2000-01-04 20:00:00 20
# 94 ABC 4 43.937191 2000-01-04 21:00:00 2000-01-04 20:00:00 20
# 95 ABC 4 39.166819 2000-01-04 22:00:00 2000-01-04 20:00:00 20
# 96 ABC 4 40.068132 2000-01-04 23:00:00 2000-01-04 20:00:00 20
Note that your new "hourclass" variable is a factor and the new "hourfloor" variable is character, but you can easily change those, even during the within stage.
str(mydata)
# 'data.frame': 96 obs. of 6 variables:
# $ species : Factor w/ 1 level "ABC": 1 1 1 1 1 1 1 1 1 1 ...
# $ ind : int 1 1 1 1 1 1 1 1 1 1 ...
# $ depth : num 14 19.2 29.1 45.5 10.9 ...
# $ datetime : POSIXct, format: "2000-01-01 00:00:00" "2000-01-01 01:00:00" ...
# $ hourclass: Factor w/ 24 levels "2000-01-01 00:00:00",..: 1 1 1 1 2 2 2 2 3 3 ...
# $ hourfloor: chr "00" "00" "00" "00" ...
tip number 1, don't use cbind to create a data.frame with differing type of columns, everything gets coerced to the same type (in this case factor)
findInterval or cut would seem appropriate here.
df <- data.frame(species,ind,hour,depth)
# copy
df2 <- df
df2$fourhour <- c(0,4,8,12,16,20)[findInterval(df$hour, c(0,4,8,12,16,20))]
Though there is probably a simpler way, here is one attempt.
Make your data.frame not using cbind first though, so hour is not a factor but numeric
df <- data.frame(species,ind,hour,depth)
Then:
df$interval <- factor(findInterval(df$hour,seq(0,23,4)),labels=seq(0,23,4))
Result:
> head(df)
species ind hour depth interval
1 ABC 1 0 23.11215 0
2 ABC 1 1 10.63896 0
3 ABC 1 2 18.67615 0
4 ABC 1 3 28.01860 0
5 ABC 1 4 38.25594 4
6 ABC 1 5 30.51363 4
You could also make the labels a bit nicer like:
cutseq <- seq(0,23,4)
df$interval <- factor(
findInterval(df$hour,cutseq),
labels=paste(cutseq,cutseq+3,sep="-")
)
Result:
> head(df)
species ind hour depth interval
1 ABC 1 0 23.11215 0-3
2 ABC 1 1 10.63896 0-3
3 ABC 1 2 18.67615 0-3
4 ABC 1 3 28.01860 0-3
5 ABC 1 4 38.25594 4-7
6 ABC 1 5 30.51363 4-7

Resources