Adding different value to different parts of a dataframe without loop - r

I want to speed up this calculation I need to do on specific part of the dataframe, this is an example data
days <- c("01.01.2018","01.01.2018","01.01.2018",
"02.01.2018","02.01.2018","02.01.2018",
"03.01.2018","03.01.2018","03.01.2018")
time <- c("00:00:00","01:00:00","02:00:00",
"00:00:00","01:00:00","02:00:00",
"00:00:00","01:00:00","02:00:00")
a <- c(1,2,3,
1,2,3,
1,2,3)
b <- c(1,2,3,
5,6,7,
10,11,12)
results <- NA
df1 <- data.frame(days,time,a,results)
df2 <- data.frame(days,time,b)
I need to add the value from df2$b at 00:00:00 of each day to the same entire day values in df1$a and store it in results.
Right now I'm doing it like this:
ndays <- unique(df1$days)
for(i in 1:length(ndays)) {
factor <- df2[(df2$days == ndays[i] & df2$time == "00:00:00"),]$b
df1[df1$days == ndays[i],]$results <- df1[df1$days == ndays[i],]$a + factor
}
The problem is, I've got huge dataframes with lot of days and cycling them one by one is slow. Is there a fastest way to do so?
edit: This is the filled results column after the cycle
df1
days time a results
1 01.01.2018 00:00:00 1 2 # results = a + df$b # 01.01.2018 00:00:00
2 01.01.2018 01:00:00 2 3 # results = a + df$b # 01.01.2018 00:00:00
3 01.01.2018 02:00:00 3 4 # results = a + df$b # 01.01.2018 00:00:00
4 02.01.2018 00:00:00 1 6 # results = a + df$b # 02.01.2018 00:00:00
5 02.01.2018 01:00:00 2 7 # results = a + df$b # 02.01.2018 00:00:00
6 02.01.2018 02:00:00 3 8 # results = a + df$b # 02.01.2018 00:00:00
7 03.01.2018 00:00:00 1 11 # results = a + df$b # 03.01.2018 00:00:00
8 03.01.2018 01:00:00 2 12 # results = a + df$b # 03.01.2018 00:00:00
9 03.01.2018 02:00:00 3 13 # results = a + df$b # 03.01.2018 00:00:00

You can do this with a merge instead of a for loop which will be much faster. In the below answer I'm also using data.table, a fast version of data.frames that are very useful when working with large tables.
# install.packages("data.table") # Uncomment if necessary
library(data.table)
df1 <- data.frame(days,time,a) # You don't need to create the result column yet
df2 <- data.frame(days,time,b)
df1 <- data.table(df1)
df2 <- data.table(df2)
# Merge the two tables on the days column
df3 <- merge(df1, df2[time=="00:00:00"], by="days")
# This is your result
answer <- df3[, .(days, time=time.x, a, results=a+b)]
Output:
> answer
days time a results
1: 01.01.2018 00:00:00 1 2
2: 01.01.2018 01:00:00 2 3
3: 01.01.2018 02:00:00 3 4
4: 02.01.2018 00:00:00 1 6
5: 02.01.2018 01:00:00 2 7
6: 02.01.2018 02:00:00 3 8
7: 03.01.2018 00:00:00 1 11
8: 03.01.2018 01:00:00 2 12
9: 03.01.2018 02:00:00 3 13

One solution using dplyr could be as below. The approach of the solution is to:
1) First filter all time other than 00:00:00 from the df2
2) Then inner_join both df1 and df2 on days. This will enable to select value of b from df2 to every matching day in merged dataframe. Finally add a and b to find result.
df1 <- data.frame(days,time,a,results, stringsAsFactors = FALSE)
df2 <- data.frame(days,time,b, stringsAsFactors = FALSE)
library(dplyr)
df2 %>%
filter(time == "00:00:00") %>%
inner_join(df1, by="days") %>%
mutate(time = time.y, results = a+b) %>%
select( days, time, a, b, results)
#Result:
days time a b results
1 01.01.2018 00:00:00 1 1 2
2 01.01.2018 01:00:00 2 1 3
3 01.01.2018 02:00:00 3 1 4
4 02.01.2018 00:00:00 1 5 6
5 02.01.2018 01:00:00 2 5 7
6 02.01.2018 02:00:00 3 5 8
7 03.01.2018 00:00:00 1 10 11
8 03.01.2018 01:00:00 2 10 12
9 03.01.2018 02:00:00 3 10 13

transform(merge(df1,aggregate(b~days,df2,function(x)x[1])),results=a+b)
days time a results b
1 01.01.2018 00:00:00 1 2 1
2 01.01.2018 01:00:00 2 3 1
3 01.01.2018 02:00:00 3 4 1
4 02.01.2018 00:00:00 1 6 5
5 02.01.2018 01:00:00 2 7 5
6 02.01.2018 02:00:00 3 8 5
7 03.01.2018 00:00:00 1 11 10
8 03.01.2018 01:00:00 2 12 10
9 03.01.2018 02:00:00 3 13 10
One thing to note. This assumes time in df2 is arranged chronologically and that the first value for any given day is of time 00:00:00.

Related

calculate number of frost change days (number of days) from the weather hourly data in r

I have to calculate the following data Number of frost change days**(NFCD)**** as weekly basis.
That means the number of days in which minimum temperature and maximum temperature cross 0°C.
Let's say I work with years 1957-1980 with hourly temp.
Example data (couple of rows look like):
Date Time (UTC) temperature
1957-07-01 00:00:00 5
1957-07-01 03:00:00 6.2
1957-07-01 05:00:00 9
1957-07-01 06:00:00 10
1957-07-01 07:00:00 10
1957-07-01 08:00:00 14
1957-07-01 09:00:00 13.2
1957-07-01 10:00:00 15
1957-07-01 11:00:00 15
1957-07-01 12:00:00 16.3
1957-07-01 13:00:00 15.8
Expected data:
year month week NFCD
1957 7 1 1
1957 7 2 5
dat <- data.frame(date=c(rep("A",5),rep("B",5)), time=rep(1:5, times=2), temp=c(1:5,-2,1:4))
dat
# date time temp
# 1 A 1 1
# 2 A 2 2
# 3 A 3 3
# 4 A 4 4
# 5 A 5 5
# 6 B 1 -2
# 7 B 2 1
# 8 B 3 2
# 9 B 4 3
# 10 B 5 4
aggregate(temp ~ date, data = dat, FUN = function(z) min(z) <= 0 && max(z) > 0)
# date temp
# 1 A FALSE
# 2 B TRUE
(then rename temp to NFCD)
Using the data from r2evans's answer you can also use tidyverse logic:
library(tidyverse)
dat %>%
group_by(date) %>%
summarize(NFCD = min(temp) < 0 & max(temp) > 0)
which gives:
# A tibble: 2 x 2
date NFCD
<chr> <lgl>
1 A FALSE
2 B TRUE

Min and max value based on another column and combine those in r

So I basically got a while loop function that creates 1's in the "algorithm_column" based on the highest percentages in the "percent" column, until a certain total percentage is reached (90% or something). The rest of the rows that are not taken into account will have a value of 0 in the "algorithm_column" ( Create while loop function that takes next largest value untill condition is met)
I want to show, based on what the loop function found, the min and max times of the column "timeinterval" (the min is where the 1's start and max is the last row with a 1, the 0's are out of the scope). And then finally create a time interval from this.
So if we have the following code, I want to create in another column, lets say "total_time" a calculation from the min time 09:00 ( this is where 1 start in the algorithm_column) until 11:15, which makes a time interval of 02:15 hours added to the "total_time" column.
algorithm
# pc4 timeinterval stops percent idgroup algorithm_column
#1 5464 08:45:00 1 1.3889 1 0
#2 5464 09:00:00 5 6.9444 2 1
#3 5464 09:15:00 8 11.1111 3 1
#4 5464 09:30:00 7 9.7222 4 1
#5 5464 09:45:00 5 6.9444 5 1
#6 5464 10:00:00 10 13.8889 6 1
#7 5464 10:15:00 6 8.3333 7 1
#8 5464 10:30:00 4 5.5556 8 1
#9 5464 10:45:00 7 9.7222 9 1
#10 5464 11:00:00 6 8.3333 10 1
#11 5464 11:15:00 5 6.9444 11 1
#12 5464 11:30:00 8 11.1111 12 0
I have multiple pc4 groups, so it should look at every group and calculate a total_time for each group respectively.
I got this function, but I'm a bit stuck if this is what I need.
test <- function(x) {
ind <- x[["algorithm$algorithm_column"]] == 0
Mx <- max(x[["timeinterval"]][ind], na.rm = TRUE);
ind <- x[["algorithm$algorithm_column"]] == 1
Mn <- min(x[["timeinterval"]][ind], na.rm = TRUE);
list(Mn, Mx) ## or return(list(Mn, Mx))
}
test(algorithm)
Here is a dplyr solution.
library(dplyr)
algorithm %>%
mutate(tmp = cumsum(c(0, diff(algorithm_column) != 0))) %>%
filter(algorithm_column == 1) %>%
group_by(pc4, tmp) %>%
summarise(first = first(timeinterval),
last = last(timeinterval)) %>%
select(-tmp)
## A tibble: 1 x 3
## Groups: pc4 [1]
# pc4 first last
# <int> <fct> <fct>
#1 5464 09:00:00 11:15:00
Data.
algorithm <- read.table(text = "
pc4 timeinterval stops percent idgroup algorithm_column
1 5464 08:45:00 1 1.3889 1 0
2 5464 09:00:00 5 6.9444 2 1
3 5464 09:15:00 8 11.1111 3 1
4 5464 09:30:00 7 9.7222 4 1
5 5464 09:45:00 5 6.9444 5 1
6 5464 10:00:00 10 13.8889 6 1
7 5464 10:15:00 6 8.3333 7 1
8 5464 10:30:00 4 5.5556 8 1
9 5464 10:45:00 7 9.7222 9 1
10 5464 11:00:00 6 8.3333 10 1
11 5464 11:15:00 5 6.9444 11 1
12 5464 11:30:00 8 11.1111 12 0
", header = TRUE)

How to calculate number of hours from a fixed start point that varies among levels of a variable

The dataframe df1 summarizes detections of different individuals (ID) through time (Datetime). As a short example:
library(lubridate)
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Datetime= ymd_hms(c("2016-08-21 00:00:00","2016-08-24 08:00:00","2016-08-23 12:00:00","2016-08-29 03:00:00","2016-08-27 23:00:00","2016-09-02 02:00:00","2016-09-01 12:00:00","2016-09-09 04:00:00","2016-09-01 12:00:00","2016-09-10 12:00:00")))
> df1
ID Datetime
1 1 2016-08-21 00:00:00
2 2 2016-08-24 08:00:00
3 1 2016-08-23 12:00:00
4 2 2016-08-29 03:00:00
5 1 2016-08-27 23:00:00
6 2 2016-09-02 02:00:00
7 1 2016-09-01 12:00:00
8 2 2016-09-09 04:00:00
9 1 2016-09-01 12:00:00
10 2 2016-09-10 12:00:00
I want to calculate for each row, the number of hours (Hours_since_begining) since the first time that the individual was detected.
I would expect something like that (It can contain some mistakes since I did the calculations by hand):
> df1
ID Datetime Hours_since_begining
1 1 2016-08-21 00:00:00 0
2 2 2016-08-24 08:00:00 0
3 1 2016-08-23 12:00:00 60 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-23 12:00:00"
4 2 2016-08-29 03:00:00 115
5 1 2016-08-27 23:00:00 167 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-27 23:00:00"
6 2 2016-09-02 02:00:00 210
7 1 2016-09-01 12:00:00 276
8 2 2016-09-09 04:00:00 380
9 1 2016-09-01 12:00:00 276
10 2 2016-09-10 12:00:00 412
Does anyone know how to do it?
Thanks in advance!
You can do this :
library(tidyverse)
# first get min datetime by ID
min_datetime_id <- df1 %>% group_by(ID) %>% summarise(min_datetime=min(Datetime))
# join with df1 and compute time difference
df1 <- df1 %>% left_join(min_datetime_id) %>% mutate(Hours_since_beginning= as.numeric(difftime(Datetime, min_datetime,units="hours")))

R Max of Same Date, Previous Date, and Previous Hour Value

A couple basic data manipulations. I searched with different wordings and couldn't find much.
I have data structured as below. In reality the hourly data is continuous, but I just included 4 lines as an example.
start <- as.POSIXlt(c('2017-1-1 1:00','2017-1-1 2:00','2017-1-2 1:00','2017-1-2 2:00'))
values <- as.numeric(c(2,5,4,3))
df <- data.frame(start,values)
df
start values
1 2017-01-01 01:00:00 2
2 2017-01-01 02:00:00 5
3 2017-01-02 01:00:00 4
4 2017-01-02 02:00:00 3
I would like to add a couple columns that:
1) Show the max of the same day.
2) Show the max of the previous day.
3) Show the value of one previous hour.
The goal is to have an output like:
MaxValueDay <- as.numeric(c(5,5,4,4))
MaxValueYesterday <- as.numeric(c(NA,NA,5,5))
PreviousHourValue <- as.numeric(c(NA,2,NA,4))
df2 <- data.frame(start,values,MaxValueDay,MaxValueYesterday,PreviousHourValue)
df2
start values MaxValueDay MaxValueYesterday PreviousHourValue
1 2017-01-01 01:00:00 2 5 NA NA
2 2017-01-01 02:00:00 5 5 NA 2
3 2017-01-02 01:00:00 4 4 5 NA
4 2017-01-02 02:00:00 3 4 5 4
Any help would be greatly appreciated. Thanks
A solution using dplyr, magrittr, and lubridate packages:
library(dplyr)
library(magrittr)
library(lubridate)
df %>%
within(MaxValueDay <- sapply(as.Date(start), function (x) max(df$values[which(x==as.Date(start))]))) %>%
within(MaxValueYesterday <- MaxValueDay[sapply(as.Date(start)-1, match, as.Date(start))]) %>%
within(PreviousHourValue <- values[sapply(start-hours(1), match, start)])
# start values MaxValueDay MaxValueYesterday PreviousHourValue
# 1 2017-01-01 01:00:00 2 5 NA NA
# 2 2017-01-01 02:00:00 5 5 NA 2
# 3 2017-01-02 01:00:00 4 4 5 NA
# 4 2017-01-02 02:00:00 3 4 5 4

Summations by conditions on another row dealing with time

I am looking to run a cumulative sum at every row for values that occur in two columns before and after that point. So in this case I have volume of 2 incident types at every given minute over two days. I want to create a column which adds all the incidents that occured before and after for each row by the type. Sumif from excel comes to mind but I'm not sure how to port that over to R:
EDIT: ADDED set.seed and easier numbers
I have the following data set:
set.seed(42)
master_min =
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-2 23:00", tz="America/New_York"),
by="min"
))
)
incident1= round(runif(2821, min=0, max=10))
incident2= round(runif(2821, min=0, max=10))
master_min = head(cbind(master_min, incident1, incident2), 5)
How do I essentially compute the following logic:
for each row, sum all the incident1s that occured before that row's timestamp and all the incident2s that occured after that row's timestamp? It would be great to get a data table solution, if not a dplyr as I am working with a large dataset. Below is a before and after for the data`:
BEFORE:
master_min incident1 incident2
1: 2016-01-01 00:00:00 9 6
2: 2016-01-01 00:01:00 9 5
3: 2016-01-01 00:02:00 3 5
4: 2016-01-01 00:03:00 8 6
5: 2016-01-01 00:04:00 6 9
AFTER THE CALCULATION:
master_min incident1 incident2 new_column
1: 2016-01-01 00:00:00 9 6 25
2: 2016-01-01 00:01:00 9 5 29
3: 2016-01-01 00:02:00 3 5 33
4: 2016-01-01 00:03:00 8 6 30
5: 2016-01-01 00:04:00 6 9 29
If I understand correctly:
# Cumsum of incident1, without current row:
master_min$sum1 <- cumsum(master_min$incident1) - master_min$incident1
# Reverse cumsum of incident2, without current row:
master_min$sum2 <- rev(cumsum(rev(master_min$incident2))) - master_min$incident2
# Your new column:
master_min$new_column <- master_min$sum1 + master_min$sum2
*update
The following two lines can do the job
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
I rewrote the question a bit to show a bit more comprehensive structure
library(data.table)
master_min <-
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-1 0:09", tz="America/New_York"),
by="min"
))
)
set.seed(2)
incident1= as.integer(runif(10, min=0, max=10))
incident2= as.integer(runif(10, min=0, max=10))
master_min = cbind(master_min, incident1, incident2)
Now master_min looks like this
> master_min
master_min incident1 incident2
1: 2016-01-01 00:00:00 1 5
2: 2016-01-01 00:01:00 7 2
3: 2016-01-01 00:02:00 5 7
4: 2016-01-01 00:03:00 1 1
5: 2016-01-01 00:04:00 9 4
6: 2016-01-01 00:05:00 9 8
7: 2016-01-01 00:06:00 1 9
8: 2016-01-01 00:07:00 8 2
9: 2016-01-01 00:08:00 4 4
10: 2016-01-01 00:09:00 5 0
Apply transformations
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
Results
> master_min
master_min incident1 incident2 sum1 sum2
1: 2016-01-01 00:00:00 1 5 1 37
2: 2016-01-01 00:01:00 7 2 8 35
3: 2016-01-01 00:02:00 5 7 13 28
4: 2016-01-01 00:03:00 1 1 14 27
5: 2016-01-01 00:04:00 9 4 23 23
6: 2016-01-01 00:05:00 9 8 32 15
7: 2016-01-01 00:06:00 1 9 33 6
8: 2016-01-01 00:07:00 8 2 41 4
9: 2016-01-01 00:08:00 4 4 45 0
10: 2016-01-01 00:09:00 5 0 50 0

Resources