R sum two columns with condition on third column - r

I have a data frame like:
user_name started_at session_time_min task_completed timediff
ABC 2018-03-02 18:00:00 1 3 NA
ABC 2018-03-02 19:00:00 1036 18 1
ABC 2018-03-03 12:00:00 6 10 17
ABC 2018-03-04 21:00:00 0 1 33
ABC 2018-03-05 16:00:00 143 61 19
ABC 2018-03-05 18:00:00 12 18 2
ABC 2018-03-05 19:00:00 60 94 1
ABC 2018-03-05 20:00:00 20 46 1
ABC 2018-03-09 15:00:00 0 1 91
I want to sum session_time_min and task_completed with previous row if timediff = 1
Want output like:
user_name started_at session_time_min task_completed
ABC 2018-03-02 18:00:00 1037 21
ABC 2018-03-03 12:00:00 6 10
ABC 2018-03-04 21:00:00 0 1
ABC 2018-03-05 16:00:00 143 61
ABC 2018-03-05 18:00:00 92 158
ABC 2018-03-09 15:00:00 0 1
Any help will highly be appricated.

You could use a for loop to help you out especially if you want to use base R.
for (i in 1:nrow(data)) {
if (is.na(data[i,5])){
data[i+1,3] <- data[i+1,3] + data[i,3]
data[i+1,4] <- data[i+1,4] + data[i,4]
} else {}
}
data <- na.omit(data)
This code runs through each row in your dataframe and checks if the value in column 5 (timediff) is a NA. If it is an NA it adds (for the 2 columns you want positioned at 3 and 4) it to the row below (which will be i+1)

Make a group counter using cumsum and then use that to subset the identifier columns and rowsum the value columns:
grp <- cumsum(!dat$timediff %in% 1)
#[1] 1 1 2 3 4 5 5 5 6
cbind(
dat[match(unique(grp), grp), c("user_name","started_at")],
rowsum(dat[c("session_time_min","task_completed")], grp)
)
# user_name started_at session_time_min task_completed
#1 ABC 2018-03-0218:00:00 1037 21
#3 ABC 2018-03-0312:00:00 6 10
#4 ABC 2018-03-0421:00:00 0 1
#5 ABC 2018-03-0516:00:00 143 61
#6 ABC 2018-03-0518:00:00 92 158
#9 ABC 2018-03-0915:00:00 0 1

Related

How to calculate number of hours from a fixed start point that varies among levels of a variable

The dataframe df1 summarizes detections of different individuals (ID) through time (Datetime). As a short example:
library(lubridate)
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Datetime= ymd_hms(c("2016-08-21 00:00:00","2016-08-24 08:00:00","2016-08-23 12:00:00","2016-08-29 03:00:00","2016-08-27 23:00:00","2016-09-02 02:00:00","2016-09-01 12:00:00","2016-09-09 04:00:00","2016-09-01 12:00:00","2016-09-10 12:00:00")))
> df1
ID Datetime
1 1 2016-08-21 00:00:00
2 2 2016-08-24 08:00:00
3 1 2016-08-23 12:00:00
4 2 2016-08-29 03:00:00
5 1 2016-08-27 23:00:00
6 2 2016-09-02 02:00:00
7 1 2016-09-01 12:00:00
8 2 2016-09-09 04:00:00
9 1 2016-09-01 12:00:00
10 2 2016-09-10 12:00:00
I want to calculate for each row, the number of hours (Hours_since_begining) since the first time that the individual was detected.
I would expect something like that (It can contain some mistakes since I did the calculations by hand):
> df1
ID Datetime Hours_since_begining
1 1 2016-08-21 00:00:00 0
2 2 2016-08-24 08:00:00 0
3 1 2016-08-23 12:00:00 60 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-23 12:00:00"
4 2 2016-08-29 03:00:00 115
5 1 2016-08-27 23:00:00 167 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-27 23:00:00"
6 2 2016-09-02 02:00:00 210
7 1 2016-09-01 12:00:00 276
8 2 2016-09-09 04:00:00 380
9 1 2016-09-01 12:00:00 276
10 2 2016-09-10 12:00:00 412
Does anyone know how to do it?
Thanks in advance!
You can do this :
library(tidyverse)
# first get min datetime by ID
min_datetime_id <- df1 %>% group_by(ID) %>% summarise(min_datetime=min(Datetime))
# join with df1 and compute time difference
df1 <- df1 %>% left_join(min_datetime_id) %>% mutate(Hours_since_beginning= as.numeric(difftime(Datetime, min_datetime,units="hours")))

R vlookup combinded with if...and

as a beginner to R, I'm facing troubles with a complex issue, for my side.
I want to add a new column with a "1" when the data$Date is between/exactly the lookup$Begin and lookup$End. Identification_no is the key for both data sets.
If the data$date is not bewteen lookup$Begin and lookup$End then there should a "0" in the new data column.
Both data frames have different length of observations.
Here's my basic data frame:
> data
# A tibble: 6 x 2
Date Identification_no
* <date> <dbl>
1 2018-08-25 13
2 2018-02-03 54
3 2018-09-01 31
4 2018-11-10 54
5 2018-08-04 60
6 2018-07-07 58
Here's my lookup data frame:
> lookup
# A tibble: 6 x 3
Begin End Identification_no
* <date> <date> <dbl>
1 2017-01-26 2017-01-26 53
2 2017-01-26 2017-01-26 53
3 2017-01-26 2017-01-26 53
4 2017-01-26 2017-01-26 53
5 2017-01-26 2017-01-26 53
6 2017-01-26 2017-01-26 53
Thanks for your inputs in advance.
EDIT: new sample data
> data
# A tibble: 6 x 2
Date Identification_no
<date> <dbl>
1 2018-08-25 13
2 2018-02-03 54
3 2018-09-01 31
4 2018-11-10 54
5 2018-08-04 60
6 2018-07-07 58
> lookup
# A tibble: 6 x 3
Begin End Identification_no
<date> <date> <dbl>
1 2018-08-20 2018-08-27 13
2 2018-09-01 2018-09-08 53
3 2018-01-09 2018-01-23 20
4 2018-10-16 2018-10-30 4
5 2017-12-22 2017-12-29 54
6 2017-10-31 2017-11-07 66
Result through below described method:
> final
Begin End Identification_no match_col
1: 2018-08-25 2018-08-25 13 1
2: 2018-02-03 2018-02-03 54 0
3: 2018-09-01 2018-09-01 31 0
4: 2018-11-10 2018-11-10 54 0
5: 2018-08-04 2018-08-04 60 0
6: 2018-07-07 2018-07-07 58 0
Works perfectly fine - thanks for your solution.
Best regards,
Paul
Could do:
library(data.table)
setDT(data)[, Date := as.Date(Date)]
setDT(lookup)[, `:=` (Begin = as.Date(Begin), End = as.Date(End), match_col = 1)]
final <- unique(lookup, by = c("Begin", "End","Identification_no"))[
data, on = .(Begin <= Date, End >= Date, Identification_no)][
is.na(match_col), match_col := 0]
On your example dataset, this would give:
final
Begin End Identification_no match_col
1: 2018-08-25 2018-08-25 13 0
2: 2018-02-03 2018-02-03 54 0
3: 2018-09-01 2018-09-01 31 0
4: 2018-11-10 2018-11-10 54 0
5: 2018-08-04 2018-08-04 60 0
6: 2018-07-07 2018-07-07 58 0
.. but only because there's really no match.

Fill in missing rows for dates by group [duplicate]

This question already has answers here:
Efficient way to Fill Time-Series per group
(2 answers)
Filling missing dates by group
(3 answers)
Fastest way to add rows for missing time steps?
(4 answers)
Closed 4 years ago.
I have a data table like this, just much bigger:
customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
time <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-05-01","%Y-%m-
%d"), as.Date("2017-06-01","%Y-%m-%d"),
as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-04-01","%Y-%m-
%d"), as.Date("2017-05-01","%Y-%m-%d"),
as.Date("2017-06-01","%Y-%m-%d"), as.Date("2017-01-01","%Y-%m-
%d"), as.Date("2017-04-01","%Y-%m-%d"),
as.Date("2017-05-01","%Y-%m-%d"))
tenor <- c(1,2,3,1,2,3,4,1,2,3)
variable_x <- c(87,90,100,120,130,150,12,13,15,14)
my_data <- data.table(customer_id,account_id,time,tenor,variable_x)
customer_id account_id time tenor variable_x
1 11 2017-01-01 1 87
1 11 2017-05-01 2 90
1 11 2017-06-01 3 100
2 55 2017-02-01 1 120
2 55 2017-04-01 2 130
2 55 2017-05-01 3 150
2 55 2017-06-01 4 12
3 38 2017-01-01 1 13
3 38 2017-04-01 2 15
3 38 2017-05-01 3 14
in which I should observe for each pair of customer_id, account_id monthly observations from 2017-01-01 to 2017-06-01, but for some customer_id, account_id pairs some dates in this sequence of 6 months are missing. I would like to fill in those missing dates such that each customer_id, account_id pair has observations for all 6 months, just with missing variables tenor and variable_x. That is, it should look like this:
customer_id account_id time tenor variable_x
1 11 2017-01-01 1 87
1 11 2017-02-01 NA NA
1 11 2017-03-01 NA NA
1 11 2017-04-01 NA NA
1 11 2017-05-01 2 90
1 11 2017-06-01 3 100
2 55 2017-01-01 NA NA
2 55 2017-02-01 1 120
2 55 2017-03-01 NA NA
2 55 2017-04-01 2 130
2 55 2017-05-01 3 150
2 55 2017-06-01 4 12
3 38 2017-01-01 1 13
3 38 2017-02-01 NA NA
3 38 2017-03-01 NA NA
3 38 2017-04-01 2 15
3 38 2017-05-01 3 14
3 38 2017-06-01 NA NA
I tried creating a sequence of dates from 2017-01-01 to 2017-06-01 by using
ts = seq(as.Date("2017/01/01"), as.Date("2017/06/01"), by = "month")
and then merge it to the original data with
ts = data.table(ts)
colnames(ts) = "time"
merged <- merge(ts, my_data, by="time", all.x=TRUE)
but it is not working. Please, do you know how to add such rows with dates for each customer_id, account_id pair?
We can do a join. Create the sequence of 'time' from min to max by '1 month', expand the dataset grouped by 'customer_id', 'account_id' and join on with those columns and the 'time'
ts1 <- seq(min(my_data$time), max(my_data$time), by = "1 month")
my_data[my_data[, .(time =ts1 ), .(customer_id, account_id)],
on = .(customer_id, account_id, time)]
# customer_id account_id time tenor variable_x
# 1: 1 11 2017-01-01 1 87
# 2: 1 11 2017-02-01 NA NA
# 3: 1 11 2017-03-01 NA NA
# 4: 1 11 2017-04-01 NA NA
# 5: 1 11 2017-05-01 2 90
# 6: 1 11 2017-06-01 3 100
# 7: 2 55 2017-01-01 NA NA
# 8: 2 55 2017-02-01 1 120
# 9: 2 55 2017-03-01 NA NA
#10: 2 55 2017-04-01 2 130
#11: 2 55 2017-05-01 3 150
#12: 2 55 2017-06-01 4 12
#13: 3 38 2017-01-01 1 13
#14: 3 38 2017-02-01 NA NA
#15: 3 38 2017-03-01 NA NA
#16: 3 38 2017-04-01 2 15
#17: 3 38 2017-05-01 3 14
#18: 3 38 2017-06-01 NA NA
Or using tidyverse
library(tidyverse)
distinct(my_data, customer_id, account_id) %>%
mutate(time = list(ts1)) %>%
unnest %>%
left_join(my_data)
Or with complete from tidyr
my_data %>%
complete(nesting(customer_id, account_id), time = ts1)
A different data.table approach:
my_data2 <- my_data[, .(time = seq(as.Date("2017/01/01"), as.Date("2017/06/01"),
by = "month")), by = list(customer_id, account_id)]
merge(my_data2, my_data, all.x = TRUE)
customer_id account_id time tenor variable_x
1: 1 11 2017-01-01 1 87
2: 1 11 2017-02-01 NA NA
3: 1 11 2017-03-01 NA NA
4: 1 11 2017-04-01 NA NA
5: 1 11 2017-05-01 2 90
6: 1 11 2017-06-01 3 100
7: 2 55 2017-01-01 NA NA
8: 2 55 2017-02-01 1 120
9: 2 55 2017-03-01 NA NA
10: 2 55 2017-04-01 2 130
11: 2 55 2017-05-01 3 150
12: 2 55 2017-06-01 4 12
13: 3 38 2017-01-01 1 13
14: 3 38 2017-02-01 NA NA
15: 3 38 2017-03-01 NA NA
16: 3 38 2017-04-01 2 15
17: 3 38 2017-05-01 3 14
18: 3 38 2017-06-01 NA NA

Summations by conditions on another row dealing with time

I am looking to run a cumulative sum at every row for values that occur in two columns before and after that point. So in this case I have volume of 2 incident types at every given minute over two days. I want to create a column which adds all the incidents that occured before and after for each row by the type. Sumif from excel comes to mind but I'm not sure how to port that over to R:
EDIT: ADDED set.seed and easier numbers
I have the following data set:
set.seed(42)
master_min =
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-2 23:00", tz="America/New_York"),
by="min"
))
)
incident1= round(runif(2821, min=0, max=10))
incident2= round(runif(2821, min=0, max=10))
master_min = head(cbind(master_min, incident1, incident2), 5)
How do I essentially compute the following logic:
for each row, sum all the incident1s that occured before that row's timestamp and all the incident2s that occured after that row's timestamp? It would be great to get a data table solution, if not a dplyr as I am working with a large dataset. Below is a before and after for the data`:
BEFORE:
master_min incident1 incident2
1: 2016-01-01 00:00:00 9 6
2: 2016-01-01 00:01:00 9 5
3: 2016-01-01 00:02:00 3 5
4: 2016-01-01 00:03:00 8 6
5: 2016-01-01 00:04:00 6 9
AFTER THE CALCULATION:
master_min incident1 incident2 new_column
1: 2016-01-01 00:00:00 9 6 25
2: 2016-01-01 00:01:00 9 5 29
3: 2016-01-01 00:02:00 3 5 33
4: 2016-01-01 00:03:00 8 6 30
5: 2016-01-01 00:04:00 6 9 29
If I understand correctly:
# Cumsum of incident1, without current row:
master_min$sum1 <- cumsum(master_min$incident1) - master_min$incident1
# Reverse cumsum of incident2, without current row:
master_min$sum2 <- rev(cumsum(rev(master_min$incident2))) - master_min$incident2
# Your new column:
master_min$new_column <- master_min$sum1 + master_min$sum2
*update
The following two lines can do the job
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
I rewrote the question a bit to show a bit more comprehensive structure
library(data.table)
master_min <-
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-1 0:09", tz="America/New_York"),
by="min"
))
)
set.seed(2)
incident1= as.integer(runif(10, min=0, max=10))
incident2= as.integer(runif(10, min=0, max=10))
master_min = cbind(master_min, incident1, incident2)
Now master_min looks like this
> master_min
master_min incident1 incident2
1: 2016-01-01 00:00:00 1 5
2: 2016-01-01 00:01:00 7 2
3: 2016-01-01 00:02:00 5 7
4: 2016-01-01 00:03:00 1 1
5: 2016-01-01 00:04:00 9 4
6: 2016-01-01 00:05:00 9 8
7: 2016-01-01 00:06:00 1 9
8: 2016-01-01 00:07:00 8 2
9: 2016-01-01 00:08:00 4 4
10: 2016-01-01 00:09:00 5 0
Apply transformations
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
Results
> master_min
master_min incident1 incident2 sum1 sum2
1: 2016-01-01 00:00:00 1 5 1 37
2: 2016-01-01 00:01:00 7 2 8 35
3: 2016-01-01 00:02:00 5 7 13 28
4: 2016-01-01 00:03:00 1 1 14 27
5: 2016-01-01 00:04:00 9 4 23 23
6: 2016-01-01 00:05:00 9 8 32 15
7: 2016-01-01 00:06:00 1 9 33 6
8: 2016-01-01 00:07:00 8 2 41 4
9: 2016-01-01 00:08:00 4 4 45 0
10: 2016-01-01 00:09:00 5 0 50 0

Creating a 4-hr time interval using a reference column in R

I want to create a 4-hrs interval using a reference column from a data frame. I have a data frame like this one:
species<-"ABC"
ind<-rep(1:4,each=24)
hour<-rep(seq(0,23,by=1),4)
depth<-runif(length(ind),1,50)
df<-data.frame(cbind(species,ind,hour,depth))
df$depth<-as.numeric(df$depth)
What I would like is to create a new column (without changing the information or dimensions of the original data frame) that could look at my hour column (reference column) and based on that value will give me a 4-hrs time interval. For example, if the value from the hour column is between 0 and 3, then the value in new column will be 0; if the value is between 4 and 7 the value in the new column will be 4, and so on... In excel I used to use the floor/ceiling functions for this, but in R they are not exactly the same. Also, if someone has an easier suggestion for this using the original date/time data that could work too. In my original script I used the function as.POSIXct to get the date/time data, and from there my hour column.
I appreciate your help,
what about taking the column of hours, converting it to integers, and using integer division to get the floor? something like this
# convert hour to integer (hour is currently a col of factors)
i <- as.numeric(levels(df$hour))[df$hour]
# make new column
df$interval <- (i %/% 4) * 4
Expanding on my comment, since I think you're ultimately looking for actual dates at some point...
Some sample hourly data:
set.seed(1)
mydata <- data.frame(species = "ABC",
ind = rep(1:4, each=24),
depth = runif(96, 1, 50),
datetime = seq(ISOdate(2000, 1, 1, 0, 0, 0),
by = "1 hour", length.out = 96))
list(head(mydata), tail(mydata))
# [[1]]
# species ind depth datetime
# 1 ABC 1 14.00992 2000-01-01 00:00:00
# 2 ABC 1 19.23407 2000-01-01 01:00:00
# 3 ABC 1 29.06981 2000-01-01 02:00:00
# 4 ABC 1 45.50218 2000-01-01 03:00:00
# 5 ABC 1 10.88241 2000-01-01 04:00:00
# 6 ABC 1 45.02109 2000-01-01 05:00:00
#
# [[2]]
# species ind depth datetime
# 91 ABC 4 12.741841 2000-01-04 18:00:00
# 92 ABC 4 3.887784 2000-01-04 19:00:00
# 93 ABC 4 32.472125 2000-01-04 20:00:00
# 94 ABC 4 43.937191 2000-01-04 21:00:00
# 95 ABC 4 39.166819 2000-01-04 22:00:00
# 96 ABC 4 40.068132 2000-01-04 23:00:00
Transforming that data using cut and format:
mydata <- within(mydata, {
hourclass <- cut(datetime, "4 hours") # Find the intervals
hourfloor <- format(as.POSIXlt(hourclass), "%H") # Display just the "hour"
})
list(head(mydata), tail(mydata))
# [[1]]
# species ind depth datetime hourclass hourfloor
# 1 ABC 1 14.00992 2000-01-01 00:00:00 2000-01-01 00:00:00 00
# 2 ABC 1 19.23407 2000-01-01 01:00:00 2000-01-01 00:00:00 00
# 3 ABC 1 29.06981 2000-01-01 02:00:00 2000-01-01 00:00:00 00
# 4 ABC 1 45.50218 2000-01-01 03:00:00 2000-01-01 00:00:00 00
# 5 ABC 1 10.88241 2000-01-01 04:00:00 2000-01-01 04:00:00 04
# 6 ABC 1 45.02109 2000-01-01 05:00:00 2000-01-01 04:00:00 04
#
# [[2]]
# species ind depth datetime hourclass hourfloor
# 91 ABC 4 12.741841 2000-01-04 18:00:00 2000-01-04 16:00:00 16
# 92 ABC 4 3.887784 2000-01-04 19:00:00 2000-01-04 16:00:00 16
# 93 ABC 4 32.472125 2000-01-04 20:00:00 2000-01-04 20:00:00 20
# 94 ABC 4 43.937191 2000-01-04 21:00:00 2000-01-04 20:00:00 20
# 95 ABC 4 39.166819 2000-01-04 22:00:00 2000-01-04 20:00:00 20
# 96 ABC 4 40.068132 2000-01-04 23:00:00 2000-01-04 20:00:00 20
Note that your new "hourclass" variable is a factor and the new "hourfloor" variable is character, but you can easily change those, even during the within stage.
str(mydata)
# 'data.frame': 96 obs. of 6 variables:
# $ species : Factor w/ 1 level "ABC": 1 1 1 1 1 1 1 1 1 1 ...
# $ ind : int 1 1 1 1 1 1 1 1 1 1 ...
# $ depth : num 14 19.2 29.1 45.5 10.9 ...
# $ datetime : POSIXct, format: "2000-01-01 00:00:00" "2000-01-01 01:00:00" ...
# $ hourclass: Factor w/ 24 levels "2000-01-01 00:00:00",..: 1 1 1 1 2 2 2 2 3 3 ...
# $ hourfloor: chr "00" "00" "00" "00" ...
tip number 1, don't use cbind to create a data.frame with differing type of columns, everything gets coerced to the same type (in this case factor)
findInterval or cut would seem appropriate here.
df <- data.frame(species,ind,hour,depth)
# copy
df2 <- df
df2$fourhour <- c(0,4,8,12,16,20)[findInterval(df$hour, c(0,4,8,12,16,20))]
Though there is probably a simpler way, here is one attempt.
Make your data.frame not using cbind first though, so hour is not a factor but numeric
df <- data.frame(species,ind,hour,depth)
Then:
df$interval <- factor(findInterval(df$hour,seq(0,23,4)),labels=seq(0,23,4))
Result:
> head(df)
species ind hour depth interval
1 ABC 1 0 23.11215 0
2 ABC 1 1 10.63896 0
3 ABC 1 2 18.67615 0
4 ABC 1 3 28.01860 0
5 ABC 1 4 38.25594 4
6 ABC 1 5 30.51363 4
You could also make the labels a bit nicer like:
cutseq <- seq(0,23,4)
df$interval <- factor(
findInterval(df$hour,cutseq),
labels=paste(cutseq,cutseq+3,sep="-")
)
Result:
> head(df)
species ind hour depth interval
1 ABC 1 0 23.11215 0-3
2 ABC 1 1 10.63896 0-3
3 ABC 1 2 18.67615 0-3
4 ABC 1 3 28.01860 0-3
5 ABC 1 4 38.25594 4-7
6 ABC 1 5 30.51363 4-7

Resources