as a beginner to R, I'm facing troubles with a complex issue, for my side.
I want to add a new column with a "1" when the data$Date is between/exactly the lookup$Begin and lookup$End. Identification_no is the key for both data sets.
If the data$date is not bewteen lookup$Begin and lookup$End then there should a "0" in the new data column.
Both data frames have different length of observations.
Here's my basic data frame:
> data
# A tibble: 6 x 2
Date Identification_no
* <date> <dbl>
1 2018-08-25 13
2 2018-02-03 54
3 2018-09-01 31
4 2018-11-10 54
5 2018-08-04 60
6 2018-07-07 58
Here's my lookup data frame:
> lookup
# A tibble: 6 x 3
Begin End Identification_no
* <date> <date> <dbl>
1 2017-01-26 2017-01-26 53
2 2017-01-26 2017-01-26 53
3 2017-01-26 2017-01-26 53
4 2017-01-26 2017-01-26 53
5 2017-01-26 2017-01-26 53
6 2017-01-26 2017-01-26 53
Thanks for your inputs in advance.
EDIT: new sample data
> data
# A tibble: 6 x 2
Date Identification_no
<date> <dbl>
1 2018-08-25 13
2 2018-02-03 54
3 2018-09-01 31
4 2018-11-10 54
5 2018-08-04 60
6 2018-07-07 58
> lookup
# A tibble: 6 x 3
Begin End Identification_no
<date> <date> <dbl>
1 2018-08-20 2018-08-27 13
2 2018-09-01 2018-09-08 53
3 2018-01-09 2018-01-23 20
4 2018-10-16 2018-10-30 4
5 2017-12-22 2017-12-29 54
6 2017-10-31 2017-11-07 66
Result through below described method:
> final
Begin End Identification_no match_col
1: 2018-08-25 2018-08-25 13 1
2: 2018-02-03 2018-02-03 54 0
3: 2018-09-01 2018-09-01 31 0
4: 2018-11-10 2018-11-10 54 0
5: 2018-08-04 2018-08-04 60 0
6: 2018-07-07 2018-07-07 58 0
Works perfectly fine - thanks for your solution.
Best regards,
Paul
Could do:
library(data.table)
setDT(data)[, Date := as.Date(Date)]
setDT(lookup)[, `:=` (Begin = as.Date(Begin), End = as.Date(End), match_col = 1)]
final <- unique(lookup, by = c("Begin", "End","Identification_no"))[
data, on = .(Begin <= Date, End >= Date, Identification_no)][
is.na(match_col), match_col := 0]
On your example dataset, this would give:
final
Begin End Identification_no match_col
1: 2018-08-25 2018-08-25 13 0
2: 2018-02-03 2018-02-03 54 0
3: 2018-09-01 2018-09-01 31 0
4: 2018-11-10 2018-11-10 54 0
5: 2018-08-04 2018-08-04 60 0
6: 2018-07-07 2018-07-07 58 0
.. but only because there's really no match.
Related
I'm trying to identify periods/episodes of exposition to a drug with prescriptions. If those prescriptions are separated for 30 days it's considered a new period/episode of exposition. Prescriptions can overlap during certain time or be consecutive. If the sum of separated days of two consecutive prescripction is greater than 30 days it's not considered a new episode.
I have data like this:
id = c(rep(1,3), rep(2,6), rep(3,5))
start = as.Date(c("2017-05-10", "2017-07-28", "2017-11-23", "2017-01-27", "2017-10-02", "2018-05-14", "2018-05-25", "2018-11-26", "2018-12-28", "2016-01-01", "2016-03-02", "2016-03-20", "2016-04-25", "2016-06-29"))
end = as.Date(c("2017-07-27", "2018-01-28", "2018-03-03", "2017-04-27", "2018-05-13", "2018-11-14", "2018-11-25", "2018-12-27", "2019-06-28", "2016-02-15", "2016-03-05", "2016-03-24", "2016-04-29", "2016-11-01"))
DT = data.table(id, start, end)
DT
id start end
1: 1 2017-05-10 2017-07-27
2: 1 2017-07-28 2018-01-28
3: 1 2017-11-23 2018-03-03
4: 2 2017-01-27 2017-04-27
5: 2 2017-10-02 2018-05-13
6: 2 2018-05-14 2018-11-14
7: 2 2018-05-25 2018-11-25
8: 2 2018-11-26 2018-12-27
9: 2 2018-12-28 2019-06-28
10: 3 2016-01-01 2016-02-15
11: 3 2016-03-02 2016-03-05
12: 3 2016-03-20 2016-03-24
13: 3 2016-04-25 2016-04-29
14: 3 2016-06-29 2016-11-01
I calculated the difference of start and last end observation (last_diffdays)
DT[, last_diffdays := start-shift(end, n=1L), by = .(id)][is.na(last_diffdays), last_diffdays := 0][]
id start end last_diffdays
1: 1 2017-05-10 2017-07-27 0 days
2: 1 2017-07-28 2018-01-28 1 days
3: 1 2017-11-23 2018-03-03 -66 days
4: 2 2017-01-27 2017-04-27 0 days
5: 2 2017-10-02 2018-05-13 158 days
6: 2 2018-05-14 2018-11-14 1 days
7: 2 2018-05-25 2018-11-25 -173 days
8: 2 2018-11-26 2018-12-27 1 days
9: 2 2018-12-28 2019-06-28 1 days
10: 3 2016-01-01 2016-02-15 0 days
11: 3 2016-03-02 2016-03-05 16 days
12: 3 2016-03-20 2016-03-24 15 days
13: 3 2016-04-25 2016-04-29 32 days
14: 3 2016-06-29 2016-11-01 61 days
This shows when an overlap happens (negative values) or not (positive values). I think an ifelse/fcase statement here would be a bad idea and I'm not comfortable doing it.
I think a good output for this job would be something like:
id start end last_diffdays noexp_days period
1: 1 2017-05-10 2017-07-27 0 days 0 1
2: 1 2017-07-28 2018-01-28 1 days 1 1
3: 1 2017-11-23 2018-03-03 -66 days 0 1
4: 2 2017-01-27 2017-04-27 0 days 0 1
5: 2 2017-10-02 2018-05-13 158 days 158 2
6: 2 2018-05-14 2018-11-14 1 days 1 2
7: 2 2018-05-25 2018-11-25 -173 days 0 2
8: 2 2018-11-26 2018-12-27 1 days 1 2
9: 2 2018-12-28 2019-06-28 1 days 1 2
10: 3 2016-01-01 2016-02-15 0 days 0 1
11: 3 2016-03-02 2016-03-05 16 days 16 1
12: 3 2016-03-20 2016-03-24 15 days 15 1
13: 3 2016-04-25 2016-04-29 32 days 32 2
14: 3 2016-06-29 2016-11-01 61 days 61 3
I manually calculated the days without exposition (noexp_days) of the before prescription.
I dunno If I'm the right path but I think I need to calculate noexp_days variable and then make a cumsum((noexp_days)>30)+1.
If there is a much better solution I don't see or any other possibility I haven't considered I will appreciate to read about them.
Thanks in advance for any help! :)
Try :
library(data.table)
DT[, noexp_days := pmax(as.integer(last_diffdays), 0)]
DT[, period := cumsum(noexp_days > 30) + 1, id]
DT
# id start end last_diffdays noexp_days period
# 1: 1 2017-05-10 2017-07-27 0 days 0 1
# 2: 1 2017-07-28 2018-01-28 1 days 1 1
# 3: 1 2017-11-23 2018-03-03 -66 days 0 1
# 4: 2 2017-01-27 2017-04-27 0 days 0 1
# 5: 2 2017-10-02 2018-05-13 158 days 158 2
# 6: 2 2018-05-14 2018-11-14 1 days 1 2
# 7: 2 2018-05-25 2018-11-25 -173 days 0 2
# 8: 2 2018-11-26 2018-12-27 1 days 1 2
# 9: 2 2018-12-28 2019-06-28 1 days 1 2
#10: 3 2016-01-01 2016-02-15 0 days 0 1
#11: 3 2016-03-02 2016-03-05 16 days 16 1
#12: 3 2016-03-20 2016-03-24 15 days 15 1
#13: 3 2016-04-25 2016-04-29 32 days 32 2
#14: 3 2016-06-29 2016-11-01 61 days 61 3
I have a annual data set that I would like to break into 10 day intervals. For example I would like to subset 2010-12-26 to 2011-01-04 create a home range using the x and y values for those dates, then get the next 9 days plus an overlapping date between the subsetted data this case it would be 2011-01-04 (2011-01-04 to 2011-01-13). Is there a good way to do this?
#Example dataset
library(lubridate)
date <- seq(dmy("26-12-2010"), dmy("15-01-2013"), by = "days")
df <- data.frame(date = date,
x = runif(752, min = 60000, max = 80000),
y = runif(752, min = 800000, max = 900000))
> df
date x y
1 2010-12-26 73649.16 894525.6
2 2010-12-27 69005.21 898233.7
3 2010-12-28 64982.90 873692.6
4 2010-12-29 64592.93 841055.2
5 2010-12-30 60475.99 854524.3
6 2010-12-31 79206.43 879468.2
7 2011-01-01 76692.40 830569.6
8 2011-01-02 70378.51 834338.2
9 2011-01-03 74977.73 820568.0
10 2011-01-04 63023.47 899482.3
11 2011-01-05 77046.80 886369.0
12 2011-01-06 68751.91 841074.7
13 2011-01-07 65471.34 888525.3
14 2011-01-08 61138.68 855039.5
15 2011-01-09 65660.66 880227.2
16 2011-01-10 75526.36 838478.6
17 2011-01-11 64485.74 808947.7
18 2011-01-12 61405.69 887784.1
19 2011-01-13 70561.86 847634.7
20 2011-01-14 69234.98 840012.1
21 2011-01-15 75539.43 817132.5
22 2011-01-16 74227.28 839230.4
23 2011-01-17 74548.59 855006.3
24 2011-01-18 72020.71 815036.7
25 2011-01-19 70814.50 883029.6
26 2011-01-20 76924.65 817289.5
27 2011-01-21 60556.21 807427.2
Thank you for your time.
What about this?
res <- lapply(
seq(0, nrow(df), by = 10),
function(k) df[max(k, 1):min(k + 10, nrow(df)), ]
)
which gives
> head(res)
[[1]]
date x y
1 2010-12-26 63748.27 856758.7
2 2010-12-27 73774.90 860222.6
3 2010-12-28 68893.24 804194.7
4 2010-12-29 79791.86 810624.5
5 2010-12-30 60073.50 809016.0
6 2010-12-31 74020.15 883304.9
7 2011-01-01 67144.95 889235.3
8 2011-01-02 67205.20 810514.2
9 2011-01-03 68518.68 882730.7
10 2011-01-04 70442.87 892934.1
[[2]]
date x y
10 2011-01-04 70442.87 892934.1
11 2011-01-05 65466.26 855725.2
12 2011-01-06 70034.79 879770.8
13 2011-01-07 60195.42 888653.4
14 2011-01-08 65208.12 883176.8
15 2011-01-09 63040.52 821902.3
16 2011-01-10 62302.66 815025.1
17 2011-01-11 77662.53 829474.5
18 2011-01-12 64802.65 809961.7
19 2011-01-13 71812.61 810755.1
20 2011-01-14 63086.30 820029.9
[[3]]
date x y
20 2011-01-14 63086.30 820029.9
21 2011-01-15 75548.71 806966.7
22 2011-01-16 68572.89 847679.0
23 2011-01-17 71408.65 889490.2
24 2011-01-18 73507.84 815559.7
25 2011-01-19 76854.50 899108.6
26 2011-01-20 79138.08 858537.1
27 2011-01-21 73960.14 898957.3
28 2011-01-22 75048.41 864425.6
29 2011-01-23 61059.20 857558.3
30 2011-01-24 67455.03 853017.1
[[4]]
date x y
30 2011-01-24 67455.03 853017.1
31 2011-01-25 72727.70 891708.8
32 2011-01-26 73230.11 836404.6
33 2011-01-27 67719.05 815528.3
34 2011-01-28 65139.66 826289.8
35 2011-01-29 65145.94 818736.4
36 2011-01-30 74206.03 839014.2
37 2011-01-31 77259.35 855653.0
38 2011-02-01 77809.65 836912.6
39 2011-02-02 62744.02 831549.0
40 2011-02-03 79594.93 873313.6
[[5]]
date x y
40 2011-02-03 79594.93 873313.6
41 2011-02-04 78942.86 825001.1
42 2011-02-05 61346.88 871578.5
43 2011-02-06 68526.18 863300.7
44 2011-02-07 76920.15 844180.0
45 2011-02-08 73023.08 823092.4
46 2011-02-09 64287.09 804682.7
47 2011-02-10 71377.16 829219.8
48 2011-02-11 68930.80 814626.6
49 2011-02-12 70780.95 831549.8
50 2011-02-13 73740.99 895868.0
[[6]]
date x y
50 2011-02-13 73740.99 895868.0
51 2011-02-14 79846.05 844586.6
52 2011-02-15 66559.60 835943.0
53 2011-02-16 68522.99 837633.2
54 2011-02-17 65898.75 891364.4
55 2011-02-18 73809.44 842797.9
56 2011-02-19 73336.53 821166.5
57 2011-02-20 72780.91 883200.6
58 2011-02-21 73240.81 864142.2
59 2011-02-22 78855.11 868599.6
60 2011-02-23 69236.04 845566.6
Alternative solution using dplyr package and applicable when instead of groups of 10 you want groups of n dates. We assume one row per date as in your example.
library(lubridate)
dt <- seq(dmy("26-12-2010"), dmy("15-01-2013"), by = "days")
df <- data.frame(date = dt,
x = runif(752, min = 60000, max = 80000),
y = runif(752, min = 800000, max = 900000))
library(dplyr)
n <- 10
df |>
arrange(date) |>
mutate(id = 0:(nrow(df) - 1),
group = id %/% n + 1) |>
group_by(group) |>
group_split() |>
head(n=2)
#> [[1]]
#> # A tibble: 10 x 5
#> date x y id group
#> <date> <dbl> <dbl> <int> <dbl>
#> 1 2010-12-26 70488. 884674. 0 1
#> 2 2010-12-27 74133. 888636. 1 1
#> 3 2010-12-28 66635. 838681. 2 1
#> 4 2010-12-29 67931. 808998. 3 1
#> 5 2010-12-30 68032. 868329. 4 1
#> 6 2010-12-31 76891. 826684. 5 1
#> 7 2011-01-01 70793. 890401. 6 1
#> 8 2011-01-02 60427. 846447. 7 1
#> 9 2011-01-03 69902. 886152. 8 1
#> 10 2011-01-04 64253. 859245. 9 1
#>
#> [[2]]
#> # A tibble: 10 x 5
#> date x y id group
#> <date> <dbl> <dbl> <int> <dbl>
#> 1 2011-01-05 74260. 844636. 10 2
#> 2 2011-01-06 75631. 807722. 11 2
#> 3 2011-01-07 74443. 840540. 12 2
#> 4 2011-01-08 78903. 811777. 13 2
#> 5 2011-01-09 78531. 894333. 14 2
#> 6 2011-01-10 79310. 812625. 15 2
#> 7 2011-01-11 71701. 801691. 16 2
#> 8 2011-01-12 63254. 854752. 17 2
#> 9 2011-01-13 72813. 837910. 18 2
#> 10 2011-01-14 62718. 877568. 19 2
Created on 2021-07-05 by the reprex package (v2.0.0)
I have a txt file like this:
[["seller_id","product_id","buyer_id","sale_date","quantity","price"],[7,11,49,"2019-01-21",5,3330],[13,32,6,"2019-02-10",9,1089],[50,47,4,"2019-01-06",1,1343],[1,22,2,"2019-03-03",9,7677]]
I would like to read it by R as a table like this:
seller_id
product_id
buyer_id
sale_date
quantity
price
7
11
49
2019-01-21
5
3330
13
32
6
2019-02-10
9
1089
50
47
4
2019-01-06
1
1343
1
22
2
2019-03-03
9
7677
How to write the correct R code? Thanks very much for your time.
An easier option is fromJSON
library(jsonlite)
library(janitor)
fromJSON(txt = "file1.txt") %>%
as_tibble %>%
row_to_names(row_number = 1) %>%
type.convert(as.is = TRUE)
-output
# A tibble: 4 x 6
# seller_id product_id buyer_id sale_date quantity price
# <int> <int> <int> <chr> <int> <int>
#1 7 11 49 2019-01-21 5 3330
#2 13 32 6 2019-02-10 9 1089
#3 50 47 4 2019-01-06 1 1343
#4 1 22 2 2019-03-03 9 7677
You will need to parse the json from arrays into a data frame. Perhaps something like this:
# Get string
str <- '[["seller_id","product_id","buyer_id","sale_date","quantity","price"],[7,11,49,"2019-01-21",5,3330],[13,32,6,"2019-02-10",9,1089],[50,47,4,"2019-01-06",1,1343],[1,22,2,"2019-03-03",9,7677]]'
df_list <- jsonlite::parse_json(str)
do.call(rbind, lapply(df_list[-1], function(x) {
setNames(as.data.frame(x), unlist(df_list[1]))}))
#> seller_id product_id buyer_id sale_date quantity price
#> 1 7 11 49 2019-01-21 5 3330
#> 2 13 32 6 2019-02-10 9 1089
#> 3 50 47 4 2019-01-06 1 1343
#> 4 1 22 2 2019-03-03 9 7677
Created on 2020-12-11 by the reprex package (v0.3.0)
Some base R options using:
gsub + read.table
read.table(
text = gsub('"|\\[|\\]', "", gsub("\\],", "\n", s)),
sep = ",",
header = TRUE
)
gsub + read.csv
read.csv(text = gsub('"|\\[|\\]', "", gsub("\\],", "\n", s)))
which gives
seller_id product_id buyer_id sale_date quantity price
1 7 11 49 2019-01-21 5 3330
2 13 32 6 2019-02-10 9 1089
3 50 47 4 2019-01-06 1 1343
4 1 22 2 2019-03-03 9 7677
Data
s <- '[["seller_id","product_id","buyer_id","sale_date","quantity","price"],[7,11,49,"2019-01-21",5,3330],[13,32,6,"2019-02-10",9,1089],[50,47,4,"2019-01-06",1,1343],[1,22,2,"2019-03-03",9,7677]]'
I have a (cut-down) table containing the following pieces of information relating to a credit application process:
Date of Application
Email Address
The table can contain the same email address multiple times but with a different application date (it can be assumed that the same person has applied multiple times).
I would like to add a third column that tells me how many other applications have been seen with the same email address in the 90 days prior to application date.
How would I do this in R? Creating a summary by email address would be straightforward but adding the 90 day condition is for me the tricky part.
Coming from SAS I'd sort the table by email address and then use a lag function but any help with R would be massively helpful.
Thanks for reading.
A reproducible example would have been pretty helpful here but here's my best shot without it. What you're asking for could be done many ways. The easiest programming way is probably using a for loop over the rows of the data.
library(data.table)
library(lubridate)
set.seed(124)
emails <- 'None'
dates <- ymd('1900/01/01')
n_email = 500
for(i in seq_len(n_email)) {
n <- rpois(1, 3) + 1
d <- sample(seq(ymd('2018/01/01'), ymd('2019/09/01'), by = 'day'), n)
emails <- c(emails, rep(as.character(i), n))
dates <- c(dates, d)
}
dat <- data.table(emails, dates)
dat <- dat[order(emails, dates)]
dat[,counts := 0][]
#> emails dates counts
#> 1: 1 2018-06-16 0
#> 2: 1 2019-02-15 0
#> 3: 10 2018-09-08 0
#> 4: 10 2018-09-26 0
#> 5: 10 2019-02-05 0
#> ---
#> 1942: 99 2018-07-03 0
#> 1943: 99 2018-07-07 0
#> 1944: 99 2019-02-07 0
#> 1945: 99 2019-04-09 0
#> 1946: None 1900-01-01 0
for(i in 1:nrow(dat)) {
diffs = difftime(dat[i,dates], dat[emails == dat[i,emails],dates], units = 'days')
count = sum(diffs < 90 & diffs > 0)
dat[i, counts := count]
}
dat[]
#> emails dates counts
#> 1: 1 2018-06-16 0
#> 2: 1 2019-02-15 0
#> 3: 10 2018-09-08 0
#> 4: 10 2018-09-26 1
#> 5: 10 2019-02-05 0
#> ---
#> 1942: 99 2018-07-03 1
#> 1943: 99 2018-07-07 2
#> 1944: 99 2019-02-07 0
#> 1945: 99 2019-04-09 1
#> 1946: None 1900-01-01 0
dat[emails %in% dat[counts > 3,emails]][order(emails, dates)]
#> emails dates counts
#> 1: 396 2018-05-27 0
#> 2: 396 2018-07-10 1
#> 3: 396 2018-10-02 1
#> 4: 396 2019-02-13 0
#> 5: 396 2019-04-21 1
#> 6: 396 2019-04-22 2
#> 7: 396 2019-04-27 3
#> 8: 396 2019-05-02 4
#> 9: 396 2019-06-13 4
#> 10: 496 2018-03-06 0
#> 11: 496 2019-01-31 0
#> 12: 496 2019-04-08 1
#> 13: 496 2019-06-10 1
#> 14: 496 2019-06-24 2
#> 15: 496 2019-07-11 2
#> 16: 496 2019-07-23 3
#> 17: 496 2019-08-25 4
#> 18: 56 2018-11-16 0
#> 19: 56 2019-02-27 0
#> 20: 56 2019-04-09 1
#> 21: 56 2019-04-13 2
#> 22: 56 2019-04-25 3
#> 23: 56 2019-05-13 4
#> emails dates counts
Created on 2019-09-18 by the reprex package (v0.3.0)
However, here's a more concise, efficient way to do it as well making more use of data.table's capabilities. Note that this way doesn't require pre-sort
library(data.table)
library(lubridate)
set.seed(124)
emails <- 'None'
dates <- ymd('1900/01/01')
n_email = 500
for(i in seq_len(n_email)) {
n <- rpois(1, 3) + 1
d <- sample(seq(ymd('2018/01/01'), ymd('2019/09/01'), by = 'day'), n)
emails <- c(emails, rep(as.character(i), n))
dates <- c(dates, d)
}
dat <- data.table(emails, dates)
dat <- dat[sample(seq_len(nrow(dat)))]
dat
#> emails dates
#> 1: 70 2018-12-21
#> 2: 416 2018-10-02
#> 3: 289 2018-12-14
#> 4: 87 2018-03-02
#> 5: 441 2018-12-08
#> ---
#> 1942: 365 2018-01-25
#> 1943: 200 2019-02-02
#> 1944: 14 2019-03-20
#> 1945: 166 2018-06-20
#> 1946: 161 2018-02-07
dat[order(dates),
counts := sapply(1:.N, FUN = function(i) {
if(i == 1) return(0)
x = c(0, diff(dates))
days = 0
place = i
ret = 0
while(days < 90 & place > 1) {
if(x[place] + days < 90) ret = ret + 1
days = days + x[place]
place = place - 1
}
ret
}),
emails][order(emails, dates)]
#> emails dates counts
#> 1: 1 2018-06-16 0
#> 2: 1 2019-02-15 0
#> 3: 10 2018-09-08 0
#> 4: 10 2018-09-26 1
#> 5: 10 2019-02-05 0
#> ---
#> 1942: 99 2018-07-03 1
#> 1943: 99 2018-07-07 2
#> 1944: 99 2019-02-07 0
#> 1945: 99 2019-04-09 1
#> 1946: None 1900-01-01 0
dat[emails %in% dat[counts > 3,emails]][order(emails, dates)]
#> emails dates counts
#> 1: 396 2018-05-27 0
#> 2: 396 2018-07-10 1
#> 3: 396 2018-10-02 1
#> 4: 396 2019-02-13 0
#> 5: 396 2019-04-21 1
#> 6: 396 2019-04-22 2
#> 7: 396 2019-04-27 3
#> 8: 396 2019-05-02 4
#> 9: 396 2019-06-13 4
#> 10: 496 2018-03-06 0
#> 11: 496 2019-01-31 0
#> 12: 496 2019-04-08 1
#> 13: 496 2019-06-10 1
#> 14: 496 2019-06-24 2
#> 15: 496 2019-07-11 2
#> 16: 496 2019-07-23 3
#> 17: 496 2019-08-25 4
#> 18: 56 2018-11-16 0
#> 19: 56 2019-02-27 0
#> 20: 56 2019-04-09 1
#> 21: 56 2019-04-13 2
#> 22: 56 2019-04-25 3
#> 23: 56 2019-05-13 4
#> emails dates counts
Created on 2019-09-18 by the reprex package (v0.3.0)
I have a data frame like:
user_name started_at session_time_min task_completed timediff
ABC 2018-03-02 18:00:00 1 3 NA
ABC 2018-03-02 19:00:00 1036 18 1
ABC 2018-03-03 12:00:00 6 10 17
ABC 2018-03-04 21:00:00 0 1 33
ABC 2018-03-05 16:00:00 143 61 19
ABC 2018-03-05 18:00:00 12 18 2
ABC 2018-03-05 19:00:00 60 94 1
ABC 2018-03-05 20:00:00 20 46 1
ABC 2018-03-09 15:00:00 0 1 91
I want to sum session_time_min and task_completed with previous row if timediff = 1
Want output like:
user_name started_at session_time_min task_completed
ABC 2018-03-02 18:00:00 1037 21
ABC 2018-03-03 12:00:00 6 10
ABC 2018-03-04 21:00:00 0 1
ABC 2018-03-05 16:00:00 143 61
ABC 2018-03-05 18:00:00 92 158
ABC 2018-03-09 15:00:00 0 1
Any help will highly be appricated.
You could use a for loop to help you out especially if you want to use base R.
for (i in 1:nrow(data)) {
if (is.na(data[i,5])){
data[i+1,3] <- data[i+1,3] + data[i,3]
data[i+1,4] <- data[i+1,4] + data[i,4]
} else {}
}
data <- na.omit(data)
This code runs through each row in your dataframe and checks if the value in column 5 (timediff) is a NA. If it is an NA it adds (for the 2 columns you want positioned at 3 and 4) it to the row below (which will be i+1)
Make a group counter using cumsum and then use that to subset the identifier columns and rowsum the value columns:
grp <- cumsum(!dat$timediff %in% 1)
#[1] 1 1 2 3 4 5 5 5 6
cbind(
dat[match(unique(grp), grp), c("user_name","started_at")],
rowsum(dat[c("session_time_min","task_completed")], grp)
)
# user_name started_at session_time_min task_completed
#1 ABC 2018-03-0218:00:00 1037 21
#3 ABC 2018-03-0312:00:00 6 10
#4 ABC 2018-03-0421:00:00 0 1
#5 ABC 2018-03-0516:00:00 143 61
#6 ABC 2018-03-0518:00:00 92 158
#9 ABC 2018-03-0915:00:00 0 1