I am honest, I could come up with a decent title for this.
Basically, I have a dateframe:
ID Qty BasePrice Total
1 2 30 50
1 1 20 20
2 4 5 15
For each line I want to calculate the following:
Result = (Qty * BasePrice) - Total
Which is supposedly easy to do in R. However, I want to group the results by ID (sum them).
Sample Output:
ID Qty BasePrice Total Results
1 2 30 50 10
1 1 20 20 10
2 4 5 15 5
For instance, for ID=1, the values represent ((2*30)-50)+((1*20)-20)
Any idea on how can I achieve this?
Thanks!
We can do a group_by sum of the difference between the product of 'Qty', 'BasePrice' with 'Total'
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Result = sum((Qty * BasePrice) - Total))
# A tibble: 3 x 5
# Groups: ID [2]
# ID Qty BasePrice Total Result
# <int> <int> <int> <int> <int>
#1 1 2 30 50 10
#2 1 1 20 20 10
#3 2 4 5 15 5
data
df1 <- structure(list(ID = c(1L, 1L, 2L), Qty = c(2L, 1L, 4L), BasePrice = c(30L,
20L, 5L), Total = c(50L, 20L, 15L)), class = "data.frame", row.names = c(NA,
-3L))
Related
I have data that look like these:
Subject Site Date
1 2 '2020-01-01'
1 2 '2020-01-01'
1 2 '2020-01-02'
2 1 '2020-01-02'
2 1 '2020-01-03'
2 1 '2020-01-03'
And I'd like to create an order variable for unique dates by Subject and Site. i.e.
Want
1
1
2
1
2
2
I define a little wrapper:
rle <- function(x) cumsum(!duplicated(x))
and I notice inconsistent behavior when I supply:
have1 <- unlist(tapply(val$Date, val[, c( 'Site', 'Subject')], rle))
versus
have2 <- unlist(tapply(val$Date, val[, c('Subject', 'Site')], rle))
> have1
[1] 1 1 2 1 2 2
> have2
[1] 1 2 2 1 1 2
Is there any way to ensure that the natural ordering of the dataset is followed regardless of the specific columns supplied to the INDEX argument?
library(dplyr)
val %>%
group_by(Subject, Site) %>%
mutate(Want = match(Date, unique(Date))) %>%
ungroup
-output
# A tibble: 6 × 4
Subject Site Date Want
<int> <int> <chr> <int>
1 1 2 2020-01-01 1
2 1 2 2020-01-01 1
3 1 2 2020-01-02 2
4 2 1 2020-01-02 1
5 2 1 2020-01-03 2
6 2 1 2020-01-03 2
val$Want <- with(val, ave(as.integer(as.Date(Date)), Subject, Site,
FUN = \(x) match(x, unique(x))))
val$Want
[1] 1 1 2 1 2 2
data
val <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L), Site = c(2L,
2L, 2L, 1L, 1L, 1L), Date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03")),
class = "data.frame", row.names = c(NA,
-6L))
I'm trying to return a logical vector based on whether a person meets one set of conditions and ALSO meets another set of conditions later on. I'm using a data frame that looks like so:
Person.Id Year Term
250 1 3
250 1 1
250 2 3
300 1 3
511 2 1
300 1 5
700 2 3
What I want to return is a logical vector that indicates true/false if person ID 250 has year 1 and term 3, AND later has year 2 term 3. So a person that only has year 1 term 3 or year 1 term 5 will return false. Solutions in dplyr preferred! I feel like this is simple and I'm just missing something. I initially tried this code but all it returned was a blank df:
df2 <- df1 %>%
group_by(Person.Id) %>%
filter((year==1 & term==3) & (year==2 & term==3))
Are you looking for something like this ?
require(dplyr)
df %>%
group_by(Person.Id) %>%
mutate(count=sum((year==1 & term==3) | (year==2 & term==3))) %>%
mutate(count2=if_else(count==2,T,F))
# A tibble: 7 x 5
# Groups: Person.Id [4]
Person.Id year term count count2
<int> <int> <int> <int> <lgl>
1 250 1 3 2 TRUE
2 250 1 1 2 TRUE
3 250 2 3 2 TRUE
4 300 1 3 1 FALSE
5 511 2 1 0 FALSE
6 300 1 5 1 FALSE
7 700 2 3 1 FALSE
Maybe this can help:
#Data
Data <- structure(list(Person.Id = c(250L, 250L, 250L, 300L, 511L, 300L,
700L), Year = c(1L, 1L, 2L, 1L, 2L, 1L, 2L), Term = c(3L, 1L,
3L, 3L, 1L, 5L, 3L)), row.names = c(NA, -7L), class = "data.frame")
#Flags
cond1 <- Data$Year==1 & Data$Term==3
cond2 <- Data$Year==2 & Data$Term==3
#Replace
Data$Flag1 <- 0
Data$Flag1[cond1]<-1
Data$Flag2 <- 0
Data$Flag2[cond2]<-1
#Filter
Data %>% group_by(Person.Id) %>% filter(Flag1==1 | Flag2==1)
# A tibble: 4 x 5
# Groups: Person.Id [3]
Person.Id Year Term Flag1 Flag2
<int> <int> <int> <dbl> <dbl>
1 250 1 3 1 0
2 250 2 3 0 1
3 300 1 3 1 0
4 700 2 3 0 1
dfin <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 1 0 20 20
1 2 1 20 20
Per study and ID, for those who have duplicate CYCLE == 0 values, remove the row that had the higher TIME.
dfout <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 2 1 20 20
Using RStudio.
An option is to do a group by 'STUDY', 'ID' and filter out the duplicated 0 values in 'CYCLE'
library(dplyr)
dfin %>%
arrange(STUDY, ID, TIME) %>%
group_by(STUDY, ID) %>%
filter(!(duplicated(CYCLE) & CYCLE == 0))
# A tibble: 2 x 5
# Groups: STUDY, ID [2]
# STUDY ID CYCLE TIME VALUE
# <int> <int> <int> <int> <int>
#1 1 1 0 10 50
#2 1 2 1 20 20
Also, if there are many duplicates for 0 and want to remove only the row where 'TIME' is also max
dfin %>%
group_by(STUDY, ID) %>%
filter(!(TIME == max(TIME) & CYCLE == 0))
Or using base R
dfin1 <- do.call(order, dfin[c("STUDY", "ID", "TIME")])
dfin1[!(duplicated(dfin1[1:3]) & duplicated(dfin1$CYCLE)),]
# STUDY ID CYCLE TIME VALUE
#1 1 1 0 10 50
#3 1 2 1 20 20
data
dfin <- structure(list(STUDY = c(1L, 1L, 1L), ID = c(1L, 1L, 2L), CYCLE = c(0L,
0L, 1L), TIME = c(10L, 20L, 20L), VALUE = c(50L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-3L))
so I have a data of 2 fields, ID and Timestamp
ID Time
1 12
1 15
1 16
2 12
2 11
And i want to increment if the difference between time and previous time is inferior to 2 for example within the same ID, unless stay at the same value and restart at 1 when ID is different.
Desired output:
ID Time ID_SESSION
1 12 1
1 15 1
1 16 2
2 12 1
2 11 1
It would be needed in dplyr/sparklyr for spark implementation with R/
A one-liner using base R,
with(df, ave(Time, ID, FUN = function(i)cumsum(c(TRUE, diff(i) <= 2))))
#[1] 1 1 2 1 2
May be we need
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(ID_SESSION = (lag(c(FALSE, diff(Time) > 2), default= FALSE)) + 1)
Or in a one-liner with data.table
library(data.table)
setDT(df1)[, ID_SESSION := shift(c(FALSE, diff(Time) > 2), fill = FALSE) + 1, ID]
df1
# ID Time ID_SESSION
#1: 1 12 1
#2: 1 15 1
#3: 1 16 2
#4: 2 12 1
#5: 2 11 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Time = c(12L, 15L,
16L, 12L, 11L)), class = "data.frame", row.names = c(NA, -5L))
I have a list of events by ID and would like to group them in two week periods. The two weeks should start whenever the first event occurs for each ID. The grouped event data should look something like the following,
ID Date Group
<dbl> <date> <dbl>
1 2018-01-01 1
1 2018-01-02 1
1 2018-01-02 1
1 2018-02-01 2
1 2018-03-01 3
2 2018-01-01 4
2 2018-04-01 5
dat = structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), Date = structure(c(17532,
17533, 17533, 17563, 17591, 17532, 17622), class = "Date"), Group = c(1L,
1L, 1L, 2L, 3L, 4L, 5L)), .Names = c("ID", "Date", "Group"), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
I was originally thinking of lagging by ID and filtering for events that happen within a two week period, but there may be many events that correspond to a single two week period.
You can use cut and seq to round to the nearest two week cutoff, then group_indices to make an increasing index:
dat %>%
group_by(ID) %>%
mutate(g = cut(Date, seq(first(Date), max(Date) + 14, by="2 weeks")) %>% as.character) %>%
ungroup %>%
mutate(g = group_indices(., ID, g))
# A tibble: 7 x 4
ID Date Group g
<int> <date> <int> <int>
1 1 2018-01-01 1 1
2 1 2018-01-02 1 1
3 1 2018-01-02 1 1
4 1 2018-02-01 2 2
5 1 2018-03-01 3 3
6 2 2018-01-01 4 4
7 2 2018-04-01 5 5
Get the difference of adjacent 'Date's with difftime specifying the unit as "week", check if the difference is greater than 2, and get the cumulative sum
dat %>%
mutate(GroupNew = cumsum(abs(difftime(Date, lag(Date,
default = first(Date)), unit = "week")) > 2) + 1)
# A tibble: 7 x 4
# ID Date Group GroupNew
# <int> <date> <int> <dbl>
#1 1 2018-01-01 1 1
#2 1 2018-01-02 1 1
#3 1 2018-01-02 1 1
#4 1 2018-02-01 2 2
#5 1 2018-03-01 3 3
#6 2 2018-01-01 4 4
#7 2 2018-04-01 5 5