Remove duplicates based on multiple conditions - r

I have some individuals that are listed twice because they have received numerous degrees. I am trying to only get the rows with the latest degree granting date. Below are examples of the current output and the desired output
people | g_date | wage|quarter
personA|2009-01-01|100 |20201
personA|2009-01-01|100 |20202
personA|2010-01-01|100 |20201
personA|2010-01-01|100 |20202
personB|2012-01-01|50 |20201
personB|2012-01-01|50 |20202
personB|2012-01-01|50 |20203
Desired output
people | g_date | wage|quarter
personA|2010-01-01|100 |20201
personA|2010-01-01|100 |20202
personB|2012-01-01|50 |20201
personB|2012-01-01|50 |20202
personB|2012-01-01|50 |20203
I have used the code that is below but it is removing all the rows so that there is only one row per person.
df<-df[order(df$g_date),]
df<-df[!duplicated(df$people, fromLast = TRUE),]

Another option using group_by with ordered slice_max like this:
library(dplyr)
df %>%
group_by(people, quarter) %>%
slice_max(order_by = g_date, n = 1)
#> # A tibble: 5 × 4
#> # Groups: people, quarter [5]
#> people g_date wage quarter
#> <chr> <chr> <dbl> <int>
#> 1 personA 2010-01-01 100 20201
#> 2 personA 2010-01-01 100 20202
#> 3 personB 2012-01-01 50 20201
#> 4 personB 2012-01-01 50 20202
#> 5 personB 2012-01-01 50 20203
Created on 2022-12-15 with reprex v2.0.2

merge(df, aggregate(. ~ people, df[1:2], max))
#> people g_date wage quarter
#> 1 personA 2010-01-01 100 20201
#> 2 personA 2010-01-01 100 20202
#> 3 personB 2012-01-01 50 20201
#> 4 personB 2012-01-01 50 20202
#> 5 personB 2012-01-01 50 20203

Update (thanks to #Villalba, removed first answer):
We colud first group arrange and then filter:
library(dplyr)
library(lubridate)
df %>%
group_by(people, quarter) %>%
mutate(g_date = ymd(g_date)) %>%
arrange(g_date, .by_group = TRUE) %>%
filter(row_number()==n())
people g_date wage quarter
<chr> <date> <int> <int>
1 personA 2010-01-01 100 20201
2 personA 2010-01-01 100 20202
3 personB 2012-01-01 50 20201
4 personB 2012-01-01 50 20202
5 personB 2012-01-01 50 20203

Related

Extract tibble_df and text message from activitylog object

I have the code below
library(bupar)
library(daqapo)
hospital<-hospital
hospital %>%
rename(start = start_ts,
complete = complete_ts) -> hospital
hospital %>%
convert_timestamps(c("start","complete"), format = dmy_hms) -> hospital
hospital %>%
activitylog(case_id = "patient_visit_nr",
activity_id = "activity",
resource_id = "originator",
timestamps = c("start", "complete")) -> hospital
hospital %>%
detect_time_anomalies()
which gives
*** OUTPUT ***
For 5 rows in the activity log (9.43%), an anomaly is detected.
The anomalies are spread over the activities as follows:
# A tibble: 3 × 3
activity type n
<chr> <chr> <int>
1 Registration negative duration 3
2 Clinical exam zero duration 1
3 Trage negative duration 1
Anomalies are found in the following rows:
# Log of 10 events consisting of:
3 traces
3 cases
5 instances of 3 activities
5 resources
Events occurred from 2017-11-21 11:22:16 until 2017-11-21 19:00:00
# Variables were mapped as follows:
Case identifier: patient_visit_nr
Activity identifier: activity
Resource identifier: originator
Timestamps: start, complete
# A tibble: 5 × 10
patient_visit_nr activity originator start complete triagecode specialization .order durat…¹ type
<dbl> <chr> <chr> <dttm> <dttm> <dbl> <chr> <int> <dbl> <chr>
1 518 Registration Clerk 12 2017-11-21 11:45:16 2017-11-21 11:22:16 4 PED 1 -23 nega…
2 518 Registration Clerk 6 2017-11-21 11:45:16 2017-11-21 11:22:16 4 PED 2 -23 nega…
3 518 Registration Clerk 9 2017-11-21 11:45:16 2017-11-21 11:22:16 4 PED 3 -23 nega…
4 520 Trage Nurse 17 2017-11-21 13:43:16 2017-11-21 13:39:00 5 URG 4 -4.27 nega…
5 528 Clinical exam Doctor 1 2017-11-21 19:00:00 2017-11-21 19:00:00 3 TRAU 5 0 zero…
# … with abbreviated variable name ¹​duration
from this output I would like to extract the text message
For 5 rows in the activity log (9.43%), an anomaly is detected.
The anomalies are spread over the activities as follows:
and also in another object the tibble
activity type n
<chr> <chr> <int>
1 Registration negative duration 3
2 Clinical exam zero duration 1
3 Trage negative duration 1
The string you are trying to obtain is a message, and though it is possible to capture a message, it's not that straightforward. You can easily generate it by emulating a couple of lines within the function though.
If you store the result of detect_time_anomalies:
anomalies <- hospital %>% detect_time_anomalies()
Then you can generate the message like this:
paste0("For ", nrow(anomalies), " rows in the activity log (",
round(nrow(anomalies)/nrow(hospital) * 100, 2),
"%), an anomaly is detected.")
#> [1] "For 5 rows in the activity log (9.43%), an anomaly is detected."
Similarly, you can obtain the output table like this:
anomalies %>%
group_by(activity, type) %>%
summarize(n = n()) %>%
arrange(desc(n))
#> # A tibble: 3 x 3
#> activity type n
#> <chr> <chr> <int>
#> 1 Registration negative duration 3
#> 2 Clinical exam zero duration 1
#> 3 Trage negative duration 1
Created on 2022-12-13 with reprex v2.0.2

How do I add only the last element of one vector to the last element of another vector?

I have a dataframe with hundreds of different investments (represented by the "id" column), their cashflows, and market value. The following example demonstrates the data that I'm working with:
df <- data.frame(date = c("2020-01-31", "2020-02-29", "2020-03-31", "2020-02-29", "2020-03-31", "2020-04-30", "2020-05-31"),
id = c("alpha", "alpha", "alpha", "bravo", "bravo", "bravo", "bravo"),
cashflow = c(-100,20,4,-50,8,12,8),
market_value = c(100,90,80,50,110,120,115))
I ultimately want to calculate the IRR per investment. However, before I can do that, I need to add only the last market value number to the corresponding cashflow. I don't care about any market values before that. In this case, the last cashflow for "alpha" investment must be 84 (i.e., 80 market value + 4 cashflow) and the last cashflow for "bravo" investment must be 123 (i.e., 115 market value + 8 cashflow).
Desired output:
id
cashflow
alpha
-100
alpha
20
alpha
84
bravo
-50
bravo
8
bravo
12
bravo
123
Thanks!
I'm not too sure on what final output you want but here's how you'd just take the last.
df %>%
mutate(total = cashflow + market_value) %>%
group_by(id) %>%
slice_max(order_by = date) %>%
ungroup()
#> # A tibble: 2 × 5
#> date id cashflow market_value total
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2020-03-31 alpha 4 80 84
#> 2 2020-05-31 bravo 8 115 123
Created on 2022-07-22 by the reprex package (v2.0.1)
EDIT - just seen what I think is your desired output, how's this?
df %>%
group_by(id) %>%
mutate(
cashflow = if_else(row_number() == n(), cashflow + market_value, cashflow)
)
#> # A tibble: 7 × 4
#> # Groups: id [2]
#> date id cashflow market_value
#> <chr> <chr> <dbl> <dbl>
#> 1 2020-01-31 alpha -100 100
#> 2 2020-02-29 alpha 20 90
#> 3 2020-03-31 alpha 84 80
#> 4 2020-02-29 bravo -50 50
#> 5 2020-03-31 bravo 8 110
#> 6 2020-04-30 bravo 12 120
#> 7 2020-05-31 bravo 123 115
Created on 2022-07-22 by the reprex package (v2.0.1)

How to do a floor_date() in dbplyr

I'm trying to aggregate minute-level time series data to hourly level via averaging.
In order to do that I want to calculate an hour column that has the day and hour that the reading occurred in. Then I can do a simple group_by summarise. For instance, my tbl_df looks like:
# Database: Microsoft SQL Server 13.00.4001[<SERVER>/<Project>]
eGauge time Channel End_Use Metric Circuit Reading mean_lag
<int> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 30739 2018-07-06 20:04:00.000 8.0 Clothes Washer P <NA> 0.000033333 60
2 30739 2018-07-06 20:13:00.000 3.0 Clothes Dryer P <NA> 0.000833333 60
3 30739 2018-07-06 21:16:00.000 6.0 Cooktop P <NA> 0.000050000 60
4 30739 2018-07-06 21:00:00.000 3.0 Clothes Dryer P <NA> 0.000833333 60
5 30739 2018-07-06 21:46:00.000 8.0 Clothes Washer P <NA> 0.000016667 60
6 30739 2018-07-07 02:06:00.000 3.0 Clothes Dryer P <NA> 0.001016667 1
7 30739 2018-07-07 08:52:00.000 1.0 Service Mains P <NA> 1.814516667 1
8 30739 2018-07-07 08:52:00.000 3.0 Clothes Dryer P <NA> 0.001050000 1
9 30739 2018-07-07 08:52:00.000 4.0 Central AC P <NA> 0.043000000 1
10 30739 2018-07-07 08:52:00.000 5.0 Oven P <NA> 0.021333333 1
and I would like a new column like this: 2018-07-06 20:00:00.000 or 2018-07-06 20:00:00.000.
Normally I would use floor_date(time, "hour") from lubridate, or even str_replace(time, ".{2}(?=:[^:]*$)", "00"), but neither are working for me with my SQL Server connection.
Any idea how this is done in R? Answer must R code and preferrably be dplyr code such as:
# NOT WORKING
my_table %>%
mutate(time_hour = floor_date(time, "hour"))
or
# NOT WORKING
my_table %>%
mutate(time_hour = DATEADD('hour', DATEDIFF('hour', 0, time), 0))
my_table %>%
mutate(time_hour = DATEADD(sql("hour"), DATEDIFF(sql("hour"), 0, time), 0))
WORKS BUT NEEDS IMPROVEMENT
my_table %>%
mutate(hour = "hour",
time_hour = DATEADD(hour, DATEDIFF(hour, 0, time), 0)) %>%
select(-hour)

time differences for multiple events for same ID in R

I'm new to Stackoverflow and looked at similar posts but couldn't find a solution that can capture time differences from multiple events from the same ID.
What I've got:
Time<-c('2016-10-04','2016-10-18', '2016-10-04','2016-10-18','2016-10-19','2016-10-28','2016-10-04','2016-10-19','2016-10-21','2016-10-22', '2017-01-02', '2017-03-04')
Value<-c(0,1,0,1,0,0,0,1,0,1,1,0)
StoreID<-c('a','a','b','b','c','c','d','d','a','a','d','c')
Unit<-c(1,1,2,2,5,5,6,6,1,1,6,5)
Helper<-c('a1','a1','b2','b2','c5','c5','d6','d6','a1','a1','d6','c5')
The helper column is the StoreID and Unit combined because I couldn't figure out how to group by both Store ID and the Unit. I want to sort the data to show when the unit was disabled (value =0) and enabled again (value =1).
Ultimately, I'd want:
Store_ID Unit Helper Time(v=0) Time(v=1) Time2(v=0) Time 2(v=1)
a 1 a1 2016-10-04 2016-10-18 2016-10-21 2016-10-22
b 2 b2 2016-10-04 2016-10-18
c 5 c5 2016-10-19 2016-10-28 2017-03-04
d 6 d6 2016-10-04 2017-10-19
Any thoughts?
I'm thinking something in dplyr but am stumped about where to go further.
Create a Header column that combines the Value column and the row number that distinguishes duplicates, then spread to wide format:
Didn't use the helper column, grouped by StoredID and Unit instead.
df <- data.frame(StoreID, Unit, Time, Value)
df %>%
group_by(StoreID, Unit, Value) %>%
mutate(Headers = sprintf('Time %s (v=%s)', row_number(), Value)) %>%
ungroup() %>% select(-Value) %>%
spread(Headers, Time)
# A tibble: 4 x 7
# StoreID Unit `Time 1 (v=0)` `Time 1 (v=1)` `Time 2 (v=0)` `Time 2 (v=1)` `Time 3 (v=0)`
#* <fctr> <dbl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 a 1 2016-10-04 2016-10-18 2016-10-21 2016-10-22 NA
#2 b 2 2016-10-04 2016-10-18 NA NA NA
#3 c 5 2016-10-19 NA 2016-10-28 NA 2017-03-04
#4 d 6 2016-10-04 2016-10-19 NA 2017-01-02 NA

How can I reshape data from long to wide

** Sample data added after comment**
What I have:
pmts <- data.frame(stringsAsFactors=FALSE,
name = c("johndoe", "johndoe", "janedoe", "foo", "foo", "foo"),
pmt_amount = c(550L, 550L, 995L, 375L, 375L, 375L),
pmt_date = c("9/1/16", "11/1/16", "12/15/16", "1/5/17", "3/5/17", "5/5/17")
)
#> name pmt_amount pmt_date
#> 1 johndoe 550 9/1/16
#> 2 johndoe 550 11/1/16
#> 3 janedoe 995 12/15/16
#> 4 foo 375 1/5/17
#> 5 foo 375 3/5/17
#> 6 foo 375 5/5/17
What I am looking to achieve:
read.table(header = T, text =
"name pmt_amount first_pmt second_pmt third_pmt
johndoe 550 9/1/16 11/1/16 NA
janedoe 995 12/15/16 NA NA
foo 375 1/5/17 3/5/17 5/5/17"
)
#> name pmt_amount first_pmt second_pmt third_pmt
#> 1 johndoe 550 9/1/16 11/1/16 <NA>
#> 2 janedoe 995 12/15/16 <NA> <NA>
#> 3 foo 375 1/5/17 3/5/17 5/5/17
** End of update**
I have a large dataset with payment information for different products. Some of these products have a pay-in-full option as well as a two-pay and three-pay option. I need to create fields that would be First_Payment, Second_Payment, and Third_Payment and would populate NA in the respective fields if there was only one or two payments.
I've tried a couple options and the best workaround I have thus far is this:
pmts %>%
group_by(Email, Name, Amount, Form.Title) %>%
summarise(First_Payment = min(Payment.Date),
Second_Payment = median(Payment.Date),
Last_Payment = max(Payment.Date)) -> pmts
This obviously is not ideal as is making up a payment date for the 2-pay plans and I would have to instruct the end-user to ignore this field and just look at the 1st and 3rd fields.
I also tried to summarise with partial sorts like this:
n <- length(pmts$Payment.Date)
sort(pmts$Payment.Date,partial=n-1)[n-1]
However, if there wasn't three payments for the person, it would take the n-1 date from the entire data set and apply to all other fields.
Ideally, I would have it so if it was a pay-in-full the the First_Payment field would have the date and the 2nd/3rd fields would say NA. The 2-pay would have 1st and 2nd dates and the 3rd field would say NA. And finally the 3 pay would have all 3 dates.
The end users here are not super data savvy so I'm trying to make this as easy to interpret as possible. Any suggestions would be tremendously appreciated. Thank you!
Using data.table this is a simple one-liner
library(data.table) #v1.9.8+
dcast(setDT(pmts), name + pmt_amount ~ rowid(pmt_amount))
# Using 'pmt_date' as value column. Use 'value.var' to override
# name pmt_amount 1 2 3
# 1: foo 375 1/5/17 3/5/17 5/5/17
# 2: janedoe 995 12/15/16 NA NA
# 3: johndoe 550 9/1/16 11/1/16 NA
dcast converts from long to wide and it accepts expressions. rowid is just adding a row counter per pmt_amount.
You can use tidyr for this.
library(dplyr)
library(tidyr)
pmts <- tibble(
name = c("johndoe", "johndoe", "janedoe", "foo", "foo", "foo"),
pmt_amount = c(550L, 550L, 995L, 375L, 375L, 375L),
pmt_date = lubridate::mdy(c("9/1/16", "11/1/16", "12/15/16", "1/5/17", "3/5/17", "5/5/17"))
)
pmts
#> # A tibble: 6 x 3
#> name pmt_amount pmt_date
#> <chr> <int> <date>
#> 1 johndoe 550 2016-09-01
#> 2 johndoe 550 2016-11-01
#> 3 janedoe 995 2016-12-15
#> 4 foo 375 2017-01-05
#> 5 foo 375 2017-03-05
#> 6 foo 375 2017-05-05
pmts_long <- pmts %>%
group_by(name) %>%
arrange(name, pmt_date) %>%
mutate(pmt = row_number()) %>%
ungroup() %>%
complete(name, nesting(pmt)) %>%
fill(pmt_amount, .direction = "down")
pmts_long
#> # A tibble: 9 x 4
#> name pmt pmt_amount pmt_date
#> <chr> <int> <int> <date>
#> 1 foo 1 375 2017-01-05
#> 2 foo 2 375 2017-03-05
#> 3 foo 3 375 2017-05-05
#> 4 janedoe 1 995 2016-12-15
#> 5 janedoe 2 995 NA
#> 6 janedoe 3 995 NA
#> 7 johndoe 1 550 2016-09-01
#> 8 johndoe 2 550 2016-11-01
#> 9 johndoe 3 550 NA
pmts_wide <- pmts_long %>%
gather("key", "val", -name, -pmt_amount, -pmt) %>%
unite(pmt_number, key, pmt) %>%
spread(pmt_number, val)
pmts_wide
#> # A tibble: 3 x 5
#> name pmt_amount pmt_date_1 pmt_date_2 pmt_date_3
#> * <chr> <int> <date> <date> <date>
#> 1 foo 375 2017-01-05 2017-03-05 2017-05-05
#> 2 janedoe 995 2016-12-15 NA NA
#> 3 johndoe 550 2016-09-01 2016-11-01 NA

Resources