how to sum conditional functions to grouped rows in R - r

I so have the following data frame
customerid
payment_month
payment_date
bill_month
charges
1
January
22
January
30
1
February
15
February
21
1
March
2
March
33
1
May
4
April
43
1
May
4
May
23
1
June
13
June
32
2
January
12
January
45
2
February
15
February
56
2
March
2
March
67
2
April
4
April
65
2
May
4
May
54
2
June
13
June
68
3
January
25
January
45
3
February
26
February
56
3
March
30
March
67
3
April
1
April
65
3
June
1
May
54
3
June
1
June
68
(the id data is much larger) I want to calculate payment efficiency using the following function,
efficiency = (amount paid not late / total bill amount)*100
not late is paying no later than the 21st day of the bill's month. (paying January's bill on the 22nd of January is considered as late)
I want to calculate the efficiency of each customer with the expected output of
customerid
effectivity
1
59.90
2
100
3
37.46
I have tried using the following code to calculate for one id and it works. but I want to apply and assign it to the entire group id and summarize it into 1 column (effectivity) and 1 row per ID. I have tried using group by, aggregate and ifelse functions but nothing works. What should I do?
df1 <- filter(df, (payment_month!=bill_month & id==1) | (payment_month==bill_month & payment_date > 21 & id==1) )
df2 <-filter(df, id==1001)
x <- sum(df1$charges)
x <- sum(df2$charges)
100-(x/y)*100

An option using dplyr
library(dplyr)
df %>%
group_by(customerid) %>%
summarise(
effectivity = sum(
charges[payment_date <= 21 & payment_month == bill_month]) / sum(charges) * 100,
.groups = "drop")
## A tibble: 3 x 2
#customerid effectivity
# <int> <dbl>
#1 1 59.9
#2 2 100
#3 3 37.5

df %>%
group_by(customerid) %>%
mutate(totalperid = sum(charges)) %>%
mutate(pay_month_number = match(payment_month , month.name),
bill_month_number = match(bill_month , month.name)) %>%
mutate(nolate = ifelse(pay_month_number > bill_month_number, TRUE, FALSE)) %>%
summarise(efficiency = case_when(nolate = TRUE ~ (charges/totalperid)*100))

Related

how best to calculate this share of a total

Below is the sample data. The goal is to first create a column that contains the total employment for that quarter. Second is to create a new column that shows the relative share for the area. Finally, the last item (and one which is vexing me) is to calculate whether the total with suppress = 0 represents over 50% of the total. I can do this in excel easily but trying to this in R and so have it be something that I can replicate year after year.
desired result is below
area <- c("001","005","007","009","011","013","015","017","019","021","023","027","033","001","005","007","009","011","013","015","017","019","021","023","027","033")
year <- c("2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021")
qtr <- c("01","01","01","01","01","01","01","01","01","01","01","01","01","02","02","02","02","02","02","02","02","02","02","02","02","02")
employment <- c(2,4,6,8,11,10,12,14,16,18,20,22,30,3,5,8,9,12,9,24,44,33,298,21,26,45)
suppress <- c(0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0)
testitem <- data.frame(year,qtr, area, employment, suppress)
For the first quarter of 2021, the total is 173. If you only take suppress = 1 into account, that is only 24 of 173 hence the TRUE in the 50 percent column. If these two values summed up to 173/2 or greater than you would have it say FALSE. For the second quarter, the suppress = 1 accounts for 310 of the total of 537 and so is over 50% of the total.
For the total column, I am showing the computation or ingredients. Ideally, it would show a value such as .0115 in place of 2/173.
year qtr area employment suppress total 50percent
2021 01 001 2 0 =2/173 TRUE
2021 01 005 4 0 =4/173 TRUE
.....
2021 02 001 3 0 =3/537 FALSE
2021 02 005 5 0 =5/537 FALSE
For example:
library(dplyr)
testitem %>%
group_by(year, qtr) %>%
mutate(
total = employment / sum(employment),
over_half = sum(employment[suppress == 0]) > (0.5 * sum(employment))
)
Gives:
# A tibble: 26 × 7
# Groups: year, qtr [2]
year qtr area employment suppress total over_half
<chr> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
1 2021 01 001 2 0 0.0116 TRUE
2 2021 01 005 4 0 0.0231 TRUE
3 2021 01 007 6 0 0.0347 TRUE
4 2021 01 009 8 1 0.0462 TRUE
5 2021 01 011 11 0 0.0636 TRUE
6 2021 01 013 10 0 0.0578 TRUE
7 2021 01 015 12 0 0.0694 TRUE
8 2021 01 017 14 0 0.0809 TRUE
9 2021 01 019 16 1 0.0925 TRUE
10 2021 01 021 18 0 0.104 TRUE
# … with 16 more rows
# ℹ Use `print(n = ...)` to see more rows
I think you'll want to use group_by() and mutate() here.
library(dplyr)
testitem |>
## grouping by year and quarter
## sums will be calculated over areas
group_by(year, qtr) |>
## this could be more terse, but gets the job done.
mutate(total_sum = sum(employment),
## This uses the total_sum column that was just created
total_prop = employment/total_sum,
## leveraging the 0,1 coding of suppress
suppress_sum = sum(suppress * employment),
suppress_prop = suppress_sum/total,
fifty = (1-suppress_prop) > 0.5)

Creating subset of dataset based on multiple condition in r

I want to extract the past 3 weeks' data for each household_id, channel combination. These past 3 weeks will be calculated from mala_fide_week and mala_fide_year and it will be less than that for each household_id and channel combination.
Below is the dataset:
for e.g. Household_id 100 for channel A: the mala_fide_week is 42 and mala_fide_year 2021. So past three records will be less than week 42 of the year 2021. This will be calculated from the week and year columns.
For the Household_id 100 and channel B combination, there are only two records much less than mala_fide_week and mala_fide_year.
For Household_id 101 and channel C, there are two years involved in 2019 and 2020.
The final dataset will be as below
Household_id 102 is not considered as week and year is greater than mala_fide_week and mala_fide_year.
I am trying multiple options but not getting through. Any help is much appreciated!
sample dataset:
data <- data.frame(Household_id =
c(100,100,100,100,100,100,101,101,101,101,102,102),
channel = c("A","A","A","A","B","B","C","C","c","C","D","D"),
duration = c(12,34,567,67,34,67,98,23,56,89,73,76),
mala_fide_week = c(42,42,42,42,42,42,5,5,5,5,30,30),
mala_fide_year =c(2021,2021,2021,2021,2021,2021,2020,2020,2020,2020,2021,2021),
week =c(36,37,38,39,22,23,51,52,1,2,38,39),
year = c(2021,2021,2021,2021,2020,2020,2019,2019,2020,2020,2021,2021))
I think you first need to obtain the absolute number of weeks week + year * 52, then filter accordingly. slice_tail gets the last three rows of each group.
library(dplyr)
data |>
filter(week + 52*year <= mala_fide_week + 52 *mala_fide_year) |>
group_by(Household_id, channel) |>
arrange(year, week, .by_group = TRUE) |>
slice_tail(n = 3)
# A tibble: 8 x 7
# Groups: Household_id, channel [3]
Household_id channel duration mala_fide_week mala_fide_year week year
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 A 34 42 2021 37 2021
2 100 A 567 42 2021 38 2021
3 100 A 67 42 2021 39 2021
4 100 B 34 42 2021 22 2020
5 100 B 67 42 2021 23 2020
6 101 C 23 5 2020 52 2019
7 101 C 56 5 2020 1 2020
8 101 C 89 5 2020 2 2020

arrange one below the other every 2 columns from data frame in R

Hi I have a df as below which show date and their respected
date 1_val date 2_val . . . . date n_val
2014 23 2014 33 . . . . 2014 34
2015 22 2016 12 . . . . 2016 99
i have tried with hard coding to arrange the columns one below the other
for 1&2 columns
a=1
b=2
names_2<-df[,c(a,b)]
colnames(names_2)[1]<-"Date"
names_2 <- names_2[!apply(is.na(names_2) | names_2 == "", 1, all),]
names_2<-melt(names_2,id=colnames(names_2)[1])
samp_out<-names_2
for 3&4 columns
a=3
b=4
names_2<-df[,c(a,b)]
colnames(names_2)[1]<-"Date"
names_2 <- names_2[!apply(is.na(names_2) | names_2 == "", 1, all),]
names_2<-melt(names_2,id=colnames(names_2)[1])
samp_out1<-names_2
till n-numbers
df1= rbind(samp_out,samp_out1,......samp_out_n)
output
date variable value
2014 1_val 23
2015 1_val 22
2014 2_val 33
2016 2_val 12
.
.
2014 n_val 34
2016 n_val 99
Thanks in advance
The function melt in the package data.table does that:
melt(df, id = "Date", measure = patterns("_val"))
You can specify the name of the variable to pivot on (Date in this case) and a pattern in the variables you want to keep the values of. You can also supply a vector with all the variablenames instead.
> DT <- data.table(Date = c(2014,2013), `1_val` = c(33, 32), Date = c(2014, 2013), `2_val` = c(65, 34))
> DT
Date 1_val Date 2_val
1: 2014 33 2014 65
2: 2013 32 2013 34
> melt(DT, id = "Date", measure = patterns("_val"))
Date variable value
1: 2014 1_val 33
2: 2013 1_val 32
3: 2014 2_val 65
4: 2013 2_val 34
You can use stack from base R,
setNames(data.frame(stack(df[c(TRUE, FALSE)])[1],
stack(df[c(FALSE, TRUE)])),
c('date', 'value', 'variable'))
# date value variable
#1 2014 33 1_val
#2 2013 32 1_val
#3 2014 65 2_val
#4 2013 34 2_val
Define the untidy rectangle
library(magrittr)
csv <- "date,1_val,date,2_val,date,3_val
2014,23,2014,33,2014,34
2015,22,2016,12,2016,99"
Read into a data frame, then transform into a long/eav rectangle.
ds_eav <- csv %>%
readr::read_csv() %>%
tibble::rownames_to_column(var="height") %>%
tidyr::gather(key=key, value=value, -height)
output:
# A tibble: 12 x 4
key index value height
<chr> <int> <int> <int>
1 date 1 2014 1
2 date 1 2015 2
3 value 1 23 1
4 value 1 22 2
5 date 2 2014 1
6 date 2 2016 2
7 value 2 33 1
8 value 2 12 2
9 date 3 2014 1
10 date 3 2016 2
11 value 3 34 1
12 value 3 99 2
Identify which rows are dates/values. Then shift up dates' index by 1.
ds_eav <- ds_eav %>%
dplyr::mutate(
index_val = sub("^(\\d+)_val$" , "\\1", key),
index_date = sub("^date_(\\d+)$", "\\1", key),
index_date = dplyr::if_else(key=="date", "0", index_date),
key = dplyr::if_else(grepl("^date(_\\d+)*", key), "date", "value"),
index = dplyr::if_else(key=="date", index_date, index_val),
index = as.integer(index),
index = index + dplyr::if_else(key=="date", 1L, 0L)
) %>%
dplyr::select(key, index, value, height)
Follow the advice of #jarko-dubbeldam and use spread/gather on the last step too
ds_eav %>%
tidyr::spread(key=key, value=value)
output:
# A tibble: 6 x 4
index height date value
* <int> <int> <int> <int>
1 1 1 2014 23
2 1 2 2015 22
3 2 1 2014 33
4 2 2 2016 12
5 3 1 2014 34
6 3 2 2016 99
You can use paste0(index, "_val") to get you exact output. But I'd prefer to keep them as integers, so you can do math on them in necessary (eg, max()).
edit 1: incorporate the advice & corrections of #jarko-dubbeldam and #hnskd.
edit 2: use rownames_to_column() in case the input isn't a balanced rectangle (eg, one column doesn't all all the rows).

Aggregation on 2 columns while keeping two unique R

So I have this:
Staff Result Date Days
1 50 2007 4
1 75 2006 5
1 60 2007 3
2 20 2009 3
2 11 2009 2
And I want to get to this:
Staff Result Date Days
1 55 2007 7
1 75 2006 5
2 15 2009 5
I want to have the Staff ID and Date be unique in each row, but I want to sum 'Days' and mean 'Result'
I can't work out how to do this in R, I'm sure I need to do lots of aggregations but I keep getting different results to what I am aiming for.
Many thanks
the simplest way to do this is to group_by Staff and Date and summarise the results with dplyr package:
require(dplyr)
df <- data.frame(Staff = c(1,1,1,2,2),
Result = c(50, 75, 60, 20, 11),
Date = c(2007, 2006, 2007, 2009, 2009),
Days = c(4, 5, 3, 3, 2))
df %>%
group_by(Staff, Date) %>%
summarise(Result = floor(mean(Result)),
Days = sum(Days)) %>%
data.frame
Staff Date Result Days
1 1 2006 75 5
2 1 2007 55 7
3 2 2009 15 5
You can aggregate on two variables by using a formula and then merge the two aggregates
merge(aggregate(Result ~ Staff + Date, data=df, mean),
aggregate(Days ~ Staff + Date, data=df, sum))
Staff Date Result Days
1 1 2006 75.0 5
2 1 2007 55.0 7
3 2 2009 15.5 5
Here is another option with data.table
library(data.table)
setDT(df1)[, .(Result = floor(mean(Result)), Days = sum(Days)), .(Staff, Date)]
# Staff Date Result Days
#1: 1 2007 55 7
#2: 1 2006 75 5
#3: 2 2009 15 5

Percentile for multiple groups of values in R

I'm using R to do my data analysis.
I'm looking for the code to achieve the below mentioned output.
I need a single piece of code to do this as I have over 500 groups & 24 months in my actual data. The below sample has only 2 groups & 2 months.
This is a sample of my data.
Date Group Value
1-Jan-16 A 10
2-Jan-16 A 12
3-Jan-16 A 17
4-Jan-16 A 20
5-Jan-16 A 12
5-Jan-16 B 56
1-Jan-16 B 78
15-Jan-16 B 97
20-Jan-16 B 77
21-Jan-16 B 86
2-Feb-16 A 91
2-Feb-16 A 44
3-Feb-16 A 93
4-Feb-16 A 87
5-Feb-16 A 52
5-Feb-16 B 68
1-Feb-16 B 45
15-Feb-16 B 100
20-Feb-16 B 81
21-Feb-16 B 74
And this is the output I'm looking for.
Month Year Group Minimum Value 5th Percentile 10th Percentile 50th Percentile 90th Percentile Max Value
Jan 2016 A
Jan 2016 B
Feb 2016 A
Feb 2016 B
considering dft as your input, you can try:
library(dplyr)
dft %>%
mutate(Date = as.Date(Date, format = "%d-%b-%y")) %>%
mutate(mon = month(Date),
yr = year(Date)) %>%
group_by(mon,yr,Group) %>%
mutate(minimum = min(Value),
maximum = max(Value),
q95 = quantile(Value, 0.95)) %>%
select(minimum, maximum, q95) %>%
unique()
which gives:
mon yr Group minimum maximum q95
<int> <int> <chr> <int> <int> <dbl>
1 1 2016 A 10 20 19.4
2 1 2016 B 56 97 94.8
3 2 2016 A 44 93 92.6
4 2 2016 B 45 100 96.2
and add more variables as per your need.

Resources