Calculating the delta between multiple variables grouped by user ids - r

How might I calculate the delta between multiple variables grouped by user ids in a "long" data frame?
Data format:
d1 <- data.frame(
id = rep(c(1, 2, 3, 4, 5), each = 2),
purchased = c(rep(c(T, F), 3), F, T, T, F),
product = rep(c("A", "B"), 5),
grade = c(1, 2, 1, 2, 2, 3, 7, 5, 1, 2),
rate = c(10, 12, 10, 12, 12, 14, 22, 18, 10, 12),
fee = rep(c(1, 2), 5))
This is my roundabout solution:
dA <- d1 %>%
filter(product == "A")
dB <- d1 %>%
filter(product == "B")
d2 <- inner_join(dA, dB, by = "id", suffix = c(".A", ".B"))
d3 <- d2 %>%
mutate(
purchased = if_else(purchased.A == T, "A", "B"),
dGrade = grade.B - grade.A,
dRate = rate.B - rate.A,
dFee = fee.B - fee.A) %>%
select(id, purchased:dFee)
All of this just seems terribly inefficient and complex. Is tidyr::spread or another dplyr/tidyr function appropriate here? (I couldn't get anything else to work)...

We can do this with gather/spread. Reshape the data from 'wide' to 'long' using gather, grouped by 'id', 'Var', we get the 'product' based on the logical column 'purchased', get the difference of 'Val' for 'product' that are 'B' and 'A', and spread it from 'long' to 'wide' format.
library(dplyr)
library(tidyr)
gather(d1, Var, Val, grade:fee) %>%
group_by(id, Var) %>%
summarise(purchased = product[purchased],
Val = Val[product == 'B'] - Val[product == 'A'])%>%
spread(Var, Val)
# id purchased fee grade rate
# <dbl> <fctr> <dbl> <dbl> <dbl>
#1 1 A 1 1 2
#2 2 A 1 1 2
#3 3 A 1 1 2
#4 4 B 1 -2 -4
#5 5 A 1 1 2
The OP's output ('d3') is
d3
# id purchased dGrade dRate dFee
#1 1 A 1 2 1
#2 2 A 1 2 1
#3 3 A 1 2 1
#4 4 B -2 -4 1
#5 5 A 1 2 1

Related

Sum While melting columns in R

Is there a way to melt 2 columns and take there sums as value . For example
df <- data.frame(A = c("x", "y", "z"), B = c(1, 2, 3), Cat1 = c(1, 4, 3), New2 = c(4, 4, 4))
Expected output
New_Col Sum
Cat1 8
New2 12
Or using base R with colSums after selecting the columns of interest and then convert the named vector to data.frame with stack
stack(colSums(df[c("Cat1", "New2")]))[2:1]
ind values
1 Cat1 8
2 New2 12
Of course
df %>%
summarise(across(starts_with('Cat'), sum)) %>%
pivot_longer(everything(), names_to = 'New_Col', values_to = 'Sum')
# A tibble: 2 × 2
New_Col Sum
<chr> <dbl>
1 Cat1 8
2 Cat2 12

Aggregating / rolling up specific records "into" subsequent records

I am trying to aggregate records with a specific type into subsequent records.
I have a dataset similar to the following:
df_initial <- data.frame("Id" = c(1, 2, 3, 4, 5),
"Qty" = c(105, 110, 100, 115, 120),
"Type" = c("A", "B", "B", "A", "A"),
"Difference" = c(30, 34, 32, 30, 34))
After sorting on the Id field, I'd like to aggregate records of Type = "B" into the next record of type = "A".
In other words, I'm looking to create df_new, which adds the Qty and Difference values for Ids 2 and 3 into the Qty and Difference values for Id 4, and flags Id 4 as being adjusted (in the field AdjustedFlag).
df_new <- data.frame("Id" = c(1, 4, 5),
"Qty" = c(105, 325, 120),
"Type" = c("A", "A", "A"),
"Difference" = c(30, 96, 34),
"AdjustedFlag" = c(0, 1, 0))
I'd greatly appreciate any advice or ideas about how to do this in R, preferably using data.table.
A data.table solution:
df_initial[, .(
Id = Id[.N], Qty = sum(Qty),
Difference = sum(Difference),
AdjustedFlag = +(.N > 1)
), by = .(grp = rev(cumsum(rev(Type == "A"))))
][, grp := NULL][]
# Id Qty Difference AdjustedFlag
# <num> <num> <num> <int>
# 1: 1 105 30 0
# 2: 4 325 96 1
# 3: 5 120 34 0
This can be solved by creating a new grouping variable, that groups the rows into the groups you describe, with the idea being to utilize that grouping variable for the desired aggregation.
Instead of having
A B B A A
that new grouping variable should look something like this:
1 2 2 2 3
This is not a data.table solution, but the same logic could be applied there:
library(tidyverse)
df_initial |>
mutate(
type2 = ifelse(Type == "A", as.numeric(factor(Type)), 0),
type2 = cumsum(type2),
type2 = ifelse(Type == "B", NA, type2)
) |>
fill(type2, .direction = "up") |>
group_by(type2) |>
summarise(
id = max(Id),
Qty = sum(Qty),
Difference = sum(Difference),
AdjustedFlag = as.numeric(n() > 1)
)
#> # A tibble: 3 × 5
#> type2 id Qty Difference AdjustedFlag
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 105 30 0
#> 2 2 4 325 96 1
#> 3 3 5 120 34 0
Using tidyverse
df_initial %>%
mutate(gn = if_else(lag(Type, default = 'A') == 'B' | Type == 'B', 'B', Type),
gr = cumsum(lag(gn, default = 'A') != gn),
adjusted = if_else(lag(Type, default = 'A') == 'B' | Type == 'B', 1, 0)) %>%
group_by(gr) %>%
summarise(Id = last(Id),
Qty = sum(Qty),
Type = 'A',
Difference = sum(Difference),
Adjusted_flg = max(adjusted)) %>% ungroup()
Here we create an interim dataset that looks like:
Id Qty Type Difference gn gr Adjusted
1 1 105 A 30 A 0 0
2 2 110 B 34 B 1 0
3 3 100 B 32 B 1 0
4 4 115 A 30 B 1 1
5 5 120 A 34 A 2 0
And use this to create our final table within the summarise. The gr is a column for indicating a group of values, which is why we group_by it.

Selecting cases based on 2 variables

I am sorry if it seems like a foolish question but I want to ask how to select cases that have the same id and index
This is an example of my dataframe:
df1<-structure(list(id = c(10, 10, 10, 11, 11, 11), pnum = c(1,
2, 3, 1, 2, 3), index = c(1, 2, 2, 1, 1, 1)), class = "data.frame", row.names = c(NA,
-6L))
Also if in and index has the values across all pnums:
df2<-structure(list(id = c(10, 10, 10, 11, 11, 11), pnum = c(1,
2, 3, 1, 2, 3), index = c(1, 1, 2, 2, 2, 2)), class = "data.frame", row.names = c(NA,
-6L))
I need to select cases that have the same id and index
End table should be this:
for df1
id pnum index
11 1 1
11 2 1
11 3 1
Also when id and index belong to the same group:
df2 outcome
id pnum index
10 1 2
10 2 2
10 3 2
We can use subset from base R
subset(df1, id == index)
# id pnum index
#4 1 1 1
#5 1 2 1
#6 1 3 1
Or with filter
library(dplyr)
df1 %>%
filter(id == index)
For the second case, may be we can use
df2 %>%
group_by(id) %>%
filter(n_distinct(index) > 1) %>%
mutate(index = 2)
We can select id's where there are only 1 unique index value.
library(data.table)
setDT(df1)[, .SD[uniqueN(index) == 1], id]
# id pnum index
#1: 11 1 1
#2: 11 2 1
#3: 11 3 1
For df2 this returns as :
setDT(df2)[, .SD[uniqueN(index) == 1], id]
# id pnum index
#1: 11 1 2
#2: 11 2 2
#3: 11 3 2
We can translate this to dplyr as :
df1 %>% group_by(id) %>% filter(n_distinct(index) == 1)
and in base R :
subset(df1, ave(index, id, FUN = function(x) length(unique(x))) == 1)

Transpose dataset and counting occurrences

The original dataset contains survey data in long form
Original dataset
T Q1 Q2 Q3
M1 3 5 4
M1 3 1 3
M1 1 3 1
M2 4 4 2
M2 2 2 3
M2 5 5 5
Where T is the type of respondents and Q1--Q3 are the questions, and the cell value corresponds to their agreement level on a 1--5 Likert
scale.
Wanted dataset
T Q A1 A2 A3 A4 A5
M1 Q1 1 0 3 0 0
M2 Q1 0 1 0 1 1
M1 Q2 1 0 1 0 1
M2 Q2 0 1 0 1 1
M1 Q3 1 0 1 1 0
M2 Q3 0 1 1 0 1
Where A1--A5 are the possible answers (1--5 Likert) and the cell value contains the frequency of these answers for each group M1 and M2.
How to get from the Original dataset to the Wanted dataset?
One way would be to use the dplyr and tidyr
library(dplyr)
library(tidyr)
df <- data.frame(Type = c('M1', 'M1', 'M1', 'M2', 'M2', 'M2'),
Q1 = c(3, 3, 1, 4, 2, 5),
Q2 = c(5, 1, 3, 4, 2, 5),
Q3 = c(4, 3, 1, 2, 3, 5))
df %>%
gather(key = 'Q', value = 'A', -Type) %>%
group_by(Type, Q, A) %>%
summarize(Count = n()) %>%
mutate(A = paste0('A', A)) %>%
spread(key = A, value = Count, fill = 0) %>%
arrange(Q, Type)
I used tidyverse fuction to solve your problem. Notice that I had to create row identifiers because not always gather an spread are symmetrics (for more, check this out)
library(tidyverse)
# Data
x <- data.frame(
T = c("M1", "M1", "M1", "M2", "M2", "M2"),
Q1 = c(3, 3, 1, 4, 2, 5),
Q2 = c(5, 1, 3, 4, 2, 5),
Q3 = c(4, 3, 1, 2, 3, 5)
)
# Modification
gather(x, key, A, -T) %>%
group_by(T, key, A) %>%
mutate(row_id = 1:n()) %>%
ungroup() %>%
spread(A, A, fill = 0, sep = "") %>%
select(-row_id)

merge or mutate a summary (dplyr)

I am always unsure how to retrieve a summary with dplyr.
Let us suppose I have a summary of individuals and households.
dta = rbind(c(1, 1, 45),
c(1, 2, 47),
c(2, 1, 24),
c(2, 2, 26),
c(3, 1, 67),
c(4, 1, 20),
c(4, 2, 21),
c(5, 3, 7)
)
dta = as.data.frame(dta)
colnames(dta) = c('householdid', 'id', 'age')
householdid id age
1 1 45
1 2 47
2 1 24
2 2 26
3 1 67
4 1 20
4 2 21
4 3 7
Imagine I want to calculate the number of person in the household and the mean age by households and then re-use this information in the original dataset.
dta %>%
group_by(householdid) %>%
summarise( nhouse = n(), meanAgeHouse = mean(age) ) %>%
merge(., dta, all = T)
I am often using merge, but it is slow sometimes when the dataset is huge.
Is it possible to
mutate
instead of
merge ?
dta %>% group_by(householdid) %>% mutate( nhouse = n(), meanAgeHouse = mean(age) )

Resources