Count unique total for combined multiple values - r

I have a dataset that records the products associated with certain accounts. I want to summarise the total number of accounts for a specific set of products, only counting each account number once, no matter how many products they have. So the total for this sample would be 4. (a + b + c + d)
Account
Product
a
1
a
2
b
1
c
1
c
2
d
3
The code I have tried so far is
filter(Product == 1 | 2 | 3) %>%
summarise(total = n_distinct(), .groups = Account)
This gives message Error in summarise_verbose(.groups, caller_env()) :
object 'Account' not found
I also tried
filter(Product == 1 | 2 | 3) %>%
summarise(total = n_distinct(Account))
But this doesn't reduce the number of rows properly - I'm still getting 300,000 rows when I should get 70,000 based on other data I have. Is there a way of counting the (alphanumeric) account numbers once and once only, no matter what the products are?

In the absence of an example of minimal data, I suppose you want to count the different elements by groups and using filters.
Data %>%
filter(Product %in% c(1,2,3)) %>%
group_by(Account) %>%
summarise(
total = n_distinct(Product)
)

You were close
df %>%
group_by(Account) %>%
summarise(
total = n_distinct(Product)
)

Related

Transition probabilities for entire table

I have df with the following structure:
sid step1 step2 step3 . . . . . step30
The sid is an id and the steps are steps through a webpage where
sids with a minimum of two steps
sids with a maximum of thirty steps
no duplicate sequential pages (ie page refreshes)
the steps are all string object types
I want to essentially create a total transition probability where for every unique page— I get a table/matrix which has a transition probability for every single possible page.
I have around ~3k unique pages so I don't know if this will be computationally feasible.
I would be okay with also passing a few pages as an argument for the matrix, so its not a 3000x3000 matrix and maybe a 1x3000 or 5x3000. In fact, I would prefer to start with this and scale up until it crashes lol.
Starting with the concept
To build a transition matrix, it is often easy to first build a matrix of counts. The counts can then be divided to produce transition probabilities.
To produce something like:
| to_site_A | to_site_B | ...
----------- +-----------+-----------+-----
from_site_A |
from_site_B | counts
from_site_C |
...
It might be simpler to first produce:
from | to | count
-------+--------+-------
site_A | site_B |
site_A | site_C |
...
This is the same information, just arranged differently.
And to do this, it is probably easier if you rearrange your current data into a structure like this:
from | to
-------+-------
site_A | site_B
site_A | site_C
...
So
Step 1: get data into long-thin structure of transitions
Step 2: count all pairwise transitions
Step 3: pivot or rearrange counts into square matrix
Step 1, rearrange data to long thin
You probably want something like this:
df_from_1_to_2 = df %>%
select(from = step1, to = step2) %>%
filter(!is.na(to))
df_from_2_to_3 = df %>%
select(from = step2, to = step3) %>%
filter(!is.na(to))
...
df_from_29_to_30 = df %>%
select(from = step29, to = step30) %>%
filter(!is.na(to))
long_list = rbind(df_from_1_to_2,
df_from_2_to_3,
...
df_from_29_to_30)
No this is not the most efficient way to approach this (by code or memory management) but we shall focus on the approach.
Step 2, count all pairwise transitions
This is now straightforward:
pairwise_count = long_list %>%
group_by(from, to) %>%
summarise(count = n())
Step 3, pivot or rearrange counts into square matrix
This step is just changing how the data is presented, and may not even be necessary depending on your application.
For rearranging this type of data, I suggest pivot_wider from the tidyr package:
count_matrix = pivot_wider(
data = pairwise_count,
names_from = to,
names_prefix = "to",
names_sep = "_",
values_from = count,
)
Edit: getting probabilities instead of counts
There are multiple points at which you could swap from counts to probabilities, one place to do it would be during step 2:
pairwise_count = long_list %>%
group_by(from, to) %>%
summarise(count = n())
pairwise_prob = pairwise_count %>%
group_by(from) %>%
mutate(from_count = sum(count)) %>%
mutate(prob = count / from_count) %>%
select(from, to, prob)
You can then use pairwise_prob in step 3 rather than pairwise_count.

R Dplyr sub-setting

I need to calculate min, max and mean by customer after sub-setting the population for primary contacts. To do this, I need to drop observations within a customer group if contact == relation and amount < 25. But, the tricky part is: if contact == relation and amount == amount, I need to keep both observations regardless the amount (this accounts for ties where we cannot define the primary contact).
If contact == relation, one can think of this as a household.
Each customer can be comprised of multiple households, so I've included contacts with NULL relationship values.
Sample Data
customer <- c(1,1,1,1,2,2,2,3,3,3,3)
contact <- c(1234,2345,3456,4567,5678,6789,7890,8901,9012,1236,2346)
relationship <- c(2345,1234,"","",6789,5678,"",9012,8901,2346,1236)
amount <- c(26,22,40,12,15,15,70,35,15,25,25)
score <- c(500,300,700,600,400,600,700,650,300,600,700)
creditinfoaggtestdata1 <- data.frame(customer,contact,relationship,amount,score)
Expected Outcome
As a point of reference, if I do not drop the appropriate contacts prior to calculating min, max and mean, by customer, I get an output table as follows:
I assume the requirement "contact = relation and amount = amount" means across different rows within the same customer group. Here's a dplyr solution:
# Create a contact-relationship id where direction doesn't matter
df <- creditinfoaggtestdata1 %>%
rowwise() %>%
mutate(id = paste0(min(contact, relationship), max(contact, relationship)))
# Filter new ID's where duplicates in amounts exist per customer group
dups <- df %>%
group_by(customer, id, amount) %>%
summarise(count = n()) %>%
filter(count > 1) %>%
ungroup() %>%
select(customer, id)
# User inner join to only select contact-relationship combinations from above
a <- df %>%
filter(amount < 25) %>%
inner_join(dups, by=c("customer", "id"))
# Combine with >= 25 data
b <- df %>%
filter(amount >= 25)
c <- rbind(a, b)
c %>%
group_by(customer) %>%
summarise(min_score = min(score), max_score = max(score), avg_score = mean(score))
Output:
customer min_score max_score avg_score
<dbl> <dbl> <dbl> <dbl>
1 1 500 700 600
2 2 400 700 567.
3 3 600 700 650

Is there a better solution? Receiving "longer object length is not a multiple of shorter object length"

This is my example.
user_id <- sample(seq(1,100),5000, TRUE)
friend_id <- sample(seq(1,100),5000, TRUE)
friends <- data.frame(user_id, friend_id)
friends <- friends %>%
filter(!user_id == friend_id)
friends <- friends %>% arrange(user_id) %>% distinct()
user_id <- sample(seq(1,100),10000, TRUE)
page_id <- sample(seq(1000,2000),10000, TRUE)
pages <- data.frame(user_id, page_id)
pages <- arrange(pages, user_id) %>% distinct()
popular <- friends %>%
left_join(pages, by = c("friend_id" = "user_id")) %>%
group_by(user_id, page_id) %>%
summarize(likes = n()) %>%
arrange(-likes) %>%
filter(!page_id %in% pages[pages$user_id == user_id,]$page_id)
My goal is to count the number of likes for each of the pages that a user's friend has liked. The last step is giving me this warning:
50: In pages$user_id == user_id : longer object length is not a
multiple of shorter object length
My goal in the last step is to filter out any page that the user has liked.
1) If I group by a column and then apply filter, will it apply to each of the grouped data frames separately? In other words, is it like having a for loop that says for (group in tbl) apply filter?
2) Will user_id give me the user_id according to each group? I guess this is an extension of 1.
3) I think it gives me the warning since pages$user_id is long and user_id is just one value. Is there a better solution or a more appropriate solution?
Is this what you are looking for:
pages_agg <- pages %>%
group_by(user_id) %>%
summarise(likes = n())
left_join(friends, pages_agg, by = c("friend_id" = "user_id")) %>%
head()
user_id friend_id likes
1 1 44 107
2 1 76 90
3 1 36 116
4 1 4 110
5 1 57 93
6 1 32 96

How to calculate weighted sums of rows based on value in another column

I searched around a lot trying to find an answer for this. It seems like what would be a relatively simple and common question, and I'm surprised I didn't find an answer but perhaps I am just not searching for the correct keywords.
I would like to calculate a weighted sum of some columns in three rows based on a value in another column. I think it makes more sense if you look at the dummy table below.
INDIVIDUAL <- c("A","A","A","A","A","A","B","B","B","B","B","B")
BEHAVIOR <- c("Smell", "Dig", "Eat", "Smell", "Dig", "Eat","Smell", "Dig", "Eat","Smell", "Dig", "Eat")
FOOD <- c("a", "a", "a","b","b","b", "a", "a", "a","b","b","b")
TIME <- c(2,4,7,6,1,2,9,0,4,3,7,6)
sample <- data.frame(Individual=INDIVIDUAL, Behavior=BEHAVIOR, Food=FOOD, Time=TIME)
Each individual spends a certain amount of time Smelling, Digging, and Eating each food item. I would like to weight and sum these three times to have one overall time per food item. Smelling is the lowest weight, eating is the highest. So basically I want a time interacting with each food item: Time per FoodA = (EatA) + (0.5*DigA) + (0.33*SmellA).
After extensive web browsing the best idea I could come up with was this:
sample %>%
group_by(Individual, Food) %>%
mutate(TIME = ((fullsum$BEHAVIOR == "EAT")
+(.5*(fullsum$BEHAVIOR == "DIG")
+(.33*(fullsum$BEHAVIOR == "SMELL")))))
But it doesn't work and I get this error: Error in mutate_impl(.data, dots) : incompatible size (2195), expecting 1 (the group size) or 1.
Any advice or direction to where this question has been answered already would be greatly appreciated!
FINAL RESULT
I modified fexjoo's suggestion to account for missing values and the result matches up with the values I calculated manually in Excel, so it looks like this is the winner. There may be a tidier way to remove the NAs from each of the columns but I'm ok with this.
data.frame %>%
spread(BEHAVIOR, TIME) %>%
mutate(EAT = coalesce(EAT, 0)) %>%
mutate(DIG = coalesce(DIG, 0)) %>%
mutate(SMELL = coalesce(SMELL, 0)) %>%
mutate(TIME = EAT + .5*DIG + .33*SMELL)
Try this
sample %>%
group_by(Individual, Food) %>%
mutate(TIME = ((Behavior == "Eat") + (.5*(Behavior == "Dig")
+(.33*(Behavior == "Smell")))))
My suggestion:
library(tidyr)
sample %>%
spread(Behavior, Time) %>%
mutate(TIME = Eat + .5*Dig + .33*Smell)
The result is:
Individual Food Dig Eat Smell TIME
1 A a 4 7 2 9.66
2 A b 1 2 6 4.48
3 B a 0 4 9 6.97
4 B b 7 6 3 10.49
You could do:
sample %>%
mutate(weights=case_when(.$Behavior=="Smell"~0.33,.$Behavior=="Dig"~0.5,.$Behavior=="Eat"~1))
%>% group_by(Food,Individual)
%>% summarise(WeightedTime=sum(weights*Time))
Which gives:
Food Individual WeightedTime
<fctr> <fctr> <dbl>
1 a A 9.66
2 a B 6.97
3 b A 4.48
4 b B 10.49
You could create a column with the weights based on the Behavior column:
library(dplyr)
sample$weights <-
case_when(
sample$Behavior == "Smell" ~ 0.33,
sample$Behavior == "Dig" ~ 0.5,
sample$Behavior == "Eat" ~ 1
)
sample %>% group_by(Individual, Food) %>%
summarise(time = sum(Time * weights))

Randomly subset each group to satisfy conditions

Looking to reduce resource allocation by looping through each resource's name, and looking at the assigned accounts to that persons name, selecting one at random and replacing that person's name with NA.
reproducible example:
Accts <- paste0("Acc", 1:200)
Value <- c(500, 2000, 5000, 1000)
AccountDF <- data.frame(Accts, Value)
AccountDF$Owner[1:200] <- NA
AccountDF$Owner[1:23] <- "Jeff"
AccountDF$Owner[24:37] <- "Alex"
AccountDF$Owner[38:61] <- "Steph"
AccountDF$Owner[62:111] <- "Matt"
AccountDF$Owner[112:141] <- "David"
library(dplyr)
OwnerDF <- AccountDF %>%
group_by(Owner) %>%
summarise(Count = n(),
TotalValue = sum(Value)) %>%
filter(!is.na(Owner))
Where I got so far:
for (p in 1:nrow(OwnerDF)){
while (AccountDF$Count[p] > 22){
AccountDF %>%
filter(Owner == OwnerDF$Owner[p]) %>%
sample_n(1)
}
}
I've heard that for loops are unnecessary. I'm sure this can be done with the purr package and pmap or something like that. I am still learning.
I would like to iterate through the OwnerDF and look at whether that person "owns" too many accounts. If yes, look at the original account list and select a random one and replace the owner's name with NA, remove 1 from their count, and continue on.
Lastly after figuring this out I would like to see if it can be done with multiple conditions.. like While(Count > 22 & Value > $40,000), or maybe two while loops. The object is to reduce each person's "owned" accounts to less than a certain threshold and reduce $$ to less than a certain threshold.
To select random accounts, just make a random var and sort on it, taking the first N accounts that meet your conditions:
set.seed(1)
res = AccountDF %>%
mutate(r = runif(n())) %>%
arrange(r) %>%
group_by(Owner) %>%
mutate(newOwner = replace(Owner, cumsum(Value) > 40000 | row_number() > 22, NA)) %>%
select(-r)
# Test that it worked...
res %>%
filter(!is.na(newOwner)) %>%
group_by(newOwner) %>%
summarise(Count = n(), TotalValue = sum(Value))
# A tibble: 5 x 3
# newOwner Count TotalValue
# <chr> <int> <dbl>
# 1 Alex 14 27000
# 2 David 18 37000
# 3 Jeff 18 39500
# 4 Matt 18 39500
# 5 Steph 17 36500
An extension mentioned by the OP in a comment:
Another question for you. Say I have a threshold for each value and count, and if someone has a low count but high value, I want to take a random account from their high value accounts, if they have a high count and low value, I want to take low value accounts away from them. How can I do this from a random perspective?
I'd probably assign a real-valued score to each observation, like...
s = scale(f(x))
where f is some function based on the conditions you mentioned (high count, high value or both), maybe as simple as x when you want to bias towards the low values and -x when you want to bias towards the high values.
Then, add on some noise and sort using the result as above:
r = s + rnorm(length(s))

Resources