I have df with the following structure:
sid step1 step2 step3 . . . . . step30
The sid is an id and the steps are steps through a webpage where
sids with a minimum of two steps
sids with a maximum of thirty steps
no duplicate sequential pages (ie page refreshes)
the steps are all string object types
I want to essentially create a total transition probability where for every unique page— I get a table/matrix which has a transition probability for every single possible page.
I have around ~3k unique pages so I don't know if this will be computationally feasible.
I would be okay with also passing a few pages as an argument for the matrix, so its not a 3000x3000 matrix and maybe a 1x3000 or 5x3000. In fact, I would prefer to start with this and scale up until it crashes lol.
Starting with the concept
To build a transition matrix, it is often easy to first build a matrix of counts. The counts can then be divided to produce transition probabilities.
To produce something like:
| to_site_A | to_site_B | ...
----------- +-----------+-----------+-----
from_site_A |
from_site_B | counts
from_site_C |
...
It might be simpler to first produce:
from | to | count
-------+--------+-------
site_A | site_B |
site_A | site_C |
...
This is the same information, just arranged differently.
And to do this, it is probably easier if you rearrange your current data into a structure like this:
from | to
-------+-------
site_A | site_B
site_A | site_C
...
So
Step 1: get data into long-thin structure of transitions
Step 2: count all pairwise transitions
Step 3: pivot or rearrange counts into square matrix
Step 1, rearrange data to long thin
You probably want something like this:
df_from_1_to_2 = df %>%
select(from = step1, to = step2) %>%
filter(!is.na(to))
df_from_2_to_3 = df %>%
select(from = step2, to = step3) %>%
filter(!is.na(to))
...
df_from_29_to_30 = df %>%
select(from = step29, to = step30) %>%
filter(!is.na(to))
long_list = rbind(df_from_1_to_2,
df_from_2_to_3,
...
df_from_29_to_30)
No this is not the most efficient way to approach this (by code or memory management) but we shall focus on the approach.
Step 2, count all pairwise transitions
This is now straightforward:
pairwise_count = long_list %>%
group_by(from, to) %>%
summarise(count = n())
Step 3, pivot or rearrange counts into square matrix
This step is just changing how the data is presented, and may not even be necessary depending on your application.
For rearranging this type of data, I suggest pivot_wider from the tidyr package:
count_matrix = pivot_wider(
data = pairwise_count,
names_from = to,
names_prefix = "to",
names_sep = "_",
values_from = count,
)
Edit: getting probabilities instead of counts
There are multiple points at which you could swap from counts to probabilities, one place to do it would be during step 2:
pairwise_count = long_list %>%
group_by(from, to) %>%
summarise(count = n())
pairwise_prob = pairwise_count %>%
group_by(from) %>%
mutate(from_count = sum(count)) %>%
mutate(prob = count / from_count) %>%
select(from, to, prob)
You can then use pairwise_prob in step 3 rather than pairwise_count.
Related
I have a long dataset in which there are duplicated entries whose data I need to merge, e.g. paste values together.
In my case, I have a database of scientific articles: the strongest unique identifiers are the DOI and the article title, but the first may be missing in one of the copies, and the second may have slight phonetic/graphic differences that are easy to spot for humans but not programmatically (e.g. one copy uses β and the other plain beta).
A "match" are two articles that share at least one of the two columns. That is, I need a way to dplyr::group_by by the DOI OR the article title (usual group_by uses an AND logic).
The only solution that comes to my mind is to repeat the aggregation twice, for each column. Not very efficient given the large number of records.
Example:
imagine an input like:
df <- data.frame(
ID = c(1, NA, 2, 2),
Title = c('A', 'A', 'beta', 'β'),
to.join = 1:4
)
After (OR)grouping and summarising:
df %>%
group_by_OR(ID, Title) %>% # dummy function
summarise(
ID = na.omit(ID)[1],
Title = Title[1],
joined = paste(to.join, collapse = ', '))
I should get something like this:
ID Title joined
1 1 A 1, 2
2 2 beta 3, 4
That is, the data was grouped by the title for the first group and by the id for the second.
I don't think you can avoid having to group the data twice, but we can do it sequentially, that way we can be as efficient as possible.
library(dplyr)
df_aggregated <- df %>%
group_by(ID) %>%
arrange(Title) %>%
summarise(Title = first(Title),
to.join = paste0(to.join, collapse=", ")) %>%
group_by(Title) %>%
arrange(ID) %>%
summarise(ID = first(ID),
to.join = paste0(to.join, collapse=", ")) %>%
select(ID, Title, joined=to.join) %>%
as.data.frame()
Now,
df_aggregated
is:
ID Title joined
1 1 A 1, 2
2 2 beta 3, 4
Eventually I found a solution, thanks also to #dario.
First I group by Title and impute the missing DOIs if at least one of the copies has one. Then I ungroup and create a new unique ID, using the DOI if present and the Title for those entries whose no copies have it.
Finally I group and summarize by this ID.
This way the computational-heavy summarising step is done only once.
records %>%
mutate(
uID = str_to_lower(Title) %>% str_remove_all('[^\\w\\d]+') # Improve matching between slightly different copies
) %>%
group_by(uID) %>%
mutate(DOI = na.omit(DOI)[1]) %>%
ungroup() %>%
mutate(
uID = ifelse(is.na(DOI), uID, DOI)
) %>%
group_by(uID) %>%
summarise(...) # various stuff here.
I'm working with some survey data and I want to summarize responses from everyone, and responses from members in a single table.
The best way I can think of to translate this to Starwars is that I want to know how many characters total have any one eye color, and how many female characters have that eye color. For simplicity, I limited the population to blue and brown eyes.
I can run to separate queries, one to show just the females:
starwars %>%
filter(eye_color %in% c("brown","blue")) %>%
count(eye_color, gender) %>%
filter(gender == "female") %>%
mutate(percent = n / sum(n) * 100,
percent = sprintf("%.0f%%", percent))
And one to show all characters regardless of gender:
starwars %>%
filter(eye_color %in% c("brown","blue")) %>%
count(eye_color) %>%
mutate(percent = n / sum(n) * 100,
percent = sprintf("%.0f%%", percent))
But I'd like to spit these out as a single table. Is there a better approach to that than just pasting the two resulting tibbles together?
I still don't know of a good way to group by data where groups overlap in dplyr without repeating data. So I think combing the data data from two different pipelines is fine. If you want to elimiated code duplication, you could write a helper function. Here's one such example
plus_margin <- function(data, filters, fun=identity, .id="id") {
stopifnot(is.list(filters))
stopifnot(!is.null(names(filters)))
stopifnot(all(sapply(filters, is.function)))
stopifnot(is.function(fun))
bind_rows(
map_dfr(filters, ~data %>% .x %>% fun, .id=.id),
data %>% fun %>% mutate(.id:="all")
)
}
Then you could call it with something like
starwars %>%
filter(eye_color %in% c("brown","blue")) %>%
plus_margin(list(
feminine = . %>% filter(gender == "feminine")
),
. %>% count(eye_color) %>%
mutate(percent = n / sum(n) * 100,
percent = sprintf("%.0f%%", percent))
)
Which returns
id eye_color n percent
<chr> <chr> <int> <chr>
1 feminine blue 6 55%
2 feminine brown 5 45%
3 all blue 19 48%
4 all brown 21 52%
The idea is that you pass in a list of filters to subset the data by. These filters, should be functions that take data and subset it in some way. The list should be named and the names will be used as values in the resulting "id" column. Here we use the magrittr syntax . %>% {} to create an anonymous function. We the need to pass in a function apply to each of the subsets.
But at the end of the day, the joining is still happening with bind_rows. Maybe someone else will suggest a better way.
My goal is to better target prospects at a higher call success rate, based on time of day and prior history.
I have created a "Prodprobability" column showing the probability of a PropertyID answering the phone at that hour for the history of calls. Instead of merely omitting Property ID 233303.13 from any calls, I want to retarget them into hour 13 or hour 16 (the sample data doesn't show but the probability of pickup at those hours are 100% and 25% respectively).
So, moving forward, based on hour of day, and history of that prospect picking up the phone or not during that hour, I'd like to re-target every prospect during the hours they're most likely to pick up.
sample data
EDIT: I guess I need a formula to do this: If "S425=0", I want to search for where "A425" has the highest probability in the S column, and return the hour and probability for that "PropertyID". Hopefully that makes sense.
EDIT: :sample date returns this
The question here would be are you dead set on creating a 'model' or an automation works for you?
I would suggest ordering the dataframe by probability of picking the call every hour (so you can give the more probable leads first) and then further sorting them by number of calls on that day.
Something along the lines of:
require(dplyr)
todaysCall = df %>%
dplyr::group_by(propertyID) %>%
dplyr::summarise(noOfCalls = n())
hourlyCalls = df %>%
dplyr::filter(hour == format(Sys.time(),"%H")) %>%
dplyr::left_join(todaysCall) %>%
dplyr::arrange(desc(Prodprobability),noOfCalls)
Essentially, getting the probability of pickups are what models are all about and you already seem to have that information.
Alternate solution
Get top 5 calling times for each propertyID
top5Times = df %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::arrange(desc(Prodprobability)) %>%
dplyr::slice(1:5L) %>%
dplyr::ungroup()
Get alternate calling time for cases with zero Prodprobability:
zeroProb = df %>%
dplyr::filter(Prodprobability == 0)
alternateTimes = df %>%
dplyr::filter(propertyID %in% zeroProb$propertyID) %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::arrange(propertyID,desc(Prodprobability))
Best calling hour for cases with zero probability at given time:
#Identifies the zero prob cases; can be by hour or at a particular instant
zeroProb = df %>%
dplyr::filter(Prodprobability == 0)
#Gets the highest calling probability and corresponding closest hour if probability is same for more than one timeslot
bestTimeForZero = df %>%
dplyr::filter(propertyID %in% zeroProb$propertyID) %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::arrange(desc(Prodprobability),hour) %>%
dplyr::slice(1L) %>%
dplyr::ungroup()
Returning number of records as per original df:
zeroProb = df %>%
dplyr::filter(Prodprobability == 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::summarise(total = n())
bestTimesList = lapply(1:nrow(zeroProb),function(i){
limit = zeroProb$total[i]
bestTime = df %>%
dplyr::filter(propertyID == zeroProb$propertyID[i]) %>%
dplyr::arrange(desc(Prodprobability)) %>%
dplyr::slice(1:limit)
return(bestTime)
})
bestTimeDf = bind_rows(bestTimesList)
Note: You can combine the filter statements; I have written them separate to highlight what each step does.
I have a df (test) like this
Now if you look at the data, 6 to 10 combination is available in the second period but not in the first period. Hence when I use this code
a_summary <- test %>%
group_by(from, to) %>%
summarize(avg = mean(share, na.rm = T)) %>%
ungroup() %>%
spread(from, avg, fill = 0)
The output comes like this
Now, look at the 10 to 6 cell. it gives a value of 1 because 10 to 6 combination only exist one time. But when I make the average, I would like to consider all combination in each period. hence the expected outcome of that 10 to 6 cell is .5 and overall matrix column and row summation should be 1.
a_summary <- test %>%
group_by(from, to) %>%
summarize(count = sum(n, na.rm = T)) %>%
ungroup() %>%
spread(from, count, fill = 0)
This will give you all count of all combinations. Now you can normalize this matrix with dividing by sum(test$n) or use prop.table()
Looking to reduce resource allocation by looping through each resource's name, and looking at the assigned accounts to that persons name, selecting one at random and replacing that person's name with NA.
reproducible example:
Accts <- paste0("Acc", 1:200)
Value <- c(500, 2000, 5000, 1000)
AccountDF <- data.frame(Accts, Value)
AccountDF$Owner[1:200] <- NA
AccountDF$Owner[1:23] <- "Jeff"
AccountDF$Owner[24:37] <- "Alex"
AccountDF$Owner[38:61] <- "Steph"
AccountDF$Owner[62:111] <- "Matt"
AccountDF$Owner[112:141] <- "David"
library(dplyr)
OwnerDF <- AccountDF %>%
group_by(Owner) %>%
summarise(Count = n(),
TotalValue = sum(Value)) %>%
filter(!is.na(Owner))
Where I got so far:
for (p in 1:nrow(OwnerDF)){
while (AccountDF$Count[p] > 22){
AccountDF %>%
filter(Owner == OwnerDF$Owner[p]) %>%
sample_n(1)
}
}
I've heard that for loops are unnecessary. I'm sure this can be done with the purr package and pmap or something like that. I am still learning.
I would like to iterate through the OwnerDF and look at whether that person "owns" too many accounts. If yes, look at the original account list and select a random one and replace the owner's name with NA, remove 1 from their count, and continue on.
Lastly after figuring this out I would like to see if it can be done with multiple conditions.. like While(Count > 22 & Value > $40,000), or maybe two while loops. The object is to reduce each person's "owned" accounts to less than a certain threshold and reduce $$ to less than a certain threshold.
To select random accounts, just make a random var and sort on it, taking the first N accounts that meet your conditions:
set.seed(1)
res = AccountDF %>%
mutate(r = runif(n())) %>%
arrange(r) %>%
group_by(Owner) %>%
mutate(newOwner = replace(Owner, cumsum(Value) > 40000 | row_number() > 22, NA)) %>%
select(-r)
# Test that it worked...
res %>%
filter(!is.na(newOwner)) %>%
group_by(newOwner) %>%
summarise(Count = n(), TotalValue = sum(Value))
# A tibble: 5 x 3
# newOwner Count TotalValue
# <chr> <int> <dbl>
# 1 Alex 14 27000
# 2 David 18 37000
# 3 Jeff 18 39500
# 4 Matt 18 39500
# 5 Steph 17 36500
An extension mentioned by the OP in a comment:
Another question for you. Say I have a threshold for each value and count, and if someone has a low count but high value, I want to take a random account from their high value accounts, if they have a high count and low value, I want to take low value accounts away from them. How can I do this from a random perspective?
I'd probably assign a real-valued score to each observation, like...
s = scale(f(x))
where f is some function based on the conditions you mentioned (high count, high value or both), maybe as simple as x when you want to bias towards the low values and -x when you want to bias towards the high values.
Then, add on some noise and sort using the result as above:
r = s + rnorm(length(s))