Randomly subset each group to satisfy conditions - r

Looking to reduce resource allocation by looping through each resource's name, and looking at the assigned accounts to that persons name, selecting one at random and replacing that person's name with NA.
reproducible example:
Accts <- paste0("Acc", 1:200)
Value <- c(500, 2000, 5000, 1000)
AccountDF <- data.frame(Accts, Value)
AccountDF$Owner[1:200] <- NA
AccountDF$Owner[1:23] <- "Jeff"
AccountDF$Owner[24:37] <- "Alex"
AccountDF$Owner[38:61] <- "Steph"
AccountDF$Owner[62:111] <- "Matt"
AccountDF$Owner[112:141] <- "David"
library(dplyr)
OwnerDF <- AccountDF %>%
group_by(Owner) %>%
summarise(Count = n(),
TotalValue = sum(Value)) %>%
filter(!is.na(Owner))
Where I got so far:
for (p in 1:nrow(OwnerDF)){
while (AccountDF$Count[p] > 22){
AccountDF %>%
filter(Owner == OwnerDF$Owner[p]) %>%
sample_n(1)
}
}
I've heard that for loops are unnecessary. I'm sure this can be done with the purr package and pmap or something like that. I am still learning.
I would like to iterate through the OwnerDF and look at whether that person "owns" too many accounts. If yes, look at the original account list and select a random one and replace the owner's name with NA, remove 1 from their count, and continue on.
Lastly after figuring this out I would like to see if it can be done with multiple conditions.. like While(Count > 22 & Value > $40,000), or maybe two while loops. The object is to reduce each person's "owned" accounts to less than a certain threshold and reduce $$ to less than a certain threshold.

To select random accounts, just make a random var and sort on it, taking the first N accounts that meet your conditions:
set.seed(1)
res = AccountDF %>%
mutate(r = runif(n())) %>%
arrange(r) %>%
group_by(Owner) %>%
mutate(newOwner = replace(Owner, cumsum(Value) > 40000 | row_number() > 22, NA)) %>%
select(-r)
# Test that it worked...
res %>%
filter(!is.na(newOwner)) %>%
group_by(newOwner) %>%
summarise(Count = n(), TotalValue = sum(Value))
# A tibble: 5 x 3
# newOwner Count TotalValue
# <chr> <int> <dbl>
# 1 Alex 14 27000
# 2 David 18 37000
# 3 Jeff 18 39500
# 4 Matt 18 39500
# 5 Steph 17 36500
An extension mentioned by the OP in a comment:
Another question for you. Say I have a threshold for each value and count, and if someone has a low count but high value, I want to take a random account from their high value accounts, if they have a high count and low value, I want to take low value accounts away from them. How can I do this from a random perspective?
I'd probably assign a real-valued score to each observation, like...
s = scale(f(x))
where f is some function based on the conditions you mentioned (high count, high value or both), maybe as simple as x when you want to bias towards the low values and -x when you want to bias towards the high values.
Then, add on some noise and sort using the result as above:
r = s + rnorm(length(s))

Related

R Dplyr sub-setting

I need to calculate min, max and mean by customer after sub-setting the population for primary contacts. To do this, I need to drop observations within a customer group if contact == relation and amount < 25. But, the tricky part is: if contact == relation and amount == amount, I need to keep both observations regardless the amount (this accounts for ties where we cannot define the primary contact).
If contact == relation, one can think of this as a household.
Each customer can be comprised of multiple households, so I've included contacts with NULL relationship values.
Sample Data
customer <- c(1,1,1,1,2,2,2,3,3,3,3)
contact <- c(1234,2345,3456,4567,5678,6789,7890,8901,9012,1236,2346)
relationship <- c(2345,1234,"","",6789,5678,"",9012,8901,2346,1236)
amount <- c(26,22,40,12,15,15,70,35,15,25,25)
score <- c(500,300,700,600,400,600,700,650,300,600,700)
creditinfoaggtestdata1 <- data.frame(customer,contact,relationship,amount,score)
Expected Outcome
As a point of reference, if I do not drop the appropriate contacts prior to calculating min, max and mean, by customer, I get an output table as follows:
I assume the requirement "contact = relation and amount = amount" means across different rows within the same customer group. Here's a dplyr solution:
# Create a contact-relationship id where direction doesn't matter
df <- creditinfoaggtestdata1 %>%
rowwise() %>%
mutate(id = paste0(min(contact, relationship), max(contact, relationship)))
# Filter new ID's where duplicates in amounts exist per customer group
dups <- df %>%
group_by(customer, id, amount) %>%
summarise(count = n()) %>%
filter(count > 1) %>%
ungroup() %>%
select(customer, id)
# User inner join to only select contact-relationship combinations from above
a <- df %>%
filter(amount < 25) %>%
inner_join(dups, by=c("customer", "id"))
# Combine with >= 25 data
b <- df %>%
filter(amount >= 25)
c <- rbind(a, b)
c %>%
group_by(customer) %>%
summarise(min_score = min(score), max_score = max(score), avg_score = mean(score))
Output:
customer min_score max_score avg_score
<dbl> <dbl> <dbl> <dbl>
1 1 500 700 600
2 2 400 700 567.
3 3 600 700 650

Conditionally remove certain rows in R

I have a very large data frame with fish species captured as one of the columns. Here is a very shortened example:
ID = seq(1,50,1)
fishes = c("bass", "jack", "snapper")
common = sample(fishes, size = 50, replace = TRUE)
dat = as.data.frame(cbind(ID, common))
I want to remove any species that make up less than a certain percentage of the data. For the example here say I want to remove all species that make up less than 30% of the data:
library(dplyr)
nrow(filter(dat, common == "bass")) #22 rows -> 22/50 -> 44%
nrow(filter(dat, common == "jack")) #12 rows -> 12/50 -> 24%
nrow(filter(dat, common == "snapper")) #16 rows -> 16/50 -> 32%
Here, jacks make up less than 30% of the rows, so I want to remove all the rows with jacks (or all species with less than 15 rows). This is easy to do here, but in reality I have over 700 fish species in my data frame and I want to throw out all species that make up less than 1% of the data (which in my case would be less than 18,003 rows). Is there a streamlined way to do this without having to filter out each species individually?
I imagine perhaps some kind of loop that says if the number of rows for common name = "x" is less than 18003, remove those rows...
You may also do it in one pipe:
library(dplyr)
dat %>%
mutate(percentage = n()) %>%
group_by(common) %>%
mutate(percentage = n() / percentage) %>%
filter(percentage > 0.3) %>%
select(-percentage)
One way to approach this is to first create a summary table, then filter based on the summary stat. There are probably more direct ways to accomplish the same thing.
library(dplyr)
set.seed(914) # so you get the same results from sample()
ID = seq(1,50,1)
fishes = c("bass", "jack", "snapper")
common = sample(fishes, size = 50, replace = TRUE)
dat = as.data.frame(cbind(ID, common)) # same as your structure, but I ended up with different species mix
summ.table <- dat %>%
group_by(common) %>%
summarize(number = n()) %>%
mutate(pct= number/sum(number))
summ.table
# # A tibble: 3 x 3
# common number pct
# <fct> <int> <dbl>
# 1 bass 18 0.36
# 2 jack 18 0.36
# 3 snapper 14 0.28
include <- summ.table$common[summ.table$pct > .3]
dat.selected = filter(dat, common %in% include)

Filtering a Data Frame with Very specific Requirements

Fifa2 datasetFirst, I am not a developer and have little experience with R, so please forgive me. I have tried to get this done on my own, but have run out of ideas for filtering a data frame using the 'filter' command.
the data frame has about a dozen or so columns, with one being Grp (meaning Group). This is a FIFA soccer dataset, so the Group in this context means the general position the player is in (Defense, Midfield, Goalkeeper, Forward).
I need to filter this data frame to provide me this exact information:
the Top 4 Defense Players
the Top 4 Midfield Players
the Top 2 Forwards
the Top 1 Goalkeeper
What do I mean by "Top"? It's arranged by the Grp column, which is just a numeric number. So, Top 4 would be like 22,21,21,20 (or something similar because that numeric number could in fact be repeated for different players). The Growth column is the difference between the Potential Column and Overall column, so again just a simple subtraction to find the difference between them.
#Create a subset of the data frame
library(dplyr)
fifa2 <- fifa %>% select(Club,Name,Position,Overall,Potential,Contract.Valid.Until2,Wage2,Value2,Release.Clause2,Grp) %>% arrange(Club)
#Add columns for determining potential
fifa2$Growth <- fifa2$Potential - fifa2$Overall
head(fifa2)
#Find Southampton Players
ClubName <- filter(fifa2, Club == "Southampton") %>%
group_by(Grp) %>% arrange(desc(Growth), .by_group=TRUE) %>%
top_n(4)
ClubName
ClubName2 <- ggplot(ClubName, aes(x=forcats::fct_reorder(Name, Grp),
y=Growth, fill = Grp)) +
geom_bar(stat = "identity", colour = "black") +
coord_flip() + xlab("Player Names") + ylab("Unfilled Growth Potential") +
ggtitle("Southampton Players, Grouped by Position")
ClubName2
That chart produces a list of players that ends up having the Top 4 players in each position (top_n(4)), but I need it further filtered per the logic I described above. How can I achieve this? I tried fooling around with dplyr and that is fairly easy to get rows by Grp name, but don't see how to filter it to the 4-4-2-1 that I need. Any help appreciated.
Sample Output from fifa2 & ClubName (which shows the data sorted by top_n(4):
fifa2_Dataset
This might not be the most elegant solution, but hopefully it works :)
# create dummy data
data_test = data.frame(grp = sample(c("def", "mid", "goal", "front"), 30, replace = T), growth = rnorm(30, 100,10), stringsAsFactors = F)
# create referencetable to give the number of players needed per grp
desired_n = data.frame(grp = c("def", "mid", "goal", "front"), top_n_desired = c(4,4,1,2), stringsAsFactors = F)
# > desired_n
# grp top_n_desired
# 1 def 4
# 2 mid 4
# 3 goal 1
# 4 front 2
# group and arrange, than look up the desired amount of players in the referencetable and select them.
data_test %>% group_by(grp) %>% arrange(desc(growth)) %>%
slice(1:desired_n$top_n_desired[which(first(grp) == desired_n$grp)]) %>%
arrange(grp)
# A bit more readable, but you have to create an additional column in your dataframe
# create additional column with desired amount for the position written in grp of each player
data_test = merge(data_test, desired_n, by = "grp", all.x = T
)
data_test %>% group_by(grp) %>% arrange(desc(growth)) %>%
slice(1:first(top_n_desired)) %>%
arrange(grp)

R dplyr summarise date gaps

I have data on a set of students and the semesters they were enrolled in courses.
ID = c(1,1,1,
2,2,
3,3,3,3,3,
4)
The semester variable "Date" is coded as the year followed by 20 for spring, 30 for summer, and 40 for fall. so the Date value 201430 is summer semester of 2014...
Date = c(201220,201240,201330,
201340,201420,
201120,201340,201420,201440,201540,
201640)
Enrolled<-data.frame(ID,Date)
I'm using dplyr to group the data by ID and to summarise various aspects about a given student's enrollment history
Enrollment.History<-dplyr::select(Enrolled,ID,Date)%>%group_by(ID)%>%summarise(Total.Semesters = n_distinct(Date),
First.Semester = min(Date))
I'm trying to get a measure for the number of enrollment gaps that each student has, as well as the size of the largest enrollment gap. The data frame shouls end up looking like this:
Enrollment.History$Gaps<-c(2,0,3,0)
Enrollment.History$Biggest.Gap<-c(1,0,7,0)
print(Enrollment.History)
I'm just trying to figure out what the best way to code those gap variables. Is it better to turn that Date variable into an ordered factor? I hope this is a simple solution
Since you are not dealing with real dates in a standard format, you can instead make use of factors to compute the gaps.
First you need to define a vector of all possible year/semester combinations ("Dates") in the correct order (this is important!).
all_semesters <- c(sapply(2011:2016, paste0, c(20,30,40)))
Then, you can create a new factor variable, arrange the data by ID and Date, and finally compute the maximum difference between two semesters:
Enrolled %>%
mutate(semester = factor(Enrolled$Date, levels = all_semesters)) %>%
group_by(ID) %>%
arrange(Date) %>%
summarise(max_gap = max(c(0, diff(as.integer(semester)) -1), na.rm = TRUE))
## A tibble: 4 × 2
# ID max_gap
# <dbl> <dbl>
#1 1 1
#2 2 0
#3 3 7
#4 4 0
I used max(c(0, ...)) in the summarise, because otherwise you would end up with -Inf for IDs with a single entry.
Similarly, you could also achieve this by using match instead of a factor:
Enrolled %>%
mutate(semester = match(Date, all_semesters)) %>%
group_by(ID) %>%
arrange(Date) %>%
summarise(max_gap = max(c(0, diff(semester) -1), na.rm = TRUE))

How to calculate weighted sums of rows based on value in another column

I searched around a lot trying to find an answer for this. It seems like what would be a relatively simple and common question, and I'm surprised I didn't find an answer but perhaps I am just not searching for the correct keywords.
I would like to calculate a weighted sum of some columns in three rows based on a value in another column. I think it makes more sense if you look at the dummy table below.
INDIVIDUAL <- c("A","A","A","A","A","A","B","B","B","B","B","B")
BEHAVIOR <- c("Smell", "Dig", "Eat", "Smell", "Dig", "Eat","Smell", "Dig", "Eat","Smell", "Dig", "Eat")
FOOD <- c("a", "a", "a","b","b","b", "a", "a", "a","b","b","b")
TIME <- c(2,4,7,6,1,2,9,0,4,3,7,6)
sample <- data.frame(Individual=INDIVIDUAL, Behavior=BEHAVIOR, Food=FOOD, Time=TIME)
Each individual spends a certain amount of time Smelling, Digging, and Eating each food item. I would like to weight and sum these three times to have one overall time per food item. Smelling is the lowest weight, eating is the highest. So basically I want a time interacting with each food item: Time per FoodA = (EatA) + (0.5*DigA) + (0.33*SmellA).
After extensive web browsing the best idea I could come up with was this:
sample %>%
group_by(Individual, Food) %>%
mutate(TIME = ((fullsum$BEHAVIOR == "EAT")
+(.5*(fullsum$BEHAVIOR == "DIG")
+(.33*(fullsum$BEHAVIOR == "SMELL")))))
But it doesn't work and I get this error: Error in mutate_impl(.data, dots) : incompatible size (2195), expecting 1 (the group size) or 1.
Any advice or direction to where this question has been answered already would be greatly appreciated!
FINAL RESULT
I modified fexjoo's suggestion to account for missing values and the result matches up with the values I calculated manually in Excel, so it looks like this is the winner. There may be a tidier way to remove the NAs from each of the columns but I'm ok with this.
data.frame %>%
spread(BEHAVIOR, TIME) %>%
mutate(EAT = coalesce(EAT, 0)) %>%
mutate(DIG = coalesce(DIG, 0)) %>%
mutate(SMELL = coalesce(SMELL, 0)) %>%
mutate(TIME = EAT + .5*DIG + .33*SMELL)
Try this
sample %>%
group_by(Individual, Food) %>%
mutate(TIME = ((Behavior == "Eat") + (.5*(Behavior == "Dig")
+(.33*(Behavior == "Smell")))))
My suggestion:
library(tidyr)
sample %>%
spread(Behavior, Time) %>%
mutate(TIME = Eat + .5*Dig + .33*Smell)
The result is:
Individual Food Dig Eat Smell TIME
1 A a 4 7 2 9.66
2 A b 1 2 6 4.48
3 B a 0 4 9 6.97
4 B b 7 6 3 10.49
You could do:
sample %>%
mutate(weights=case_when(.$Behavior=="Smell"~0.33,.$Behavior=="Dig"~0.5,.$Behavior=="Eat"~1))
%>% group_by(Food,Individual)
%>% summarise(WeightedTime=sum(weights*Time))
Which gives:
Food Individual WeightedTime
<fctr> <fctr> <dbl>
1 a A 9.66
2 a B 6.97
3 b A 4.48
4 b B 10.49
You could create a column with the weights based on the Behavior column:
library(dplyr)
sample$weights <-
case_when(
sample$Behavior == "Smell" ~ 0.33,
sample$Behavior == "Dig" ~ 0.5,
sample$Behavior == "Eat" ~ 1
)
sample %>% group_by(Individual, Food) %>%
summarise(time = sum(Time * weights))

Resources