I'm interested in filtering out data based on a set of rules.
I have a dataset that contains play data for all games in which a team had a .8 win probability at some point. What I'd like to do is find that point in which the win probability reached .8 and remove every play thereafter until the next game data begins. The dataset contains numerous games so once a game ends data from a new one begins in which the win probability goes back to around .5.
Here are the relevant columns and each row is a play in the game:
game_id = unique num for each game
team = team that will eventually get an .8 win prob
play_id = num that is increased (but not necessary in seq order for some reason) after each play
win_per = num showing what the teams win percentage chance at the start of that recorded play was
Example df
df = data.frame(game_id = c(122,122,122,122,122,144,144,144,144,144),
team = c("a","a","a","a","a", "b","b","b","b","b"),
play_id = c(1,5,22,25,34, 45,47,55,58,66),
win_per = c(.5,.6,.86,.81,.85,.54,.43,.47,.81,.77))
So in this small example, I have recorded 5 plays of two teams (a and b) who both obtained a win_prob of at least .8 at some point in the game. In both example cases, I would want to have all the plays removed AFTER they attained this .8 mark regardless of whether the win_prob kept rising or fell back below .8.
So team a would have the final two rows of data removed (win_prob == .81 and .85) and team b would have the final row removed (win_prob = .77)
I'm imagining running a for loop that checks if the team in any row is the same team as the prior row, and if so, find a win_prob >= .8 with the lowest play-id (as this would be the first time the team reached .8) and then somehow remove the rest of the rows following that match UNTIL the team != prior row's team.
Of course, you might know a better way as well. Thank you so much for helping me out!
No need to use a loop, that whole selection can be performed in 1 line using the dplyr package:
df = data.frame(game_id = c(122,122,122,122,122,144,144,144,144,144),
team = c("a","a","a","a","a", "b","b","b","b","b"),
play_id = c(1,5,22,25,34, 45,47,55,58,66),
win_per = c(.5,.6,.86,.81,.85,.54,.43,.47,.81,.77))
library(dplyr)
#group by team
#find the first row that exceeds .80 and add temp column
#save the row from 1 to the row that exceeds 0.80
#remove temp column
df %>% group_by(team, game_id) %>%
mutate(g80= min(which(win_per>=0.80))) %>%
slice(1:g80) %>%
select(-g80)
# A tibble: 7 x 4
# Groups: team [2]
game_id team play_id win_per
<dbl> <fct> <dbl> <dbl>
1 122 a 1 0.5
2 122 a 5 0.6
3 122 a 22 0.86
4 144 b 45 0.54
5 144 b 47 0.43
6 144 b 55 0.47
7 144 b 58 0.81
Here is a base R way using cumsum in ave
subset(df, ave(win_per > 0.8, game_id, FUN = function(x) c(0, cumsum(x)[-length(x)])) == 0)
# game_id team play_id win_per
#1 122 a 1 0.50
#2 122 a 5 0.60
#3 122 a 22 0.86
#6 144 b 45 0.54
#7 144 b 47 0.43
#8 144 b 55 0.47
#9 144 b 58 0.81
and using the similar concept in dplyr
library(dplyr)
df %>% group_by(game_id) %>% filter(lag(cumsum(win_per > 0.8) == 0, default = TRUE))
Related
I am measuring electric current (µA) over a certain time interval (s) for 4 different channels (chan_n) and this is how my data looks:
dat
s µA chan_n
<dbl> <dbl> <chr>
0.00 -0.03167860 1
0.02 -0.03136610 1
0.04 -0.03118490 1
0.06 -0.03094740 1
0.08 -0.03065360 1
0.10 -0.03047860 1
0.12 -0.03012230 1
0.14 -0.02995980 1
0.16 -0.02961610 1
... ... ...
My end goal is to get the current of a certain time after the peak value. Therefore I first get the time timepoints at which the maximum appears for each channel:
BaslineTime <- dat %>%
group_by(chan_n) %>%
slice(which.max(µA)) %>% # get max current values
transmute(s = s + 30) # add 30 to the timepoints at which the max value appears
chan_n s
<chr> <dbl>
1 539.84
2 540.00
3 539.82
4 539.80
But if I use BaselineTime to filter for my current values I get two NAs:
BaslineVal <- right_join(dat, BaselineTime, by =c("chan_n","s"))
s µA chan_n
<dbl> <dbl> <chr>
540.00 0.00364974 2
539.80 0.00610948 4
539.84 NA 1
539.82 NA 3
I checked if the time values exist for channel 1 and 3 and they do. Also if I create a data frame manualy by hardcoding the time values and use it for filtering, it works just fine.
So why isn't it working? I would be very happy for any suggestions or explanations.
I think it might have something to do the the decimal places as for channel 2 and 4 there is a 0 on the last decimal place.
Untested as the sample data isn't suitable for testing. I would try something like this:
data %>%
group_by(chan_n) %>%
mutate(
is_peak = row_number() == which.max(µA),
post_peak = lag(is_peak, n = 30, default = FALSE)
)
This will give a TRUE in the new post_peak column 30 rows after the peak, so you can trivially ... %>% filter(post_peak) or do whatever you need to with the result.
If you need more help than this, please share some data that illustrates the problem better, e.g., 10 rows each of 2 chan_n groups with the goal of finding the row 3 after the peak (and that row existing in the data).
I'm looking to include a group statement within my for loop and I'm having difficulty finding any details into how to properly do this.
The example below , calculates the Extra, Outstanding and Current Column within my loop statement. I'm trying to group by id so that the loop will restart with every id. My current code:
dat <- tibble(
id = c("A","A","A","A","A","A","B","B"),
rn= c(1,2,3,4,5,6,1,2),
current = c(100,0,0,0,0,0,500,0),
paid = c(10,12,12,13,13,13,20,20),
pct_extra = c(.02,.05,.05,.07, .03, .01, .09,.01),
Extra = NA,
Outstanding = NA)
for(i in 1:nrow(dat)){
dat$Extra[i] <- dat$current[i]*dat$pct_extra[i]
dat$Outstanding[i] <- dat$current[i] - dat$paid[i] - dat$Extra[i]
if(i < nrow(dat)){
dat$current[(i+1)] <- dat$Outstanding[i]}}
I saw other posts with this same question and they seem to revert to using dplyr. So my first attempt was:
for(i in 1:nrow(dat)){
dat%>%
group_by(id)%>%
mutate(Extra=pct_extra*(current-paid),
Outstanding=current-paid-Extra,
current=if_else(rn==1,current,lag(Outstanding)))}
But this attempt didnt actually calculate the Extra, Outstanding and current columns which my guess is because I'm not using the loop statement properly.
Does anyone have any suggestions/references on how I can include a group statement into my for loop?
Thanks!
A few things.
for loops (surrounding dplyr pipes) are generally not necessary with dplyr grouping, this is no exception (though we will use your for loop in a "single group at a time" way).
Even if it were, you loop with i and never use i, so you're doing the same calculation to all rows, nrow(dat) times.
Third, you aren't storing the results.
My first attempt (after realizing the rolling nature of this) was to try to adapt slider::slide to it, but unfortunately I couldn't get it to work.
In older-dplyr, I would dat %>% group_by(id) %>% do({...}), but they've superseded do in lieu of across and multi-row summarize (which I could not figure out how to make do this).
So then I realized that your for loop works fine, it just needs to be applied one group at a time.
func <- function(z) {
for (i in seq_len(nrow(z))) {
z$Extra[i] <- z$current[i]*z$pct_extra[i]
z$Outstanding[i] <- z$current[i] - z$paid[i] - z$Extra[i]
if (i < nrow(z)) {
z$current[(i+1)] <- z$Outstanding[i]
}
}
z
}
library(dplyr)
library(tidyr) # nest, unnest
library(purrr) # map, can be done with base::Map as well
dat %>%
group_by(id) %>%
nest(quux = -id) %>%
mutate(quux = map(quux, func)) %>%
unnest(quux) %>%
ungroup()
# # A tibble: 8 x 7
# id rn current paid pct_extra Extra Outstanding
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 100 10 0.02 2 88
# 2 A 2 88 12 0.05 4.4 71.6
# 3 A 3 71.6 12 0.05 3.58 56.0
# 4 A 4 56.0 13 0.07 3.92 39.1
# 5 A 5 39.1 13 0.03 1.17 24.9
# 6 A 6 24.9 13 0.01 0.249 11.7
# 7 B 1 500 20 0.09 45 435
# 8 B 2 435 20 0.01 4.35 411.
My data frame has 3 columns, the first contains names of different partners and the second contains different levels (all partners have all levels) The third column contains the volume of that particular level with that partner. It looks like this
Partner | Level | Volume
a | 32 | 213
b | 32 | 450
c | 24 | 56
a | 24 | 213
I want to add a column that displays the share of the particular level in a partner. For example, the additional column in the above example would read 50% for both rows with partner a and 100% for the rest. I tried a for loop like this
for (i in 1:nrow(df) {
for (a in partners) {
if (df$Partner[i] == a) {
df$Share[i] <- df$Volume[i] / filter(aggregate(.~Partner, df, sum), Partner %in% i)$Volume
break
}
}
}
There is no error, but the data frame remains unchanged. What am I doing wrong, and is there any other way to do this better? Because I'll be performing this on a very large dataset
Here is a solution in base R :
df <- data.frame(Partner = c("a", "b", "c", "a"), Level = c(32, 32, 24, 23), Volume = c(213, 450, 56, 213))
df$Share <- sapply(1:nrow(df), function(k) df$Volume[k] / sum(df$Volume[df$Partner[k] == df$Partner]))*100
df
Partner Level Volume Share
1 a 32 213 50
2 b 32 450 100
3 c 24 56 100
4 a 23 213 50
A fairly straightforward base R solution would be to compute a frequency table of Partner, divide 1 by each frequency, and then merge it with the original dataframe
Share <- as.data.frame(1/table(df$Partner))
names(Share) <- c("Partner", "Share")
df <- merge(df, Share, by = "Partner")
#### OUTPUT ####
Partner Level Volume Share
1 a 32 213 0.5
2 a 24 213 0.5
3 b 32 450 1.0
4 c 24 56 1.0
Another, cleaner solution might be to use dplyr:
library(dplyr)
df %>%
group_by(Partner) %>%
mutate(Share = 1/n())
#### OUTPUT ####
# A tibble: 4 x 4
# Groups: Partner [3]
Partner Level Volume Share
<chr> <int> <int> <dbl>
1 a 32 213 0.5
2 b 32 450 1
3 c 24 56 1
4 a 24 213 0.5
Let's suppose that a company has 3 Bosses and 20 Employees, where each Employee has done n_Projects with an overall Performance in percentage:
> df <- data.frame(Boss = sample(1:3, 20, replace=TRUE),
Employee = sample(1:20,20),
n_Projects = sample(50:100, 20, replace=TRUE),
Performance = round(sample(1:100,20,replace=TRUE)/100,2),
stringsAsFactors = FALSE)
> df
Boss Employee n_Projects Performance
1 3 8 79 0.57
2 1 3 59 0.18
3 1 11 76 0.43
4 2 5 85 0.12
5 2 2 75 0.10
6 2 9 66 0.60
7 2 19 85 0.36
8 1 20 79 0.65
9 2 17 79 0.90
10 3 14 77 0.41
11 1 1 78 0.97
12 1 7 72 0.52
13 2 6 62 0.69
14 2 10 53 0.97
15 3 16 91 0.94
16 3 4 98 0.63
17 1 18 63 0.95
18 2 15 90 0.33
19 1 12 80 0.48
20 1 13 97 0.07
The CEO asks me to compute the quality of the work for each boss. However, he asks for a specific calculation: Each Performance value has to have a weight equal to the n_Project value over the total n_Project for that boss.
For example, for Boss 1 we have a total of 604 n_Projects, where the project 1 has a Performance weight of 0,13 (78/604 * 0,97 = 0,13), project 3 a Performance weight of 0,1 (59/604 * 0,18 = 0,02), and so on. The sum of these Performance weights are the Boss performance, that for Boss 1 is 0,52. So, the final output should be like this:
Boss total_Projects Performance
1 604 0.52
2 340 0.18 #the values for boss 2 are invented
3 230 0.43 #the values for boss 3 are invented
However, I'm still struggling with this:
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
In addition to this problem, can you give me any feedback about this problem (my code, specifically) or any recommendation to improve data-manipulations skills? (you can see in my profile that I have asked a lot of questions like this, but still I'm not able to solve them on my own)
We can get the sum of product of `n_Projects' and 'Performance' and divide by the 'total_projects'
library(dplyr)
df %>%
group_by(Boss) %>%
summarise(total_projects = sum(n_Projects),
Weight_Project = sum(n_Projects * Performance)/total_projects)
# or
# Weight_Project = n_Projects %*% Performance/total_projects)
# A tibble: 3 x 3
# Boss total_projects Weight_Project
# <int> <int> <dbl>
#1 1 604 0.518
#2 2 595 0.475
#3 3 345 0.649
Adding some more details about what you did and #akrun's answer :
You must have received the following error message :
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
## Error in summarise_impl(.data, dots) :
## Column `Weight_Project` must be length 1 (a summary value), not 7
This tells you that the calculus you made for Weight_Project does not yield a unique value for each Boss, but 7. summarise is there to summarise several values into one (by means, sums, etc.). Here you just divide each value of n_Projects by sum(total_Projects), but you don't summarise it into a single value.
Assuming that what you had in mind was first calculating the weight for each performance, then combining it with the performance mark to yield the weighted mean performance, you can proceed in two steps :
df %>%
group_by(Boss) %>%
mutate(Weight_Performance = n_Projects / sum(n_Projects)) %>%
summarise(weighted_mean_performance = sum(Weight_Performance * Performance))
The mutate statement preserves the number of total rows in df, but sum(n_Projects) is calculated for each Boss value thanks to group_by.
Once, for each row, you have a project weight (which depends on the boss), you can calculate the weighted mean — which is a mean thus a summary value — with summarise.
A more compact way that still lets appear the weighted calculus would be :
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum((n_Projects / sum(n_Projects)) * Performance))
# Reordering to minimise parenthesis, which is #akrun's answer
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum(n_Projects * Performance) / sum(n_Projects))
A sample data first
set.seed(123)
dat <- data.frame(day= 1:50 ,demand = sample(0:17, size = 50,replace = T),supply = sample(2:9, size = 50,replace = T))
reservoir <- 200
I have a data of demand and supply starting with day 1 till 50
and a fourth column which is the difference between supply and demand
dat$balance <- dat$supply - dat$demand
I want to calculate another column called net deficit. Here's the logic
If for a given day, Demand > Supply, a deficit exists. However, this
deficit can be met by reservoir and hence the net deficit columns will get zero,
If Supply > Demand, the excess supply is either added to reservoir (only if reservoir < 200).If reservoir is at its full capacity (200), the excess supply is discarded
If Demand > Supply and reservoir is zero, then the net deficit column gets the difference between Demand and Supply
For example, starting with day 1, there was a deficit (balance) of 3. This deficit is met by reservoir (making it 197) and, net deficit is zero,
Day 2: deficit is -9 which is borrowed from reservoir (making it 188) and net defict will be zero again.
Day 3, there is an excess of 1 which is used to fill reservoir (since reservoir < 100) net deficit gets a value of 0 and reservour becomes (189)
Day 4: there is a deficit of 13 in balance which is met by reservoir. Reservoir further reduces to 176
I hope this is clear.
If at some point of time, reservoir becomes 0, deficit cannot be compensated and
therefore net deficit gets the value of dat$balance
The solution is basically using a for loop to construct the reservoir vector based on the balance each day. The provided sample did not actually manage to empty the reservoir in 50 days, so I made it longer (but this means the numbers are not the same as the 50 day example). You can then simply bind the vector as a column to your data, and set the net_deficit column to zero while reservoir is positive.
library(tidyverse)
set.seed(123)
dat <- tibble(
day = 1:100,
demand = sample(0:17, size = 100,replace = T),
supply = sample(2:9, size = 100,replace = T)
)
balance <- dat$supply - dat$demand
reservoir <- rep(200, nrow(dat))
reservoir[1] <- reservoir[1] + balance[1]
for (day in 2:nrow(dat)){
reservoir[day] <- reservoir[day - 1] + balance[day]
}
out <- dat %>%
bind_cols(balance = balance, reservoir = reservoir) %>%
mutate(net_deficit = ifelse(reservoir >= 0, 0, reservoir))
out[61:70, ]
# A tibble: 10 x 6
day demand supply balance reservoir net_deficit
<int> <int> <int> <int> <dbl> <dbl>
1 61 11 6 - 5 3.00 0
2 62 1 4 3 6.00 0
3 63 6 7 1 7.00 0
4 64 4 4 0 7.00 0
5 65 14 4 -10 - 3.00 - 3.00
6 66 8 6 - 2 - 5.00 - 5.00
7 67 14 7 - 7 -12.0 -12.0
8 68 14 3 -11 -23.0 -23.0
9 69 14 5 - 9 -32.0 -32.0
10 70 7 4 - 3 -35.0 -35.0