My data frame has 3 columns, the first contains names of different partners and the second contains different levels (all partners have all levels) The third column contains the volume of that particular level with that partner. It looks like this
Partner | Level | Volume
a | 32 | 213
b | 32 | 450
c | 24 | 56
a | 24 | 213
I want to add a column that displays the share of the particular level in a partner. For example, the additional column in the above example would read 50% for both rows with partner a and 100% for the rest. I tried a for loop like this
for (i in 1:nrow(df) {
for (a in partners) {
if (df$Partner[i] == a) {
df$Share[i] <- df$Volume[i] / filter(aggregate(.~Partner, df, sum), Partner %in% i)$Volume
break
}
}
}
There is no error, but the data frame remains unchanged. What am I doing wrong, and is there any other way to do this better? Because I'll be performing this on a very large dataset
Here is a solution in base R :
df <- data.frame(Partner = c("a", "b", "c", "a"), Level = c(32, 32, 24, 23), Volume = c(213, 450, 56, 213))
df$Share <- sapply(1:nrow(df), function(k) df$Volume[k] / sum(df$Volume[df$Partner[k] == df$Partner]))*100
df
Partner Level Volume Share
1 a 32 213 50
2 b 32 450 100
3 c 24 56 100
4 a 23 213 50
A fairly straightforward base R solution would be to compute a frequency table of Partner, divide 1 by each frequency, and then merge it with the original dataframe
Share <- as.data.frame(1/table(df$Partner))
names(Share) <- c("Partner", "Share")
df <- merge(df, Share, by = "Partner")
#### OUTPUT ####
Partner Level Volume Share
1 a 32 213 0.5
2 a 24 213 0.5
3 b 32 450 1.0
4 c 24 56 1.0
Another, cleaner solution might be to use dplyr:
library(dplyr)
df %>%
group_by(Partner) %>%
mutate(Share = 1/n())
#### OUTPUT ####
# A tibble: 4 x 4
# Groups: Partner [3]
Partner Level Volume Share
<chr> <int> <int> <dbl>
1 a 32 213 0.5
2 b 32 450 1
3 c 24 56 1
4 a 24 213 0.5
Related
I have a large dataset ~1M rows with, among others, a column that has a score for each customer record. The score is between 0 and 100.
What I'm trying to do is efficiently map the score to a rating using a rating table. Each customer receives a rating between 1 and 15 based the customer's score.
# Generate Example Customer Data
set.seed(1)
n_customers <- 10
customer_df <-
tibble(id = c(1:n_customers),
score = sample(50:80, n_customers, replace = TRUE))
# Rating Map
rating_map <- tibble(
max = c(
47.0,
53.0,
57.0,
60.5,
63.0,
65.5,
67.3,
69.7,
71.7,
74.0,
76.3,
79.0,
82.5,
85.5,
100.00
),
rating = c(15:1)
)
The best code that I've come up with to map the rating table onto the customer score data is as follows.
customer_df <-
customer_df %>%
mutate(rating = map(.x = score,
.f = ~max(select(filter(rating_map, .x < max),rating))
)
) %>%
unnest(rating)
The problem I'm having is that while it works, it is extremely inefficient. If you set n = 100k in the above code, you can get a sense of how long it takes to work.
customer_df
# A tibble: 10 x 3
id score rating
<int> <int> <int>
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9
I need to speed up the code because it's currently taking over an hour to run. I've identified the inefficiency in the code to be my use of the purrr::map() function. So my question is how I could replicate the above results without using the map() function?
Thanks!
customer_df$rating <- length(rating_map$max) -
cut(score, breaks = rating_map$max, labels = FALSE, right = FALSE)
This produces the same output and is much faster. It takes 1/20th of a second on 1M rows, which sounds like >72,000x speedup.
It seems like this is a good use case for the base R cut function, which assigns values to a set of intervals you provide.
cut divides the range of x into intervals and codes the values in x
according to which interval they fall. The leftmost interval
corresponds to level one, the next leftmost to level two and so on.
In this case you want the lowest rating for the highest score, hence the subtraction of the cut term from the length of the breaks.
EDIT -- added right = FALSE because you want the intervals to be closed on the left and open on the right. Now matches your output exactly; previously had different results when the value matched a break.
We could do a non-equi join
library(data.table)
setDT(rating_map)[customer_df, on = .(max > score), mult = "first"]
-output
max rating id
<int> <int> <int>
1: 74 5 1
2: 53 13 2
3: 56 13 3
4: 50 14 4
5: 51 14 5
6: 78 4 6
7: 72 6 7
8: 60 12 8
9: 63 10 9
10: 67 9 10
Or another option in base R is with findInterval
customer_df$rating <- nrow(rating_map) -
findInterval(customer_df$score, rating_map$max)
-output
> customer_df
id score rating
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9
I'm interested in filtering out data based on a set of rules.
I have a dataset that contains play data for all games in which a team had a .8 win probability at some point. What I'd like to do is find that point in which the win probability reached .8 and remove every play thereafter until the next game data begins. The dataset contains numerous games so once a game ends data from a new one begins in which the win probability goes back to around .5.
Here are the relevant columns and each row is a play in the game:
game_id = unique num for each game
team = team that will eventually get an .8 win prob
play_id = num that is increased (but not necessary in seq order for some reason) after each play
win_per = num showing what the teams win percentage chance at the start of that recorded play was
Example df
df = data.frame(game_id = c(122,122,122,122,122,144,144,144,144,144),
team = c("a","a","a","a","a", "b","b","b","b","b"),
play_id = c(1,5,22,25,34, 45,47,55,58,66),
win_per = c(.5,.6,.86,.81,.85,.54,.43,.47,.81,.77))
So in this small example, I have recorded 5 plays of two teams (a and b) who both obtained a win_prob of at least .8 at some point in the game. In both example cases, I would want to have all the plays removed AFTER they attained this .8 mark regardless of whether the win_prob kept rising or fell back below .8.
So team a would have the final two rows of data removed (win_prob == .81 and .85) and team b would have the final row removed (win_prob = .77)
I'm imagining running a for loop that checks if the team in any row is the same team as the prior row, and if so, find a win_prob >= .8 with the lowest play-id (as this would be the first time the team reached .8) and then somehow remove the rest of the rows following that match UNTIL the team != prior row's team.
Of course, you might know a better way as well. Thank you so much for helping me out!
No need to use a loop, that whole selection can be performed in 1 line using the dplyr package:
df = data.frame(game_id = c(122,122,122,122,122,144,144,144,144,144),
team = c("a","a","a","a","a", "b","b","b","b","b"),
play_id = c(1,5,22,25,34, 45,47,55,58,66),
win_per = c(.5,.6,.86,.81,.85,.54,.43,.47,.81,.77))
library(dplyr)
#group by team
#find the first row that exceeds .80 and add temp column
#save the row from 1 to the row that exceeds 0.80
#remove temp column
df %>% group_by(team, game_id) %>%
mutate(g80= min(which(win_per>=0.80))) %>%
slice(1:g80) %>%
select(-g80)
# A tibble: 7 x 4
# Groups: team [2]
game_id team play_id win_per
<dbl> <fct> <dbl> <dbl>
1 122 a 1 0.5
2 122 a 5 0.6
3 122 a 22 0.86
4 144 b 45 0.54
5 144 b 47 0.43
6 144 b 55 0.47
7 144 b 58 0.81
Here is a base R way using cumsum in ave
subset(df, ave(win_per > 0.8, game_id, FUN = function(x) c(0, cumsum(x)[-length(x)])) == 0)
# game_id team play_id win_per
#1 122 a 1 0.50
#2 122 a 5 0.60
#3 122 a 22 0.86
#6 144 b 45 0.54
#7 144 b 47 0.43
#8 144 b 55 0.47
#9 144 b 58 0.81
and using the similar concept in dplyr
library(dplyr)
df %>% group_by(game_id) %>% filter(lag(cumsum(win_per > 0.8) == 0, default = TRUE))
I have a dataset that looks like this:
> df
teams people entries
1 A Team 6fd1 49
2 A Team 1df5 4
3 A Team 2hgt 19
4 A Team 8akt 4
5 A Team sdf9 19
6 B Team asc1 42
7 B Team abm8 32
8 B Team plo9 38
9 B Team 90la 5
10 B Team 8uil 23
> dput(df)
structure(list(teams = c("A Team", "A Team", "A Team", "A Team",
"A Team", "B Team", "B Team", "B Team", "B Team", "B Team"),
people = c("6fd1", "1df5", "2hgt", "8akt", "sdf9", "asc1",
"abm8", "plo9", "90la", "8uil"), entries = c(49, 4, 19, 4,
19, 42, 32, 38, 5, 23)), .Names = c("teams", "people", "entries"
), row.names = c(NA, -10L), class = "data.frame")
I am able to get a percentage of the teams that have above 75% by doing this, albeit messy and probably not the best way:
# sorted df and added cumulative percentage/sum and row number per team
> df
teams people entries cumulative_sum cumulative_perc number
1 A Team 6fd1 49 49 51.57895 1
3 A Team 2hgt 19 68 71.57895 2
5 A Team sdf9 19 87 91.57895 3
2 A Team 1df5 4 91 95.78947 4
4 A Team 8akt 4 95 100.00000 5
7 B Team abm8 89 89 45.17766 1
6 B Team asc1 42 131 66.49746 2
8 B Team plo9 38 169 85.78680 3
10 B Team 8uil 23 192 97.46193 4
9 B Team 90la 5 197 100.00000 5
# from this view, each team has 3/5 people (60%) reaching the minimum 75%
# entries, and using ddply, we can get that
ddply(df, 'teams', summarise,
marker = min(which(cumulative_perc > 75)),
total = NROW(teams),
seventyfive = marker/total)
teams marker total seventyfive
1 A Team 3 5 0.6
2 B Team 3 5 0.6
and while that works, I want to take into account only the percentage of entries of the third person that was actually 75% of the teams entries. For example, for A Team, 75% of their entries is 72 (rounded up), which would mean that we are only looking at 4 of the 19 entries for the third person, giving that team 2.21/5 instead of 3/5.
df %>% group_by(teams) %>%
summarise(seventyfive = {
tmp1 <- ceiling(0.75 * sum(entries)); tmp2 <- sum(cumsum(entries) < tmp1)
tmp2 + (tmp1 - sum(entries[1:tmp2])) / entries[tmp2 + 1]
})
# A tibble: 2 x 2
# teams seventyfive
# <chr> <dbl>
# 1 A Team 2.21
# 2 B Team 2.78
tmp1 is the 75% of entries, while tmp2 is the maximal number of entries still giving the cumulative percentage lower than 75%. The final line then directly computes the desired quantity.
lead() gives you the next rows' variable in your current group.
Below approach filters for the one row that is a fraction (0-1) of the next variable's entries away from the minimum amount of entries.
df %>%
group_by(teams) %>%
arrange(teams, -entries) %>%
mutate(delta = (ceiling(0.75 * sum(entries)) - cumsum(entries)) / lead(entries),
marker = row_number() + delta) %>%
filter(delta >= 0 & delta <= 1) %>%
select(teams, marker)
# A tibble: 2 x 2
# Groups: teams [2]
teams marker
<chr> <dbl>
1 A Team 2.21
2 B Team 2.78
I need to to create a sample from my dataframe and to do so I am using the code bellow.
name <- sample(c("Adam","John","Henry","Mike"),100,rep = TRUE)
area <- sample(c("run","develop","test"),100,rep = TRUE)
id <- sample(100:200,100,rep = FALSE)
mydata <- as.data.frame(cbind(id,area,name))
qcsample <- mydata %>%
group_by(area) %>%
nest() %>%
mutate(n = c(20, 15, 15)) %>%
mutate(samp = map2(data, n, sample_n)) %>%
select(area, samp) %>%
unnest()
Now, I am getting these results.
table(qcsample$area)
develop run test
15 15 20
--
table(qcsample$name)
Adam Henry John Mike
9 9 16 16
I would like to create a sample that would have more or less the same number of samples for each name eg. Adam - 12, Henry - 12, John - 13, Mike - 13.
How can I achieve that ? can I somehow request that the sample is equally distributed ?
Also, in this example I used function
sample_n
and specified number of samples.
I am anticipating that sometimes there will not be required number from a given group. In my example I am taking 20 samples from area called "test" but sometimes there will be only let's say 10 rows containing "test". The total number is 50 so I need to make sure if there are only 10 "test" the code has to automatically increase the others, so the sample would be "test" - 10, "run" - 20 and "develop" - 20. This can happen to any of the area so I need to test if there is enough rows to create the sample and increase other areas. If there is only 1 it can be added to any of the remaining areas or if the difference is 3 we add 1 to one area and 2 to the another one.
How could I check that taking into account all the possibilities ? I believe there are eight permutations in this case.
Thanks in advance A.
If you are using made up data then you can create a minimum amount of each row and then create filler to get you up to the total:
set.seed(42)
names <- c("Adam", "John", "Henry", "Mike")
areas <- c("run", "develop", "test")
totalrows <- 100
minname <- 22 # No less than 20 of each name (set to near threshold to test)
minarea <- 30 # No less than 30 of each area (less randomness the higher these are)
qcsample <- data.frame(
name=sample(c(rep(names, minname), sample(names, totalrows-length(names)*minname, replace=T))),
area=sample(c(rep(areas, minarea), sample(areas, totalrows-length(areas)*minarea, replace=T))),
id=sample(99+(1:totalrows))
)
This results in:
R> table(qcsample$name)
Adam Henry John Mike
23 28 24 25
R> table(qcsample$area)
develop run test
37 31 32
Notice that the count of name to area isn't constrained:
R> table(qcsample[,-3])
area
name develop run test
Adam 5 11 7
Henry 11 8 9
John 10 7 7
Mike 11 5 9
R>
Using a loop as suggested by #r2evans:
library(dplyr)
set.seed(42)
mydata <- data.frame(
name = sample(c("Adam","John","Henry","Mike"), 100, rep = TRUE),
area = sample(c("run","develop","test"), 100, rep = TRUE),
id = sample(100:200, 100, rep = FALSE)
)
Nsamples <- 50
mysample <- data.frame(sample_n(mydata, Nsamples))
minname <- 11 # max is 50/4 -> 12
minarea <- 15 # max is 50/3 -> 16
# the test you were asking about
while( (min(table(mysample$name)) < minname) || (min(table(mysample$area)) < minarea) ) {
mysample <- data.frame(sample_n(mydata, Nsamples))
}
This results in:
R> table(mysample$name)
Adam Henry John Mike
13 15 11 11
R> table(mysample$area)
develop run test
15 17 18
And, like before, there's no minimum of name to area.
R> table(mysample[-3])
area
name develop run test
Adam 4 3 6
Henry 2 6 7
John 4 4 3
Mike 5 4 2
If you needed to enforce an minimum number for each permutation add this to the test:
while(... || (min(table(mysample[-3])) < some_min)) {
BTW, the number of permutations, as you can see by the table, is the number of names times the number of areas.
Here's another thought.
Depending on your desired end-size, it might over-create the number of samples so that it can reduce some name/area pairs to bring the total down.
Let's say that you want to end up with a total of 50 rows:
final_size <- 50
For completeness, here are the sets from which we'll choose:
avail_names <- c("Adam", "John", "Henry", "Mike")
avail_areas <- c("run", "develop", "test")
and the minimum we need to create for Adam,run (etc) in order to certainly end up with no less than final_size rows:
size_per_namearea <- ceiling(final_size / (length(avail_names) * length(avail_areas)))
Ok, generate at least as many (likely more than) the number of rows we need:
set.seed(20180920)
qcsample <- crossing(data_frame(rownum = seq_len(size_per_namearea)),
data_frame(name = avail_names),
data_frame(area = avail_areas)) %>%
group_by(name, area) %>%
mutate(id = sample(100, size = n(), replace = FALSE))
qcsample
# # A tibble: 60 x 4
# # Groups: name, area [12]
# rownum name area id
# <int> <chr> <chr> <int>
# 1 1 Adam run 59
# 2 1 Adam develop 51
# 3 1 Adam test 23
# 4 1 John run 71
# 5 1 John develop 5
# 6 1 John test 24
# 7 1 Henry run 4
# 8 1 Henry develop 29
# 9 1 Henry test 79
# 10 1 Mike run 77
# # ... with 50 more rows
Verify we have identical sample sizes for each name/area:
xtabs(~ name + area, data = qcsample) %>%
stats::addmargins()
# area
# name develop run test Sum
# Adam 5 5 5 15
# Henry 5 5 5 15
# John 5 5 5 15
# Mike 5 5 5 15
# Sum 20 20 20 60
If we just do head(final_size), then we know which names we will be cutting short, which undermines the randomness of your sampling a little. The reason I added rownum up front was so that I can arrange by it plus a jitter, ensuring I get all of max(rownum)-1, and then some sampling of max(rownum), guaranteeing that each name/area pair have either max(rownum)-1 or max(rownum) rows; your tallies are never different by more than 1.
reducedsample <- arrange(qcsample, rownum + runif(n())) %>%
head(final_size) %>%
select(-rownum)
reducedsample %>%
xtabs(~ name + area, data = .) %>%
stats::addmargins()
# area
# name develop run test Sum
# Adam 4 4 5 13
# Henry 5 4 4 13
# John 4 4 4 12
# Mike 4 4 4 12
# Sum 17 16 17 50
I'm relatively new in R and learning. I have the following data frame = data
ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016
I am looking to count the number of people (in this case only two unique individuals) who passed their tests after multiple attempts(passing is defined as 65 or over). So the final product would return me a list of unique ID's who had multiple counts until their test scores hit 65. This would inform me that approx. 66% of the clients in this data frame require multiple test sessions before getting a passing grade.
Below is my idea or concept more or less, I've framed it as an if statement
If ID appears twice
count how often it appears, until TEST GRADE >= 65
ifelse(duplicated(data$ID), count(ID), NA)
I'm struggling with the second piece where I want to say, count the occurrence of ID until grade >=65.
The other option I see is some sort of loop. Below is my attempt
for (i in data$ID) {
duplicated(datad$ID)
count(data$ID)
Here is where something would say until =65
}
Again the struggle comes in how to tell R to stop counting when grade hits 65.
Appreciate the help!
You can use data.table:
library(data.table)
dt <- fread(" ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")
# count the number of try per ID then get only the one that have been successful
dt <- dt[, N:=.N, by=ID][grade>=65]
# proportion of successful having tried more than once
length(dt[N>1]$ID)/length(dt$ID)
[1] 0.6666667
Another option, though the other two work just fine:
library(dplyr)
dat2 <- dat %>%
group_by(ID) %>%
summarize(
multiattempts = n() > 1 & any(grade < 65),
maxgrade = max(grade)
)
dat2
# Source: local data frame [3 x 3]
# ID multiattempts maxgrade
# <int> <lgl> <int>
# 1 1 TRUE 73
# 2 2 TRUE 76
# 3 3 FALSE 66
sum(dat2$multiattempts) / nrow(dat2)
# [1] 0.6666667
Here is a method using the aggregate function and subsetting that returns the maximum score for testers that took the the test more than once starting from their second test.
multiTestMax <- aggregate(grade~ID, data=df[duplicated(df$ID),], FUN=max)
multiTestMax
ID grade
1 1 73
2 2 76
To get the number of rows, you can use nrow:
nrow(multiTestMax)
2
or the proportion of all test takers
nrow(multiTestMax) / unique(df$ID)
data
df <- read.table(header=T, text="ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")