I have a dataset that has several hundred variables with hundreds of observations. Each observation has a unique identifier, and is associated with one of approximately 50 groups. It looks like so (the variables I'm not concerned about have been ignored below):
ID Group Score
1 10 400
2 11 473
3 12 293
4 13 382
5 14 283
6 11 348
7 11 645
8 13 423
9 10 434
10 10 124
I would like to calculate an adjusted mean for each observation that needs to use the N-count for each Group, the sum of Scores for that Group, as well as the means for the Scores of each group. (So, in the example above, the N-count for Group 11 is three, the sum is 1466, and the mean is 488.67, and I would use these numbers only on IDs 2, 6, and 7).
I've been fiddling with plyr, and am able to extract the n-counts and means as follows (accounting for missing Scores and Group values):
new_data <- ddply(main_data, "Group", N = sum(!is.na(Scores)), mean = mean(Scores, na.rm = TRUE).
I'm stuck, though, on how to get the sum of the scores for a particular group, and then how to calculate the adjusted means either within the main_data set or a new dataset. Any help would be appreciated.

Here is the plyr way.
ddply(main_data, .(Group), summarize, N = sum(!is.na(Score)), mean = mean(Score, na.rm = TRUE), total = sum(Score))
Group N mean total
1 10 3 319.3333 958
2 11 3 488.6667 1466
3 12 1 293.0000 293
4 13 2 402.5000 805
5 14 1 283.0000 283
Check out the dplyr package.
main_data %>% group_by(Group) %>% summarize(n = n(), mean = mean(Score, na.rm=TRUE), total = sum(Score))
Source: local data frame [5 x 4]
Group n mean total
1 10 3 319.3333 958
2 11 3 488.6667 1466
3 12 1 293.0000 293
4 13 2 402.5000 805
5 14 1 283.0000 283


R dplyr: How do I apply a less than / greater than mapping table across a large dataset efficiently?

I have a large dataset ~1M rows with, among others, a column that has a score for each customer record. The score is between 0 and 100.
What I'm trying to do is efficiently map the score to a rating using a rating table. Each customer receives a rating between 1 and 15 based the customer's score.
# Generate Example Customer Data
n_customers <- 10
customer_df <-
tibble(id = c(1:n_customers),
score = sample(50:80, n_customers, replace = TRUE))
# Rating Map
rating_map <- tibble(
max = c(
rating = c(15:1)
The best code that I've come up with to map the rating table onto the customer score data is as follows.
customer_df <-
customer_df %>%
mutate(rating = map(.x = score,
.f = ~max(select(filter(rating_map, .x < max),rating))
) %>%
The problem I'm having is that while it works, it is extremely inefficient. If you set n = 100k in the above code, you can get a sense of how long it takes to work.
# A tibble: 10 x 3
id score rating
<int> <int> <int>
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9
I need to speed up the code because it's currently taking over an hour to run. I've identified the inefficiency in the code to be my use of the purrr::map() function. So my question is how I could replicate the above results without using the map() function?
customer_df$rating <- length(rating_map$max) -
cut(score, breaks = rating_map$max, labels = FALSE, right = FALSE)
This produces the same output and is much faster. It takes 1/20th of a second on 1M rows, which sounds like >72,000x speedup.
It seems like this is a good use case for the base R cut function, which assigns values to a set of intervals you provide.
cut divides the range of x into intervals and codes the values in x
according to which interval they fall. The leftmost interval
corresponds to level one, the next leftmost to level two and so on.
In this case you want the lowest rating for the highest score, hence the subtraction of the cut term from the length of the breaks.
EDIT -- added right = FALSE because you want the intervals to be closed on the left and open on the right. Now matches your output exactly; previously had different results when the value matched a break.
We could do a non-equi join
setDT(rating_map)[customer_df, on = .(max > score), mult = "first"]
max rating id
<int> <int> <int>
1: 74 5 1
2: 53 13 2
3: 56 13 3
4: 50 14 4
5: 51 14 5
6: 78 4 6
7: 72 6 7
8: 60 12 8
9: 63 10 9
10: 67 9 10
Or another option in base R is with findInterval
customer_df$rating <- nrow(rating_map) -
findInterval(customer_df$score, rating_map$max)
> customer_df
id score rating
1 1 74 5
2 2 53 13
3 3 56 13
4 4 50 14
5 5 51 14
6 6 78 4
7 7 72 6
8 8 60 12
9 9 63 10
10 10 67 9

Calculation within a pipe between different rows of a data frame

I have a tibble with a column of different numbers. I wish to calculate for every one of them how many others before them are within a certain range.
For example, let's say that range is 200 ; in the tibble below the result for the 5th number would be 2, that is the cardinality of the list {816, 705} whose numbers are above 872-1-200 = 671 but below 872.
I have thought of something along the lines of :
for every theRow of the tibble, do calculate the vector theTibble$number_list between(X,Y) ;
summing the boolean returned vector.
I have been told that using loops is less efficient.
Is there a clean way to do this within a pipe without using loops?
Not the way you asked for it, but you can use a bit of linear algebra. Should be more efficient and more simple than a loop.
number_list <- c(248,650,705,816,872,991,1156,1157,1180,1277)
m <- matrix(number_list, nrow = length(number_list), ncol = length(number_list))
d <- (t(m) - number_list)
cutoff <- 200
# I used setNames to name the result, but you do not need to
# We count inclusive of 0 in case of ties
setNames(colSums(d >= 0 & d < cutoff) - 1, number_list)
Which gives you the following named vector.
248 650 705 816 872 991 1156 1157 1180 1277
0 0 1 2 2 2 1 2 3 3
Here is another way that is pipe-able using rollapply().
cutoff <- 200
df %>%
mutate(count = rollapply(number_list,
width = seq_along(number_list),
function(x) sum((tail(x, 1) - head(x, -1)) <= cutoff),
align = "right"))
Which gives you another column.
# A tibble: 10 x 2
number_list count
<int> <int>
1 248 0
2 650 0
3 705 1
4 816 2
5 872 2
6 991 2
7 1156 1
8 1157 2
9 1180 3
10 1277 3

Return values with matching conditions in r

I would like to return values with matching conditions in another column based on a cut score criterion. If the cut scores are not available in the variable, I would like to grab closest larger value. Here is a snapshot of dataset:
ids <- c(1,2,3,4,5,6,7,8,9,10)
scores.a <- c(512,531,541,555,562,565,570,572,573,588)
scores.b <- c(12,13,14,15,16,17,18,19,20,21)
data <- data.frame(ids, scores.a, scores.b)
> data
ids scores.a scores.b
1 1 512 12
2 2 531 13
3 3 541 14
4 4 555 15
5 5 562 16
6 6 565 17
7 7 570 18
8 8 572 19
9 9 573 20
10 10 588 21
cuts <- c(531, 560, 571)
I would like to grab score.b value corresponding to the first cut score, which is 13. Then, grab score.b value corresponding to the second cut (560) score but it is not in the score.a, so I would like to get the score.a value 562 (closest to 560), and the corresponding value would be 16. Lastly, for the third cut score (571), I would like to get 19 which is the corresponding value of the closest value (572) to the third cut score.
Here is what I would like to get.
cut.1 13
cut.2 16
cut.3 19
Any thoughts?
We can use a rolling join
setDT(data)[data.table(cuts = cuts), .(ids = ids, cuts, scores.b),
on = .(scores.a = cuts), roll = -Inf]
# ids cuts scores.b
#1: 2 531 13
#2: 5 560 16
#3: 8 571 19
Or another option is findInterval from base R after changing the sign and taking the reverse
with(data, scores.b[rev(nrow(data) + 1 - findInterval(rev(-cuts), rev(-scores.a)))])
#[1] 13 16 19
This doesn't remove the other columns, but this illustrates correct results better
df1 <- data[match(seq_along(cuts), findInterval(data$scores.a, cuts)), ]
rownames(df1) <- paste("cuts", seq_along(cuts), sep = ".")
> df1
ids scores.a scores.b
cuts.1 2 531 13
cuts.2 5 562 16
cuts.3 8 572 19

group_by() summarise() and weights percentages - R

Let's suppose that a company has 3 Bosses and 20 Employees, where each Employee has done n_Projects with an overall Performance in percentage:
> df <- data.frame(Boss = sample(1:3, 20, replace=TRUE),
Employee = sample(1:20,20),
n_Projects = sample(50:100, 20, replace=TRUE),
Performance = round(sample(1:100,20,replace=TRUE)/100,2),
stringsAsFactors = FALSE)
> df
Boss Employee n_Projects Performance
1 3 8 79 0.57
2 1 3 59 0.18
3 1 11 76 0.43
4 2 5 85 0.12
5 2 2 75 0.10
6 2 9 66 0.60
7 2 19 85 0.36
8 1 20 79 0.65
9 2 17 79 0.90
10 3 14 77 0.41
11 1 1 78 0.97
12 1 7 72 0.52
13 2 6 62 0.69
14 2 10 53 0.97
15 3 16 91 0.94
16 3 4 98 0.63
17 1 18 63 0.95
18 2 15 90 0.33
19 1 12 80 0.48
20 1 13 97 0.07
The CEO asks me to compute the quality of the work for each boss. However, he asks for a specific calculation: Each Performance value has to have a weight equal to the n_Project value over the total n_Project for that boss.
For example, for Boss 1 we have a total of 604 n_Projects, where the project 1 has a Performance weight of 0,13 (78/604 * 0,97 = 0,13), project 3 a Performance weight of 0,1 (59/604 * 0,18 = 0,02), and so on. The sum of these Performance weights are the Boss performance, that for Boss 1 is 0,52. So, the final output should be like this:
Boss total_Projects Performance
1 604 0.52
2 340 0.18 #the values for boss 2 are invented
3 230 0.43 #the values for boss 3 are invented
However, I'm still struggling with this:
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
In addition to this problem, can you give me any feedback about this problem (my code, specifically) or any recommendation to improve data-manipulations skills? (you can see in my profile that I have asked a lot of questions like this, but still I'm not able to solve them on my own)
We can get the sum of product of `n_Projects' and 'Performance' and divide by the 'total_projects'
df %>%
group_by(Boss) %>%
summarise(total_projects = sum(n_Projects),
Weight_Project = sum(n_Projects * Performance)/total_projects)
# or
# Weight_Project = n_Projects %*% Performance/total_projects)
# A tibble: 3 x 3
# Boss total_projects Weight_Project
# <int> <int> <dbl>
#1 1 604 0.518
#2 2 595 0.475
#3 3 345 0.649
Adding some more details about what you did and #akrun's answer :
You must have received the following error message :
df %>%
group_by(Boss) %>%
summarise(total_Projects = sum(n_Projects),
Weight_Project = n_Projects/sum(total_Projects))
## Error in summarise_impl(.data, dots) :
## Column `Weight_Project` must be length 1 (a summary value), not 7
This tells you that the calculus you made for Weight_Project does not yield a unique value for each Boss, but 7. summarise is there to summarise several values into one (by means, sums, etc.). Here you just divide each value of n_Projects by sum(total_Projects), but you don't summarise it into a single value.
Assuming that what you had in mind was first calculating the weight for each performance, then combining it with the performance mark to yield the weighted mean performance, you can proceed in two steps :
df %>%
group_by(Boss) %>%
mutate(Weight_Performance = n_Projects / sum(n_Projects)) %>%
summarise(weighted_mean_performance = sum(Weight_Performance * Performance))
The mutate statement preserves the number of total rows in df, but sum(n_Projects) is calculated for each Boss value thanks to group_by.
Once, for each row, you have a project weight (which depends on the boss), you can calculate the weighted mean — which is a mean thus a summary value — with summarise.
A more compact way that still lets appear the weighted calculus would be :
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum((n_Projects / sum(n_Projects)) * Performance))
# Reordering to minimise parenthesis, which is #akrun's answer
df %>%
group_by(Boss) %>%
summarise(weighted_mean_performance = sum(n_Projects * Performance) / sum(n_Projects))

Top-n-Box (Likert Scale) by factor groups in a dataframe

I have the following dataframe, which is the result of a cluster analysis with ten 7-likert attitude scales for specific product benefits (see 'variable' column). At this, n is the number of persons stating a specific value for each Benefit and sum is the total sum of persons for each cluster. n2 is just the relative share of answers to all answers per cluster (n2=n/cum*100, which is basically %).
Now, I want to create a new column, aggregating / summing up the top-n (indicated in 'value' column) percent (indicated in n2) for each benefit, e.g. a new column "Top-3-Box" with e.g. a value of 46.5 for rows 1-7/Benefit.1 (which is the sum of the n2 of the rows with the top-3 value 7,6,5). It would be great if there would be a solution for this, which is instantly applicable in dplyr.
Please see the dataframe below:
cluster variable value n cum n2
<int> <chr> <dbl> <int> <int> <dbl>
1 1 Benefit.1 1 11 86 12.8
2 1 Benefit.1 2 11 86 12.8
3 1 Benefit.1 3 6 86 7
4 1 Benefit.1 4 18 86 20.9
5 1 Benefit.1 5 16 86 18.6
6 1 Benefit.1 6 14 86 16.3
7 1 Benefit.1 7 10 86 11.6
8 1 Benefit.10 1 10 86 11.6
9 1 Benefit.10 2 13 86 15.1
10 1 Benefit.10 3 8 86 9.3
# ... with 40 more rows
I highly appreciate your support!
We can do a group by sum of 'n2' by subsetting the values corresponding to the first 3 'value'
df1 %>%
group_by(cluster, variable) %>%
mutate(percent = sum(n2[value %in% 1:3]))
If the 'value' is already ordered per 'cluster', 'variable', then we can just subset the 'n2'
df1 %>%
group_by(cluster, variable) %>%
mutate(percent = sum(n2[1:3]))
