In my dataframe, I have multiple columns with student grades. I want to sum the "Quiz" columns (e.g., Quiz1, Quiz2). However, I only want to sum the top 2 values, and ignore the others. I want to create a new column with the total (i.e., the sum of the top 2 values).
One issue is that some students have grades that tie for the top 2 grades in a given row. For example, Aaron has a high score of 42, but then there are two scores that tie for the second highest (i.e., 36).
df <-
Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia",
"Timothy", "Grant", "Chloe", "Judy", "Justin"),
ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096,
98432, 19487, 94029),
Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame")
I know that I can use pivot_longer to do this, which allows me to arrange by group, then take the top 2 values for each student. This works fine, but I would like a more efficient way with tidyverse, rather than having to pivot back and forth.
What I Tried
df %>%
pivot_longer(-c(Student, ID)) %>%
group_by(Student, ID) %>%
arrange(desc(value), .by_group = TRUE) %>%
slice_head(n = 2) %>%
pivot_wider(names_from = name, values_from = value) %>%
ungroup() %>%
mutate(Total = rowSums(select(., starts_with("Quiz")), na.rm = TRUE))
I also know that if I wanted to sum all the columns on each row, then I could use rowSums, as I made use of above. However, I am unsure how to do rowSums of just the top 2 values in the 4 quiz columns.
Expected Output
# A tibble: 10 × 7
Student ID Quiz2 Quiz3 Quiz1 Quiz4 Total
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 42 36 NA NA 78
2 Charlotte 61755 NA NA 41 39 80
3 Chloe 98432 49 NA 42 NA 91
4 Grant 34096 48 44 NA NA 92
5 James 87311 NA 36 NA 43 79
6 Judy 19487 NA 42 NA 41 83
7 Justin 94029 NA 37 NA 41 78
8 Katie 55323 NA 32 NA 46 78
9 Olivia 94839 NA 43 NA 40 83
10 Timothy 38209 38 38 NA NA 76
df %>%
rowwise() %>%
mutate(Quiz_Total = sum(sort(c(Quiz1,Quiz2,Quiz3,Quiz4), decreasing = TRUE)[1:2])) %>%
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 Quiz_Total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 31 42 36 36 78
#> 2 James 87311 25 33 36 43 79
#> 3 Charlotte 61755 41 34 34 39 80
#> 4 Katie 55323 10 22 32 46 78
#> 5 Olivia 94839 35 23 43 40 83
#> 6 Timothy 38209 19 38 38 38 76
#> 7 Grant 34096 27 48 44 43 92
#> 8 Chloe 98432 42 49 42 35 91
#> 9 Judy 19487 15 23 42 41 83
#> 10 Justin 94029 20 30 37 41 78
with base R - select just the quiz result columns and you can treat it like a matrix. apply sort in decreasing order, subsetting first two elements, and then use colSums.
df$Total <- colSums(apply(df[grepl("Quiz", names(df))], 1, function(x) sort(x, decreasing = TRUE)[1:2]))
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 31 42 36 36 78
#> 2 James 87311 25 33 36 43 79
#> 3 Charlotte 61755 41 34 34 39 80
#> 4 Katie 55323 10 22 32 46 78
#> 5 Olivia 94839 35 23 43 40 83
#> 6 Timothy 38209 19 38 38 38 76
#> 7 Grant 34096 27 48 44 43 92
#> 8 Chloe 98432 42 49 42 35 91
#> 9 Judy 19487 15 23 42 41 83
#> 10 Justin 94029 20 30 37 41 78
You do not have to do pivot_wider. Note that the longer format is the tidy format. Just do pivot_longer and left_join:
df %>%
left_join(pivot_longer(., -c(Student, ID)) %>%
group_by(Student, ID) %>%
summarise(Total = sum(sort(value, TRUE)[1:2]), .groups = 'drop'))
# A tibble: 10 x 7
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 31 42 36 36 78
2 James 87311 25 33 36 43 79
3 Charlotte 61755 41 34 34 39 80
4 Katie 55323 10 22 32 46 78
5 Olivia 94839 35 23 43 40 83
6 Timothy 38209 19 38 38 38 76
7 Grant 34096 27 48 44 43 92
8 Chloe 98432 42 49 42 35 91
9 Judy 19487 15 23 42 41 83
10 Justin 94029 20 30 37 41 78
Yet another solution, based on tidyverse:
df %>%
rowwise %>%
mutate(Quiz = list(c_across(starts_with("Quiz")) *
across(matches("\\d$"), ~ NULL), total = sum(Quiz, na.rm = T)) %>%
unnest_wider(Quiz, names_sep = "")
#> # A tibble: 10 × 7
#> Student ID Quiz1 Quiz2 Quiz3 Quiz4 total
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Aaron 30016 NA 42 36 NA 78
#> 2 James 87311 NA NA 36 43 79
#> 3 Charlotte 61755 41 NA NA 39 80
#> 4 Katie 55323 NA NA 32 46 78
#> 5 Olivia 94839 NA NA 43 40 83
#> 6 Timothy 38209 NA 38 38 NA 76
#> 7 Grant 34096 NA 48 44 NA 92
#> 8 Chloe 98432 42 49 NA NA 91
#> 9 Judy 19487 NA NA 42 41 83
#> 10 Justin 94029 NA NA 37 41 78
(A bit messy) Base R Solution:
# Store the names of quiz columns as a vector: quiz_colnames => character vector
quiz_colnames <- grep("Quiz\\d+", names(df), value = TRUE)
# Store the names of the non-quiz columns as a vector: non_quiz_colnames => character vector
non_quiz_colnames <- names(df)[!(names(df) %in% quiz_colnames)]
# Store an Idx based on the ID: Idx => integer vector:
Idx <- with(df, as.integer(factor(ID, levels = unique(ID))))
# Split-Apply-Combine to calculate the top 2 quizes: res => data.frame
res <- data.frame(
# Extract the top 2 quiz vectors: top_2_quizes => named integer vector
top_2_quizes <- head(sort(unlist(x[,quiz_colnames]), decreasing = TRUE), 2)
# Calculate the quiz columns not used: remainder_quiz_cols => character vector
remainder_quiz_cols <- quiz_colnames[!(quiz_colnames %in% names(top_2_quizes))]
# Nullify the remaining quizes: x => data.frame
x[, remainder_quiz_cols] <- NA_integer_
# Calculate the resulting data.frame: data.frame => env
Total = sum(top_2_quizes)
)[,c(non_quiz_names, "Quiz2", "Quiz3", "Quiz1", "Quiz4", "Total")]
row.names = NULL,
stringsAsFactors = FALSE
Try this base R to also get the NAs
cbind( df[,1:2], t( sapply( seq_along( 1:nrow( df ) ), function(x){
ord <- order( unlist( df[x,3:6] ) )[1:2]; arow <- df[x,3:6];
arow[ord] <- NA; ttl <- rowSums( arow[-ord], na.rm=T );
cbind( arow,Total=ttl ) } ) ) )
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
1 Aaron 30016 NA 42 NA 36 78
2 James 87311 NA NA 36 43 79
3 Charlotte 61755 41 NA NA 39 80
4 Katie 55323 NA NA 32 46 78
5 Olivia 94839 NA NA 43 40 83
6 Timothy 38209 NA NA 38 38 76
7 Grant 34096 NA 48 44 NA 92
8 Chloe 98432 NA 49 42 NA 91
9 Judy 19487 NA NA 42 41 83
10 Justin 94029 NA NA 37 41 78
As #akrun provided above, collapse is another efficient possibility. radixorder provides an integer ordering vector, and only the top 2 values in each row are kept, while the others are replaced with NA. Then, rowSums is used to get the totals for each row.
ftransform(gvr(df, "Student|ID"),
gvr(df, "^Quiz"),
FUN = function(x)
replace(x, radixorder(radixorder(x)) %in% 1:2, NA)
)) %>%
ftransform(Total = rowSums(gvr(., "^Quiz"), na.rm = TRUE))
# A tibble: 10 × 7
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron 30016 NA 42 NA 36 78
2 James 87311 NA NA 36 43 79
3 Charlotte 61755 41 NA NA 39 80
4 Katie 55323 NA NA 32 46 78
5 Olivia 94839 NA NA 43 40 83
6 Timothy 38209 NA NA 38 38 76
7 Grant 34096 NA 48 44 NA 92
8 Chloe 98432 NA 49 42 NA 91
9 Judy 19487 NA NA 42 41 83
10 Justin 94029 NA NA 37 41 78
Thank you for taking a look at my question!
I have the following (dummy) data for patient performance on 3 tasks:
patient_df = data.frame(id = seq(1:5),
age = c(30,72,46,63,58),
education = c(11, 22, 18, 12, 14),
task1 = c(21, 28, 20, 24, 22),
task2 = c(15, 15, 10, 11, 14),
task3 = c(82, 60, 74, 78, 78))
> patient_df
id age education task1 task2 task3
1 1 30 11 21 15 82
2 2 72 22 28 15 60
3 3 46 18 20 10 74
4 4 63 12 24 11 78
5 5 58 14 22 14 78
I also have the following (dummy) lookup table for age and education-based cutoff values to define a patient's performance as impaired or not impaired on each task:
cutoffs = data.frame(age = rep(seq(from = 35, to = 70, by = 5), 2),
education = c(rep("<16", 8), rep(">=16",8)),
task1_cutoff = c(rep(24, 16)),
task2_cutoff = c(11,11,11,11,10,10,10,10,9,13,13,13,13,12,12,11),
task3_cutoff = c(rep(71,8), 70, rep(74,2), rep(73, 5)))
> cutoffs
age education task1_cutoff task2_cutoff task3_cutoff
1 35 <16 24 11 71
2 40 <16 24 11 71
3 45 <16 24 11 71
4 50 <16 24 11 71
5 55 <16 24 10 71
6 60 <16 24 10 71
7 65 <16 24 10 71
8 70 <16 24 10 71
9 35 >=16 24 9 70
10 40 >=16 24 13 74
11 45 >=16 24 13 74
12 50 >=16 24 13 73
13 55 >=16 24 13 73
14 60 >=16 24 12 73
15 65 >=16 24 12 73
16 70 >=16 24 11 73
My goal is to create 3 new variables in patient_df that indicate whether or not a patient is impaired on each task with a binary indicator. For example, for id=1 in patient_df, their age is <=35 and their education is <16 years, so the cutoff value for task1 would be 24, the cutoff value for task2 would be 11, and the cutoff value for task3 would be 71, such that scores below these values would denote impairment.
I would like to do this for each id by referencing the age and education-associated cutoff value in the cutoff dataset, so that the outcome would look something like this:
> goal_patient_df
id age education task1 task2 task3 task1_impaired task2_impaired task3_impaired
1 1 30 11 21 15 82 1 1 0
2 2 72 22 28 15 60 0 0 1
3 3 46 18 20 10 74 1 1 0
4 4 63 12 24 11 78 1 0 0
5 5 58 14 22 14 78 1 0 0
In actuality, my patient_df has 600+ patients and there are 7+ tasks each with age- and education-associated cutoff values, so a 'clean' way of doing this would be greatly appreciated! My only alternative that I can think of right now is writing a TON of if_else statements or case_whens which would not be incredibly reproducible for anyone else who would use my code :(
Thank you in advance!
I would recommend putting both your lookup table and patient_df dataframe in long form. I think that might be easier to manage with multiple tasks.
Your education column is numeric; so converting to character "<16" or ">=16" will help with matching in lookup table.
Using fuzzy_inner_join will match data with lookup table where task and education match exactly == but age will between an age_low and age_high if you specify a range of ages for each lookup table row.
Finally, impaired is calculated comparing the values from the two data frames for the particular task.
Please note for output, id of 1 is missing, as falls outside of age range from lookup table. You can add more rows to that table to address this.
cutoffs_long <- cutoffs %>%
pivot_longer(cols = starts_with("task"), names_to = "task", values_to = "cutoff_value", names_pattern = "task(\\d+)") %>%
mutate(age_low = age,
age_high = age + 4) %>%
patient_df %>%
pivot_longer(cols = starts_with("task"), names_to = "task", values_to = "patient_value", names_pattern = "(\\d+)") %>%
mutate(education = ifelse(education < 16, "<16", ">=16")) %>%
fuzzy_inner_join(cutoffs_long, by = c("age" = "age_low", "age" = "age_high", "education", "task"), match_fun = list(`>=`, `<=`, `==`, `==`)) %>%
mutate(impaired = +(patient_value < cutoff_value))
# A tibble: 12 x 11
id age education.x task.x patient_value education.y task.y cutoff_value age_low age_high impaired
<int> <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <int>
1 2 72 >=16 1 28 >=16 1 24 70 74 0
2 2 72 >=16 2 15 >=16 2 11 70 74 0
3 2 72 >=16 3 60 >=16 3 73 70 74 1
4 3 46 >=16 1 20 >=16 1 24 45 49 1
5 3 46 >=16 2 10 >=16 2 13 45 49 1
6 3 46 >=16 3 74 >=16 3 74 45 49 0
7 4 63 <16 1 24 <16 1 24 60 64 0
8 4 63 <16 2 11 <16 2 10 60 64 0
9 4 63 <16 3 78 <16 3 71 60 64 0
10 5 58 <16 1 22 <16 1 24 55 59 1
11 5 58 <16 2 14 <16 2 10 55 59 0
12 5 58 <16 3 78 <16 3 71 55 59 0
I have a data frame with lot of company information separated by an id variable. I want to sort one of the variables and repeat it for every id. Let's take this example,
df <- structure(list(id = c(110, 110, 110, 90, 90, 90, 90, 252, 252
), var1 = c(26, 21, 54, 10, 18, 9, 16, 54, 39), var2 = c(234,
12, 43, 32, 21, 19, 16, 34, 44)), .Names = c("id", "var1", "var2"
), row.names = c(NA, -9L), class = "data.frame")
Which looks like this
id var1 var2
1 110 26 234
2 110 21 12
3 110 54 43
4 90 10 32
5 90 18 21
6 90 9 19
7 90 16 16
8 252 54 34
9 252 39 44
Now, I want to sort the data frame according to var1 by the vector id. Easiest solution I can think of is using apply function like this,
> apply(df, 2, sort)
id var1 var2
[1,] 90 9 12
[2,] 90 10 16
[3,] 90 16 19
[4,] 90 18 21
[5,] 110 21 32
[6,] 110 26 34
[7,] 110 39 43
[8,] 252 54 44
[9,] 252 54 234
However, this is not the output I am seeking. The correct output should be,
id var1 var2
1 110 21 12
2 110 26 234
3 110 54 43
4 90 9 19
5 90 10 32
6 90 16 16
7 90 18 21
8 252 39 44
9 252 54 34
Group by id and sort by var1 column and keep original id column order.
Any idea how to sort like this?
Note. As mentioned by Moody_Mudskipper, there is no need to use tidyverse and can also be done easily with base R:
df[order(ordered(df$id, unique(df$id)), df$var1), ]
A one-liner tidyverse solution w/o any temp vars:
df %>% arrange(ordered(id, unique(id)), var1)
# id var1 var2
# 1 110 26 234
# 2 110 21 12
# 3 110 54 43
# 4 90 10 32
# 5 90 18 21
# 6 90 9 19
# 7 90 16 16
# 8 252 54 34
# 9 252 39 44
Explanation of why apply(df, 2, sort) does not work
What you were trying to do is to sort each column independently. apply runs over the specified dimension (2 in this case which corresponds to columns) and applies the function (sort in this case).
apply tries to further simplify the results, in this case to a matrix. So you are getting back a matrix (not a data.frame) where each column is sorted independently. For example this row from the apply call:
# [1,] 90 9 12
does not even exist in the original data.frame.
Another base R option using order and match
df[with(df, order(match(id, unique(id)), var1, var2)), ]
# id var1 var2
#2 110 21 12
#1 110 26 234
#3 110 54 43
#6 90 9 19
#4 90 10 32
#7 90 16 16
#5 90 18 21
#9 252 39 44
#8 252 54 34
We can convert the id to factor in order to split while preserving the original order. We can then loop over the list and order, and rbind again, i.e.
df$id <- factor(df$id, levels = unique(df$id)), lapply(split(df, df$id), function(i)i[order(i$var1),]))
# id var1 var2
#110.2 110 21 12
#110.1 110 26 234
#110.3 110 54 43
#90.6 90 9 19
#90.4 90 10 32
#90.7 90 16 16
#90.5 90 18 21
#252.9 252 39 44
#252.8 252 54 34
NOTE: You can reset the rownames by rownames(new_df) <- NULL
In base R we could use split<- :
split(df,df$id) <- lapply(split(df,df$id), function(x) x[order(x$var1),] )
or as #Markus suggests :
split(df, df$id) <- by(df, df$id, function(x) x[order(x$var1),])
output in either case :
# id var1 var2
# 1 110 21 12
# 2 110 26 234
# 3 110 54 43
# 4 90 9 19
# 5 90 10 32
# 6 90 16 16
# 7 90 18 21
# 8 252 39 44
# 9 252 54 34
With the following tidyverse pipe, the question's output is reproduced.
df %>%
mutate(tmp = cumsum(c(0, diff(id) != 0))) %>%
group_by(id) %>%
arrange(tmp, var1) %>%
## A tibble: 9 x 3
## Groups: id [3]
# id var1 var2
# <dbl> <dbl> <dbl>
#1 110 21 12
#2 110 26 234
#3 110 54 43
#4 90 9 19
#5 90 10 32
#6 90 16 16
#7 90 18 21
#8 252 39 44
#9 252 54 34
Let's say I have got a data.frame like the following:
df = read.table(text = 'A B
11 98
11 87
11 999
11 22
12 34
12 34
12 44
12 98
17 77
17 67
17 87
17 66
33 6
33 45
33 12
33 10', header = TRUE)
I need to group df by col A and select only a given number of rows based on the following vector:
n_rows = c(2, 3, 4, 2)
So that the first group will have only 2 rows (no matter their order), the second group 3 rows, etc...
Here my expected output:
11 98
11 87
12 34
12 34
12 44
17 77
17 67
17 87
17 66
33 6
33 45
I tried to do the trick with dplyr by doing the following:
df %>%
group_by(A) %>%
top_n(n = n_rows, wt =B)
but I got the following error:
Error: n must be a scalar integer
Any suggestion?
Another base R option,, Map(function(x, y) x[seq(y),], split(df, df$A), n_rows))
which gives,
11.1 11 98
11.2 11 87
12.5 12 34
12.6 12 34
12.7 12 44
17.9 17 77
17.10 17 67
17.11 17 87
17.12 17 66
33.13 33 6
33.14 33 45
Here's a possibility, splitting first the data.frame then using map2:
df %>% split(.$A) %>%
# A B
# 1 11 98
# 2 11 87
# 3 12 34
# 4 12 34
# 5 12 44
# 6 17 77
# 7 17 67
# 8 17 87
# 9 17 66
# 10 33 6
# 11 33 45
If order doesn't matter you don't need top_n, head works just fine (and faster), else just replace head with top_n.
Here is also a tidy solution, a few characters longer but maybe more satisfying as you don't separate things of the same "kind" but rather work completely inside of the data.frame (same output).
df %>% nest(B) %>%
mutate(data = map2(data,n_rows,head)) %>%
In base R, you can do something like:
df2 <- data.frame()
for (i in seq_along(unique(df$A))) {
df2 <- rbind(df2, df[df$A == unique(df$A)[i], ][1:n_rows[i], ])
> df2
1 11 98
2 11 87
5 12 34
6 12 34
7 12 44
9 17 77
10 17 67
11 17 87
12 17 66
13 33 6
14 33 45
Here is an option with top_n
df %>%
split(., .$A) %>%
map2_df(., n_rows, ~ top_n(., .y, wt = .$B))
If we are not looking for top_n, then another option is slice
df %>%
group_by(A) %>%
nest(B) %>%
mutate(newcol = map2(data, n_rows, ~ .x %>% slice(seq(.y)))) %>%
select(-data) %>%
Ok, first of all let me generate some sample data:
A_X01 <- c(34, 65, 23, 43, 22)
A_X02 <- c(2, 4, 7, 8, 3)
B_X01 <- c(24, 45, 94, 23, 54)
B_X02 <- c(4, 2, 4, 9, 1)
C_X01 <- c(34, 65, 876, 45, 87)
C_X02 <- c(123, 543, 86, 87, 34)
Var <- c(3, 5, 7, 2, 3)
DF <- data.frame(A_X01, A_X02, B_X01, B_X02, C_X01, C_X02, Var)
What I want to do is apply an equation to the concurrent columns of A and B for both X01 and X02, with a third column "Var" used in the equation.
So far I have been doing this the following way:
DF$D_X01 <- (DF$A_X01 + DF$B_X01) * DF$Var
DF$D_X02 <- (DF$A_X02 + DF$B_X02) * DF$Var
My desired output is as follows:
A_X01 A_X02 B_X01 B_X02 C_X01 C_X02 Var D_X01 D_X02
1 34 2 24 4 34 123 3 174 18
2 65 4 45 2 65 543 5 550 30
3 23 7 94 4 876 86 7 819 77
4 43 8 23 9 45 87 2 132 34
5 22 3 54 1 87 34 3 228 12
As you'll appreciate this is a lot of lines of code to do something fairly simple. Meaning at present my scripts are rather long (as I have multiple columns in the actual dataset)!
One of the apply functions must be the way to go but I can't seem to get my head around it for concurrent columns. I did think about using lapply but how would I get this to work for the two lists of columns and for the right columns to be added together?
I've looked around and can't seem to find a way to do this which must be a fairly common problem?
Original question was a bit confusing so have updated with a desired output and some extra conditions.
Try this
indx <- gsub("\\D", "", grep("A_X|B_X", names(DF), value = TRUE)) # Retrieving indexes
indx2 <- DF[grep("A_X|B_X", names(DF))] # Considering only the columns of interest
DF[paste0("D_X", unique(indx))] <-
sapply(unique(indx), function(x) rowSums(indx2[which(indx == x)])*DF$Var)
# A_X01 A_X02 B_X01 B_X02 C_X01 C_X02 Var D_X01 D_X02
# 1 34 2 24 4 34 123 3 174 18
# 2 65 4 45 2 65 543 5 550 30
# 3 23 7 94 4 876 86 7 819 77
# 4 43 8 23 9 45 87 2 132 34
# 5 22 3 54 1 87 34 3 228 12
You may also try
indxA <- grep("^A", colnames(DF))
indxB <- grep("^B", colnames(DF))
f1 <- function(x,y,z) (x+y)*z
DF[sprintf('D_X%02d', indxA)] <- Map(f1 , DF[indxA], DF[indxB], list(DF$Var))
# A_X01 A_X02 B_X01 B_X02 C_X01 C_X02 Var D_X01 D_X02
#1 34 2 24 4 34 123 3 174 18
#2 65 4 45 2 65 543 5 550 30
#3 23 7 94 4 876 86 7 819 77
#4 43 8 23 9 45 87 2 132 34
#5 22 3 54 1 87 34 3 228 12
Or you could use mapply
DF[sprintf('D_X%02d', indxA)] <- mapply(`+`, DF[indxA],DF[indxB])*DF$Var