This question already has an answer here:
Subset data frame to only include the nth highest value of a column
(1 answer)
Closed 1 year ago.
I have data that shows every students' score and I have to find out who are in the third place. I have to make a list of test scores and a list of students' names.
If there are two or more people who get the same score and occupy third place, the output must show all of the names. I still have no idea how to solve this problem.
example :
names = c('Alex', 'Joy', 'Cindy', 'Lily')
score = c(80, 80,100,90)
Output:
'Students in the third place: Alex, Joy'.
We can use slice_max, by default with_ties = TRUE and then filter the min value
library(dplyr)
df1 %>%
slice_max(n = 3, order_by= score) %>%
filter(score == min(score))
-output
names score
1 Alex 80
2 Joy 80
If we need to output in a format
df1 %>%
slice_max(n = 3, order_by= score) %>%
filter(score == min(score)) %>%
pull(names) %>%
{glue::glue("Students in the third place: {toString(.)}")}
Students in the third place: Alex, Joy
data
df1 <- data.frame(names, score)
One solution with rank:
df$names[rank(-df$score) >= 3]
[1] "Alex" "Joy"
If you have ranks greater than 3:
df$names[rank(-df$score) >= 3 & rank(-df$score) <= 4]
Data:
df <- data.frame(
names = c('Alex', 'Joy', 'Cindy', 'Lily'),
score = c(80, 80,100,90)
)
Related
I need to operate columns based on their name condition. In the following reproducible example, per each column that ends with 'x', I create a column that multiplies by 2 the respective variable:
library(dplyr)
set.seed(8)
id <- seq(1,700, by = 1)
a1_x <- runif(700, 0, 10)
a1_y <- runif(700, 0, 10)
a2_x <- runif(700, 0, 10)
df <- data.frame(id, a1_x, a1_y, a2_x)
#Create variables manually: For every column that ends with X, I need to create one column that multiplies the respective column by 2
df <- df %>%
mutate(a1_x_new = a1_x*2,
a2_x_new = a2_x*2)
Since I'm working with several columns, I need to automate this process. Does anybody know how to achieve this? Thanks in advance!
Try this:
df %>% mutate(
across(ends_with("x"), ~ .x*2, .names = "{.col}_new")
)
Thanks #RicardoVillalba for correction.
You could use transmute and across to generate the new columns for those column names ending in "x". Then, use rename_with to add the "_new" suffix and bind_cols back to the original data frame.
library(dplyr)
df <- df %>%
transmute(across(ends_with("x"), ~ . * 2)) %>%
rename_with(., ~ paste0(.x, "_new")) %>%
bind_cols(df, .)
Result:
head(df)
id a1_x a1_y a2_x a1_x_new a2_x_new
1 1 4.662952 0.4152313 8.706219 9.325905 17.412438
2 2 2.078233 1.4834044 3.317145 4.156466 6.634290
3 3 7.996580 1.4035441 4.834126 15.993159 9.668252
4 4 6.518713 7.0844794 8.457379 13.037426 16.914759
5 5 3.215092 3.5578827 8.196574 6.430184 16.393149
6 6 7.189275 5.2277208 3.712805 14.378550 7.425611
I'm trying to create a new variable which equals the latest month's value minus the previous month's (or 3 months prior, etc.).
A quick df:
country <- c("XYZ", "XYZ", "XYZ")
my_dates <- c("2021-10-01", "2021-09-01", "2021-08-01")
var1 <- c(1, 2, 3)
df1 <- country %>% cbind(my_dates) %>% cbind(var1) %>% as.data.frame()
df1$my_dates <- as.Date(df1$my_dates)
df1$var1 <- as.numeric(df1$var1)
For example, I've tried (partially from: How to subtract months from a date in R?)
library(tidyverse)
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] -var1[my_dates==max(my_dates) %m-% months(1)]
I've also tried different variations of using lag():
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates)-lag(max(my_dates), n=1L)])
Any suggestions on how to grab the value of a variable when dates equal the second latest observation?
Thanks for help, and apologies for not including any data. Can edit if necessary.
Edited with a few potential answers:
#this gives me the value of var1 of the latest date
df2 <- df1 %>%
mutate(value_1month = var1[my_dates==max(my_dates)])
#this gives me the date of the second latest date
df2 <- df1 %>%
mutate(month1 = max(my_dates) %m-%months(1))
#This gives me the second to latest value
df2 <- df1 %>%
mutate(var1_1month = var1[my_dates==max(my_dates) %m-%months(1)])
#This gives me the difference of the latest value and the second to last of var1
df2 <- df1 %>%
mutate(diff_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates) %m-%months(1)])
mutate requires the output to be of the same length as the number of rows of the original data. When we do the subsetting, the length is different. We may need ifelse or case_when
library(dplyr)
library(lubridate)
df1 %>%
mutate(diff_1month = case_when(my_dates==max(my_dates) ~
my_dates %m-% months(1)))
NOTE: Without a reproducible example, it is not clear about the column types and values
Based on the OP's update, we may do an arrange first, grab the last two 'val' and get the difference
df1 %>%
arrange(my_dates) %>%
mutate(dif_1month = diff(tail(var1, 2)))
. my_dates var1 dif_1month
1 XYZ 2021-08-01 3 -1
2 XYZ 2021-09-01 2 -1
3 XYZ 2021-10-01 1 -1
I have a data frame in which the first column indicates the work (manager, employee or worker), the second indicates whether the person works at night or not and the last is a household code (if two individuals share the same code then it means that they share the same house).
#Here is the reproductible data :
PCS <- c("worker", "manager","employee","employee","worker","worker","manager","employee","manager","employee")
work_night <- c("Yes","Yes","No", "No","No","Yes","No","Yes","No","Yes")
HHnum <- c(1,1,2,2,3,3,4,4,5,5)
df <- data.frame(PCS,work_night,HHnum)
My problem is that I would like to have a new data frame with households instead of individuals. I would like to group individuals based on HHnum and then merge their answers.
For the variable "PCS" I have new categories based on the combination of answers : Manager+work ="I" ; manager+employee="II", employee+employee=VI, worker+worker=III etc
For the variable "work_night", I would like to apply a score (is both answered Yes then score=2, if one answered YES then score =1 and if both answered No then score = 0).
To be clear, I would like my data frame to look like this :
HHnum PCS work_night
1 "I" 2
2 "VI" 0
3 "III" 1
4 "II" 1
5 "II" 1
How can I do this on R using dplyr ? I know that I need group_by() but then I don't know what to use.
Best,
Victor
Here is one way to do it (though I admit it is pretty verbose). I created a reference dataframe (i.e., combos) in case you had more categories than 3, which is then joined with the main dataframe (i.e., df_new) to bring in the PCS roman numerals.
library(dplyr)
library(tidyr)
# Create a dataframe with all of the combinations of PCS.
combos <- expand.grid(unique(df$PCS), unique(df$PCS))
combos <- unique(t(apply(combos, 1, sort))) %>%
as.data.frame() %>%
dplyr::mutate(PCS = as.roman(row_number()))
# Create another dataframe with the columns reversed (will make it easier to join to the main dataframe).
combos2 <- data.frame(V1 = c(combos$V2), V2 = c(combos$V1), PCS = c(combos$PCS)) %>%
dplyr::mutate(PCS = as.roman(PCS))
combos <- rbind(combos, combos2)
# Get the count of "Yes" for each HHnum group.
# Then, put the PCS into 2 columns to join together with "combos" df.
df_new <- df %>%
dplyr::group_by(HHnum) %>%
dplyr::mutate(work_night = sum(work_night == "Yes")) %>%
dplyr::group_by(grp = rep(1:2, length.out = n())) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = grp, values_from = PCS) %>%
dplyr::rename("V1" = 3, "V2" = 4) %>%
dplyr::left_join(combos, by = c("V1", "V2")) %>%
unique() %>%
dplyr::select(HHnum, PCS, work_night)
I am trying to accomplish the following:
group data by id
remove any rows after '3' occurs.
find the closest '1','2' or NA that precedes '3' and only keep that row.
My data:
data <- data.frame(
id=c(1,1,1,1,1, 2,2,2,2, 3,3,3),
a=c(NA,1,2,3,3, NA,3,2,3, 1,5,3))
Desired output:
desired <- data.frame(
id=c(1,2,3), a=c(2,NA,1))
For steps 1-2, I have tried:
data %>% group_by(id) %>% slice(if(first(a) == 3))
but that seems quite off.
Thank you.
This breaks the problem into separate steps
data %>%
group_by(id) %>%
filter(row_number()<first(which(a==3))) %>% # drop things past a 3
filter(a %in% c(1,2,NA)) %>% # only keep 1,2 or NA
filter(row_number()==n()) # choose the last row in each group
I'm trying to subset a dataframe in R. It contains several categories. The first few rows for each category need to be removed. The number of rows to remove is inconsistent, but there is a row that indicates the cutoff. How do I remove everything above the cutoff (including that row) for each group?
Example data:
category <- c(rep("A", 3), rep("B", 5), rep("C", 4))
info <- as.character(c("Junk", "Border", "Useful",
"This", "is", "Useless", "Border", "Yes please",
"Unwanted", "Row", "Border", "Required"))
example_df <- data.frame(category, info)
example_df$row_number <- 1:nrow(example_df)
I can extract the row numbers of the border and the start of each group:
border_rows <- which(example_df$info == "Border")
start_rows <- example_df %>%
group_by(category) %>%
slice(1)
start_rows <- start_rows$row_number
I've tried the following, but this only removes the first two rows (i.e. the ones that need to be removed for group A).
for(i in 1:length(border_rows)) {
new_df <- example_df[-(start_rows[i]:border_rows[i]), ]
}
You can easily do this with dplyr package -
library(dplyr)
example_df %>%
group_by(category) %>%
filter(row_number() > which(info == "Border")) %>%
ungroup()
# A tibble: 3 x 2
category info
<fct> <fct>
1 A Useful
2 B Yes please
3 C Required