Random sample of rows with "at least one from each" condition - r

So I've got a dataset that looks something like this:
a b c
23 34 Falcons
14 9 Hawks
2 18 Eagles
3 21 Eagles
22 8 Falcons
11 4 Hawks
And I know I can do a random subset of rows with a nested conditional, but what I want to do is create a random subset that takes at least one of each available value in column 'c'.
So possible correct subsets would be
23 34 Falcons
14 9 Hawks
3 21 Eagles
or
11 4 Hawks
2 18 Eagles
22 8 Falcons
[in no particular order], but something like:
14 9 Hawks
2 18 Eagles
3 21 Eagles
would NOT work, because 'Falcons' is not represented. Is there an easy way to do this in R?

Use group_by and sample_n functions in the dplyr package.
text1 <- "a b c
23 34 Falcons
14 9 Hawks
2 18 Eagles
3 21 Eagles
22 8 Falcons
11 4 Hawks"
dat <- read.table(text=text1, head=T, as.is=T)
library(dplyr)
dat %>% group_by(c) %>% sample_n(1)
# Source: local data frame [3 x 3]
# Groups: c [3]
# a b c
# (int) (int) (chr)
# 1 3 21 Eagles
# 2 22 8 Falcons
# 3 11 4 Hawks
UPDATE: You can write a function to do the sampling.
sample_df <- function(df) {
df.r <- sample(1:nrow(df), 1)
return(sample_n(df, df.r))
}
dat %>% group_by(c) %>% do(sample_df(.))

You can specify the n for each group here (use 1s if you only want a data frame with nrows == number of groups
dd <- read.table(header = TRUE, text = 'a b c
23 34 Falcons
14 9 Hawks
2 18 Eagles
3 21 Eagles
22 8 Falcons
11 4 Hawks', stringsAsFactors = FALSE)
(n <- setNames(c(1,2,1), unique(dd$c)))
# Falcons Hawks Eagles
# 1 2 1
set.seed(1)
dd[as.logical(ave(dd$c, dd$c, FUN = function(x)
sample(rep(c(FALSE, TRUE), c(length(x) - n[x[1]], n[x[1]]))))), ]
# a b c
# 1 23 34 Falcons
# 2 14 9 Hawks
# 4 3 21 Eagles
# 6 11 4 Hawks
Putting this into a function to automate some other things for you
sample_each <- function(data, var, n = 1L) {
lvl <- table(data[, var])
n1 <- setNames(rep_len(n, length(lvl)), names(lvl))
n0 <- lvl - n1
idx <- ave(as.character(data[, var]), data[, var], FUN = function(x)
sample(rep(0:1, c(n0[x[1]], n1[x[1]]))))
data[!!(as.numeric(idx)), ]
}
sample_each(dd, 'c', n = c(1,2,1))
# a b c
# 1 23 34 Falcons
# 3 2 18 Eagles
# 5 22 8 Falcons
# 6 11 4 Hawks
sample_each(mtcars, 'gear', 1)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
# Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4
# Maserati Bora 15.0 8 301.0 335 3.54 3.57 14.60 0 1 5 8
sample_each(mtcars, 'gear', c(2,2,5))
# mpg cyl disp hp drat wt qsec vs am gear carb
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
# Mazda RX4 Wag1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# Hornet Sportabout1 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# Merc 2801 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4

Related

Read multiple CSV in different paths and bind them together for each path in R

I have one folder for each subject in my study containing 2-5 csv data which i would like to bind together. For each subject/folder the names of the data are the same.
I want to bind the data for each subject and would like to create a loop. As I have 230 different subjects doing that manually with rbind would be overhelming. The foldername is the subjectID
Any Ideas?
FolderStructure:
subject1/day1.csv
subject1/day2.csv
subject1/day3.csv
subject2/day1.csv
subject2/day2.csv
subject3/day1.csv
subject3/day2.csv
subject3/day3.csv
...
I'll demo using dplyr::bind_rows, though it'll work just as well with data.table::rbindlist. The base-R variant do.call(rbind, ...) does not work as directly since it doesn't have the .id=/idcol= easy option (some elbow-grease can work around this).
list_of_files ## make this however you want
# [1] "sub1/day1.csv" "sub1/day2.csv" "sub2/day1.csv" "sub2/day2.csv"
alldat <- lapply(setNames(nm=list_of_files), read.csv)
### fake data for demonstration
# alldat <- setNames(replicate(4, mtcars[sample(32,3),], simplify=FALSE), list_of_files)
lapply(split(alldat, sub("/.*", "", names(alldat))), dplyr::bind_rows, .id = "subj")
# $sub1
# subj mpg cyl disp hp drat wt qsec vs am gear carb
# Merc 230 sub1/day1.csv 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Lincoln Continental sub1/day1.csv 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# Merc 450SE sub1/day1.csv 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
# Porsche 914-2 sub1/day2.csv 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Fiat 128 sub1/day2.csv 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Toyota Corona sub1/day2.csv 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# $sub2
# subj mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 sub2/day1.csv 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# Fiat X1-9 sub2/day1.csv 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# Lincoln Continental sub2/day1.csv 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# Merc 240D sub2/day2.csv 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 450SLC sub2/day2.csv 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
# Hornet 4 Drive sub2/day2.csv 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
You now have a list, one element per subject.
You can solve this problem easily by using read_csv from readr package and by obtaining paths of files in each of subjectID folders.
Let's suppose I have three folders subject1, subject2, subject3 which has csv files with same columns but with different rows.
library(readr)
subjects <- 3
paths <- sort(list.files(recursive = TRUE))
paths
[1] "subject1/cars1.csv" "subject1/cars2.csv" "subject1/cars3.csv"
[4] "subject2/cars1.csv" "subject2/cars2.csv" "subject3/cars1.csv"
[7] "subject3/cars2.csv" "subject3/cars3.csv"
So we can see subject1 folder has 3 csv files, subject2 has 2 csv files and sunject3 folder has 3 csv files. All we need to group these paths to csv files for each subjects folder and give each group of paths to read_csv
for (i in seq_len(subjects)) {
subj <- paste0("subject", i)
subject_paths <- paths[grepl(subj, paths)]
file_name <- paste0("subject_", i, "_binded")
assign(file_name, readr::read_csv(subject_paths, id = "paths"))
}
This will create three binded dataframe each for each subject folder.
> subject_1_binded
# A tibble: 15 × 4
paths mpg cyl disp
<chr> <dbl> <dbl> <dbl>
1 subject1/cars1.csv 27.3 4 79
2 subject1/cars1.csv 21 6 160
3 subject1/cars1.csv 13.3 8 350
4 subject1/cars1.csv 19.2 8 400
5 subject1/cars1.csv 10.4 8 460
6 subject1/cars2.csv 14.7 8 440
7 subject1/cars2.csv 18.7 8 360
8 subject1/cars2.csv 30.4 4 95.1
9 subject1/cars2.csv 15.5 8 318
10 subject1/cars2.csv 16.4 8 276.
11 subject1/cars3.csv 21.5 4 120.
12 subject1/cars3.csv 15.5 8 318
13 subject1/cars3.csv 19.7 6 145
14 subject1/cars3.csv 14.3 8 360
15 subject1/cars3.csv 21.4 4 121
> subject_2_binded
# A tibble: 15 × 4
paths mpg cyl disp
<chr> <dbl> <dbl> <dbl>
1 subject2/cars1.csv 27.3 4 79
2 subject2/cars1.csv 21 6 160
3 subject2/cars1.csv 13.3 8 350
4 subject2/cars1.csv 19.2 8 400
5 subject2/cars1.csv 10.4 8 460
6 subject2/cars2.csv 14.7 8 440
7 subject2/cars2.csv 18.7 8 360
8 subject2/cars2.csv 30.4 4 95.1
9 subject2/cars2.csv 15.5 8 318
10 subject2/cars2.csv 16.4 8 276.
11 subject2/cars3.csv 21.5 4 120.
12 subject2/cars3.csv 15.5 8 318
13 subject2/cars3.csv 19.7 6 145
14 subject2/cars3.csv 14.3 8 360
15 subject2/cars3.csv 21.4 4 121
> subject_3_binded
# A tibble: 15 × 4
paths mpg cyl disp
<chr> <dbl> <dbl> <dbl>
1 subject3/cars1.csv 27.3 4 79
2 subject3/cars1.csv 21 6 160
3 subject3/cars1.csv 13.3 8 350
4 subject3/cars1.csv 19.2 8 400
5 subject3/cars1.csv 10.4 8 460
6 subject3/cars2.csv 14.7 8 440
7 subject3/cars2.csv 18.7 8 360
8 subject3/cars2.csv 30.4 4 95.1
9 subject3/cars2.csv 15.5 8 318
10 subject3/cars2.csv 16.4 8 276.
11 subject3/cars3.csv 21.5 4 120.
12 subject3/cars3.csv 15.5 8 318
13 subject3/cars3.csv 19.7 6 145
14 subject3/cars3.csv 14.3 8 360
15 subject3/cars3.csv 21.4 4 121
So In your case for your 230 subjects you will have 230 binded data.
Note the use of assign() in the last line of for-loop, for your case it would create 230 data.frame (more appropriately, tbl_df) object in your global environment.

Problem with user created function to set the upper and lower values from 2 columns in data frame

I'm trying to create a user created function to set the upper and lower values from 2 columns in data frame, but it doesn't work.
I'm using the mtcars data. I want to create a data frame from the function using the variables gear and carb. The function puts in the gear, carb, and mtcars as arguments. I want to the function to create a data frame in an object called mtcars_post_function_dataset that has the variables gear and carb such that gear has the smaller of the 2 variable values and carb has the larger of the 2 variable values. In addition, I want it to create 2 variables called gear_pre_storage and carb_pre_storage which has the original values of gear and carb from the original mtcars data frame.
When I tried to make this function, called function_set_upper_and_lower_range_values, and I try to run it with the data of interest, I get the follow output (which is not the desired result):
mtcars_post_function_dataset <-
function_set_upper_and_lower_range_values(gear, carb, mtcars)
> Error in as.name(lower_value_post) : object 'lower_value_post' not found
I'm not sure what to do. Please advise. Thanks ahead of time for any help.
Here is the code I used for the function:
# creates function_set_upper_and_lower_range_values
# ---- NOTE: creates function
function_set_upper_and_lower_range_values <-
# ---- NOTE: enter in 2 variables and dataset, function will set data so that lower_value >= upper_value for a given pair of values in 2 columns
# ---- NOTE: important for confidence interval / credible interval tests
# ---- NOTE: lower_value == lower value in range being evaluated
# ---- NOTE: upper_value == lower value in range being evaluated
# ---- NOTE: object to put function return should be data frame object
# ---- NOTE: function_range_check_specific == function name
function(lower_value, upper_value, dataset_name)
{
# ---- NOTE: turns function inputs into strings
lower_value_colmn <- deparse(substitute(lower_value))
upper_value_colmn <- deparse(substitute(upper_value))
nm1 <- deparse(substitute(dataset_name))
# ---- NOTE: # turns dataset_name into data frame
set_upper_and_lower_range_values_construction_funct_object_A <-
data.frame(
dataset_name
)
# ---- NOTE: adds column used for merging, and then turns data into data frame
set_upper_and_lower_range_values_construction_funct_object_A <- tibble::rowid_to_column(set_upper_and_lower_range_values_construction_funct_object_A, "merging_column")
# ---- NOTE: turns data into data frame
set_upper_and_lower_range_values_construction_funct_object_A <- data.frame(set_upper_and_lower_range_values_construction_funct_object_A)
# ---- NOTE: selects variables of interest
set_upper_and_lower_range_values_construction_funct_object_B <-
set_upper_and_lower_range_values_construction_funct_object_A %>%
select(
lower_value_colmn,
upper_value_colmn,
merging_column
)
# ---- NOTE: turns object into data frame
set_upper_and_lower_range_values_construction_funct_object_B <-
data.frame(set_upper_and_lower_range_values_construction_funct_object_B)
# ---- NOTE: # transforms values of interest into numeric form
set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre <- as.numeric(as.character(set_upper_and_lower_range_values_construction_funct_object_B[[lower_value_colmn]]))
set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre <- as.numeric(as.character(set_upper_and_lower_range_values_construction_funct_object_B[[upper_value_colmn]]))
# ---- NOTE: # sets data so lower_value_pre <= upper_value_pre
# ---- NOTE: ## creates storage variables
set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage <- set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre
set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage <- set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre
# ---- NOTE: ## creates upper_value_colmn
set_upper_and_lower_range_values_construction_funct_object_B$upper_value_post <-
ifelse(((set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage) > (set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage)), set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage,
ifelse(((set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage) < (set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage)), set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage,
ifelse(((set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage) == (set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage)), set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage,
NA
)))
# ---- NOTE: ## creates lower_value_post
set_upper_and_lower_range_values_construction_funct_object_B$lower_value_post <-
ifelse(((set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage) > (set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage)), set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage,
ifelse(((set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage) < (set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage)), set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage,
ifelse(((set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage) == (set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage)), set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage,
NA
)))
# ---- NOTE: creates set_upper_and_lower_range_values_construction_funct_object_C to be used to combine variables of interest to main dataset
set_upper_and_lower_range_values_construction_funct_object_C <- data.frame(set_upper_and_lower_range_values_construction_funct_object_A)
# ---- NOTE: drops specific columns
set_upper_and_lower_range_values_construction_funct_object_C <-
dplyr::select(set_upper_and_lower_range_values_construction_funct_object_C, -c(lower_value_colmn, upper_value_colmn))
# ---- NOTE: selects specific data from set_upper_and_lower_range_values_construction_funct_object_B for combining dataset
set_upper_and_lower_range_values_construction_funct_object_D <-
merge(set_upper_and_lower_range_values_construction_funct_object_C,
set_upper_and_lower_range_values_construction_funct_object_B,
by.x = "merging_column",
by.y = "merging_column",
all.x = TRUE,
all.y = FALSE,
no.dups = TRUE)
# ---- NOTE: turns object into data frame
set_upper_and_lower_range_values_construction_funct_object_D <- data.frame(set_upper_and_lower_range_values_construction_funct_object_D)
# ---- NOTE: changes colnames
names(set_upper_and_lower_range_values_construction_funct_object_D)[as.name(lower_value_post)] <- lower_value_colmn
names(set_upper_and_lower_range_values_construction_funct_object_D)[as.name(upper_value_post)] <- upper_value_colmn
names(set_upper_and_lower_range_values_construction_funct_object_D)[as.name(lower_value_pre_storage)] <- paste(lower_value_colmn, "pre_storage", sep = "_")
names(set_upper_and_lower_range_values_construction_funct_object_D)[as.name(upper_value_pre_storage)] <- paste(upper_value_colmn, "pre_storage", sep = "_")
# ---- NOTE: returns proper variable
return(set_upper_and_lower_range_values_construction_funct_object_D)
}
In the function, the lines at the end are the problematic i.e. as.name is trying to convert an object that doesn't exist in the global env i.e lower_value_post, upper_value_post are column names in that set_upper_and_lower_range_values_construction_funct_object_D data.frame object.
names(set_upper_and_lower_range_values_construction_funct_object_D)[as.name(lower_value_post)] <- lower_value_colmn
names(set_upper_and_lower_range_values_construction_funct_object_D)[as.name(upper_value_post)] <- upper_value_colmn
names(set_upper_and_lower_range_values_construction_funct_object_D)[as.name(lower_value_pre_storage)] <- paste(lower_value_colmn, "pre_storage", sep = "_")
names(set_upper_and_lower_range_values_construction_funct_object_D)[as.name(upper_value_pre_storage)] <- paste(upper_value_colmn, "pre_storage", sep = "_")
As the OP wanted to rename those specific columns, in base R an option would to be match the column name with that vector of column names to get the position index of column.
i1 <- match(c("lower_value_post", "upper_value_post", "lower_value_pre_storage", "upper_value_pre_storage"), names(set_upper_and_lower_range_values_construction_funct_object_D))
and use that index to extract those column names and assign (<-) with new vector of column names in that order
names(set_upper_and_lower_range_values_construction_funct_object_D)[i1] <-
c(lower_value_colmn, upper_value_colmn, paste0(c(lower_value_colmn, upper_value_colmn), "_pre_storage"))
-full function with changes made
function_set_upper_and_lower_range_values <-
# ---- NOTE: enter in 2 variables and dataset, function will set data so that lower_value >= upper_value for a given pair of values in 2 columns
# ---- NOTE: important for confidence interval / credible interval tests
# ---- NOTE: lower_value == lower value in range being evaluated
# ---- NOTE: upper_value == lower value in range being evaluated
# ---- NOTE: object to put function return should be data frame object
# ---- NOTE: function_range_check_specific == function name
function(lower_value, upper_value, dataset_name)
{
# ---- NOTE: turns function inputs into strings
lower_value_colmn <- deparse(substitute(lower_value))
upper_value_colmn <- deparse(substitute(upper_value))
nm1 <- deparse(substitute(dataset_name))
# ---- NOTE: # turns dataset_name into data frame
set_upper_and_lower_range_values_construction_funct_object_A <-
data.frame(
dataset_name
)
# ---- NOTE: adds column used for merging, and then turns data into data frame
set_upper_and_lower_range_values_construction_funct_object_A <- tibble::rowid_to_column(set_upper_and_lower_range_values_construction_funct_object_A, "merging_column")
# ---- NOTE: turns data into data frame
set_upper_and_lower_range_values_construction_funct_object_A <- data.frame(set_upper_and_lower_range_values_construction_funct_object_A)
# ---- NOTE: selects variables of interest
set_upper_and_lower_range_values_construction_funct_object_B <-
set_upper_and_lower_range_values_construction_funct_object_A %>%
select(
lower_value_colmn,
upper_value_colmn,
merging_column
)
# ---- NOTE: turns object into data frame
set_upper_and_lower_range_values_construction_funct_object_B <-
data.frame(set_upper_and_lower_range_values_construction_funct_object_B)
# ---- NOTE: # transforms values of interest into numeric form
set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre <- as.numeric(as.character(set_upper_and_lower_range_values_construction_funct_object_B[[lower_value_colmn]]))
set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre <- as.numeric(as.character(set_upper_and_lower_range_values_construction_funct_object_B[[upper_value_colmn]]))
# ---- NOTE: # sets data so lower_value_pre <= upper_value_pre
# ---- NOTE: ## creates storage variables
set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage <- set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre
set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage <- set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre
# ---- NOTE: ## creates upper_value_colmn
set_upper_and_lower_range_values_construction_funct_object_B$upper_value_post <-
ifelse(((set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage) > (set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage)), set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage,
ifelse(((set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage) < (set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage)), set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage,
ifelse(((set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage) == (set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage)), set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage,
NA
)))
# ---- NOTE: ## creates lower_value_post
set_upper_and_lower_range_values_construction_funct_object_B$lower_value_post <-
ifelse(((set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage) > (set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage)), set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage,
ifelse(((set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage) < (set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage)), set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage,
ifelse(((set_upper_and_lower_range_values_construction_funct_object_B$upper_value_pre_storage) == (set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage)), set_upper_and_lower_range_values_construction_funct_object_B$lower_value_pre_storage,
NA
)))
# ---- NOTE: creates set_upper_and_lower_range_values_construction_funct_object_C to be used to combine variables of interest to main dataset
set_upper_and_lower_range_values_construction_funct_object_C <- data.frame(set_upper_and_lower_range_values_construction_funct_object_A)
# ---- NOTE: drops specific columns
set_upper_and_lower_range_values_construction_funct_object_C <-
dplyr::select(set_upper_and_lower_range_values_construction_funct_object_C, -c(lower_value_colmn, upper_value_colmn))
# ---- NOTE: selects specific data from set_upper_and_lower_range_values_construction_funct_object_B for combining dataset
set_upper_and_lower_range_values_construction_funct_object_D <-
merge(set_upper_and_lower_range_values_construction_funct_object_C,
set_upper_and_lower_range_values_construction_funct_object_B,
by.x = "merging_column",
by.y = "merging_column",
all.x = TRUE,
all.y = FALSE,
no.dups = TRUE)
# ---- NOTE: turns object into data frame
set_upper_and_lower_range_values_construction_funct_object_D <- data.frame(set_upper_and_lower_range_values_construction_funct_object_D)
# ---- NOTE: changes colnames
i1 <- match(c("lower_value_post", "upper_value_post", "lower_value_pre_storage", "upper_value_pre_storage"), names(set_upper_and_lower_range_values_construction_funct_object_D))
names(set_upper_and_lower_range_values_construction_funct_object_D)[i1] <-
c(lower_value_colmn, upper_value_colmn, paste0(c(lower_value_colmn, upper_value_colmn), "_pre_storage"))
# ---- NOTE: returns proper variable
return(set_upper_and_lower_range_values_construction_funct_object_D)
}
-testing
function_set_upper_and_lower_range_values(gear, carb, mtcars)
merging_column mpg cyl disp hp drat wt qsec vs am gear carb lower_value_pre upper_value_pre carb_pre_storage gear_pre_storage carb
1 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 4 4 4 4 4
2 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 4 4 4 4 4
3 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 1 1 4 4
4 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 3 1 1 3 3
5 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 3 2 2 3 3
6 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 3 1 1 3 3
7 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 3 4 4 3 4
8 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 4 2 2 4 4
9 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 4 2 2 4 4
10 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 4 4 4 4 4
11 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 4 4 4 4 4
12 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 3 3 3 3 3
13 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 3 3 3 3 3
14 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 3 3 3 3 3
15 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 3 4 4 3 4
16 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 3 4 4 3 4
17 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 3 4 4 3 4
18 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 4 1 1 4 4
19 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 4 2 2 4 4
20 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 4 1 1 4 4
21 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 3 1 1 3 3
22 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 3 2 2 3 3
23 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 3 2 2 3 3
24 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 3 4 4 3 4
25 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 3 2 2 3 3
26 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 4 1 1 4 4
27 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 5 2 2 5 5
28 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 5 2 2 5 5
29 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 5 4 4 5 5
30 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 5 6 6 5 6
31 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 5 8 8 5 8
32 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 4 2 2 4 4
gear
1 4
2 4
3 1
4 1
5 2
6 1
7 3
8 2
9 2
10 4
11 4
12 3
13 3
14 3
15 3
16 3
17 3
18 1
19 2
20 1
21 1
22 2
23 2
24 3
25 2
26 1
27 2
28 2
29 4
30 5
31 5
32 2
>

tidyverse function to `mutate_sample`?

I'm looking to mutate a column for a random sample, e.g., mutate_sample. Does anyone know whether there is a dplyr/other tidyverse verb for this? Below is a reprex for the behavior I am looking for and an attempt to functionalize (which isn't running because I'm struggling with quasiquotation in if_else).
library(dplyr)
library(tibble)
library(rlang)
# Setup -------------------------------------------------------------------
group_size <- 10
group_n <- 1
my_cars <-
mtcars %>%
rownames_to_column(var = "model") %>%
mutate(group = NA_real_, .after = model)
# Code to create mutated sample -------------------------------------------
group_sample <-
my_cars %>%
filter(is.na(group)) %>%
slice_sample(n = group_size) %>%
pull(model)
my_cars %>%
mutate(group = if_else(model %in% group_sample, group_n, group)) %>%
head()
#> model group mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Mazda RX4 NA 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> 2 Mazda RX4 Wag 1 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> 3 Datsun 710 1 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> 4 Hornet 4 Drive NA 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> 5 Hornet Sportabout NA 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> 6 Valiant NA 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Function to create mutated sample ---------------------------------------
#
# Note: doesn't run because of var in if_else
# mutate_sample <- function(data, var, id, n, value) {
# # browser()
# sample <-
# data %>%
# filter(is.na({{var}})) %>%
# slice_sample(n = n) %>%
# pull({{id}})
#
# data %>%
# mutate(var = if_else({{id}} %in% sample, value, {{var}}))
# }
#
# mutate_sample(my_cars, group, model, group_size, group_n)
Created on 2020-10-21 by the reprex package (v0.3.0)
Looking through SO, I found this related post:
Mutate column as input to sample
I think you could achieve your goal with this two options.
With dplyr:
mtcars %>% mutate(group = sample(`length<-`(rep(group_n, group_size), n())))
or with base R:
mtcars[sample(nrow(mtcars), group_size), "group"] <- group_n
If you need an external function to handle it, you could go with:
mutate_sample <- function(.data, .var, .size, .value) {
mutate(.data, {{.var}} := sample(`length<-`(rep(.value, .size), n())))
}
mtcars %>% mutate_sample(group, group_size, group_n)
or
mutate_sample_rbase <- function(.data, .var, .size, .value) {
.data[sample(nrow(.data), size = min(.size, nrow(.data))),
deparse(substitute(.var))] <- .value
.data
}
mtcars %>% mutate_sample(group, group_size, group_n)
Note that if .size is bigger than the number of rows of .data, .var will be a constant equal to .value.
EDIT
If you're interested in keeping the old group, I suggest you another way to handle the problem:
library(dplyr)
# to understand this check out ?sample
resample <- function(x, ...){
x[sample.int(length(x), ...)]
}
# this is to avoid any error in case you choose a size bigger than the available rows to select in one group
resample_max <- function (x, size) {
resample(x, size = min(size, length(x)))
}
mutate_sample <- function(.data, .var, .size, .value) {
# creare column if it doesnt exist
if(! deparse(substitute(.var)) %in% names(.data)) .data <- mutate(.data, {{.var}} := NA)
# replace missing values randomly keeping existing non-missing values
mutate(.data, {{.var}} := replace({{.var}}, resample_max(which(is.na({{.var}})), .size), .value))
}
group_size <- 10
mtcars %>%
mutate_sample(group, group_size, 1) %>%
mutate_sample(group, group_size, 2)
#> mpg cyl disp hp drat wt qsec vs am gear carb group
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 NA
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 NA
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 2
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 NA
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 NA
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 NA
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 2
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 1
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 NA
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 2
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 1
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 1
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 2
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 NA
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 NA
#> 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 1
#> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 1
#> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 2
#> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 1
#> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 2
#> 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 2
#> 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 NA
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 2
#> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 1
#> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 NA
#> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
#> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 NA
Notice that this solution works even with grouped_df class (what you get after a dplyr::group_by): from each group [made by dplyr::group_by] a sample of .size units will be selected.
mtcars %>%
group_by(am) %>%
mutate_sample(group, 10, 1) %>%
ungroup() %>%
count(group)
#> # A tibble: 2 x 2
#> group n
#> <dbl> <int>
#> 1 1 20 # two groups, each with 10!
#> 2 NA 12

Select N rows above and below match [duplicate]

This question already has answers here:
Returning above and below rows of specific rows in r dataframe
(5 answers)
Closed 1 year ago.
I would like to select N rows above and below a match.
I'm trying the command:
mtcars[which(mtcars$vs == 1) + c(-1:1), ]
It returns the follow warning:
Warning message:
In which(mtcars$vs == 1) + c(-1:1):
longer object length is not a multiple of shorter object length
We can write a short function to return all elements of vec that match val or that are within n elements (either direction):
newfun <- function(vec, val, n) {
rows <- sapply(which(vec==val), function(x) seq(x-n, x+n, 1))
rows <- unique(sort(rows[rows>0 & rows<length(vec)]))
return(vec[rows])
}
For example:
newfun(mtcars$vs, 1, 2)
Before adding the desired range of indices to your focal index (the result from which), you need to repeat each value to the length of your range.
# set the number of values to select, before and after each focal index
n <- 1
# create a range of (relative) indices
i <- -n:n
# repeat focal indices
# add range of n prior and following indices
ix <- rep(which(mtcars$vs == 1), each = length(i)) + i
# select unique indices, truncated to the relevant range of rows,...
unique(ix[ix > 0 & ix <= nrow(mtcars)])
# [1] 2 3 4 5 6 7 8 9 10 11 12 17 18 19 20 21 22 25 26 27 28 29 31 32
# ...which then can be used to subset data
mtcars[unique(ix[ix > 0 & ix <= nrow(mtcars)]), ]
This seems to be a simple question but is not as trivial as presumably expected.
The issue is that which(mtcars$vs == 1) returns a vector rather than a single value:
[1] 3 4 6 8 9 10 11 18 19 20 21 26 28 32
If another vector -1:1 (which is c(-1L, 0L, 1L)) is added to it, the normal R rules for operations on vectors of unequal lengths apply: The recycling rule says
Any short vector operands are extended by recycling their values until
they match the size of any other operands.
Therefore the shorter vector -1:1 will be recycled to the length of which(mtcars$vs == 1), i.e.,
rep(-1:1, length.out = length(which(mtcars$vs == 1)))
[1] -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0
Therefore, the result of
which(mtcars$vs == 1) + -1:1
is the element-wise sum of the elements of both vectors where the shorter vector has been recycled to match the length of the longer vector.
[1] 2 4 7 7 9 11 10 18 20 19 21 27 27 32
which is propably not what the OP has expected.
In addition, we get the
Warning message:
In which(mtcars$vs == 1) + -1:1 :
longer object length is not a multiple of shorter object length
because which(mtcars$vs == 1) has length 14 and -1:1 has length 3.
Solution using outer()
In order to select the N rows above and below each matching row, we need to add -N:N to each row number returned by which(mtcars$vs == 1):
outer(which(mtcars$vs == 1), -1:1, `+`)
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 5 6 7
[4,] 7 8 9
[5,] 8 9 10
[6,] 9 10 11
[7,] 10 11 12
[8,] 17 18 19
[9,] 18 19 20
[10,] 19 20 21
[11,] 20 21 22
[12,] 25 26 27
[13,] 27 28 29
[14,] 31 32 33
Now, we have an array of all row numbers. Unfortunately, it cannot be used directly for subsetting because it contains duplicates and there are row numbers which do not exist in mtcars. So the the result has to be "post-processed" before it can be used for subsetting.
library(magrittr) # piping used for clarity
rn <- outer(which(mtcars$vs == 1), -1:1, `+`) %>%
as.vector() %>%
unique() %>%
Filter(function(x) x[1 <= x & x <= nrow(mtcars)], .)
rn
[1] 2 3 4 5 6 7 8 9 10 11 12 17 18 19 20 21 22 25 26 27 28 29 31 32
mtcars[rn, ]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Select the top N values by group

This is in response to a question asked on the r-help mailing list.
Here are lots of examples of how to find top values by group using sql, so I imagine it's easy to convert that knowledge over using the R sqldf package.
An example: when mtcars is grouped by cyl, here are the top three records for each distinct value of cyl. Note that ties are excluded in this case, but it'd be nice to show some different ways to treat ties.
mpg cyl disp hp drat wt qsec vs am gear carb ranks
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 2.0
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 1.0
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 2.0
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 3.0
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1.0
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 1.5
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 1.5
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 3.0
How to find the top or bottom (maximum or minimum) N records per group?
This seems more straightforward using data.table as it performs the sort while setting the key.
So, if I were to get the top 3 records in sort (ascending order), then,
require(data.table)
d <- data.table(mtcars, key="cyl")
d[, head(.SD, 3), by=cyl]
does it.
And if you want the descending order
d[, tail(.SD, 3), by=cyl] # Thanks #MatthewDowle
Edit: To sort out ties using mpg column:
d <- data.table(mtcars, key="cyl")
d.out <- d[, .SD[mpg %in% head(sort(unique(mpg)), 3)], by=cyl]
# cyl mpg disp hp drat wt qsec vs am gear carb rank
# 1: 4 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1 11
# 2: 4 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2 1
# 3: 4 21.5 120.1 97 3.70 2.465 20.01 1 0 3 1 8
# 4: 4 21.4 121.0 109 4.11 2.780 18.60 1 1 4 2 6
# 5: 6 18.1 225.0 105 2.76 3.460 20.22 1 0 3 1 7
# 6: 6 19.2 167.6 123 3.92 3.440 18.30 1 0 4 4 1
# 7: 6 17.8 167.6 123 3.92 3.440 18.90 1 0 4 4 2
# 8: 8 14.3 360.0 245 3.21 3.570 15.84 0 0 3 4 7
# 9: 8 10.4 472.0 205 2.93 5.250 17.98 0 0 3 4 14
# 10: 8 10.4 460.0 215 3.00 5.424 17.82 0 0 3 4 5
# 11: 8 13.3 350.0 245 3.73 3.840 15.41 0 0 3 4 3
# and for last N elements, of course it is straightforward
d.out <- d[, .SD[mpg %in% tail(sort(unique(mpg)), 3)], by=cyl]
dplyr does the trick
mtcars %>%
arrange(desc(mpg)) %>%
group_by(cyl) %>% slice(1:2)
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
5 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
6 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Just sort by whatever (mpg for example, question is not clear on this)
mt <- mtcars[order(mtcars$mpg), ]
then use the by function to get the top n rows in each group
d <- by(mt, mt["cyl"], head, n=4)
If you want the result to be a data.frame:
Reduce(rbind, d)
Edit:
Handling ties is more difficult, but if all ties are desired:
by(mt, mt["cyl"], function(x) x[rank(x$mpg) %in% sort(unique(rank(x$mpg)))[1:4], ])
Another approach is to break ties based on some other information, e.g.,
mt <- mtcars[order(mtcars$mpg, mtcars$hp), ]
by(mt, mt["cyl"], head, n=4)
There are at least 4 ways to do this thing, however,each has some difference.
We using u_id to group and using lift value to order/sort
1 dplyr traditional way
library(dplyr)
top10_final_subset1 = final_subset %>% arrange(desc(lift)) %>% group_by(u_id) %>% slice(1:10)
and if you switch the order of arrange(desc(lift)) and group_by(u_id) the result is essential the same.And if there is tie for equal lift value,it will slice to make sure each group has no more than 10 values, if you only have 5 lift value in the group, it will only gives you 5 results for that group.
2 dplyr topN way
library(dplyr)
top10_final_subset2 = final_subset %>% group_by(u_id) %>% top_n(10,lift)
this one if you have tie in lift value, say 15 same lift for the same u_id, you will got all 15 observations
3 data.table tail way
library(data.table)
final_subset = data.table(final_subset,key = "lift")
top10_final_subset3 = final_subset[,tail(.SD,10),,by = c("u_id")]
It has the same row numbers as the first way, however, there are some rows are different, I guess they are using diff random algorithm dealing with tie.
4 data.table .SD way
library(data.table)
top10_final_subset4 = final_subset[,.SD[order(lift,decreasing = TRUE),][1:10],by = "u_id"]
This way is the most "uniform" way,if in a group there are only 5 observation it will repeat value to make it to 10 observations and if there are ties it will still slice and only hold for 10 observations.
If there were a tie at the fourth position for mtcars$mpg then this should return all the ties:
top_mpg <- mtcars[ mtcars$mpg >= mtcars$mpg[order(mtcars$mpg, decreasing=TRUE)][4] , ]
> top_mpg
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Since there is a tie at the 3-4 position you can test it by changing 4 to a 3, and it still returns 4 items. This is logical indexing and you might need to add a clause that removes the NA's or wrap which() around the logical expression. It's not much more difficult to do this "by" cyl:
Reduce(rbind, by(mtcars, mtcars$cyl,
function(d) d[ d$mpg >= d$mpg[order(d$mpg, decreasing=TRUE)][4] , ]) )
#-------------
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Incorporating my suggestion to #Ista:
Reduce(rbind, by(mtcars, mtcars$cyl, function(d) d[ d$mpg <= sort( d$mpg )[3] , ]) )
You can write a function that splits the database by a factor, orders by another desired variable, extract the number of rows you want in each factor (category) and combine these into a database.
top<-function(x, num, c1,c2){
sorted<-x[with(x,order(x[,c1],x[,c2],decreasing=T)),]
splits<-split(sorted,sorted[,c1])
df<-lapply(splits,head,num)
do.call(rbind.data.frame,df)}
x is the dataframe;
num is the number of number of rows you would like to see;
c1 is the column number of the variable you would like to split by;
c2 is the column number of the variable you would like to rank by or handle ties.
Using the mtcars data, the function extracts the 3 heaviest cars (mtcars$wt is the 6th column) in each cylinder class (mtcars$cyl is the 2nd column)
top(mtcars,3,2,6)
mpg cyl disp hp drat wt qsec vs am gear carb
4.Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
4.Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
4.Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
6.Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6.Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
6.Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
8.Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
8.Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
8.Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
You can also easily get the lightest in a class by changing head in the lapply function to tail OR by removing the decreasing=T argument in the order function which will return it to its default, decreasing=F.
Since dplyr 1.0.0, the slice_max()/slice_min() functions were implemented:
mtcars %>%
group_by(cyl) %>%
slice_max(mpg, n = 2, with_ties = FALSE)
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
5 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
6 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
The documentation on with_ties parameter:
Should ties be kept together? The default, TRUE, may return more rows
than you request. Use FALSE to ignore ties, and return the first n
rows.
I prefer #Ista solution, cause needs no extra package and is simple.
A modification of the data.table solution also solve my problem, and is more general.
My data.frame is
> str(df)
'data.frame': 579 obs. of 11 variables:
$ trees : num 2000 5000 1000 2000 1000 1000 2000 5000 5000 1000 ...
$ interDepth: num 2 3 5 2 3 4 4 2 3 5 ...
$ minObs : num 6 4 1 4 10 6 10 10 6 6 ...
$ shrinkage : num 0.01 0.001 0.01 0.005 0.01 0.01 0.001 0.005 0.005 0.001 ...
$ G1 : num 0 2 2 2 2 2 8 8 8 8 ...
$ G2 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ qx : num 0.44 0.43 0.419 0.439 0.43 ...
$ efet : num 43.1 40.6 39.9 39.2 38.6 ...
$ prec : num 0.606 0.593 0.587 0.582 0.574 0.578 0.576 0.579 0.588 0.585 ...
$ sens : num 0.575 0.57 0.573 0.575 0.587 0.574 0.576 0.566 0.542 0.545 ...
$ acu : num 0.631 0.645 0.647 0.648 0.655 0.647 0.619 0.611 0.591 0.594 ...
The data.table solution needs order on i to do the job:
> require(data.table)
> dt1 <- data.table(df)
> dt2 = dt1[order(-efet, G1, G2), head(.SD, 3), by = .(G1, G2)]
> dt2
G1 G2 trees interDepth minObs shrinkage qx efet prec sens acu
1: 0 FALSE 2000 2 6 0.010 0.4395953 43.066 0.606 0.575 0.631
2: 0 FALSE 2000 5 1 0.005 0.4294718 37.554 0.583 0.548 0.607
3: 0 FALSE 5000 2 6 0.005 0.4395753 36.981 0.575 0.559 0.616
4: 2 FALSE 5000 3 4 0.001 0.4296346 40.624 0.593 0.570 0.645
5: 2 FALSE 1000 5 1 0.010 0.4186802 39.915 0.587 0.573 0.647
6: 2 FALSE 2000 2 4 0.005 0.4390503 39.164 0.582 0.575 0.648
7: 8 FALSE 2000 4 10 0.001 0.4511349 38.240 0.576 0.576 0.619
8: 8 FALSE 5000 2 10 0.005 0.4469665 38.064 0.579 0.566 0.611
9: 8 FALSE 5000 3 6 0.005 0.4426952 37.888 0.588 0.542 0.591
10: 2 TRUE 5000 3 4 0.001 0.3812878 21.057 0.510 0.479 0.615
11: 2 TRUE 2000 3 10 0.005 0.3790536 20.127 0.507 0.470 0.608
12: 2 TRUE 1000 5 4 0.001 0.3690911 18.981 0.500 0.475 0.611
13: 8 TRUE 5000 6 10 0.010 0.2865042 16.870 0.497 0.435 0.635
14: 0 TRUE 2000 6 4 0.010 0.3192862 9.779 0.460 0.433 0.621
By some reason, it does not order the way pointed (probably because ordering by the groups). So, another ordering is done.
> dt2[order(G1, G2)]
G1 G2 trees interDepth minObs shrinkage qx efet prec sens acu
1: 0 FALSE 2000 2 6 0.010 0.4395953 43.066 0.606 0.575 0.631
2: 0 FALSE 2000 5 1 0.005 0.4294718 37.554 0.583 0.548 0.607
3: 0 FALSE 5000 2 6 0.005 0.4395753 36.981 0.575 0.559 0.616
4: 0 TRUE 2000 6 4 0.010 0.3192862 9.779 0.460 0.433 0.621
5: 2 FALSE 5000 3 4 0.001 0.4296346 40.624 0.593 0.570 0.645
6: 2 FALSE 1000 5 1 0.010 0.4186802 39.915 0.587 0.573 0.647
7: 2 FALSE 2000 2 4 0.005 0.4390503 39.164 0.582 0.575 0.648
8: 2 TRUE 5000 3 4 0.001 0.3812878 21.057 0.510 0.479 0.615
9: 2 TRUE 2000 3 10 0.005 0.3790536 20.127 0.507 0.470 0.608
10: 2 TRUE 1000 5 4 0.001 0.3690911 18.981 0.500 0.475 0.611
11: 8 FALSE 2000 4 10 0.001 0.4511349 38.240 0.576 0.576 0.619
12: 8 FALSE 5000 2 10 0.005 0.4469665 38.064 0.579 0.566 0.611
13: 8 FALSE 5000 3 6 0.005 0.4426952 37.888 0.588 0.542 0.591
14: 8 TRUE 5000 6 10 0.010 0.2865042 16.870 0.497 0.435 0.635
data.table way for picking the lowest 3 mpg per group:
data("mtcars")
setDT(mtcars)[order(mpg), head(.SD, 3), by = "cyl"]
# start with the mtcars data frame (included with your installation of R)
mtcars
# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )
# choose whether you want to find the minimum or maximum
find.maximum <- FALSE
# create a simple data frame with only two columns
x <- mtcars
# order it based on
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]
# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# done!
# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]
# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties. using `max` would *exclude* all ties
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.

Resources