Sample random rows in dataframe with probability - r

How can I take a random sample (with or without replacement) but with given probabilities?
I am trying to extract a random sample of rows in iris data frame but with this condition of species:
80% versicolor and 20% virginica
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

You could try this in base R:
f.sample <- function(a, percent) a[sample(nrow(a), nrow(a)*percent, replace = TRUE),]
f.sample(iris[iris$Species=="versicolor",], 0.8)
f.sample(iris[iris$Species=="virginica",], 0.2)
You can set the replace argument accordingly.

I seem to have a different understanding than the other answerers.
The following function should produce a 80/20 dataset regardless of the group sizes in the original data set.
foo <- function(DF, n = 50, group_var, groups, probs, replace = FALSE) {
# subset relevant groups & split
DF <- DF[DF[[group_var]] %in% groups, ]
DF <- split(DF, as.character(DF[[group_var]]))
DF <- DF[match(names(DF), groups)]
# sample number of observations per group (this requires replace= TRUE)
smpl <- sample(groups, size = n, replace = TRUE, prob = probs)
# subset random rows per group according to group size
DF <- Map(function(x,y) x[sample(1:nrow(x), y, replace = replace),], DF, c(table(smpl)))
# combine and clean up
DF <- do.call(rbind, DF)
DF <- DF[sample(nrow(DF)),] # not really necessary
row.names(DF) <- NULL # not really necessary
DF
}
foo(iris, 50, "Species", c("versicolor", "virginica"), c(0.8, 0.2))

We can make use of the quosures from the devel version of dplyr (soon to be released 0.6.0) for creating the function
library(tidyverse)
f.sample <- function(dat, colN, value, perc){
colN <- enquo(colN)
value <- quo_name(enquo(value))
dat %>%
filter(UQ(colN) == UQ(value)) %>%
sample_frac(perc) %>%
droplevels
}
f.sample(iris, Species, versicolor, 0.8)
f.sample(iris, Species, virginica, 0.2)
#Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#20 6.0 2.2 5.0 1.5 virginica
#9 6.7 2.5 5.8 1.8 virginica
#15 5.8 2.8 5.1 2.4 virginica
#10 7.2 3.6 6.1 2.5 virginica
#12 6.4 2.7 5.3 1.9 virginica
#49 6.2 3.4 5.4 2.3 virginica
#22 5.6 2.8 4.9 2.0 virginica
#34 6.3 2.8 5.1 1.5 virginica
#2 5.8 2.7 5.1 1.9 virginica
#44 6.8 3.2 5.9 2.3 virginica
The enquo does similar functionality as substitute by taking the input arguments and convert it to quosure, while quo_name convert to string, and within the filter/group_by/summarise/mutate the quosures are evaluated by unquoting (!! or UQ)
Based on the comments below, we modified the function so that it would work for other cases
f.sample2 <- function(dat, colN, values, perc){
colN <- enquo(colN)
dat %>%
filter(UQ(colN) %in% values) %>%
droplevels %>%
nest(-UQ(colN)) %>%
.$data %>%
setNames(values) %>%
Map(sample_frac, ., perc) %>%
bind_rows(.id = quo_name(colN))
}
res <- f.sample2(iris, Species, c("versicolor", "virginica"), c(0.8, 0.2))
prop.table(table(res$Species))
#versicolor virginica
# 0.8 0.2

Related

for loop to change columns with a specified unique length to factor in multiple dataframes

I have several dataframes for which I need to fix the classes of multiple columns, before I can proceed. Because the dataframes all have the same variables but the classes seemed to differ from one dataframe to the other, I figured I would go for a 'for loop'and specify the unique length upon which a column should be coded as factor or numeric.
I tried the following for factor:
dataframes <- list(dataframe1, dataframe2, dataframe2, dataframe3)
for (i in dataframes){
cols.to.factor <-sapply(i, function(col) length(unique(col)) < 6)
i[cols.to.factor] <- apply(i[cols.to.factor] , factor)
}
now the code runs, but it doesn't change anything. What am I missing?
Thanks for the help in advance!
The instruction
for(i in dataframes)
extracts i from the list dataframes and the loop changes the copy, that is never reassigned to the original. A way to correct the problem is
for (i in seq_along(dataframes)){
x <- dataframes[[i]]
cols.to.factor <-sapply(x, function(col) length(unique(col)) < 6)
x[cols.to.factor] <- lapply(x[cols.to.factor] , factor)
dataframes[[i]] <- x
}
An equivalent lapply based solution is
dataframes <- lapply(dataframes, \(x){
cols.to.factor <- sapply(x, function(col) length(unique(col)) < 6)
x[cols.to.factor] <- lapply(x[cols.to.factor], factor)
x
})
library(tidyverse)
# example data
list(
iris,
iris %>% mutate(Sepal.Length = Sepal.Length %>% as.character())
) %>%
# unify column classes
map(~ .x %>% mutate(across(everything(), as.character))) %>%
# optional joining if wished
bind_rows() %>%
mutate(Species = Species %>% as.factor()) %>%
as_tibble()
#> # A tibble: 300 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <chr> <chr> <chr> <chr> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 290 more rows
Created on 2021-10-05 by the reprex package (v2.0.1)

Select values in a table conditional to an external table

I'd like to select the first N values of each variables (columns) in a data set, where N varies by column and row and are given in an other table. An example below with the iris data:
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
## Create a fake external table
ext.tab <- data.table(species=c("setosa","versicolor", "virginica" ),N1=c(1:3),N2=c(3:5),N3=c(5:7),N4=c(7:9))
head(ext.tab)
species N1 N2 N3 N4
1: setosa 1 3 5 7
2: versicolor 2 4 6 8
3: virginica 3 5 7 9
Now for Iris setosa, I'd like to get the first maximum value (N1 in ext.tab) of column 1 ('sepal.length' in iris data), then the three max values (N2 in ext.tab) for column 2 (sepal.width), then the five max values (N3) for column 3 (petal.length) and so forth. Then moving to the Iris versicolor and do the same.
The result can be either a table or a list for each species with the values themselves or row indices for each variable (column). Any idea of a fast way to implement that?
Here is a tidyverse approach using a custom function. The function takes the variable and group names as character scalar and number of maximum values as numeric. Inside the function is a dplyr pipeline using .data pronoun. Then, I reshaped ext.tab to long form and applied get_maximum() row-wise.
library(tidyverse)
get_maximum <- \(.x, .group, .n_max, .dat) {
.dat %>%
filter(Species == .group) %>%
arrange(desc(.data[[.x]])) %>%
slice(seq_len(.n_max)) %>%
pull(.data[[.x]])
}
dat <- as_tibble(ext.tab) %>%
pivot_longer(-species) %>%
mutate(name = recode(
name,
N1 = "Sepal.Length",
N2 = "Sepal.Width",
N3 = "Petal.Length",
N4 = "Petal.Width"
)) %>%
rowwise() %>%
mutate(max_num = list(
get_maximum(name, species, value, iris)
)) %>%
ungroup()
If you need the unique maximum values, you can add distinct() inside the custom function.
get_maximum_unique <- \(.x, .group, .n_max, .dat) {
.dat %>%
filter(Species == .group) %>%
distinct(.data[[.x]]) %>%
arrange(desc(.data[[.x]])) %>%
slice(seq_len(.n_max)) %>%
pull(.data[[.x]])
}
Here is an option using data.table. I have taken the liberty of renaming the column names.
cols <- setdiff(names(ext.tab), "Species")
iris[ext.tab, on=.(Species), by=.EACHI,
.(.(mapply(function(x, n) -head(sort(-x, partial=n), n),
x=mget(cols), n=mget(paste0("i.", cols)), SIMPLIFY=FALSE)))]$V1
data:
library(data.table)
iris <- as.data.table(iris)
ext.tab <- data.table(Species=c("setosa", "versicolor", "virginica"),
Sepal.Length=c(1:3),
Sepal.Width=c(3:5),
Petal.Length=c(5:7),
Petal.Width=c(7:9))
output:
[[1]]
[[1]]$Sepal.Length
[1] 5.8
[[1]]$Sepal.Width
[1] 4.4 4.2 4.1
[[1]]$Petal.Length
[1] 1.9 1.9 1.7 1.7 1.7
[[1]]$Petal.Width
[1] 0.4 0.4 0.6 0.4 0.5 0.4 0.4
[[2]]
[[2]]$Sepal.Length
[1] 7.0 6.9
[[2]]$Sepal.Width
[1] 3.4 3.3 3.2 3.2
[[2]]$Petal.Length
[1] 5.1 4.8 4.9 5.0 4.9 4.8
[[2]]$Petal.Width
[1] 1.7 1.6 1.6 1.8 1.5 1.5 1.6 1.5
[[3]]
[[3]]$Sepal.Length
[1] 7.7 7.9 7.7
[[3]]$Sepal.Width
[1] 3.8 3.8 3.6 3.4 3.4
[[3]]$Petal.Length
[1] 6.4 6.3 6.7 6.9 6.7 6.6 6.1
[[3]]$Petal.Width
[1] 2.5 2.5 2.4 2.5 2.4 2.4 2.3 2.3 2.3
Short explanation:
Perform a left join iris[ext.tab, on=.(Species),
by=.EACHI means for each row of ext.tab
x=mget(cols) gets the columns in iris
mget(paste0("i.", cols)) gets the number of values required for each column
-head(sort(-x, partial=n), n) performs a partial sort and extract the first n values
SIMPLIFY=FALSE and .(.( )) are simply required to return the results as a list

Succinct subsetting across multiple columns in R

Say I have a massive dataframe and in multiple columns I have an extremely large list of unique codes and I want to use these codes to select certain rows to subset the original dataframe. There are around 1000 codes and the codes I want all follow after each other. For example I have about 30 columns that contain codes and I only want to take rows that have codes 100 to 120 in ANY of these columns .
There's a long way to do this which is something like
new_dat <- df[which(df$codes==100 | df$codes==101 | df$codes1==100
and I repeat this for every single possible code for everyone of the columns that can contain these codes. Is there a way to do this in a more convenient fashion?
I want to try solving this with dplyr's select function, but I'm having trouble seeing if it works for my case out of the box
Take the iris dataset
Say I wanted all rows that contain the value 4.0-5.0 in any columns that contains the word Sepal in the column name.
#this only goes for 4.0
brand_new_df <- select(filter(iris, Sepal.Length ==4.0 | Sepal.Width == 4.0))
but what I want is something like
brand_new_df <- select(filter(iris, contains(Sepal) == 4.0:5.0))
Is there a dplyr way to do this?
A corresponding across() version from #RonakShah's answer:
library(dplyr)
iris %>% filter(rowSums(across(contains('Sepal'), ~ between(., 4, 5))) > 0)
or
iris %>% filter(rowSums(across(contains('Sepal'), between, 4, 5)) > 0)
From vignette("colwise"):
Previously, filter() was paired with the all_vars() and any_vars() helpers. Now, across() is equivalent to all_vars(), and there’s no direct replacement for any_vars().
So you need something like rowSums(...) > 0 to achieve the effect of any_vars().
You can use filter_at :
library(dplyr)
iris %>% filter_at(vars(contains('Sepal')), any_vars(between(., 4, 5)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#3 4.6 3.1 1.5 0.2 setosa
#4 5.0 3.6 1.4 0.2 setosa
#5 4.6 3.4 1.4 0.3 setosa
#6 5.0 3.4 1.5 0.2 setosa
#7 4.4 2.9 1.4 0.2 setosa
#....
Base R:
# Subset:
cols <- grep("codes", names(df2), value = TRUE)
df2[rowSums(sapply(cols,
function(x) {
df2[, x] >= 100 & df2[, x] <= 120
})) == length(cols), ]
# Data:
tmp <- data.frame(x1 <- rnorm(999, mean = 100, sd = 2))
df <-
setNames(data.frame(tmp[rep(1, each = 80)]), paste0("codes", 1:80))
df2 <- cbind(id = 1:nrow(df), df)
One option could be:
iris %>%
filter(Reduce(`|`, across(contains("Sepal"), ~ between(.x, 4, 5))))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 1
3 4.6 3.1 1.5 0.2 1
4 5.0 3.6 1.4 0.2 1
5 4.6 3.4 1.4 0.3 1
6 5.0 3.4 1.5 0.2 1
7 4.4 2.9 1.4 0.2 1
8 4.9 3.1 1.5 0.1 1
9 4.8 3.4 1.6 0.2 1
10 4.8 3.0 1.4 0.1 1
library(dplyr)
df <- iris
# value to look for
val <- 4
# find columns
cols <- which(colSums(df == val , na.rm = TRUE) > 0L)
# filter rows
iris %>% filter_at(cols, any_vars(.==val))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 4.0 1.2 0.2 setosa
2 5.5 2.3 4.0 1.3 versicolor
3 6.0 2.2 4.0 1.0 versicolor
4 6.1 2.8 4.0 1.3 versicolor
5 5.5 2.5 4.0 1.3 versicolor
6 5.8 2.6 4.0 1.2 versicolor

Use an apply function with user defined function that adds variables to data frame

I have defined a function which will dynamically create new variables in a data frame. For this function the input is a string which is then pasted with other strings to create variable names that already exist in the data frame which are then compared using case_when within mutate. The output of the function is the data frame with the new variable appended to the end. I want to apply this function to a vector of inputs, and create multiple new columns in the data frame. I have used the iris data set to create a function very similar to what I am doing.
func <- function(x) {
a <- paste0("Sepal.", x)
b <- paste0("Petal.", x)
x <- iris %>%
mutate(
!!(paste0("Compare.", x)) :=
case_when(
iris[[a]] > iris[[b]] ~ "Sepal",
iris[[a]] < iris[[b]] ~ "Petal",
TRUE ~ "Equal"
)
)
return(x)
}
inputVector <- c("Length", "Width")
I want to apply this function to inputVector and have one data frame returned which would have the two new variables in it, preferably without loops. I'm looking to use something like
iris <- lapply(inputVector, func)
But this returns a list of two data frames. I also know that I could use <<- in func, but I would like to avoid this.
Slight change in the way you want to structure the output -
func <- function(x) {
a <- paste0("Sepal", x)
b <- paste0("Petal", x)
x1 <- iris %>%
mutate(
!!(paste0("Compare.", x)) :=
case_when(
a > b ~ "Sepal",
a < b ~ "Petal",
TRUE ~ "Equal"
)
)
return(x1[[paste0('Compare.',x)]])
}
inputVector <- c("Length", "Width")
op <- iris
op[,paste0('Compare.',inputVector)] <- lapply(inputVector, func)
Output
> head(op)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Compare.Length Compare.Width
1 5.1 3.5 1.4 0.2 setosa Sepal Sepal
2 4.9 3.0 1.4 0.2 setosa Sepal Sepal
3 4.7 3.2 1.3 0.2 setosa Sepal Sepal
4 4.6 3.1 1.5 0.2 setosa Sepal Sepal
5 5.0 3.6 1.4 0.2 setosa Sepal Sepal
6 5.4 3.9 1.7 0.4 setosa Sepal Sepal
A simpler option is to only generate the new columns in the function by substituting transmute for mutate, iterating and joining them into a data frame with map_dfc, and then using bind_cols to add them to the original:
library(tidyverse)
func <- function(x) {
a <- sym(paste0("Sepal.", x)) # these need to be quosures to refer to variables
b <- sym(paste0("Petal.", x))
iris %>% transmute(
!!paste0("Compare.", x) := case_when(
!!a > !!b ~ "Sepal", # unquote quosures
!!a < !!b ~ "Petal",
TRUE ~ "Equal"
)
)
}
inputVector <- c("Length", "Width")
iris %>% bind_cols(map_dfc(inputVector, func)) %>% head()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Compare.Length
#> 1 5.1 3.5 1.4 0.2 setosa Sepal
#> 2 4.9 3.0 1.4 0.2 setosa Sepal
#> 3 4.7 3.2 1.3 0.2 setosa Sepal
#> 4 4.6 3.1 1.5 0.2 setosa Sepal
#> 5 5.0 3.6 1.4 0.2 setosa Sepal
#> 6 5.4 3.9 1.7 0.4 setosa Sepal
#> Compare.Width
#> 1 Sepal
#> 2 Sepal
#> 3 Sepal
#> 4 Sepal
#> 5 Sepal
#> 6 Sepal
A slightly more elegant approach is to structure the function to take a vector, moving all the iteration inside. You could use the approach above, or just generate the vectors and assemble them into a data frame once there are multiple:
func2 <- function(x) {
columns <- map_dfc(x, function(y){
a <- paste0("Sepal.", y)
b <- paste0("Petal.", y)
column <- list(case_when(
iris[[a]] > iris[[b]] ~ "Sepal", # base notation is simpler than quosures
iris[[a]] < iris[[b]] ~ "Petal",
TRUE ~ "Equal"
))
names(column) <- paste0("Compare.", y)
column
})
iris %>% bind_cols(columns)
}
func2(inputVector) %>% tail()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 145 6.7 3.3 5.7 2.5 virginica
#> 146 6.7 3.0 5.2 2.3 virginica
#> 147 6.3 2.5 5.0 1.9 virginica
#> 148 6.5 3.0 5.2 2.0 virginica
#> 149 6.2 3.4 5.4 2.3 virginica
#> 150 5.9 3.0 5.1 1.8 virginica
#> Compare.Length Compare.Width
#> 145 Sepal Sepal
#> 146 Sepal Sepal
#> 147 Sepal Sepal
#> 148 Sepal Sepal
#> 149 Sepal Sepal
#> 150 Sepal Sepal
Playing around just a little bit more I did find the an answer to this question. Instead of having func output the data frame I changed it to output just the vector and then used do.call and lapply
func <- function(x) {
a <- paste0("Sepal", x)
b <- paste0("Petal", x)
x <- iris %>%
transmute(
!!(paste0("Compare.", x)) :=
case_when(
a > b ~ "Sepal",
a < b ~ "Petal",
TRUE ~ "Equal"
)
)
return(x)
}
do.call(cbind, c(iris, lapply(inputVector, func)))
I am definitely open to alternative solutions, as I imagine that this is perhaps not the best.
Your result is two data frames (as you indicated):
iris[[1]] is the first dataframe
iris[[2]] is the second dataframe.
You can use merge to merge both data frames into one, like this:
comp.iris <- lapply(inputVector, func)
comp.iris <- merge(comp.iris[[1]], comp.iris[[2]], sort = FALSE)
Hope it helps.

Creating a data frame based on a simple VLOOKUP in R?

df <- iris
x <- data.frame(Petal.Length=c('1.7', '1.9', '3.5'))
The new data frame (dfnew) needs all 5 columns from "iris" extracted, for all the rows with the petal lengths specified in "x".
I've tried it this way, but it doesn't seem to work:
dfnew <- df$Petal.Length[x]
Using dplyr:
> library(dplyr)
> data(iris)
> (dfnew <- iris %>% filter(Petal.Length %in% c('1.7', '1.9', '3.5')) )
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.4 3.9 1.7 0.4 setosa
2 5.7 3.8 1.7 0.3 setosa
3 5.4 3.4 1.7 0.2 setosa
4 5.1 3.3 1.7 0.5 setosa
5 4.8 3.4 1.9 0.2 setosa
6 5.1 3.8 1.9 0.4 setosa
7 5.0 2.0 3.5 1.0 versicolor
8 5.7 2.6 3.5 1.0 versicolor
It's worth noting that this is what you are technically asking for with "VLOOKUP", but the comment from phiver might actually be what you want.
df <- iris
x <- data.frame(Petal.Length=c('1.7', '1.9', '3.5'), X = c('X','Y','Z'))
df.new <- merge(df, x, by = 'Petal.Length')

Resources