I'd like to select the first N values of each variables (columns) in a data set, where N varies by column and row and are given in an other table. An example below with the iris data:
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
## Create a fake external table
ext.tab <- data.table(species=c("setosa","versicolor", "virginica" ),N1=c(1:3),N2=c(3:5),N3=c(5:7),N4=c(7:9))
head(ext.tab)
species N1 N2 N3 N4
1: setosa 1 3 5 7
2: versicolor 2 4 6 8
3: virginica 3 5 7 9
Now for Iris setosa, I'd like to get the first maximum value (N1 in ext.tab) of column 1 ('sepal.length' in iris data), then the three max values (N2 in ext.tab) for column 2 (sepal.width), then the five max values (N3) for column 3 (petal.length) and so forth. Then moving to the Iris versicolor and do the same.
The result can be either a table or a list for each species with the values themselves or row indices for each variable (column). Any idea of a fast way to implement that?
Here is a tidyverse approach using a custom function. The function takes the variable and group names as character scalar and number of maximum values as numeric. Inside the function is a dplyr pipeline using .data pronoun. Then, I reshaped ext.tab to long form and applied get_maximum() row-wise.
library(tidyverse)
get_maximum <- \(.x, .group, .n_max, .dat) {
.dat %>%
filter(Species == .group) %>%
arrange(desc(.data[[.x]])) %>%
slice(seq_len(.n_max)) %>%
pull(.data[[.x]])
}
dat <- as_tibble(ext.tab) %>%
pivot_longer(-species) %>%
mutate(name = recode(
name,
N1 = "Sepal.Length",
N2 = "Sepal.Width",
N3 = "Petal.Length",
N4 = "Petal.Width"
)) %>%
rowwise() %>%
mutate(max_num = list(
get_maximum(name, species, value, iris)
)) %>%
ungroup()
If you need the unique maximum values, you can add distinct() inside the custom function.
get_maximum_unique <- \(.x, .group, .n_max, .dat) {
.dat %>%
filter(Species == .group) %>%
distinct(.data[[.x]]) %>%
arrange(desc(.data[[.x]])) %>%
slice(seq_len(.n_max)) %>%
pull(.data[[.x]])
}
Here is an option using data.table. I have taken the liberty of renaming the column names.
cols <- setdiff(names(ext.tab), "Species")
iris[ext.tab, on=.(Species), by=.EACHI,
.(.(mapply(function(x, n) -head(sort(-x, partial=n), n),
x=mget(cols), n=mget(paste0("i.", cols)), SIMPLIFY=FALSE)))]$V1
data:
library(data.table)
iris <- as.data.table(iris)
ext.tab <- data.table(Species=c("setosa", "versicolor", "virginica"),
Sepal.Length=c(1:3),
Sepal.Width=c(3:5),
Petal.Length=c(5:7),
Petal.Width=c(7:9))
output:
[[1]]
[[1]]$Sepal.Length
[1] 5.8
[[1]]$Sepal.Width
[1] 4.4 4.2 4.1
[[1]]$Petal.Length
[1] 1.9 1.9 1.7 1.7 1.7
[[1]]$Petal.Width
[1] 0.4 0.4 0.6 0.4 0.5 0.4 0.4
[[2]]
[[2]]$Sepal.Length
[1] 7.0 6.9
[[2]]$Sepal.Width
[1] 3.4 3.3 3.2 3.2
[[2]]$Petal.Length
[1] 5.1 4.8 4.9 5.0 4.9 4.8
[[2]]$Petal.Width
[1] 1.7 1.6 1.6 1.8 1.5 1.5 1.6 1.5
[[3]]
[[3]]$Sepal.Length
[1] 7.7 7.9 7.7
[[3]]$Sepal.Width
[1] 3.8 3.8 3.6 3.4 3.4
[[3]]$Petal.Length
[1] 6.4 6.3 6.7 6.9 6.7 6.6 6.1
[[3]]$Petal.Width
[1] 2.5 2.5 2.4 2.5 2.4 2.4 2.3 2.3 2.3
Short explanation:
Perform a left join iris[ext.tab, on=.(Species),
by=.EACHI means for each row of ext.tab
x=mget(cols) gets the columns in iris
mget(paste0("i.", cols)) gets the number of values required for each column
-head(sort(-x, partial=n), n) performs a partial sort and extract the first n values
SIMPLIFY=FALSE and .(.( )) are simply required to return the results as a list
Related
I have several dataframes for which I need to fix the classes of multiple columns, before I can proceed. Because the dataframes all have the same variables but the classes seemed to differ from one dataframe to the other, I figured I would go for a 'for loop'and specify the unique length upon which a column should be coded as factor or numeric.
I tried the following for factor:
dataframes <- list(dataframe1, dataframe2, dataframe2, dataframe3)
for (i in dataframes){
cols.to.factor <-sapply(i, function(col) length(unique(col)) < 6)
i[cols.to.factor] <- apply(i[cols.to.factor] , factor)
}
now the code runs, but it doesn't change anything. What am I missing?
Thanks for the help in advance!
The instruction
for(i in dataframes)
extracts i from the list dataframes and the loop changes the copy, that is never reassigned to the original. A way to correct the problem is
for (i in seq_along(dataframes)){
x <- dataframes[[i]]
cols.to.factor <-sapply(x, function(col) length(unique(col)) < 6)
x[cols.to.factor] <- lapply(x[cols.to.factor] , factor)
dataframes[[i]] <- x
}
An equivalent lapply based solution is
dataframes <- lapply(dataframes, \(x){
cols.to.factor <- sapply(x, function(col) length(unique(col)) < 6)
x[cols.to.factor] <- lapply(x[cols.to.factor], factor)
x
})
library(tidyverse)
# example data
list(
iris,
iris %>% mutate(Sepal.Length = Sepal.Length %>% as.character())
) %>%
# unify column classes
map(~ .x %>% mutate(across(everything(), as.character))) %>%
# optional joining if wished
bind_rows() %>%
mutate(Species = Species %>% as.factor()) %>%
as_tibble()
#> # A tibble: 300 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <chr> <chr> <chr> <chr> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 290 more rows
Created on 2021-10-05 by the reprex package (v2.0.1)
I'm currently trying to automate a data task that requires taking in a list of column names in string format, then summing those columns (rowwise). i.e., suppose there is some list as follows:
> list
[1] "colname1" "colname2" "colname3"
How would I go about passing in this list to some function like sum() in tidyverse? That is, I would like to run something like the following:
df <- df %>%
rowwise %>%
mutate(new_var = sum(list))
Any suggestions would be greatly, greatly appreciated. Thanks.
You could use rowSums here. For example:
library(dplyr)
mycols <- colnames(iris)[3:4]
mycols
[1] "Petal.Length" "Petal.Width"
Then:
iris %>%
mutate(new_var = rowSums(.[, mycols])) %>%
head()
Result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.6
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.7
5 5.0 3.6 1.4 0.2 setosa 1.6
6 5.4 3.9 1.7 0.4 setosa 2.1
You can pass the vector of column names in c_across.
library(dplyr)
df <- df %>% rowwise() %>% mutate(new_var = sum(c_across(list)))
df
Say I have a massive dataframe and in multiple columns I have an extremely large list of unique codes and I want to use these codes to select certain rows to subset the original dataframe. There are around 1000 codes and the codes I want all follow after each other. For example I have about 30 columns that contain codes and I only want to take rows that have codes 100 to 120 in ANY of these columns .
There's a long way to do this which is something like
new_dat <- df[which(df$codes==100 | df$codes==101 | df$codes1==100
and I repeat this for every single possible code for everyone of the columns that can contain these codes. Is there a way to do this in a more convenient fashion?
I want to try solving this with dplyr's select function, but I'm having trouble seeing if it works for my case out of the box
Take the iris dataset
Say I wanted all rows that contain the value 4.0-5.0 in any columns that contains the word Sepal in the column name.
#this only goes for 4.0
brand_new_df <- select(filter(iris, Sepal.Length ==4.0 | Sepal.Width == 4.0))
but what I want is something like
brand_new_df <- select(filter(iris, contains(Sepal) == 4.0:5.0))
Is there a dplyr way to do this?
A corresponding across() version from #RonakShah's answer:
library(dplyr)
iris %>% filter(rowSums(across(contains('Sepal'), ~ between(., 4, 5))) > 0)
or
iris %>% filter(rowSums(across(contains('Sepal'), between, 4, 5)) > 0)
From vignette("colwise"):
Previously, filter() was paired with the all_vars() and any_vars() helpers. Now, across() is equivalent to all_vars(), and there’s no direct replacement for any_vars().
So you need something like rowSums(...) > 0 to achieve the effect of any_vars().
You can use filter_at :
library(dplyr)
iris %>% filter_at(vars(contains('Sepal')), any_vars(between(., 4, 5)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#3 4.6 3.1 1.5 0.2 setosa
#4 5.0 3.6 1.4 0.2 setosa
#5 4.6 3.4 1.4 0.3 setosa
#6 5.0 3.4 1.5 0.2 setosa
#7 4.4 2.9 1.4 0.2 setosa
#....
Base R:
# Subset:
cols <- grep("codes", names(df2), value = TRUE)
df2[rowSums(sapply(cols,
function(x) {
df2[, x] >= 100 & df2[, x] <= 120
})) == length(cols), ]
# Data:
tmp <- data.frame(x1 <- rnorm(999, mean = 100, sd = 2))
df <-
setNames(data.frame(tmp[rep(1, each = 80)]), paste0("codes", 1:80))
df2 <- cbind(id = 1:nrow(df), df)
One option could be:
iris %>%
filter(Reduce(`|`, across(contains("Sepal"), ~ between(.x, 4, 5))))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 1
3 4.6 3.1 1.5 0.2 1
4 5.0 3.6 1.4 0.2 1
5 4.6 3.4 1.4 0.3 1
6 5.0 3.4 1.5 0.2 1
7 4.4 2.9 1.4 0.2 1
8 4.9 3.1 1.5 0.1 1
9 4.8 3.4 1.6 0.2 1
10 4.8 3.0 1.4 0.1 1
library(dplyr)
df <- iris
# value to look for
val <- 4
# find columns
cols <- which(colSums(df == val , na.rm = TRUE) > 0L)
# filter rows
iris %>% filter_at(cols, any_vars(.==val))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 4.0 1.2 0.2 setosa
2 5.5 2.3 4.0 1.3 versicolor
3 6.0 2.2 4.0 1.0 versicolor
4 6.1 2.8 4.0 1.3 versicolor
5 5.5 2.5 4.0 1.3 versicolor
6 5.8 2.6 4.0 1.2 versicolor
df <- iris
x <- data.frame(Petal.Length=c('1.7', '1.9', '3.5'))
The new data frame (dfnew) needs all 5 columns from "iris" extracted, for all the rows with the petal lengths specified in "x".
I've tried it this way, but it doesn't seem to work:
dfnew <- df$Petal.Length[x]
Using dplyr:
> library(dplyr)
> data(iris)
> (dfnew <- iris %>% filter(Petal.Length %in% c('1.7', '1.9', '3.5')) )
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.4 3.9 1.7 0.4 setosa
2 5.7 3.8 1.7 0.3 setosa
3 5.4 3.4 1.7 0.2 setosa
4 5.1 3.3 1.7 0.5 setosa
5 4.8 3.4 1.9 0.2 setosa
6 5.1 3.8 1.9 0.4 setosa
7 5.0 2.0 3.5 1.0 versicolor
8 5.7 2.6 3.5 1.0 versicolor
It's worth noting that this is what you are technically asking for with "VLOOKUP", but the comment from phiver might actually be what you want.
df <- iris
x <- data.frame(Petal.Length=c('1.7', '1.9', '3.5'), X = c('X','Y','Z'))
df.new <- merge(df, x, by = 'Petal.Length')
How can I take a random sample (with or without replacement) but with given probabilities?
I am trying to extract a random sample of rows in iris data frame but with this condition of species:
80% versicolor and 20% virginica
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
You could try this in base R:
f.sample <- function(a, percent) a[sample(nrow(a), nrow(a)*percent, replace = TRUE),]
f.sample(iris[iris$Species=="versicolor",], 0.8)
f.sample(iris[iris$Species=="virginica",], 0.2)
You can set the replace argument accordingly.
I seem to have a different understanding than the other answerers.
The following function should produce a 80/20 dataset regardless of the group sizes in the original data set.
foo <- function(DF, n = 50, group_var, groups, probs, replace = FALSE) {
# subset relevant groups & split
DF <- DF[DF[[group_var]] %in% groups, ]
DF <- split(DF, as.character(DF[[group_var]]))
DF <- DF[match(names(DF), groups)]
# sample number of observations per group (this requires replace= TRUE)
smpl <- sample(groups, size = n, replace = TRUE, prob = probs)
# subset random rows per group according to group size
DF <- Map(function(x,y) x[sample(1:nrow(x), y, replace = replace),], DF, c(table(smpl)))
# combine and clean up
DF <- do.call(rbind, DF)
DF <- DF[sample(nrow(DF)),] # not really necessary
row.names(DF) <- NULL # not really necessary
DF
}
foo(iris, 50, "Species", c("versicolor", "virginica"), c(0.8, 0.2))
We can make use of the quosures from the devel version of dplyr (soon to be released 0.6.0) for creating the function
library(tidyverse)
f.sample <- function(dat, colN, value, perc){
colN <- enquo(colN)
value <- quo_name(enquo(value))
dat %>%
filter(UQ(colN) == UQ(value)) %>%
sample_frac(perc) %>%
droplevels
}
f.sample(iris, Species, versicolor, 0.8)
f.sample(iris, Species, virginica, 0.2)
#Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#20 6.0 2.2 5.0 1.5 virginica
#9 6.7 2.5 5.8 1.8 virginica
#15 5.8 2.8 5.1 2.4 virginica
#10 7.2 3.6 6.1 2.5 virginica
#12 6.4 2.7 5.3 1.9 virginica
#49 6.2 3.4 5.4 2.3 virginica
#22 5.6 2.8 4.9 2.0 virginica
#34 6.3 2.8 5.1 1.5 virginica
#2 5.8 2.7 5.1 1.9 virginica
#44 6.8 3.2 5.9 2.3 virginica
The enquo does similar functionality as substitute by taking the input arguments and convert it to quosure, while quo_name convert to string, and within the filter/group_by/summarise/mutate the quosures are evaluated by unquoting (!! or UQ)
Based on the comments below, we modified the function so that it would work for other cases
f.sample2 <- function(dat, colN, values, perc){
colN <- enquo(colN)
dat %>%
filter(UQ(colN) %in% values) %>%
droplevels %>%
nest(-UQ(colN)) %>%
.$data %>%
setNames(values) %>%
Map(sample_frac, ., perc) %>%
bind_rows(.id = quo_name(colN))
}
res <- f.sample2(iris, Species, c("versicolor", "virginica"), c(0.8, 0.2))
prop.table(table(res$Species))
#versicolor virginica
# 0.8 0.2