How to randomly sample multiple consecutive rows of a dataframe in R? - r

I've a dataframe with 100 rows and 20 columns and want to randomly sample 5 times 10 consecutive rows, e.g. 10:19, 25:34, etc. With: sample_n( df, 5 ) I'm able to extract 5 unique, randomly sampled rows, but don't know how to sample consecutive rows. Any help? Thanks!

df <- mtcars
df$row_nm <- seq(nrow(df))
set.seed(7)
sample_seq <- function(n, N) {
i <- sample(seq(N), size = 1)
ifelse(
test = i + (seq(n) - 1) <= N,
yes = i + (seq(n) - 1),
no = i + (seq(n) - 1) - N
)
}
replica <- replicate(n = 5, sample_seq(n = 10, N = nrow(df)))
# result
lapply(seq(ncol(replica)), function(x) df[replica[, x], ])
#> [[1]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 10
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 11
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 12
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 14
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 15
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 16
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 17
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 18
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 19
#>
#> [[2]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 19
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 20
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 21
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 22
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 23
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 24
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 25
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 26
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 27
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 28
#>
#> [[3]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 31
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 32
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 7
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 8
#>
#> [[4]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 28
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 29
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 30
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 31
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 32
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
#>
#> [[5]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 7
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 8
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 9
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 10
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 11
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 12
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 14
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 15
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 16
Created on 2022-01-24 by the reprex package (v2.0.1)

You could so something like:
#sample data
df <- data.table(value = 1:100000)
#function which sampled consecutive rows (x = dataframe, rows = nr of consecutive rows, nr = amount of times you want to sample consecutive rows)
sample_fun <- function(x, rows, nr){
#maximum number which can be sampled
numbers <- 1:(nrow(x) - rows)
#randomly sample 5 numbers
sampled.numbers <- sample(numbers, nr)
#convert to vector (5 consecutive)
sampled.rows <- lapply(sampled.numbers, function(x){seq(x, (x+rows-1), 1)})
sampled.rows <- do.call(c, sampled.rows)
#sample and return
result <- x[sampled.rows,]
return(result)
}
sample_fun(x = df, rows = 5, nr = 2)

You don't mention if this can include replacement (i.e. if you sample 10:19, can you then also sample 15:24?). You also don't mention if you can sample anything above row 91, which would mean that sample of 10 gets cut off (i.e. 98,99,100 would only be 3 consecutive rows unless you want that to loop back to row 1). Assuming you can sample any value with replacement, the solution can be done in one line:
sapply(sample(1:100,5),function(x){seq(x,x+9)})
This applies the sequence function to each of 5 individually sampled numbers. The output will be a matrix, where each column is a sample of 10 consecutive rows, but as noted, these will potentially overlap, or go above 100.
If you want a solution where the rows will not overlap at all, and avoiding values over 100, without making values above 91 less likely to be sampled, this actually gets kind of trick, but I think the code below should work. You cant just sample from 1:91 without affect probability of your random sample, because then this means a value like 100 actually only has a 1/91 probability of being sampled (the sample value has to be 91), where other values don't involve that same constraint. This solution makes it so all rows are equally likely to be sampled.
Rows=c(1:100,1:100)
SampleRows=matrix(0,nrow=10,ncol=5)
for(i in 1:ncol(SampleRows)){
SampledValue=sample(Rows,1)
RowsIndex=min(which(Rows==SampledValue))
Sequence=Rows[RowsIndex:(RowsIndex+9)]
SampleRows[,i]=Sequence
Rows=Rows[!(Rows %in% Sequence)]
}
This approach creates a vector that sequences from 1:100, repeated twice (variable Rows), you'll see why this is important in a bit. For each of 5 iterations (corresponding to 5 samples), we take a sampled value from Rows, which will be a number 1:100, we then figure out where that number is in Rows, and take all 9 values next to it. In the first sample this will always be 10 consecutive numbers (e.g. 20:29). But then we remove those sampled values from Rows. If we happen to get the next sample as a value that would lead to overlap (like 18), then instead it samples (18,19,30,31,32,33,34...) since 20:29 have been removed. We need to do 1:100 twice in Rows, so that if we sample a value like 99, it resets from 100, back to 1.
If you want your output in a vector,throw in this at the end
sort(as.vector(SampleRows))
Let me know if this works for the needs of your problem.

Related

Calculate row sums exclude the first n columns

I need to calculate row sums for a data frame except for the first 5 columns. The output will consist of these first 5 columns and the row sums.
I tried this:
df1$rowsums <- rowSums(df1[,-c(1:5)], na.rm= T)
But I get this error message:
Error in rowSums(df1[, c(1:5)], na.rm = T) : 'x' must be numeric
without data my guess is, that the columns you are using are not numeric. Then it will be hard to calculate the rowsum. Make sure, that columns you use for summing (except 1:5) are indeed numeric, then the following code should work:
library(tidyverse)
df2 <- df1[,-c(1:5)] %>%
rowwise() %>%
mutate(rowsum = sum(c_across(everything()), na.rm = T))
df_result <- cbind(df1[,c(1:5)], df2$rowsum)
EDIT: I added na.rm = T (dont know if necessary). And you might want to rename the resulting "df2$rowsum" column of the resulting df_result dataframe this can be done using
df_result <- df_result %>% rename(rowsum_name = "df2$rowsum")
You could select the columns except the first 5 by -c(1:5) and use rowSums like this (I use mtcars as an example):
library(dplyr)
mtcars %>%
mutate(rowsums = select(., -c(1:5)) %>%
rowSums(na.rm = TRUE))
#> mpg cyl disp hp drat wt qsec vs am gear carb rowsums
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 28.080
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 28.895
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 27.930
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 27.655
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 25.460
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 28.680
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 26.410
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 30.190
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 33.050
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 30.740
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 31.340
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 27.470
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 27.330
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 27.780
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 30.230
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 30.244
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 29.765
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 28.670
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 28.135
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 28.735
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 27.475
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 25.390
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 25.735
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 26.250
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 25.895
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 27.835
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 26.840
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 27.413
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 27.670
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 30.270
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 32.170
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 29.380
Created on 2022-07-09 by the reprex package (v2.0.1)

R SQL templating with glue_sql, ability to dynamically drop where clause

TLDR
I would like to be able to template SQL queries and run them in R. The glue package and DBI work great, but I can't figure out a way to template statements. In other words, is there a way to do something like this (borrowing from jinja):
SELECT * FROM mtcars
{% if length( {make} ) > 0 %}
WHERE make IN( {make*}
{% end %}
Additional Detail
DBI and glue work great for a single use case, but often I want to reuse the same general SQL code with a few different variations of WHERE clauses and things like that. Often I want the WHERE to be "off". in some of the use cases and not in others (e.g. for WHERE IN() it defaults to all values, for WHERE x >= y it doesn't apply the conditional at all, etc.).
The only solution I can find is to evaluate inputs in R as discussed here, and then pass a default vector or the input. This approach works in some use cases and not at all in others. I think it makes it harder to generalize and has a performance hit in my most common use case - when I want a query with a parameter that passes values to a WHERE IN() clause, but defaults to all values. If the table is evolving (i.e. all values changes over time) then I need to first run a query to get all values, then input them if the user doesn't provide values. That can be expensive on larger tables and prohibitive if it's in a user experience (shiny).
library(DBI)
library(glue)
library(dplyr, warn.conflicts = F)
# Setup local DB ####
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
mtcars_df <- tibble::rownames_to_column(mtcars, var = "make")
str(mtcars_df)
#> 'data.frame': 32 obs. of 12 variables:
#> $ make: chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ disp: num 160 160 108 258 360 ...
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ qsec: num 16.5 17 18.6 19.4 17 ...
#> $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
#> $ am : num 1 1 1 0 0 0 0 0 0 0 ...
#> $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
#> $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
DBI::dbWriteTable(con, "mtcars", mtcars_df)
# Example query ####
sql <- glue::glue_sql("SELECT * FROM mtcars WHERE make IN( {make*} )", make = c("Fiat X1-9", "Datsun 710"), .con = con)
DBI::dbGetQuery(con, sql)
#> make mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> 2 Fiat X1-9 27.3 4 79 66 4.08 1.935 18.90 1 1 4 1
# Templating ####
sql <- "SELECT * FROM mtcars WHERE make IN( {make*} )"
sql_template <- tempfile(fileext = ".sql")
readr::write_file(sql, sql_template)
read_sql <- function(file, ..., .con, .envir = parent.frame()){
sql <- readr::read_file(file)
sql <- glue::glue_sql(sql, ..., .con = .con, .envir = .envir)
}
# SQL files can be templated and called from R
sql <- read_sql(sql_template, make = c("Fiat X1-9", "Datsun 710"), .con = con)
DBI::dbGetQuery(con, sql)
#> make mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> 2 Fiat X1-9 27.3 4 79 66 4.08 1.935 18.90 1 1 4 1
# All {values} must be provided, errors out
sql <- read_sql(sql_template, .con = con)
#> Error in eval(parse(text = text, keep.source = FALSE), envir): object 'make' not found
# Doesn't return anything
sql <- read_sql(sql_template, make = DBI::SQL(""), .con = con)
print(sql)
#> <SQL> SELECT * FROM mtcars WHERE make IN( )
DBI::dbGetQuery(con, sql)
#> [1] make mpg cyl disp hp drat wt qsec vs am gear carb
#> <0 rows> (or 0-length row.names)
# Can't make the entire where clause a parameter either without doing a lot of escapes and basically defeating the purppose of glue
sql <- glue::glue_sql("SELECT * FROM mtcars {makes}", makes = "WHERE make IN('Fiat X1-9', 'Datsun 710')", .con = con)
print(sql)
#> <SQL> SELECT * FROM mtcars 'WHERE make IN(''Fiat X1-9'', ''Datsun 710'')'
DBI::dbGetQuery(con, sql)
#> make mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
# Get all values first
all_makes <- DBI::dbGetQuery(con, "SELECT DISTINCT make FROM mtcars") %>% dplyr::pull(make)
sql <- read_sql(sql_template, make = all_makes, .con = con)
DBI::dbGetQuery(con, sql)
#> make mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
# Templating with a conditional####
sql <- "SELECT * FROM mtcars WHERE cyl >= {cyl} "
sql_template <- tempfile(fileext = ".sql")
readr::write_file(sql, sql_template)
read_sql <- function(file, ..., .con, .envir = parent.frame()){
sql <- readr::read_file(file)
sql <- glue::glue_sql(sql, ..., .con = .con, .envir = .envir)
}
# No way to use the all values approach since it's a one sided conditional
sql <- read_sql(sql_template, cyl = 8, .con = con)
DBI::dbGetQuery(con, sql)
#> make mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 2 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 3 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> 4 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> 5 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> 6 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> 7 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> 8 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 9 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> 10 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> 11 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> 12 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> 13 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 14 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
You can use multiple arguments to glue::glue_sql:
# con <- dbConnect(...)
make <- c()
glue::glue_sql(
"select * from mtcars",
if (length(make)) " where make in ({make*})" else "",
.con = con)
# <SQL> select * from mtcars
make <- c("Fiat X1-9", "Datsun 710")
glue::glue_sql( # unchanged
"select * from mtcars",
if (length(make)) " where make in ({make*})" else "",
.con = con)
# <SQL> select * from mtcars where make in ('Fiat X1-9', 'Datsun 710')
Using jinjar v0.2.0:
query <- "SELECT * FROM mtcars
{% if length(make) > 0 -%}
WHERE make IN ({{ quote_sql(make) }})
{%- endif %}"
jinjar::render(query, make = c()) |> writeLines()
#> SELECT * FROM mtcars
jinjar::render(query, make = c("Fiat X1-9", "Datsun 710")) |> writeLines()
#> SELECT * FROM mtcars
#> WHERE make IN ('Fiat X1-9', 'Datsun 710')
Created on 2022-06-24 by the reprex package (v2.0.1)

How do I speed up this specific for loop?

I've looked at other threads and tried to apply it to my code but have had no luck.
CDR3_post_challenge_unique_clonecount$participant_per_cdr3aa <- as.numeric(CDR3_post_challenge_unique_clonecount$cdr3aa)
participant_list <- unique(CDR3_post_challenge_unique_clonecount$cdr3aa)
for (c in participant_list)
{
CDR3_post_challenge_unique_clonecount$participant_per_cdr3aa[CDR3_post_challenge_unique_clonecount$cdr3aa == c] <- length(unique(CDR3_post_challenge_unique_clonecount$PartID[CDR3_post_challenge_unique_clonecount$cdr3aa == c]))
}
Here is a bit of the dataframe:
cdr3aa clonecount PartID
CAAGRAARGGSVPHWFDPF 1 S-1
CAALADSGSQTDAFDIA 1 S-1
CAFHAAYGSQHGLDVW 1 S-1
CAGGLAWLVDDW 1 S-1
CAGRWFFPW 1 S-1
CAGVKNGRGMDVW 1 S-1
I think you can replace the for loop with
CDR3_post_challenge_unique_clonecount$per3 <-
as.integer(
ave(CDR3_post_challenge_unique_clonecount$PartID,
CDR3_post_challenge_unique_clonecount$cdr3aa,
FUN = function(z) length(unique(z)))
)
I'll demonstrate with mtcars, using the follow analogs:
mtcars --> CDR3_post_challenge_unique_clonecount
cyl --> cdr3aa, the categorical variable in which we want to count PartID
drat --> PartID, the thing we want to count (uniquely) within each cdr3aa
mtcars$drat_per_cyl <- ave(mtcars$drat, mtcars$cyl, FUN = function(z) length(unique(z)))
mtcars
# mpg cyl disp hp drat wt qsec vs am gear carb drat_per_cyl
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 5
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 5
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 10
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 11
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 5
# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 11
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 10
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 5
# Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 5
# Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 11
# Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 11
# Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 11
# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 11
# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 11
# Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 11
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 10
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 10
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 10
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 10
# Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 11
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 11
# Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 11
# Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 11
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 10
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 10
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 10
# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 11
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 5
# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 11
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 10
Notes:
ave is a little brain-dead in that the class of the return value is always the same as the class of the first argument. This means that one cannot count unique "character" and expect to get an integer, it is instead returned as a string. It's because of this that I wrap ave in as.integer(.).
ave returns a vector the same length as the input, with values corresponding 1-for-1 (meaning the order is relevant and preserved). In my example of mtcars, this means that it is effectively doing something like this:
ind4 <- which(mtcars$cyl == 4L)
ind4
# [1] 3 8 9 18 19 20 21 26 27 28 32
length(unique(mtcars$drat[ind4]))
# [1] 10
ind6 <- which(mtcars$cyl == 6L)
ind6
# [1] 1 2 4 6 10 11 30
length(unique(mtcars$drat[ind6]))
# [1] 5
### ...
but it will place the return value 10 in the ind4 positions of the return value. For example, because of my ind6, the return value will start with
c(5, 5, .., 5, .., 5, .., .., .., 5, 5, .., .....)
Because of ind4, it will contain
c(.., .., 10, .., .., .., .., 10, 10, .....)
(And same for cyl==8L.)

PCA analysis: getting error in dim desc(): not convenient data

I have conducted PCA on a set of data using prcomp. As a final step I am trying to use the dimdesc() function from FactoMineR to obtain p-values that identify the most significantly associated variables with my principal components.
The data frame has seven variables all of which are numerical and there are no missing values. The names are standard names such as "RCH_Home" (just in case the names could be problematic).
I write the following function:
res.desc <- dimdesc(df_PCA, axes = c(1:2), proba = 0.05)
And get the following error message:
Error in dimdesc(df_PCA, axes = c(1:2), proba = 0.05) : non convenient data
Any idea what might be going on?
Thanks!!!!
You should use the PCA function in sostitution of the prcomp
Below an example of PCA with FactoMineR.
library(FactoMineR)
library(factoextra)
library(paran)
data(cars)
mtcars_pca<-cars_pca<-PCA(mtcars)
If you want to check the percentage of variance, you can do this:
mtcars_pca$eig
> mtcars_pca$eig
eigenvalue percentage of variance cumulative percentage of variance
comp 1 6.60840025 60.0763659 60.07637
comp 2 2.65046789 24.0951627 84.17153
comp 3 0.62719727 5.7017934 89.87332
comp 4 0.26959744 2.4508858 92.32421
comp 5 0.22345110 2.0313737 94.35558
comp 6 0.21159612 1.9236011 96.27918
comp 7 0.13526199 1.2296544 97.50884
comp 8 0.12290143 1.1172858 98.62612
comp 9 0.07704665 0.7004241 99.32655
comp 10 0.05203544 0.4730495 99.79960
comp 11 0.02204441 0.2004037 100.00000
Cos2 stands for squared cosine and is an index for the quality representation of both variables and individuals. The closer this value is to one, the better the quality.
mtcars_pca$var$cos2
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
mpg 0.8685312 0.0006891117 0.031962249 1.369725e-04 0.0023634487
cyl 0.9239416 0.0050717032 0.019276287 1.811054e-06 0.0007642822
disp 0.8958370 0.0064482423 0.002370993 1.775235e-02 0.0346868281
hp 0.7199031 0.1640467049 0.012295659 1.234773e-03 0.0651697911
drat 0.5717921 0.1999959326 0.016295731 1.970035e-01 0.0013361275
wt 0.7916038 0.0542284172 0.073281663 1.630161e-02 0.0012578888
qsec 0.2655437 0.5690984542 0.101947952 1.249426e-03 0.0060588455
vs 0.6208539 0.1422249798 0.115330572 1.244460e-02 0.0803189801
am 0.3647715 0.4887450097 0.026555457 2.501834e-04 0.0018011675
gear 0.2829342 0.5665806069 0.052667265 1.888829e-02 0.0005219259
carb 0.3026882 0.4533387304 0.175213444 4.333912e-03 0.0291718181
res.desc <- dimdesc(mtcars_pca, axes = c(1:2), proba = 0.05)
> head(res.desc)
$Dim.1
$quanti
correlation p.value
cyl 0.9612188 2.471950e-18
disp 0.9464866 2.804047e-16
wt 0.8897212 9.780198e-12
hp 0.8484710 8.622043e-10
carb 0.5501711 1.105272e-03
qsec -0.5153093 2.542578e-03
gear -0.5319156 1.728737e-03
am -0.6039632 2.520665e-04
drat -0.7561693 5.575736e-07
vs -0.7879428 8.658012e-08
mpg -0.9319502 9.347042e-15
attr(,"class")
[1] "condes" "list "
$Dim.2
$quanti
correlation p.value
gear 0.7527155 6.712704e-07
am 0.6991030 8.541542e-06
carb 0.6733043 2.411011e-05
drat 0.4472090 1.028069e-02
hp 0.4050268 2.147312e-02
vs -0.3771273 3.335771e-02
qsec -0.7543861 6.138696e-07
attr(,"class")
[1] "condes" "list "
$call
$call$num.var
[1] 1
$call$proba
[1] 0.05
$call$weights
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
$call$X
Dim.1 mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 -0.6572132031 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag -0.6293955058 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 -2.7793970426 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive -0.3117707086 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 1.9744889419 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant -0.0561375337 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 3.0026742880 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D -2.0553287289 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 -2.2874083842 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 -0.5263812077 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C -0.5092054932 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 2.2478104359 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 2.0478227622 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 2.1485421615 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 3.8997903717 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 3.9541231097 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 3.5929719882 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 -3.8562837567 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic -4.2540325032 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla -4.2342207436 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona -1.9041678566 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 2.1848507430 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 1.8633834347 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 2.8889945733 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 2.2459189274 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 -3.5739682964 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 -2.6512550541 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa -3.3857059882 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 1.3729574238 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino -0.0009899207 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 2.6691258658 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E -2.4205931001 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
it should be use the function from the same package.
If you want to choose how many dimensions you need, you can do this by the param package
library(paran)
cars_paran<-paran(mtcars, graph = TRUE)

How to search for string within variable labels of a data frame, and return a vector with all these variables in R

I have a data frame with a large amount of math and science related items, and I want all math related variables removed.
Variable names has no consistent naming for neither math nor science, so it's hard to search and select based variable name. However, the variable labels are descriptive of what the variable represents. I essentially want all variables with labels that contain the word "math" removed. I tried the following code:
library(dplyr)
library(Hmisc)
# Sample data frame:
M <- c(1, 2)
S <- c(3, 4)
old_df <- data.frame(M, S)
label(old_df$M) <- "My Mathematics Variable"
label(old_df$S) <- "My Science Variable"
#dplyr syntax:
new_df <- old_df %>% select( -contains(hmisc::label(.) == "MATH" ) )
using the Hmisc::label()-function to retrieve a vector with labels.
Sample code of the label()-function:
> label(old_df)
M S
"My Mathematics Variable" "My Science Variable"
> str(label(old_df))
Named chr [1:2] "My Mathematics Variable" "My Science Variable"
- attr(*, "names")= chr [1:2] "M" "S"
I need a what to search through the label items and find the string "math" within. I tried coerce to a matrix and data frame, but I still can't figure out how to search and retrive the variable names. Any suggestions that will get this to work is welcome.
You mean something like this? (UPDATED to more closely map grepl to your example.)
library(Hmisc)
library(dplyr)
Hmisc::label(mtcars$mpg) <- "Miles per Gallon" # grepl WILL catch this
Hmisc::label(mtcars$hp) <- "Not here" # nope
Hmisc::label(mtcars$qsec) <- "MILES all caps here" # nope unless you ignore_case = TRUE
Hmisc::label(mtcars$drat) <- "later in the label Miles is here" # yepp
mtcars %>% select_if(.predicate = !(grepl("Miles", Hmisc::label(.), ignore.case = TRUE)))
#> cyl disp hp wt qsec vs am gear carb
#> Mazda RX4 6 160.0 110 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 6 160.0 110 2.875 17.02 0 1 4 4
#> Datsun 710 4 108.0 93 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 6 258.0 110 3.215 19.44 1 0 3 1
#> Hornet Sportabout 8 360.0 175 3.440 17.02 0 0 3 2
#> Valiant 6 225.0 105 3.460 20.22 1 0 3 1
#> Duster 360 8 360.0 245 3.570 15.84 0 0 3 4
#> Merc 240D 4 146.7 62 3.190 20.00 1 0 4 2
#> Merc 230 4 140.8 95 3.150 22.90 1 0 4 2
#> Merc 280 6 167.6 123 3.440 18.30 1 0 4 4
#> Merc 280C 6 167.6 123 3.440 18.90 1 0 4 4
#> Merc 450SE 8 275.8 180 4.070 17.40 0 0 3 3
#> Merc 450SL 8 275.8 180 3.730 17.60 0 0 3 3
#> Merc 450SLC 8 275.8 180 3.780 18.00 0 0 3 3
#> Cadillac Fleetwood 8 472.0 205 5.250 17.98 0 0 3 4
#> Lincoln Continental 8 460.0 215 5.424 17.82 0 0 3 4
#> Chrysler Imperial 8 440.0 230 5.345 17.42 0 0 3 4
#> Fiat 128 4 78.7 66 2.200 19.47 1 1 4 1
#> Honda Civic 4 75.7 52 1.615 18.52 1 1 4 2
#> Toyota Corolla 4 71.1 65 1.835 19.90 1 1 4 1
#> Toyota Corona 4 120.1 97 2.465 20.01 1 0 3 1
#> Dodge Challenger 8 318.0 150 3.520 16.87 0 0 3 2
#> AMC Javelin 8 304.0 150 3.435 17.30 0 0 3 2
#> Camaro Z28 8 350.0 245 3.840 15.41 0 0 3 4
#> Pontiac Firebird 8 400.0 175 3.845 17.05 0 0 3 2
#> Fiat X1-9 4 79.0 66 1.935 18.90 1 1 4 1
#> Porsche 914-2 4 120.3 91 2.140 16.70 0 1 5 2
#> Lotus Europa 4 95.1 113 1.513 16.90 1 1 5 2
#> Ford Pantera L 8 351.0 264 3.170 14.50 0 1 5 4
#> Ferrari Dino 6 145.0 175 2.770 15.50 0 1 5 6
#> Maserati Bora 8 301.0 335 3.570 14.60 0 1 5 8
#> Volvo 142E 4 121.0 109 2.780 18.60 1 1 4 2

Resources