Applying function using dplyr and setting output as columns in dataframe - r

I have a huge dataframe and I am applying a function that has multiple outputs on one column and would like to add these outputs as columns in the dataframe.
Example function:
measure <- function(x){ # useless function for illustrative purposes
one <- x+1
two <- x^2
three <- x/2
m <- c(one,two,three)
names(m) <- c('Plus1','Square','Half')
return(m)
}
My current method which is very inefficient:
a <- mtcars %>% group_by(cyl) %>% mutate(Plus1 = measure(wt)[1], Square = measure(wt)[2],
Half = measure(wt)[3]) %>% as.data.frame()
Output:
head(a,15)
mpg cyl disp hp drat wt qsec vs am gear carb Plus1 Square Half
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 3.62 3.875 4.215
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3.62 3.875 4.215
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3.32 4.190 4.150
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 3.62 3.875 4.215
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 4.44 4.570 5.070
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 3.62 3.875 4.215
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 4.44 4.570 5.070
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 3.32 4.190 4.150
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 3.32 4.190 4.150
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 3.62 3.875 4.215
11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 3.62 3.875 4.215
12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 4.44 4.570 5.070
13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 4.44 4.570 5.070
14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 4.44 4.570 5.070
15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 4.44 4.570 5.070
Is there any more efficient way to do this? My actual function has 13 outputs and it is taking very long to apply to my large dataframe. Please help!

There could be various ways to solve this however, one option is to return a tibble output from the function, split the dataframe based on group, calculate the statistics for each and bind the result together.
library(tidyverse)
measure <- function(x){
tibble(Plus1 = x+1,Square = x^2,Half = x/2)
}
bind_cols(mtcars %>% arrange(cyl),
mtcars %>%
group_split(cyl) %>%
map_df(~measure(.$wt)))
# mpg cyl disp hp drat wt qsec vs am gear carb Plus1 Square Half
#1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3.320 5.382400 1.1600
#2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 4.190 10.176100 1.5950
#3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 4.150 9.922500 1.5750
#4 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 3.200 4.840000 1.1000
#5 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 2.615 2.608225 0.8075
#6 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 2.835 3.367225 0.9175
#....
This calls measure only once per group irrespective of number of values returned unlike in the attempt it was called n times to extract n values.

Related

apply function to dataframe's chosen column whilst grouping by another chosen column

I would like to apply the below function (cut.at.n.tile) to a data frame (some_data_frame) whilst grouping by a chosen column (e.g. SomeGroupingColumn) and choosing the target column (e.g. ChosenColumn). I tried using sapply() without success - see code below. Any input very much appreciated. Apologies for this not being fully replicable/self contained ...
cut.at.n.tile <- function(X, n = 7) {
cut(X, breaks = quantile(X, probs = (0:n)/n, na.rm = TRUE),
labels = 1:n, include.lowest = TRUE)
}
some_data_frame$SeasonTypeNumber = sapply(split(some_data_frame['ChosenColumn'], SomeGroupingColumn), cut.at.n.tile)
There are a few problems here.
some_data_frame['ChosenColumn'] always returns a single-column data.frame, not a vector which your function requires. I suggest switching to some_data_frame[['ChosenColumn']].
SomeGroupingColumn looks like it should be a column (hence the name) in the data, but it is not referenced within a frame. Perhaps some_data_frame[['SomeGroupingColumn']].
You need to ensure that the breaks= used are unique. For example,
cut.at.n.tile(subset(mtcars, cyl == 8)$disp)
# Error in cut.default(X, breaks = quantile(X, probs = (0:n)/n, na.rm = TRUE), :
# 'breaks' are not unique
If we debug that function, we see
X
# [1] 360.0 360.0 275.8 275.8 275.8 472.0 460.0 440.0 318.0 304.0 350.0 400.0 351.0 301.0
quantile(X, probs = (0:n)/n, na.rm = TRUE)
# 0% 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 100%
# 275.8000 275.8000 303.1429 336.2857 354.8571 371.4286 442.8571 472.0000
where 275.8 is repeated. This can happen based on nuances in the raw data, and you can't really predict when it will occur.
Since we'll likely have multiple groups, all of the subvectors' levels= (since cut returns a factor) must be the same length, though admittedly 1 in one group is unlikely to be the same as 1 in another group.
Since in this case we can never be certain which n-tile a number strictly applies (in 275.8 in the first or second n-tile?), we can only adjust one of the dupes and accept the imperfection. I suggest a cumsum(duplicated(.)*1e-9): the premise is that it adds an iota to each value that is a dupe, rendering it no-longer a dupe. It is possible that adding 1e-9 to one value will make it a dupe of the next ... so we can be a little OCD by repeatedly doing this until we have no duplicates.
sapply is unlikely to return a vector, much (almost "certainly") more likely to return a list (if the groups are not perfectly balanced) or a matrix (perfectly balanced). We cannot simply unlist, since the order of the unlisted vectors will likely not be the order of the source data.
We can use `split<-`, or we can use a few other techniques (dplyr and/or data.table)
Updated function, and demonstration with mtcars:
cut.at.n.tile <- function(X, n = 7) {
brks <- quantile(X, probs = (0:n)/n, na.rm = TRUE)
while (any(dupes <- duplicated(brks))) brks <- brks + cumsum(1e-9*dupes)
cut(X, breaks = brks, labels = 1:n, include.lowest = TRUE)
}
base R
ret <- lapply(split(mtcars[['disp']], mtcars[['cyl']]), cut.at.n.tile)
mtcars[["newcol"]] <- NA # create an empty column
split(mtcars[['newcol']], mtcars[['cyl']]) <- ret
mtcars
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 7
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 5
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 7
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 7
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 4
# Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 4
# Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 1
# Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 1
# Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 1
# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 7
# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 7
# Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 6
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 2
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 1
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 5
# Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 3
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 3
# Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 4
# Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 6
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 3
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 5
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 3
# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 4
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 1
# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 6
Validation:
cut.at.n.tile(subset(mtcars, cyl == 8)$disp)
# [1] 5 5 1 1 1 7 7 6 3 3 4 6 4 2
# Levels: 1 2 3 4 5 6 7
subset(mtcars, cyl == 8)$newcol
# [1] 5 5 1 1 1 7 7 6 3 3 4 6 4 2
dplyr
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(newcol = cut.at.n.tile(disp)) %>%
ungroup()
# # A tibble: 32 × 12
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 2
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 7
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 5
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 6
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 5
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 7
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 7
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 4
# # … with 22 more rows
# # ℹ Use `print(n = ...)` to see more rows
data.table
library(data.table)
as.data.table(mtcars)[, newcol := cut.at.n.tile(disp), by = .(cyl)][]
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <fctr>
# 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
# 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
# 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
# 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 7
# 5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
# 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
# 7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 5
# 8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 7
# 9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 7
# 10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 4
# ---
# 23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 3
# 24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 4
# 25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 6
# 26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 3
# 27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 5
# 28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 3
# 29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 4
# 30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 1
# 31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
# 32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 6

How to randomly sample multiple consecutive rows of a dataframe in R?

I've a dataframe with 100 rows and 20 columns and want to randomly sample 5 times 10 consecutive rows, e.g. 10:19, 25:34, etc. With: sample_n( df, 5 ) I'm able to extract 5 unique, randomly sampled rows, but don't know how to sample consecutive rows. Any help? Thanks!
df <- mtcars
df$row_nm <- seq(nrow(df))
set.seed(7)
sample_seq <- function(n, N) {
i <- sample(seq(N), size = 1)
ifelse(
test = i + (seq(n) - 1) <= N,
yes = i + (seq(n) - 1),
no = i + (seq(n) - 1) - N
)
}
replica <- replicate(n = 5, sample_seq(n = 10, N = nrow(df)))
# result
lapply(seq(ncol(replica)), function(x) df[replica[, x], ])
#> [[1]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 10
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 11
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 12
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 14
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 15
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 16
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 17
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 18
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 19
#>
#> [[2]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 19
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 20
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 21
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 22
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 23
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 24
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 25
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 26
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 27
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 28
#>
#> [[3]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 31
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 32
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 7
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 8
#>
#> [[4]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 28
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 29
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 30
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 31
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 32
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
#>
#> [[5]]
#> mpg cyl disp hp drat wt qsec vs am gear carb row_nm
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 7
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 8
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 9
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 10
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 11
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 12
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 14
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 15
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 16
Created on 2022-01-24 by the reprex package (v2.0.1)
You could so something like:
#sample data
df <- data.table(value = 1:100000)
#function which sampled consecutive rows (x = dataframe, rows = nr of consecutive rows, nr = amount of times you want to sample consecutive rows)
sample_fun <- function(x, rows, nr){
#maximum number which can be sampled
numbers <- 1:(nrow(x) - rows)
#randomly sample 5 numbers
sampled.numbers <- sample(numbers, nr)
#convert to vector (5 consecutive)
sampled.rows <- lapply(sampled.numbers, function(x){seq(x, (x+rows-1), 1)})
sampled.rows <- do.call(c, sampled.rows)
#sample and return
result <- x[sampled.rows,]
return(result)
}
sample_fun(x = df, rows = 5, nr = 2)
You don't mention if this can include replacement (i.e. if you sample 10:19, can you then also sample 15:24?). You also don't mention if you can sample anything above row 91, which would mean that sample of 10 gets cut off (i.e. 98,99,100 would only be 3 consecutive rows unless you want that to loop back to row 1). Assuming you can sample any value with replacement, the solution can be done in one line:
sapply(sample(1:100,5),function(x){seq(x,x+9)})
This applies the sequence function to each of 5 individually sampled numbers. The output will be a matrix, where each column is a sample of 10 consecutive rows, but as noted, these will potentially overlap, or go above 100.
If you want a solution where the rows will not overlap at all, and avoiding values over 100, without making values above 91 less likely to be sampled, this actually gets kind of trick, but I think the code below should work. You cant just sample from 1:91 without affect probability of your random sample, because then this means a value like 100 actually only has a 1/91 probability of being sampled (the sample value has to be 91), where other values don't involve that same constraint. This solution makes it so all rows are equally likely to be sampled.
Rows=c(1:100,1:100)
SampleRows=matrix(0,nrow=10,ncol=5)
for(i in 1:ncol(SampleRows)){
SampledValue=sample(Rows,1)
RowsIndex=min(which(Rows==SampledValue))
Sequence=Rows[RowsIndex:(RowsIndex+9)]
SampleRows[,i]=Sequence
Rows=Rows[!(Rows %in% Sequence)]
}
This approach creates a vector that sequences from 1:100, repeated twice (variable Rows), you'll see why this is important in a bit. For each of 5 iterations (corresponding to 5 samples), we take a sampled value from Rows, which will be a number 1:100, we then figure out where that number is in Rows, and take all 9 values next to it. In the first sample this will always be 10 consecutive numbers (e.g. 20:29). But then we remove those sampled values from Rows. If we happen to get the next sample as a value that would lead to overlap (like 18), then instead it samples (18,19,30,31,32,33,34...) since 20:29 have been removed. We need to do 1:100 twice in Rows, so that if we sample a value like 99, it resets from 100, back to 1.
If you want your output in a vector,throw in this at the end
sort(as.vector(SampleRows))
Let me know if this works for the needs of your problem.

Optimize/Vectorize Database Query with R

I am attempting to use R to query a large database. Due to the size of the database, I have written the query to fetch 100 rows at a time My code looks something like:
library(RJDBC)
library(DBI)
library(tidyverse)
options(java.parameters = "-Xmx8000m")
drv<-JDBC("driver name", "driver path.jar")
conn<-
dbConnect(
drv,
"database info",
"username",
"password"
)
query<-"SELECT * FROM some_table"
hc<-tibble()
res<-dbSendQuery(conn,query)
repeat{
chunk<-dbFetch(res,100)
if(nrow(chunk)==0){break}
hc<-bind_rows(hc,chunk)
print(nrow(hc))
}
Basically, I would like write something that does the same thing, but via the combination of function and lapply. In theory, given the way R processes data via loops, using lapply will speed up query. Some understanding of the dbFetch function may help. Specifically, how in the repeat loop it doesn't just keep selecting the first initial 100 rows.
I have tried the following, but nothing works:
df_list <- lapply(query , function(x) dbGetQuery(conn, x))
hc<-tibble()
res<-dbSendQuery(conn,query)
test_query<-function(x){
chunk<-dbFetch(res,100)
if(nrow(chunk)==0){break}
print(nrow(hc))
}
bind_rows(lapply(test_query,res))
Consider following the example in dbFetch docs that checks for completed status of fetch, dbHasCompleted. Then, for memory efficiency build a list of data frames/tibbles with lapply then row bind once outside the loop.
rs <- dbSendQuery(con, "SELECT * FROM some_table")
run_chunks <- function(i, res) {
# base::transform OR dplyr::mutate
# base::tryCatch => for empty chunks depending on chunk number
chunk <- tryCatch(transform(dbFetch(res, 100), chunk_no = i),
error = function(e) NULL)
return(chunk)
}
while (!dbHasCompleted(rs)) {
# PROVIDE SUFFICIENT NUMBER OF CHUNKS (table rows / fetch rows)
df_list <- lapply(1:5, run_chunks, res=rs)
}
# base::do.call(rbind, ...) OR dplyr::bind_rows(...)
final_df <- do.call(rbind, df_list)
Demonstration with in-memory SQLite database of mtcars:
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
run_chunks <- function(i, res) {
chunk <- dbFetch(res, 10)
return(chunk)
}
rs <- dbSendQuery(con, "SELECT * FROM mtcars")
while (!dbHasCompleted(rs)) {
# PROVIDE SUFFICIENT NUMBER OF CHUNKS (table rows / fetch rows)
df_list <- lapply(1:5, function(i)
print(run_chunks(i, res=rs))
)
}
do.call(rbind, df_list)
dbClearResult(rs)
dbDisconnect(con)
Output (5 chunks of 10 rows, 10 rows, 10 rows, 2 rows, 0 rows, and full 32 rows)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
# 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
# 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
# 2 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
# 3 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
# 4 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
# 5 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
# 6 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# 7 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 8 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# 9 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# 10 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# 2 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
# 3 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
# 4 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
# 5 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# 6 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# 7 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# 8 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# 9 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
# 10 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
# 2 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# [1] mpg cyl disp hp drat wt qsec vs am gear carb
# <0 rows> (or 0-length row.names)
do.call(rbind, df_list)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
# 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
# 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
# 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
# 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
# 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
# 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
# 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
# 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
# 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
# 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
# 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
# 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
# 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
The following works well, as it allows the user to customize the size and number of chunks. Ideally, the function would be Vectorized somehow.
I explored getting the number of rows to automatically set the chunk number, but I couldn't find any methods without actually needing to perform the query first. Adding a large number of chunks doesn't add a ton of extra process time. The performance improvement over the repeat approach depends on the size of the data, but the bigger the data the bigger the performance improvement.
Chunks of n = 1000 seem to consistently produce the best results. Any suggestions to these points would be much appreciated.
Solution:
library(RJDBC)
library(DBI)
library(dplyr)
library(tidyr)
res<-dbSendQuery(conn,"SELECT * FROM some_table")
##Multiplied together need to be greater than N
chunk_size<-1000
chunk_number<-150
run_chunks<-
function(chunk_number, res, chunk_size) {
chunk <-
tryCatch(
dbFetch(res, chunk_size),
error = function(e) NULL
)
if(!is.null(chunk)){
return(chunk)
}
}
dat<-
bind_rows(
lapply(
1:chunk_number,
run_chunks,
res,
chunk_size
)
)

Filter values in multiple dataframes

I can not get my head around this. I have a dataset which contains a data.frame in per day for 3 years, so i have a list with 1000 dataframes.
I want to filter all dataframes like in the example below. I know I could easily filter (or use rbindlist), first and then do the split, but I desire a way to apply a filter function to multiple dataframes. Can you help me? The code below does not work, but hope it helps to make clear what I want to archieve.
dflist <- mtcars %>%
split(.$cyl)
lapply(dflist, function(x) dplyr::filter(x[["mpg"]] > 10))
The filter works on a data.frame/tbl_df. Instead, we are extracting a vector (x[["mpg"]])
library(tidyverse)
filter(mtcars$mpg > 10)
Error in UseMethod("filter_") : no applicable method for 'filter_'
applied to an object of class "logical"
and apply filter on it.
We need to apply filter on the data.frame itself
map(dflist, ~ .x %>%
filter(mpg > 10))
#$`4`
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#4 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#5 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#6 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#7 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#8 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#9 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#10 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#11 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
#$`6`
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#4 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#5 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#6 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#7 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#$`8`
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#2 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#3 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#4 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#5 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#6 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#7 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#8 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#9 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#10 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#11 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#12 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#13 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#14 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Or using lapply
lapply(dflist, function(x) x %>%
filter(mpg > 10))

Automatically print message when data.frame updates

I'd like for R to print the dimensions of my updated dataframes to show how each of my steps influences the number of observations remaining. Typically my dataset size changes because there are NAs. I can always quickly check dimensions:
a = data.frame(a = c(1,2), b = c(3, NA))
na.omit(a) %>% dim
but I would want to do two things in that one line:
na.omit(a) %>% size
a <- na.omit(a)
since code gets messy with many dim() calls.
Is there a way to toggle R's output? Something like
a <- na.omit(a)
message: dimensions are now m x n
Edit: Added code to only display notification when the object changes under the original code.
If you just want the dimensions to print after every line that you execute you could write a taskCallBack
tc <- addTaskCallback(function(...){message("Dimensions are: ", paste(dim(a), collapse = " "));return(TRUE)})
# When you want to remove the callback
removeTaskCallback(tc)
Here is an example session of it in use. You could probably modify the taskCallBack to only print if things change.
> a <- mtcars
>
> tc <- addTaskCallback(function(...){message("Dimensions are: ", paste(dim(a), collapse = " "));return(TRUE)})
Dimensions are: 32 11
>
> a
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Dimensions are: 32 11
> a <- a[1:5,]
Dimensions are: 5 11
>
> # When you want to remove the callback
> removeTaskCallback(tc)
[1] TRUE
If you want to only receive notifications when the object changes we can write a little bit more to actually monitor the object and only display the message when appropriate. Here is the code I'll use:
# #object_name - Character string. The object name you want to montior
# #example tc <- addTaskCallback(monitor_object("mtcars"))
monitor_object <- function(object_name){
object_dim <- NULL
if(exists(object_name)){
object_dim <- dim(get(object_name))
}
f <- function(...){
new_dim <- NULL
if(exists(object_name)){
new_dim <- dim(get(object_name))
}
if(!identical(new_dim, object_dim)){
msg <- paste0(object_name,
" changed.\nOld dimensions: ",
paste(object_dim, collapse = " "),
"\nNew dimensions: ",
paste(new_dim, collapse = " "))
object_dim <<- new_dim
message(msg)
}
return(TRUE)
}
return(f)
}
tc <- addTaskCallback(monitor_object("mtcars"))
# When you want to remove the callback
removeTaskCallback(tc)
And an example session...
> tc <- addTaskCallback(monitor_object("mtcars"))
> mtcars <- mtcars[,-1]
mtcars changed.
Old dimensions: 32 11
New dimensions: 32 10
> head(mtcars)
cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 6 225 105 2.76 3.460 20.22 1 0 3 1
> mtcars <- mtcars[1:5,]
mtcars changed.
Old dimensions: 32 10
New dimensions: 5 10
I've actually expanded upon this code because it was a fun little exercise and made it so you can specify the function you want to use to monitor the object. I made a gist for the function that can be found here.
One option would be to write a new function like so:
omitter<-function(x){
x<-na.omit(x)
print(paste0("The dimensions are now ",dim(x)[1]," x ", dim(x)[2]))
return(x)
}
This would print the message and store the new object:
y<-omitter(data.frame(a=c(1:5,NA)))
> y
a
1 1
2 2
3 3
4 4
5 5
Edit: based on #DavidKlotz comment above, you could use message in place of print

Resources