This question already has answers here:
Returning above and below rows of specific rows in r dataframe
(5 answers)
Closed 1 year ago.
I would like to select N rows above and below a match.
I'm trying the command:
mtcars[which(mtcars$vs == 1) + c(-1:1), ]
It returns the follow warning:
Warning message:
In which(mtcars$vs == 1) + c(-1:1):
longer object length is not a multiple of shorter object length
We can write a short function to return all elements of vec that match val or that are within n elements (either direction):
newfun <- function(vec, val, n) {
rows <- sapply(which(vec==val), function(x) seq(x-n, x+n, 1))
rows <- unique(sort(rows[rows>0 & rows<length(vec)]))
return(vec[rows])
}
For example:
newfun(mtcars$vs, 1, 2)
Before adding the desired range of indices to your focal index (the result from which), you need to repeat each value to the length of your range.
# set the number of values to select, before and after each focal index
n <- 1
# create a range of (relative) indices
i <- -n:n
# repeat focal indices
# add range of n prior and following indices
ix <- rep(which(mtcars$vs == 1), each = length(i)) + i
# select unique indices, truncated to the relevant range of rows,...
unique(ix[ix > 0 & ix <= nrow(mtcars)])
# [1] 2 3 4 5 6 7 8 9 10 11 12 17 18 19 20 21 22 25 26 27 28 29 31 32
# ...which then can be used to subset data
mtcars[unique(ix[ix > 0 & ix <= nrow(mtcars)]), ]
This seems to be a simple question but is not as trivial as presumably expected.
The issue is that which(mtcars$vs == 1) returns a vector rather than a single value:
[1] 3 4 6 8 9 10 11 18 19 20 21 26 28 32
If another vector -1:1 (which is c(-1L, 0L, 1L)) is added to it, the normal R rules for operations on vectors of unequal lengths apply: The recycling rule says
Any short vector operands are extended by recycling their values until
they match the size of any other operands.
Therefore the shorter vector -1:1 will be recycled to the length of which(mtcars$vs == 1), i.e.,
rep(-1:1, length.out = length(which(mtcars$vs == 1)))
[1] -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0
Therefore, the result of
which(mtcars$vs == 1) + -1:1
is the element-wise sum of the elements of both vectors where the shorter vector has been recycled to match the length of the longer vector.
[1] 2 4 7 7 9 11 10 18 20 19 21 27 27 32
which is propably not what the OP has expected.
In addition, we get the
Warning message:
In which(mtcars$vs == 1) + -1:1 :
longer object length is not a multiple of shorter object length
because which(mtcars$vs == 1) has length 14 and -1:1 has length 3.
Solution using outer()
In order to select the N rows above and below each matching row, we need to add -N:N to each row number returned by which(mtcars$vs == 1):
outer(which(mtcars$vs == 1), -1:1, `+`)
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 5 6 7
[4,] 7 8 9
[5,] 8 9 10
[6,] 9 10 11
[7,] 10 11 12
[8,] 17 18 19
[9,] 18 19 20
[10,] 19 20 21
[11,] 20 21 22
[12,] 25 26 27
[13,] 27 28 29
[14,] 31 32 33
Now, we have an array of all row numbers. Unfortunately, it cannot be used directly for subsetting because it contains duplicates and there are row numbers which do not exist in mtcars. So the the result has to be "post-processed" before it can be used for subsetting.
library(magrittr) # piping used for clarity
rn <- outer(which(mtcars$vs == 1), -1:1, `+`) %>%
as.vector() %>%
unique() %>%
Filter(function(x) x[1 <= x & x <= nrow(mtcars)], .)
rn
[1] 2 3 4 5 6 7 8 9 10 11 12 17 18 19 20 21 22 25 26 27 28 29 31 32
mtcars[rn, ]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Related
I would like to apply the below function (cut.at.n.tile) to a data frame (some_data_frame) whilst grouping by a chosen column (e.g. SomeGroupingColumn) and choosing the target column (e.g. ChosenColumn). I tried using sapply() without success - see code below. Any input very much appreciated. Apologies for this not being fully replicable/self contained ...
cut.at.n.tile <- function(X, n = 7) {
cut(X, breaks = quantile(X, probs = (0:n)/n, na.rm = TRUE),
labels = 1:n, include.lowest = TRUE)
}
some_data_frame$SeasonTypeNumber = sapply(split(some_data_frame['ChosenColumn'], SomeGroupingColumn), cut.at.n.tile)
There are a few problems here.
some_data_frame['ChosenColumn'] always returns a single-column data.frame, not a vector which your function requires. I suggest switching to some_data_frame[['ChosenColumn']].
SomeGroupingColumn looks like it should be a column (hence the name) in the data, but it is not referenced within a frame. Perhaps some_data_frame[['SomeGroupingColumn']].
You need to ensure that the breaks= used are unique. For example,
cut.at.n.tile(subset(mtcars, cyl == 8)$disp)
# Error in cut.default(X, breaks = quantile(X, probs = (0:n)/n, na.rm = TRUE), :
# 'breaks' are not unique
If we debug that function, we see
X
# [1] 360.0 360.0 275.8 275.8 275.8 472.0 460.0 440.0 318.0 304.0 350.0 400.0 351.0 301.0
quantile(X, probs = (0:n)/n, na.rm = TRUE)
# 0% 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 100%
# 275.8000 275.8000 303.1429 336.2857 354.8571 371.4286 442.8571 472.0000
where 275.8 is repeated. This can happen based on nuances in the raw data, and you can't really predict when it will occur.
Since we'll likely have multiple groups, all of the subvectors' levels= (since cut returns a factor) must be the same length, though admittedly 1 in one group is unlikely to be the same as 1 in another group.
Since in this case we can never be certain which n-tile a number strictly applies (in 275.8 in the first or second n-tile?), we can only adjust one of the dupes and accept the imperfection. I suggest a cumsum(duplicated(.)*1e-9): the premise is that it adds an iota to each value that is a dupe, rendering it no-longer a dupe. It is possible that adding 1e-9 to one value will make it a dupe of the next ... so we can be a little OCD by repeatedly doing this until we have no duplicates.
sapply is unlikely to return a vector, much (almost "certainly") more likely to return a list (if the groups are not perfectly balanced) or a matrix (perfectly balanced). We cannot simply unlist, since the order of the unlisted vectors will likely not be the order of the source data.
We can use `split<-`, or we can use a few other techniques (dplyr and/or data.table)
Updated function, and demonstration with mtcars:
cut.at.n.tile <- function(X, n = 7) {
brks <- quantile(X, probs = (0:n)/n, na.rm = TRUE)
while (any(dupes <- duplicated(brks))) brks <- brks + cumsum(1e-9*dupes)
cut(X, breaks = brks, labels = 1:n, include.lowest = TRUE)
}
base R
ret <- lapply(split(mtcars[['disp']], mtcars[['cyl']]), cut.at.n.tile)
mtcars[["newcol"]] <- NA # create an empty column
split(mtcars[['newcol']], mtcars[['cyl']]) <- ret
mtcars
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 7
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 5
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 7
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 7
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 4
# Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 4
# Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 1
# Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 1
# Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 1
# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 7
# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 7
# Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 6
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 2
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 1
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 5
# Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 3
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 3
# Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 4
# Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 6
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 3
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 5
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 3
# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 4
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 1
# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 6
Validation:
cut.at.n.tile(subset(mtcars, cyl == 8)$disp)
# [1] 5 5 1 1 1 7 7 6 3 3 4 6 4 2
# Levels: 1 2 3 4 5 6 7
subset(mtcars, cyl == 8)$newcol
# [1] 5 5 1 1 1 7 7 6 3 3 4 6 4 2
dplyr
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(newcol = cut.at.n.tile(disp)) %>%
ungroup()
# # A tibble: 32 × 12
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 2
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 7
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 5
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 6
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 5
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 7
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 7
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 4
# # … with 22 more rows
# # ℹ Use `print(n = ...)` to see more rows
data.table
library(data.table)
as.data.table(mtcars)[, newcol := cut.at.n.tile(disp), by = .(cyl)][]
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <fctr>
# 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
# 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
# 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
# 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 7
# 5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
# 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
# 7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 5
# 8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 7
# 9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 7
# 10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 4
# ---
# 23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 3
# 24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 4
# 25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 6
# 26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 3
# 27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 5
# 28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 3
# 29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 4
# 30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 1
# 31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
# 32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 6
I've looked at other threads and tried to apply it to my code but have had no luck.
CDR3_post_challenge_unique_clonecount$participant_per_cdr3aa <- as.numeric(CDR3_post_challenge_unique_clonecount$cdr3aa)
participant_list <- unique(CDR3_post_challenge_unique_clonecount$cdr3aa)
for (c in participant_list)
{
CDR3_post_challenge_unique_clonecount$participant_per_cdr3aa[CDR3_post_challenge_unique_clonecount$cdr3aa == c] <- length(unique(CDR3_post_challenge_unique_clonecount$PartID[CDR3_post_challenge_unique_clonecount$cdr3aa == c]))
}
Here is a bit of the dataframe:
cdr3aa clonecount PartID
CAAGRAARGGSVPHWFDPF 1 S-1
CAALADSGSQTDAFDIA 1 S-1
CAFHAAYGSQHGLDVW 1 S-1
CAGGLAWLVDDW 1 S-1
CAGRWFFPW 1 S-1
CAGVKNGRGMDVW 1 S-1
I think you can replace the for loop with
CDR3_post_challenge_unique_clonecount$per3 <-
as.integer(
ave(CDR3_post_challenge_unique_clonecount$PartID,
CDR3_post_challenge_unique_clonecount$cdr3aa,
FUN = function(z) length(unique(z)))
)
I'll demonstrate with mtcars, using the follow analogs:
mtcars --> CDR3_post_challenge_unique_clonecount
cyl --> cdr3aa, the categorical variable in which we want to count PartID
drat --> PartID, the thing we want to count (uniquely) within each cdr3aa
mtcars$drat_per_cyl <- ave(mtcars$drat, mtcars$cyl, FUN = function(z) length(unique(z)))
mtcars
# mpg cyl disp hp drat wt qsec vs am gear carb drat_per_cyl
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 5
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 5
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 10
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 11
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 5
# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 11
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 10
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 5
# Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 5
# Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 11
# Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 11
# Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 11
# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 11
# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 11
# Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 11
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 10
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 10
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 10
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 10
# Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 11
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 11
# Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 11
# Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 11
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 10
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 10
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 10
# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 11
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 5
# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 11
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 10
Notes:
ave is a little brain-dead in that the class of the return value is always the same as the class of the first argument. This means that one cannot count unique "character" and expect to get an integer, it is instead returned as a string. It's because of this that I wrap ave in as.integer(.).
ave returns a vector the same length as the input, with values corresponding 1-for-1 (meaning the order is relevant and preserved). In my example of mtcars, this means that it is effectively doing something like this:
ind4 <- which(mtcars$cyl == 4L)
ind4
# [1] 3 8 9 18 19 20 21 26 27 28 32
length(unique(mtcars$drat[ind4]))
# [1] 10
ind6 <- which(mtcars$cyl == 6L)
ind6
# [1] 1 2 4 6 10 11 30
length(unique(mtcars$drat[ind6]))
# [1] 5
### ...
but it will place the return value 10 in the ind4 positions of the return value. For example, because of my ind6, the return value will start with
c(5, 5, .., 5, .., 5, .., .., .., 5, 5, .., .....)
Because of ind4, it will contain
c(.., .., 10, .., .., .., .., 10, 10, .....)
(And same for cyl==8L.)
I am attempting to use R to query a large database. Due to the size of the database, I have written the query to fetch 100 rows at a time My code looks something like:
library(RJDBC)
library(DBI)
library(tidyverse)
options(java.parameters = "-Xmx8000m")
drv<-JDBC("driver name", "driver path.jar")
conn<-
dbConnect(
drv,
"database info",
"username",
"password"
)
query<-"SELECT * FROM some_table"
hc<-tibble()
res<-dbSendQuery(conn,query)
repeat{
chunk<-dbFetch(res,100)
if(nrow(chunk)==0){break}
hc<-bind_rows(hc,chunk)
print(nrow(hc))
}
Basically, I would like write something that does the same thing, but via the combination of function and lapply. In theory, given the way R processes data via loops, using lapply will speed up query. Some understanding of the dbFetch function may help. Specifically, how in the repeat loop it doesn't just keep selecting the first initial 100 rows.
I have tried the following, but nothing works:
df_list <- lapply(query , function(x) dbGetQuery(conn, x))
hc<-tibble()
res<-dbSendQuery(conn,query)
test_query<-function(x){
chunk<-dbFetch(res,100)
if(nrow(chunk)==0){break}
print(nrow(hc))
}
bind_rows(lapply(test_query,res))
Consider following the example in dbFetch docs that checks for completed status of fetch, dbHasCompleted. Then, for memory efficiency build a list of data frames/tibbles with lapply then row bind once outside the loop.
rs <- dbSendQuery(con, "SELECT * FROM some_table")
run_chunks <- function(i, res) {
# base::transform OR dplyr::mutate
# base::tryCatch => for empty chunks depending on chunk number
chunk <- tryCatch(transform(dbFetch(res, 100), chunk_no = i),
error = function(e) NULL)
return(chunk)
}
while (!dbHasCompleted(rs)) {
# PROVIDE SUFFICIENT NUMBER OF CHUNKS (table rows / fetch rows)
df_list <- lapply(1:5, run_chunks, res=rs)
}
# base::do.call(rbind, ...) OR dplyr::bind_rows(...)
final_df <- do.call(rbind, df_list)
Demonstration with in-memory SQLite database of mtcars:
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
run_chunks <- function(i, res) {
chunk <- dbFetch(res, 10)
return(chunk)
}
rs <- dbSendQuery(con, "SELECT * FROM mtcars")
while (!dbHasCompleted(rs)) {
# PROVIDE SUFFICIENT NUMBER OF CHUNKS (table rows / fetch rows)
df_list <- lapply(1:5, function(i)
print(run_chunks(i, res=rs))
)
}
do.call(rbind, df_list)
dbClearResult(rs)
dbDisconnect(con)
Output (5 chunks of 10 rows, 10 rows, 10 rows, 2 rows, 0 rows, and full 32 rows)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
# 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
# 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
# 2 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
# 3 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
# 4 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
# 5 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
# 6 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# 7 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 8 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# 9 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# 10 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# 2 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
# 3 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
# 4 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
# 5 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# 6 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# 7 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# 8 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# 9 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
# 10 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
# 2 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# [1] mpg cyl disp hp drat wt qsec vs am gear carb
# <0 rows> (or 0-length row.names)
do.call(rbind, df_list)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
# 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
# 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
# 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
# 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
# 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
# 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
# 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
# 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
# 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
# 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
# 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
# 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
# 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
The following works well, as it allows the user to customize the size and number of chunks. Ideally, the function would be Vectorized somehow.
I explored getting the number of rows to automatically set the chunk number, but I couldn't find any methods without actually needing to perform the query first. Adding a large number of chunks doesn't add a ton of extra process time. The performance improvement over the repeat approach depends on the size of the data, but the bigger the data the bigger the performance improvement.
Chunks of n = 1000 seem to consistently produce the best results. Any suggestions to these points would be much appreciated.
Solution:
library(RJDBC)
library(DBI)
library(dplyr)
library(tidyr)
res<-dbSendQuery(conn,"SELECT * FROM some_table")
##Multiplied together need to be greater than N
chunk_size<-1000
chunk_number<-150
run_chunks<-
function(chunk_number, res, chunk_size) {
chunk <-
tryCatch(
dbFetch(res, chunk_size),
error = function(e) NULL
)
if(!is.null(chunk)){
return(chunk)
}
}
dat<-
bind_rows(
lapply(
1:chunk_number,
run_chunks,
res,
chunk_size
)
)
I have made a plot using ggplot2 geom_histogram from a data frame. See sample below and link to the ggplot histogram Need to label each geom_vline with the factors using a nested ddply function and facet wrap
I now need to make a data frame that contains the summarized data used to generate the ggplot above.
Sector2 Family Year Length
BUN Acroporidae 2010 332.1300496
BUN Poritidae 2011 141.1467966
BUN Acroporidae 2012 127.479
BUN Acroporidae 2013 142.5940556
MUR Faviidae 2010 304.0405
MUR Faviidae 2011 423.152
MUR Pocilloporidae 2012 576.0295
MUR Poritidae 2013 123.8936667
NTH Faviidae 2010 60.494
NTH Faviidae 2011 27.427
NTH Pocilloporidae 2012 270.475
NTH Poritidae 2013 363.4635
To get values actually plotted you can use function ggplot_build() where argument is your plot.
p <- ggplot(mtcars,aes(mpg))+geom_histogram()+
facet_wrap(~cyl)+geom_vline(data=data.frame(x=c(20,30)),aes(xintercept=x))
pg <- ggplot_build(p)
This will make list and one of sublists is named data. This sublist contains dataframe with values used in plot, for example, for histrogramm it contains y values (the same as count). If you use facets then column PANEL shows in which facet values are used. If there are more than one geom_ in your plot then data will contains dataframes for each - in my example there is one dataframe for histogramm and another for vlines.
head(pg$data[[1]])
y count x ndensity ncount density PANEL group ymin ymax
1 0 0 9.791667 0 0 0 1 1 0 0
2 0 0 10.575000 0 0 0 1 1 0 0
3 0 0 11.358333 0 0 0 1 1 0 0
4 0 0 12.141667 0 0 0 1 1 0 0
5 0 0 12.925000 0 0 0 1 1 0 0
6 0 0 13.708333 0 0 0 1 1 0 0
xmin xmax
1 9.40000 10.18333
2 10.18333 10.96667
3 10.96667 11.75000
4 11.75000 12.53333
5 12.53333 13.31667
6 13.31667 14.10000
head(pg$data[[2]])
xintercept PANEL group xend x
1 20 1 1 20 20
2 30 1 1 30 30
3 20 2 2 20 20
4 30 2 2 30 30
5 20 3 3 20 20
6 30 3 3 30 30
layer_data is designed precisely for this :
layer_data(p, 1)
It will give you the data of the first layer, same as ggplot_build(p)$data[[1]].
Its source code is indeed precisely:
function (plot, i = 1L) ggplot_build(plot)$data[[i]]
While the other answers get you close, if you are looking for the actual data that was passed to ggplot(), you can use:
ggplot_build(p)$plot$data
require(tidyverse)
p <- ggplot(mtcars,aes(mpg))+geom_histogram()+
facet_wrap(~cyl)+geom_vline(data=data.frame(x=c(20,30)),aes(xintercept=x))
pg <- ggplot_build(p)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pg$plot$data
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Created on 2019-03-04 by the reprex package (v0.2.1)
While that isn't useful for an un-modified data frame, if you are piping through a series of mutate()'s or summarize()'s before you get to the ggplot, this can be useful after the fact to show the data.
Load the purrr package and write my_plot %>% pluck("data")
I want to parse_factor then fct_recode several variables in a dataframe. The levels (and their recode values) are stored in named strings.
How can I use those to implement what I want?
Note that in my case, I cannot simply use mutate, because I have several variables to which I want to apply the recoding.
Below is an example of what I thought would work (but does not).
library(tidyverse)
#> ── Attaching packages ────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.4
#> ✔ tidyr 0.8.0 ✔ stringr 1.3.0
#> ✔ readr 1.1.1 ✔ forcats 0.3.0
#> ── Conflicts ───────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
gear_levels <- c("tri" = 3, "quad" = 4, "six" = 6, `NA` = 8)
mtcars %>%
mutate_at("gear", parse_factor, levels = gear_levels) %>%
mutate_at("gear", fct_recode, !!! gear_levels)
#> Warning: 5 parsing failures.
#> row # A tibble: 5 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 27 NA value in level set 5 row 2 28 NA value in level set 5 col 3 29 NA value in level set 5 expected 4 30 NA value in level set 5 actual 5 31 NA value in level set 5
#> Error: Can't use `!!!` on atomic vectors in non-quoting functions
As per lionel's comment, this is what coercing to list looks like. Note that you need to supply a character vector to fct_recode and that you have to replace the names after as.character. I'm not sure exactly how your desired levels are stored.
Also your supplied levels don't match those in mtcars$gear, in case you didn't realise.
library(tidyverse)
gear_levels <- c("tri" = 3, "quad" = 4, "six" = 6, `NA` = 8)
gear_recode <- as.list(as.character(gear_levels))
names(gear_recode) <- names(gear_levels)
mtcars %>%
mutate_at(vars(gear), parse_factor, levels = gear_levels) %>%
mutate_at(vars(gear), fct_recode, !!! gear_recode)
#> Warning: 5 parsing failures.
#> row # A tibble: 5 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 27 NA value in level set 5 row 2 28 NA value in level set 5 col 3 29 NA value in level set 5 expected 4 30 NA value in level set 5 actual 5 31 NA value in level set 5
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 quad 4
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 quad 4
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 quad 1
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 tri 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 tri 2
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 tri 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 tri 4
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 quad 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 quad 2
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 quad 4
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 quad 4
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 tri 3
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 tri 3
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 tri 3
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 tri 4
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 tri 4
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 tri 4
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 quad 1
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 quad 2
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 quad 1
#> 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 tri 1
#> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 tri 2
#> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 tri 2
#> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 tri 4
#> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 tri 2
#> 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 quad 1
#> 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 <NA> 2
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 <NA> 2
#> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 <NA> 4
#> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 <NA> 6
#> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 <NA> 8
#> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 quad 2
Created on 2018-03-16 by the reprex package (v0.2.0).