I can not get my head around this. I have a dataset which contains a data.frame in per day for 3 years, so i have a list with 1000 dataframes.
I want to filter all dataframes like in the example below. I know I could easily filter (or use rbindlist), first and then do the split, but I desire a way to apply a filter function to multiple dataframes. Can you help me? The code below does not work, but hope it helps to make clear what I want to archieve.
dflist <- mtcars %>%
split(.$cyl)
lapply(dflist, function(x) dplyr::filter(x[["mpg"]] > 10))
The filter works on a data.frame/tbl_df. Instead, we are extracting a vector (x[["mpg"]])
library(tidyverse)
filter(mtcars$mpg > 10)
Error in UseMethod("filter_") : no applicable method for 'filter_'
applied to an object of class "logical"
and apply filter on it.
We need to apply filter on the data.frame itself
map(dflist, ~ .x %>%
filter(mpg > 10))
#$`4`
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#4 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#5 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#6 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#7 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#8 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#9 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#10 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#11 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
#$`6`
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#4 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#5 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#6 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#7 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#$`8`
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#2 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#3 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#4 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#5 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#6 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#7 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#8 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#9 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#10 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#11 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#12 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#13 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#14 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Or using lapply
lapply(dflist, function(x) x %>%
filter(mpg > 10))
Related
My query is: May I use summarise after group_by like this:
mydataset %>%
arrange(grouping_variable,ordering_variable) %>%
group_by(grouping_variable) %>% summarise(answer = another_variable[j])
I expect to see the jth ranked row in each group when I do the above.
I think the above is correct but not mentioned in the documentation.
I ran the following experiment to determine this.
Here is the whole data set:
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% as.data.frame
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
2 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
5 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
6 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
7 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
8 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
9 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
10 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
11 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
12 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
13 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
15 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
16 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
17 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
18 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
19 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
20 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
21 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
22 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
23 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
24 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
25 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
26 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
27 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
28 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
30 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
31 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
32 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Here is the first row (j=1) in each group:
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% summarise(answer = mpg[1])
# A tibble: 3 × 2
cyl answer
<dbl> <dbl>
1 4 21.4
2 6 17.8
3 8 10.4
Here is the second row in each group (j=2):
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% summarise(answer = mpg[2])
# A tibble: 3 × 2
cyl answer
<dbl> <dbl>
1 4 21.5
2 6 18.1
3 8 10.4
>
On the help page, it says that way to do this is as follows:
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% summarise(answer = first(mpg))
# A tibble: 3 × 2
cyl answer
<dbl> <dbl>
1 4 21.4
2 6 17.8
3 8 10.4
>
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% summarise(answer = nth(mpg,2))
# A tibble: 3 × 2
cyl answer
<dbl> <dbl>
1 4 21.5
2 6 18.1
3 8 10.4
>
I don't remember which website I read this on. Since it was not mentioned on help page that I why I am asking here.
I would like to add new columns to an existing data frame. The column names are generated in a FOR loop so that they are numerically sequential. Here is the code:
NewColumn <- paste("return_date", as.character(i), sep = "_")
When I display NewColumn, this is what I want:
[1] "return_date_2"
When I execute:
mutate(Cima, NewColumn = "01-01-01")
The name of the column is: NewColumn
I can rename it, but is there a way to avoid this step?
Why does R not recognize that NewColumn holds a string?
Do you have to use mutate in your code?
If not, replace mutate(Cima, NewColumn = "01-01-01") with Cima[NewColumn] <- "01-01-01"
Because mutate consider the left part of the equal sign to be already the column name. U can get over it with the code below:
library(dplyr)
library(rlang)
i <- 1
NewColumn <- paste("return_date", as.character(i), sep = "_")
> mutate(mtcars, !!NewColumn := 5)
mpg cyl disp hp drat wt qsec vs am gear carb return_date_1
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 5
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 5
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 5
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 5
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 5
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 5
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 5
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 5
11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 5
12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 5
13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 5
14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 5
15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 5
16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 5
17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 5
18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 5
19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 5
20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 5
21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 5
22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 5
23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 5
24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 5
25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 5
26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 5
27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 5
28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 5
29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 5
30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 5
31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 5
32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 5
Take a look into this one to understand it better:
Use dynamic variable names in `dplyr`
You can also check advanced R from Hadley Wickham and take a look at the bang bang operator and see what it does.
https://adv-r.hadley.nz/
This question already has answers here:
Use dynamic name for new column/variable in `dplyr`
(10 answers)
Closed 2 years ago.
I've used curly-curly with group_by and summarise as described in the rlang announcement. But I can't get it to work when mutating a variable in place. What's the best way to do this currently with dplyr?
Say I want to supply an unquoted column name and have it mutated, here's a toy example function that doesn't work:
my_fun <- function(dat, var_name){
dat %>%
mutate({{var_name}} = 1)
}
my_fun(mtcars, cyl)
What should that mutate line be to change any column in mtcars to be a constant?
You need to use the assignment operator (:=) if you want to use the curly-curly to specify a name on the left hand side of an assignment in mutate:
my_fun <- function(dat, var_name){
dat %>%
mutate({{var_name}} := 1)
}
Which allows:
my_fun(mtcars, cyl)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21.0 1 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 21.0 1 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 22.8 1 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 4 21.4 1 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 5 18.7 1 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 6 18.1 1 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 7 14.3 1 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 8 24.4 1 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 9 22.8 1 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 10 19.2 1 167.6 123 3.92 3.440 18.30 1 0 4 4
#> 11 17.8 1 167.6 123 3.92 3.440 18.90 1 0 4 4
#> 12 16.4 1 275.8 180 3.07 4.070 17.40 0 0 3 3
#> 13 17.3 1 275.8 180 3.07 3.730 17.60 0 0 3 3
#> 14 15.2 1 275.8 180 3.07 3.780 18.00 0 0 3 3
#> 15 10.4 1 472.0 205 2.93 5.250 17.98 0 0 3 4
#> 16 10.4 1 460.0 215 3.00 5.424 17.82 0 0 3 4
#> 17 14.7 1 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 18 32.4 1 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 19 30.4 1 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 20 33.9 1 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 21 21.5 1 120.1 97 3.70 2.465 20.01 1 0 3 1
#> 22 15.5 1 318.0 150 2.76 3.520 16.87 0 0 3 2
#> 23 15.2 1 304.0 150 3.15 3.435 17.30 0 0 3 2
#> 24 13.3 1 350.0 245 3.73 3.840 15.41 0 0 3 4
#> 25 19.2 1 400.0 175 3.08 3.845 17.05 0 0 3 2
#> 26 27.3 1 79.0 66 4.08 1.935 18.90 1 1 4 1
#> 27 26.0 1 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 28 30.4 1 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 29 15.8 1 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 30 19.7 1 145.0 175 3.62 2.770 15.50 0 1 5 6
#> 31 15.0 1 301.0 335 3.54 3.570 14.60 0 1 5 8
#> 32 21.4 1 121.0 109 4.11 2.780 18.60 1 1 4 2
I am attempting to use R to query a large database. Due to the size of the database, I have written the query to fetch 100 rows at a time My code looks something like:
library(RJDBC)
library(DBI)
library(tidyverse)
options(java.parameters = "-Xmx8000m")
drv<-JDBC("driver name", "driver path.jar")
conn<-
dbConnect(
drv,
"database info",
"username",
"password"
)
query<-"SELECT * FROM some_table"
hc<-tibble()
res<-dbSendQuery(conn,query)
repeat{
chunk<-dbFetch(res,100)
if(nrow(chunk)==0){break}
hc<-bind_rows(hc,chunk)
print(nrow(hc))
}
Basically, I would like write something that does the same thing, but via the combination of function and lapply. In theory, given the way R processes data via loops, using lapply will speed up query. Some understanding of the dbFetch function may help. Specifically, how in the repeat loop it doesn't just keep selecting the first initial 100 rows.
I have tried the following, but nothing works:
df_list <- lapply(query , function(x) dbGetQuery(conn, x))
hc<-tibble()
res<-dbSendQuery(conn,query)
test_query<-function(x){
chunk<-dbFetch(res,100)
if(nrow(chunk)==0){break}
print(nrow(hc))
}
bind_rows(lapply(test_query,res))
Consider following the example in dbFetch docs that checks for completed status of fetch, dbHasCompleted. Then, for memory efficiency build a list of data frames/tibbles with lapply then row bind once outside the loop.
rs <- dbSendQuery(con, "SELECT * FROM some_table")
run_chunks <- function(i, res) {
# base::transform OR dplyr::mutate
# base::tryCatch => for empty chunks depending on chunk number
chunk <- tryCatch(transform(dbFetch(res, 100), chunk_no = i),
error = function(e) NULL)
return(chunk)
}
while (!dbHasCompleted(rs)) {
# PROVIDE SUFFICIENT NUMBER OF CHUNKS (table rows / fetch rows)
df_list <- lapply(1:5, run_chunks, res=rs)
}
# base::do.call(rbind, ...) OR dplyr::bind_rows(...)
final_df <- do.call(rbind, df_list)
Demonstration with in-memory SQLite database of mtcars:
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
run_chunks <- function(i, res) {
chunk <- dbFetch(res, 10)
return(chunk)
}
rs <- dbSendQuery(con, "SELECT * FROM mtcars")
while (!dbHasCompleted(rs)) {
# PROVIDE SUFFICIENT NUMBER OF CHUNKS (table rows / fetch rows)
df_list <- lapply(1:5, function(i)
print(run_chunks(i, res=rs))
)
}
do.call(rbind, df_list)
dbClearResult(rs)
dbDisconnect(con)
Output (5 chunks of 10 rows, 10 rows, 10 rows, 2 rows, 0 rows, and full 32 rows)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
# 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
# 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
# 2 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
# 3 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
# 4 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
# 5 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
# 6 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# 7 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 8 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# 9 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# 10 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# 2 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
# 3 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
# 4 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
# 5 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# 6 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# 7 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# 8 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# 9 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
# 10 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
# 2 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# [1] mpg cyl disp hp drat wt qsec vs am gear carb
# <0 rows> (or 0-length row.names)
do.call(rbind, df_list)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
# 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
# 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
# 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
# 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
# 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
# 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
# 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
# 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
# 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
# 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
# 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
# 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
# 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
The following works well, as it allows the user to customize the size and number of chunks. Ideally, the function would be Vectorized somehow.
I explored getting the number of rows to automatically set the chunk number, but I couldn't find any methods without actually needing to perform the query first. Adding a large number of chunks doesn't add a ton of extra process time. The performance improvement over the repeat approach depends on the size of the data, but the bigger the data the bigger the performance improvement.
Chunks of n = 1000 seem to consistently produce the best results. Any suggestions to these points would be much appreciated.
Solution:
library(RJDBC)
library(DBI)
library(dplyr)
library(tidyr)
res<-dbSendQuery(conn,"SELECT * FROM some_table")
##Multiplied together need to be greater than N
chunk_size<-1000
chunk_number<-150
run_chunks<-
function(chunk_number, res, chunk_size) {
chunk <-
tryCatch(
dbFetch(res, chunk_size),
error = function(e) NULL
)
if(!is.null(chunk)){
return(chunk)
}
}
dat<-
bind_rows(
lapply(
1:chunk_number,
run_chunks,
res,
chunk_size
)
)
I have way too many variables to list them manually inside a rowMeans(cbind()) function. Naturally I tried to pass them packed in one single character vector, but it's not working. I tried with eval, .., mget, yet no one seems to do the trick
column_names <- as.vector(summary$variables) #this is where I take the column names from (characters)
dataset[ , means := rowMeans( cbind( eval(column_names) ) , na.rm=TRUE )]
Thanks
You need to use .SD and .SDcols to specify the relevant columns; here is a minimal reproducible example based on mtcars
library(data.table)
dt <- as.data.table(mtcars)
col_names <- c("mpg", "disp", "drat")
dt[, mean := rowMeans(.SD), .SDcols = col_names]
dt
#mpg cyl disp hp drat wt qsec vs am gear carb mean
#1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 61.63333
#2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 61.63333
#3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 44.88333
#4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 94.16000
#5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 127.28333
#6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 81.95333
#7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 125.83667
#8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 58.26333
#9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 55.84000
#10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 63.57333
#11: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 63.10667
#12: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 98.42333
#13: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 98.72333
#14: 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 98.02333
#15: 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 161.77667
#16: 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 157.80000
#17: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 152.64333
#18: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 38.39333
#19: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 37.01000
#20: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 36.40667
#21: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 48.43333
#22: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 112.08667
#23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 107.45000
#24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 122.34333
#25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 140.76000
#26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 36.79333
#27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 50.24333
#28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 43.09000
#29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 123.67333
#30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 56.10667
#31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 106.51333
#32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 48.83667
#mpg cyl disp hp drat wt qsec vs am gear carb mean
So in your case, something like
dataset[ , means := rowMeans(.SD, na.rm = T), .SDcols = column_names]