Export ‘epiR" Output to Tables - r

Good Morning,
i am using the "epiR" packages to assess test accuracy.
https://search.r-project.org/CRAN/refmans/epiR/html/epi.tests.html
## Generate a data set listing test results and true disease status:
dis <- c(rep(1, times = 744), rep(0, times = 842))
tes <- c(rep(1, times = 670), rep(0, times = 74),
rep(1, times = 202), rep(0, times = 640))
dat.df02 <- data.frame(dis, tes)
tmp.df02 <- dat.df02 %>%
mutate(dis = factor(dis, levels = c(1,0), labels = c("Dis+","Dis-"))) %>%
mutate(tes = factor(tes, levels = c(1,0), labels = c("Test+","Test-"))) %>%
group_by(tes, dis) %>%
summarise(n = n())
tmp.df02
## View the data in conventional 2 by 2 table format:
pivot_wider(tmp.df02, id_cols = c(tes), names_from = dis, values_from = n)
rval.tes02 <- epi.tests(tmp.df02, method = "exact", digits = 2,
conf.level = 0.95)
summary(rval.tes02)
The data type is listed as "epi.test". I would like to export the summary statistics to a table (i.e. gtsummary or flextable).
As summary is a function of base R, I am struggling to do this. Can anyone help? Thank you very much

The epi.tests function has been edited so it writes the results out to a data frame (instead of a list). This will simplify export to gtsummary or flextable. epiR version 2.0.50 to be uploaded to CRAN shortly.

This was not quite as straight forward as I expected.
It appears that summary() when applied to an object x of class epi.tests simply prints x$details. x$details is a list of data.frames with statistic names as row names. That last bit makes things slightly more complicated than they would otherwise have been.
A potential tidyverse solution is
library(tidyverse)
lapply(
names(rval.tes02$detail),
function(x) {
as_tibble(rval.tes02$detail[[x]]) %>%
add_column(statistic=x, .before=1)
}
) %>%
bind_rows()
# A tibble: 18 × 4
statistic est lower upper
<chr> <dbl> <dbl> <dbl>
1 ap 0.550 0.525 0.574
2 tp 0.469 0.444 0.494
3 se 0.901 0.877 0.921
4 sp 0.760 0.730 0.789
5 diag.ac 0.826 0.806 0.844
6 diag.or 28.7 21.5 38.2
7 nndx 1.51 1.41 1.65
8 youden 0.661 0.607 0.710
9 pv.pos 0.768 0.739 0.796
10 pv.neg 0.896 0.872 0.918
11 lr.pos 3.75 3.32 4.24
12 lr.neg 0.131 0.105 0.163
13 p.rout 0.450 0.426 0.475
14 p.rin 0.550 0.525 0.574
15 p.tpdn 0.240 0.211 0.270
16 p.tndp 0.0995 0.0789 0.123
17 p.dntp 0.232 0.204 0.261
18 p.dptn 0.104 0.0823 0.128
Which is a tibble containing the same information as summary(rval.tes02), which you should be able to pass on to gtsummary or flextable. Unusually, the broom package doesn't have a tidy() verb for epi.tests objects.

Related

Loop on several variables with the same suffix in R

I have a database which looks like this but with much more rows and columns.
Several variables (x,y,z) measured at different time (1,2,3).
df <-
tibble(
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10),
y1 = rnorm(10),
y2 = rnorm(10),
y3 = rnorm(10),
z1 = rnorm(10),
z2 = rnorm(10),
z3 = rnorm(10),
)
I am trying to create dummies variables from the variables with the same suffix (measured at the same time) like this:
df <- df %>%
mutate(var1= ifelse(x1>0 & (y1<0.5 |z1<0.5),0,1)) %>%
mutate(var2= ifelse(x2>0 & (y2<0.5 |z2<0.5),0,1)) %>%
mutate(var3= ifelse(x3>0 & (y1<0.5 |z3<0.5),0,1))
I am used to coding in SAS or Stata, so I would like to use a function or a loop because I have many more variables in my database.
But I think I don't have the right approach in R to deal with this.
Thank you very much for your help !
{dplyover} makes this kind of operation easy (disclaimer: I'm the maintainer), given that your desired output contains a typo:
I think you want to use all variables with the same digit (1, 2, 3 and so on) in each calculation:
df <- df %>%
mutate(var1= ifelse(x1>0 & (y1<0.5 |z1<0.5),0,1)) %>%
mutate(var2= ifelse(x2>0 & (y2<0.5 |z2<0.5),0,1)) %>%
mutate(var3= ifelse(x3>0 & (y3<0.5 |z3<0.5),0,1))
If that is the case we can use dplyover::over to apply the same function over a vector. Here we construct the vector with extract_names("[0-9]{1}$") which gets us all ending numbers of our variable names here: c(1,2,3). We can then construct the variable names using a special syntax: .("x{.x}"). Here .x evaluates to the first number in our vector so it would return the object name x1 (not a string!) which we can use inside the function argument of over.
library(dplyr)
library(dplyover) # Only on GitHub: https://github.com/TimTeaFan/dplyover
df %>%
mutate(over(cut_names("^[a-z]{1}"),
~ ifelse(.("x{.x}") > 0 & (.("y{.x}") < 0.5 | .("z{.x}") < 0.5), 0, 1),
.names = "var{x}"
))
#> # A tibble: 10 x 12
#> x1 x2 x3 y1 y2 y3 z1 z2 z3 var1
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.690 0.550 0.911 0.203 -0.111 0.530 -2.09 0.189 0.147 0
#> 2 -0.238 1.32 -0.145 0.744 1.05 -0.448 2.05 -1.04 1.50 1
#> 3 0.888 0.898 -1.46 -1.87 -1.14 1.59 1.91 -0.155 1.46 0
#> 4 -2.78 -1.34 -0.486 -0.0674 0.246 0.141 0.154 1.08 -0.319 1
#> 5 -1.20 0.835 1.28 -1.32 -0.674 0.115 0.362 1.06 0.515 1
#> 6 0.622 -0.713 0.0525 1.79 -0.427 0.819 -1.53 -0.885 0.00237 0
#> 7 -2.54 0.0197 0.942 0.230 -1.37 -1.02 -1.55 -0.721 -1.06 1
#> 8 -0.434 1.97 -0.274 0.848 -0.482 -0.422 0.197 0.497 -0.600 1
#> 9 -0.316 -0.219 0.467 -1.97 -0.718 -0.442 -1.39 -0.877 1.52 1
#> 10 -1.03 0.226 2.04 0.432 -1.02 -0.535 0.954 -1.11 0.804 1
#> # ... with 2 more variables: var2 <dbl>, var3 <dbl>
Alternatively we can use dplyr::across and use cur_column(), get() and gsub() to alter the name of the column on the fly. To name the new variables correctly we use gsub() in the .names argument of across and wrap it in curly braces {} to evaluate the expression.
library(dplyr)
df %>%
mutate(across(starts_with("x"),
~ {
cur_c <- dplyr::cur_column()
ifelse(.x > 0 & (get(gsub("x","y", cur_c)) < 0.5 | get(gsub("x","z", cur_c)) < 0.5), 0, 1)
},
.names = '{gsub("x", "var", .col)}'
))
#> # A tibble: 10 x 12
#> x1 x2 x3 y1 y2 y3 z1 z2 z3 var1
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.423 -1.42 -1.15 -1.54 1.92 -0.511 -0.739 0.501 0.451 1
#> 2 -0.358 0.164 0.971 -1.61 1.96 -0.675 -0.0188 -1.88 1.63 1
#> 3 -0.453 -0.758 -0.258 -0.449 -0.795 -0.362 -1.81 -0.780 -1.90 1
#> 4 0.855 0.335 -1.36 0.796 -0.674 -1.37 -1.42 -1.03 -0.560 0
#> 5 0.436 -0.0487 -0.639 0.352 -0.325 -0.893 -0.746 0.0548 -0.394 0
#> 6 -0.228 -0.240 -0.854 -0.197 0.884 0.118 -0.0713 1.09 -0.0289 1
#> 7 -0.949 -0.231 0.428 0.290 -0.803 2.15 -1.11 -0.202 -1.21 1
#> 8 1.88 -0.0980 -2.60 -1.86 -0.0258 -0.965 -1.52 -0.539 0.108 0
#> 9 0.221 1.58 -1.46 -0.806 0.749 0.506 1.09 0.523 1.86 0
#> 10 0.0238 -0.389 -0.474 0.512 -0.448 0.178 0.529 1.56 -1.12 1
#> # ... with 2 more variables: var2 <dbl>, var3 <dbl>
Created on 2022-06-08 by the reprex package (v2.0.1)
You could restructure your data along the principles of tidy data (see e.g. https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html).
Here to a long format and using tidyverse:
library(tidyverse)
df <-
df |>
pivot_longer(everything()) |>
separate(name, c("var", "time"), sep = "(?=[0-9])") |>
pivot_wider(id_col = "time",
names_from = "var",
names_prefix = "var_",
values_from = "value",
values_fn = list) |>
unnest(-time) |>
mutate(new_var = ifelse(var_x > 0 & (var_y < 0.5 | var_z < 0.5), 0, 1))
df
You would probably want to keep the data in a long format, but if you want, you can pivot_wider and get back to the format you started with. E.g.
df |>
pivot_wider(values_from = c(starts_with("var_"), "new_var"),
names_from = "time",
values_fn = list) |>
unnest(everything())
As you suggested, a solution using a loop is definitely possible.
# times as unique non-alphabetical parts of column names
times <- unique(gsub('[[:alpha:]]', '', names(df)))
for (time in times) {
# column names for current time
xyz <- paste0(c('x', 'y', 'z'), time)
df[[paste0('var', time)]] <-
ifelse(df[[xyz[1]]]>0 & (df[[xyz[2]]]<.5 | df[[xyz[3]]]<.5), 0, 1)
}
Another way I can think of is transforming the data into a 3D array (observartion × variable × time) so that you can actually do the computation for all variables at once.
times <- unique(gsub('[[:alpha:]]', '', names(df)))
df.arr <- sapply(c('x', 'y', 'z'),
function(var) as.matrix(df[, paste0(var, times)]),
simplify='array')
new.vars <- ifelse(df.arr[, , 1]>0 & (df.arr[, , 2]<0.5 | df.arr[, , 3]<0.5), 0, 1)
colnames(new.vars) <- paste0('var', times)
cbind(df, new.vars)
Here, sapply creates a matrix from columns of measurings for each variable at different times and stacks them into a 3D array.
If you trust (or ensure) correct ordering of columns in the data frame, instead of using sapply you can create the array just by modifying the object's dimensions. I didn't do any benchmarking but i guess this could be the most computationally efficient solution (if it should matter).
df.arr <- as.matrix(df)
dim(df.arr) <- c(dim(df.arr) / c(1, 3), 3)

Append output from iterative mapped function

This is a follow-up to a previous question: Read functions as text and use for plotting
The output of the mapped function...
data %>%
bind_cols(
map(.x = models,.f = text_model) %>%
set_names(models) %>%
bind_rows(.id = "model")
)
...generates a data frame with the results of each function written to a separate column (with the function included in the column headers).
However, it would be best to have the output from each function appended such that all results are included in the same column with a separate column to keep track of which function ("model001", "model002",..."model500") generated the results.
How can the code from the previous question (Read functions as text and use for plotting) be adjusted to write the results in this manner?
Edit: Someone suggested Read functions as text and use for plotting as an answer, but this post is a follow-up to that one asking about how the output can be written to a single column (rather than a sperate column for each function).
Given the other answer, we can pivot the data
data %>%
bind_cols(
map(.x = models,.f = text_model) %>%
set_names(models_names) %>%
bind_rows(.id = "model")
) %>%
pivot_longer(cols = model1:model2,names_to = "model")
# A tibble: 200 x 6
A B C D model value
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 0.833 0.538 0.647 1.65 model1 22.9
2 0.833 0.538 0.647 1.65 model2 57.9
3 2.07 1.20 -0.748 -2.04 model1 35.3
4 2.07 1.20 -0.748 -2.04 model2 70.3
5 0.880 -0.199 1.08 1.04 model1 29.2
6 0.880 -0.199 1.08 1.04 model2 64.2
7 0.252 0.400 1.45 -0.0645 model1 15.6
8 0.252 0.400 1.45 -0.0645 model2 50.6
9 0.746 0.0943 -1.00 1.44 model1 20.4
10 0.746 0.0943 -1.00 1.44 model2 55.4
# ... with 190 more rows

How to efficiently import multiple excel tables located in one sheet into an R list?

Task
I am trying to import tables situated in a single excel sheet into an R object as efficiently as possible (list will be fine, as I can take the rest of the calculations from there).
Nuance
The tables are actually excel ranges not excel tables, but they are structured and look like tables: here is an example of an excel range that should be imported as a table in R:
Ranges(In a table form) are not of the same length and can be situated anywhere in the same sheet.
Reproducible Example
Here you can find a toy example (.xlsx file) to play with:
What I have tried
Here is the code that I have written to import excel tables into R. This is inefficient method as it requires to convert all excel ranges into tables before running this code to import them to a list in R:
library(purrr)
library(XLConnect)
wb <- loadWorkbook("example.xlsx")
tables <- map(1:100,function(x) tryCatch(readTable(wb,
sheet = "Sheet1",
table = paste0("Table",x)),
error = function(e) NA)
)
Question
Is there a better (more efficient) way of importing ranges in one excel sheet into an R structure by taking excel file as given and running all computations/transformations in R. Any packages are welcomed!
Thank you very much in advance.
I'm not sure if I'm doing it using the best way, but to solve a similar problem in one of my projects. I wrote some utility functions to deal with it.You can see those functions here
The logic behind the splits is that whenever there is a row or column that only contains NA, the split will be created on the row or column. And this process will be done for a certain times.
Anyway, if you load all the functions I wrote, you can use the codes below:
Read Data
library(tidyverse)
table_raw<- readxl::read_excel("example.xlsx",col_names = FALSE,col_types = "text")
Display data Shape
# This is a custom function I wrote
display_table_shape(table_raw)
Split data into separate data frames.
split_table <- table_raw %>%
split_df(complexity = 2) # another custom function I wrote
After the original data frame is split, you can do more processing using for loop or map functions.
Data Cleaning
map(split_table, function(df){
df <- df[-1,]
set_1row_colname(df) %>% # another function I wrote
mutate_all(as.numeric)
})
Result
[[1]]
# A tibble: 8 x 4
aa bb cc dd
<dbl> <dbl> <dbl> <dbl>
1 0.197 0.321 0.265 0.0748
2 0.239 0.891 0.0308 0.453
3 0.300 0.779 0.780 0.213
4 0.132 0.138 0.612 0.0362
5 0.834 0.697 0.879 0.571
6 0.956 0.807 0.741 0.936
7 0.359 0.536 0.0902 0.764
8 0.403 0.315 0.593 0.840
[[2]]
# A tibble: 4 x 4
aa bb cc dd
<dbl> <dbl> <dbl> <dbl>
1 0.136 0.347 0.603 0.542
2 0.790 0.672 0.0808 0.795
3 0.589 0.338 0.837 0.00968
4 0.513 0.766 0.553 0.189
[[3]]
# A tibble: 8 x 4
aa bb cc dd
<dbl> <dbl> <dbl> <dbl>
1 0.995 0.105 0.106 0.530
2 0.372 0.306 0.190 0.609
3 0.508 0.987 0.585 0.233
4 0.0800 0.851 0.215 0.761
5 0.471 0.603 0.740 0.106
6 0.395 0.0808 0.571 0.266
7 0.908 0.739 0.245 0.141
8 0.534 0.313 0.663 0.824
[[4]]
# A tibble: 14 x 4
aa bb cc dd
<dbl> <dbl> <dbl> <dbl>
1 0.225 0.993 0.0382 0.412
2 0.280 0.202 0.823 0.664
3 0.423 0.616 0.377 0.857
4 0.289 0.298 0.0418 0.410
5 0.919 0.932 0.882 0.668
6 0.568 0.561 0.600 0.832
7 0.341 0.210 0.351 0.0863
8 0.757 0.962 0.484 0.677
9 0.275 0.0845 0.824 0.571
10 0.187 0.512 0.884 0.612
11 0.706 0.311 0.00610 0.463
12 0.906 0.411 0.215 0.377
13 0.629 0.317 0.0975 0.312
14 0.144 0.644 0.906 0.353
The functions you need to load
# utility function to get rle as a named vector
vec_rle <- function(v){
temp <- rle(v)
out <- temp$values
names(out) <- temp$lengths
return(out)
}
# utility function to map table with their columns/rows in a bigger table
make_df_index <- function(v){
table_rle <- vec_rle(v)
divide_points <- c(0,cumsum(names(table_rle)))
table_index <- map2((divide_points + 1)[1:length(divide_points)-1],
divide_points[2:length(divide_points)],
~.x:.y)
return(table_index[table_rle])
}
# split a large table in one direction if there are blank columns or rows
split_direction <- function(df,direction = "col"){
if(direction == "col"){
col_has_data <- unname(map_lgl(df,~!all(is.na(.x))))
df_mapping <- make_df_index(col_has_data)
out <- map(df_mapping,~df[,.x])
} else if(direction == "row"){
row_has_data <- df %>%
mutate_all(~!is.na(.x)) %>%
as.matrix() %>%
apply(1,any)
df_mapping <- make_df_index(row_has_data)
out <- map(df_mapping,~df[.x,])
}
return(out)
}
# split a large table into smaller tables if there are blank columns or rows
# if you still see entire rows or columns missing. Please increase complexity
split_df <- function(df,showWarnig = TRUE,complexity = 1){
if(showWarnig){
warning("Please don't use first row as column names.")
}
out <- split_direction(df,"col")
for(i in 1 :complexity){
out <- out %>%
map(~split_direction(.x,"row")) %>%
flatten() %>%
map(~split_direction(.x,"col")) %>%
flatten()
}
return(out)
}
#display the rough shape of table in a sheet with multiple tables
display_table_shape <- function(df){
colnames(df) <- 1:ncol(df)
out <- df %>%
map_df(~as.numeric(!is.na(.x))) %>%
gather(key = "x",value = "value") %>%
mutate(x = as.numeric(x)) %>%
group_by(x) %>%
mutate(y = -row_number()) %>%
ungroup() %>%
filter(value == 1) %>%
ggplot(aes(x = x, y = y,fill = value)) +
geom_tile(fill = "skyblue3") +
scale_x_continuous(position = "top") +
theme_void() +
theme(legend.position="none",
panel.border = element_rect(colour = "black", fill=NA, size=2))
return(out)
}
# set first row as column names for a data frame and remove the original first row
set_1row_colname <- function(df){
colnames(df) <- as.character(df[1,])
out <- df[-1,]
return(out)
}
I had a similar problem, this is how I solved it. Note, it loses some of the benefit of yusuzech's answer, in that it does require you to specify the ranges of interest. On the flip side, it may be more efficient to code and more adaptable to different situations.
# specify the ranges you want to import from the excel sheet
v_ranges <- c("A3:F54", "H3:M54", "O3:T54", "V3:AA54", "AC3:AH54")
# specify the names of the dataframes
v_names <- c("21Q3", "21Q2", "21Q1", "20Q4", "20Q3")
# specify sheet and path
v_path_file <- "my_path/my_excel_file.xlsx"
v_sheet <- "my_sheet_name"
# define the import function, with v_ranges as your ranges of interest, v_path_file as the excel workbook you want to import from, and v_sheet the sheet name of the file
f_import_excel_by_range <- function(.x) {
janitor::clean_names(
readxl::read_excel(v_path_file,
sheet = v_sheet,
range = .x,
col_names = TRUE, na = c(" ", "NA"), trim_ws = TRUE, skip = 1)
)
}
my_file_name <-
purrr::map(v_ranges, f_import_excel_by_range) %>%
purrr::set_names(paste0("my_file_name_",v_names))
# extract databases to the environment
base::invisible(base::list2env(my_file_name, .GlobalEnv))
I believe this function can be improved by including the path and file as well as the sheet name in the function. If I sleuth that out, I will edit. Feedback welcome.

Produce a matrix from every row (tidyverse)

I am trying to implement analyses across a posterior of matrices. What I start with is a tibble of k^2 columns, where k is the dimensions of the matrix. The ith row forms the matrix of the ith iteration.
So, for example for a 3x3 matrix, this is:
set.seed(12)
n <- 1000
z1z1 <- rnorm(n, 5, 1)
z2z2 <- rnorm(n, 5, 1)
z3z3 <- rnorm(n, 5, 1)
z1z2 <- rnorm(n, 0, 1)
z1z3 <- rnorm(n, 0, 1)
z2z3 <- rnorm(n, 0, 1)
post3 <- as_tibble(matrix(c(z1z1, z1z2, z1z3,
z1z2, z2z2, z2z3,
z1z3, z2z3, z3z3),
ncol = 9))
post3
Giving:
# A tibble: 1,000 x 9
V1 V2 V3 V4 V5 V6 V7 V8 V9
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3.52 -0.618 2.96 -0.618 2.48 -0.634 2.96 -0.634 5.98
2 6.58 -0.827 0.0909 -0.827 5.52 -1.84 0.0909 -1.84 6.20
3 4.04 1.48 -1.66 1.48 6.58 0.166 -1.66 0.166 5.58
4 4.08 -1.01 0.809 -1.01 5.49 0.607 0.809 0.607 6.55
5 3.00 0.582 -0.485 0.582 6.20 0.0765 -0.485 0.0765 6.38
6 4.73 0.718 1.97 0.718 4.00 -0.147 1.97 -0.147 4.35
7 4.68 -0.372 0.572 -0.372 4.65 -1.68 0.572 -1.68 3.83
8 4.37 -0.809 0.883 -0.809 3.96 0.985 0.883 0.985 4.97
9 4.89 0.405 0.686 0.405 6.02 0.252 0.686 0.252 6.29
10 5.43 0.124 0.199 0.124 5.75 0.354 0.199 0.354 4.20
# ... with 990 more rows
Where this is the matrix in the first iteration:
k <- sqrt(length(post3))
matrix(post3[1,], nrow = k)
[,1] [,2] [,3]
[1,] 3.519432 -0.618137 2.962622
[2,] -0.618137 2.479522 -0.6338298
[3,] 2.962622 -0.6338298 5.977552
I am then working along this posterior to calculate the dominance of the first eigenvector:
post3 %>%
rowwise %>%
mutate(
pre_eig = list(eigen(matrix(c(V1, V2, V3, V4, V5, V6, V7, V8, V9), nrow = k))),
dom = pre_eig[[1]][1] / sum(pre_eig[[1]][1:k])) %>%
select('dom')
Giving:
# A tibble: 1,000 x 1
dom
<dbl>
1 0.676
2 0.437
3 0.462
4 0.427
5 0.414
6 0.504
7 0.474
8 0.429
9 0.394
10 0.383
# ... with 990 more rows
What I would like to do is make this script versatile so that it can take posteriors for any value of k. The issue I am having is in how to define the matrix without having to hand write all the column names - when applying this to 2000x2000 matrices I don't want to write out V1, V2, V3... V4000000!
I tried a few things (including ...eigen(matrix(c(paste0('V', 1:(k^2))), nrow = k)))..., which I think is not working because it wants V1, V2... rather than "V1", "V2"...) and I all out of ideas. How do I get it to automatically take the column names from the posterior tibble?
I would then be able to use the exact same piece of script for example on post3 <- as_tibble(matrix(c(z1z1, z1z2, z1z2, z2z2), ncol = 4))...
You can avoid naming all the columns explicitly if you gather each row's values into key-value pairs:
library(tidyr)
post3 %>%
# add row ID (so that results can be sorted back into original order)
mutate(row.id = seq(1, n())) %>%
# convert each row to long format, with values sorted from 1st to k^2th column
gather(position, value, -row.id) %>%
mutate(position = as.numeric(gsub("^V", "", position))) %>%
arrange(row.id, position) %>%
select(-position) %>%
# group by row ID & calculate
group_by(row.id) %>%
summarise(pre_eig = list(eigen(matrix(value, nrow = k))[["values"]]),
dom = pre_eig[[1]][1] / sum(pre_eig[[1]][1:k])) %>%
ungroup() %>%
# sort results in original order
arrange(row.id) %>%
select(dom)
The results should be the same as before:
# A tibble: 1,000 x 1
dom
<dbl>
1 0.676
2 0.437
3 0.462
4 0.427
5 0.414
6 0.504
7 0.474
8 0.429
9 0.394
10 0.383
# ... with 990 more rows

calculating qchisq in on a sparklyr tbl

I need to use the qchisq function on a column of a sparklyr data frame.
The problem is that it seems that qchisq function is not implemented in Spark. If I am reading the error message below correctly, sparklyr tried execute a function called "QCHISQ", however this doesn't exist neither in Hive SQL, nor in Spark.
In general, is there a way to run arbitrary functions that are not implemented in Hive or Spark, with sparklyr? I know about spark_apply, but haven't figured out how to configure it yet.
> mydf = data.frame(beta=runif(100, -5, 5), pval = runif(100, 0.001, 0.1))
> mydf_tbl = copy_to(con, mydf)
> mydf_tbl
# Source: table<mydf> [?? x 2]
# Database: spark_connection
beta pval
<dbl> <dbl>
1 3.42 0.0913
2 -1.72 0.0629
3 0.515 0.0335
4 -3.12 0.0717
5 -2.12 0.0253
6 1.36 0.00640
7 -3.33 0.0896
8 1.36 0.0235
9 0.619 0.0414
10 4.73 0.0416
> mydf_tbl %>% mutate(se = sqrt(beta^2/qchisq(pval)))
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'QCHISQ'.
This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 49
As you noted you can use spark_apply:
mydf_tbl %>%
spark_apply(function(df)
dplyr::mutate(df, se = sqrt(beta^2/qchisq(pval, df = 12))))
# # Source: table<sparklyr_tmp_14bd5feacf5> [?? x 3]
# # Database: spark_connection
# beta pval X3
# <dbl> <dbl> <dbl>
# 1 1.66 0.0763 0.686
# 2 0.153 0.0872 0.0623
# 3 2.96 0.0485 1.30
# 4 4.86 0.0349 2.22
# 5 -1.82 0.0712 0.760
# 6 2.34 0.0295 1.10
# 7 3.54 0.0297 1.65
# 8 4.57 0.0784 1.88
# 9 4.94 0.0394 2.23
# 10 -0.610 0.0906 0.246
# # ... with more rows
but fair warning - it is embarrassingly slow. Unfortunately you don't have alternative here, short of writing your own Scala / Java extensions.
In the end I've used an horrible hack, which for this case works fine.
Another solution would have been to write a User Defined Function (UDF), but sparklyr doesn't support it yet: https://github.com/rstudio/sparklyr/issues/1052
This is the hack I've used. In short, I precompute a qchisq table, upload it as a sparklyr object, then join. If I compare this with results calculated on a local data frame, I get a correlation of r=0.99999990902236146617.
#' #param n: number of significant digits to use
> check_precomputed_strategy = function(n) {
chisq = data.frame(pval=seq(0, 1, 1/(10**(n)))) %>%
mutate(qval=qchisq(pval, df=1, lower.tail = FALSE)) %>%
mutate(pval_s = as.character(round(as.integer(pval*10**n),0)))
chisq %>% head %>% print
chisq_tbl = copy_to(con, chisq, overwrite=T)
mydf = data.frame(beta=runif(100, -5, 5), pval = runif(100, 0.001, 0.1)) %>%
mutate(se1 = sqrt(beta^2/qchisq(pval, df=1, lower.tail = FALSE)))
mydf_tbl = copy_to(con, mydf)
mydf_tbl.up = mydf_tbl %>%
mutate(pval_s=as.character(round(as.integer(pval*10**n),0))) %>%
left_join(chisq_tbl, by="pval_s") %>%
mutate(se=sqrt(beta^2 / qval)) %>%
collect %>%
filter(!duplicated(beta))
mydf_tbl.up %>% head %>% print
mydf_tbl.up %>% filter(complete.cases(.)) %>% nrow %>% print
mydf_tbl.up %>% filter(complete.cases(.)) %>% select(se, se1) %>% cor
}
> check_precomputed_strategy(4)
pval qval pval_s
1 0.00000000000000000000000 Inf 0
2 0.00010000000000000000479 15.136705226623396570 1
3 0.00020000000000000000958 13.831083619091122827 2
4 0.00030000000000000002793 13.070394140069462097 3
5 0.00040000000000000001917 12.532193305401813532 4
6 0.00050000000000000001041 12.115665146397173402 5
# A tibble: 6 x 8
beta pval.x se1 myvar pval_s pval.y qval se
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 3.42 0.0913 2.03 1. 912 0.0912 2.85 2.03
2 -1.72 0.0629 0.927 1. 628 0.0628 3.46 0.927
3 0.515 0.0335 0.242 1. 335 0.0335 4.52 0.242
4 -3.12 0.0717 1.73 1. 716 0.0716 3.25 1.73
5 -2.12 0.0253 0.947 1. 253 0.0253 5.00 0.946
6 1.36 0.00640 0.498 1. 63 0.00630 7.46 0.497
[1] 100
se se1
se 1.00000000000000000000 0.99999990902236146617
se1 0.99999990902236146617 1.00000000000000000000

Resources